-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement/detection optimization #42
Conversation
docs/user_guide/sensitive.rst
Outdated
.. ipython:: python | ||
|
||
df = pd.read_csv("../datasets/compas.csv") | ||
|
||
dt.detect_names_df(df) | ||
|
||
As we can see, some sensitive categories from the dataframe have been picked out by the shallow search. | ||
Let's see now if by enabling deep search we are able to detect more attributes: | ||
|
||
.. ipython:: python | ||
|
||
dt.detect_names_df(df, deep_search=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add COMPAS example in the user guides.
"Age": ["age", "DOB", "birth", "youth", "elder", "senior", "date of birth"], | ||
"Gender": ["gender", "sex"], | ||
"Ethnicity": ["race", "color", "ethnic", "breed", "culture"], | ||
"Religion": ["religion", "creed", "cult", "doctrine"], | ||
"Nationality": ["nation", "geography", "location", "native", "country", "region"], | ||
"Family Status": ["family status", "family", "house", "marital", "children", "partner", "pregnant"], | ||
"Ethnicity": ["race", "color", "ethnic", "breed", "culture", "ethnicity"], | ||
"Religion": ["religion", "creed", "cult", "doctrine", "faith"], | ||
"Nationality": ["nationality", "nation", "geography", "location", "language", "native", "country", "region"], | ||
"Family Status": ["family status", "family", "house", "marital", "children", "partner", "pregnant", "marital status"], | ||
"Disability": ["disability", "impairment"], | ||
"Sexual Orientation": ["sexual orientation", "sexual", "orientation", "attracted"] | ||
"Sexual Orientation": ["sexual orientation", "sexual preference", "sexual", "orientation", "attracted"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly update the configuration to contain the category itself.
src/fairlens/sensitive/detection.py
Outdated
for col in cols: | ||
# If the series are larger than the provided n_samples, we take a sample to increase speed. | ||
if df[col].size > n_samples: | ||
column = df[col].sample(n=n_samples) | ||
else: | ||
column = df[col] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limits the amount of series elements analyzed to improve execution speed.
src/fairlens/sensitive/detection.py
Outdated
@@ -151,8 +162,9 @@ def _detect_name( | |||
|
|||
# Check startswith / endswith | |||
for group_name, attrs in attr_synonym_dict.items(): | |||
separator = "|".join(" ,.-:") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe directly writing separator = " |,|.|-|:"
will be more clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a bit clearer, yes. Also, we could do:
separator = ",.-:
And then call the function with "|".join(separator)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, probably separator = ",.-:
and "|".join(separator)
is a good option, easier to change if we need to add any extra symbol
tests/test_detection.py
Outdated
def test_compas_detect_shallow(): | ||
res = { | ||
"DateOfBirth": "Age", | ||
"Ethnicity": "Ethnicity", | ||
"Language": "Nationality", | ||
"MaritalStatus": "Family Status", | ||
"Sex": "Gender", | ||
} | ||
assert dt.detect_names_df(dfc) == res | ||
|
||
|
||
def test_compas_detect_deep(): | ||
res = { | ||
"DateOfBirth": "Age", | ||
"Ethnicity": "Ethnicity", | ||
"Language": "Nationality", | ||
"MaritalStatus": "Family Status", | ||
"Sex": "Gender", | ||
} | ||
assert dt.detect_names_df(dfc, deep_search=True) == res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in both cases, the COMPAS dataset seems to behave the same. For testing, can we find an example where deep_search
has an impact on the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few tests that showcase this in the previous version (that is on main). In order to do it on a dataset, I think we will need to find a common one that presents this behavior, but that means solving #9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that this test is currently doesn't really check correct behaviour, because even if deep_search=True
first we do a "non-deep" search, and we know that this works from previous test.
What about something like dfc = dfc.rename(columns={"DateOfBirth": "A", "Ethnicity": "B" [...]}
and res = {"A": "Age", "B": "Ethnicity", [...]}
? (make sure you copy/re-load the dataframe otherwise it might affect other tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that actually sounds good. It might not detect Age if we change it because it is just a numeric value but otherwise should be good.
@@ -18,6 +18,7 @@ def detect_names_df( | |||
threshold: float = 0.1, | |||
str_distance: Callable[[Optional[str], Optional[str]], float] = None, | |||
deep_search: bool = False, | |||
n_samples: int = 20, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
open to discuss, is it 20 representative enough? with a big dataset, one could easily get 20 samples that don't contain enough data to do a proper search.
What's the impact of increasing this number a bit more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the dictionary is very comprehensive, it could be enough. However, I don't think increasing it a few times will be troublesome as it is already pretty fast.
As an alternative suggestion, what do you think about doing .unique()
on the series and then sampling from that new series?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea good idea, .unique()
seems the best option as it's quite fast and obviously the most representative sample! Maybe you can still apply the n_samples
threshold after getting unique values, otherwise if it has many values (e.g. names, continuous values, etc) can heavily affect the performance
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌 💯
* make deep_search override shallow search and skip analyzing column titles * add sampling for series of large dataframes * update ENGB configuration so that synonyms contain the categories * update user guide with COMPAS example * update detection module and tests, fix COMPAS bugs * fix SonarCloud code smell * remove extraneous local variable * slightly reduce redundant code * fix pre-commit issues * remove debugging snippet from test_detection.py * implement sampling from uniques, change separators * update COMPAS deep search test columns to have non-indicative names * add more detail to COMPAS detection example
* add resampling methods and 2 metrics with p-values, integrate with distance metrics * add p-value module with resampling methods, tests for binomial * fix flake8 error * add mean distance, tests for p-vals, refactor histogram logic into utility * add tests for p-value tests * fix sonarcloud bug * trailing whitepace error * add additional tests * add additional visualization tools * add title to plt_attr_dist * Enhancement/detection optimization (#42) * make deep_search override shallow search and skip analyzing column titles * add sampling for series of large dataframes * update ENGB configuration so that synonyms contain the categories * update user guide with COMPAS example * update detection module and tests, fix COMPAS bugs * fix SonarCloud code smell * remove extraneous local variable * slightly reduce redundant code * fix pre-commit issues * remove debugging snippet from test_detection.py * implement sampling from uniques, change separators * update COMPAS deep search test columns to have non-indicative names * add more detail to COMPAS detection example * add additional visualization tools * add title to plt_attr_dist * add demo notebook, use histogram method from p-value pr * histograms now work for dates * update viz docs * update compas to fix date ambiguity * fix code smells * fix viz bug w categorical data * quantize dates method * add import to docs * make suggested changes * attempt to fix code smell * attempt to fix code smell p * attempt to fix code smell pt 3 * fix code smell * create methods to use, reset style * fix code smell Co-authored-by: George-Bogdan Surdu <51715053+bogdansurdu@users.noreply.github.com> Co-authored-by: bogdansurdu <gs2318@ic.ac.uk>
This pull request will add optimization for the sensitive attribute detection algorithm to speed up the execution for large datasets, by making
detect_names_df()
sample from series.Additionally, it makes
deep_search
override the shallow search as opposed to executing both and makes the shallow search algorithm more robust to minimize false positives.Solves #41