Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement/detection optimization #42

Merged
merged 14 commits into from
Jun 11, 2021

Conversation

bogdansurdu
Copy link
Contributor

@bogdansurdu bogdansurdu commented Jun 10, 2021

This pull request will add optimization for the sensitive attribute detection algorithm to speed up the execution for large datasets, by making detect_names_df() sample from series.

Additionally, it makes deep_search override the shallow search as opposed to executing both and makes the shallow search algorithm more robust to minimize false positives.

Solves #41

@bogdansurdu bogdansurdu self-assigned this Jun 10, 2021
@bogdansurdu bogdansurdu marked this pull request as draft June 10, 2021 11:14
@Hilly12 Hilly12 added the category:sensitive-attribute-detection Given a dataset, automatically discover the fields that contain sensitive information. label Jun 11, 2021
@Hilly12 Hilly12 linked an issue Jun 11, 2021 that may be closed by this pull request
@bogdansurdu bogdansurdu marked this pull request as ready for review June 11, 2021 09:08
Comment on lines 64 to 75
.. ipython:: python

df = pd.read_csv("../datasets/compas.csv")

dt.detect_names_df(df)

As we can see, some sensitive categories from the dataframe have been picked out by the shallow search.
Let's see now if by enabling deep search we are able to detect more attributes:

.. ipython:: python

dt.detect_names_df(df, deep_search=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add COMPAS example in the user guides.

Comment on lines +4 to +11
"Age": ["age", "DOB", "birth", "youth", "elder", "senior", "date of birth"],
"Gender": ["gender", "sex"],
"Ethnicity": ["race", "color", "ethnic", "breed", "culture"],
"Religion": ["religion", "creed", "cult", "doctrine"],
"Nationality": ["nation", "geography", "location", "native", "country", "region"],
"Family Status": ["family status", "family", "house", "marital", "children", "partner", "pregnant"],
"Ethnicity": ["race", "color", "ethnic", "breed", "culture", "ethnicity"],
"Religion": ["religion", "creed", "cult", "doctrine", "faith"],
"Nationality": ["nationality", "nation", "geography", "location", "language", "native", "country", "region"],
"Family Status": ["family status", "family", "house", "marital", "children", "partner", "pregnant", "marital status"],
"Disability": ["disability", "impairment"],
"Sexual Orientation": ["sexual orientation", "sexual", "orientation", "attracted"]
"Sexual Orientation": ["sexual orientation", "sexual preference", "sexual", "orientation", "attracted"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly update the configuration to contain the category itself.

Comment on lines 81 to 86
for col in cols:
# If the series are larger than the provided n_samples, we take a sample to increase speed.
if df[col].size > n_samples:
column = df[col].sample(n=n_samples)
else:
column = df[col]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limits the amount of series elements analyzed to improve execution speed.

@@ -151,8 +162,9 @@ def _detect_name(

# Check startswith / endswith
for group_name, attrs in attr_synonym_dict.items():
separator = "|".join(" ,.-:")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe directly writing separator = " |,|.|-|:" will be more clear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a bit clearer, yes. Also, we could do:

separator = ",.-:

And then call the function with "|".join(separator)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, probably separator = ",.-: and "|".join(separator) is a good option, easier to change if we need to add any extra symbol

tests/test_detection.py Outdated Show resolved Hide resolved
Comment on lines 115 to 134
def test_compas_detect_shallow():
res = {
"DateOfBirth": "Age",
"Ethnicity": "Ethnicity",
"Language": "Nationality",
"MaritalStatus": "Family Status",
"Sex": "Gender",
}
assert dt.detect_names_df(dfc) == res


def test_compas_detect_deep():
res = {
"DateOfBirth": "Age",
"Ethnicity": "Ethnicity",
"Language": "Nationality",
"MaritalStatus": "Family Status",
"Sex": "Gender",
}
assert dt.detect_names_df(dfc, deep_search=True) == res
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in both cases, the COMPAS dataset seems to behave the same. For testing, can we find an example where deep_search has an impact on the result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few tests that showcase this in the previous version (that is on main). In order to do it on a dataset, I think we will need to find a common one that presents this behavior, but that means solving #9

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that this test is currently doesn't really check correct behaviour, because even if deep_search=True first we do a "non-deep" search, and we know that this works from previous test.

What about something like dfc = dfc.rename(columns={"DateOfBirth": "A", "Ethnicity": "B" [...]} and res = {"A": "Age", "B": "Ethnicity", [...]}? (make sure you copy/re-load the dataframe otherwise it might affect other tests)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that actually sounds good. It might not detect Age if we change it because it is just a numeric value but otherwise should be good.

@@ -18,6 +18,7 @@ def detect_names_df(
threshold: float = 0.1,
str_distance: Callable[[Optional[str], Optional[str]], float] = None,
deep_search: bool = False,
n_samples: int = 20,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open to discuss, is it 20 representative enough? with a big dataset, one could easily get 20 samples that don't contain enough data to do a proper search.

What's the impact of increasing this number a bit more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the dictionary is very comprehensive, it could be enough. However, I don't think increasing it a few times will be troublesome as it is already pretty fast.

As an alternative suggestion, what do you think about doing .unique() on the series and then sampling from that new series?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea good idea, .unique() seems the best option as it's quite fast and obviously the most representative sample! Maybe you can still apply the n_samples threshold after getting unique values, otherwise if it has many values (e.g. names, continuous values, etc) can heavily affect the performance

@sonarcloud
Copy link

sonarcloud bot commented Jun 11, 2021

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link
Contributor

@tonbadal tonbadal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌 💯

@bogdansurdu bogdansurdu merged commit 6ea607c into main Jun 11, 2021
@bogdansurdu bogdansurdu deleted the enhancement/detection-optimization branch June 11, 2021 13:15
Hilly12 pushed a commit that referenced this pull request Jun 30, 2021
* make deep_search override shallow search and skip analyzing column titles

* add sampling for series of large dataframes

* update ENGB configuration so that synonyms contain the categories

* update user guide with COMPAS example

* update detection module and tests, fix COMPAS bugs

* fix SonarCloud code smell

* remove extraneous local variable

* slightly reduce redundant code

* fix pre-commit issues

* remove debugging snippet from test_detection.py

* implement sampling from uniques, change separators

* update COMPAS deep search test columns to have non-indicative names

* add more detail to COMPAS detection example
Hilly12 added a commit that referenced this pull request Jul 26, 2021
* add resampling methods and 2 metrics with p-values, integrate with distance metrics

* add p-value module with resampling methods, tests for binomial

* fix flake8 error

* add mean distance, tests for p-vals, refactor histogram logic into utility

* add tests for p-value tests

* fix sonarcloud bug

* trailing whitepace error

* add additional tests

* add additional visualization tools

* add title to plt_attr_dist

* Enhancement/detection optimization (#42)

* make deep_search override shallow search and skip analyzing column titles

* add sampling for series of large dataframes

* update ENGB configuration so that synonyms contain the categories

* update user guide with COMPAS example

* update detection module and tests, fix COMPAS bugs

* fix SonarCloud code smell

* remove extraneous local variable

* slightly reduce redundant code

* fix pre-commit issues

* remove debugging snippet from test_detection.py

* implement sampling from uniques, change separators

* update COMPAS deep search test columns to have non-indicative names

* add more detail to COMPAS detection example

* add additional visualization tools

* add title to plt_attr_dist

* add demo notebook, use histogram method from p-value pr

* histograms now work for dates

* update viz docs

* update compas to fix date ambiguity

* fix code smells

* fix viz bug w categorical data

* quantize dates method

* add import to docs

* make suggested changes

* attempt to fix code smell

* attempt to fix code smell p

* attempt to fix code smell pt 3

* fix code smell

* create methods to use, reset style

* fix code smell

Co-authored-by: George-Bogdan Surdu <51715053+bogdansurdu@users.noreply.github.com>
Co-authored-by: bogdansurdu <gs2318@ic.ac.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category:sensitive-attribute-detection Given a dataset, automatically discover the fields that contain sensitive information.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bottleneck in deep search
3 participants