Enhancement/detection optimization #42

bogdansurdu · 2021-06-10T11:14:35Z

This pull request will add optimization for the sensitive attribute detection algorithm to speed up the execution for large datasets, by making detect_names_df() sample from series.

Additionally, it makes deep_search override the shallow search as opposed to executing both and makes the shallow search algorithm more robust to minimize false positives.

Solves #41

…tles

bogdansurdu · 2021-06-11T09:12:23Z

docs/user_guide/sensitive.rst

+.. ipython:: python
+
+    df = pd.read_csv("../datasets/compas.csv")
+
+    dt.detect_names_df(df)
+
+As we can see, some sensitive categories from the dataframe have been picked out by the shallow search.
+Let's see now if by enabling deep search we are able to detect more attributes:
+
+.. ipython:: python
+
+    dt.detect_names_df(df, deep_search=True)


Add COMPAS example in the user guides.

bogdansurdu · 2021-06-11T09:12:50Z

src/fairlens/sensitive/configs/config_engb.json

+            "Age": ["age", "DOB", "birth", "youth", "elder", "senior", "date of birth"],
            "Gender": ["gender", "sex"],
-            "Ethnicity": ["race", "color", "ethnic", "breed", "culture"],
-            "Religion": ["religion", "creed", "cult", "doctrine"],
-            "Nationality": ["nation", "geography", "location", "native", "country", "region"],
-            "Family Status": ["family status", "family", "house", "marital", "children", "partner", "pregnant"],
+            "Ethnicity": ["race", "color", "ethnic", "breed", "culture", "ethnicity"],
+            "Religion": ["religion", "creed", "cult", "doctrine", "faith"],
+            "Nationality": ["nationality", "nation", "geography", "location", "language", "native", "country", "region"],
+            "Family Status": ["family status", "family", "house", "marital", "children", "partner", "pregnant", "marital status"],
            "Disability": ["disability", "impairment"],
-            "Sexual Orientation": ["sexual orientation", "sexual", "orientation", "attracted"]
+            "Sexual Orientation": ["sexual orientation", "sexual preference", "sexual", "orientation", "attracted"]


Slightly update the configuration to contain the category itself.

bogdansurdu · 2021-06-11T09:13:13Z

src/fairlens/sensitive/detection.py

+        for col in cols:
+            # If the series are larger than the provided n_samples, we take a sample to increase speed.
+            if df[col].size > n_samples:
+                column = df[col].sample(n=n_samples)
+            else:
+                column = df[col]


Limits the amount of series elements analyzed to improve execution speed.

tonbadal · 2021-06-11T09:14:36Z

src/fairlens/sensitive/detection.py

@@ -151,8 +162,9 @@ def _detect_name(

    # Check startswith / endswith
    for group_name, attrs in attr_synonym_dict.items():
+        separator = "|".join(" ,.-:")


nit: maybe directly writing separator = " |,|.|-|:" will be more clear?

Might be a bit clearer, yes. Also, we could do:

separator = ",.-:

And then call the function with "|".join(separator)

yes, probably separator = ",.-: and "|".join(separator) is a good option, easier to change if we need to add any extra symbol

tests/test_detection.py

tonbadal · 2021-06-11T09:19:09Z

tests/test_detection.py

+def test_compas_detect_shallow():
+    res = {
+        "DateOfBirth": "Age",
+        "Ethnicity": "Ethnicity",
+        "Language": "Nationality",
+        "MaritalStatus": "Family Status",
+        "Sex": "Gender",
+    }
+    assert dt.detect_names_df(dfc) == res
+
+
+def test_compas_detect_deep():
+    res = {
+        "DateOfBirth": "Age",
+        "Ethnicity": "Ethnicity",
+        "Language": "Nationality",
+        "MaritalStatus": "Family Status",
+        "Sex": "Gender",
+    }
+    assert dt.detect_names_df(dfc, deep_search=True) == res


in both cases, the COMPAS dataset seems to behave the same. For testing, can we find an example where deep_search has an impact on the result?

There are a few tests that showcase this in the previous version (that is on main). In order to do it on a dataset, I think we will need to find a common one that presents this behavior, but that means solving #9

The problem is that this test is currently doesn't really check correct behaviour, because even if deep_search=True first we do a "non-deep" search, and we know that this works from previous test.

What about something like dfc = dfc.rename(columns={"DateOfBirth": "A", "Ethnicity": "B" [...]} and res = {"A": "Age", "B": "Ethnicity", [...]}? (make sure you copy/re-load the dataframe otherwise it might affect other tests)

Oh, that actually sounds good. It might not detect Age if we change it because it is just a numeric value but otherwise should be good.

tonbadal · 2021-06-11T09:21:20Z

src/fairlens/sensitive/detection.py

@@ -18,6 +18,7 @@ def detect_names_df(
    threshold: float = 0.1,
    str_distance: Callable[[Optional[str], Optional[str]], float] = None,
    deep_search: bool = False,
+    n_samples: int = 20,


open to discuss, is it 20 representative enough? with a big dataset, one could easily get 20 samples that don't contain enough data to do a proper search.

What's the impact of increasing this number a bit more?

Assuming the dictionary is very comprehensive, it could be enough. However, I don't think increasing it a few times will be troublesome as it is already pretty fast.

As an alternative suggestion, what do you think about doing .unique() on the series and then sampling from that new series?

Yea good idea, .unique() seems the best option as it's quite fast and obviously the most representative sample! Maybe you can still apply the n_samples threshold after getting unique values, otherwise if it has many values (e.g. names, continuous values, etc) can heavily affect the performance

sonarcloud · 2021-06-11T13:11:28Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

tonbadal

👌 💯

* make deep_search override shallow search and skip analyzing column titles * add sampling for series of large dataframes * update ENGB configuration so that synonyms contain the categories * update user guide with COMPAS example * update detection module and tests, fix COMPAS bugs * fix SonarCloud code smell * remove extraneous local variable * slightly reduce redundant code * fix pre-commit issues * remove debugging snippet from test_detection.py * implement sampling from uniques, change separators * update COMPAS deep search test columns to have non-indicative names * add more detail to COMPAS detection example

* add resampling methods and 2 metrics with p-values, integrate with distance metrics * add p-value module with resampling methods, tests for binomial * fix flake8 error * add mean distance, tests for p-vals, refactor histogram logic into utility * add tests for p-value tests * fix sonarcloud bug * trailing whitepace error * add additional tests * add additional visualization tools * add title to plt_attr_dist * Enhancement/detection optimization (#42) * make deep_search override shallow search and skip analyzing column titles * add sampling for series of large dataframes * update ENGB configuration so that synonyms contain the categories * update user guide with COMPAS example * update detection module and tests, fix COMPAS bugs * fix SonarCloud code smell * remove extraneous local variable * slightly reduce redundant code * fix pre-commit issues * remove debugging snippet from test_detection.py * implement sampling from uniques, change separators * update COMPAS deep search test columns to have non-indicative names * add more detail to COMPAS detection example * add additional visualization tools * add title to plt_attr_dist * add demo notebook, use histogram method from p-value pr * histograms now work for dates * update viz docs * update compas to fix date ambiguity * fix code smells * fix viz bug w categorical data * quantize dates method * add import to docs * make suggested changes * attempt to fix code smell * attempt to fix code smell p * attempt to fix code smell pt 3 * fix code smell * create methods to use, reset style * fix code smell Co-authored-by: George-Bogdan Surdu <51715053+bogdansurdu@users.noreply.github.com> Co-authored-by: bogdansurdu <gs2318@ic.ac.uk>

bogdansurdu added 2 commits June 10, 2021 12:37

make deep_search override shallow search and skip analyzing column ti…

2c0ba4e

…tles

add sampling for series of large dataframes

ea1a9c9

bogdansurdu self-assigned this Jun 10, 2021

bogdansurdu marked this pull request as draft June 10, 2021 11:14

bogdansurdu and others added 8 commits June 11, 2021 11:04

update ENGB configuration so that synonyms contain the categories

c65b62a

update user guide with COMPAS example

05e76bf

update detection module and tests, fix COMPAS bugs

4c61001

fix SonarCloud code smell

21d0dfb

remove extraneous local variable

0c68151

slightly reduce redundant code

c59183b

Merge branch 'main' into enhancement/detection-optimization

2b6aee2

fix pre-commit issues

4047673

Hilly12 added the category:sensitive-attribute-detection Given a dataset, automatically discover the fields that contain sensitive information. label Jun 11, 2021

Hilly12 linked an issue Jun 11, 2021 that may be closed by this pull request

Bottleneck in deep search #41

Closed

bogdansurdu requested a review from jamied157 June 11, 2021 09:08

bogdansurdu marked this pull request as ready for review June 11, 2021 09:08

bogdansurdu commented Jun 11, 2021

View reviewed changes

remove debugging snippet from test_detection.py

d96fe7b

tonbadal reviewed Jun 11, 2021

View reviewed changes

bogdansurdu added 3 commits June 11, 2021 15:59

implement sampling from uniques, change separators

bd443a9

update COMPAS deep search test columns to have non-indicative names

d5d7966

add more detail to COMPAS detection example

bb5deeb

tonbadal approved these changes Jun 11, 2021

View reviewed changes

bogdansurdu merged commit 6ea607c into main Jun 11, 2021

bogdansurdu deleted the enhancement/detection-optimization branch June 11, 2021 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement/detection optimization #42

Enhancement/detection optimization #42

bogdansurdu commented Jun 10, 2021 •

edited

Loading

bogdansurdu Jun 11, 2021

bogdansurdu Jun 11, 2021

bogdansurdu Jun 11, 2021

tonbadal Jun 11, 2021

bogdansurdu Jun 11, 2021

tonbadal Jun 11, 2021

tonbadal Jun 11, 2021

bogdansurdu Jun 11, 2021

tonbadal Jun 11, 2021

bogdansurdu Jun 11, 2021

tonbadal Jun 11, 2021

bogdansurdu Jun 11, 2021

tonbadal Jun 11, 2021

sonarcloud bot commented Jun 11, 2021

tonbadal left a comment

Enhancement/detection optimization #42

Enhancement/detection optimization #42

Conversation

bogdansurdu commented Jun 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarcloud bot commented Jun 11, 2021

tonbadal left a comment

Choose a reason for hiding this comment

bogdansurdu commented Jun 10, 2021 •

edited

Loading