Adds Multi localisations feature for PII fields defined in #308 #609

xamm · 2021-10-14T16:38:25Z

This PR adds a feature to use multiple localizations for PII fields. see #308

Localizations can be specified per field with the pii_locales attribute in the metadata.
Localizations can be OrderedDicts to give weight to the localizations and change the probability of how often a particular localization is used to create fake values for the field.

- pull Faker creation outside of loop - can be reused inside - get faker fn by category from Faker created outside of loop

katxiao

Thanks for the PR! This seems useful. I left some comments about the unit tests.

Could you also update the documentation here with an example that uses pii_locales?

katxiao · 2021-10-20T18:41:18Z

sdv/metadata/table.py

@@ -157,10 +157,16 @@ class Table:
        ('id', 'string'): 'str'
    }

-    def _get_faker(self, category):
-        """Return the faker object to anonymize data.
+    def _get_faker(self, field_metadata):


nit: Add a docstring here, with the same format that the other methods have.

I added docstrings for the new _get_faker method.

I just realized that his should be @staticmethod, since it does not use anything from self.

That's right, the other methods don't use anything from self either, so they can be static as well.

tests/unit/metadata/test_table.py

- safer testing as not relying on random.choice for localisation selection - Renames test methods - Uses anonymization function directly not calling fit

katxiao · 2021-10-21T17:35:37Z

Looks great, thanks for making the changes!

One more thing - could you also update the documentation here with an example that uses pii_locales?

xamm · 2021-10-25T09:10:17Z

@katxiao I added the documentation for using localizations when anonymizing values.

The note I included could probably also be a warning.

tests/unit/metadata/test_table.py

- prior to faker version 3.0.0 only single localization can be specified - after 3.0.0 multiple localizations are possible

- when using older versions only one localization can be specified

- skipping multiple localization if Faker version is too low

csala

Thanks for the PR @xamm !
I think the concept of the proposed changes is fine, but I added a few suggestions to make improve the overall implementation, especially on the tests side.
Would you mind having a look at those?

csala · 2021-10-26T14:08:13Z

docs/developer_guides/sdv/metadata.rst

+.. note:: Specifying localizations and using ``Faker`` categories may result in an error 
+          if the defined ``pii_category`` is not available for all specified languages.
+
+.. warning:: When using versions of ``Faker`` prior to ``3.0.0``, 


I do not think there is any strong reason to keep supporting older Faker versions if they lack features, so rather than having multiple behaviors on our side depending on their version I would simply change the supported range to >3.0.0.
Could we do the change on the setup.py and remove this comment and the skip in the tests below?

I changed this the minimal version for Faker to 3.0.0 in the setup.py and the .yaml meta files.

If the minimum version is changed I suppose this warning can be removed

@xamm did you see my previous message? This warning can be removed.

Ah sorry, I did not see that.

sdv/metadata/table.py

csala · 2021-10-26T16:37:17Z

tests/unit/metadata/test_table.py

+        assert len(ean_8) == 8
+        assert len(ean_13) == 13
+
+    def test__make_anonymization_mappings_unique_faked_value_in_field(self):


These tests seem to be too broad, since they test multiple things at once. I would rather make tests for the individual methods that are involved to test each piece of code on its own. I suggest the following changes:

Create a couple of tests for the _get_faker method (with and without locales) where we ensure that when locales are defined we call Faker passing them properly (we can patch Faker for this), and that we do not otherwise.

Create one test for the _get_fake_values suggested above, mocking the faker method to ensure that faker is called as required.

Make a single test for _make_anonymization_mappings to ensure that the mapping is created properly, mocking the _get_fake_values to make the test easier.

On the other hand, when testing the _make_anonymization_mappings method I think that we should not really be validating the returned dict (see comment above about removing the return statement), but rather the values that are set in the self._ANONYMIZATION_MAPPINGS attribute, since those are the ones that will end up being used.

I do not understand how this should be mocked. Mocking Faker seems to be quite complicated.
What exactly should be mocked for the _get_faker method?

I was thinking about using patch for this, but now that I see the tests that you implemented, I think it is not necessary.
Just validating that the output is a Faker instance and that the locales are being properly set is OK.

I hope, that this is changed as you expected with the new commits.

csala · 2021-10-26T16:40:23Z

tests/unit/metadata/test_table.py

+        assert list(foo_mappings.keys()) == ['test1@example.com', 'test2@example.com']
+
+    @pytest.mark.skipif(faker.VERSION < "3.0.0", reason="Higher version of Faker required.")
+    def test__make_anonymization_mappings_multiple_localizations(self):


I think it is not necessary to distinguish between single and multiple localizations, since this is something that is handled by Faker internally. Here we should be testing only the SDV code, so all we need to ensure is that the localizations are passed properly when creating the Faker instance, regardless of their number.
On the other hand, we could add an integration test where this functionality is fully tested end to end, and there we can add 1 field with a single locale and another one with 2 or more, so the Faker behavior is implicitly tested too.

I would like to keep both of the tests to ensure that the API of Faker is not changed in future versions.
The Unit tests might be more explicit what fails than an end-to-end integration test.

I believe that this ultimately should be an integration test, as it is testing how the SDV code integrates with a 3rd party library. Adding it as an integration test would allow us to test the interaction without knowing the specific inner-workings of Faker. However, I think these unit tests are sufficient for now, and we can migrate this to integration tests further down the line, as we flesh out the testing scaffold. Tracking here: #624

csala · 2021-10-26T16:42:26Z

tests/unit/metadata/test_table.py

+        - The mappings created from the original values to localized faked values.
+        """
+        # Setup
+        def _mock_faker_getattr(obj, fn_name):


This may not apply here any more if we change the tests, but I wanted to share one thought that may apply to the new tests: When mocking third party libraries we generally use the patch decorator on the test case function, accepting a Mock instance as an additional argument, and then if necessary set the return_value or side_effect function on it.
By doing it this way, we can fully configure the Mock before running the tested method, and then assert on the calls that the Mock received.

Is it also possible to define the mock function outside of the tests functions?
The same mock is used multiple times and for me it would make sense to pull this outside of the Test setup of each test.

Like this:

def _mock_faker_getattr(obj, fn_name): if fn_name == "company": return lambda: obj.__lang__ else: return getattr(obj, fn_name) @patch('faker.generator.getattr', _mock_faker_getattr)

Then no mock must be created / changed inside the test.

sdv/metadata/table.py

- adds new function to generate fake values

- use the private attribute _ANONYMIZATION_MAPPINGS for tests

- remove skipping test when lower version is used

csala · 2021-10-29T13:38:54Z

tests/unit/metadata/test_table.py

+        assert len(ean_8) == 8
+        assert len(ean_13) == 13
+
+    def test__make_anonymization_mappings_unique_faked_value_in_field(self):


I was thinking about using patch for this, but now that I see the tests that you implemented, I think it is not necessary.
Just validating that the output is a Faker instance and that the locales are being properly set is OK.

csala · 2021-10-29T13:40:03Z

sdv/metadata/table.py

+                The Faker object to anonymize the data in the field using its functions.
+        """
+        pii_locales = field_metadata.get('pii_locales', None)
+        return Faker(locale=pii_locales) if pii_locales is not None else Faker()


It seems like the default value for the locale argument is None, so I think this if/else is not needed.
We can just pass pii_locales down to Faker and it will work in all cases.

This is removed, but i kept the local pii_locales varialbe for clarity.
If wanted, this could also be moved inside the call:

Faker(locale=field_metadata.get('pii_locales', None))

csala · 2021-10-29T13:42:37Z

tests/unit/metadata/test_table.py

+                }
+            }
+        }
+        metadata = Table.from_dict(metadata_dict)


I think creating an instance will not really needed if we make the method static, so this can remove this setup phase after that change.

I removed this local variable in tests where metadata was only used once.
When this local variable was used more than once I kept the local variable inside the Setup section.

tests/unit/metadata/test_table.py

sdv/metadata/table.py

xamm · 2021-11-01T14:04:37Z

I will look at the comments as soon as I have time to edit them. I should have time this week.

- moves getting concrete faker attribute to closure - moves functions in order to always define before using

- only import Faker object from the library

- when only used once use static method directly in Run section

xamm · 2021-11-02T10:14:50Z

Also what might be fixable in this PR (while it technically would be out of scope) are these unused lines:

    _fakers = None
    _constraint_instances = None

They are defined but I could not find an instance of their usage.

csala

This is almost ready @xamm so I think it can already be approved.
I just added a couple of minor comments, and once those are addressed we can merge,

sdv/metadata/table.py

tests/unit/metadata/test_table.py

csala · 2021-11-02T17:43:12Z

docs/developer_guides/sdv/metadata.rst

+.. note:: Specifying localizations and using ``Faker`` categories may result in an error 
+          if the defined ``pii_category`` is not available for all specified languages.
+
+.. warning:: When using versions of ``Faker`` prior to ``3.0.0``, 


@xamm did you see my previous message? This warning can be removed.

tests/unit/metadata/test_table.py

katxiao · 2021-11-02T18:37:40Z

tests/unit/metadata/test_table.py

+        assert list(foo_mappings.keys()) == ['test1@example.com', 'test2@example.com']
+
+    @pytest.mark.skipif(faker.VERSION < "3.0.0", reason="Higher version of Faker required.")
+    def test__make_anonymization_mappings_multiple_localizations(self):


I believe that this ultimately should be an integration test, as it is testing how the SDV code integrates with a 3rd party library. Adding it as an integration test would allow us to test the interaction without knowing the specific inner-workings of Faker. However, I think these unit tests are sufficient for now, and we can migrate this to integration tests further down the line, as we flesh out the testing scaffold. Tracking here: #624

tests/unit/metadata/test_table.py

- version is not supported anymore

- _get_faker - _get_faker_method - _get_fake_values - add doc string

xamm added 2 commits October 14, 2021 18:05

Adds multi localisation to pii marked fields

a158c5d

Refectors multi locale faker usage

fcd4858

- pull Faker creation outside of loop - can be reused inside - get faker fn by category from Faker created outside of loop

xamm requested a review from a team as a code owner October 14, 2021 16:38

xamm requested review from katxiao and removed request for a team October 14, 2021 16:38

xamm mentioned this pull request Oct 14, 2021

Add custom providers to Faker #308

Open

katxiao reviewed Oct 20, 2021

View reviewed changes

katxiao requested a review from csala October 20, 2021 18:59

xamm added 5 commits October 21, 2021 11:27

- Mocks faker method

c70020f

- safer testing as not relying on random.choice for localisation selection - Renames test methods - Uses anonymization function directly not calling fit

Renames constant for n values created

6fa859f

adds docstring for _get_faker function

20351ca

Adds docstring to test methods

5c97608

creates test for passing arguments to the function creating fake values

e80d264

adds documentation for localized data anonymization

8bca69f

katxiao reviewed Oct 25, 2021

View reviewed changes

tests/unit/metadata/test_table.py Outdated Show resolved Hide resolved

xamm added 4 commits October 26, 2021 11:24

adds version specific test logic

01cfa3f

- prior to faker version 3.0.0 only single localization can be specified - after 3.0.0 multiple localizations are possible

adds warning to documentation for faker versions prior to 3.0.0

0d1a582

- when using older versions only one localization can be specified

adds tests for different versions of Faker

087464e

- skipping multiple localization if Faker version is too low

Merge branch 'sdv-dev:master' into gh-308-anonymization-specify-locales

558b2aa

csala suggested changes Oct 29, 2021

View reviewed changes

xamm added 5 commits October 29, 2021 14:17

- renames get_faker_fn to get_faker_method

f5a0332

- adds new function to generate fake values

removes return value of _make_anonymization_mappings

038d5f2

- use the private attribute _ANONYMIZATION_MAPPINGS for tests

change minimal faker version to 3.0.0

6cd992f

- remove skipping test when lower version is used

rename test name and docstrings

bb7cb79

add tests for _get_faker

7920159

csala reviewed Oct 29, 2021

View reviewed changes

sdv/metadata/table.py Outdated Show resolved Hide resolved

sorts imports

f3fe2c9

xamm added 5 commits November 2, 2021 10:35

incorporates review:

994d5f4

- moves getting concrete faker attribute to closure - moves functions in order to always define before using

asserts isinstance faker to get_faker tests

bc515d4

removes faker import

aad11ae

- only import Faker object from the library

removes metadata variable

3320764

- when only used once use static method directly in Run section

moves mock fn outside of test setup

ef55a95

xamm requested a review from csala November 2, 2021 10:10

csala approved these changes Nov 2, 2021

View reviewed changes

katxiao reviewed Nov 2, 2021

View reviewed changes

xamm added 5 commits November 3, 2021 09:07

fix typo strig-> string

e0e2593

change double quotes to single quotes

bb524a9

remove specific version warning from doc

7b0ad38

- version is not supported anymore

add blank line after double intendation

dab8530

make faker methods static

2523ebb

- _get_faker - _get_faker_method - _get_fake_values - add doc string

katxiao approved these changes Nov 3, 2021

View reviewed changes

katxiao merged commit 77dce0f into sdv-dev:master Nov 3, 2021

xamm deleted the gh-308-anonymization-specify-locales branch November 3, 2021 14:22

amontanez24 added this to the 0.13.0 milestone Nov 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds Multi localisations feature for PII fields defined in #308 #609

Adds Multi localisations feature for PII fields defined in #308 #609

xamm commented Oct 14, 2021 •

edited

Loading

katxiao left a comment

katxiao Oct 20, 2021

xamm Oct 21, 2021

csala Oct 29, 2021 •

edited

Loading

xamm Nov 3, 2021

katxiao commented Oct 21, 2021

xamm commented Oct 25, 2021

csala left a comment

csala Oct 26, 2021

xamm Oct 29, 2021

csala Oct 29, 2021

csala Nov 2, 2021

xamm Nov 3, 2021

csala Oct 26, 2021

xamm Oct 29, 2021

csala Oct 29, 2021

xamm Nov 2, 2021

csala Oct 26, 2021

xamm Nov 2, 2021

katxiao Nov 2, 2021

csala Oct 26, 2021

xamm Nov 2, 2021

csala Oct 29, 2021

csala Oct 29, 2021

xamm Nov 2, 2021

csala Oct 29, 2021

xamm Nov 2, 2021

xamm commented Nov 1, 2021 •

edited

Loading

xamm commented Nov 2, 2021

csala left a comment

csala Nov 2, 2021

katxiao Nov 2, 2021

Adds Multi localisations feature for PII fields defined in #308 #609

Adds Multi localisations feature for PII fields defined in #308 #609

Conversation

xamm commented Oct 14, 2021 • edited Loading

katxiao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csala Oct 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katxiao commented Oct 21, 2021

xamm commented Oct 25, 2021

csala left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xamm commented Nov 1, 2021 • edited Loading

xamm commented Nov 2, 2021

csala left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xamm commented Oct 14, 2021 •

edited

Loading

csala Oct 29, 2021 •

edited

Loading

xamm commented Nov 1, 2021 •

edited

Loading