Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Multi localisations feature for PII fields defined in #308 #609

Merged
merged 28 commits into from
Nov 3, 2021
Merged

Adds Multi localisations feature for PII fields defined in #308 #609

merged 28 commits into from
Nov 3, 2021

Conversation

xamm
Copy link
Contributor

@xamm xamm commented Oct 14, 2021

This PR adds a feature to use multiple localizations for PII fields. see #308

Localizations can be specified per field with the pii_locales attribute in the metadata.
Localizations can be OrderedDicts to give weight to the localizations and change the probability of how often a particular localization is used to create fake values for the field.

- pull Faker creation outside of loop
  - can be reused inside
- get faker fn by category from Faker created outside of loop
@xamm xamm requested a review from a team as a code owner October 14, 2021 16:38
@xamm xamm requested review from katxiao and removed request for a team October 14, 2021 16:38
Copy link
Contributor

@katxiao katxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This seems useful. I left some comments about the unit tests.

Could you also update the documentation here with an example that uses pii_locales?

@@ -157,10 +157,16 @@ class Table:
('id', 'string'): 'str'
}

def _get_faker(self, category):
"""Return the faker object to anonymize data.
def _get_faker(self, field_metadata):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add a docstring here, with the same format that the other methods have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added docstrings for the new _get_faker method.

Copy link
Contributor

@csala csala Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that his should be @staticmethod, since it does not use anything from self.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, the other methods don't use anything from self either, so they can be static as well.

tests/unit/metadata/test_table.py Outdated Show resolved Hide resolved
tests/unit/metadata/test_table.py Outdated Show resolved Hide resolved
tests/unit/metadata/test_table.py Outdated Show resolved Hide resolved
@katxiao katxiao requested a review from csala October 20, 2021 18:59
    - safer testing as not relying on random.choice for localisation selection
- Renames test methods
- Uses anonymization function directly not calling fit
@katxiao
Copy link
Contributor

katxiao commented Oct 21, 2021

Looks great, thanks for making the changes!

One more thing - could you also update the documentation here with an example that uses pii_locales?

@xamm
Copy link
Contributor Author

xamm commented Oct 25, 2021

@katxiao I added the documentation for using localizations when anonymizing values.

The note I included could probably also be a warning.

- prior to faker version 3.0.0 only single localization can be specified
- after 3.0.0 multiple localizations are possible
- when using older versions only one localization can be specified
- skipping multiple localization if Faker version is too low
Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @xamm !
I think the concept of the proposed changes is fine, but I added a few suggestions to make improve the overall implementation, especially on the tests side.
Would you mind having a look at those?

.. note:: Specifying localizations and using ``Faker`` categories may result in an error
if the defined ``pii_category`` is not available for all specified languages.

.. warning:: When using versions of ``Faker`` prior to ``3.0.0``,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think there is any strong reason to keep supporting older Faker versions if they lack features, so rather than having multiple behaviors on our side depending on their version I would simply change the supported range to >3.0.0.
Could we do the change on the setup.py and remove this comment and the skip in the tests below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this the minimal version for Faker to 3.0.0 in the setup.py and the .yaml meta files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the minimum version is changed I suppose this warning can be removed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xamm did you see my previous message? This warning can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I did not see that.

sdv/metadata/table.py Outdated Show resolved Hide resolved
sdv/metadata/table.py Outdated Show resolved Hide resolved
sdv/metadata/table.py Outdated Show resolved Hide resolved
assert len(ean_8) == 8
assert len(ean_13) == 13

def test__make_anonymization_mappings_unique_faked_value_in_field(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests seem to be too broad, since they test multiple things at once. I would rather make tests for the individual methods that are involved to test each piece of code on its own. I suggest the following changes:

  1. Create a couple of tests for the _get_faker method (with and without locales) where we ensure that when locales are defined we call Faker passing them properly (we can patch Faker for this), and that we do not otherwise.
  2. Create one test for the _get_fake_values suggested above, mocking the faker method to ensure that faker is called as required.
  3. Make a single test for _make_anonymization_mappings to ensure that the mapping is created properly, mocking the _get_fake_values to make the test easier.

On the other hand, when testing the _make_anonymization_mappings method I think that we should not really be validating the returned dict (see comment above about removing the return statement), but rather the values that are set in the self._ANONYMIZATION_MAPPINGS attribute, since those are the ones that will end up being used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand how this should be mocked. Mocking Faker seems to be quite complicated.
What exactly should be mocked for the _get_faker method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about using patch for this, but now that I see the tests that you implemented, I think it is not necessary.
Just validating that the output is a Faker instance and that the locales are being properly set is OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope, that this is changed as you expected with the new commits.

assert list(foo_mappings.keys()) == ['test1@example.com', 'test2@example.com']

@pytest.mark.skipif(faker.VERSION < "3.0.0", reason="Higher version of Faker required.")
def test__make_anonymization_mappings_multiple_localizations(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not necessary to distinguish between single and multiple localizations, since this is something that is handled by Faker internally. Here we should be testing only the SDV code, so all we need to ensure is that the localizations are passed properly when creating the Faker instance, regardless of their number.
On the other hand, we could add an integration test where this functionality is fully tested end to end, and there we can add 1 field with a single locale and another one with 2 or more, so the Faker behavior is implicitly tested too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep both of the tests to ensure that the API of Faker is not changed in future versions.
The Unit tests might be more explicit what fails than an end-to-end integration test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this ultimately should be an integration test, as it is testing how the SDV code integrates with a 3rd party library. Adding it as an integration test would allow us to test the interaction without knowing the specific inner-workings of Faker. However, I think these unit tests are sufficient for now, and we can migrate this to integration tests further down the line, as we flesh out the testing scaffold. Tracking here: #624

- The mappings created from the original values to localized faked values.
"""
# Setup
def _mock_faker_getattr(obj, fn_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not apply here any more if we change the tests, but I wanted to share one thought that may apply to the new tests: When mocking third party libraries we generally use the patch decorator on the test case function, accepting a Mock instance as an additional argument, and then if necessary set the return_value or side_effect function on it.
By doing it this way, we can fully configure the Mock before running the tested method, and then assert on the calls that the Mock received.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it also possible to define the mock function outside of the tests functions?
The same mock is used multiple times and for me it would make sense to pull this outside of the Test setup of each test.

Like this:

    def _mock_faker_getattr(obj, fn_name):
        if fn_name == "company":
            return lambda: obj.__lang__
        else:
            return getattr(obj, fn_name)

    @patch('faker.generator.getattr', _mock_faker_getattr)

Then no mock must be created / changed inside the test.

sdv/metadata/table.py Show resolved Hide resolved
- adds new function to generate fake values
- use the private attribute _ANONYMIZATION_MAPPINGS for tests
- remove skipping test when lower version is used
assert len(ean_8) == 8
assert len(ean_13) == 13

def test__make_anonymization_mappings_unique_faked_value_in_field(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about using patch for this, but now that I see the tests that you implemented, I think it is not necessary.
Just validating that the output is a Faker instance and that the locales are being properly set is OK.

The Faker object to anonymize the data in the field using its functions.
"""
pii_locales = field_metadata.get('pii_locales', None)
return Faker(locale=pii_locales) if pii_locales is not None else Faker()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the default value for the locale argument is None, so I think this if/else is not needed.
We can just pass pii_locales down to Faker and it will work in all cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is removed, but i kept the local pii_locales varialbe for clarity.
If wanted, this could also be moved inside the call:

Faker(locale=field_metadata.get('pii_locales', None))

}
}
}
metadata = Table.from_dict(metadata_dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think creating an instance will not really needed if we make the method static, so this can remove this setup phase after that change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this local variable in tests where metadata was only used once.
When this local variable was used more than once I kept the local variable inside the Setup section.

tests/unit/metadata/test_table.py Show resolved Hide resolved
sdv/metadata/table.py Outdated Show resolved Hide resolved
@xamm
Copy link
Contributor Author

xamm commented Nov 1, 2021

I will look at the comments as soon as I have time to edit them. I should have time this week.

- moves getting concrete faker attribute to closure
- moves functions in order to always define before using
- only import Faker object from the library
- when only used once use static method directly in Run section
@xamm xamm requested a review from csala November 2, 2021 10:10
@xamm
Copy link
Contributor Author

xamm commented Nov 2, 2021

Also what might be fixable in this PR (while it technically would be out of scope) are these unused lines:

    _fakers = None
    _constraint_instances = None

They are defined but I could not find an instance of their usage.

Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost ready @xamm so I think it can already be approved.
I just added a couple of minor comments, and once those are addressed we can merge,

sdv/metadata/table.py Show resolved Hide resolved
tests/unit/metadata/test_table.py Outdated Show resolved Hide resolved
.. note:: Specifying localizations and using ``Faker`` categories may result in an error
if the defined ``pii_category`` is not available for all specified languages.

.. warning:: When using versions of ``Faker`` prior to ``3.0.0``,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xamm did you see my previous message? This warning can be removed.

tests/unit/metadata/test_table.py Outdated Show resolved Hide resolved
assert list(foo_mappings.keys()) == ['test1@example.com', 'test2@example.com']

@pytest.mark.skipif(faker.VERSION < "3.0.0", reason="Higher version of Faker required.")
def test__make_anonymization_mappings_multiple_localizations(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this ultimately should be an integration test, as it is testing how the SDV code integrates with a 3rd party library. Adding it as an integration test would allow us to test the interaction without knowing the specific inner-workings of Faker. However, I think these unit tests are sufficient for now, and we can migrate this to integration tests further down the line, as we flesh out the testing scaffold. Tracking here: #624

tests/unit/metadata/test_table.py Outdated Show resolved Hide resolved
@katxiao katxiao merged commit 77dce0f into sdv-dev:master Nov 3, 2021
@xamm xamm deleted the gh-308-anonymization-specify-locales branch November 3, 2021 14:22
@amontanez24 amontanez24 added this to the 0.13.0 milestone Nov 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants