-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds Multi localisations feature for PII fields defined in #308 #609
Adds Multi localisations feature for PII fields defined in #308 #609
Conversation
- pull Faker creation outside of loop - can be reused inside - get faker fn by category from Faker created outside of loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! This seems useful. I left some comments about the unit tests.
Could you also update the documentation here with an example that uses pii_locales
?
sdv/metadata/table.py
Outdated
@@ -157,10 +157,16 @@ class Table: | |||
('id', 'string'): 'str' | |||
} | |||
|
|||
def _get_faker(self, category): | |||
"""Return the faker object to anonymize data. | |||
def _get_faker(self, field_metadata): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Add a docstring here, with the same format that the other methods have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added docstrings for the new _get_faker
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that his should be @staticmethod
, since it does not use anything from self
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, the other methods don't use anything from self either, so they can be static as well.
- safer testing as not relying on random.choice for localisation selection - Renames test methods - Uses anonymization function directly not calling fit
Looks great, thanks for making the changes! One more thing - could you also update the documentation here with an example that uses |
@katxiao I added the documentation for using localizations when anonymizing values. The note I included could probably also be a warning. |
- prior to faker version 3.0.0 only single localization can be specified - after 3.0.0 multiple localizations are possible
- when using older versions only one localization can be specified
- skipping multiple localization if Faker version is too low
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @xamm !
I think the concept of the proposed changes is fine, but I added a few suggestions to make improve the overall implementation, especially on the tests side.
Would you mind having a look at those?
.. note:: Specifying localizations and using ``Faker`` categories may result in an error | ||
if the defined ``pii_category`` is not available for all specified languages. | ||
|
||
.. warning:: When using versions of ``Faker`` prior to ``3.0.0``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think there is any strong reason to keep supporting older Faker versions if they lack features, so rather than having multiple behaviors on our side depending on their version I would simply change the supported range to >3.0.0
.
Could we do the change on the setup.py
and remove this comment and the skip in the tests below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this the minimal version for Faker to 3.0.0 in the setup.py
and the .yaml
meta files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the minimum version is changed I suppose this warning
can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xamm did you see my previous message? This warning
can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, I did not see that.
assert len(ean_8) == 8 | ||
assert len(ean_13) == 13 | ||
|
||
def test__make_anonymization_mappings_unique_faked_value_in_field(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests seem to be too broad, since they test multiple things at once. I would rather make tests for the individual methods that are involved to test each piece of code on its own. I suggest the following changes:
- Create a couple of tests for the
_get_faker
method (with and without locales) where we ensure that when locales are defined we call Faker passing them properly (we canpatch
Faker for this), and that we do not otherwise. - Create one test for the
_get_fake_values
suggested above, mocking the faker method to ensure that faker is called as required. - Make a single test for
_make_anonymization_mappings
to ensure that the mapping is created properly, mocking the_get_fake_values
to make the test easier.
On the other hand, when testing the _make_anonymization_mappings
method I think that we should not really be validating the returned dict (see comment above about removing the return
statement), but rather the values that are set in the self._ANONYMIZATION_MAPPINGS
attribute, since those are the ones that will end up being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand how this should be mocked. Mocking Faker seems to be quite complicated.
What exactly should be mocked for the _get_faker
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about using patch
for this, but now that I see the tests that you implemented, I think it is not necessary.
Just validating that the output is a Faker
instance and that the locales
are being properly set is OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope, that this is changed as you expected with the new commits.
assert list(foo_mappings.keys()) == ['test1@example.com', 'test2@example.com'] | ||
|
||
@pytest.mark.skipif(faker.VERSION < "3.0.0", reason="Higher version of Faker required.") | ||
def test__make_anonymization_mappings_multiple_localizations(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not necessary to distinguish between single and multiple localizations, since this is something that is handled by Faker internally. Here we should be testing only the SDV code, so all we need to ensure is that the localizations are passed properly when creating the Faker instance, regardless of their number.
On the other hand, we could add an integration
test where this functionality is fully tested end to end, and there we can add 1 field with a single locale and another one with 2 or more, so the Faker
behavior is implicitly tested too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to keep both of the tests to ensure that the API of Faker is not changed in future versions.
The Unit tests might be more explicit what fails than an end-to-end integration test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this ultimately should be an integration test, as it is testing how the SDV code integrates with a 3rd party library. Adding it as an integration test would allow us to test the interaction without knowing the specific inner-workings of Faker. However, I think these unit tests are sufficient for now, and we can migrate this to integration tests further down the line, as we flesh out the testing scaffold. Tracking here: #624
tests/unit/metadata/test_table.py
Outdated
- The mappings created from the original values to localized faked values. | ||
""" | ||
# Setup | ||
def _mock_faker_getattr(obj, fn_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not apply here any more if we change the tests, but I wanted to share one thought that may apply to the new tests: When mocking third party libraries we generally use the patch
decorator on the test case function, accepting a Mock
instance as an additional argument, and then if necessary set the return_value
or side_effect
function on it.
By doing it this way, we can fully configure the Mock before running the tested method, and then assert on the calls that the Mock received.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it also possible to define the mock function outside of the tests functions?
The same mock is used multiple times and for me it would make sense to pull this outside of the Test setup of each test.
Like this:
def _mock_faker_getattr(obj, fn_name):
if fn_name == "company":
return lambda: obj.__lang__
else:
return getattr(obj, fn_name)
@patch('faker.generator.getattr', _mock_faker_getattr)
Then no mock must be created / changed inside the test.
- adds new function to generate fake values
- use the private attribute _ANONYMIZATION_MAPPINGS for tests
- remove skipping test when lower version is used
assert len(ean_8) == 8 | ||
assert len(ean_13) == 13 | ||
|
||
def test__make_anonymization_mappings_unique_faked_value_in_field(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about using patch
for this, but now that I see the tests that you implemented, I think it is not necessary.
Just validating that the output is a Faker
instance and that the locales
are being properly set is OK.
sdv/metadata/table.py
Outdated
The Faker object to anonymize the data in the field using its functions. | ||
""" | ||
pii_locales = field_metadata.get('pii_locales', None) | ||
return Faker(locale=pii_locales) if pii_locales is not None else Faker() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the default value for the locale
argument is None
, so I think this if/else
is not needed.
We can just pass pii_locales
down to Faker
and it will work in all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is removed, but i kept the local pii_locales
varialbe for clarity.
If wanted, this could also be moved inside the call:
Faker(locale=field_metadata.get('pii_locales', None))
tests/unit/metadata/test_table.py
Outdated
} | ||
} | ||
} | ||
metadata = Table.from_dict(metadata_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think creating an instance will not really needed if we make the method static
, so this can remove this setup phase after that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this local variable in tests where metadata
was only used once.
When this local variable was used more than once I kept the local variable inside the Setup section.
I will look at the comments as soon as I have time to edit them. I should have time this week. |
- moves getting concrete faker attribute to closure - moves functions in order to always define before using
- only import Faker object from the library
- when only used once use static method directly in Run section
Also what might be fixable in this PR (while it technically would be out of scope) are these unused lines:
They are defined but I could not find an instance of their usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost ready @xamm so I think it can already be approved.
I just added a couple of minor comments, and once those are addressed we can merge,
.. note:: Specifying localizations and using ``Faker`` categories may result in an error | ||
if the defined ``pii_category`` is not available for all specified languages. | ||
|
||
.. warning:: When using versions of ``Faker`` prior to ``3.0.0``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xamm did you see my previous message? This warning
can be removed.
assert list(foo_mappings.keys()) == ['test1@example.com', 'test2@example.com'] | ||
|
||
@pytest.mark.skipif(faker.VERSION < "3.0.0", reason="Higher version of Faker required.") | ||
def test__make_anonymization_mappings_multiple_localizations(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this ultimately should be an integration test, as it is testing how the SDV code integrates with a 3rd party library. Adding it as an integration test would allow us to test the interaction without knowing the specific inner-workings of Faker. However, I think these unit tests are sufficient for now, and we can migrate this to integration tests further down the line, as we flesh out the testing scaffold. Tracking here: #624
- version is not supported anymore
- _get_faker - _get_faker_method - _get_fake_values - add doc string
This PR adds a feature to use multiple localizations for PII fields. see #308
Localizations can be specified per field with the
pii_locales
attribute in the metadata.Localizations can be OrderedDicts to give weight to the localizations and change the probability of how often a particular localization is used to create fake values for the field.