Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting a locale for all my anonymized (PII) columns #1371

Closed
npatki opened this issue Apr 12, 2023 · 0 comments · Fixed by #1383
Closed

Setting a locale for all my anonymized (PII) columns #1371

npatki opened this issue Apr 12, 2023 · 0 comments · Fixed by #1383
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Apr 12, 2023

Problem Description

Sometimes, I have a dataset with many sensitive columns such as address, phone_number, etc. Usually, the entire dataset comes from a specific region of users so I want to set the locales for all of these values. The SDV 1.0 provides this functionality, but it is cumbersome.

Current Functionality: Currently I can set the locales individually on each of the transformer objects using the Anonymization Settings

from sdv.single_table import GaussianCopulaSynthesizer
from rdt.transformers.pii import AnonymizedFaker

synth = GaussianCopulaSynthesizer(my_metadata)
synth.auto_assign_transformers(my_data)

# update all PII columns to using the desired locale, such as 'en_CA'
synth.update_transformers(column_name_to_transformer={
    'address': AnonymizedFaker(provider_name='address', function_name='address', locales=['en_CA'],
    'phone_number': AnonymizedFaker(provider_name='phone_number', function_name='phone_number', locales=['en_CA'])
})

synth.fit(data)
synthetic_data = synth.sample(num_rows=100)

Expected behavior

All synthesizers (single table, multi table and sequential) should accept a global locales parameter during initialization. This should set the locales for all the relevant column transformers at once.

from sdv.single_table import GaussianCopulaSynthesizer

synth = GaussianCopulaSynthesizer(my_metadata, locales=['en_CA'])

Note: if needed, it should still be possible to update the locales on individual columns by using auto_assign_transformers and update_transformers, as shown above.

Additional context

Internally, we should make sure that any time we assign an AnonymizedFaker, we pass in the locales specified in the synthesizer parameters.

If no locales are specified, Faker defaults to en_US. So this change does not require backwards compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants