Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add anonymization to DataProcessor #950

Closed
amontanez24 opened this issue Aug 11, 2022 · 0 comments · Fixed by #1008
Closed

Add anonymization to DataProcessor #950

amontanez24 opened this issue Aug 11, 2022 · 0 comments · Fixed by #1008
Assignees
Labels
feature request Request for a new feature
Milestone

Comments

@amontanez24
Copy link
Contributor

Problem Description

As a user, it would be useful if data that I mark as pii in the metadata gets anonymized.
The new DataProcessor class should utilize the AnonymizedFaker class to handle anonymization.

Expected behavior

  • When configuring which transformers should be used on each field and the sdtype is not one of the basic sdtypes with a predefined transformer, then we should do the following:
    • If the field is marked as pii, we should configure a AnonymizedFaker class based on the details in the Additional Context section.
    • Otherwise we should use the default categorical transformer.

Additional context

The transformer we select if the field is marked as pii depends on the sdtype. We should follow the rules below

Data Type Transformer (if PII)
"physical_address" AnonymizedFaker(provider_name="address", function_name="address")
"email" AnonymizedFaker(provider_name="internet", function_name="email")
"latitude" AnonymizedFaker(provider_name="geo", function_name="latitude")
"longitude" AnonymizedFaker(provider_name="geo", function_name="longitude")
"ipv4_address" AnonymizedFaker(provider_name="internet", function_name="ipv4")
"ipv6_address" AnonymizedFaker(provider_name="internet", function_name="ipv6")
"mac_address" AnonymizedFaker(provider_name="internet", function_name="mac_address")
"name" AnonymizedFaker(provider_name="person", function_name="name")
"phone_number" If premium: AnonymizedGeoExtractor()If not premium: AnonymizedFaker(provider_name="phone_number", function_name="phone_number")
"ssn" AnonymizedFaker(provider_name="ssn", function_name="ssn")
"user_agent_string" AnonymizedFaker(provider_name="user_agent", function_name="user_agent")
"<function_name>"OR<"provider"."function">eg."lorem.sentence" AnonymizedFaker(provider_name="lorem", function_name="sentence")

If another sdtype is provided, then we should attempt to search through the Faker function names for one that matches the sdtype and use that if possible. If we cannot do a search, then we will require users to provide the sdtype in the following format: provider_name.function_name

@amontanez24 amontanez24 added feature request Request for a new feature new Automatic label applied to new issues labels Aug 11, 2022
@npatki npatki removed the new Automatic label applied to new issues label Aug 11, 2022
@amontanez24 amontanez24 added this to the 1.0.0 milestone Aug 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants