# Anonymizing PII

Masking and anonymization are vital for data privacy, ensuring sensitive information is protected, especially under regulations like GDPR and HIPAA. These techniques allow companies to securely use data in development, testing, and analytics without exposing personal details, which mitigates risk and builds trust.

Integrating these solutions into recurring data pipelines is essential for automated, consistent, and scalable data protection. This integration ensures that data is regularly anonymized or masked, adapting as regulations evolve and providing traceability for audits. In short, embedding masking and anonymization within pipelines safeguards data privacy efficiently and keeps organizations compliant.

### Reading the data from the Data Catalog

In [2]:
# Importing YData's packages
from ydata.labs import DataSources
# Reading the Dataset from the DataSource
datasource = DataSources.get(uid='{insert-uid}', namespace='{insert-namespace-id}')
dataset = datasource.dataset
# Getting the calculated Metadata to get the profile overview information in the labs
metadata = datasource.metadata
print(metadata)

[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m10
[1mNumber of rows: [0m50000
[1mDuplicate rows: [0m0
[1mTarget column: [0m

[1mColumn detail: [0m
           Column    Data type Variable type Characteristics
0   customer_name       string        string    name, person
1           email       string        string       email, id
2    phone_number       string        string           phone
3         address     longtext        string         address
4   purchase_date       string        string                
5         product       string        string                
6        quantity  categorical           int                
7           price    numerical           int                
8  payment_method  categorical        string                
9    order_status  categorical        string                

0      cardinality  [customer_name, email, phone_number, purchase_date, product]
1            zeros       

In [3]:
dataset.head()

Unnamed: 0,customer_name,email,phone_number,address,purchase_date,product,quantity,price,payment_method,order_status
0,Alexandra Harris,ncarrillo@example.net,017.770.9160x21749,Unit 3045 Box 9416 DPO AA 76688,2021-12-23,idea,2,243,Gift Card,Delivered
1,Richard Graham,chasejessica@example.org,001-785-880-4267x26931,"5431 Christine Lake Suite 050 Lake Kevin, AZ 7...",2022-10-24,treat,3,237,Credit Card,Returned
2,Christopher Carlson MD,charris@example.org,+1-206-791-1312x020,"4754 White Pass Apt. 984 Smithchester, IA 58897",2021-01-04,occur,3,351,PayPal,Cancelled
3,Katie Gould,griffingina@example.net,840-176-1884x3130,"10711 Ruiz Islands Krystalfurt, IL 75583",2023-08-23,able,5,495,Gift Card,Shipped
4,Theresa Young,troy45@example.com,(132)320-3616x6550,USNS Maldonado FPO AP 26854,2022-02-03,sense,3,230,PayPal,Pending


## Anonymizing the information

### Leveraging Fabric suggestions

In [5]:
from ydata.preprocessors.methods.anonymization import AnonymizerConfigurationBuilder, AnonymizerType

#detected_pii_cols=list(detected_pii.keys())
#if len(detected_pii_cols) > 0:
from ydata.characteristics.characteristics import suggest_anonymizer_config

anonymizer_config=suggest_anonymizer_config(metadata)
config={}
for col, v in anonymizer_config.items():
    config[col] = v[0]['type']

In [6]:
builder = AnonymizerConfigurationBuilder(config)

{'customer_name': <AnonymizerType.NAME: 27>,
 'email': <AnonymizerType.EMAIL: 16>,
 'phone_number': <AnonymizerType.PHONE: 43>,
 'address': <AnonymizerType.FULL_ADDRESS: 36>}

In [8]:
from ydata.preprocessors.preprocess_methods import AnonymizerEngine

anonymizer = AnonymizerEngine()
anon_dataset = anonymizer.fit_transform(X=dataset, config=config, metadata=metadata)

In [9]:
anon_dataset.head()

Unnamed: 0,customer_name,email,phone_number,address,purchase_date,product,quantity,price,payment_method,order_status
0,Mrs. Jennifer Yoder MD,lwest@example.com,9124138147,Unit 1039 Box 3940\nDPO AE 32431,2021-12-23,idea,2,243,Gift Card,Delivered
1,John Walker,pamela00@example.net,9380480273,"178 Suzanne Shoals\nLake Vicki, WA 93295",2022-10-24,treat,3,237,Credit Card,Returned
2,Brandi Howe,vhernandez@example.org,6386415424,"12169 Soto Street Apt. 219\nEast Ryanchester, ...",2021-01-04,occur,3,351,PayPal,Cancelled
3,Misty Graham,jenningsalexander@example.org,6880352781,"3537 Arnold Bypass Apt. 949\nJohntown, DE 70208",2023-08-23,able,5,495,Gift Card,Shipped
4,Phillip Harris,patricia32@example.org,3904188171,"0386 Lee Extensions\nStephaniefort, ME 70718",2022-02-03,sense,3,230,PayPal,Pending


### User provided configuration

In [27]:
config = {
        # Regex as a string is deduced automatically as AnonymizerType.REGEX
        'customer_name': {
            "type": "name",
        },
        'phone_number': {
                "type": "regex",
                "regex": r'[0-9]{9}',
        },
        'email': {
                "type": "email"
        },
        'address': {
                "type": AnonymizerType.FULL_ADDRESS
        }
    }

builder = AnonymizerConfigurationBuilder(config)

In [28]:
from ydata.preprocessors.preprocess_methods import AnonymizerEngine

anonymizer = AnonymizerEngine()
anon_dataset = anonymizer.fit_transform(X=dataset, config=config, metadata=metadata)

In [30]:
anon_dataset.head()

Unnamed: 0,customer_name,email,phone_number,address,purchase_date,product,quantity,price,payment_method,order_status
0,Heather Johnson,williamguerrero@example.com,134476637,"7630 Smith Wall Suite 015\nWest Jordanville, M...",2021-12-23,idea,2,243,Gift Card,Delivered
1,Michael Payne,karen46@example.org,142075635,"6592 Ronald Stream\nEast Bernardtown, WY 79262",2022-10-24,treat,3,237,Credit Card,Returned
2,Ronnie Phillips,curtis79@example.com,771248879,"882 Jackson Square\nDavisbury, MI 60158",2021-01-04,occur,3,351,PayPal,Cancelled
3,Henry Chapman,ashley06@example.com,657801551,"54911 Jose Knoll Apt. 278\nGilesmouth, OR 91171",2023-08-23,able,5,495,Gift Card,Shipped
4,Rebecca Crawford,rebeccarice@example.net,306272013,"384 Marcus Tunnel Suite 163\nLake Lisatown, MP...",2022-02-03,sense,3,230,PayPal,Pending


## Write dataset to storage

In [33]:
# Importing YData's packages
from ydata.labs import Connectors
# Getting a previously created Connector
connector = Connectors.get(uid='{insert-connector-uid}')
connector.write_table(data=anon_dataset, name='anonymized_data')

This may cause some slowdown.
Consider scattering data ahead of time and using futures.
