# Identifying and anonymizing Personally Identifiable Information (PII) with ydata-sdk

Many datasets contain Personal Information (PI) and therefore cannot easily be shared. It is not enough to simply synthesize a dataset to remove the personal information. Indeed, although the synthetic dataset does not contain any records from the original data, it may still include values that represent personally identifiable information (PII). For instance, in some context, the simple fact to have the name of a city might leak information about the entire dataset.

To solve this problem, YData offers the possibility to anonymize any field such that the synthetic data do not contain any PI. The anonymizer mechanism provides several pre-configured anonymizer that corresponds to the most common scenarios (city, address, names, IP address) and also allows to specify a regular expression to match any format that you might have (e.g. an internal customer ID format).

By the end of this notebook, you will learn how to:
- Detect columns that contain PII
- Apply built-in anonymization strategies (e.g., names, locations, IDs)
- Generate a sanitized version of your dataset using `ydata-sdk`

The dataset used in this notebook can be found at https://www.kaggle.com/datasets/yeanzc/telco-customer-churn-ibm-dataset

## Authenticate with your YData account

In [None]:
# Authenticate with your ydata-sdk token - https://dashboard.ydata.ai/
import os

os.environ['YDATA_LICENSE_KEY'] = '{add-your-key}'

## Indentifying PII with ydata-sdk

In this section, we simply load the dataset and display few rows to observe that there are personal information that should be anonymized.

In [None]:
import pandas as pd

from ydata.dataset import Dataset

# Step 1: Load or create your Dataset
df = pd.read_csv('insert-file-path')
dataset = Dataset(df)

In [5]:
df.head()

Unnamed: 0,CustomerID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,...,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
0,3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
1,9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.30742,Female,...,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,67,2701,Moved
2,9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,...,Month-to-month,Yes,Electronic check,99.65,820.5,Yes,1,86,5372,Moved
3,7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,...,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes,1,84,5003,Moved
4,0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,...,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,Yes,1,89,5340,Competitor had better devices


In [1]:
from ydata.metadata import Metadata

# Step 2: Calculate the metadata
# When infer_characteristics is set to True, ydata-sdk will automatically infer data characteristics such as potential PII (e.g., email, name, etc.). By default, this option is set to False.
metadata = Metadata(dataset, infer_characteristics = True)
metadata

ModuleNotFoundError: No module named 'ydata.metadata'

In [None]:
metadata.summary['characteristics']

You can also define characteristics manually using the `characteristics` parameter.

This parameter expects a dictionary that explicitly maps column names to known PII types. This is useful when you already know which columns contain sensitive information. The format should be:

{ "column_name": "pii_type" }

For example:
{ "email": "email", "customer_id": "id" }

## Anonymizing PII with ydata-sdk

Now that we were able to identify potential PII columns, it is also possible to create the logic to anonymize the columns Now that we've identified potential PII columns, we can define the logic to anonymize them using the Fabric Anonymizer.

In this example, we'll demonstrate how to anonymize two specific columns: customerID and City.

**Note:** This example focuses on demonstrating the anonymization workflow — not on fully securing the dataset. Other columns such as Lat Long, Latitude, Longitude, Zip Code, State, and Country may also contain sensitive information and should be reviewed carefully in a real-world scenario.

To view the list of available built-in anonymization methods, use the following command:

In [6]:
from ydata.preprocessors.methods.anonymization import AnonymizerType

## Available anonymizer types list
dict(AnonymizerType.__members__)

{'REGEX': <AnonymizerType.REGEX: 0>,
 'IP': <AnonymizerType.IP: 1>,
 'IPV4': <AnonymizerType.IPV4: 2>,
 'IPV6': <AnonymizerType.IPV6: 3>,
 'HOSTNAME': <AnonymizerType.HOSTNAME: 4>,
 'LICENCE_PLATE': <AnonymizerType.LICENCE_PLATE: 5>,
 'ABA': <AnonymizerType.ABA: 6>,
 'BANK_COUNTRY': <AnonymizerType.BANK_COUNTRY: 7>,
 'BBAN': <AnonymizerType.BBAN: 8>,
 'IBAN': <AnonymizerType.IBAN: 9>,
 'SWIFT': <AnonymizerType.SWIFT: 10>,
 'BARCODE': <AnonymizerType.BARCODE: 11>,
 'COLOR': <AnonymizerType.COLOR: 12>,
 'COLOR_NAME': <AnonymizerType.COLOR_NAME: 13>,
 'COMPANY': <AnonymizerType.COMPANY: 14>,
 'COMPANY_SUFFIX': <AnonymizerType.COMPANY_SUFFIX: 15>,
 'CRYPTOCURRENCY': <AnonymizerType.CRYPTOCURRENCY: 16>,
 'CRYPTOCURRENCY_CODE': <AnonymizerType.CRYPTOCURRENCY_CODE: 17>,
 'CRYPTOCURRENCY_NAME': <AnonymizerType.CRYPTOCURRENCY_NAME: 18>,
 'CURRENCY': <AnonymizerType.CURRENCY: 19>,
 'CURRENCY_CODE': <AnonymizerType.CURRENCY_CODE: 20>,
 'CURRENCY_NAME': <AnonymizerType.CURRENCY_NAME: 21>,
 'CURREN

For `CustomerID` anonymization we will use a REGEX as the value are specific to this dataset.    
On the other hand, `City` can leverage the `AnonymizerType.CITY` to generate fake city names.

The configuration is to be passed to the Synthesizer model and looks like the following mapping:

In [7]:
# Step 1: Configure the Anonymizer – Define which masking or replacement strategy should be applied to each PII column based on its type and sensitivity.
anonymizer_config = {
    'CustomerID': {'type': 'regex', 'regex': r'[0-9]{4}-[A-Z]{5}'},  # Regex as a string is deduced automatically as AnonymizerType.REGEX
    'City': AnonymizerType.CITY  # Direct usage of AnonymizerType
}

In [None]:
from ydata.preprocessors.preprocess_methods import AnonymizerEngine

# Step 2: Create your Anonymizer Engine
anonymizer = AnonymizerEngine()
dataset_anonymized = anonymizer.fit_transform(X=dataset, 
                                      config=anonymizer_config, 
                                      metadata=metadata)

dataset_anonymized.head()