# Using YData to synthesizer anonymized Personal Information

Many datasets contain Personal Information (PI) and therefore cannot easily be shared. It is not enough to simply synthesize a dataset to remove the personal information. Indeed, despite the fact that the synthetic dataset does not contain any record from the original data, it might contain some values from the original dataset that still represent PI. For instance, in some context, the simple fact to have the name of a city might leak information about the entire dataset.

To solve this problem, YData offers the possibility to anonymize any field such that the synthetic data do not contain any PI. The anonymizer mechanism provides several pre-configured anonymizer that corresponds to the most common scenarios (city, address, names, IP address) and also allows to specify a regular expression to match any format that you might have (e.g. an internal customer ID format).

In this notebook, we demonstrate how we can configure the synthesizer to anonymize some of our fields, in particular the customer ID.

The dataset used in this notebook can be found at https://www.kaggle.com/datasets/yeanzc/telco-customer-churn-ibm-dataset

## Dataset exploration

In this section, we simply load the dataset and display few rows to observe that there are personal information that should be anonymized.

In [3]:
# Importing YData's packages
from ydata.platform.datasources import DataSources
from ydata.metadata import Metadata
# Creating a Dataset from the Data Source
datasource = DataSources.get(uid='{insert-uid}', namespace='{insert-namespace}')
dataset = datasource.dataset
df = dataset.to_pandas()

"\nImporting YData's packages\nfrom ydata.platform.datasources import DataSources\nfrom ydata.metadata import Metadata\n# Creating a Dataset from the Data Source\ndatasource = DataSources.get(uid='{insert-uid}', namespace='{insert-namespace}')\ndataset = datasource.read()\ndf = dataset.to_pandas()\n"

In [5]:
df.head()

Unnamed: 0,CustomerID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,...,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
0,3668-QPYBK,1,United States,California,Los Angeles,90003,"33.964131, -118.272783",33.964131,-118.272783,Male,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,1,86,3239,Competitor made better offer
1,9237-HQITU,1,United States,California,Los Angeles,90005,"34.059281, -118.30742",34.059281,-118.30742,Female,...,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,1,67,2701,Moved
2,9305-CDSKC,1,United States,California,Los Angeles,90006,"34.048013, -118.293953",34.048013,-118.293953,Female,...,Month-to-month,Yes,Electronic check,99.65,820.5,Yes,1,86,5372,Moved
3,7892-POOKP,1,United States,California,Los Angeles,90010,"34.062125, -118.315709",34.062125,-118.315709,Female,...,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes,1,84,5003,Moved
4,0280-XJGEX,1,United States,California,Los Angeles,90015,"34.039224, -118.266293",34.039224,-118.266293,Male,...,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,Yes,1,89,5340,Competitor had better devices


## Anonymizing data during synthesis

In this example, we will demonstrate how to anonymize the customer ID column and the City Column. 

**Remark:** However, keep in mind that this is not enough for this dataset not to contain any personal information. The column `Lat Long`, `Latitude`, `Longitude`, `Zip Code`, `State` and `Country` might also be considered as sensitive. However, the purpose of this notebook is only to demonstrate Fabric Anonymizer.

It is possible to display the list of pre-configured anonymizer method with the following:

In [6]:
from ydata.preprocessors.methods.anonymization import AnonymizerType

dict(AnonymizerType.__members__)

{'REGEX': <AnonymizerType.REGEX: 0>,
 'IP': <AnonymizerType.IP: 1>,
 'IPV4': <AnonymizerType.IPV4: 2>,
 'IPV6': <AnonymizerType.IPV6: 3>,
 'HOSTNAME': <AnonymizerType.HOSTNAME: 4>,
 'LICENCE_PLATE': <AnonymizerType.LICENCE_PLATE: 5>,
 'ABA': <AnonymizerType.ABA: 6>,
 'BANK_COUNTRY': <AnonymizerType.BANK_COUNTRY: 7>,
 'BBAN': <AnonymizerType.BBAN: 8>,
 'IBAN': <AnonymizerType.IBAN: 9>,
 'SWIFT': <AnonymizerType.SWIFT: 10>,
 'BARCODE': <AnonymizerType.BARCODE: 11>,
 'COLOR': <AnonymizerType.COLOR: 12>,
 'COLOR_NAME': <AnonymizerType.COLOR_NAME: 13>,
 'COMPANY': <AnonymizerType.COMPANY: 14>,
 'COMPANY_SUFFIX': <AnonymizerType.COMPANY_SUFFIX: 15>,
 'CRYPTOCURRENCY': <AnonymizerType.CRYPTOCURRENCY: 16>,
 'CRYPTOCURRENCY_CODE': <AnonymizerType.CRYPTOCURRENCY_CODE: 17>,
 'CRYPTOCURRENCY_NAME': <AnonymizerType.CRYPTOCURRENCY_NAME: 18>,
 'CURRENCY': <AnonymizerType.CURRENCY: 19>,
 'CURRENCY_CODE': <AnonymizerType.CURRENCY_CODE: 20>,
 'CURRENCY_NAME': <AnonymizerType.CURRENCY_NAME: 21>,
 'CURREN

For `CustomerID` anonymization we will use a REGEX as the value are specific to this dataset.    
On the other hand, `City` can leverage the `AnonymizerType.CITY` to generate fake city names.

The configuration is to be passed to the Synthesizer model and looks like the following mapping:

In [7]:
anonymize = {
    'CustomerID': r'[0-9]{4}-[A-Z]{5}',  # Regex as a string is deduced automatically as AnonymizerType.REGEX
    'City': AnonymizerType.CITY  # Direct usage of AnonymizerType
}

## Training the synthesizer

In [8]:
meta = Metadata(dataset)

[########################################] | 100% Completed | 101.39 ms
[###########                             ] | 29% Completed | 657.48 ms

  result = function(*args, **kwargs)


[########################################] | 100% Completed | 1.81 sms


In [9]:
print(meta)

[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m33
[1mDuplicate rows: [0m9
[1mTarget column: [0m

[1mColumn detail: [0m
               Column    Data type Variable type
0          CustomerID  categorical        string
1               Count    numerical           int
2             Country  categorical        string
3               State  categorical        string
4                City  categorical        string
5            Zip Code    numerical           int
6            Lat Long  categorical        string
7            Latitude    numerical         float
8           Longitude    numerical         float
9              Gender  categorical        string
10     Senior Citizen  categorical        string
11            Partner  categorical        string
12         Dependents  categorical        string
13      Tenure Months    numerical           int
14      Phone Service  categorical        string
15     Multiple Lines  c

In [10]:
from ydata.synthesizers.regular import RegularSynthesizer

synth = RegularSynthesizer()
synth.fit(dataset, 
          metadata=meta,
          anonymize=anonymize)

INFO: 2022-12-08 15:54:47,307 [SYNTHESIZER] - Number columns considered for synth: 33
INFO: 2022-12-08 15:54:51,177 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2022-12-08 15:54:51,179 [SYNTHESIZER] - Preprocess segment
INFO: 2022-12-08 15:54:51,185 [SYNTHESIZER] - Synthesizer init.
INFO: 2022-12-08 15:54:51,186 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.


<ydata.synthesizers.regular.model.RegularSynthesizer at 0x7f7294b827f0>

In [11]:
synth_sample = synth.sample(len(dataset))

INFO: 2022-12-08 15:54:54,021 [SYNTHESIZER] - Start generating model samples.


In [12]:
synth_sample.head(100)

Unnamed: 0,CustomerID,Count,Country,State,City,Zip Code,Lat Long,Latitude,Longitude,Gender,...,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn Label,Churn Value,Churn Score,CLTV,Churn Reason
0,9112-AAOKW,1,United States,California,Andrewton,92867,"33.81859, -117.821288",33.819385,-117.821288,Female,...,Month-to-month,Yes,Bank transfer (automatic),29.85,381.2,No,0,29,5974,
1,2810-MYEQY,1,United States,California,North Stevenport,90623,"33.850504, -118.039892",33.859171,-118.039892,Female,...,Month-to-month,Yes,Electronic check,69.50,1108,No,0,75,3585,
2,5368-AVJEK,1,United States,California,South Brucemouth,95912,"38.982373, -122.047751",38.982373,-122.047751,Female,...,One year,No,Bank transfer (automatic),24.60,692.1,Yes,1,97,5638,Moved
3,5373-RYBED,1,United States,California,New William,96057,"41.251322, -122.105209",41.251322,-121.160249,Female,...,Two year,No,Credit card (automatic),19.25,1240.8,No,0,46,4149,
4,2649-EVORV,1,United States,California,East Michael,90260,"33.97803, -118.217141",33.978030,-118.217141,Male,...,One year,Yes,Bank transfer (automatic),79.85,5662.25,No,0,73,5327,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,5303-CUWHH,1,United States,California,Evanstad,92054,"33.194742, -117.29032",33.200369,-117.285879,Female,...,One year,No,Mailed check,20.35,826,No,0,46,3573,
96,4581-SQWIG,1,United States,California,Rodriguezshire,96096,"40.759401, -122.939337",40.759401,-121.906949,Male,...,Two year,Yes,Credit card (automatic),20.25,1270.55,No,0,52,6186,
97,7366-NITBG,1,United States,California,Jeffreyville,91709,"33.942895, -117.725644",33.942895,-117.725644,Female,...,Month-to-month,No,Mailed check,78.95,319.6,Yes,1,68,4127,Lack of self-service on Website
98,2804-WQGLD,1,United States,California,Port Garytown,91504,"34.188339, -118.300942",34.188339,-118.310030,Female,...,Month-to-month,Yes,Electronic check,69.95,325.45,No,0,21,2547,


As expected, the final dataset does not contain the original CustomerID nor the original cities.

In [13]:
sample_df = synth_sample.to_pandas()

In [14]:
sample_customers = list(sample_df['CustomerID'].unique())
original_customers = list(df['CustomerID'].unique())
len([c for c in original_customers if c in sample_customers])

0

In [15]:
sample_cities = sample_df['City'].value_counts()
original_cities = df['City'].value_counts()
len([c for c in original_cities if c in sample_cities])

0