# DATA ANONYMIZATION

## Introduction

This notebook show how to use the anonymization feature with an small example. We will start by setting up the notebook, then we will create our dummy dataset and its metadata, and finally we will `model` and `sample` the data, checking the diferencies in both the data and the internal state of the objects.

## Notebook preparation

In [1]:
import json

import numpy as np
import pandas as pd
import rdt

from sdv import SDV

## Creating dataset and metadata

We are going to create a dataset of a single table containing three different columns : `primary_key`, `name` and `credit_card_number` and two different metadata, one that does use anonymization, and the other that it doesn't.

In [2]:
metadata = {
    "tables": [
        {
            "fields": [
                {
                    "name": "index",
                    "type": "id"
                },
                {
                    "name": "name",
                    "type": "categorical",
                    "pii": True,
                    "pii_category": "first_name"
                },
                {
                    "name": "credit_card_number",
                    "type": "categorical",
                    "pii": True,
                    "pii_category": [
                        "credit_card_number",
                        "visa"
                    ]
                }
            ],
            "name": "anonymized",
            "primary_key": "index",
        },
        {
            "fields": [
                {
                    "name": "index",
                    "type": "id"
                },
                {
                    "name": "name",
                    "type": "categorical"
                },
                {
                    "name": "credit_card_number",
                    "type": "categorical"
                }
            ],
            "name": "normal",
            "primary_key": "index",
        }
    ]
}

In [3]:
# Generating data for table.
data = pd.DataFrame([
    {
        'index': 1,
        'name': 'Bill',
        'credit_card_number': '1111222233334444'
    },
    {
        'index': 2,
        'name': 'Jeff',
        'credit_card_number': '0000000000000000'
    },
    {
        'index': 3,
        'name': 'Bill',
        'credit_card_number': '9999999999999999'
    },
    {
        'index': 4,
        'name': 'Jeff',
        'credit_card_number': '8888888888888888'
    },
])

In [4]:
tables = {
    'anonymized': data,
    'normal': data.copy()
}

Now we are going to generate the metadata. There are, a part from the anonymization parameters, two major differences with the other example metadata:

Now we have all that we needed in order to model and sample our example dataset, that is:

- A table of data stored as `table.csv` file
- Two table metadata, both **using to the same table** , but only **one of them anonymizing data**, and each of them using a different name.
- A full metadata specification, including the table metadata mentioned above, stored as `metadata.json`


# Modelling the dataset


Now that we have prepared our data and metadata files is time to model and sample them. To do so, we will:

1. Create an instance of `SDV`.
2. Model the database calling its `fit` method.
3. Generate samples for each table.

In [5]:
from sdv import SDV

sdv = SDV()
sdv.fit(metadata, tables)

2019-11-03 16:06:36,940 - INFO - modeler - Modeling anonymized
2019-11-03 16:06:36,941 - INFO - metadata - Loading transformer CategoricalTransformer for field name
2019-11-03 16:06:36,942 - INFO - metadata - Loading transformer CategoricalTransformer for field credit_card_number
2019-11-03 16:06:36,989 - INFO - modeler - Modeling normal
2019-11-03 16:06:36,989 - INFO - metadata - Loading transformer CategoricalTransformer for field name
2019-11-03 16:06:36,989 - INFO - metadata - Loading transformer CategoricalTransformer for field credit_card_number
2019-11-03 16:06:37,006 - INFO - modeler - Modeling Complete


## Sample and compare results

Now we are ready to samnple some data.

We will sample data from tables `anon` and `normal` that are originated from the same exact dataframe as we have confirmed before. The behavior that we are expecting is that on the anonymized table, unique values on the columns `credit_card_number` and `name` are not a subset of the unique values of the same columns in the original data table

In [16]:
sampled = sdv.sample_all()

In [17]:
sampled['anonymized']

Unnamed: 0,index,name,credit_card_number
0,30,Connie,4892780642269054
1,31,Blake,4762337792635670
2,32,Blake,4762337792635670
3,33,Connie,4776292250767081
4,34,Connie,4776292250767081


Here we can see, that the `name` and `credit_card_number` have different values that on the original data, for exemple, in the names column, unique values have changed from `['Bill', 'Jeff', 'Warren']` to `['Jodi', 'David', 'Darrell']`. (Please note that this concrete values are from this execution, and running this notebook again, may yield different results)

On the `credit_card_number` the difference is even more noticeable as they don't have keep the same format. This is not an issue as this data will be transfomed before being passed to the `Modeler` and the transformation for categorical values into numeric should yield close enough results.



In [18]:
sampled['normal']

Unnamed: 0,index,name,credit_card_number
0,30,Bill,9999999999999999
1,31,Jeff,9999999999999999
2,32,Bill,8888888888888888
3,33,Bill,9999999999999999
4,34,Jeff,1111222233334444
