# Multi-Table Synthesis combined with Anonymization

Combining synthetic data with traditional anonymization enhances privacy and data utility while ensuring compliance with regulations. Synthetic data reduces re-identification risks by not being directly tied to individuals, preserving the usefulness of data for analysis. This approach also facilitates safer data sharing and collaboration by adding an extra layer of privacy protection that allows to replicate the same schema while protecting certain identifiers, like zip-codes or even unique identifiers, making it a strategic choice for organizations handling sensitive information.

In this notebook we will be exploring how to combine the benefits of the `MultiTableSynthesizer`with YData Fabric Anonymizer.

## Getting your database from the Data Catalog

In this example we have create our database in a MySQL server and [created a Dataset in Fabric Data Catalog](https://docs.sdk.ydata.ai/0.10/get-started/create_multitable_dataset/).

In [1]:
# Importing YData's packages
from ydata.labs import DataSources
# Reading the Dataset from the DataSource
datasource = DataSources.get(uid='{insert-database-id}')

dataset = datasource.dataset

This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


[1mMultiMetadata Summary 
 
[0m[1mNumber of tables: [0m9 
 
  Table name  # cols Primary keys             Foreign keys PK characteristics                           FK characteristics Notes
0     append       3                                                                                                            
1   district      16           a1                                        [id]                                                   
2    account       4   account_id            [district_id]               [id]                      {'district_id': ['id']}      
3     client       6    client_id            [district_id]               [id]                      {'district_id': ['id']}      
4       disp       4      disp_id  [client_id, account_id]               [id]  {'client_id': ['id'], 'account_id': ['id']}      
5       loan       9      loan_id             [account_id]               [id]                       {'account_id': ['id']}      
6      order       6     order_id

## Training & sampling a Database Synthetic Data generator

We can now define the **Anonymize** configuration that will allow to set an extra layer of protection over some of the database properties, while ensuring that the original schema is reflected.
By default, columns that have been defined as **PK** are always anonymized with an incremental integers. This pattern can be changed by the user.

In this example, the `Berka` database transactions table can be considered a time series. For that reason, the table **trans** will to be set as a `timeseries` and the column `date` as the table time order reference (**sortbykey**). For that reason we need to calculate a new `MultiMetadata`.

In [2]:
from ydata.metadata.multimetadata import MultiMetadata

dataset_type = {
    'trans': 'timeseries'
}

dataset_attrs = {
    'trans': {
        'sortbykey': 'date',
        'entities': []
    }
}

metadata = MultiMetadata(dataset, dataset_attrs=dataset_attrs, dataset_type=dataset_type)

This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Co

In this example, the following columns are anonymized:
- The `district` table primary key (`district.a1`) and all its references (i.e., foreign keys) across the database, such as the `account.district_id`. The replacement data will be generated according to the specified regex expression.
- All the primary keys from the `account` table, as well as their references across the database. The replacement data are integers.
- The values from the `bank_to` column from the `order` table will be replaced by city names (other strategies are available according to the specified `AnonymizerType`).

In [4]:
## setting the Anonymizer definition
from ydata.preprocessors.methods.anonymization import AnonymizerType

anonymizer_config = {
    'district': {
        'a1': r'[0-9]{4}-[A-Z]{5}'
    },
    'account': {
        'anonymize_primary_keys': True
    },
    'order': {
        'bank_to': AnonymizerType.CITY
    }
}

In [5]:
from ydata.synthesizers.multitable.model import MultiTableSynthesizer

synth = MultiTableSynthesizer()
synth.fit(dataset, metadata, anonymize=anonymizer_config)

INFO: 2024-02-08 00:02:52,096 (1/9) - Fitting table: [district]
INFO: 2024-02-08 00:02:54,713 [SYNTHESIZER] - Number columns considered for synth: 16
INFO: 2024-02-08 00:02:55,012 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2024-02-08 00:02:55,017 [SYNTHESIZER] - Preprocess segment
INFO: 2024-02-08 00:02:55,025 [SYNTHESIZER] - Synthesizer init.
INFO: 2024-02-08 00:02:55,026 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.
INFO: 2024-02-08 00:02:55,334 (2/9) - Fitting table: [client]
INFO: 2024-02-08 00:03:06,846 [SYNTHESIZER] - Number columns considered for synth: 22
INFO: 2024-02-08 00:03:07,369 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2024-02-08 00:03:07,373 [SYNTHESIZER] - Preprocess segment
INFO: 2024-02-08 00:03:07,382 [SYNTHESIZER] - Synthesizer init.
INFO: 2024-02-08 00:03:07,383 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.
INFO: 2024-02-08 00:03:09,060 (3/9

<ydata.synthesizers.multitable.model.MultiTableSynthesizer at 0x7f6b601fcdc0>

To generate the synthetic data we call the `sample` method.

Since there is a need to keep the consistency of the tables, as well as the referential integrity, to sample from trained synthesizers the number of records is set through a ratio based on the original number of records (e.g., 1.0 is equivalent to the size of the original database).

In [6]:
sample = synth.sample(n_samples=1.)
print(sample)

INFO: 2024-02-08 00:04:36,569 (1/9) - Synthesizing table: district
INFO: 2024-02-08 00:04:36,570 [SYNTHESIZER] - Start generating model samples.
INFO: 2024-02-08 00:04:37,436 (2/9) - Synthesizing table: client
INFO: 2024-02-08 00:04:37,700 [SYNTHESIZER] - Start generating model samples.
INFO: 2024-02-08 00:04:38,796 (3/9) - Synthesizing table: disp
INFO: 2024-02-08 00:04:38,924 [SYNTHESIZER] - Start generating model samples.
INFO: 2024-02-08 00:04:39,668 (4/9) - Synthesizing table: account
INFO: 2024-02-08 00:04:39,797 [SYNTHESIZER] - Start generating model samples.
INFO: 2024-02-08 00:04:40,679 (5/9) - Synthesizing table: trans
INFO: 2024-02-08 00:04:40,938 [SYNTHESIZER] - Start generating model samples.
INFO: 2024-02-08 00:04:40,944 [SYNTHESIZER] - Sample segment (-0.001, 67499.5]
INFO: 2024-02-08 00:04:42,644 [SYNTHESIZER] - Sample segment (67499.5, 134999.0]
INFO: 2024-02-08 00:04:48,974 (6/9) - Synthesizing table: order
INFO: 2024-02-08 00:04:49,071 [SYNTHESIZER] - Start generatin

In [7]:
sample['district'].head()


Unnamed: 0,a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16
0,7674-LTGIU,Domazlice,west Bohemia,133777,9,11,7,2,8,85.0,9832,3.32,4.0,90,6949,5273
1,8793-ICNMI,Brno - venkov,south Moravia,75685,31,33,1,2,4,59.0,8772,6.43,2.0,106,2595,2487
2,2097-SXOVW,Uherske Hradiste,south Moravia,145688,22,41,13,2,6,57.0,8369,1.29,2.0,110,2212,2906
3,2734-VPGRI,Rokycany,west Bohemia,159617,29,26,10,0,9,70.0,8678,3.83,5.0,131,4355,4265
4,2009-ALOYW,Domazlice,north Bohemia,133777,24,16,7,4,10,85.0,8705,1.39,6.0,131,4650,4505
