# Multi-Table Synthesis with Business Rules

Combining synthetic data with traditional anonymization enhances privacy and data utility while ensuring compliance with regulations. Synthetic data reduces re-identification risks by not being directly tied to individuals, preserving the usefulness of data for analysis. This approach also facilitates safer data sharing and collaboration by adding an extra layer of privacy protection that allows to replicate the same schema while protecting certain identifiers, like zip-codes or even unique identifiers, making it a strategic choice for organizations handling sensitive information.

In this notebook we will be exploring how to combine the benefits of the `MultiTableSynthesizer`with YData Fabric Anonymizer.

## Getting your database from the Data Catalog

In this example we have create our database in a MySQL server and [created a Dataset in Fabric Data Catalog](https://docs.sdk.ydata.ai/0.10/get-started/create_multitable_dataset/).

In [1]:
# Importing YData's packages
from ydata.labs import DataSources

# Reading the Dataset from the DataSource
datasource = DataSources.get(uid='{insert-datasource-uid}')

dataset = datasource.dataset

## Training & sampling a Database Synthetic Data generator

The calculated features functionality allows the generation of specific columns based on data from other columns according to the business rules specified in custom functions.

In this example, the `Berka` database transactions table can be considered a time series. For that reason, the table **trans** will to be set as a `timeseries` and the column `date` as the table time order reference (**sortbykey**). For that reason we need to calculate a new `MultiMetadata`.

In [2]:
from ydata.metadata.multimetadata import MultiMetadata

dataset_type = {
    'trans': 'timeseries'
}

dataset_attrs = {
    'trans': {
        'sortbykey': 'date',
        'entities': []
    }
}

metadata = MultiMetadata(dataset, dataset_attrs=dataset_attrs, dataset_type=dataset_type)

This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


In this example, the following columns are calculated features:
- The `full_name` column from the `client` table is generated by concatenating the first and last names of each client, which are available in the `first_name` and `last_name` columns of the same table.
- The `a10_sum` column from the `client` table is generated by summing all the values from the `a10` column of the `district` table for each client. Since this is an inter-table calculated feature (i.e., several tables are used), there is a need to establish the relationship between the tables (in this case, between the `client` and the `district`). The user should include the primary and foreign keys in the base columns, and establish the relationship inside the custom function (see the `get_a10_sum` function).

In [5]:
import pandas as pd
import numpy as np

def get_full_name(first_name, last_name):
    full_names = []
    for ix in range(first_name.shape[0]):
        full_names.append(first_name[ix].strip() + " " + last_name[ix].strip())
    return np.asarray(full_names)

def get_a10_sum(client_id, district_id, a1, a10):
    a1_s = pd.Series(a1, name="a1")
    a10_s = pd.Series(a10, name="a10")
    district_data = pd.concat([a1_s, a10_s], axis=1)
    a10_sum = pd.Series(0, index=client_id)
    for c, d in zip(client_id, district_id):
        a10_sum[c] = district_data[district_data["a1"] == d]["a10"].sum()
    return a10_sum.values

calculated_features=[
    {
      "calculated_features": "client.full_name",
      "function": get_full_name,
      "calculated_from": ["client.first_name", "client.last_name"],
    }
]

In [6]:
from ydata.synthesizers.multitable.model import MultiTableSynthesizer

synth = MultiTableSynthesizer()
synth.fit(dataset, metadata, calculated_features=calculated_features)

INFO: 2024-11-19 12:09:27,858 (1/9) - Fitting table: [district]
INFO: 2024-11-19 12:09:34,692 [SYNTHESIZER] - Number columns considered for synth: 16
INFO: 2024-11-19 12:09:35,407 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2024-11-19 12:09:35,410 [SYNTHESIZER] - Preprocess segment
INFO: 2024-11-19 12:09:35,420 [SYNTHESIZER] - Synthesizer init.
INFO: 2024-11-19 12:09:35,421 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.
INFO: 2024-11-19 12:09:35,792 (2/9) - Fitting table: [client]
INFO: 2024-11-19 12:09:39,229 [SYNTHESIZER] - Number columns considered for synth: 23
INFO: 2024-11-19 12:09:40,312 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2024-11-19 12:09:40,316 [SYNTHESIZER] - Preprocess segment
INFO: 2024-11-19 12:09:40,326 [SYNTHESIZER] - Synthesizer init.
INFO: 2024-11-19 12:09:40,328 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.
INFO: 2024-11-19 12:09:42,594 (3/9

<ydata.synthesizers.multitable.model.MultiTableSynthesizer at 0x7fa5ec539ab0>

To generate the synthetic data we call the `sample` method.

Since there is a need to keep the consistency of the tables, as well as the referential integrity, to sample from trained synthesizers the number of records is set through a ratio based on the original number of records (e.g., 1.0 is equivalent to the size of the original database).

### Generate data and sample

In [7]:
# Importing YData's packages
from ydata.labs import Connectors
# Getting a previously created Connector
connector = Connectors.get(uid='{insert-connector-uid}')
print(connector)


MySQLConnector(
  
  uid='ac7ec4a8-ea81-4725-8c0d-b40b88db0c6a',
  name='Berka database synth',
  type=ConnectorType.MYSQL,
  connection=Connection(host='data-science-mysql-41955.c1xxv3f18hni.eu-west-1.rds.amazonaws.com', port=3306),
  database=berka_synth)


In [12]:
synth.sample(n_samples=1., connector=connector.connector)

INFO: 2024-11-19 12:59:13,588 (1/9) - Synthesizing table: district
INFO: 2024-11-19 12:59:13,590 [SYNTHESIZER] - Start generating model samples.
INFO: 2024-11-19 12:59:13,591 [SYNTHESIZER] - Init Dask cluster for sampling.
INFO: 2024-11-19 12:59:13,702 [SYNTHESIZER] - Postprocessing.
INFO: 2024-11-19 12:59:17,254 let write into the connector None
INFO: 2024-11-19 12:59:17,256 using kwargs {}
INFO: 2024-11-19 12:59:17,316 [SYNTHESIZER] - Numerical clipping
INFO: 2024-11-19 12:59:18,671 (2/9) - Synthesizing table: client
INFO: 2024-11-19 12:59:19,009 [SYNTHESIZER] - Start generating model samples.
INFO: 2024-11-19 12:59:19,011 [SYNTHESIZER] - Init Dask cluster for sampling.
INFO: 2024-11-19 12:59:19,528 [SYNTHESIZER] - Postprocessing.
INFO: 2024-11-19 12:59:19,629 let write into the connector None
INFO: 2024-11-19 12:59:19,631 using kwargs {}
INFO: 2024-11-19 12:59:19,765 [SYNTHESIZER] - Numerical clipping
INFO: 2024-11-19 12:59:20,100 (3/9) - Synthesizing table: account
INFO: 2024-11-19

<ydata.dataset.multidataset.MultiDataset at 0x7fa5ec984700>

In [13]:
synth_data = connector.read_database()

In [14]:
print(synth_data)

[1mMultiDataset Summary 
 
[0m[1mNumber of tables: [0m9 
 
  Table name  Num cols                         Num rows Primary keys Foreign keys Notes
0    account         4  Number of rows not yet computed                                
1     append         3  Number of rows not yet computed                                
2       card         4  Number of rows not yet computed                                
3     client         6  Number of rows not yet computed                                
4       disp         4  Number of rows not yet computed                                
5   district        16  Number of rows not yet computed                                
6       loan         9  Number of rows not yet computed                                
7      order         6  Number of rows not yet computed                                
8      trans        10  Number of rows not yet computed                                
