# Multi-Table Synthesis from a Relational Database

Relational databases (RDBMS) are a type of data storage that allows users to access data that is stored in various tables connected through primary and foreign keys. They serve a variety of different use cases, as they offer benefits such as security and reliability.

For many data science scenarios, a single-table model is usually the go-to, but the truth is that RDBMS and table-like storages are important for more complex use cases, such as systems testing, replicating a database for customer segmentation, or even for data migrations between on-prem and the cloud.

YData Fabric offers an easy-to-use and familiar interface through the SDK to support Multi-Table Synthesis. With the SDK and a few lines of code, users can replicate full relational databases while maintaining the consistency of all the keys and the statistical information of cross-table relations.

[Berka](https://data.world/lpetrocelli/czech-financial-dataset-real-anonymized-transactions) is the dataset chosen to demo Fabric Multi-Table synthesis properties and interface.

## Getting your database from the Data Catalog

In this example we have create our database in a MySQL server and [created a Dataset in Fabric Data Catalog](https://docs.sdk.ydata.ai/0.10/get-started/create_multitable_dataset/). Copy the required code snippet by clicking in the "Explore in Labs" button that you can find inside of the dataset detail as per the image below.

![explore_in_labs.png](attachment:4c68426e-b85f-464a-a22d-45ce05175628.png)

In [1]:
# Importing YData's packages
from ydata.labs import DataSources

# Reading the Dataset from the DataSource
datasource = DataSources.get(uid='f1a18828-bb2b-442d-b0cd-b581ad96b1e9')

dataset = datasource.dataset
# Getting the calculated Metadata to get the profile overview information in the labs
metadata = datasource.metadata
print(metadata)


+---------+--------+-----------+---------+
| Package | Client | Scheduler | Workers |
+---------+--------+-----------+---------+
| tornado | 6.4.1  | 6.4.2     | None    |
+---------+--------+-----------+---------+
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


[1mMultiMetadata Summary 
 
[0m[1mTables Summary [0m
[1mNumber of tables: [0m9 
 
  Table name  # cols  # nrows  Primary keys             Foreign keys PK characteristics                           FK characteristics Notes
0     append       3       20            []                                                                                               
1   district      16       77          [a1]                                        [id]                                                   
2    account       4     4500  [account_id]            [district_id]               [id]                      {'district_id': ['id']}      
3     client       6     5369   [client_id]            [district_id]               [id]                      {'district_id': ['id']}      
4       disp       4     5369     [disp_id]  [client_id, account_id]               [id]  {'client_id': ['id'], 'account_id': ['id']}      
5       loan       9      682     [loan_id]             [account_id]          

## Training & sampling a Database Synthetic Data generator

We can now train the synthesizer by creating a `MultiTableSynthesizer` and passing the data and the metadata.

In [None]:
from ydata.synthesizers import MultiTableSynthesizer

synth = MultiTableSynthesizer()
synth.fit(dataset, metadata)

To generate the synthetic data we call the `sample` method.

Since there is a need to keep the consistency of the tables, as well as the referential integrity, to sample from trained synthesizers the number of records is set through a ratio based on the original number of records (e.g., 1.0 is equivalent to the size of the original database).

### Sampling & writing to a database

In [3]:
# Importing YData's packages
from ydata.labs import Connectors
# Getting a previously created Connector
connector = Connectors.get(uid='{insert-connector-uid}')

conn = connector.connector

In [6]:
sample = synth.sample(n_samples=1.,
                     connector=conn,
                     if_exists='replace')

INFO: 2024-12-18 15:34:13,702 (1/9) - Synthesizing table: district
INFO: 2024-12-18 15:34:13,703 [SYNTHESIZER] - Start generating model samples.
INFO: 2024-12-18 15:34:13,704 [SYNTHESIZER] - Init Dask cluster for sampling.
INFO: 2024-12-18 15:34:13,801 [SYNTHESIZER] - Postprocessing.
INFO: 2024-12-18 15:34:19,032 let write into the connector None
INFO: 2024-12-18 15:34:19,032 using kwargs {}
INFO: 2024-12-18 15:34:19,088 [SYNTHESIZER] - Numerical clipping
INFO: 2024-12-18 15:34:20,451 (2/9) - Synthesizing table: account
INFO: 2024-12-18 15:34:20,787 [SYNTHESIZER] - Start generating model samples.
INFO: 2024-12-18 15:34:20,788 [SYNTHESIZER] - Init Dask cluster for sampling.
INFO: 2024-12-18 15:34:21,005 [SYNTHESIZER] - Postprocessing.
INFO: 2024-12-18 15:34:21,092 let write into the connector None
INFO: 2024-12-18 15:34:21,093 using kwargs {}
INFO: 2024-12-18 15:34:21,202 [SYNTHESIZER] - Numerical clipping
INFO: 2024-12-18 15:34:21,387 (3/9) - Synthesizing table: loan
INFO: 2024-12-18 1

### Getting the data as CSV instead

In [7]:
sample = synth.sample(n_samples=1.)

INFO: 2024-12-18 15:35:37,991 (1/9) - Synthesizing table: district
INFO: 2024-12-18 15:35:37,992 [SYNTHESIZER] - Start generating model samples.
INFO: 2024-12-18 15:35:37,993 [SYNTHESIZER] - Init Dask cluster for sampling.
INFO: 2024-12-18 15:35:38,074 [SYNTHESIZER] - Postprocessing.
INFO: 2024-12-18 15:35:38,114 let write into the connector None
INFO: 2024-12-18 15:35:38,115 using kwargs {}
INFO: 2024-12-18 15:35:38,167 [SYNTHESIZER] - Numerical clipping
INFO: 2024-12-18 15:35:38,784 (2/9) - Synthesizing table: account
INFO: 2024-12-18 15:35:39,129 [SYNTHESIZER] - Start generating model samples.
INFO: 2024-12-18 15:35:39,131 [SYNTHESIZER] - Init Dask cluster for sampling.
INFO: 2024-12-18 15:35:39,344 [SYNTHESIZER] - Postprocessing.
INFO: 2024-12-18 15:35:39,430 let write into the connector None
INFO: 2024-12-18 15:35:39,431 using kwargs {}
INFO: 2024-12-18 15:35:39,536 [SYNTHESIZER] - Numerical clipping
INFO: 2024-12-18 15:35:39,733 (3/9) - Synthesizing table: loan
INFO: 2024-12-18 1

In [8]:
sample['account'].head()

Unnamed: 0,account_id,district_id,frequency,date
0,1,1,POPLATEK MESICNE,950327.0
1,2,1,POPLATEK MESICNE,950620.0
2,3,1,POPLATEK MESICNE,970227.0
3,4,1,POPLATEK MESICNE,930916.0
4,5,1,POPLATEK MESICNE,960822.0


In [11]:
for table in sample.schema.keys():
    df = sample[table].to_pandas()
    df.to_csv(f'{table}.csv')

## Validate the quality of the Synthetic database relations

To validate the quality of the synthetic database relationships compare to what is observed in the original dataset, it is needed to compute the `MultiMetadata` for the Synthetic sample database, as we use this object as part of the comparision.

In [7]:
from ydata.metadata.multimetadata import MultiMetadata

m_sample = MultiMetadata(sample)
print(m_sample.get_schema_validation_summary(metadata, sample, dataset))

[1mSchema Validation Summary

[0m[1mNumber of Primary Key Violations: [0m0
[1mNumber of Foreign Key Violations: [0m0
[1mRelationship Quality: [0m100%


[1mTable append
[0m[1m
	Primary Keys
[0m		Current Schema: None
		Reference Schema: None
[1m
	Non-Matching Primary Keys: [0mNone
[1m
	Non-Matching Foreign Keys: [0mNone


[1mTable district
[0m[1m
	Primary Keys
[0m		Current Schema: a1 [VALID]
		Reference Schema: a1 [VALID]
[1m
	Non-Matching Primary Keys: [0mNone
[1m
	Non-Matching Foreign Keys: [0mNone


[1mTable account
[0m[1m
	Primary Keys
[0m		Current Schema: account_id [VALID]
		Reference Schema: account_id [VALID]
[1m
	Foreign Key 1
[0m		Current Schema: district_id -> district.a1 (1-N)
		Reference Schema: district_id -> district.a1 (1-N)
[1m
	Percentage of Valid Foreign Keys
[0m		Current Schema: 100%
		Reference Schema: 100%
[1m
	Non-Matching Primary Keys: [0mNone
[1m
	Non-Matching Foreign Keys: [0mNone


[1mTable client
[0m[1m
	Primary Keys
[0m