# Multi-Table Synthesis - Manual Schema

### How to synthesize data from a database schema

Relational databases (RDBMS) are a type of data storage that allows users to access data that is stored in various tables connected through primary and foreign keys. They serve a variety of different use cases, as they offer benefits such as security and reliability. 

For many data science scenarios, a single-table model is usually the go-to, but the truth is that RDBMS and table-like storages are important for more complex use cases, such as systems testing, replicating a database for customer segmentation, or even for data migrations between on-prem and the cloud. 

YData Fabric offers an easy-to-use and familiar interface through the SDK to support Multi-Table Synthesis. With the SDK and a few lines of code, users can replicate full relational databases while maintaining the consistency of all the keys and the statistical information of cross-table relations. 

[Berka](https://data.world/lpetrocelli/czech-financial-dataset-real-anonymized-transactions) is the dataset chosen to demo Fabric Multi-Table synthesis properties and interface. 

## Getting the data from an RDBMS

We start by creating the RDBMS connector through the SDK.

In [1]:
from ydata.connectors.storages.rdbms_connector import MySQLConnector

USERNAME = '<username>'
PASSWORD = '<password>'
HOSTNAME = '<hostname>'
PORT = '3306'
DATABASE_NAME = '<database>'

conn_str = {
    "hostname": f'{HOSTNAME}',
    "username": f'{USERNAME}',
    "password": f'{PASSWORD}',
    "port": f'{PORT}',
    "database": f'{DATABASE_NAME}'
}

connector = MySQLConnector(conn_string=conn_str)

We will use the `get_tables` method to retrieve only the tables `account` and `order`. The lazy loading will also be disabled so that the data is immediately fetched.

In [2]:
tables = ["account", "order"]
data = connector.get_tables(tables, lazy=False)
account_data = data["account"]
order_data = data["order"]

## Defining the database schema

We will now manually define the schema for this database subset of 2 tables. The `account` only has a primary key (`account_id`). The `order` table has a primary key (`order_id`), but it also has a 1-N foreign key with the `account` table.

In [3]:
from ydata.dataset.multidataset import MultiTableSchema

schema = MultiTableSchema({"account": {"primary_keys": ["account_id"]},
                           "order": {"primary_keys": ["order_id"]}})
schema.add_foreign_key(table="order", column="account_id",
                       parent_table="account", parent_column="account_id",
                       relation_type="1-n")
schema

{'account': TableSchema(primary_keys=['account_id'], foreign_keys=[]),
 'order': TableSchema(primary_keys=['order_id'], foreign_keys=[ForeignReference(table='order', column='account_id', parent_table='account', parent_column='account_id', relation_type=<RelationType.ONE_TO_MANY: '1-n'>)])}

Alternatively, we can also load the schema from a JSON string or file.

In [4]:
from json import loads as json_loads
from ydata.dataset.multidataset import MultiTableSchema

json_schema = '{"account": {"primary_keys": ["account_id"], "foreign_keys": []}, "order": {"primary_keys": ["order_id"], "foreign_keys": [{"table": "order", "column": "account_id", "parent_table": "account", "parent_column": "account_id", "relation_type": "1-n"}]}}'
MultiTableSchema(json_loads(json_schema))
schema

{'account': TableSchema(primary_keys=['account_id'], foreign_keys=[]),
 'order': TableSchema(primary_keys=['order_id'], foreign_keys=[ForeignReference(table='order', column='account_id', parent_table='account', parent_column='account_id', relation_type=<RelationType.ONE_TO_MANY: '1-n'>)])}

## Creating the MultiDataset

We can now manually create the `MultiDataset` from the database schema and the independent datasets that represent each table.

In [5]:
from ydata.dataset.multidataset import MultiDataset

new_data = MultiDataset({"account": account_data, "order": order_data}, schema=schema)
print(new_data)

[1mMultiDataset Summary 
 
[0m[1mNumber of tables: [0m2 
 
  Table name  Num cols  Num rows  Primary keys  Foreign keys Notes
0    account         4      4500  [account_id]                    
1      order         6      6471    [order_id]  [account_id]      


Let's display the data of the `order` table.

In [6]:
new_data['order'].to_pandas()

Unnamed: 0,order_id,account_id,bank_to,account_to,amount,k_symbol
0,29401,1,YZ,87144583,2452.0,SIPO
1,29402,2,ST,89597016,3373.0,UVER
2,29403,2,QR,13943797,7266.0,SIPO
3,29404,3,WX,83084338,1135.0,SIPO
4,29405,3,CD,24485939,327.0,
...,...,...,...,...,...,...
6466,46334,11362,YZ,70641225,4780.0,SIPO
6467,46335,11362,MN,78507822,56.0,
6468,46336,11362,ST,40799850,330.0,POJISTNE
6469,46337,11362,KL,20009470,129.0,


## Creating the MultiMetadata

Similarly to what was done for the `MultiDataset`, we can create the `MultiMetadata` by providing the individual `Metadata` objects for each table.

In [7]:
from ydata.metadata import Metadata

account_meta = Metadata(account_data)
order_meta = Metadata(order_data)

In [8]:
from ydata.metadata.multimetadata import MultiMetadata

metadata = MultiMetadata(new_data, {"account": account_meta, "order": order_meta}, schema=schema)
print(metadata)

[1mMultiMetadata Summary 
 
[0m[1mNumber of tables: [0m2 
 
  Table name  # cols Primary keys  Foreign keys PK characteristics      FK characteristics Notes
0    account       4   account_id                             [id]                              
1      order       6     order_id  [account_id]               [id]  {'account_id': ['id']}      


## Synthesizer definition, training, and sampling

We can now train the synthesizer by creating a `MultiTableSynthesizer` and passing the data and the metadata.

In [9]:
from ydata.synthesizers.multitable.model import MultiTableSynthesizer

synth = MultiTableSynthesizer()
synth.fit(data, metadata)

INFO: 2023-12-28 10:24:19,962 (1/2) - Fitting table: [account]
INFO: 2023-12-28 10:24:20,314 [SYNTHESIZER] - Number columns considered for synth: 4
INFO: 2023-12-28 10:24:20,376 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2023-12-28 10:24:20,379 [SYNTHESIZER] - Preprocess segment
INFO: 2023-12-28 10:24:20,381 [SYNTHESIZER] - Synthesizer init.
INFO: 2023-12-28 10:24:20,382 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.
INFO: 2023-12-28 10:24:20,686 (2/2) - Fitting table: [order]
INFO: 2023-12-28 10:24:21,466 [SYNTHESIZER] - Number columns considered for synth: 10
INFO: 2023-12-28 10:24:21,678 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2023-12-28 10:24:21,681 [SYNTHESIZER] - Preprocess segment
INFO: 2023-12-28 10:24:21,684 [SYNTHESIZER] - Synthesizer init.
INFO: 2023-12-28 10:24:21,685 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.


<ydata.synthesizers.multitable.model.MultiTableSynthesizer at 0x7f5767f77af0>

To generate the synthetic data we call the `sample` method.

Since there is a need to keep the consistency of the tables, as well as the referential integrity, to sample from trained synthesizers the number of records is set through a ratio based on the original number of records (e.g., 1.0 is equivalent to the size of the original database).

In [10]:
sample = synth.sample(n_samples=1.)

INFO: 2023-12-28 10:24:23,732 (1/2) - Synthesizing table: account
INFO: 2023-12-28 10:24:23,733 [SYNTHESIZER] - Start generating model samples.
INFO: 2023-12-28 10:24:24,024 (2/2) - Synthesizing table: order
INFO: 2023-12-28 10:24:24,086 [SYNTHESIZER] - Start generating model samples.


We can now display the sampled data of the `account` and `order` tables.

In [11]:
sample['account'].to_pandas()

Unnamed: 0,account_id,district_id,frequency,date
0,1,28,POPLATEK MESICNE,931123
1,2,44,POPLATEK MESICNE,960725
2,3,55,POPLATEK MESICNE,950310
3,4,73,POPLATEK MESICNE,971209
4,5,60,POPLATEK PO OBRATU,930414
...,...,...,...,...
4495,4496,69,POPLATEK MESICNE,960830
4496,4497,55,POPLATEK MESICNE,931222
4497,4498,74,POPLATEK MESICNE,951019
4498,4499,66,POPLATEK MESICNE,970523


In [12]:
sample['order'].to_pandas()

Unnamed: 0,order_id,account_id,bank_to,account_to,amount,k_symbol
0,1,1.0,QR,24663144.0,3600.0,SIPO
1,2,2.0,GH,9780645.0,3146.0,SIPO
2,3,3.0,AB,60398054.0,3954.0,SIPO
3,4,3.0,AB,57657045.0,2885.0,SIPO
4,6,4.0,AB,88882463.0,1836.0,POJISTNE
...,...,...,...,...,...,...
6383,7668,3749.0,AB,4569364.0,1616.0,UVER
6384,7669,3750.0,AB,8696937.0,2989.0,UVER
6385,7670,3751.0,MN,34760767.0,3312.0,UVER
6386,7671,3751.0,WX,26197701.0,4285.0,LEASING
