# Multi-Table Synthesis - Database from multiple CSV files

### Multiple files can also serve as databases

For many data science scenarios, a single-table model is usually the go-to, but the truth is that databases and other relational-like storages are important for more complex use cases, such as systems testing, replicating a database for customer segmentation, or even for data migrations between on-prem and the cloud. 

YData Fabric offers an easy-to-use and familiar interface through code to support Multi-Table Synthesis from multiple local or cloud storage files. With a few lines of code, users can replicate full relational databases while maintaining the consistency of all the keys and the statistical information of cross-table relations. 

[Berka](https://data.world/lpetrocelli/czech-financial-dataset-real-anonymized-transactions) is the dataset chosen to demo Fabric Multi-Table synthesis properties and interface. 

## Creating your Database

In this example we are downloaded and used multiple CSV files define our Database for Synthesis. It is also possible to use a connector to any other type of storage (like AWS S3 or Azure Blob) to read the data from. 

Each data file must represent a table of our Database. To create the **Database** definition we must respect the following structure: 

```yml
{
   'table1_name': `Dataset`, 
   'table2_name': `Dataset`,
   etc.
}
```

For this flow, we have added all the database CSV files to a folder called `data` and read all the available files that had a `.CSV` extension using pandas.

In [1]:
import os
import pprint

import pandas as pd 

from ydata.dataset import Dataset

# To print the dictionary structure (Optional)
pp = pprint.PrettyPrinter(indent=4)

datasets = {}

for f in os.listdir('data'):
    if f.endswith(".csv"):
        path = os.path.join('data', f)
        df = pd.read_csv(path, sep=';')
        dataset = Dataset(df)
        
        datasets[f.replace('.csv', '')] = dataset
        
pp.pprint(datasets)

{   'account': <ydata.dataset.dataset.Dataset object at 0x7f51a418bbb0>,
    'card': <ydata.dataset.dataset.Dataset object at 0x7f520d1d9180>,
    'client': <ydata.dataset.dataset.Dataset object at 0x7f51a3fe5420>,
    'disp': <ydata.dataset.dataset.Dataset object at 0x7f520d1d9a50>,
    'district': <ydata.dataset.dataset.Dataset object at 0x7f51c572bd00>,
    'loan': <ydata.dataset.dataset.Dataset object at 0x7f51a418b1f0>,
    'order': <ydata.dataset.dataset.Dataset object at 0x7f51a418ad10>}


## Setting the Database schema

In order to properly define our Database file it is needed a schema. When leveraging an `RBMS` storage this is already ensured by the storage system itself, but when using files we need to manually define the relationships between the tables/files. 

You can achieve that by reading a pre-defined schema from a yaml file or, have it defined manually through code. See the code examples below. 

### Reading from a yaml file

The relationships are represented as a dictionary with its primary and foreign keys. Below an example of what you yaml file with the tables relationships is expected to look like: 

```yml
table1_name: 
  primary_keys: [table1_PK]

table2_name:
  primary_keys: [table2_PK]
  foreign_keys:
    - {column: table1_PK, parent_table: table1_name, parent_column: table1_PK}
```

In [2]:
from yaml import safe_load

with open(f"data/schema.yml", "r") as f:
    schema = safe_load(f)
    
pp.pprint(schema)

{   'account': {   'foreign_keys': [   {   'column': 'district_id',
                                           'parent_column': 'A1',
                                           'parent_table': 'district'}],
                   'primary_keys': ['account_id']},
    'card': {   'foreign_keys': [   {   'column': 'disp_id',
                                        'parent_column': 'disp_id',
                                        'parent_table': 'disp'}],
                'primary_keys': ['card_id']},
    'client': {   'foreign_keys': [   {   'column': 'district_id',
                                          'parent_column': 'A1',
                                          'parent_table': 'district'}],
                  'primary_keys': ['client_id']},
    'disp': {   'foreign_keys': [   {   'column': 'account_id',
                                        'parent_column': 'account_id',
                                        'parent_table': 'account'},
                                    {   'co

### Defining it manually through code

In [17]:
from ydata.dataset.multidataset import MultiTableSchema

schema = MultiTableSchema({"account": {"primary_keys": ["account_id"]},
                           "order": {"primary_keys": ["order_id"]},
                           "district": {}})

#add a new primary key for an existing table
schema.add_primary_key("district", "A1")

#add new foreign keys to the schema definition
schema.add_foreign_key(table="order", column="account_id",
                       parent_table="account", parent_column="account_id",
                       relation_type="1-n")

pp.pprint(schema)

{   'account': TableSchema(primary_keys=['account_id'], foreign_keys=[]),
    'district': TableSchema(primary_keys=['A1'], foreign_keys=[]),
    'order': TableSchema(primary_keys=['order_id'], foreign_keys=[ForeignReference(table='order', column='account_id', parent_table='account', parent_column='account_id', relation_type=<RelationType.ONE_TO_MANY: '1-n'>)])}


## Creating the MultiDataset

Now that we have both the list of Tables and Schema, we can now succesfully define our Database as a `MultiDataset` object.

In [4]:
from ydata.dataset.multidataset import MultiDataset

database = MultiDataset(datasets, schema=schema)
print(database)

[1mMultiDataset Summary 
 
[0m[1mNumber of tables: [0m7 
 
  Table name  Num cols  Num rows  Primary keys             Foreign keys Notes
0   district        16        77          [A1]                               
1     client         3      5369   [client_id]            [district_id]      
2    account         4      4500  [account_id]            [district_id]      
3       disp         4      5369     [disp_id]  [account_id, client_id]      
4       card         4       892     [card_id]                [disp_id]      
5       loan         7       682     [loan_id]             [account_id]      
6      order         6      6471    [order_id]             [account_id]      


Let's display the data of the `order` table.

In [5]:
database['order'].head()

Unnamed: 0,order_id,account_id,bank_to,account_to,amount,k_symbol
0,29401,1,YZ,87144583,2452.0,SIPO
1,29402,2,ST,89597016,3372.7,UVER
2,29403,2,QR,13943797,7266.0,SIPO
3,29404,3,WX,83084338,1135.0,SIPO
4,29405,3,CD,24485939,327.0,


## Creating the MultiMetadata

Similarly to what was done for the `MultiDataset`, and to be able to train our `MultiTableSynthesizer` we need to create the Metadata of our Database. For that purpose, we will be using the `MultiMetadata` object.

The metadata is an important to step for the synthetisis process, as not only allows to in general terms describe the quality of the data, but also extract some insights regarding the different tables relations.

In [6]:
from ydata.metadata.multimetadata import MultiMetadata

database_metadata = MultiMetadata(database)
print(database_metadata)

[########################################] | 100% Completed | 101.93 ms
[########################################] | 100% Completed | 108.50 ms
[########################################] | 100% Completed | 108.29 ms
[########################################] | 100% Completed | 102.98 ms
[########################################] | 100% Completed | 111.41 ms
[########################################] | 100% Completed | 102.26 ms
[########################################] | 100% Completed | 107.05 ms
[########################################] | 100% Completed | 102.29 ms
[########################################] | 100% Completed | 102.34 ms
[########################################] | 100% Completed | 101.73 ms
[########################################] | 100% Completed | 101.90 ms
[########################################] | 100% Completed | 111.29 ms
[########################################] | 100% Completed | 102.38 ms
[########################################] | 100% Completed | 10

## Training & sampling a Database Synthetic Data generator

We can now train the synthesizer by creating a `MultiTableSynthesizer` and passing the data and the metadata.

In [7]:
from ydata.synthesizers import MultiTableSynthesizer

synth = MultiTableSynthesizer()
synth.fit(database, database_metadata)

INFO: 2024-02-07 23:11:30,932 (1/7) - Fitting table: [district]
[########################################] | 100% Completed | 103.17 ms
[########################################] | 100% Completed | 101.77 ms
[########################################] | 100% Completed | 104.20 ms
[########################################] | 100% Completed | 101.68 ms
[########################################] | 100% Completed | 102.28 ms
[########################################] | 100% Completed | 101.88 ms
[########################################] | 100% Completed | 102.16 ms
[########################################] | 100% Completed | 102.50 ms
[########################################] | 100% Completed | 101.76 ms
[########################################] | 100% Completed | 102.09 ms
[########################################] | 100% Completed | 101.75 ms
[########################################] | 100% Completed | 102.21 ms
[########################################] | 100% Completed | 101.56 ms


<ydata.synthesizers.multitable.model.MultiTableSynthesizer at 0x7f5177d623b0>

To generate the synthetic data we call the `sample` method.

Since there is a need to keep the consistency of the tables, as well as the referential integrity, to sample from trained synthesizers the number of records is set through a ratio based on the original number of records (e.g., 1.0 is equivalent to the size of the original database).

In [10]:
sample = synth.sample(n_samples=1.)
print(sample)

INFO: 2024-02-07 23:14:39,041 (1/7) - Synthesizing table: district
INFO: 2024-02-07 23:14:39,043 [SYNTHESIZER] - Start generating model samples.
INFO: 2024-02-07 23:14:39,584 (2/7) - Synthesizing table: client
INFO: 2024-02-07 23:14:39,686 [SYNTHESIZER] - Start generating model samples.
INFO: 2024-02-07 23:14:40,346 (3/7) - Synthesizing table: disp
INFO: 2024-02-07 23:14:40,378 [SYNTHESIZER] - Start generating model samples.
INFO: 2024-02-07 23:14:41,149 (4/7) - Synthesizing table: card
INFO: 2024-02-07 23:14:41,189 [SYNTHESIZER] - Start generating model samples.
INFO: 2024-02-07 23:14:41,561 (5/7) - Synthesizing table: account
INFO: 2024-02-07 23:14:41,666 [SYNTHESIZER] - Start generating model samples.
INFO: 2024-02-07 23:14:42,052 (6/7) - Synthesizing table: order
INFO: 2024-02-07 23:14:42,095 [SYNTHESIZER] - Start generating model samples.
INFO: 2024-02-07 23:14:43,210 (7/7) - Synthesizing table: loan
INFO: 2024-02-07 23:14:43,267 [SYNTHESIZER] - Start generating model samples.
[1

We can now display the sampled data of the `account` and `order` tables.

In [11]:
sample['account'].head()

Unnamed: 0,account_id,district_id,frequency,date
0,1,5.0,POPLATEK MESICNE,950324
1,2,1.0,POPLATEK MESICNE,930226
2,3,1.0,POPLATEK MESICNE,970707
3,4,18.0,POPLATEK MESICNE,970530
4,5,5.0,POPLATEK MESICNE,970707
