# Database synthetic data generation

In this Notebook we will guide on how you can generate a synthetic version of a database using YData Fabric's proprietary process.
In this notebook, we will explore the techniques and methodologies employed by YData Fabric to create synthetic datasets that replicate the properties of real-world data. The notebook also covers the process of training these models on original datasets and generating new data.

### Get the data from the database

In [3]:
# Importing YData's packages
from ydata.labs import DataSources
# Reading the Dataset from the DataSource
datasource = DataSources.get(uid='{datasource-id}')
dataset = datasource.dataset
# Getting the calculated Metadata to get the profile overview information in the labs
metadata = datasource.metadata
print(metadata)

[1mMultiMetadata Summary 
 
[0m[1mTables Summary [0m
[1mNumber of tables: [0m9 
 
  Table name  # cols  # nrows  Primary keys             Foreign keys PK characteristics                           FK characteristics Notes
0     append       3       20            []                                                                                               
1   district      16       77          [a1]                                        [id]                                                   
2    account       4     4500  [account_id]            [district_id]               [id]                      {'district_id': ['id']}      
3     client       6     5369   [client_id]            [district_id]               [id]                      {'district_id': ['id']}      
4       disp       4     5369     [disp_id]  [client_id, account_id]               [id]  {'client_id': ['id'], 'account_id': ['id']}      
5       loan       9      682     [loan_id]             [account_id]          

## Synthetic data generation



### Configure the MultiTableSynthesizer 

The configuration of the synthetic data generator will depend on the type of database as well as the expectations for the database quality and application. For more information on other configurations please check [YData Fabric Academy](https://github.com/ydataai/academy/blob/master/2-%20Synthetic%20Data/MultiTable).


In [None]:
from ydata.synthesizers import MultiTableSynthesizer

synth = MultiTableSynthesizer()
synth.fit(dataset, metadata)

INFO: 2024-08-07 10:27:46,066 (1/9) - Fitting table: [district]
INFO: 2024-08-07 10:27:49,473 [SYNTHESIZER] - Number columns considered for synth: 16
INFO: 2024-08-07 10:27:49,945 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2024-08-07 10:27:49,949 [SYNTHESIZER] - Preprocess segment
INFO: 2024-08-07 10:27:49,957 [SYNTHESIZER] - Synthesizer init.
INFO: 2024-08-07 10:27:49,959 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.
INFO: 2024-08-07 10:27:50,291 (2/9) - Fitting table: [account]
INFO: 2024-08-07 10:27:54,232 [SYNTHESIZER] - Number columns considered for synth: 21
INFO: 2024-08-07 10:27:54,864 [SYNTHESIZER] - Starting the synthetic data modeling process over 1x1 blocks.
INFO: 2024-08-07 10:27:54,867 [SYNTHESIZER] - Preprocess segment
INFO: 2024-08-07 10:27:54,874 [SYNTHESIZER] - Synthesizer init.
INFO: 2024-08-07 10:27:54,876 [SYNTHESIZER] - Processing the data prior fitting the synthesizer.
INFO: 2024-08-07 10:27:55,897 (3/

### Generate synthetic data

In [None]:
sample = synth.sample(n_samples=1.)
print(sample)

## Validate generated data referential integrity

Validating referential integrity between real and synthetic databases is essential to ensure that relationships between data entities are consistently maintained. YData Fabric metadata validation ensures that the synthetic data is accurately validated in what concerns the structure and dependencies of the real data, preserving the logical consistency necessary for reliable testing and analysis.

In [None]:
from ydata.metadata.multimetadata import MultiMetadata

m_sample = MultiMetadata(sample)
print(m_sample.get_schema_validation_summary(metadata, sample, dataset))

In [None]:
import pandas as pd

tables_info = []
for k, table in metadata.items():
   tables_info.append({"Table name": k,
                         "# cols": table.ncols,
                         "# nrows": table.summary['nrows'],})

tables_info_synth = pd.DataFrame(tables_info)
tables_info_synth.to_csv('tables_info_synth.csv', index=True)

## Write to a destination database

In [None]:
# Importing YData's packages
from ydata.labs import Connectors
# Getting a previously created Connector
connector = Connectors.get(uid='{connector-id}')

In [None]:
connector.write_database(data=sample)

### Pipeline outputs

In [None]:
##add here the outputs logic
import json

profile_pipeline_output = {
    'outputs' :  [
        {
            'type': 'table',
            'storage': 'inline',
            'format': 'csv',
            'header': list(tables_info_synth.columns),
            'source': tables_info_synth.to_csv(header=False, index=True)
        },
    ]
  }
with open('mlpipeline-ui-metadata.json', 'w') as metadata_file:
    json.dump(profile_pipeline_output, metadata_file)
