# Validate and profile synthetic databases

Visual profiling of real and synthetic databases is crucial to ensure that synthetic data accurately reflects the original data's structure and statistical properties. This process highlights key insights such as distribution differences, correlations, and anomalies. It enables stakeholders to assess the quality of synthetic data, ensuring it meets privacy, compliance, and usability requirements without compromising data utility.

- Why **profiling** is the best option?

Leveraging the *ProfileReport* is the go-to solution for this task due to its robust and automated profiling capabilities. It generates comprehensive visual reports that include detailed summaries, distributions, correlations, and missing data insights. With its seamless Python interface it can be used inside notebooks, and support multiple output formats (e.g., HTML, JSON, etc.). Data quality profiling simplifies the comparison process, making it easy to identify gaps or inconsistencies between real and synthetic datasets.

## Read data

### Read data real datasource (dataset & metadata)

In [1]:
# Importing YData's packages
from ydata.labs import Connectors
# Getting a previously created Connector
og_connector = Connectors.get(uid='{insert-connector-uid}')

og_dataset = og_connector.read_database()

sdf = SyntheticDataProfile(report_type=ReportType.TABULAR)
data_types = {k: v.datatype for k, v in metadata.columns.items()}

sdf.generate_report(real={insert-holdout-dataset},
                    synth=synth_sample,
                    target="{insert-target-col-name}",
                    data_types=data_types,
                    training_data={insert-training-dataset},
                    metadata=metadata,
                    pdf=True)

INFO: 2024-12-28 18:33:38,537 [MULTIMETADATA] - Initializing characteristics.


This may cause some slowdown.
Consider scattering data ahead of time and using futures.


INFO: 2024-12-28 18:34:08,154 [MULTIMETADATA] - Validating schema.
INFO: 2024-12-28 18:34:08,156 [MULTIMETADATA] - Update relationship types.


### Read data synthetic datasource (dataset & metadata)

In [2]:
# Importing YData's packages
from ydata.labs import Connectors
# Getting a previously created Connector
synth_connector = Connectors.get(uid='{insert-connector-uid}')

synth_dataset = synth_connector.read_database()

## Generate PDF report & profile

### Use a connector to store each table Profiling

In [3]:
## Get an file storage connector to write the profiling compare files
# Importing YData's packages
from ydata.labs import Connectors
# Getting a previously created Connector
aws_conn = Connectors.get(uid='{insert-connector-uid}')
aws_client = aws_conn.connector.client

### Calculate a profiling per table & write to AWS S3

In [8]:
from ydata.profiling import ProfileReport
bucket_name = 'ydata-dev'
folder = 'data-profiling'

for table in og_dataset.rdbms_schema.tables.keys():
    og_table = og_dataset[table]
    synth_table = og_dataset[table]
    
    #generate profile report
    print(f'Calculate profile report - {table}')
    og_report = ProfileReport(og_table, title=f'Original {table}')
    synth_report = ProfileReport(synth_table, title=f'Synthetic {table}')
    
    compare = og_report.compare(synth_report)
    compare_html = compare.to_html()
    
    print(f'Writing report to AWS S3 - {table}')
    
    file_name = f'{folder}/{table}_compare_report.html'
    
    #saving it to an AWS S3
    # Upload the file to S3
    aws_client.put_object(
        Bucket=bucket_name,
        Key=file_name,
        Body=compare_html,
        ContentType='text/html'
    )

Calculate profile report - append


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Writing report to AWS S3 - append
Calculate profile report - district


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]



Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Writing report to AWS S3 - district
Calculate profile report - account


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Writing report to AWS S3 - account
Calculate profile report - client


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Writing report to AWS S3 - client
Calculate profile report - disp


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Writing report to AWS S3 - disp
Calculate profile report - loan


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Writing report to AWS S3 - loan
Calculate profile report - order


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]



Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Writing report to AWS S3 - order
Calculate profile report - trans


This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'cannot reindex on an axis with duplicate labels')


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'cannot reindex on an axis with duplicate labels')


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Writing report to AWS S3 - trans
Calculate profile report - card


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Writing report to AWS S3 - card
