# Generate synthetic data quality report for multitable dataset

Evaluating the quality of synthetic data is essential to ensure it effectively mirrors the original data's characteristics while safeguarding privacy. YData Fabric employs a comprehensive set of metrics to assess synthetic data across three fundamental pillars: utility, fidelity, and privacy.

1. **Utility:** Measures how well synthetic data can replace real data in downstream applications, such as analytics or machine learning tasks. High utility ensures that models trained on synthetic data perform comparably to those trained on real data, making the synthetic data practically applicable.

2. **Fidelity:** Evaluates the degree to which synthetic data preserves the statistical properties of the original dataset. This includes maintaining distributions, correlations, and variability. High fidelity ensures that the synthetic data accurately reflects the original data's structure, making it suitable for exploratory analysis and pattern recognition.

3. **Privacy:** Assesses the extent to which synthetic data protects sensitive information from the original dataset. Effective privacy measures ensure that the synthetic data cannot be reverse-engineered to disclose personal or confidential details, thus complying with data protection regulations and ethical standards.

**Generating the Quality Report**
In this notebook, users will learn how to generate a PDF quality report to compare two databases: the synthetic and the original database. This report will leverage YData Fabric’s comprehensive metrics to provide a clear assessment of the synthetic data's utility, fidelity, and privacy for each table of a multitable dataset, ensuring its suitability for real-world applications.

The process includes:

- **Loading the datasets using Fabric connectors:** Reading the original and synthetic databases.
- **Generating a comprehensive PDF report:** Compiling results into a structured PDF document for easy sharing and interpretation.

This workflow streamlines the validation process, ensuring that stakeholders have clear and interpretable insights into the synthetic data's quality.
For more information about [Fabric's Synthetic Data quality report you can download in this link a detailed whitepaper](https://ydata.ai/synthetic-data-quality-metrics).

## Read data

### Read data real datasource (dataset & metadata)

In [1]:
# Importing YData's packages
from ydata.labs import DataSources
# Reading the Dataset from the DataSource
og_datasource = DataSources.get(uid='{insert-datasource-id}')
og_dataset = og_datasource.dataset
# Getting the calculated Metadata to get the profile overview information in the labs
og_metadata = og_datasource.metadata

INFO: 2024-12-28 19:03:37,991 [MULTIMETADATA] - Initializing characteristics.


This may cause some slowdown.
Consider scattering data ahead of time and using futures.


INFO: 2024-12-28 19:04:25,773 [MULTIMETADATA] - Validating schema.
INFO: 2024-12-28 19:04:25,786 [MULTIMETADATA] - Update relationship types.


### Read data synthetic datasource (dataset & metadata)

In [2]:
# Importing YData's packages
from ydata.labs import DataSources
# Reading the Dataset from the DataSource
synth_datasource = DataSources.get(uid='{insert-datasource-id}')
synth_dataset = synth_datasource.dataset
# Getting the calculated Metadata to get the profile overview information in the labs
synth_metadata = synth_datasource.metadata

INFO: 2024-12-28 19:04:45,948 [MULTIMETADATA] - Initializing characteristics.


This may cause some slowdown.
Consider scattering data ahead of time and using futures.


INFO: 2024-12-28 19:05:16,348 [MULTIMETADATA] - Validating schema.
INFO: 2024-12-28 19:05:16,350 [MULTIMETADATA] - Update relationship types.


## Generate PDF report & profile

In [None]:
from ydata.report import SyntheticDataProfile
from ydata.report.reports.report_type import ReportType

for table in og_dataset.rdbms_schema.tables.keys():
    og_table = og_dataset[table]
    synth_table = og_dataset[table]
    
    table_metadata = og_metadata[table]
    data_types = {k: v.datatype for k, v in table_metadata.columns.items()}
    
    print(f'Calculate report metrics - {table}')
    
    try:
        #calculate PDF report per table
        sdf = SyntheticDataProfile(report_type=ReportType.TABULAR,
                                   real=og_table,
                                   synth=synth_table,
                                   training_data=og_table,
                                   data_types=data_types,
                                   metadata=table_metadata)


        sdf.generate_report(output_path=f'report_{table}.pdf')
    except:
        print(f'{table} has not enough rows to run the data quality report.')

Calculate report metrics - append
append has not enough rows to run the data quality report.
Calculate report metrics - district
INFO: 2024-12-28 19:12:14,949 [PROFILEREPORT] - Starting metrics calculation.


/home/ydata/.venv/lib/python3.10/site-packages/ydata/report/reports/syntheticdata/syntheticdata_profile.py:91: SmallTrainingDataset: Small training dataset detected. For optimal results, training data should have at least 100 rows.
  warn(


INFO: 2024-12-28 19:12:20,886 [PROFILEREPORT] - Synthetic data quality report selected target variable: None
INFO: 2024-12-28 19:12:20,906 [PROFILEREPORT] - preparing data format.
INFO: 2024-12-28 19:12:20,966 [PROFILEREPORT] - Preparing the data for metrics calculation
INFO: 2024-12-28 19:12:21,065 [PROFILEREPORT] - Calculating privacy metrics.
INFO: 2024-12-28 19:12:21,068 [PROFILEREPORT] - Calculating metric [Exact Matches].
INFO: 2024-12-28 19:12:21,073 [PROFILEREPORT] - Metric [Exact Matches] took 0.00s.
INFO: 2024-12-28 19:12:21,076 [PROFILEREPORT] - Calculating metric [Identifiability Risk].
INFO: 2024-12-28 19:12:21,138 [PROFILEREPORT] - Metric [Identifiability Risk] took 0.06s.
INFO: 2024-12-28 19:12:21,142 [PROFILEREPORT] - Calculating metric [Membership Inference Risk].
INFO: 2024-12-28 19:12:21,142 [PROFILEREPORT] - Membership Disclosure Score sample size was reduce to match the dataset with size 77.
INFO: 2024-12-28 19:12:21,152 [PROFILEREPORT] - Metric [Membership Inferen



INFO: 2024-12-28 19:12:24,793 [PROFILEREPORT] - Metric [QScore] took 0.59s.


[1m# Real data records[0m: 77
[1m# Synthetic data records generated[0m: 77
[1m# Columns[0m: 16
[1mFidelity Score[0m: Excellent
[1mUtility Score[0m: Excellent
[1mPrivacy Score[0m: Poor


[1mCorrelation Similarity[0m: 1.00
[1mDistance Distribution[0m: 1.00
[1mSynthetic Classifier[0m: 1.00
[1mMutual Information[0m: 1.00
[1mMissing Values Similarity[0m: 1.00
[1mMean Similarity[0m: 1.00
[1mStd. Dev. Similarity[0m: 1.00
[1mMedian Similarity[0m: 1.00
[1mQ25% Similarity[0m: 1.00
[1mQ75% Similarity[0m: 1.00
[1mKolmogorov-Smirnov Test[0m: 1.00
[1mTotal Variation Distance[0m: 1.00
[1mCategory Coverage[0m: 1.00
[1mMissing Category Coverage[0m: 1.00
[1mRange Coverage[0m: 1.00


[1mQScore[0m: 1.00


[1mExact Matches[0m: 1.0
[1mIdentifiability Risk[0m: 1.0
[1mMembership Inference Risk[0m: 0.0
Calculate report metrics - account
INFO: 2024-12-28 19:12:33,105 [PROFILEREPORT] - Starting metrics calculation.
INFO: 2024-12-28 19:12:38,375 [PROFILEREPORT] - Synthetic data quality report selected target variable: None
INFO: 2024-12-28 19:12:38,386 [PROFILEREPORT] - preparing data format.
INFO: 2024-12-28 19:12:38,406 [PROFILEREPORT] - Preparing the data for metrics calculation
INFO: 2024-12-28 19:12:38,477 [PROFILEREPORT] - Calculating privacy metrics.
INFO: 2024-12-28 19:12:38,482 [PROFILEREPORT] - Calculating metric [Exact Matches].
INFO: 2024-12-28 19:12:38,568 [PROFILEREPORT] - Metric [Exact Matches] took 0.08s.
INFO: 2024-12-28 19:12:38,572 [PROFILEREPORT] - Calculating metric [Identifiability Risk].
INFO: 2024-12-28 19:12:38,601 [PROFILEREPORT] - Metric [Identifiability Risk] took 0.03s.
INFO: 2024-12-28 19:12:38,605 [PROFILEREPORT] - Calculating metric [Membership Inference Ri



[1m# Real data records[0m: 4K
[1m# Synthetic data records generated[0m: 4K
[1m# Columns[0m: 4
[1mFidelity Score[0m: Excellent
[1mUtility Score[0m: Excellent
[1mPrivacy Score[0m: Poor


[1mCorrelation Similarity[0m: 1.00
[1mDistance Distribution[0m: 1.00
[1mSynthetic Classifier[0m: 1.00
[1mMutual Information[0m: 1.00
[1mMissing Values Similarity[0m: 1.00
[1mMean Similarity[0m: 1.00
[1mStd. Dev. Similarity[0m: 1.00
[1mMedian Similarity[0m: 1.00
[1mQ25% Similarity[0m: 1.00
[1mQ75% Similarity[0m: 1.00
[1mKolmogorov-Smirnov Test[0m: 1.00
[1mTotal Variation Distance[0m: 1.00
[1mCategory Coverage[0m: 1.00
[1mMissing Category Coverage[0m: 1.00
[1mRange Coverage[0m: 1.00


[1mQScore[0m: 1.00


[1mExact Matches[0m: 1.0
[1mIdentifiability Risk[0m: 1.0
[1mMembership Inference Risk[0m: 0.0
Calculate report metrics - client
INFO: 2024-12-28 19:12:49,214 [PROFILEREPORT] - Starting metrics calculation.
INFO: 2024-12-28 19:12:53,486 [PROFILEREPORT] - Synthetic data quality report selected target variable: None
INFO: 2024-12-28 19:12:53,503 [PROFILEREPORT] - preparing data format.
INFO: 2024-12-28 19:12:53,550 [PROFILEREPORT] - Preparing the data for metrics calculation
INFO: 2024-12-28 19:12:53,696 [PROFILEREPORT] - Calculating privacy metrics.
INFO: 2024-12-28 19:12:53,701 [PROFILEREPORT] - Calculating metric [Exact Matches].
INFO: 2024-12-28 19:12:53,800 [PROFILEREPORT] - Metric [Exact Matches] took 0.10s.
INFO: 2024-12-28 19:12:53,806 [PROFILEREPORT] - Calculating metric [Identifiability Risk].
INFO: 2024-12-28 19:12:53,854 [PROFILEREPORT] - Metric [Identifiability Risk] took 0.05s.
INFO: 2024-12-28 19:12:53,859 [PROFILEREPORT] - Calculating metric [Membership Inference Ris



[1m# Real data records[0m: 5K
[1m# Synthetic data records generated[0m: 5K
[1m# Columns[0m: 6
[1mFidelity Score[0m: Excellent
[1mUtility Score[0m: Excellent
[1mPrivacy Score[0m: Poor


[1mCorrelation Similarity[0m: 1.00
[1mDistance Distribution[0m: 1.00
[1mSynthetic Classifier[0m: 1.00
[1mMutual Information[0m: 1.00
[1mMissing Values Similarity[0m: 1.00
[1mMean Similarity[0m: 1.00
[1mStd. Dev. Similarity[0m: 1.00
[1mMedian Similarity[0m: 1.00
[1mQ25% Similarity[0m: 1.00
[1mQ75% Similarity[0m: 1.00
[1mKolmogorov-Smirnov Test[0m: 1.00
[1mTotal Variation Distance[0m: 1.00
[1mCategory Coverage[0m: 1.00
[1mMissing Category Coverage[0m: 1.00
[1mRange Coverage[0m: 1.00


[1mQScore[0m: 1.00


[1mExact Matches[0m: 1.0
[1mIdentifiability Risk[0m: 1.0
[1mMembership Inference Risk[0m: 0.0
Calculate report metrics - disp
INFO: 2024-12-28 19:13:13,267 [PROFILEREPORT] - Starting metrics calculation.
INFO: 2024-12-28 19:13:17,322 [PROFILEREPORT] - Synthetic data quality report selected target variable: None
INFO: 2024-12-28 19:13:17,335 [PROFILEREPORT] - preparing data format.
INFO: 2024-12-28 19:13:17,361 [PROFILEREPORT] - Preparing the data for metrics calculation
INFO: 2024-12-28 19:13:17,442 [PROFILEREPORT] - Calculating privacy metrics.
INFO: 2024-12-28 19:13:17,446 [PROFILEREPORT] - Calculating metric [Exact Matches].
INFO: 2024-12-28 19:13:17,543 [PROFILEREPORT] - Metric [Exact Matches] took 0.10s.
INFO: 2024-12-28 19:13:17,547 [PROFILEREPORT] - Calculating metric [Identifiability Risk].
INFO: 2024-12-28 19:13:17,583 [PROFILEREPORT] - Metric [Identifiability Risk] took 0.03s.
INFO: 2024-12-28 19:13:17,587 [PROFILEREPORT] - Calculating metric [Membership Inference Risk]

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


INFO: 2024-12-28 19:13:26,514 [PROFILEREPORT] - Metric [Category Coverage] took 6.71s.
INFO: 2024-12-28 19:13:26,520 [PROFILEREPORT] - Calculating metric [Missing Category Coverage].
INFO: 2024-12-28 19:13:26,722 [PROFILEREPORT] - Metric [Missing Category Coverage] took 0.20s.
INFO: 2024-12-28 19:13:26,729 [PROFILEREPORT] - Calculating metric [Range Coverage].
INFO: 2024-12-28 19:13:26,740 [PROFILEREPORT] - Metric [Range Coverage] took 0.01s.
INFO: 2024-12-28 19:13:26,743 [PROFILEREPORT] - Calculating metric [Kolmogorov-Smirnov Test].
INFO: 2024-12-28 19:13:26,751 [PROFILEREPORT] - Metric [Kolmogorov-Smirnov Test] took 0.00s.
INFO: 2024-12-28 19:13:26,756 [PROFILEREPORT] - Calculating metric [Total Variation Distance].


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


INFO: 2024-12-28 19:13:26,960 [PROFILEREPORT] - Metric [Total Variation Distance] took 0.20s.
INFO: 2024-12-28 19:13:26,964 [PROFILEREPORT] - Calculating metric [Missing Values Similarity].
INFO: 2024-12-28 19:13:26,969 [PROFILEREPORT] - Metric [Missing Values Similarity] took 0.00s.
INFO: 2024-12-28 19:13:26,973 [PROFILEREPORT] - Calculating metric [Mutual Information].
INFO: 2024-12-28 19:13:27,270 [PROFILEREPORT] - Metric [Mutual Information] took 0.30s.
INFO: 2024-12-28 19:13:27,274 [PROFILEREPORT] - Calculating metric [Dimensionality Reduction].
INFO: 2024-12-28 19:13:27,612 [PROFILEREPORT] - Metric [Dimensionality Reduction] took 0.34s.
INFO: 2024-12-28 19:13:27,615 [PROFILEREPORT] - Calculating metric [Synthetic Classifier].
INFO: 2024-12-28 19:13:29,055 [PROFILEREPORT] - Metric [Synthetic Classifier] took 1.44s.
INFO: 2024-12-28 19:13:29,056 [PROFILEREPORT] - Calculating utility metrics.
INFO: 2024-12-28 19:13:29,066 [PROFILEREPORT] - Calculating metric [QScore].
INFO: 2024-12-

[1m# Real data records[0m: 5K
[1m# Synthetic data records generated[0m: 5K
[1m# Columns[0m: 4
[1mFidelity Score[0m: Excellent
[1mUtility Score[0m: Excellent
[1mPrivacy Score[0m: Poor


[1mCorrelation Similarity[0m: 1.00
[1mDistance Distribution[0m: 1.00
[1mSynthetic Classifier[0m: 1.00
[1mMutual Information[0m: 1.00
[1mMissing Values Similarity[0m: 1.00
[1mMean Similarity[0m: N/A
[1mStd. Dev. Similarity[0m: N/A
[1mMedian Similarity[0m: N/A
[1mQ25% Similarity[0m: N/A
[1mQ75% Similarity[0m: N/A
[1mKolmogorov-Smirnov Test[0m: N/A
[1mTotal Variation Distance[0m: 1.00
[1mCategory Coverage[0m: 1.00
[1mMissing Category Coverage[0m: 1.00
[1mRange Coverage[0m: N/A


[1mQScore[0m: 1.00


[1mExact Matches[0m: 1.0
[1mIdentifiability Risk[0m: 1.0
[1mMembership Inference Risk[0m: 0.0
Calculate report metrics - loan
INFO: 2024-12-28 19:13:33,187 [PROFILEREPORT] - Starting metrics calculation.
INFO: 2024-12-28 19:13:38,334 [PROFILEREPORT] - Synthetic data quality report selected target variable: None
INFO: 2024-12-28 19:13:38,356 [PROFILEREPORT] - preparing data format.
INFO: 2024-12-28 19:13:38,410 [PROFILEREPORT] - Preparing the data for metrics calculation
INFO: 2024-12-28 19:13:38,494 [PROFILEREPORT] - Calculating privacy metrics.
INFO: 2024-12-28 19:13:38,498 [PROFILEREPORT] - Calculating metric [Exact Matches].
INFO: 2024-12-28 19:13:38,515 [PROFILEREPORT] - Metric [Exact Matches] took 0.02s.
INFO: 2024-12-28 19:13:38,518 [PROFILEREPORT] - Calculating metric [Identifiability Risk].
INFO: 2024-12-28 19:13:38,526 [PROFILEREPORT] - Metric [Identifiability Risk] took 0.01s.
INFO: 2024-12-28 19:13:38,529 [PROFILEREPORT] - Calculating metric [Membership Inference Risk]



[1m# Real data records[0m: 682
[1m# Synthetic data records generated[0m: 682
[1m# Columns[0m: 9
[1mFidelity Score[0m: Excellent
[1mUtility Score[0m: Excellent
[1mPrivacy Score[0m: Poor


[1mCorrelation Similarity[0m: 1.00
[1mDistance Distribution[0m: 1.00
[1mSynthetic Classifier[0m: 1.00
[1mMutual Information[0m: 1.00
[1mMissing Values Similarity[0m: 1.00
[1mMean Similarity[0m: 1.00
[1mStd. Dev. Similarity[0m: 1.00
[1mMedian Similarity[0m: 1.00
[1mQ25% Similarity[0m: 1.00
[1mQ75% Similarity[0m: 1.00
[1mKolmogorov-Smirnov Test[0m: 1.00
[1mTotal Variation Distance[0m: 1.00
[1mCategory Coverage[0m: 1.00
[1mMissing Category Coverage[0m: 1.00
[1mRange Coverage[0m: 1.00


[1mQScore[0m: 1.00


[1mExact Matches[0m: 1.0
[1mIdentifiability Risk[0m: 1.0
[1mMembership Inference Risk[0m: 0.0
Calculate report metrics - order
INFO: 2024-12-28 19:13:46,241 [PROFILEREPORT] - Starting metrics calculation.
INFO: 2024-12-28 19:13:50,384 [PROFILEREPORT] - Synthetic data quality report selected target variable: None
INFO: 2024-12-28 19:13:50,399 [PROFILEREPORT] - preparing data format.
INFO: 2024-12-28 19:13:50,442 [PROFILEREPORT] - Preparing the data for metrics calculation
INFO: 2024-12-28 19:13:50,571 [PROFILEREPORT] - Calculating privacy metrics.
INFO: 2024-12-28 19:13:50,575 [PROFILEREPORT] - Calculating metric [Exact Matches].
INFO: 2024-12-28 19:13:50,712 [PROFILEREPORT] - Metric [Exact Matches] took 0.14s.
INFO: 2024-12-28 19:13:50,716 [PROFILEREPORT] - Calculating metric [Identifiability Risk].
INFO: 2024-12-28 19:13:50,781 [PROFILEREPORT] - Metric [Identifiability Risk] took 0.06s.
INFO: 2024-12-28 19:13:50,785 [PROFILEREPORT] - Calculating metric [Membership Inference Risk



[1m# Real data records[0m: 6K
[1m# Synthetic data records generated[0m: 6K
[1m# Columns[0m: 6
[1mFidelity Score[0m: Excellent
[1mUtility Score[0m: Excellent
[1mPrivacy Score[0m: Poor


[1mCorrelation Similarity[0m: 1.00
[1mDistance Distribution[0m: 1.00
[1mSynthetic Classifier[0m: 1.00
[1mMutual Information[0m: 1.00
[1mMissing Values Similarity[0m: 1.00
[1mMean Similarity[0m: 1.00
[1mStd. Dev. Similarity[0m: 1.00
[1mMedian Similarity[0m: 1.00
[1mQ25% Similarity[0m: 1.00
[1mQ75% Similarity[0m: 1.00
[1mKolmogorov-Smirnov Test[0m: 1.00
[1mTotal Variation Distance[0m: 1.00
[1mCategory Coverage[0m: 1.00
[1mMissing Category Coverage[0m: 1.00
[1mRange Coverage[0m: 1.00


[1mQScore[0m: 1.00


[1mExact Matches[0m: 1.0
[1mIdentifiability Risk[0m: 1.0
[1mMembership Inference Risk[0m: 0.0
Calculate report metrics - trans
INFO: 2024-12-28 19:14:20,783 [PROFILEREPORT] - Starting metrics calculation.


This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


INFO: 2024-12-28 19:14:29,206 [PROFILEREPORT] - Synthetic data quality report selected target variable: None
INFO: 2024-12-28 19:14:29,308 [PROFILEREPORT] - preparing data format.
INFO: 2024-12-28 19:14:29,739 [PROFILEREPORT] - Preparing the data for metrics calculation
INFO: 2024-12-28 19:14:31,318 [PROFILEREPORT] - Calculating privacy metrics.
INFO: 2024-12-28 19:14:31,340 [PROFILEREPORT] - Calculating metric [Exact Matches].
INFO: 2024-12-28 19:14:33,995 [PROFILEREPORT] - Metric [Exact Matches] took 2.65s.
INFO: 2024-12-28 19:14:34,009 [PROFILEREPORT] - Calculating metric [Identifiability Risk].
INFO: 2024-12-28 19:14:36,111 [PROFILEREPORT] - Metric [Identifiability Risk] took 2.10s.
INFO: 2024-12-28 19:14:36,124 [PROFILEREPORT] - Calculating metric [Membership Inference Risk].
INFO: 2024-12-28 19:14:44,401 [PROFILEREPORT] - Metric [Membership Inference Risk] took 8.28s.
INFO: 2024-12-28 19:14:44,402 [PROFILEREPORT] - Calculating fidelity metrics.
INFO: 2024-12-28 19:14:44,433 [PROF