# Profile compare table from a database - Real vs Synthetic data

Profiling is crucial for comparing the quality of real and synthetic data in a database because it provides a detailed analysis of the data's characteristics, such as distribution, completeness, and consistency. By profiling both datasets, discrepancies and similarities can be identified, ensuring that the synthetic data accurately mirrors the real data's attributes. This process helps validate the reliability of synthetic data for testing, modeling, and decision-making, ensuring that the synthetic data is a trustworthy stand-in for real data without compromising data quality or integrity.

In this notebook it is showcase how to compare profile a table from the original and another from the synthetic database.

## Get the original data

In [1]:
# Importing YData's packages
from ydata.labs import Connectors
# Getting a previously created Connector
connector = Connectors.get(uid='{datasource-id}')

In [2]:
dataset = connector.query("SELECT * FROM berka.trans;")

## Get the synthetic data

In [3]:
# Importing YData's packages
from ydata.labs import Connectors
# Getting a previously created Connector
connectorb = Connectors.get(uid='{connector-id}')


In [4]:
synth_dataset = connectorb.query("SELECT * FROM berka_synth.trans;")

## Generate the profiling compare

In [5]:
from ydata.profiling import ProfileReport

report_og  = ProfileReport(dataset)
report_synth = ProfileReport(synth_dataset)

compare = report_og.compare(report_synth)
compare_html = compare.to_html()

INFO: 2024-08-07 15:12:13,181 Pandas backend loaded 1.5.3
INFO: 2024-08-07 15:12:13,190 Numpy backend loaded 1.23.5
INFO: 2024-08-07 15:12:13,192 Pyspark backend NOT loaded
INFO: 2024-08-07 15:12:13,193 Python backend loaded


  def hasna(x: np.ndarray) -> bool:
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'cannot reindex on an axis with duplicate labels')


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'cannot reindex on an axis with duplicate labels')


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

### Pipeline output

In [6]:
##add here the outputs logic
import json

profile_pipeline_output = {
    'outputs' :  [
        {
          'type': 'web-app',
          'storage': 'inline',
          'source': compare_html,
        },

    ]
  }
with open('mlpipeline-ui-metadata.json', 'w') as metadata_file:
    json.dump(profile_pipeline_output, metadata_file)
