# Datasets profiling compare

Comparing data profiling across different datasets is essential for several reasons:

**Ensuring Consistency and Quality:** Identifies discrepancies and maintains data quality when merging datasets.

**Detecting Anomalies:** Highlights outliers and unexpected patterns not visible in isolated datasets.

**Schema Comparison:** Ensures key fields match for seamless data integration.

**Benchmarking:** Assesses data against industry or internal standards for quality and completeness.

**Resource Allocation:** Guides where to focus cleaning and preprocessing efforts.

**Historical Analysis:** Tracks data evolution over time for trend analysis and quality monitoring.

**Feature Engineering:** Aids in creating robust machine learning models by understanding feature variations.

By comparing dataset profiles, you can enhance the reliability and effectiveness of your data-driven processes.

## Read the datasets from the Catalog

### Real dataset

In [1]:
# Importing YData's packages
from ydata.labs import DataSources
# Reading the Dataset from the DataSource
datasource = DataSources.get(uid='{datasource_id}')
dataset = datasource.dataset
# Getting the calculated Metadata to get the profile overview information in the labs
metadata = datasource.metadata
print(metadata)

[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m15
[1mNumber of rows: [0m32561
[1mDuplicate rows: [0m0
[1mTarget column: [0m

[1mColumn detail: [0m
            Column    Data type Variable type Characteristics
0              age    numerical           int                
1        workclass  categorical        string                
2           fnlwgt    numerical           int                
3        education  categorical        string                
4    education-num  categorical           int                
5   marital-status  categorical        string                
6       occupation  categorical        string                
7     relationship  categorical        string                
8             race  categorical        string                
9              sex  categorical        string                
10    capital-gain    numerical           int                
11    capital-loss    numerical   

### Synthetic dataset

In [3]:
# Importing YData's packages
from ydata.labs import DataSources
# Reading the Dataset from the DataSource
datasource_synth = DataSources.get(uid='{datasource_id}')
dataset_synth = datasource_synth.dataset
# Getting the calculated Metadata to get the profile overview information in the labs
metadata_synth = datasource_synth.metadata
print(metadata_synth)

[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m15
[1mNumber of rows: [0m32561
[1mDuplicate rows: [0m0
[1mTarget column: [0m

[1mColumn detail: [0m
            Column    Data type Variable type Characteristics
0              age    numerical           int                
1        workclass  categorical        string                
2           fnlwgt    numerical           int                
3        education  categorical        string                
4    education-num  categorical           int                
5   marital-status  categorical        string                
6       occupation  categorical        string                
7     relationship  categorical        string                
8             race  categorical        string                
9              sex  categorical        string                
10    capital-gain    numerical           int                
11    capital-loss    numerical   

## Profile a dataset

In [4]:
from ydata.profiling import ProfileReport

INFO: 2024-08-08 23:38:05,375 Pandas backend loaded 1.5.3
INFO: 2024-08-08 23:38:05,383 Numpy backend loaded 1.23.5
INFO: 2024-08-08 23:38:05,385 Pyspark backend NOT loaded
INFO: 2024-08-08 23:38:05,386 Python backend loaded


  def hasna(x: np.ndarray) -> bool:


In [None]:
report = ProfileReport(dataset, title='Census original')
report_synth = ProfileReport(dataset_synth, title='Census synthetic')
report.to_file('census_original_profiling.html') #This will save the report as a shareable HTML file
report_synth.to_file('census_synthetic_profiling.html')
# Compare the synthetic data with the original data
comparison_report = report.compare(report_synth)
comparison_report.to_file("comparison_profiling_report.html")

# The profiling report can be also stored as json using the command below
#report.to_file('census_compare_profiling.json')