# How to profiling compare two datasets with ydata-sdk

Understanding how datasets differ—whether across time periods, environments, or sources—is critical for building trustworthy data pipelines and robust machine learning models. This notebook demonstrates how to use `ydata-sdk to compare dataset profiles in a structured, scalable way.

We’ll explore the importance of comparative profiling across datasets, using a real-world example from [Kaggle: Loan Eligible Dataset](https://www.kaggle.com/datasets/vikasukani/loan-eligible-dataset).

### Why compare dataset profiles?
Comparing profiles across datasets can help you:

- **Ensure Data Consistency:** Detect schema misalignments or variable inconsistencies across environments.
- **Maintain Data Quality:** Spot nulls, skewness, duplicates, or low-information variables before integration.
- **Detect Anomalies:** Identify unexpected changes or outliers not visible in isolated analysis.
- **Support Schema Validation:** Ensure critical fields align structurally and statistically when joining data.
- **Benchmark Against Standards:** Measure datasets against internal baselines or industry quality benchmarks.
- **Track Historical Trends:** Understand how data evolves over time to support quality monitoring or drift detection.
- **Guide Preprocessing and Feature Engineering:** Focus cleaning and transformation efforts where they matter most.

By comparing dataset profiles using ydata-sdk, you can ensure alignment, spot risks early, and gain deeper trust in your data-driven workflows.

### Authenticate with your YData account

In [1]:
# Authenticate with your ydata-sdk token - https://dashboard.ydata.ai/
import os

os.environ['YDATA_LICENSE_KEY'] = '{add-your-key}'

## Read the datasets from the Catalog

### Real dataset

In [3]:
import pandas as pd
from ydata.dataset import Dataset

df = pd.read_csv('insert-file-path.csv')
df_a = df[df['Gender']=='Male']
df_b = df[df['Gender']=='Female']

dataset = Dataset(df_a)
dataset.head()


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,LP001035,Male,Yes,2,Graduate,No,2340,2546,100.0,360.0,,Urban
4,LP001051,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban


### Creating a second dataset to compare

In [4]:
dataset_b = Dataset(df_b)
dataset_b.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
6,LP001055,Female,No,1,Not Graduate,No,2226,0,59.0,360.0,1.0,Semiurban
14,LP001096,Female,No,0,Graduate,No,4666,0,124.0,360.0,1.0,Semiurban
21,LP001124,Female,No,3+,Not Graduate,No,2083,0,28.0,180.0,1.0,Urban
23,LP001135,Female,No,0,Not Graduate,No,3765,0,125.0,360.0,1.0,Urban
30,LP001177,Female,No,0,Not Graduate,No,2478,0,75.0,360.0,1.0,Semiurban


## Profile a dataset

In [7]:
from ydata.profiling import ProfileReport

#Step 1: Create a ProfileReport from the first dataset
report = ProfileReport(dataset, title='Male loans profiling')
report.to_file('male_loans_profiling.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


100%|██████████| 12/12 [00:00<00:00, 619.65it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [8]:
#Step 2: Create a ProfileReport from the second dataset
report2 = ProfileReport(dataset_b, title='Female loans profiling')
report2.to_file('female_loans_profiling.html') 

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


100%|██████████| 12/12 [00:00<00:00, 365.93it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [9]:
# Step 3: Compare the results of both reports
compare_report = report.compare(report2)
compare_report.to_file("comparison_profiling_report.html")

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [10]:
compare_report

