# Dataset profiling

Data profiling is a crucial step in the data analysis process, especially when working with large datasets. It involves examining the data to understand its structure, content, and quality. Through data profiling, analysts can identify patterns, anomalies, missing values, and inconsistencies, which are essential for ensuring the accuracy and reliability of the data before any further analysis or modeling.

When dealing with larger datasets, Fabric Labs offer the capability to trigger data profiling on-demand. This feature allows users to perform profiling when needed, ensuring that the process is both efficient and tailored to the dataset at hand. Additionally, by integrating pipelines into the workflow, data profiling can be made a recurrent process. This automation ensures that data quality checks are consistently applied as the dataset evolves, keeping the data ready for analysis at any time.

In this notebook we will be covering how a dataset profiling can be triggered.

### Authenticate with your YData account

In [1]:
# Authenticate with your ydata-sdk token - https://dashboard.ydata.ai/
import os

os.environ['YDATA_LICENSE_KEY'] = '{add-your-key}'

## Read your dataset
You can read your dataset leveraging pd.DataFrame read_csv method as usual, or you can leverage [`ydata-sdk` connectors](https://docs.sdk.ydata.ai/latest/connectors/).

In [2]:
import pandas as pd
from ydata.dataset import Dataset

df = pd.read_csv('insert-file-path.csv')
dataset = Dataset(df)
dataset.head()


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,LP001035,Male,Yes,2,Graduate,No,2340,2546,100.0,360.0,,Urban
4,LP001051,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban


## Profile a dataset

In [3]:
from ydata.profiling import ProfileReport

report = ProfileReport(dataset, title='Loans profiling')
report.to_file('profiling.html') #This will save the report as a shareable HTML file

# The profiling report can be also stored as json using the command below
#report.to_file('profiling.json')

INFO: 2025-05-07 17:29:34,308 Pandas backend loaded 2.2.3
INFO: 2025-05-07 17:29:34,313 Numpy backend loaded 2.1.3
INFO: 2025-05-07 17:29:34,313 Pyspark backend NOT loaded
INFO: 2025-05-07 17:29:34,314 Python backend loaded


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


100%|██████████| 12/12 [00:00<00:00, 296.62it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [5]:
report

