# Dataset profiling

Data profiling is a crucial step in the data analysis process, especially when working with large datasets. It involves examining the data to understand its structure, content, and quality. Through data profiling, analysts can identify patterns, anomalies, missing values, and inconsistencies, which are essential for ensuring the accuracy and reliability of the data before any further analysis or modeling.

When dealing with larger datasets, Fabric Labs offer the capability to trigger data profiling on-demand. This feature allows users to perform profiling when needed, ensuring that the process is both efficient and tailored to the dataset at hand. Additionally, by integrating pipelines into the workflow, data profiling can be made a recurrent process. This automation ensures that data quality checks are consistently applied as the dataset evolves, keeping the data ready for analysis at any time.

In this notebook we will be covering how a dataset profiling can be triggered.

## Read the data from the Catalog

In [1]:
# Importing YData's packages
from ydata.labs import DataSources
# Reading the Dataset from the DataSource
datasource = DataSources.get(uid='{datasource_id}')
dataset = datasource.dataset
# Getting the calculated Metadata to get the profile overview information in the labs
metadata = datasource.metadata
print(metadata)

[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m15
[1mNumber of rows: [0m32561
[1mDuplicate rows: [0m0
[1mTarget column: [0m

[1mColumn detail: [0m
            Column    Data type Variable type Characteristics
0              age    numerical           int                
1        workclass  categorical        string                
2           fnlwgt    numerical           int                
3        education  categorical        string                
4    education-num  categorical           int                
5   marital-status  categorical        string                
6       occupation  categorical        string                
7     relationship  categorical        string                
8             race  categorical        string                
9              sex  categorical        string                
10    capital-gain    numerical           int                
11    capital-loss    numerical   

## Profile a dataset

In [2]:
from ydata.profiling import ProfileReport

INFO: 2024-08-08 23:27:10,160 Pandas backend loaded 1.5.3
INFO: 2024-08-08 23:27:10,168 Numpy backend loaded 1.23.5
INFO: 2024-08-08 23:27:10,170 Pyspark backend NOT loaded
INFO: 2024-08-08 23:27:10,171 Python backend loaded


  def hasna(x: np.ndarray) -> bool:


In [3]:
report = ProfileReport(dataset, title='Census dataset')
report.to_file('census_profiling.html') #This will save the report as a shareable HTML file

# The profiling report can be also stored as json using the command below
#report.to_file('payments_pr')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [4]:
report


