# **2. DATA PROFILING**

**YDATA-PROFILING LIBRARY**

*ydata_profiling* is a useful library for building a profiling report. This library automatically generates a profile report from a pandas DataFrame for data understanding.

The following information is presented in an interactive report:

*Overview*: mostly global details and statistics information about the whole dataset.

*Alerts*: a list of potential data quality issues (*e.g.,* high correlation, skewness, uniformity, zeros, missing values, constant values).

*Reproduction*: technical details about the analysis (time, version and configuration).

*Variables* (for each column): data types, distict values, missing values, quantile & descriptive statistics (*e.g.,* min, median, max, Q1, Q3, range, etc.), descriptive statistics (*e.g.,* mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, skewness, etc.), histograms, common and extremes values.

*Interactions & Correlations* between variables (heatmap).

*Missing Values*: count, matrix, heatmap.

*Sample of the data* and *Duplicated rows*.

In [1]:
import sys
!{sys.executable} -m pip install -U ydata-profiling[notebook]
!pip install jupyter-contrib-nbextensions



In [2]:
!jupyter nbextension enable --py widgetsnbextension

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


In [3]:
!pip install dataprofiler



In [4]:
from ydata_profiling import ProfileReport
import pandas as pd
import json

In [5]:
BEERS = pd.read_csv('https://raw.githubusercontent.com/camillasancricca/DATADIQ/master/BEERS.csv')

In [6]:
#create a profile report
profile = ProfileReport(BEERS, title="BEERS REPORT")

In [7]:
# in html
profile.to_file("BEERS_REPORT.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/7 [00:00<?, ?it/s][A
100%|██████████| 7/7 [00:00<00:00, 27.89it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [9]:
#create a profile report in json
profile.to_file("BEERS_REPORT.json")

Render JSON:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [10]:
file = open("BEERS_REPORT.json")


In [11]:
jsonfile = json.load(file)

In [12]:
#inspect json profile report
jsonfile

{'analysis': {'title': 'BEERS REPORT',
  'date_start': '2025-10-16 19:40:58.178124',
  'date_end': '2025-10-16 19:41:05.548209'},
 'time_index_analysis': 'None',
 'table': {'n': 2419,
  'n_var': 7,
  'memory_size': 135596,
  'record_size': 56.05456800330715,
  'n_cells_missing': 1074,
  'n_vars_with_missing': 3,
  'n_vars_all_missing': 0,
  'p_cells_missing': 0.06342644540246856,
  'types': {'Numeric': 4, 'Text': 3},
  'n_duplicates': 9,
  'p_duplicates': 0.0037205456800330715},
 'variables': {'abv': {'n_distinct': 74,
   'p_distinct': 0.031395842172252865,
   'is_unique': False,
   'n_unique': 9,
   'p_unique': 0.003818413237165889,
   'type': 'Numeric',
   'hashable': True,
   'value_counts_without_nan': {'0.05': 216,
    '55.0': 159,
    '0.06': 126,
    '65.0': 124,
    '0.052': 107,
    '0.07': 93,
    '45.0': 90,
    '48.0': 72,
    '0.0579999999999999': 66,
    '0.0559999999999999': 66,
    '51.0': 62,
    '62.0': 60,
    '53.0': 60,
    '49.0': 59,
    '0.08': 58,
    '47.0': 5

In [13]:
jsonfile['table']['n']  #n of rows

2419

In [14]:
jsonfile['variables']['ibu']['n_distinct']

107

**dataprofiler LIBRARY** (**alternative library** / same output as ydata)

In [16]:
from dataprofiler import Data, Profiler

profile = Profiler(BEERS)
profile

readable_report = profile.report(report_options={"output_format": "compact"})
readable_report         #reported without the visualization info, more readable

INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns... 


  saveable.load_own_variables(weights_store.get(inner_path))
INFO:DataProfiler.profilers.profile_builder:Finding the Null values in the columns... 
100%|██████████| 7/7 [00:00<00:00, 173.11it/s]

INFO:DataProfiler.profilers.profile_builder: Calculating the statistics... 



INFO:DataProfiler.profilers.profile_builder:Calculating the statistics... 
100%|██████████| 7/7 [00:01<00:00,  4.13it/s]


{'global_stats': {'samples_used': 2419,
  'column_count': 7,
  'row_count': 2419,
  'row_has_null_ratio': 0.4171,
  'row_is_null_ratio': 0.0,
  'unique_row_ratio': 0.9963,
  'duplicate_row_count': 9,
  'file_type': "<class 'pandas.core.frame.DataFrame'>",
  'encoding': None,
  'correlation_matrix': None,
  'chi2_matrix': '[[ 1.,  0., nan, nan,  0., nan,  0.], ... , [ 0.,  0., nan, nan,  0., nan,  1.]]',
  'profile_schema': {'abv': [0],
   'ibu': [1],
   'id': [2],
   'name': [3],
   'style': [4],
   'brewery_id': [5],
   'ounces': [6]},
  'times': {'row_stats': 0.0038}},
 'data_stats': [{'column_name': 'abv',
   'data_type': 'float',
   'data_label': 'FLOAT',
   'categorical': True,
   'order': 'random',
   'samples': "['0.05', '0.05', '63.0', '0.052', '0.06']",
   'statistics': {'min': 0.027,
    'max': 128.0,
    'mode': '[0.0909865]',
    'median': 46.0033,
    'sum': 81624.327,
    'mean': 34.6306,
    'variance': 994.5591,
    'stddev': 31.5366,
    'skewness': 0.0752,
    'kurtos