# 2. Exploratory data analysis - In-depth profiling

The first step in a data preparation pipeline is the exploratory data analysis (EDA). In a nutshell, data exploration and data cleansing are hand-to-hand and both are mutually iterative steps.

*But what does data exploration includes? And how to make a better data exploration giving we are building a credit scorecard model?*


Data exploration includes both univariate and bivariate analysis and ranges from univariate statistics and frequency distributions to correlations, cross-tabulation, and characteristic analysis.
add here detail about pandas-profiling and data exploration in general (re-use the sentence above)

## Read the data & computed metadata

### Import needed packages

In [1]:
import os

from pickle import load
import pandas as pd

from ydata.utils.formats import read_json
from ydata.metadata import Metadata
from ydata.labs.datasources import DataSources

from ydata.profiling import ProfileReport

INFO: 2022-12-08 22:37:45,959 Pandas backend loaded 1.2.3
INFO: 2022-12-08 22:37:45,967 Numpy backend loaded 1.21.2
INFO: 2022-12-08 22:37:45,969 Pyspark backend NOT loaded
INFO: 2022-12-08 22:37:45,969 Python backend loaded
INFO: 2022-12-08 22:37:46,380 generated new fontManager


In [2]:
dataset = DataSources.get(uid='{datasource-id}').read()

In [3]:
try:
    label = os.environ['LABEL_NAME']
except:
    label = 'SeriousDlqin2yrs'

In [4]:
meta = Metadata.load('metadata.pkl')
print(meta)

[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m11
[1mDuplicate rows: [0m274
[1mTarget column: [0m

[1mColumn detail: [0m
                                  Column    Data type Variable type
0                       SeriousDlqin2yrs  categorical           int
1   RevolvingUtilizationOfUnsecuredLines    numerical         float
2                                    age    numerical           int
3   NumberOfTime30-59DaysPastDueNotWorse    numerical           int
4                              DebtRatio    numerical         float
5                          MonthlyIncome    numerical         float
6        NumberOfOpenCreditLinesAndLoans    numerical           int
7                NumberOfTimes90DaysLate    numerical           int
8           NumberRealEstateLoansOrLines    numerical           int
9   NumberOfTime60-89DaysPastDueNotWorse    numerical           int
10                    NumberOfDependents    numerical     

## Generating the full data profile

In [5]:
try:
    data_path = os.environ['DATASET_PATH']
except:
    data_path = 'train.csv'

try:
    data_split=os.environ['DATA_SPLIT']
except:
    data_split='train'

data = pd.read_csv(data_path)
data = data.drop('Unnamed: 0', axis=1)

In [6]:
from pandas_profiling import ProfileReport

print(f'Profile Name: {data_split}_profile')
profile = ProfileReport(df=data, title='Hotel bookings demand')
profile.config.html.navbar_show = False

profile.to_file(f'{data_split}_profile.html')

Profile Name: train_profile


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

  (2 * xtie * ytie) / m + x0 * y0 / (9 * m * (size - 2)))


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Add here the html artifact to be generated

In [7]:
y = data[label]
ratio_labels = pd.DataFrame(y.value_counts(normalize=True))

In [8]:
print(ratio_labels)

   SeriousDlqin2yrs
0           0.93332
1           0.06668


In [9]:
import json

metadata = {
    'outputs' : [
        {
      'type': 'table',
      'storage': 'inline',
      'format': 'csv',
      'header': list(ratio_labels.columns),
      'source': ratio_labels.to_csv(header=False, index=True)
    },
        {
      'type': 'web-app',
      'storage': 'inline',
      'source': profile.to_html(),
    }
    ]
  }

with open('mlpipeline-ui-metadata.json', 'w') as metadata_file:
    json.dump(metadata, metadata_file)