# 2. Exploratory data analysis - In-depth profiling

The first step in a data preparation pipeline is the exploratory data analysis (EDA). In a nutshell, data exploration and data cleansing are hand-to-hand and both are mutually iterative steps. 

*But what does data exploration includes? And how to make a better data exploration giving we are building a credit scorecard model?*


Data exploration includes both univariate and bivariate analysis and ranges from univariate statistics and frequency distributions to correlations, cross-tabulation, and characteristic analysis.
add here detail about pandas-profiling and data exploration in general (re-use the sentence above)

## Read the data & computed metadata

In [1]:
%%capture
!pip install pandas-profiling

### Import needed packages

In [2]:
import os

from pickle import load
import pandas as pd

from ydata.metadata import Metadata

In [3]:
try:
    dataset_path = os.environ['DATASET_PATH']
    print(dataset_path)
except:
    dataset_path = 'cs-training.csv'
    
data = pd.read_csv(dataset_path, index_col=[0])

test.csv


In [4]:
try:
    label = os.environ['LABEL_NAME']
except:
    label = 'SeriousDlqin2yrs'

In [5]:
meta = Metadata.load('metadata.pkl')
print(meta)

[1mMetadata Summary 
 
[0m[1mDataset type: [0mTABULAR
[1mDataset attributes: [0m
[1mNumber of columns: [0m11
[1m% of duplicate rows: [0m0
[1mTarget column: [0m

[1mColumn detail: [0m
                                  Column  Data type Variable type
0                       SeriousDlqin2yrs  numerical           int
1   RevolvingUtilizationOfUnsecuredLines  numerical         float
2                                    age  numerical           int
3   NumberOfTime30-59DaysPastDueNotWorse  numerical           int
4                              DebtRatio  numerical         float
5                          MonthlyIncome  numerical         float
6        NumberOfOpenCreditLinesAndLoans  numerical           int
7                NumberOfTimes90DaysLate  numerical           int
8           NumberRealEstateLoansOrLines  numerical           int
9   NumberOfTime60-89DaysPastDueNotWorse  numerical           int
10                    NumberOfDependents  numerical         float

0     skew

## Generating the full data profile

In [6]:
try:
    data_split=os.environ['DATA_SPLIT']
except:
    data_split='train'

In [7]:
from pandas_profiling import ProfileReport

print(f'Profile Name: {data_split}_profile')
profile = ProfileReport(df=data, title='Hotel bookings demand')
profile.config.html.navbar_show = False

profile.to_file(f'{data_split}_profile.html')

Profile Name: Test_profile


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Add here the html artifact to be generated

In [8]:
y = data[label]
ratio_labels = pd.DataFrame(y.value_counts(normalize=True))

In [9]:
print(ratio_labels)

   SeriousDlqin2yrs
0          0.934575
1          0.065425


In [10]:
import json

metadata = {
    'outputs' : [
        {
      'type': 'table',
      'storage': 'inline',
      'format': 'csv',
      'header': list(ratio_labels.columns),
      'source': ratio_labels.to_csv(header=False, index=True)
    },
        {
      'type': 'web-app',
      'storage': 'inline',
      'source': profile.to_html(),
    }
    ]
  }

with open('mlpipeline-ui-metadata.json', 'w') as metadata_file:
    json.dump(metadata, metadata_file)