# Introduction

This notebook is intended to extract useful insights for the datasets of ‘Tabular Playground Series - Feb 2021’ competition in Kaggle. For this competition, it is required to tackle the Regression problem to predict a continuous target based on a number of feature columns given in the data. All of the feature columns, cat0 - cat9 are categorical, and the feature columns cont0 - cont13 are continuous.

We are going to perform the complete and comprehensive EDA as follows
-	Automate the generic aspects of EDA with AutoViz, one of the leading freeware Rapid EDA tools in Pythonic Data Science world
-	Deep into the problem-specific advanced analytical questions/discoveries with the custom manual EDA routines programmed on top of standard capabilities of Plotly and Matplotlib


In [None]:
!pip install xlrd

In [None]:
!pip install AutoViz

# Initial Preparations

We are going to start with the essential pre-requisites as follows

- installing *AutoViz* into this notebook
- importing the standard Python packages we need to use down the road

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime as dt
from typing import Tuple, List, Dict

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.offline


# read data
in_kaggle = True

def get_data_file_path(is_in_kaggle: bool) -> Tuple[str, str, str]:
    train_path = ''
    test_path = ''
    sample_submission_path = ''

    if is_in_kaggle:
        # running in Kaggle, inside the competition
        train_path = '../input/tabular-playground-series-feb-2021/train.csv'
        test_path = '../input/tabular-playground-series-feb-2021/test.csv'
        sample_submission_path = '../input/tabular-playground-series-feb-2021/sample_submission.csv'
    else:
        # running locally
        train_path = 'data/train.csv'
        test_path = 'data/test.csv'
        sample_submission_path = 'data/sample_submission.csv'

    return train_path, test_path, sample_submission_path



In [None]:
# main flow
start_time = dt.datetime.now()
print("Started at ", start_time)

In [None]:
%%time
# get the training set and labels
train_set_path, test_set_path, sample_subm_path = get_data_file_path(in_kaggle)

df_train = pd.read_csv(train_set_path)
df_test = pd.read_csv(test_set_path)

subm = pd.read_csv(sample_subm_path)

# Basic Data Overview

In [None]:
df_train.info()

# Express EDA Analysis 

We are going to invoke *AutoViz*, one of the prominent freeware Pythonic Rapid EDA tools, to quickly draw the basic insights about the data

## Express Analysis of Training Set

In [None]:

from autoviz.AutoViz_Class import AutoViz_Class

AV = AutoViz_Class()
dftc = AV.AutoViz(
    filename='', 
    sep='' , 
    depVar='target', 
    dfte=df_train, 
    header=0, 
    verbose=2, 
    lowess=False, 
    chart_format='png', 
    max_rows_analyzed=300000, 
    max_cols_analyzed=30
)


## Express Analysis Insights

As we can see, the simple express EDA analysis yielded a lot of useful insights out of the box, in less then 20 minutes of the data crunching. Below are the key finding from the charts generated by *AutoViz* on a generic basis.

### Feature-to-Target Relations

We find that the training set data manifests the following relations between the *target* and feature variables

- It seems like the training set observations with *target* < 3.5 or/and cont5 < 0.1 could be clearly attributed to as outliers
- Target variable is a little skewed to the right
- There is no any numeric feature that is highly correlated with *target*
- *target* distribution by the labels of the respective cat variables demonstrated that there is a relatively huge association of the *target* with *cat2, cat5, cat6, cat7, cat9*
- The best association between the *target* values and the cat labels is demonstrated by cat7
- In turn, there is a weaker association of the target variable with the rest of categorical features


### Numeric Feature Findings

It is demonstrated that

- There is a clear separation of the observations in the training and test sets into well-contained and well separable clusters by the values of *cont1* (6-8 clusters observed, subject to further clustering experiments)
- Distribution of the continual variables is identic on both the training and testing sets (the details for each variables are provided below)
- *cont3, cont4, cont5, cont6*, and *cont12* are highly skewed to the left 
- *cont8, cont9, cont10, cont11*, and *cont13* have a polynomial distribution (binomial distribution, presumably)
- *cont1* and *cont11* are skewed to the right
- *cont0, cont2* have almost normal distribution
- as per the review of the respective violin plots on the training set, it could be possible to use the extreme tail values of *cont0, cont5, cont6*, and *cont12* for the outlier removal when training the ML models on the training set
- there are several quite highly correlated numeric feature pairs detected on the training and test sets (with the Pierson’s correlation coefficient >= 0.6): *cat5-cat8, cat5-cat9*, and *cat5-cat12* (among them, *cat5* has the highest absolute correlation with the target variable on the training set)


### Categorical Feature Findings

It has been detected that

- *cat0* is a two-label categorical variable, and it is unbalance by the label value distribution on the training set (‘A’ drastically predominates ‘B’)
- *cat1* is a two-label categorical variable, and it is unbalanced a little (‘A’ vs. ‘B’)
- *cat2* is a two-label categorical variable, and it is unbalance by the label value distribution on the training set (‘A’ drastically predominates ‘B’)
- *cat3* is a four-label categorical variable, and two of its labels (‘C’, ‘A’) predominate the rest of the labels (the latter ones can be binned into a single category label ‘Other’, to reduce the dimensionality of the respective feature space)
- *cat4* is a four-label categorical variable, and one of its labels (‘A’) predominates others (such labels can be binned into a single category label ‘Other’, to reduce the dimensionality of the respective feature space)
- *cat5* is a four-label categorical variable, and two of its labels (‘B’, ‘D’) predominate the rest of the labels (the latter ones can be binned into a single category label ‘Other’, cont reduce the dimensionality of the respective feature space)
- *cat6* is an eight-label categorical variable, and one of its labels (‘E’) predominates others (such labels can be binned into a single category label ‘Other’, to reduce the
- *cat8* is a 7-label categorical variable, and 4 of its categories (‘C’, ‘E’, ‘G’, and ‘A’) predominate the rest of the categories on the training set (the latter ones can be binned into a single category label ‘Other’, to reduce the dimensionality of the respective feature space)
- *cat9* is a 15-label categorical variable, and 3 of its labels (‘F’, ‘I’, and ‘L’) predominate others (the latter ones can be binned into a single category label ‘Other’, to reduce the dimensionality of the respective feature space)

### Categorical-to-Numerical Feature Associations

There are quite strong associations found between the following categorical and numerical features on the training set

- cont1 by cat3
- cont5 by cat3
- cont6 by cat3
- cont9 by cat3
- cont10 by cat3
- cont11 by cat3
- cont12 by cat3
- cont0 by cat4
- cont5 by cat4
- cont6 by cat4
- cont8 by cat4
- cont9 by cat4
- cont10 by cat4
- cont11 by cat4
- cont12 by cat4
- cont13 by cat4
- all continual variables by cat5
- all continual variables by cat6
- all continual variables by cat7
- all continual variables by cat8
- cont0 by cat9
- cont1 by cat9
- cont2 by cat9
- cont5 by cat9
- cont6 by cat9
- cont8 by cat9
- cont9 by cat9
- cont10 by cat9
- cont11 by cat9
- cont12 by cat9
- cont13 by cat9
- cont1 by cat1

The above-mentioned continual-to-categorical feature associations are also confirmed on the test set

# Roadmap For Additional EDA Visualizations

The good insights we quickly got from the express EDA Analysis with *AutoViz* above were very helpful per se. However, they did not address all and every analytical issues we would like to address, when tackling the fundumantal question of what the impact of features on the *target* are.

Now we are going to undertake the additional manual EDA discoveries to review

- pair associations between the selective cat variables
- multi-variative associations between selective cat variables, factored by the impact of such association on the conditional distributions of the *target* and numeric features on the training set

While doing it, we will be paying the most attention to the cat features highlighted in the express EDA analysis above. These are

- *cat2, cat5, cat6, cat7, and cat9* that have a good association with *target* on the training set
- *cat8* that has good association with every feature variable both in the training and test sets

In [None]:
print('We are done. That is all, folks!')
finish_time = dt.datetime.now()
print("Finished at ", finish_time)
elapsed = finish_time - start_time
print("Elapsed time: ", elapsed)