# Introduction

This notebook is intended to extract useful insights for the datasets of ‘Tabular Playground Series - Mar 2021’ competition in Kaggle. For this competition, it is required to tackle the Regression problem to predict a continuous target based on a number of feature columns given in the data. All of the feature columns, cat0 - cat9 are categorical, and the feature columns cont0 - cont13 are continuous.

We are going to perform the complete and comprehensive EDA as follows
-	Automate the generic aspects of EDA with AutoViz, one of the leading freeware Rapid EDA tools in Pythonic Data Science world
-	Deep into the problem-specific advanced analytical questions/discoveries with the custom manual EDA routines programmed on top of standard capabilities of Plotly and Matplotlib


In [None]:
!pip install xlrd

In [None]:
!pip install AutoViz

# Initial Preparations

We are going to start with the essential pre-requisites as follows

- installing *AutoViz* into this notebook
- importing the standard Python packages we need to use down the road
- programming the useful automation routines for repeatable data visualizations we are going to draw in the Advance Analytical EDA trials down the road

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime as dt
from typing import Tuple, List, Dict

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.offline


# read data
in_kaggle = True

def get_data_file_path(is_in_kaggle: bool) -> Tuple[str, str, str]:
    train_path = ''
    test_path = ''
    sample_submission_path = ''

    if is_in_kaggle:
        # running in Kaggle, inside the competition
        train_path = '../input/tabular-playground-series-mar-2021/train.csv'
        test_path = '../input/tabular-playground-series-mar-2021/test.csv'
        sample_submission_path = '../input/tabular-playground-series-mar-2021/sample_submission.csv'
    else:
        # running locally
        train_path = 'data/train.csv'
        test_path = 'data/test.csv'
        sample_submission_path = 'data/sample_submission.csv'

    return train_path, test_path, sample_submission_path



In [None]:
# main flow
start_time = dt.datetime.now()
print("Started at ", start_time)

In [None]:
%%time
# get the training set and labels
train_set_path, test_set_path, sample_subm_path = get_data_file_path(in_kaggle)

df_train = pd.read_csv(train_set_path)
df_test = pd.read_csv(test_set_path)

subm = pd.read_csv(sample_subm_path)

# Basic Data Overview

In [None]:
df_train.info()

# Express EDA Analysis 

We are going to invoke *AutoViz*, one of the prominent freeware Pythonic Rapid EDA tools, to quickly draw the basic insights about the data

## Express Analysis of Training Set

In [None]:

from autoviz.AutoViz_Class import AutoViz_Class

AV = AutoViz_Class()
dftc = AV.AutoViz(
    filename='', 
    sep='' , 
    depVar='target', 
    dfte=df_train, 
    header=0, 
    verbose=1, 
    lowess=False, 
    chart_format='png', 
    max_rows_analyzed=300000, 
    max_cols_analyzed=30
)


## Express Analysis Insights

As we can see, the simple express EDA analysis yielded a lot of useful insights out of the box, in less then 20 minutes of the data crunching. Below are the key finding from the charts generated by *AutoViz* on a generic basis.

### Target Class Labels

It  is manifested the training dataset has unbalanced class labels for *target* variable. Therefore one of the following techniques has to be adapted in the pre-processing and ML down the road

- oversampling the data using SMOTE or similar technique to balance the class labels in the resulted training set
- smart undersampling the data to balance the class labels in the resulted training set
- use adequate class label weights in the modelling with GBDT-style algorithms as well as any other algorithms supporting the class label weights

### Feature-to-Target Relations

We find that the training set data manifests the following relations between the *target* and feature variables

- all numeric variables except *cont9* demonstrate the good association with the target
- we may want to try ML experiments with and without *cont9* to see what adds the edge
- since the dataset seeems to be somewhat similar to the contests in Jan 2021 and Feb 2021


### Numeric Feature Findings

It is demonstrated that

- There is a clear separation of the observations in the training and test sets into well-contained and well separable clusters by the values of *cont4* (2 clusters detected for it, subject to further clustering experiments)
- *cont5* demonstrates much more clusters in the data, however it is almost sure  to be less productive in ML down the road (similar to what we have observed in Jan 2021 and Feb 2021)
- Distribution of the continual variables is identic on both the training and testing sets (the details for each variables are provided below)
- There are certain pairs of highly correlated numeric features with corr >= 0.7 ( cont0-cont01, cont0-cont7, cont1-cont2, cont1-cont8, cont7-cont10)
- *cont0, cont1, cont2, cont3, cont6, cont7, cont8, cont9*, and *cont10* are highly skewed to the left 
- *cont5* is skewed to the right
- *cont4* demonstrates the perfrect binomial distribution (and it can be the good feature to use in possible clustering experiments



### Categorical Feature Findings

First of all, unlike the datasets for the tabular playground competitions for Jan and Feb 2021, this dataset proved to have irrelevant (noisy) category variables. *featurewiz*, the secret sauce of *AutoViz*, detected four categories of this sort as follows

- *'cat5'*
- *'cat7'*
- *'cat8'*
- *'cat10'*

It is suggested to exclude such variables from the ML experiments down the road.

It has been additionally detected that the rest of the category variables show weak relations with the target and them as well as with the numeric variables (similar to what has been observed in the contests for Jan 2021 and Feb 2021). Therefore the similar ML approaches that worked in the previous playground tabular competitions will be applicable here as well.




In [None]:
print('We are done. That is all, folks!')
finish_time = dt.datetime.now()
print("Finished at ", finish_time)
elapsed = finish_time - start_time
print("Elapsed time: ", elapsed)

# References

Since the dataset in this competition is quite similar to ones for Jan 2021 and Feb 2021 tabular playground competitions, it could be useful to review the EDA findings for the mentioned competitions too

- Feb 2021 Tabular Playground Contest: https://www.kaggle.com/gvyshnya/generic-express-eda-with-comprehensive-insights
- Jan 2021 Tabular Playground Contest: https://www.kaggle.com/gvyshnya/using-autoviz-to-build-a-comprehensive-eda