# <div style="color:white;background-color:#1d1545;padding:3%;border-radius:50px 50px;font-size:1em;text-align:center">Introduction</div>

This notebook is intended to extract useful insights for the datasets of ‘Tabular Playground Series - Dec 2021’ competition in Kaggle. 

For this competition, you will be predicting a categorical target based on a number of feature columns given in the data. 

The data is synthetically generated by a GAN that was trained on a the data from the [Forest Cover Type Prediction](https://www.kaggle.com/c/forest-cover-type-prediction/overview). This dataset is (a) much larger, and (b) may or may not have the same relationship to the target as the original data.

**Note:** Please refer to this [data page](https://www.kaggle.com/c/forest-cover-type-prediction/data) for a detailed explanation of the features.

We are going to perform the complete and comprehensive EDA as follows
-	Automate the generic aspects of EDA with AutoViz, one of the leading freeware Rapid EDA tools in Pythonic Data Science world
-	Deep into the problem-specific advanced analytical questions/discoveries with the custom manual EDA routines programmed on top of standard capabilities of conventional Python visualization libraries

# <div style="color:white;background-color:#1d1545;padding:3%;border-radius:50px 50px;font-size:1em;text-align:center">EDA Findings in Training Set</div>

This section will be focused on the discoveries and insights we obtained from the *AutoViz*-automated EDA for the contest dataset.

**Note:** The diagrams used in the subsecctions have been automatically generated by running the code in the next chapter. The charts were generated in a matter of minutes thus exposing you to spend time on more thinking-intensive activities and drawing insights from your data.

## <div style="font-size:20px;text-align:center;color:black;border-bottom:5px #0026d6 solid;padding-bottom:3%">Target Class Label Distribution</div>

One of the plots auto-generated by *AutoViz* displays the bar chats to visualize the distribution of the class labels of target variable.

<img src="https://raw.githubusercontent.com/gvyshnya/tab-dec-21/main/AutoViz_Plots/Cover_Type/Dist_Plots_target.png">

Based on the charts above, we find that 
- the target class labels are seriously imbalanced, with labels *4* and *5* to be neglactibely small relative to other class labels (it may justify exclusing the observations with such class labels from the training set, to improve the ML model accuracy)
- it will be required to use one of the industrial techniques to handle inbalanced target class problem iin ML modelling experiments down the road (see below)

## <div style="font-size:20px;text-align:center;color:black;border-bottom:5px #0026d6 solid;padding-bottom:3%">Relations Between Numeric Feature and Target</div>

<img src="https://raw.githubusercontent.com/gvyshnya/tab-dec-21/main/AutoViz_Plots/Cover_Type/Box_Plots.png">

Reviewing the box plots above reveals a lot of interesting insights
- There are certain numeric features that have strong association with the target variable (*Cover_Type*) and thus they are going to be quite good predictors (for instance, *Elevation*, *Horizontal_Distance_to_Hydrology*, *Vertical_Distance_to_Hydrology*, *Hillshade_9am*, *'Horizontal_Distance_to_Fire_Points'*)
- Other features seem to be less strong predictors in terms of their association with the target class labels (howerver, it  does not equally justify excluding such features from the model training in ML Experiments)

<img src="https://raw.githubusercontent.com/gvyshnya/tab-dec-21/main/AutoViz_Plots/Cover_Type/Scatter_Plots.png">

The pair scatter plots above further detail the insights on the feature variables-to-target class label relations.

## <div style="font-size:20px;text-align:center;color:black;border-bottom:5px #0026d6 solid;padding-bottom:3%">Pair Associations Between Features by Target</div>

<img src="https://raw.githubusercontent.com/gvyshnya/tab-dec-21/main/AutoViz_Plots/Cover_Type/Pair_Scatter_Plots.png">

When we review the associations between the feature variables (factored by the distribution of the target class labels in the training set observations), we see *Elevation* to separate the target class labels quite solidly, in the interactions with other feature variables.

## <div style="font-size:20px;text-align:center;color:black;border-bottom:5px #0026d6 solid;padding-bottom:3%">Pair Correlations Between Features</div>

<img src="https://raw.githubusercontent.com/gvyshnya/tab-dec-21/main/AutoViz_Plots/Cover_Type/Heat_Maps.png">

Although many numeric features in this dataset are often categories with numeric discrete values, looking at the Pearson correlation between such variables can still bring some more insights on highly associated/'correlated' features. From that standpoint, we observe that

- *'Wilderness_Area1'* and *'Wilderness_Area3'* are highgly correlated, and one of such features can be removed from the training set withough compromising the model accuracy

# <div style="color:white;background-color:#1d1545;padding:3%;border-radius:50px 50px;font-size:1em;text-align:center">Call for Action: Let's get the hands dirty</div>

The sections below demonstrate the source code of the express EDA experiment that lead to the insights collected above.

Executing the source code in the sections below will lead to generating the charts used as images in the previous chapter.

In [None]:
!pip install AutoViz

## <div style="font-size:20px;text-align:center;color:black;border-bottom:5px #0026d6 solid;padding-bottom:3%">Initial Preparations</div>

We are going to start with the essential pre-requisites as follows

- installing *AutoViz* into this notebook
- importing the standard Python packages we need to use down the road
- programming the useful automation routines for repeatable data visualizations we are going to draw in the Advance Analytical EDA trials down the road

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime as dt
from typing import Tuple, List, Dict

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.offline


# read data
in_kaggle = True

def get_data_file_path(is_in_kaggle: bool) -> Tuple[str, str, str]:
    train_path = ''
    test_path = ''
    sample_submission_path = ''

    if is_in_kaggle:
        # running in Kaggle, inside the competition
        train_path = '../input/tabular-playground-series-dec-2021/train.csv'
        test_path = '../input/tabular-playground-series-dec-2021/test.csv'
        sample_submission_path = '../input/tabular-playground-series-dec-2021/sample_submission.csv'
    else:
        # running locally
        train_path = 'data/train.csv'
        test_path = 'data/test.csv'
        sample_submission_path = 'data/sample_submission.csv'

    return train_path, test_path, sample_submission_path

In [None]:
# main flow
start_time = dt.datetime.now()
print("Started at ", start_time)

In [None]:
%%time
# get the training set and labels
train_set_path, test_set_path, sample_subm_path = get_data_file_path(in_kaggle)

df_train = pd.read_csv(train_set_path)
df_test = pd.read_csv(test_set_path)

subm = pd.read_csv(sample_subm_path)

## <div style="font-size:20px;text-align:center;color:black;border-bottom:5px #0026d6 solid;padding-bottom:3%">Training Set Overview</div>

In [None]:
df_train.info()

## <div style="font-size:20px;text-align:center;color:black;border-bottom:5px #0026d6 solid;padding-bottom:3%">Detecting Cardinality of the Variables in Training Set</div>

In [None]:
cols = df_train.columns
for f in cols:
    dist_value = df_train[f].value_counts().shape[0]
    print('Variable {:>40} has {} distinct values'.format(f, dist_value))

As a result, we see that *'Soil_Type15'*, and *'Soil_Type7'* have just one value in every training records. Therefore it won't make any sense to use such features in the model training down the road.

'Id' feature is also a nominal identifier, and therefore it should be excluded from the training set in the model training time down the road. 

In [None]:
features_to_drop = ['Soil_Type15', 'Soil_Type7']
df_train = df_train.drop(features_to_drop, axis=1)

## <div style="font-size:20px;text-align:center;color:black;border-bottom:5px #0026d6 solid;padding-bottom:3%">Express EDA Analysis with AutoViz</div>

We are going to invoke *AutoViz*, one of the prominent freeware Pythonic Rapid EDA tools, to quickly draw the basic insights about the data

In [None]:
# uncomment the block below to run it on your premise
'''from autoviz.AutoViz_Class import AutoViz_Class

AV = AutoViz_Class()
dftc = AV.AutoViz(
    filename='', 
    sep='' , 
    depVar='Cover_Type', 
    dfte=df_train, 
    header=0, 
    verbose=1, 
    lowess=False, 
    chart_format='png', 
    max_rows_analyzed=400000, 
    max_cols_analyzed=55
)'''


****# <div style="color:white;background-color:#1d1545;padding:3%;border-radius:50px 50px;font-size:1em;text-align:center">Tackling Imbalanced Target Class Labels</div>

Since we detected extremely imbanalced target class labels, we should take it into account when building ML models down the road. We can choose from one of the conventional methods below

- undersampling
- oversampling (like using SMOTE etc.)
- assigning the differnt class label weights in the ML model training
- Etc.

**Note:** In a nice discussion thread per [Imbalanced classes vs. imbalanced cost](https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/294305) it is presented with the arguments on why resampling methods are not likely to work in a less efficient manner vs. class weightening and other model-level tweaks.

# <div style="color:white;background-color:#1d1545;padding:3%;border-radius:50px 50px;font-size:1em;text-align:center">Benefits</div>

This notebook provides a real-world example of how **AutoViz**, one of the best freeware Rapid EDA tools, can save your time on routinous steps in EDA while allowing you to spend more time on drawing the real insights from your data, using the auto-generated charts producted by **AutoViz**.

The contribution in this notebook expands on the research to prove **AutoViz** to be a good automation tool for Data Analysts/Data Scientists. Previous EDA experiments with **AutoViz** could be reviewed below

- [Using AutoViz to Build a Comprehensive EDA](https://www.kaggle.com/gvyshnya/using-autoviz-to-build-a-comprehensive-eda)
- [Express EDA with AutoViz](https://www.kaggle.com/gvyshnya/mar-21-tpc-express-eda-with-autoviz)
- [EDA and Feature Importance Findings](https://www.kaggle.com/c/lish-moa/discussion/190647#1047649)

# <div style="color:white;background-color:#1d1545;padding:3%;border-radius:50px 50px;font-size:1em;text-align:center">References</div>

- https://www.kaggle.com/damagejun/tps-dec-2021-eda
- https://towardsdatascience.com/heres-what-i-ve-learnt-about-sklearn-resample-ab735ae1abc4
- https://towardsdatascience.com/how-to-handle-imbalance-data-and-small-training-sets-in-ml-989f8053531d


In [None]:
print('We are done. That is all, folks!')
finish_time = dt.datetime.now()
print("Finished at ", finish_time)
elapsed = finish_time - start_time
print("Elapsed time: ", elapsed)