# Introduction

In this notebook, we are going to see if one of the classes of Auto ML tools (namely, Rapid EDA Automation tools) can be effectively used on the problems/datasets of a type proposed to tackle in this competition.

For future experiments, we are going to use *AutoViz* (https://github.com/AutoViML/AutoViz).

AutoViz stands out of the crowd of freeware Pythonic Rapid EDA Automation tools, doing things in a very fast way, the way better than its close freeware rivals like *SweetViz* or *Pandas Profiling*

*Notes:* 

- You can find the motivation of why I try to use AutoViz, when feasible, in one of my earlier case studies per https://www.kaggle.com/c/lish-moa/discussion/190647 
- I also put a few references to the blog posts about AutoViz in the *References* Section at the bottom of this notebook

# Preparation Activities

First of all, we are going to do a few usual preparation steps

- install the latest stable version of AutoViz
- import the packages we need to work with in the course of the current analytical effort
- read the competion data in memory for future manipulations

In [None]:
!pip install git+git://github.com/AutoViML/AutoViz.git

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime as dt
from typing import Tuple

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.offline


# read data
in_kaggle = True


def get_data_file_path(is_in_kaggle: bool) -> Tuple[str, str, str]:
    train_path = ''
    test_path = ''
    sample_submission_path = ''

    if is_in_kaggle:
        # running in Kaggle, inside the competition
        train_path = '../input/tabular-playground-series-jan-2021/train.csv'
        test_path = '../input/tabular-playground-series-jan-2021/test.csv'
        sample_submission_path = '../input/tabular-playground-series-jan-2021/sample_submission.csv'
    else:
        # running locally
        train_path = 'data/train.csv'
        test_path = 'data/test.csv'
        sample_submission_path = 'data/sample_submission.csv'

    return train_path, test_path, sample_submission_path

In [None]:
# main flow
start_time = dt.datetime.now()
print("Started at ", start_time)

In [None]:
%%time
# get the training set and labels
train_set_path, test_set_path, sample_subm_path = get_data_file_path(in_kaggle)

df_train = pd.read_csv(train_set_path)
df_test = pd.read_csv(test_set_path)

subm = pd.read_csv(sample_subm_path)

Before running the AutoViz-based EDA discoveries, we will check the basic info about our training dataset (records count, data types of variables, % of missing values etc.)

In [None]:
df_train.info()

# AutoViz-based EDA

Now we are ready to set up an AutoViz-based EDA discovery - it is as simple as the code fragment below

*Note*: You can check the documentation at https://github.com/AutoViML/AutoViz , *Usage* section, for more information of each of the attributes used in *AV.AutoViz* method invokation.

In [None]:

from autoviz.AutoViz_Class import AutoViz_Class

AV = AutoViz_Class()
dftc = AV.AutoViz(
    filename='', 
    sep='' , 
    depVar='target', 
    dfte=df_train, 
    header=0, 
    verbose=1, 
    lowess=False, 
    chart_format='png', 
    max_rows_analyzed=300000, 
    max_cols_analyzed=30
)


Well, as we can see, it took it a few minutes to actually run the EDA discovery flow as well as generate the relevant data visualization charts (105 charts generated in fact). 

Then, in 20 min, I could review the charts to quickly grasp on the data-driven insights below

- there are no missing values for any of the feature variables in any observations in the training set
- most of the feature variables are polynomially distributed
- values of *count5* variable seem to be extremely skewed
- there are potential outliers in the training set with values of *target* below 5 (so it could be reasonable to drop such records from the training set down the road, as a part of the respective ML experiments)
- there is a set of highly correlated features detected (these are *count1*, *count6*, *count9*, and *count10* variables, namely) - we may want to drop all of them but one with the highest absolute value of the correlation coefficient vs. *target* (it seems to be *count1* in fact)
- there are also some other pairs of highly correlated features detected (*count6* and *count11*, *count6* and *count12*), and dropping *count6* from the training set could resolve such a correlation issue with such feature pairs down the road in ML experiments
- *count11* and *count12* are also highly correlated so we may want to leave just one of them in the training set during the ML experiments down the road (retaining *count11* seems to be a better option though as it has higher absolute value of the correlation coefficient vs. *target*)
- *count2* and *count14* seem to have a nice separation of values into relatively contained clusters vs. the values of *target*


*Note:* It could take me around 3 h to build the EDA and data visualizations of the comparable granularity/level of details if I did it manually with some of the mainstream data visualization libraries.

In [None]:
print('We are done. That is all, folks!')
finish_time = dt.datetime.now()
print("Finished at ", finish_time)
elapsed = finish_time - start_time
print("Elapsed time: ", elapsed)

# References

The references to the blog posts below may be helpful in your deeper delve into the universe of AutoViz

* Dan Roth, AutoViz: A New Tool for Automated Visualization - https://towardsdatascience.com/autoviz-a-new-tool-for-automated-visualization-ec9c1744a6ad
* George Vyshnya, PROs and CONs of Rapid EDA Tools - https://medium.com/sbc-group-blog/pros-and-cons-of-rapid-eda-tools-e1ccd159ab07