## Perform Exploratory data analysis(EDA) efficiently using smart tools

When we do EDA, we would perform these routine steps:

1) check the distribution for each variables

2) check the association between the variables 

3) check the association between all variables and target variable for modelling purposes

4) check missing values

5) check for outliers

and other nitty gritty analysis

All that takes up a lot of time(easily a few hours creating those plots, properly encoding these variables so that they can be interpreted correctly), and after you have done all that, you would be <b>too mentally drained to start the real work</b>

Therefore, it would be more producible to skip these menial stuff and jump directly to looking at what moves the needle. One good way is to make sure of a library like SweetViz which can automatically replace a huge part of the EDA steps for you.

In [2]:
import pandas as pd
import numpy as np

In [3]:
#use the famous dataset
df = pd.read_csv('titanic/train.csv')

In [4]:
df.shape

(891, 12)

In [5]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Sweetviz

In [None]:
import sweetviz as sv

In [7]:
#sweetviz
#https://github.com/fbdesignpro/sweetviz

# in just 2 lines of codes
# indicate the depedendent variable
my_report = sv.analyze(df, target_feat='Survived')
my_report.show_html() 

                                             |          | [  0%]   00:00 -> (? left)

Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


## Checking distributions and missingness of each variables

Upon executing the 2 lines of codes, it will generate a report consisting of visualization of all variables in the dataframe.

These are some cool stuff we will see this:

It would show different information depending on these 3 types of variables: categorical(nominal/ordinal), continous, and textual data. They are indicated by different icons at the top left corner.

<img src='http://datageeko.com/imgs/titanic/all_distri.png'/>

1)  We could see that it has automatically highlighted(literally) the presence of missing values, and how many of them. The highlighted color is shown in 3 different colors(yellow, green, and red) to express the severity of the missingness.

2) At 1 glance, we could see the distribution of both categorical(Embarked) and continuous(Fare) variables. And at the same time, the distribution of target variable(Survived) for each bin is being overlaid as the 2nd axis. This allows you to look at the association between the groups; the proportion of values with respect to the proportion of target values being "1".

3) For the variable that contains textual data, the bar plot is obviously not shown, but it does give you a good idea on the nature of the values.

4) One problem with Sweetviz is that it doesn't highlight outliers.

You might ask, does it support target variable being a continous one. Sure! Here's how it look like if I set it the target variable to "Fare":

<img src='http://datageeko.com/imgs/titanic/all_distri_conti.png'/>

Similarly, we could see the distribution of target variable(Fare) being binned into a few groups and they were overlaid on the continous variable, and from here we could get a sense of the association between these 2 cont. variables.

In [21]:
# check if the target can be used with cont data
#my_report = sv.analyze(df, target_feat='Fare')

                                             |          | [  0%]   00:00 -> (? left)

In [22]:
#my_report.show_html() 

Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


## Checking the associations between variables
A key task in the EDA process is to check for associations between variables and there are different associations depending on the types of variables. If we select the variable and scroll to the right, we could see all the association between all variables done for us. 

### Continuous-Continuous variables & Continuous-categorical variables
<img src='http://datageeko.com/imgs/titanic/conti_2_conti.png'/>

It computes the pearson correlation between all other detected Continuous variables in this dataset. Additionally, it computes the association against all categorical variables as well. 

At one glance, we could see what variables have some relationships with a variable.

### Categorical-Categorical variables
<img src='http://datageeko.com/imgs/titanic/cat_2_cat.png'/>
In this case, the Theil-U, which is a asymmetric method to measure the association between categorical variables

### Overview of all associations
This is even better; an eagle view of the association between all variables regardless of type:
<img src='http://datageeko.com/imgs/titanic/sweet_viz_associ.png' width=600/>

## D-Tale
There's another similar library called D-Tale which complements sweetviz pretty well. Here's why.

In [None]:
import dtale

In [5]:
# Assigning a reference to a running D-Tale process
d = dtale.show(df)

In [6]:
d.open_browser()

<img src='http://datageeko.com/imgs/titanic/d_tale_main.png'/>
In the main screen, it looks basically like excel, but it's actually much more than that. The magic lies when you click on the top left arrow button:
<img src='http://datageeko.com/imgs/titanic/d_tale_top_left.png' width=500/>

### Predictive Power Score(Overview of all associations)
There are a few key useful features. If you click on the "Predictive Power Score", you will see an eagle eye view of all the association between all variables so that you could <b>narrow down to the most important variables</b>. This is similar to the "associations" screen in sweetviz, but the underlying measure is different.
<img src='http://datageeko.com/imgs/titanic/d_tale_pps.png' width=600/>

### Outliers detection
Another useful feature is the "<b>highlight outliers</b>", which is something that sweetviz is missing out:
<img src='http://datageeko.com/imgs/titanic/d_tale_outliers.png' width=600/>
It brightly highlights the columns that contain outliers and if you mouseover the column header, it will show you the number of affected rows too.

### Missingness analysis
If you click on the <b>"highlight missing"</b>, it will similarly highlight the missing rows:
<img src='http://datageeko.com/imgs/titanic/d_tale_missing.png' width=400/>
One last useful feature that I want to show is the <b>"missing analysis"</b>; sometimes there's a relationship between missing variables or the order of dataset, and this diagram could give you a clue:
<img src='http://datageeko.com/imgs/titanic/d_tale_missing2.png' width=600/>