## TPS(feb): How to EDA. 

EDA (exploratory data analysis) is very important and most ignored part of Data Science.<br/>
EDA allows us to understand our data better and helps us in data preprocessing.<br/>

when we are solving a real world problem with the help of data only making models is not enough we need to comunicate our findings as well.

In this notebook I will explain how to get started with EDA and what things to look for in a data.

This notebook uses matplotlib, seaborn and plotly for making graphs and plots.

That said let's get started.

## Importing LibrariesðŸ“—

Here I am importing basic libraries which is required for data loading, manupilating and  plotting.

In [None]:
import os
import gc
import sys
import time
import tqdm
import random
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt 

import plotly.express as px 
import plotly.graph_objs as go
import plotly.figure_factory as ff

from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.model_selection import train_test_split

from colorama import Fore, Back, Style
y_ = Fore.YELLOW
r_ = Fore.RED
g_ = Fore.GREEN
b_ = Fore.BLUE
m_ = Fore.MAGENTA
c_ = Fore.CYAN
sr_ = Style.RESET_ALL

import warnings
warnings.filterwarnings('ignore')

## What problem are we trying to solve ?

Kaggle provides us with predetermined problem that we need to solve but that is not always true in real world we need to define the problem before solving it.

So first step is to ask what problem are we trying to solve,answer is not always as simple as regression, classification ,time-series.
for example it could be someting like how can we increase sales of certain product, how to advertise so that it reaches interested customer, how to recommend better movies and shows, how to secure people from spam emails etc.

Well here we are doing no such thing. Here we need to create a regression model which effeciently determines relationship between independent and target variables.

## Loading Data ðŸ’½

Here we are provided with 3 files one train.csv which contains all the training data it consists of 10 categorical and 14 continous features and a target column other is test.csv on which we have to make prediction 3rd file submission.csv shows format of submission file.

Let's load this files and see shape of each file.

In [None]:
folder_path = '../input/tabular-playground-series-feb-2021'
train_data = pd.read_csv(f'{folder_path}/train.csv')
test_data = pd.read_csv(f'{folder_path}/test.csv')
sample = pd.read_csv(f'{folder_path}/sample_submission.csv')

In [None]:
print("{0}Number of rows in train data: {1}{2}\n{0}Number of columns in train data: {1}{3}".format(y_,r_,train_data.shape[0],train_data.shape[1]))
print("{0}Number of rows in test data: {1}{2}\n{0}Number of columns in test data: {1}{3}".format(m_,r_,test_data.shape[0],test_data.shape[1]))
print("{0}Number of rows in sample : {1}{2}\n{0}Number of columns in sample : {1}{3}".format(c_,r_,sample.shape[0],sample.shape[1]))

## Looking at the Data

Next step obviously is to look at the data and see what columns do we have.

we can see data is not very wide only 26 columns which is good narrow data is easy to analyse.

In [None]:
pd.set_option('display.max_columns', None) # for displaying all the columns of dataframe

In [None]:
train_data.head()

In [None]:
test_data.head()

So data is very clear first is id which basically is like a unique number for each row. 

We can see there are 10 categorical variables cat0 to cat9 ,14 continous variables cont0 to cont13.There is no mention whether categorical columns are nominal or ordinal.

## Always check for missing values.

In kaggle competitions data is usually very clean and requires very less preprocessing but that is not always true in real world. In real world data is very messy and inconsistent and has many missing values so we should always look for missing values in each column before any further analysis. 

In [None]:
train_data.isnull().sum()

As I said in Kaggle competitions data is very clean so there is are no missing values in any column, but if there were missing values we need to preprocess them. There are three basic methods for dealing with missing values.

If there are too many missing values in a column it is better to remove that column.If there are some missing values in a column we should fill them using various filling techniques.If there are too few missing values maybe we can remove those rows with missing values.

Sometimes missing values in a column can act as a additional data we can create a seprate column indicating missing value in particular column and use it as a feature.

## Let's begin plotting.

Raw data do not provide much information so we make graphs to bring insight from it.

Graphs and plots are good way to communicate our findings to other.

There are some basics we need to know about plotting.
There are two basic simple types of plotting Univariate graphs and multivariate graphs. In Univariate graphing we try to detect patterns and anamolies in single variable and in multivariate we try to find relationship between different variables.

## Distribution of continous columns

Distribution of continous column shows what values are more and less likely to occure. It also reveals if there are certain values which are far away from other values

Go to method for looking at distribution of continous column is distplot in seaborn another method is boxplot.

Mean, min, max, std are some stats which reveals more about data. So we will write a function that will do all this thing in one go.

In [None]:
cat_features = [f'cat{i}' for i in range(10)]
cont_features = [f'cont{i}' for i in range(14)]
all_features = cat_features + cont_features

In [None]:
plt.style.use('fivethirtyeight')
def distribution1(feature,color1,color2,df=train_data):
    plt.figure(figsize=(15,7))
    
    plt.subplot(121)
    dist = sns.distplot(df[feature],color=color1)
    a = dist.patches
    xy = [(a[i].get_x() + a[i].get_width() / 2,a[i].get_height()) \
          for i in range(1,len(a)-1) if (a[i].get_height() > a[i-1].get_height() and a[i].get_height() > a[i+1].get_height())]
    
    for i,j in xy:
        dist.annotate(
            s=f"{i:.3f}",
            xy=(i,j), 
            xycoords='data',
            ha='center', 
            va='center', 
            fontsize=11, 
            color='black',
            xytext=(0,7), 
            textcoords='offset points',
        )
    
    qnt = df[feature].quantile([.25, .5, .75]).reset_index(level=0).to_numpy()
    plt.subplot(122)
    box = sns.boxplot(df[feature],color=color2)
    for i,j in qnt:
        box.annotate(str(j)[:4],xy= (j-.05,-0.01),horizontalalignment='center')
        
    print("{}Max value of {} is: {} {:.2f} \n{}Min value of {} is: {} {:.2f}\n{}Mean of {} is: {}{:.2f}\n{}Standard Deviation of {} is:{}{:.2f}"\
      .format(y_,feature,r_,df[feature].max(),g_,feature,r_,df[feature].min(),b_,feature,r_,df[feature].mean(),m_,feature,r_,df[feature].std()))

In [None]:
distribution1('cont0','blue','yellow');

As we can see there is no resemblance to normal distribution here so data might require some transformation before feeding it to model.
data is almost in the range of 0 to 1 almost and box plot reveals that most of data is in range of 0.4 to 0.6.

let's look at another feature

In [None]:
distribution1('cont1','magenta','red');

well cont1 is bit strange as there are gaps in the disbtibution you can see there are pillars in data it looks as if data is partially continous partially categorical.

In [None]:
distribution1('cont2','yellow','pink');

Cont2 also does not look anything like normal distribution and there is some anamoly in range of 0.9 to 1. Maybe we should have closer look in that area.

In [None]:
distribution1('cont2','green','yellow',df=train_data[train_data['cont2']>=0.9]);

well this part of cont2 looks like a normal distribution and there are outliers in the distribution on right side.

Plotting and analysing all the continous features is time taking so lets see all the distribution at same time.

In [None]:
plt.figure(figsize=(20,10))

colors = ['#8ECAE6','#219EBC','#023047',
          '#023047','#023047','#0E402D',
          '#023047','#023047','#F77F00',
          '#D62828','#4285F4','#EA4335',
          '#FBBC05','#34A853']

for i,feature in enumerate(cont_features):
    plt.subplot(2,7,i+1)
    sns.distplot(train_data[feature],color=colors[i])

Well None of the distribution are normal distribution and there are peaks in all the distribution and one other thing is that all the peaks have a sharp what i meana is slope around them are very steep

Categorical columns EDA.

Categorical columns add valuable information in data.

suposse that we have a store which sells various products. so categorical data could be something like male or female which can be used to find out which sex is more likely to buy certain product. If we have data from many stores than store can be a categorical data which can be used to regulate product requirements of that store. These were examples of nominal categorical data. Ordinal catgorical data could be movie rating between 1 to 10. 

Other thing to look for in a categorical variable is balance in count of each category. which we will see in this notebook.

First graph that i usually make for categorical data is countplot.

I will write a function which will show countplot and also annotate it.

In [None]:
def countplot1(feature,df=train_data):
    cnt = sns.countplot(df[feature])
    for g in cnt.patches:
        cnt.annotate(f"{g.get_height()}",(g.get_x()+g.get_width()/3,g.get_height()+50))

In [None]:
plt.figure(dpi=100)
countplot1('cat0');

Here we can see that there is huge imbalance in category A and B. such variable do not provide enough information when we feed it into model unless distribution of target for cat A and B are very different which we will see later.

Let's see countplot of all the category at the same time

In [None]:
plt.figure(figsize=(20,10))
for i, feature in enumerate(cat_features):
    plt.subplot(2,5,i+1)
    countplot1(feature)

cat1 is very well balanced. cat0, cat2, cat4, cat6, cat7, have one category which is dominating others. In cat3,cat5,cat8 and cat9 two categories are dominating.

Now we will look at interactions between cont-cont, cat-cat and cont-cat features.

This is the part where we can find some patterns which can help us in making business decisions. we can answers questions like what is the effect of increase in price on sale of product. on which day do people buy certain product. What are chances of catching certain disease in certain month.


When you are looking at continous data first thing that you should do is plot a correlation matrix. Correlation matrix gives us how much effect a variable is having on other variable this. So we could focus on relationship between strongly correlated features.

I like using plotly for correlation matrix you can use seaborn as well.

In [None]:
corr = train_data[cont_features+['target']].corr()
fig = px.imshow(corr)
fig.show()

If we look at variables from cont5 to cont12 we see some strong positive correlation.
and value cont12 and cont2 has strongest negative correlation of -0.3.
cont12 and cont5 shows highest correlation value of 0.63 so let's analyse those.

To look at relationship between continous variable we can use scatterplot but there is a more powerful tool in seaborn which plots regression plot which is called lmplot

In [None]:
plt.style.use('ggplot')
sns.lmplot(x='cont12',y='cont5',line_kws={"color":"green"},data=train_data)
sns.lmplot(x='cont12',y='cont2',line_kws={"color":"green"},data=train_data);

well it looks like there is no pattern to be found here which was expected as correlation value of 0.63 and -0.3 is not big enough.

So there is nothing much to find in relationship between continous variable but if you want to see relationship between all the continous variable we can use PairGrid in sns and plot scatter and kde on it.

We will use only 1000 sample for PariGrid as it takes lot of time.

In [None]:
def plot_grid(data,color1,color2,color3):
    f = sns.PairGrid(data)
    plt.figure(figsize=(10,10))
    f.map_upper(plt.scatter,color = color1)
    f.map_lower(sns.kdeplot,color = color2)
    f.map_diag(sns.kdeplot, lw=3, legend=False,color = color3)

In [None]:
plot_grid(train_data.sample(n=1000),'#EA4335','#FBBC05','#34A853');

This plot helps us to see all the variables against each other so we can plot this at the begining and futher analyse those variable which looks interesting

At first glance I do not see any variable that stands out.

So now let us move on to finding relationship between continous and categorical variable.
First thing I do while comparing continous variable with categorical is see that can categorical variable seprate distribution of target.

In [None]:
def distribution2(cat_feature,cont_feature,df=train_data):
    sns.histplot(train_data,x=cont_feature,hue=cat_feature)

In [None]:
plt.figure(figsize=(15, 7))
distribution2('cat1','target');

One might say that most of the distribution is overlapping but I would say that there is a good difference. At the time of modeling model will give huge importance to this feature.

Let us see this plot for all the cat features

In [None]:
plt.figure(figsize=(20,20))
for i, feature in enumerate(cat_features):
    plt.subplot(5,2,i+1)
    distribution2(feature,'target')

It looks like cat1, cat3, cat5 are good at seprating target data.

We can use same plots for finding relationship between any other continous variable and cat variable but we can also use boxplot for this purpose let us see how to do that.

We will look at relationship between cont0 and cat9

In [None]:
def boxploting1(feature,category,df=train_data):
    sns.boxplot(x=feature, y=category, data=df,whis=[0, 100], width=.6, palette="vlag")

In [None]:
plt.figure(figsize=(15,7))
boxploting1('cont1','cat9')

Let us see how  cat1 seprates other continous variable

In [None]:
plt.figure(figsize=(20,10))
for i,feature in enumerate(cont_features):
    plt.subplot(2,7,i+1)
    boxploting1(feature,'cat1')

We can also use more than one category to make different types of countplots.

This type of graphs can be used to answer questions like which products sells more on each day of week.
Or for example if we have a data of patients with disease X and Y we can plot how many patient had disease Y which has and does not have diesase X.

Let's make a function for making this plot

In [None]:
def countplot2(cat1,cat2,df=train_data):
    cnt = sns.countplot(data =df,x=cat1,hue=cat2)
    for g in cnt.patches:
        cnt.annotate(f"{g.get_height()}",(g.get_x()+g.get_width()/3,g.get_height()+50))

In [None]:
plt.figure(figsize=(15,7))
countplot2('cat1','cat9')

Here we can see if the cat1 is A then we have more chance of having L(green) than in B.
Let's plot one more such plot between cat3 and cat5

In [None]:
plt.figure(figsize=(15,7))
countplot2('cat3','cat5');

We can also use more than two categorical variable to make countplot

In [None]:
sns.catplot(data=train_data,x='cat3', hue='cat5', col='cat1',kind='count',height=7, aspect=.7);

### How to use Dimensnality Reduction in EDA ?

Dimensionality Reduction is exactly like it sounds, it reduces dimension basic idea of Dimensionality Reduction is finding few dimensions which represents whole data. Most used Dimensionality reduction technique is PCA (principle component analysis).Dimensionality reduction can be used to reduce dimensions so we can plot them. We cannot plot 14 dimension but we can plot 2 or 3 dimension.

So I am going to write a function which will show pairplot between all the components of pca and I will use color to add target in the plot.

In [None]:
def pca_plot1(features,n_components,target,nrows=10**4):
    pca = PCA(n_components=n_components)
    train_d = train_data.sample(n=nrows).fillna(train_data.mean())
    train_g_pca = pca.fit_transform(train_d[features])

    total_var = pca.explained_variance_ratio_.sum()*100
    labels = {str(i): f"PC {i+1}" for i in range(n_components)}

    fig = px.scatter_matrix(
        train_g_pca,
        dimensions=range(n_components),
        labels=labels,
        title=f"Total explained variance ratio{total_var:.2f}%",
        color=train_d[target].values
    )

    fig.update_traces(diagonal_visible=True,opacity=0.5)
    fig.show()

In [None]:
pca_plot1(cont_features,4,'target')

Well there is nothing much we stand out clearly but sometime you might find something that is usefull

We can also use 3 components and plot a 3d graph

In [None]:
def pca_plot_3d(features,target,nrows=10**4):
    pca = PCA(n_components=3)
    train_d = train_data.sample(n=nrows).fillna(train_data.mean())
    train_g_pca = pca.fit_transform(train_d[features])

    total_var = pca.explained_variance_ratio_.sum()*100

    fig = px.scatter_3d(
        train_g_pca,x=0,y=1,z=2,
        title=f"Total explained variance ratio{total_var:.2f}%",
        color=train_d[target].values,
        labels={'0': 'PC 1', '1': 'PC 2', '2': 'PC 3'}
    )

    fig.show()

In [None]:
pca_plot_3d(cont_features,'target')

## ðŸš§ work in Progress ðŸš§

I will keep on editing and adding stuff in this notebook