# Exploratory Data Analysis (EDA) for Categorical Variables - A Beginner's Way

In this Kernel I have defined very very basic 2 EDAs which are essential to get your hands on the data and to know just enough when you are starting off with the competition. 
1. Basic Statistics for each Variable
2. Frequency Distribution for each Variable
3. Relationship between the Dependent Variable & Inependent Variables

You might be wondering I mention 2 EDAs earlier and listed down 3 :)<br> 
Well, 2 & 3 make sense when looked together. So, it would be more like
1. Basic Statistics <br>
2. Distribution Plots <br>
 2.1 Frequency distribution for each Independent Variable <br>
 2.2 Relationship between the Dependent Variable & Inependent Variables

Significant focus of this Kernel is to explain how building blocks are created and then how everything is put together to create final output (for people who are getting started with Data Science / Kaggle). Eventually once you are comfortable with this you can directly use the relevant parts for future projects. 

#### Let's Get Started
---

## Notebook Content

1. [Basic Statistics](#1) <br>
    1.1 [Distinct Categories](#1.1) <br>
    1.2 [Count of Distinct Categories (including 'nan')](#1.2) <br>
    1.3 [Count of Distinct Categories (excluding 'nan')](#1.3) <br>
    1.4 [Number of Missing Values](#1.4) <br>
    1.5 [Percentage of Missing Values](#1.5) <br>
    1.6 [Putting in together](#1.6) <br>
2. [Frequency Distribution](#2) <br>
    2.1 [Count Plot](#2.1) <br>
    2.2 [Box Plot](#2.2) <br>
    2.3 [Violin Plot](#2.3) <br>
    2.4 [Swarm Plot](#2.4) <br>
    2.5 [Violin Plot & Swarm Plot Combined](#2.5) <br>
    2.6 [Putting it together](#2.6) <br>
3. [Understanding Fig & Subplot](#3) <br>
3. [Conclusion](#4) <br>

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# This is to supress the warning messages (if any) generated in our code
import warnings
warnings.filterwarnings('ignore')

# Comment this if the data visualisations doesn't work on your side
%matplotlib inline

# We are using whitegrid style for our seaborn plots. This is like the most basic one
sns.set_style(style = 'whitegrid')

In [None]:
dataset = pd.read_csv('../input/train.csv')

In [None]:
nrow, ncol = dataset.shape
nrow, ncol

Our Data has 1460 Rows and 81 Columns

In [None]:
#Let's look at first few rows of our dataset
dataset.head(3)

In [None]:
dataset.info()

We have following data types of variables float64(3), int64(35), object(43)
In this Kernel we will look at Categorical Variable i.e object type of variable that are 43 in number

# EDA for Categorical Variables 

#### Create a separate dataframe which has only Categorical Variables

In [None]:
ds_cat = dataset.select_dtypes(include = 'object').copy()
ds_cat.head(2)

<a id ='1'>  
    
## 1. Basic Stats for each variable
</a>     


We will build that code piece by piece. Pick one Categorical Variable from the dataframe and do all the stuff on that variable then we will combine everything together.

We will start with Variable - MSZoning

<a id = '1.1'> 
#### 1.1 Look at the different values of distinct categories in our variable. This method will list down any missing values (nan) as well
</a>

In [None]:
ds_cat['MSZoning'].unique()

<a id = '1.2' >
#### 1.2 Count of distinct categories in our variable. Here we have counted nan values also (if any)
</a>    

In [None]:
len(ds_cat['MSZoning'].unique())

<a id = "1.3">
#### 1.3 Count of distinct categories in our variable but this time we don't want to count any nan values

In [None]:
ds_cat['MSZoning'].nunique()

<a id = "1.4">
#### 1.4 Number of Missing Values in that variable (for all the rows)

In [None]:
ds_cat['MSZoning'].isnull().sum()

<a id = "1.5">
#### 1.5 Percentage of Missing Values in that variable

In [None]:
ds_cat['MSZoning'].isnull().sum()/ nrow

#Let's multiple by 100 and keep only 1 decimal places
(ds_cat['MSZoning'].isnull().sum()/ nrow).round(3)*100

<a id='1.6'>
#### 1.6 Putting it together
</a>

[Look at this kernel](https://www.kaggle.com/nextbigwhat/eda-for-categorical-variables-part-2) for clean and concise way of how one simple function will do all this for you

---
<a id = "2">
## 2. Frequency Distribution

Here again we will start with one variable (MSZoning) as we did in part 1 to build our code and then subsequently we will put everything together 

<a id = "2.1">
#### 2.1 Count Plot 

In [None]:
sns.countplot(data = ds_cat, x = 'MSZoning')

#### Since we are working on a supervised ML problem we should also look at the relationshipt between the dependent variable and independent variable. In order to do that let's add our dependent variable to this dataset.

In [None]:
ds_cat['SalePrice'] = dataset.loc[ds_cat.index, 'SalePrice'].copy()

In [None]:
len(ds_cat.columns) # 43 means 11 rows will be needed

<a id = "2.2">
#### 2.2 Box Plot

In [None]:
sns.boxplot(data = ds_cat, x='MSZoning', y='SalePrice')

 <a id = "2.3">
#### 2.3 Violin Plot

In [None]:
sns.violinplot(data = ds_cat, x='MSZoning', y='SalePrice')

<a id = "2.4">
#### 2.4 Swarm Plot

In [None]:
sns.swarmplot(data = ds_cat, x='MSZoning', y='SalePrice')

<a id = "2.5">
#### 2.5 Combine Violin Plot & Swarm Plot 

In [None]:
sns.violinplot(data = ds_cat, x='MSZoning', y='SalePrice')
sns.swarmplot(data = ds_cat, x='MSZoning', y='SalePrice', color = 'k', alpha = 0.6)

A better way to analyze will be to look at countplot and boxplot together othwerwise we will have to keep shuffling between two plots. 

Here's how we are going to put them together - on the top we have countplot and bottow we will have box plot (or violinplot). <br>
Or, in the language of seaborn we can say that we have created 2 rows where in the 1st row we have countplot and in the 2nd rwo we have violinplot. How we do that in python is we use subplot functionality in seaborn

<a id = "s3"> </a>
**Read Below for a quick learning note on how to combine multiple charts in Python using Seaborn**

There are 2 key elements to really understand seaborn's graphic capability :
 - Figure
 - Graphs
 
First, we’ll start with figure, it’s your entire graphic. Your actual graph is not your figure(Weird!). The figure is the part around your graph. Your chart (or charts) sits  on top of the figure. Usually it's defined by *fig* <br>
>fig = plt.figure() # Actual python code that we will use eventually 

AxesSubplot (or in short ax) is your actual graph. so we call <br>
>ax = fig.add_subplot(2,1) # Second line of python code that we will use

Here we are saying to python that "you have created a figure and now add a subplot on that figure. Add the subplot in such a way that you create a 2X1 grid because I intend to use 2 graphs in one figure"<br>
A better way to write would be to define ax1 & ax2 in this way - 

>ax1 = fig.add_subplot(2,1,1) <br>
>ax2 = fig.add_subplot(2,1,2)

Benefit of doing it this way is that whenever you use ax1 (or ax2) in any of our plots, Python interpretor will know where to put it this graph.<br>
Normally you don’t see a lot of people using add_subplot. Why? I don’t know, people like sleek codes. They like to do it all at once. That’s why plt.subplots() exists

You can refer to this article [here](http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/how-pandas-uses-matplotlib-plus-figures-axes-and-subplots/) for an even detailed explanation. 

In [None]:
fig = plt.figure()

ax1 = fig.add_subplot(2,1,1) 
sns.countplot(data = ds_cat, x = 'MSZoning', ax = ax1)

ax2 = fig.add_subplot(2,1,2) 
sns.boxplot(data = ds_cat, x='MSZoning', y='SalePrice' , ax = ax2)
#sns.violinplot(data = ds_cat, x='MSZoning', y='SalePrice' , ax = ax2)

# Try using VIOLIN PLOT as well. This can give you a lot of details on your underlying data

#### Let's stack 3 variables (6 charts) and see how it looks like 

In [None]:
fig = plt.figure(figsize = (15,10))

ax1 = fig.add_subplot(2,3,1)
sns.countplot(data = ds_cat, x = 'MSZoning', ax=ax1)

ax2 = fig.add_subplot(2,3,2)
sns.countplot(data = ds_cat, x = 'LotShape', ax=ax2)

ax3 = fig.add_subplot(2,3,3)
sns.countplot(data = ds_cat, x = 'LotConfig', ax=ax3)

ax4 = fig.add_subplot(2,3,4)
sns.boxplot(data = ds_cat, x = 'MSZoning', y = 'SalePrice' , ax=ax4)
#sns.violinplot(data = ds_cat, x = 'MSZoning', y = 'SalePrice' , ax=ax4)
#sns.swarmplot(data = ds_cat, x = 'MSZoning', y='SalePrice', color = 'k', alpha = 0.4, ax=ax4  )

ax5 = fig.add_subplot(2,3,5)
sns.boxplot(data = ds_cat, x = 'LotShape', y = 'SalePrice', ax=ax5)
#sns.violinplot(data = ds_cat, x = 'LotShape', y = 'SalePrice', ax=ax5)
#sns.swarmplot(data = ds_cat, x = 'LotShape', y='SalePrice', color = 'k', alpha = 0.4, ax=ax5  )

ax6 = fig.add_subplot(2,3,6)
sns.boxplot(data = ds_cat, x = 'LotConfig', y = 'SalePrice', ax=ax6)
#sns.violinplot(data = ds_cat, x = 'LotConfig', y = 'SalePrice', ax=ax6)
#sns.swarmplot(data = ds_cat, x = 'LotConfig', y='SalePrice', color = 'k', alpha = 0.4, ax=ax6  )

#### Looks good!

<a id = "2.6">
### 2.6 Putting it all together
</a>

[Look at this kernel](https://www.kaggle.com/nextbigwhat/eda-for-categorical-variables-part-2) for clean and concise way of how one simple function will do all this for you

<a id='4'>
### Conclusion
</a>
That's it for this kernel. 

Idea of this kernel is just to lay out the process of getting started in EDA. However, in practice you will finally use all combined one and won't have to go through the building blocks process.

We definitely don't need to put too much weight on the insights that can be gained from an EDA like this. At most we get 2-dimensional relationships, which can be misleading. We will rather focus more on Machine Learning driven EDA.

But the quest is not over yet! 

On the basis of the feedbacks/suggestions that I recieve for the kernel, I would add them and convert into a different Kernel so that we can see the progression for our learnings.