# CrashDS

#### Module 1 : Data Exploration

Datasets from ISLR by *James et al.* : `Advertising.csv` and `Heart.csv`         
Source: http://faculty.marshall.usc.edu/gareth-james/ISL/data.html     

---

### Essential Libraries

Let us begin by importing the essential Python Libraries.    
You may install any library using `conda install <library>`.    
Most of the libraries come by default with the Anaconda platform.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [None]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

---

## Case Study 1 : Advertising Budget vs Sales


### Import the Dataset

The dataset is in CSV format; hence we use the `read_csv` function from Pandas.  
Immediately after importing, take a quick look at the data using the `head` function.

In [None]:
# Load the CSV file and check the format
advData = pd.read_csv('Advertising.csv')
advData.head()

Check the vital statistics of the dataset using the `type` and `shape` attributes.

In [None]:
print("Data type : ", type(advData))
print("Data dims : ", advData.shape)

Check the variables (and their types) in the dataset using the `info()` method.

In [None]:
advData.info()

*NOTE : At this stage, it is a good practice to study the data description, if available.*

### Format the Dataset

Drop the `Unnamed: 0` column as it contributes nothing to the problem.   

In [None]:
# Drop the first column (axis = 1) by its name
advData = advData.drop('Unnamed: 0', axis = 1)

Rename the other columns for homogeneity in nomenclature and style.   

In [None]:
# Rename the other columns as per your choice
advData = advData.rename(columns={"TV": "TV", "radio": "RD", "newspaper" : "NP", "sales" : "Sales"})

Check the format and vital statistics of the modified dataframe.

In [None]:
advData.head()

In [None]:
advData.info()

### Uni-Variate Exploration

Let us take `Sales` as our target variable for the Uni-Variate analysis.    
Check the Summary Statistics of Uni-Variate Series using `describe`.

In [None]:
advData["Sales"].describe()

Check the Summary Statistics visually using a standard `boxplot`.

In [None]:
f = plt.figure(figsize=(16, 2))
sb.boxplot(advData["Sales"], orient = "h")

Extend the summary to visualize the complete distribution of the Series.  
The first visualization is a simple Histogram with automatic bin sizes.

In [None]:
f = plt.figure(figsize=(16, 8))
sb.distplot(advData["Sales"], kde = False)

The generic `distplot` produces both the Histogram and the Kernel Density Estimate.

In [None]:
f = plt.figure(figsize=(16, 8))
sb.distplot(advData["Sales"])

Finally, the `violinplot` combines the Boxplot with the Kernel Density Estimate.

In [None]:
f = plt.figure(figsize=(16, 8))
sb.violinplot(advData["Sales"])

### Multi-Variate Exploration

Quick way to check Uni-Variate Summary Statistics for all variables in the dataset is as follows.

In [None]:
# Check summary statistics of all variables
advData.describe()

The following piece of code is a quick way to visualize the Uni-Variate plots for all variables.

In [None]:
# Draw the distributions of all variables
f, axes = plt.subplots(4, 3, figsize=(18, 16))
colors = ["r", "g", "b", "m"]

count = 0
for var in advData:
    sb.boxplot(advData[var], orient = "h", color = colors[count], ax = axes[count,0])
    sb.distplot(advData[var], color = colors[count], ax = axes[count,1])
    sb.violinplot(advData[var], color = colors[count], ax = axes[count,2])
    count += 1

However, for multi-variate analysis, we need to know mutual relations.     
Check the Pearson Correlation between the variables using `corr()`.

In [None]:
# Correlation Matrix
cormat = advData.corr()
print(cormat)

In [None]:
# Heatmap of the Correlation Matrix
cormat = advData.corr()

f, axes = plt.subplots(1, 1, figsize=(16, 12))
sb.heatmap(cormat, vmin = -1, vmax = 1, linewidths = 1,
           annot = True, fmt = ".2f", annot_kws = {"size": 18}, cmap = "RdBu")
axes.set_ylim(len(cormat), 0)  # temporary fix for heatmap
plt.show()

Let us consider `Sales` against `TV` first, as they show the largest correlation.

In [None]:
# 2D scatterplot of two variables to observe their relationship
f = plt.figure(figsize=(16, 8))
sb.scatterplot(x = "TV", y = "Sales", data = advData)

Next, let us consider `Sales` against `RD`, with the second largest correlation.

In [None]:
# 2D scatterplot of two variables to observe their relationship
f = plt.figure(figsize=(16, 8))
sb.scatterplot(x = "RD", y = "Sales", data = advData)

Finally, let us consider `Sales` against `NP`, with the minimum correlation value.

In [None]:
# 2D scatterplot of two variables to observe their relationship
f = plt.figure(figsize=(16, 8))
sb.scatterplot(x = "NP", y = "Sales", data = advData)

You may also explore a pairwise scatterplot between all variables in the dataset.

In [None]:
# Draw pairs of variables against one another
sb.pairplot(data = advData, height = 4)

---

## Case Study 2 : Personal Parameters vs Heart Disease


### Import the Dataset

The dataset is in CSV format; hence we use the `read_csv` function from Pandas.  
Immediately after importing, take a quick look at the data using the `head` function.

In [None]:
# Load the CSV file and check the format
heartData = pd.read_csv('Heart.csv')
heartData.head()

Check the vital statistics of the dataset using the `type` and `shape` attributes.

In [None]:
print("Data type : ", type(heartData))
print("Data dims : ", heartData.shape)

Check the variables (and their types) in the dataset using the `info()` method.

In [None]:
heartData.info()

*NOTE : At this stage, it is a good practice to study the data description, if available.*

### Format the Dataset

Drop the `Unnamed: 0` column as it contributes nothing to the problem.   

In [None]:
# Drop the first column (axis = 1) by its name
heartData = heartData.drop('Unnamed: 0', axis = 1)

Drop the rows where values are missing in any column using `dropna()`.    
You may instead choose the `fillna()` method to fill in missing values. 

In [None]:
# Drop the rows with `NA` values
heartData = heartData.dropna()

Check the summary statistics to make sense of the data a little.

In [None]:
heartData.describe()

Convert the columns of type `object` to categorical data (factor) format.   

In [None]:
heartData["ChestPain"] = heartData["ChestPain"].astype('category')
heartData["Thal"] = heartData["Thal"].astype('category')
heartData["AHD"] = heartData["AHD"].astype('category')

Convert the non-obvious *categorical* columns to `category` format as well.    
You may use `nunique()` method on each column to identify *categoricals*.   

In [None]:
heartData["Chol"].nunique()

In [None]:
heartData["Sex"] = heartData["Sex"].astype('category')
heartData["Fbs"] = heartData["Fbs"].astype('category')
heartData["RestECG"] = heartData["RestECG"].astype('category')
heartData["ExAng"] = heartData["ExAng"].astype('category')
heartData["Ca"] = heartData["Ca"].astype('category')
heartData["Slope"] = heartData["Slope"].astype('category')

Check the format and vital statistics of the modified dataframe.

In [None]:
heartData.head()

In [None]:
heartData.info()

### Uni-Variate Exploration

Let us take `AHD` as our target variable for the Uni-Variate analysis.    
Check the Summary Statistics of Uni-Variate Series using `describe`.

In [None]:
heartData["AHD"].describe()

Check the Summary Statistics visually using a standard `countplot`.

In [None]:
print(heartData["AHD"].value_counts())
sb.countplot(y = "AHD", data = heartData)

You may check the same for all categorical variables in the dataset.    
Uni-Variate Summary Statistics for all numeric variables is as follows.

In [None]:
# Check summary statistics of all numeric variables
heartData.describe()

### Multi-Variate Exploration

Note that there is no natural notion for correlation apart from the numeric variables.

In [None]:
# Correlation Matrix
print(heartData.corr())

Let us consider `AHD` against `RestBP` first, as we think they may have a connection.   

In [None]:
# Boxplot of numeric variable against categorical variable
f = plt.figure(figsize=(16, 4))
sb.boxplot(x = "RestBP", y = "AHD", data = heartData)

Let us consider `AHD` against `Chol` too, as we think they may have a connection.   

In [None]:
# Boxplot of numeric variable against categorical variable
f = plt.figure(figsize=(16, 4))
sb.boxplot(x = "Chol", y = "AHD", data = heartData)

Next, let us consider `AHD` against `ChestPain`, as it seems like an obvious connection.   

In [None]:
grouped = heartData.groupby(['AHD', 'ChestPain']).size().unstack()
grouped

In [None]:
# Boxplot of numeric variable against categorical variable
grouped = heartData.groupby(['AHD', 'ChestPain']).size().unstack()

f, axes = plt.subplots(1, 1, figsize=(16, 6))
sb.heatmap(grouped, linewidths = 1, annot = True, fmt = ".2f", annot_kws = {"size": 18}, cmap = "Greys")
axes.set_ylim(len(grouped), 0)  # temporary fix for heatmap
plt.show()

Next, let us consider `AHD` against `Sex`, to find if there is at all a connection.   

In [None]:
grouped = heartData.groupby(['AHD', 'Sex']).size().unstack()
grouped

In [None]:
# Boxplot of numeric variable against categorical variable
grouped = heartData.groupby(['AHD', 'Sex']).size().unstack()

f, axes = plt.subplots(1, 1, figsize=(8, 6))
sb.heatmap(grouped, linewidths = 1, annot = True, fmt = ".2f", annot_kws = {"size": 18}, cmap = "Greys")
axes.set_ylim(len(grouped), 0)  # temporary fix for heatmap
plt.show()

**Now it's time to take this basic Data Exploration forward, to prediction of `Sales` and `AHD` based on the two datasets.**