# EDA and Choosing Best Classifier including Hyperparameters 

Below I will go through a realtively simple but comprehensive pipeline of how to properly analyze, process and impute data and, apply Machine Learning algorithms to perform binary classification. Most important points or perks of the below notebooks are 

1. Imputing Missing Data. 
    
    1.1. Imputing via Mean of Feature and via Interpolation. 

2. Learn about the Dataset through properly analyzing distributions. 
    
    2.1. Feature Distribution based on Outcome. 

    2.2. Feature distribution using Pair Plot of Seaborn.  
    
    2.3. Are there outliers? Using Box plots.


3. Learn to simply deal with Outliers. 


4. Apply and Test Several ML Algorithms. 
    
    4.1. Create a pipeline of standardizing the features and ML Algorithms. 
    
    4.2. Select the best hyperparameter for each model. 
    
    4.3. Store the models and corresponding scores in a dataframe. 

Enjoy!    

## 1. Necessary Imports 

In [None]:
import time
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

## 2. Load the CSV File with Pandas  

In [None]:
diab_df = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")
diab_df.head(6)

## 3. Check Basic Information About the Dataframe 

In [None]:
print ('dataframe shape: ', diab_df.shape)
print ('null values in the entire dataframe: ', diab_df.isnull().values.any()) # check for NaNs
print ('total number of null values: ', diab_df.isnull().sum().sum()) # total number of NaNs
print ('number of positive (1) and negative (0) diabetic patients: ', '\n', diab_df.Outcome.value_counts())

We see that Number of negative patients are higher (almost twice) than positive patient. Later when we split the data for training and testing, we would like to keep this in mind. 
For now, let's change the column name 'DiabetesPedigreeFunction' as this is a rather long name. Change it to something short. 

In [None]:
diab_df.rename(columns={'DiabetesPedigreeFunction': 'DiabPedgFunct'}, inplace=True)

print ('check dataframe columns :', diab_df.columns)

## 4. Knowing the Data (Data Analysis)

_So the dataframe has no null (NaN) values_. Don't get fooled by this though, because as you can see at least for the first 6 rows, 'Insulin' and 'SkinThickness' have entries with value 0. These are impossible entries, so we have to test ways to impute those zero values. 

So first we check for 0 values in specific columns (Glucose, BloodPressure, SkinThickness, Insulin, BMI) where measuring 0 does not make sense (should be treated as missing value). We can check later that Age column has no 0 values.  We will proceed step by step. 

### 4.1. Fraction of Missing Values (How much data aren't available)  

In [None]:
glucose_val_count0 = diab_df['Glucose'].value_counts()[0]
# print (glucose_val_count0)

bp_val_count0 = diab_df['BloodPressure'].value_counts()[0]
# print (bp_val_count0)

skin_th_count0 = diab_df['SkinThickness'].value_counts()[0]
# print (skin_th_count0)

insulin_count0 = diab_df['Insulin'].value_counts()[0]
# print(insulin_count0)

BMI_count0 = diab_df['BMI'].value_counts()[0]
# print(BMI_count0)

# Age_count0 = diab_df['Age'].value_counts()[30]
# print (Age_count0) # no zero values for age, gives a keyerror 

val_list0 = [glucose_val_count0/diab_df.shape[0], bp_val_count0/diab_df.shape[0], skin_th_count0/diab_df.shape[0], 
             insulin_count0/diab_df.shape[0], BMI_count0/diab_df.shape[0]]

labels0 = ['Glucose', 'BP', 'SkinThick', 'Insulin', 'BMI'] 
x = np.arange(len(labels0))

fig = plt.figure(figsize=(6, 5))
plt.bar(x, height=val_list0, width=0.4, align='center', color='magenta', alpha=0.7)
plt.xticks(ticks=x, labels=labels0, fontsize=12)
plt.title('Fraction of Missing Values', fontsize=14)
plt.show()

We see here that for 'Insulin' almost half the values are missing.  

### 4.2. Check Distribution of Features 

Below we will plot _Histogram Plots_ of the features to see how they are distributed. 

In [None]:
fig = plt.figure(figsize=(12, 7))
fig.add_subplot(241)
plt.hist(diab_df['Age'], bins=int(np.sqrt(diab_df.shape[0])), density= True, color='lime', alpha=0.6)
plt.xlabel('Age', fontsize=12)

fig.add_subplot(242)
plt.hist(diab_df['BloodPressure'], bins=int(np.sqrt(diab_df.shape[0])), density= True, color='lime', alpha=0.6)
plt.xlabel('Blood Pressure', fontsize=12)

fig.add_subplot(243)
plt.hist(diab_df['BMI'], bins=int(np.sqrt(diab_df.shape[0])), density= True, color='lime', alpha=0.6)
plt.xlabel('BMI', fontsize=12)

fig.add_subplot(244)
plt.hist(diab_df['DiabPedgFunct'], bins=int(np.sqrt(diab_df.shape[0])), density= True, color='lime', alpha=0.6)
plt.xlabel('Diabetes Pedigree Function', fontsize=12)

fig.add_subplot(245)
plt.hist(diab_df['Glucose'], bins=int(np.sqrt(diab_df.shape[0])), density= True, color='lime', alpha=0.6)
plt.xlabel('Glucose', fontsize=12)

fig.add_subplot(246)
plt.hist(diab_df['Insulin'], bins=int(np.sqrt(diab_df.shape[0])), density= True, color='lime', alpha=0.6)
plt.xlabel('Insulin', fontsize=12)

fig.add_subplot(247)
plt.hist(diab_df['Pregnancies'], bins=int(np.sqrt(diab_df.shape[0])), density= True, color='lime', alpha=0.6)
plt.xlabel('Pregnancies', fontsize=12)

fig.add_subplot(248)
plt.hist(diab_df['SkinThickness'], bins=int(np.sqrt(diab_df.shape[0])), density= True, color='lime', alpha=0.6)
plt.xlabel('Skin  Thickness', fontsize=12)

plt.tight_layout()
plt.show()

### 4.3. Strategy to Impute Missing Values
#### Try Imputing with _Mean Value_ of Feature

We can once again verify from the histogram plots that Insulin and SkinThickness features have lots of zero values. Also BloodPressure, BMI, Glucose and SkinThickness features follow nearly _Normal_ distribution. So we start to replace the null values with the distribution mean (in case of a perfectly normal distribution the mean, median and mode should be same). We will use DataFrame [replace](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) method. However, since Insulin and SkinThickness have lots of missing values, this strategy may not be good enough. 

To get started we first replace the zero values with NaNs so that handling them later would be easier.   

In [None]:
diab_df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = diab_df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0, np.NaN)

In [None]:
# replace using mean 

diab_df['BMI_New'] = diab_df['BMI'].replace(np.NaN, diab_df['BMI'].mean())
diab_df['BloodPressure_New'] = diab_df['BloodPressure'].replace(np.NaN, diab_df['BloodPressure'].mean())
diab_df['Glucose_New'] = diab_df['Glucose'].replace(np.NaN, diab_df['Glucose'].mean())
diab_df['Insulin_New'] = diab_df['Insulin'].replace(np.NaN, diab_df['Insulin'].mean())
diab_df['SkinThickness_New'] = diab_df['SkinThickness'].replace(np.NaN, diab_df['SkinThickness'].mean())

We plot the distributions again to see after imputation with the mean value, how those distributions look like. 

In [None]:
fig = plt.figure(figsize=(10, 8))

fig.add_subplot(231)
plt.hist(diab_df['BMI_New'], density=True, bins=int(np.sqrt(diab_df.shape[0])), color='lime', alpha=0.7)
plt.xlabel('BMI', fontsize=12)

fig.add_subplot(232)
plt.hist(diab_df['BloodPressure_New'], density=True, bins=int(np.sqrt(diab_df.shape[0])), color='lime', alpha=0.7)
plt.xlabel('BP', fontsize=12)

fig.add_subplot(233)
plt.hist(diab_df['Glucose_New'], density=True, bins=int(np.sqrt(diab_df.shape[0])), color='lime', alpha=0.7)
plt.xlabel('Glucose', fontsize=12)

fig.add_subplot(234)
plt.hist(diab_df['Insulin_New'], density=True, bins=int(np.sqrt(diab_df.shape[0])), color='lime', alpha=0.7)
plt.xlabel('Insulin', fontsize=12)

fig.add_subplot(235)
plt.hist(diab_df['SkinThickness_New'], density=True, bins=int(np.sqrt(diab_df.shape[0])), color='lime', alpha=0.7)
plt.xlabel('Skin Thickness', fontsize=12)

plt.tight_layout()
plt.show()

Here we see the problem of imputing with mean value when a lot of values of a particular feature are missing. Even though the distributions for BMI, BP and Glucose looks reasonable, distributions of Insulin and SkinThickness resemble a lot with [Cauchy distribution](https://en.wikipedia.org/wiki/Cauchy_distribution) with low $\gamma$ value. They just look abnormal because of our imputing strategy. So for these 2 distributions we have to take a different strategy.  

#### Impute Using _Interpolation._ 

For Skin thickness and Insulin we do a linear interpolation to replace the NaN values. This method is available for Pandas Series. [_Interpolation_](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html). Pandas has a nice [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) about different ways to handle missing data. 


In [None]:
diab_df = diab_df.astype(float)
diab_df['SkinThickness_New1'] = diab_df.SkinThickness.interpolate(method='linear', limit=400, limit_direction='both')
diab_df['Insulin_New1'] = diab_df.Insulin.interpolate(method='linear', limit=600, limit_direction='both')


fig = plt.figure(figsize=(6, 5))

fig.add_subplot(121)
plt.hist(diab_df['SkinThickness_New1'], density=True, color='lime', alpha=0.6)
plt.xlabel('Skin Thickness', fontsize=12)

fig.add_subplot(122)
plt.hist(diab_df['Insulin_New1'], density=True, color='lime', alpha=0.7)
plt.xlabel('Insulin_New1', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
### check the min and max values of the above 2 features. 

print ('Insulin max and min; ', max(diab_df['Insulin_New1']) , min(diab_df['Insulin_New1']))
print ('Skin Thickness max and min; ', max(diab_df['SkinThickness_New1']), min(diab_df['SkinThickness_New1']))

After successful imputation we only keep the relevant columns for our analysis. 

In [None]:
#### select the relevant features
diab_df_selected = diab_df[['Pregnancies', 'Glucose_New', 'BloodPressure_New', 'SkinThickness_New1', 'Insulin_New1',
       'BMI_New', 'DiabPedgFunct', 'Age', 'Outcome']]   

### 4.4. Distribution of Features Based on Outcome

Below we would like to see how each feature is distributed based on the Outcome i.e. positive or negative diabetic patients. 
If we see a(some) feature(s) where positive and negative patients have widely separated histograms, we can say that that feature plays an important role to classify the patient.  

In [None]:
features = diab_df_selected.drop(['Outcome'], axis=1)

features_arr = features.to_numpy()
feature_names_list = features.columns.to_list()

positive_diab = features_arr[diab_df_selected.Outcome==1]
negative_diab = features_arr[diab_df_selected.Outcome==0]

fig,axes =plt.subplots(4,2, figsize=(10, 8))
ax = axes.ravel()

for i in range(8):
    _,bins= np.histogram(features_arr[:, i], bins=int(np.sqrt(768)) )
    # plt.close()
    ax[i].hist(positive_diab[:, i], bins=bins, histtype='stepfilled', edgecolor='red', linewidth=1.2, fill=False, alpha=0.8,)
    ax[i].hist(negative_diab[:, i], bins=bins, color='green', alpha=0.6)
    ax[i].set_title(feature_names_list[i],fontsize=12)
ax[0].legend(['Positive','Negative'],loc='best',fontsize=11)
plt.tight_layout()
plt.show() 

From the age distribution we can categorically say that young people have less chance of diabetes.  But there are no such particular feature where we see a wide separation between the two different classes. So for the machine learning part we may need to include all the features to classify the patients. 

This will be even more meaningful and prominent if we plot the correlation plot of different features. 

### 4.5. Feature Correlation  

In [None]:
# features = list(diab_df_selected.columns[0:7])
# feature_corr = diab_df_selected[features].corr() # alternate way 


feature_corr = features.corr() 

fig = plt.figure(figsize=(10, 7))
g1 = sns.heatmap(feature_corr, cmap='coolwarm', vmin=0., vmax=1., )
g1.set_xticklabels(g1.get_xticklabels(), rotation=40, fontsize=10)
g1.set_yticklabels(g1.get_yticklabels(), rotation=40, fontsize=10)
plt.title('Correlation Plot of Features', fontsize=12)
plt.show()

Correlation between the variables are very low. Some moderate correlation between Age and Pregnancy and, BMI_New and BloodPressure_New exist.  
This also tell us that we can ignore proceeding via PCA (this is for ML part), because we want all our features to be present. 

We can see the feature dependence in detail with the pair plot. 

In [None]:
sns.set(font_scale=1.2)
sns.pairplot(diab_df_selected, hue='Outcome', palette='Set2')

### 4.6. Are there Outliers ? How to Deal with Them?  

To understand the feature distribution we will now consider [Box Plots](https://en.wikipedia.org/wiki/Box_plot) and mainly focus on the outliers. 

In [None]:
f, axes = plt.subplots(2, 3, figsize=(12, 8))

sns.set(font_scale=0.8)
sns.boxplot(x=diab_df_selected['Age'], ax=axes[0][0],)
sns.boxplot(x=diab_df_selected['Glucose_New'], ax=axes[1][0])
sns.boxplot(x=diab_df_selected['Insulin_New1'], ax=axes[1][1]) ### many outliers !!!
sns.boxplot(x=diab_df_selected['BMI_New'], ax=axes[0][1])
sns.boxplot(x=diab_df_selected['BloodPressure_New'], ax=axes[0][2])
sns.boxplot(diab_df_selected['Pregnancies'], ax=axes[1][2])

As we can see apart except Glucose, all features have outliers. Specially Insulin have lots of outliers. We can use dataframe [describe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) method to get the info about Insulin features.  

In [None]:
print (diab_df_selected['Insulin_New1'].describe())

As we can see $75\%$ count is reached for a value of 190.13 whereas the maximum value is at 846.0, no wonder we have tons of outliers. 

#### Select Features Only When Z score is less than 3$\sigma$. 

Consider $3\sigma$ standard deviation, so that 99.7% data are included and everything beyond that would be neglected. This we do for all features.    

In [None]:
diab_df_selected_Zscore = diab_df_selected[(np.abs(stats.zscore(diab_df_selected)) < 3).all(axis=1)]

print ('check new dataframe shape after rejecting outliers: ', diab_df_selected_Zscore.shape) # 50 rows are gone

Let's plot the distributions again to see whether now the distributions have less outliers or not.  

In [None]:
f, axes = plt.subplots(2, 3, figsize=(12, 8))

sns.set(font_scale=0.8)
sns.boxplot(x=diab_df_selected_Zscore['Age'], ax=axes[0][0],)
sns.boxplot(x=diab_df_selected_Zscore['Glucose_New'], ax=axes[1][0])
sns.boxplot(x=diab_df_selected_Zscore['Insulin_New1'], ax=axes[1][1]) 
sns.boxplot(x=diab_df_selected_Zscore['BMI_New'], ax=axes[0][1])
sns.boxplot(x=diab_df_selected_Zscore['BloodPressure_New'], ax=axes[0][2])
sns.boxplot(diab_df_selected_Zscore['Pregnancies'], ax=axes[1][2])

## 5. Testing Machine Learning Algortihms

After feature engineering, we are now ready to prepare our data for testing various ML algorithms.  

### 5.1. Select Features and Labels 

In [None]:
Outcome_arr = diab_df_selected_Zscore['Outcome'].to_numpy()
features_Zscore = diab_df_selected_Zscore.drop(['Outcome'], axis=1)
features_Zscore_arr = features_Zscore.to_numpy()

print ('check shapes for features and outcome: ', features_Zscore_arr.shape, Outcome_arr.shape)

### 5.2. Split the Data in Train and Test Set 

Since the positive and negative samples are not equally distributed, we will use stratify based on outcomes. 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features_Zscore_arr, Outcome_arr, test_size=0.20, random_state=42, shuffle=True, stratify=Outcome_arr)

print ('check shape of training data: ', X_train.shape, y_train.shape)
print ('check shape of test data: ', X_test.shape, y_test.shape)

### 5.3. Necessary Import for Machine Learning Part 

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

### 5.4. Create a List of Classifiers  

For classification task, we would like to check the following classifiers

* Support Vector Machine. 
* Logistic Regression. 
* Random Forest. 
* Adaboost (Base Classifier as decision tree). 
* Naive Bayes.  

Apart from the classifier since we would also like to pre-process the features a bit, we include [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). For tree  based method standardization has no effect. So we neglect it there. 

P.S: If you are interseted in learning about in detail how SVM, Logistic Regression and Decision Tree classifiers work, please check my articles. 
1. [Complete Theory of SVM](https://towardsdatascience.com/understanding-support-vector-machine-part-1-lagrange-multipliers-5c24a52ffc5e). 
2. [Understanding Logit of Logistic Regression](https://towardsdatascience.com/logit-of-logistic-regression-understanding-the-fundamentals-f384152a33d1). 
3. [Understanding Decision Tree Classification](https://towardsdatascience.com/understanding-decision-tree-classification-with-scikit-learn-2ddf272731bd). 

In [None]:
pipelines = [ [('scaler', StandardScaler()), ('SVM', SVC())], [('scaler', StandardScaler()), ('LR', LogisticRegression())], 
             [ ('RF', RandomForestClassifier())],  [('ADB', AdaBoostClassifier(DecisionTreeClassifier(max_depth=2)))], 
             [('scaler', StandardScaler()), ('GNB', GaussianNB())]]

### 5.5. Create the Grids for Parameter Search for Each Classifier  

In [None]:
svm_param_grid = {'SVM__C': [0.01, 0.05, 0.1, 0.5, 1, 5, 10, 30, 40, 50, 75, 100, 200],  
                  'SVM__kernel': ['linear']}

LR_param_grid = {'LR__C': [0.01, 0.05, 0.1, 0.5, 1., 2., 5., 10.], 'LR__class_weight':['balanced']}

RF_param_grid = {'RF__criterion': ['gini', 'entropy'], 'RF__n_estimators': [30, 50, 75, 100, 125, 150, 200], 'RF__max_depth': [2, 3, 4]}

ADB_param_grid = {'ADB__n_estimators': [20, 40, 50, 75, 100, 200], 'ADB__learning_rate': [0.01, 0.05, 0.1, 0.5, 1., 2]}

GNB_param_grid = {'GNB__priors': [[0.35, 0.65], [0.4, 0.6]], 'GNB__var_smoothing': [1e-9, 1e-8]}

all_param_grid = [svm_param_grid, LR_param_grid, RF_param_grid, ADB_param_grid, GNB_param_grid]

### 5.6. Create a Pipeline of Standardization and Classifier 

In [None]:
all_pipelines = []

for p in pipelines:
    all_pipelines.append(Pipeline(p))
print ('check one of the pipelines: ', all_pipelines[4])  

### 5.7. Select the Best Parameter for Each Classifier using GridSearchCV

In [None]:
from tqdm.notebook import tqdm

grid_scores = []
grid_best_params = []

time1 = time.time()

for x in tqdm(range(len(all_pipelines))):
    grid = GridSearchCV(all_pipelines[x], param_grid=all_param_grid[x], cv=5)
    grid.fit(X_train, y_train)
    score = grid.score(X_test, y_test)
    grid_scores.append(score)
    grid_best_params.append(grid.best_params_)
print ('!!!!! out of the loop !!!!!')  
print ('time taken: ', time.time() - time1, 'seconds')

### 5.8. Results 

#### 5.8.1. Check the best hyperparameter for every model.  

Print out the best hyperparameters for every model.   

In [None]:
print ('Below are the Selected Best Hyperparameter for Each Classifier: ')
print ('\n')
for x in range(len(grid_best_params)):
    print (grid_best_params[x])

#### 5.8.2. Score for Each Model 

Print out and check the score for each classifier. Best seems like SVM with Linear kernel with an accuracy of $78.4\%$.   

In [None]:
models = ['SVM', 'Logistic Reg', 'Random Forest', 'AdaBoost', 'GaussianNB']
score_dict = dict(zip(models, grid_scores))
print ('check score for each model : \n', score_dict)

### 5.9. Store the Result in Dataframe

We finally create a dataframe to store Model Name and Score. 

In [None]:
score_df = pd.DataFrame(score_dict.items(), columns=['Model', 'Score'])

In [None]:
score_df