This notebook is a guide to Hyperparameter Tuning.
In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm.
A hyperparameter is a parameter whose value is used to control the learning process.
Please upvote if you like this kernel.


In [None]:

import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(color_codes=True)
import itertools
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



In [None]:
diabetes = pd.read_csv("/kaggle/input/diabetes.csv")
print(diabetes.columns)


In [None]:
diabetes.shape

In [None]:
diabetes.head()

In [None]:
display(diabetes.info())

In [None]:
diabetes['Outcome'].value_counts()

In [None]:
print (diabetes.isnull().values.any())

In [None]:
 0 in diabetes.values


Count 0 values in diabetes dataset

In [None]:
print("# rows in dataframe {0}".format(len(diabetes)))
print("Zero in Pregnancies : {0}".format(len(diabetes.loc[diabetes['Pregnancies'] == 0])))
print("Zero in Glucose : {0}".format(len(diabetes.loc[diabetes['Glucose'] == 0])))
print("Zero in BloodPressure: {0}".format(len(diabetes.loc[diabetes['BloodPressure'] == 0])))
print("Zero in SkinThickness : {0}".format(len(diabetes.loc[diabetes['SkinThickness'] == 0])))
print("Zero in Insulin  : {0}".format(len(diabetes.loc[diabetes['Insulin'] == 0])))
print("Zero in BMI : {0}".format(len(diabetes.loc[diabetes['BMI'] == 0])))
print("Zero in DiabetesPedigreeFunction  : {0}".format(len(diabetes.loc[diabetes['DiabetesPedigreeFunction'] == 0])))
print("Zero in Age: {0}".format(len(diabetes.loc[diabetes['Age'] == 0])))

In [None]:
#Replace 0 to NaN

R_d=diabetes[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']]=diabetes[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.nan)
R_d.head()

In [None]:
R_d.isnull().sum()[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']]


Handling the Missing values by replacing NaN to mean


In [None]:
pd.options.display.float_format ='{:,.2f}'.format

diabetes['Glucose'].fillna(diabetes['Glucose'].mean(),inplace=True)
diabetes['BloodPressure'].fillna(diabetes['BloodPressure'].mean(),inplace=True)
diabetes['SkinThickness'].fillna(diabetes['SkinThickness'].mean(),inplace=True)
diabetes['Insulin'].fillna(diabetes['Insulin'].mean(),inplace=True)
diabetes['BMI'].fillna(diabetes['BMI'].mean(),inplace=True)
diabetes.head()

# Data Visualization

Analysing the Outcome to  get the number of diabetic & Healthy person


In [None]:
diabetes['Outcome'].value_counts()

Outcome 0 means Non diabetic, outcome 1 means diabetic
 so the Data is biased towards people who are non-diabetics

In [None]:
fig1, ax1 = plt.subplots(1,2,figsize=(8,8))

sns.countplot(diabetes['Outcome'],ax=ax1[0])

labels = 'Healthy', 'Diabetic'

diabetes.Outcome.value_counts().plot.pie(labels=labels, autopct='%1.1f%%',shadow=True, startangle=90)

count plot shows the count values of the outcome.
pieplot shows that 65.1% people are Healthy and 34.9% people are diabetic

Dist Plot helps us to flexibly plot a univariate distribution of observations.


In [None]:
fig, ax = plt.subplots(4,2, figsize=(16,16))
sns.distplot(diabetes.Age, bins = 20, ax=ax[0,0]) 
sns.distplot(diabetes.Pregnancies, bins = 20, ax=ax[0,1]) 
sns.distplot(diabetes.Glucose, bins = 20, ax=ax[1,0]) 
sns.distplot(diabetes.BloodPressure, bins = 20, ax=ax[1,1]) 
sns.distplot(diabetes.SkinThickness, bins = 20, ax=ax[2,0])
sns.distplot(diabetes.Insulin, bins = 20, ax=ax[2,1])
sns.distplot(diabetes.DiabetesPedigreeFunction, bins = 20, ax=ax[3,0]) 
sns.distplot(diabetes.BMI, bins = 20, ax=ax[3,1]) 

## Pair Plots

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our dataset.

In [None]:
sns.pairplot(diabetes,hue='Outcome', diag_kind='kde')


# # Correlation between features

Variables within a dataset can be related for lots of reasons. It can be useful in data analysis and modeling to better understand the relationships between variables. The statistical relationship between two variables is referred to as their correlation. 

A correlation could be positive, meaning both variables move in the same direction, or negative, meaning that when one variable’s value increases, the other variables’ values decrease. Correlation can also be neural or zero, meaning that the variables are unrelated.

In [None]:
pd.options.display.float_format ='{:,.3f}'.format

correlation=diabetes.corr()
correlation

correlation plot---heatmap


In [None]:
sns.set(font_scale=1.15)
plt.figure(figsize=(12, 8))

ax =sns.heatmap(correlation, linewidths=0.01,
            annot=True,square=True,cmap='BuPu',linecolor="black")

bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)

plt.title('Correlation between features');

plt.show()

Observations:
The correlation plot shows the relation between the parameters.

Glucose,Age,BMI and Pregnancies are the most correlated parameters with the Outcome. Insulin and DiabetesPedigreeFunction have little correlation with the outcome. BloodPressure and SkinThickness have tiny correlation with the outcome. There is a little correlation between Age and Pregnancies,Insulin and Skin Thickness, BMI and Skin Thickness,Insulin and Glucose

# Feature selection

In [None]:
Feature = diabetes[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age']]

In [None]:
X= Feature
X[0:5]

In [None]:
y = diabetes['Outcome'].values
y[0:5]

# Normalize Data 

In [None]:
from sklearn import preprocessing

X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

Train Test Split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

check the desired 70% train, 30% test split of the data

In [None]:
trainval = (1.0 * len(X_train)) / (1.0 * len(diabetes.index))
testval = (1.0 * len(X_test)) / (1.0 * len(diabetes.index))
print("{0:0.2f}% in training set".format(trainval * 100))
print("{0:0.2f}% in test set".format(testval * 100))

# Predictive Modeling with Hyperparammeter Tuning 

# Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
diabetesTree = DecisionTreeClassifier()
diabetesTree

In [None]:
diabetesTree.fit(X_train,y_train)

In [None]:
y_predict = diabetesTree.predict(X_test)

In [None]:
from sklearn import metrics
print("DecisionTrees's Accuracy on Train set: ", metrics.accuracy_score(y_train, diabetesTree.predict(X_train)))
print("DecisionTrees's Accuracy on Test set : ", metrics.accuracy_score(y_test, y_predict))

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

#Plot the confusion matrix
sns.set(font_scale=1.5)
cm = confusion_matrix(y_test, y_predict)
ax = sns.heatmap(cm, annot=True, cmap='BuPu', fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.xlabel("Predicted label")
plt.ylabel("True label")

plt.show()

In [None]:
print (classification_report(y_test,y_predict))

In [None]:
from sklearn.model_selection import GridSearchCV


In [None]:
parameters = {'max_depth': (2,4,6,8,10),
             'criterion': ('gini','entropy'),
             'min_samples_leaf' : (1,2,3,4,5),
             'max_leaf_nodes' : (3,4,5,6,7,8,9,10)
        
             }

In [None]:
gridsearch_tree = GridSearchCV(estimator = diabetesTree,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                        n_jobs = -1
                          )

In [None]:
gridsearch_tree.fit(X_train, y_train)


In [None]:
print("tuned  decision tree perameter: {}" .format(gridsearch_tree.best_params_))
print("best score: {}" .format(gridsearch_tree.best_score_))

In [None]:
print("best estimator: {}" .format(gridsearch_tree.best_estimator_))

In [None]:
diabetesTree = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
            max_features=None, max_leaf_nodes=9, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=4,
            min_samples_split=2, min_weight_fraction_leaf='deprecated',
            presort=False, random_state=None, splitter='best')

In [None]:
diabetesTree.fit(X_train,y_train)

In [None]:
y_predict = diabetesTree.predict(X_test)

In [None]:
from sklearn import metrics
print(" Tuned DecisionTrees's Accuracy on Train set: ", metrics.accuracy_score(y_train, diabetesTree.predict(X_train)))
print("Tuned DecisionTrees's Accuracy on Test set : ", metrics.accuracy_score(y_test, y_predict))

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

#Plot the confusion matrix
sns.set(font_scale=1.5)
cm = confusion_matrix(y_test, y_predict)
ax = sns.heatmap(cm, annot=True, cmap='BuPu', fmt='g')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.xlabel("Predicted label")
plt.ylabel("True label")

plt.show()

In [None]:
print (classification_report(y_test,y_predict))