# PIMA INDIAN DIABETES CLASSIFICATION

* Who are 'Pima Indians'?
* What is 'Diabetes'?
* What is 'Classification'?

**Who are 'Pima Indians'?**

> Pima, North American Indians who traditionally lived along the Gila and Salt rivers in Arizona, U.S., in what was the core area of the prehistoric Hohokam culture. The Pima, who speak a Uto-Aztecan language and call themselves the “River People,” are usually considered to be the descendants of the Hohokam.

**What is 'Diabetes'?**

> Diabetes is a disease that occurs when your blood glucose, also called blood sugar, is too high. Blood glucose is your main source of energy and comes from the food you eat. Insulin, a hormone made by the pancreas, helps glucose from food get into your cells to be used for energy. The most common types of diabetes are type 1, type 2, and gestational diabetes. In type 1 diabetes, body does not make Insulin, immune system attacks and destroys the cells in pancreas that make insulin. People who has type 1 diabetes need to take Insulin externally to keep themselves alive. Type 2 is the most common type of diabetes where body does not make or use insulin well. Gestational diabetes develops in some women when they are pregnant. Most of the time, this type of diabetes goes away after the baby is born. 

**What is 'Classification'?**

> Classification is a type of supervised learning. It specifies the class to which data elements belong to and is best used when the output has finite and discrete values. A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”. 

***That's all about background study. Before prediction, we need to have a good picture ragrding the dataset. To that we need to familiar about dataset.***

First we need to import libraries which need and import dataset to have a insight look.

In [None]:
import pandas as pd  # data processing
import numpy as np   # linear algebra
import matplotlib.pyplot as plt  #Plotting
import seaborn as sns

In [None]:
data = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
data.head()

Here by looking about results we can see that there are some misleading data points. For example here in the SkinThickness varibale contains 0 as a value which is not correct. The vaibales 'pregnancies', 'Insulin' & 'Outcome' can have 0 as its value, but in other vaibles it can't. Therefore we have to identify those values as missing or misleading value. 

In [None]:
data[['Glucose','BloodPressure','SkinThickness','BMI','DiabetesPedigreeFunction','Age']] = data[['Glucose','BloodPressure','SkinThickness','BMI','DiabetesPedigreeFunction','Age']].replace(0,np.NaN)
data.head()

Since we have indenfied misleading values, we need to have an idea about how many misleading values present and how they are going to effect the final prediction.

In [None]:
data_nan = data.isna().sum()
data_nan = pd.DataFrame(data_nan, columns=['NaN count'])
data_nan

In [None]:
data_nan = data_nan.reset_index()
plt.figure(figsize = (12,8))
plot = sns.barplot(x = 'index', y = 'NaN count', data = data_nan, palette = 'rocket')
for p in plot.patches:
    plot.annotate(format(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)

plt.xticks(fontsize = 12, rotation=40)
plt.xlabel("Variable", fontsize=15)
plt.yticks(fontsize = 12)
plt.ylabel("NaN Count", fontsize=15)
plt.title('NaN Count of variables', fontsize=20)
plt.show()

Here we can see that SkinThickness has the highest number of misleading values. While building models for prediction, we need to take care about this issue. 

In [None]:
data.info()

Now Let's start Explonatory Data Analysis (EDA) to get good idea about the dataset.

1. Univariate Analysis

In [None]:
plt.figure(figsize=(12,8))
plot_outcome = sns.countplot(x = 'Outcome', data = data, palette="husl")
for p in plot_outcome.patches:
    plot_outcome.annotate(format(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)

plt.title('Count of Outcome', fontsize = 20)
plt.xlabel('Outcome', fontsize = 15)
plt.xticks(np.arange(2), ('No', 'Yes'), fontsize = 15)
plt.yticks(fontsize = 15)
plt.ylabel('Count', fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
plt_preg = sns.countplot(x = 'Pregnancies', data = data, palette="husl")
for p in plt_preg.patches:
    plt_preg.annotate(format(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)

plt.title('Count of Number of Pregnancies', fontsize = 20)
plt.xlabel('Number of Pregnancies', fontsize = 15)
plt.xticks(np.arange(18), fontsize = 15)
plt.yticks(fontsize = 15)
plt.ylabel('Count', fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(data['Glucose'], kde = True, color = 'Orange')
plt.title('Histogram of Glucose', fontsize = 20)
plt.xlabel('Glucose Level', fontsize = 15)
plt.ylabel('Frequency', fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(data['BloodPressure'], kde = True, color = 'Purple')
plt.title('Histogram of BloodPressure', fontsize = 20)
plt.xlabel('BloodPressure Level', fontsize = 15)
plt.ylabel('Frequency', fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(data['SkinThickness'], kde = True, color = 'Red')
plt.title('Histogram of SkinThickness', fontsize = 20)
plt.xlabel('SkinThickness Level', fontsize = 15)
plt.ylabel('Frequency', fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(data['Insulin'], kde = True, color = 'Orange')
plt.title('Histogram of Insulin', fontsize = 20)
plt.xlabel('Insulin Level', fontsize = 15)
plt.ylabel('Frequency', fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(data['BMI'], kde = True, color = 'Blue')
plt.title('Histogram of BMI', fontsize = 20)
plt.xlabel('BMI Value', fontsize = 15)
plt.ylabel('Frequency', fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(data['DiabetesPedigreeFunction'], kde = True, color = 'Brown')
plt.title('Histogram of Diabetes Pedigree Function', fontsize = 20)
plt.xlabel('Diabetes Pedigree Function Value', fontsize = 15)
plt.ylabel('Frequency', fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(data['Age'], kde = True, color = 'Black')
plt.title('Histogram of Age', fontsize = 20)
plt.xlabel('Age in years', fontsize = 15)
plt.ylabel('Frequency', fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

**Let's have short discussion regarding the results we got from Univariate Analysis.**

1. Outcome : There are 500 non-diabetic(0) paitents along with 268 diabetic(1) paitents, which makes the dataset imbalance. Imbalance dataset can lead to misleading predictions. Therefore we need to take care of this issue as well.
2. Pregnancies : This indicates number of pregnancies paitent had, which ranges from 0 to 17.
3. Glucose : This incicates the glucose level of the paitent. Data is normally distributed.
4. BloodPressure : Level of blood pressure of the paitent. Data is normally distributed.
5. SkinThickness : Thickness of the skin of the paitent. Data is normally distributed.
6. Insulin : Amount of Insulin paitent has. Data is skewed as there are many 0 values, which can be result of presence of type 1 diabetic paitents. 
7. BMI : BMI value of the paitent. Data is normally distributed.
8. DiabetesPedigreeFunction : A function which scores likelihood of diabetes based on family history. Data is skewed. 
9. Age : Age of the paitent. Variates from 21 to around 75. 

2. Bivariate Analysis

In [None]:
plt.figure(figsize=(12,8))
df1 = data[data['Outcome'] == 0]
sns.distplot(df1['Age'],  kde = True, label = 'No', color='red')
df2 = data[data['Outcome'] == 1]
sns.distplot(df2['Age'],  kde = True, label = 'Yes', color='blue')

# Plot formatting
plt.legend(prop={'size': 12})
plt.title('Age Vs. Outcome', fontsize = 20)
plt.xlabel('Age', fontsize = 15)
plt.ylabel('Density', fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
df1 = data[data['Outcome'] == 0]
sns.distplot(df1['DiabetesPedigreeFunction'],  kde = True, label = 'No', color='green')
df2 = data[data['Outcome'] == 1]
sns.distplot(df2['DiabetesPedigreeFunction'],  kde = True, label = 'Yes', color='orange')

# Plot formatting
plt.legend(prop={'size': 12})
plt.title('Diabetes Pedigree Function Vs. Outcome', fontsize = 20)
plt.xlabel('Diabetes Pedigree Function Value', fontsize = 15)
plt.ylabel('Density', fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
df1 = data[data['Outcome'] == 0]
sns.distplot(df1['BMI'],  kde = True, label = 'No', color='purple')
df2 = data[data['Outcome'] == 1]
sns.distplot(df2['BMI'],  kde = True, label = 'Yes', color='red')

# Plot formatting
plt.legend(prop={'size': 12})
plt.title('BMI Vs. Outcome', fontsize = 20)
plt.xlabel('BMI Value', fontsize = 15)
plt.ylabel('Density', fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
df1 = data[data['Outcome'] == 0]
sns.distplot(df1['Insulin'],  kde = True, label = 'No', color='black')
df2 = data[data['Outcome'] == 1]
sns.distplot(df2['Insulin'],  kde = True, label = 'Yes', color='yellow')

# Plot formatting
plt.legend(prop={'size': 12})
plt.title('Insulin Vs. Outcome', fontsize = 20)
plt.xlabel('Insulin Level', fontsize = 15)
plt.ylabel('Density', fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
df1 = data[data['Outcome'] == 0]
sns.distplot(df1['SkinThickness'],  kde = True, label = 'No', color='red')
df2 = data[data['Outcome'] == 1]
sns.distplot(df2['SkinThickness'],  kde = True, label = 'Yes', color='green')

# Plot formatting
plt.legend(prop={'size': 12})
plt.title('SkinThickness Vs. Outcome', fontsize = 20)
plt.xlabel('SkinThickness', fontsize = 15)
plt.ylabel('Density', fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
df1 = data[data['Outcome'] == 0]
sns.distplot(df1['BloodPressure'],  kde = True, label = 'No', color='crimson')
df2 = data[data['Outcome'] == 1]
sns.distplot(df2['BloodPressure'],  kde = True, label = 'Yes', color='gold')

# Plot formatting
plt.legend(prop={'size': 12})
plt.title('Blood Pressure level Vs. Outcome', fontsize = 20)
plt.xlabel('Blood Pressure Level', fontsize = 15)
plt.ylabel('Density', fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
df1 = data[data['Outcome'] == 0]
sns.distplot(df1['Glucose'],  kde = True, label = 'No', color='teal')
df2 = data[data['Outcome'] == 1]
sns.distplot(df2['Glucose'],  kde = True, label = 'Yes', color='peru')

# Plot formatting
plt.legend(prop={'size': 12})
plt.title('Glucose level Vs. Outcome', fontsize = 20)
plt.xlabel('Glucose Level', fontsize = 15)
plt.ylabel('Density', fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
plt_preg = sns.countplot(x = 'Pregnancies', data = data, palette="rocket", hue = 'Outcome')
for p in plt_preg.patches:
    plt_preg.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)

plt.title('Number of Pregnancies Vs. Outcome', fontsize = 20)
plt.xlabel('Number of Pregnancies', fontsize = 15)
plt.xticks(np.arange(18), fontsize = 15)
plt.yticks(fontsize = 15)
plt.ylabel('Count', fontsize = 15)
plt.legend(['No','Yes'], loc = 'upper right', fontsize = 15)
plt.show()

**Let's have short discussion regarding the results we got from Bivariate Analysis.**

1. Age Vs. Outcome : By the histogram see that when paitent getting older there's high risk in getting diabeteics. 
2. DiabetesPedigreeFunction Vs. Outcome : By the histogram see that when Diabetes Pedigree Function value goes high, there's considerable risk of being diabetes paitent.
3. BMI Vs. Outcome : By the plot we can see that who has high BMI value tend to have diabetes.
4. Insulin Vs. Outcome : Both histograms are skewed due to 0 value and many of the paitents who has 0 insulin level is non-diabetic paitent surprisingly.  
5. SkinThickness Vs. Outcome : According to the plot people who has high skin thickness tend to have diabetes. 
6. BloodPressure Vs. Outcome : Both histograms for diabetic and non-diabetic paitents overlays with each other indicating blood pressure is not much involved in diabetes.
7. Glucose Vs. Outcome : Histgrams shows that when the Glucose level is high, Being diabetic paitent is more likely.
8. Pregnancies Vs. Outcome : By the bar plot we can see that when number of pregnancies goes above 8, there's high risk of being a diabetic paitent. 

3. Correlation

In [None]:
plt.subplots(figsize = (12,8))
sns.set(font_scale = 1.5)
sns.heatmap(data.corr(), annot=True, fmt='.2f')
plt.title('Correlation Plot', fontsize = 20)
plt.show()

Here we can see that the variables 'SkinThickness' and 'BMI' has considerable amount of correlation which means presence of multicolinearity. This issue also have to take care while building models for predictions.

**That's all about Explonatory Data Analysis. Next is Data modeling. But before data modelling we need to solve the issues we identified in EDA.**

* **Imputing misleading/missing data**
* **Addressing imbalance data**
* **Addressing multicolinearity**

1. **Imputing misleading/missing data**

There are many ways of imputing misleading/missing data, I have used K-Nearest Neighbour method here.

In [None]:
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2, weights="uniform")
data_imputed = imputer.fit_transform(data)
data_imputed = pd.DataFrame(data_imputed, columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'])
data_imputed.head()

2. Addressing imbalance data

There are many ways to address the issue of imbalance data. Here I have used the upsampling method.

In [None]:
data_imputed['Outcome'].value_counts()

In [None]:
from sklearn.utils import resample

df_majority = data_imputed[data_imputed.Outcome==0]
df_minority = data_imputed[data_imputed.Outcome==1]

data_minority_upsampled = resample(df_minority, replace=True, n_samples=500, random_state=123) 
data_upsampled = pd.concat([df_majority, data_minority_upsampled])

data_upsampled['Outcome'].value_counts()

3. Addressing multicolinearity

To address the issue of multicolinearity we can use regularization modeling methods instead of standard logistic classification such as Logistic Ridge classification, Logistic Lasso classification.

**Since now we have resolved all the issues, let's start data modelling. Here I'm building 5 different prediction models and compare them using Accuracy and Area Under Curve (AUC) value.**

* **Logistic Ridge Classification**
* **Logistic Lasso Classification**
* **Neural Network**
* **K-Nearest Neighbor Classification**
* **Random Forest**

Before start building models we need to Scale data and define the explonatory and response varibales. 

In [None]:
x = data_upsampled.drop(columns = ['Outcome'], axis=1)
y = data_upsampled.Outcome

* Scale Data : Scale difference in explonatory variables can effects the prediction results.  

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x = StandardScaler().fit_transform(x)

Then we have to split the dataset as traning and test set.

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=0)

In [None]:
result_table = pd.DataFrame(columns=['classifiers', 'fpr','tpr','auc'])
accuracy = pd.DataFrame(columns=['classifiers', 'accuracy','auc'])

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

1. **Logistic Ridge Classification**

In [None]:
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

In [None]:
alphas = np.linspace(1,10,100)
ridgeClassifiercv = LogisticRegressionCV(penalty = 'l2', Cs = 1/alphas, solver = 'liblinear')
ridgeClassifiercv.fit(x_train, y_train)
ridgeClassifiercv.C_  #Inverse of best alpha value

In [None]:
LR = LogisticRegression(penalty = 'l2', C = ridgeClassifiercv.C_[0], solver = 'liblinear')
LR.fit(x_train, y_train)
y_predLR = LR.predict(x_test)
accLR = accuracy_score(y_test, y_predLR)

In [None]:
clf_roc_auc = roc_auc_score(y_test, LR.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, LR.predict_proba(x_test)[:,1])
result_table = result_table.append({'classifiers':'Logistics Ridge', 'fpr':fpr, 'tpr':tpr, 'auc':clf_roc_auc}, ignore_index=True)
plt.figure(figsize = (12,8))
plt.plot(fpr, tpr, label='Logistic Ridge (area = %0.2f)' % clf_roc_auc, lw = 2)
plt.plot([0, 1], [0, 1],'r--', lw = 2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize = 15)
plt.ylabel('True Positive Rate', fontsize = 15)
plt.title('Receiver operating characteristic curve for Logistic Ridge Classification', fontsize = 20)
plt.legend(loc="lower right", fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

In [None]:
accuracy = accuracy.append({'classifiers':'Logistic Ridge', 'accuracy':accLR, 'auc':clf_roc_auc}, ignore_index=True)

2. Logistic Lasso Classification

In [None]:
lassoClassifiercv = LogisticRegressionCV(penalty = 'l1', Cs = 1/alphas, solver = 'liblinear')
lassoClassifiercv.fit(x_train, y_train)
lassoClassifiercv.C_

In [None]:
LL = LogisticRegression(penalty = 'l1', C = lassoClassifiercv.C_[0], solver = 'liblinear')
LL.fit(x_train, y_train)
y_predLL = LL.predict(x_test)
accLL = accuracy_score(y_test, y_predLL)

In [None]:
clf_roc_auc = roc_auc_score(y_test, LL.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, LL.predict_proba(x_test)[:,1])
result_table = result_table.append({'classifiers':'Logistics Lasso', 'fpr':fpr, 'tpr':tpr, 'auc':clf_roc_auc}, ignore_index=True)
plt.figure(figsize = (12,8))
plt.plot(fpr, tpr, label='Logistic Lasso (area = %0.2f)' % clf_roc_auc, lw = 2)
plt.plot([0, 1], [0, 1],'r--', lw = 2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize = 15)
plt.ylabel('True Positive Rate', fontsize = 15)
plt.title('Receiver operating characteristic curve for Logistic Lasso Classification', fontsize = 20)
plt.legend(loc="lower right", fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

In [None]:
accuracy = accuracy.append({'classifiers':'Logistic Lasso', 'accuracy':accLL, 'auc':clf_roc_auc}, ignore_index=True)

3. Neural Network

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

In [None]:
np.random.seed(123)
model = Sequential()
model.add(Dense(8, activation = 'tanh', input_dim = 8))
model.add(Dense(128, activation = 'relu'))
model.add(Dense(256, activation = 'relu'))
model.add(Dense(512, activation = 'relu'))
model.add(Dense(128, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
model.fit(x_train, y_train, batch_size = 10, epochs = 10, verbose = 1)

In [None]:
y_pred = model.predict_classes(x_test)
accNN = accuracy_score(y_test, y_pred)

In [None]:
clf_roc_auc = roc_auc_score(y_test, model.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(x_test))
result_table = result_table.append({'classifiers':'Neural Network', 'fpr':fpr, 'tpr':tpr, 'auc':clf_roc_auc}, ignore_index=True)
plt.figure(figsize = (12,8))
plt.plot(fpr, tpr, label='Neural Network (area = %0.2f)' % clf_roc_auc, lw = 2)
plt.plot([0, 1], [0, 1],'r--', lw = 2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize = 15)
plt.ylabel('True Positive Rate', fontsize = 15)
plt.title('Receiver operating characteristic curve for Neural Network', fontsize = 20)
plt.legend(loc="lower right", fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

In [None]:
accuracy = accuracy.append({'classifiers':'Neural Network', 'accuracy':accNN, 'auc':clf_roc_auc}, ignore_index=True)

4. K-Nearest Neighbor Classification

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
error_rate = []
k = range(1,30)
for i in k:
  knn = KNeighborsClassifier(n_neighbors=i)
  knn.fit(x_train, y_train)
  pred_i = knn.predict(x_test)
  error_rate.append(np.mean(pred_i != y_test))

plt.figure(figsize=(12,8))
plt.plot(k,error_rate, color='black', linestyle='dashed', marker='o', markerfacecolor='pink')
plt.title('Error Rate vs. K Value', fontsize = 20)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.xlabel('K', fontsize = 15)
plt.ylabel('Error Rate', fontsize = 15)
plt.show()

Above graph can be used to decide best K value. Here we can see that k=1 is best for this dataset.

In [None]:
KNN = KNeighborsClassifier(n_neighbors=1, p=2, metric='euclidean')
KNN.fit(x_train,y_train)
y_predKNN = KNN.predict(x_test)
accKNN = accuracy_score(y_test, y_predKNN)

In [None]:
clf_roc_auc = roc_auc_score(y_test, KNN.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, KNN.predict_proba(x_test)[:,1])
result_table = result_table.append({'classifiers':'KNN', 'fpr':fpr, 'tpr':tpr, 'auc':clf_roc_auc}, ignore_index=True)
plt.figure(figsize = (12,8))
plt.plot(fpr, tpr, label='K Nearest Neighbour (area = %0.2f)' % clf_roc_auc, lw = 2)
plt.plot([0, 1], [0, 1],'r--', lw = 2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize = 15)
plt.ylabel('True Positive Rate', fontsize = 15)
plt.title('Receiver operating characteristic curve for K Nearest Neighbour Classification', fontsize = 20)
plt.legend(loc="lower right", fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

In [None]:
accuracy = accuracy.append({'classifiers':'KNN', 'accuracy':accKNN, 'auc':clf_roc_auc}, ignore_index=True)

5. Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
error_rate_RF = []
n = range(1,20)
for i in n:
  RFC = RandomForestClassifier(n_estimators=i)
  RFC.fit(x_train, y_train)
  pred_i = RFC.predict(x_test)
  error_rate_RF.append(np.mean(pred_i != y_test))

plt.figure(figsize=(12,8))
plt.plot(n,error_rate_RF, color='black', linestyle='dashed', marker='o', markerfacecolor='maroon')
plt.title('Error Rate vs. Number of estimators(Trees)', fontsize = 20)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.xlabel('Number of estimators(Trees)', fontsize = 15)
plt.ylabel('Error Rate', fontsize = 15)
plt.show()

According to above graph optimal number of trees is 18.

In [None]:
RFC = RandomForestClassifier(n_estimators=18)
RFC.fit(x_train, y_train)
y_predRFC = RFC.predict(x_test)
accRF = accuracy_score(y_test, y_predRFC)

In [None]:
clf_roc_auc = roc_auc_score(y_test, RFC.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, RFC.predict_proba(x_test)[:,1])
result_table = result_table.append({'classifiers':'Random Forest', 'fpr':fpr, 'tpr':tpr, 'auc':clf_roc_auc}, ignore_index=True)
plt.figure(figsize = (12,8))
plt.plot(fpr, tpr, label='Random Forest (area = %0.2f)' % clf_roc_auc, lw = 2)
plt.plot([0, 1], [0, 1],'r--', lw = 2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize = 15)
plt.ylabel('True Positive Rate', fontsize = 15)
plt.title('Receiver operating characteristic curve for Random Forest Classification', fontsize = 20)
plt.legend(loc="lower right", fontsize = 15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15)
plt.show()

In [None]:
accuracy = accuracy.append({'classifiers':'Random Forest', 'accuracy':accRF, 'auc':clf_roc_auc}, ignore_index=True)

In [None]:
result_table.set_index('classifiers', inplace=True)
accuracy.set_index('classifiers', inplace=True)

In [None]:
fig = plt.figure(figsize=(12,8))

for i in result_table.index:
    plt.plot(result_table.loc[i]['fpr'], 
             result_table.loc[i]['tpr'], 
             label="{}, AUC={:.3f}".format(i, result_table.loc[i]['auc']))
    
plt.plot([0,1], [0,1], color='black', linestyle='--')

plt.xticks(np.arange(0.0, 1.1, step=0.1), fontsize = 12)
plt.xlabel("Flase Positive Rate", fontsize=15)

plt.yticks(np.arange(0.0, 1.1, step=0.1), fontsize = 12)
plt.ylabel("True Positive Rate", fontsize=15)

plt.title('ROC Curve Analysis', fontsize=20)
plt.legend(prop={'size':13}, loc='lower right')
plt.show()

In [None]:
accuracy

Above table summaries the final results. Accuracy and AUC value both can be used to identify best model. If we consider accuracy **Random Forest** is the best model. In AUC method **Neural Network** is the best. We can use any of these two models.

***That'a all.***

# Thank You. 