# Evaluation of Classification Methods

![](https://healthnavigator.blob.core.windows.net/cache/f/b/e/b/8/f/fbeb8f908aefb57046466d208d4e4444c4ee8f2e.jpg)

## Contents
1. Introduction
2. Data Analysis
3. Preparing Data for Machine Learning
4. Classification Methods
    * Logistic Regression Classification
    * KNN Classification
    * SVM Classification
    * Naive Bayes Classification
    * Decision Tree Classification
    * Random Forest Classification
5. Checking Classification Results with Confusion Matrix
6. Conclusion

## 1. Introduction

### What is Diabetes?

Diabetes is a disease that occurs when your blood glucose, also called blood sugar, is too high. Blood glucose is your main source of energy and comes from the food you eat. Insulin, a hormone made by the pancreas, helps glucose from food get into your cells to be used for energy. Sometimes your body doesn’t make enough—or any—insulin or doesn’t use insulin well. Glucose then stays in your blood and doesn’t reach your cells.

Over time, having too much glucose in your blood can cause health problems. Although diabetes has no cure, you can take steps to manage your diabetes and stay healthy.

### Pupose

I will examine Pima Indians Diabetes Database with supervised learning algorithms and than evaluate classification methods. 

### Data


The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

* **Pregnancies**: Number of times pregnant
* **Glucose**: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* **BloodPressure**: Diastolic blood pressure (mm Hg)
* **SkinThickness**: Triceps skin fold thickness (mm)
* **Insulin**: 2-Hour serum insulin (mu U/ml)
* **BMI**: Body mass index (weight in kg/(height in m)^2)
* **DiabetesPedigreeFunction**: Diabetes pedigree function
* **Age**: Age (years)
* **Outcome**: Class variable (0 or 1)


## 2. Data Analysis

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
data=pd.read_csv("../input/diabetes.csv")

In [None]:
data.sample(5)

In [None]:
data.info()

In [None]:
sns.countplot(data.Outcome)
plt.title("Diabates Status",color="black",fontsize=15)

In [None]:
data.Outcome.value_counts()

In [None]:
f,ax=plt.subplots(figsize=(15,15))
sns.heatmap(data.corr(),annot=True,linecolor="blue",fmt=".2f",ax=ax)
plt.show()

In [None]:
g = sns.pairplot(data, hue="Outcome",palette="Set2",diag_kind = "kde",kind = "scatter")

## 3. Preparing Data for Machine Learning

In [None]:
x=data.drop(["Outcome"],axis=1)
y=data.Outcome.values.reshape(-1,1)

In [None]:
#Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = sc.fit_transform(x)

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2,random_state=0)

## 4. Classification Methods

### Logistic Regression Classification

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)
lr_prediction= lr.predict(x_test)

lr_cm = confusion_matrix(y_test,lr_prediction)
print("Logistic Regression Accuracy :",lr.score(x_test, y_test))

### KNN Classification

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 16)
knn.fit(x_train,y_train)
knn_prediction= knn.predict(x_test)

knn_cm = confusion_matrix(y_test,knn_prediction)
print("KNN Classification Accuracy :",knn.score(x_test,y_test))

In [None]:
score_list = []
for each in range(1,25):
    knn2 = KNeighborsClassifier(n_neighbors = each)
    knn2.fit(x_train,y_train)
    score_list.append(knn2.score(x_test,y_test))
    
plt.plot(range(1,25),score_list)
plt.xlabel("k values")
plt.ylabel("accuracy")
plt.show()

print("Best accuracy is {} with K = {}".format(np.max(score_list),1+score_list.index(np.max(score_list))))

### SVM Classification

In [None]:
from sklearn.svm import SVC

svm=SVC(random_state=1)
svm.fit(x_train,y_train)
svm_prediction= svm.predict(x_test)

svm_cm = confusion_matrix(y_test,svm_prediction)
print("Support Vector Classification Accuracy :",svm.score(x_test,y_test))

### Naive Bayes Classification

In [None]:
from sklearn.naive_bayes import GaussianNB

nb=GaussianNB()
nb.fit(x_train,y_train)
nb_prediction= nb.predict(x_test)

nb_cm = confusion_matrix(y_test,nb_prediction)
print("Naive Bayes Classification Accuracy :",nb.score(x_test,y_test))

### Decision Tree Classification

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier()
dt.fit(x_train,y_train)
dt_prediction= dt.predict(x_test)

dt_cm = confusion_matrix(y_test,dt_prediction)
print("Decision Tree Classification Accuracy :",dt.score(x_test,y_test))

### Random Forest Classification

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()
rf.fit(x_train,y_train)
rf_prediction= rf.predict(x_test)

rf_cm = confusion_matrix(y_test,rf_prediction)
print("Random Forest Classification Accuracy :",rf.score(x_test,y_test))

## 5. Checking Classification Results with Confusion Matrix

In [None]:
fig = plt.figure(figsize=(15,15))

ax1 = fig.add_subplot(3, 3, 1) # row, column, position
ax1.set_title('Logistic Regression Classification')

ax2 = fig.add_subplot(3, 3, 2)
ax2.set_title('KNN Classification')

ax3 = fig.add_subplot(3, 3, 3)
ax3.set_title('SVM Classification')

ax4 = fig.add_subplot(3, 3, 4)
ax4.set_title('Naive Bayes Classification')

ax5 = fig.add_subplot(3, 3, 5)
ax5.set_title('Decision Tree Classification')

ax6 = fig.add_subplot(3, 3, 6)
ax6.set_title('Random Forest Classification')


sns.heatmap(data=lr_cm, annot=True, linewidth=0.5, linecolor='mintcream', fmt='.0f', ax=ax1, cmap='RdGy')
sns.heatmap(data=knn_cm, annot=True, linewidth=0.5, linecolor='mintcream', fmt='.0f', ax=ax2, cmap='RdGy')   
sns.heatmap(data=svm_cm, annot=True, linewidth=0.5, linecolor='mintcream', fmt='.0f', ax=ax3, cmap='RdGy')
sns.heatmap(data=nb_cm, annot=True, linewidth=0.5, linecolor='mintcream', fmt='.0f', ax=ax4, cmap='RdGy')
sns.heatmap(data=dt_cm, annot=True, linewidth=0.5, linecolor='mintcream', fmt='.0f', ax=ax5, cmap='RdGy')
sns.heatmap(data=rf_cm, annot=True, linewidth=0.5, linecolor='mintcream', fmt='.0f', ax=ax6, cmap='RdGy')
plt.show()

## 6. Conclusion
* Our data is not balanced.
* we need more diabetic patient information (outcome=1)
* Classfication methods have low accuracy between %82 and %75.
* Logistic Regression gave the best accuracy to us