# PIMA Diabetics Classification

Diabetes mellitus (DM), commonly known as diabetes, is a group of metabolic disorders characterized by a high blood sugar level over a prolonged period of time. Symptoms often include frequent urination, increased thirst, and increased appetite. If left untreated, diabetes can cause many complications. Acute complications can include diabetic ketoacidosis, hyperosmolar hyperglycemic state, or death. Serious long-term complications include cardiovascular disease, stroke, chronic kidney disease, foot ulcers, damage to the nerves, damage to the eyes and cognitive impairment.

Diabetes is due to either the pancreas not producing enough insulin, or the cells of the body not responding properly to the insulin produced. There are three main types of diabetes mellitus:

* Type 1 diabetes results from the pancreas's failure to produce enough insulin due to loss of beta cells. This form was previously referred to as "insulin-dependent diabetes mellitus" (IDDM) or "juvenile diabetes". The loss of beta cells is caused by an autoimmune response. The cause of this autoimmune response is unknown.
* Type 2 diabetes begins with insulin resistance, a condition in which cells fail to respond to insulin properly. As the disease progresses, a lack of insulin may also develop. This form was previously referred to as "non insulin-dependent diabetes mellitus" (NIDDM) or "adult-onset diabetes". The most common cause is a combination of excessive body weight and insufficient exercise.
* Gestational diabetes is the third main form, and occurs when pregnant women without a previous history of diabetes develop high blood sugar levels.

## Here we'll be performing:
* Exploratory data analysis
* Data Visualization
* Feature Engineering
* Classification

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Importing libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, roc_curve

# Reading the data

In [None]:
df=pd.read_csv('/kaggle/input/pima-indians-diabetes-database/diabetes.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

# Checking for null values

In [None]:
sns.heatmap(df.isnull(),cmap='viridis')

We are good to go

# Exploratory data analysis

## Outcome

In [None]:
df['Outcome'].value_counts()

In [None]:
labels=['True','False']
explode=[0.03,0.03]
color=['pink','lightgreen']

In [None]:
f,ax = plt.subplots(1,2,figsize = (15, 7))
_=df.Outcome.value_counts().plot.bar(ax=ax[0],cmap='viridis')
_=df.Outcome.value_counts().plot.pie(ax=ax[1],labels=labels,autopct='%.2f%%',colors=color,explode=explode)

# Looking for Relationships

# Which attributes heavily affect the outcome?

In [None]:
sns.pairplot(df,hue='Outcome')

That's a bit hard to interpret.
Let's try another approach

# Plotting a heatmap

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(df.corr(),annot=True,cmap='plasma',linecolor='black',linewidths=0.01)

# Distribution Plot

In [None]:
fig, ax = plt.subplots(4,2,figsize=(16,16))
sns.distplot(df.Age,bins=20, ax=ax[0,0]) 
sns.distplot(df.Pregnancies,bins=20,ax=ax[0,1]) 
sns.distplot(df.Glucose,bins=20,ax=ax[1,0]) 
sns.distplot(df.BloodPressure,bins=20,ax=ax[1,1]) 
sns.distplot(df.SkinThickness,bins=20,ax=ax[2,0])
sns.distplot(df.Insulin,bins=20,ax=ax[2,1])
sns.distplot(df.DiabetesPedigreeFunction,bins=20,ax=ax[3,0]) 
sns.distplot(df.BMI,bins=20,ax=ax[3,1]) 

# Data Visualization

## Count of total number of pregnancies

In [None]:
plt.figure(figsize=(15,6))
sns.countplot('Pregnancies',hue='Outcome',data=df,palette='viridis')
plt.legend(loc='upper right',labels=['False','True'])

* Most people have been pregnant twice max
* Looks like higher the number of pregnancies, higher the chances of being diabetic.

# Feature Engineering

## Replacing NaN values (if any) with zero

In [None]:
data=df.copy(deep=True)
data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
      'BMI', 'DiabetesPedigreeFunction']]=data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction']].replace(0,np.NaN)

In [None]:
data.isnull().sum()

# Replacing null values with mean

In [None]:
data=data.fillna(data.mean())

# Some more Exploratory Data Analysis

## Let's checkout the heatmap now

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(data.corr(),annot=True,cmap='plasma',linecolor='black',linewidths=0.01)

OK!. Here we find there is good correlation between :
* SkinThickness and BMI
* Age and Pregnancy
* Glucose and Outcome
* Glucose and Insulin

## Let's check the distribution again

In [None]:
fig,ax=plt.subplots(4,2,figsize=(16,16))
sns.distplot(data['Pregnancies'],ax=ax[0,0],bins=20)
sns.distplot(data['Glucose'],ax=ax[0,1],bins=20)
sns.distplot(data['BloodPressure'],ax=ax[1,0],bins=20)
sns.distplot(data['SkinThickness'],ax=ax[1,1],bins=20)
sns.distplot(data['Insulin'],ax=ax[2,0],bins=20)
sns.distplot(data['BMI'],ax=ax[2,1],bins=20)
sns.distplot(data['DiabetesPedigreeFunction'],ax=ax[3,0],bins=20)
sns.distplot(data['Age'],ax=ax[3,1],bins=20)

# Splitting the data

In [None]:
X=data.drop('Outcome',axis=1)
y=data['Outcome']

# Splitting Data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

# Scaling the data

In [None]:
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.fit_transform(X_test)

# Classifying the data

## Hyperparameter Tuning:

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters are learned. The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data. The objective function takes a tuple of hyperparameters and returns the associated loss.
For more info : https://en.wikipedia.org/wiki/Hyperparameter_optimization


## Reciever Operating Characteristics

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.

The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning. The false-positive rate is also known as probability of false alarm and can be calculated as (1 − specificity). It can also be thought of as a plot of the power as a function of the Type I Error of the decision rule (when the performance is calculated from just a sample of the population, it can be thought of as estimators of these quantities). The ROC curve is thus the sensitivity or recall as a function of fall-out. In general, if the probability distributions for both detection and false alarm are known, the ROC curve can be generated by plotting the cumulative distribution function (area under the probability distribution from − ∞ to the discrimination threshold) of the detection probability in the y-axis versus the cumulative distribution function of the false-alarm probability on the x-axis. 
For more info: https://en.wikipedia.org/wiki/Receiver_operating_characteristic

# KNeighborsClassifier

In [None]:
knn=KNeighborsClassifier()
params={'n_neighbors':range(1,21),'p':[1,2,3,4,5,6,7,8,9,10],
        'weights':['distance','uniform'],'leaf_size':range(1,21)}

In [None]:
gs_knn=GridSearchCV(knn,param_grid=params,cv=10,n_jobs=-1)

In [None]:
gs_knn.fit(X_train,y_train)
gs_knn.best_params_

In [None]:
prediction=gs_knn.predict(X_test)

In [None]:
acc_knn=accuracy_score(y_test,prediction)
print(acc_knn)
print(confusion_matrix(y_test,prediction))

In [None]:
probability=gs_knn.predict_proba(X_test)[:,1]

In [None]:
fpr_knn,tpr_knn,thresh=roc_curve(y_test,probability)

In [None]:
plt.figure(figsize=(12,6))
plt.plot(fpr_knn,tpr_knn)
plt.plot([0,1],ls='--')
plt.plot([0,0],[1,0],c='0.5')
plt.plot([1,1],c='0.5')

In [None]:
roc_auc_score(y_test,probability)*100

# LogisticRegression

In [None]:
log_reg=LogisticRegression()
params={'C':[0.01,0.1,1,10],'max_iter':[100,300,600]}

In [None]:
gs_lr=GridSearchCV(log_reg,param_grid=params,n_jobs=-1,cv=10)

In [None]:
gs_lr.fit(X_train,y_train)
gs_lr.best_params_

In [None]:
prediction=gs_lr.predict(X_test)

In [None]:
acc_lr=accuracy_score(y_test,prediction)
print(acc_lr)
print(confusion_matrix(y_test,prediction))

In [None]:
probability=gs_lr.predict_proba(X_test)[:,1]

In [None]:
fpr_lr,tpr_lr,thresh=roc_curve(y_test,probability)

In [None]:
plt.figure(figsize=(14,6))
plt.plot(fpr_lr,tpr_lr)
plt.plot([0,1],ls='--')
plt.plot([0,0],[1,0],c='0.5')
plt.plot([1,1],c='0.5')

In [None]:
roc_auc_score(y_test,probability)*100

# DecisionTreeClassifier

In [None]:
dtr=DecisionTreeClassifier()
params={'max_features':["auto", "sqrt", "log2"],'min_samples_leaf':range(1,11),'min_samples_split':range(1,11)}

In [None]:
gs_dtr=GridSearchCV(dtr,param_grid=params,n_jobs=-1,cv=5)

In [None]:
gs_dtr.fit(X_train,y_train)
gs_dtr.best_params_

In [None]:
prediction=gs_dtr.predict(X_test)

In [None]:
acc_dtr=accuracy_score(y_test,prediction)
print(acc_dtr)
print(confusion_matrix(y_test,prediction))

In [None]:
probability=gs_dtr.predict_proba(X_test)[:,1]

In [None]:
fpr_dtr,tpr_dtr,thresh=roc_curve(y_test,probability)

In [None]:
plt.figure(figsize=(14,6))
plt.plot(fpr_dtr,tpr_dtr)
plt.plot([0,1],ls='--')
plt.plot([0,0],[1,0],c='0.5')
plt.plot([1,1],c='0.5')

In [None]:
roc_auc_score(y_test,probability)*100

# RandomForestClassifier

In [None]:
rfc=RandomForestClassifier()
params={'n_estimators':[100,300,500],'min_samples_leaf':range(1,11)}

In [None]:
gs_rfc=GridSearchCV(rfc,param_grid=params,n_jobs=-1,cv=5)

In [None]:
gs_rfc.fit(X_train,y_train)
gs_rfc.best_params_

In [None]:
prediction=gs_rfc.predict(X_test)

In [None]:
acc_rfc=accuracy_score(y_test,prediction)
print(acc_rfc)
print(confusion_matrix(y_test,prediction))

In [None]:
probability=gs_rfc.predict_proba(X_test)[:,1]

In [None]:
fpr_rfc,tpr_rfc,thresh=roc_curve(y_test,probability)

In [None]:
plt.figure(figsize=(14,6))
plt.plot(fpr_rfc,tpr_rfc)
plt.plot([0,1],ls='--')
plt.plot([0,0],[1,0],c='0.5')
plt.plot([1,1],c='0.5')

In [None]:
roc_auc_score(y_test,probability)*100

# Comparing the accuracies

In [None]:
report=pd.DataFrame({'Model':['KNeighborsClassifier','LogisticRegression','DecisionTreeClassifier','RandoForestClassifier'],
                    'Score':[acc_knn,acc_lr,acc_dtr,acc_rfc]})
report.sort_values(by='Score',ascending=False)