## Problem Statement:

Build a model to accurately predict whether the patients in the dataset have diabetes or not.
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.


1. Pregnancies: Number of times pregnant
2. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. BloodPressure: Diastolic blood pressure (mm Hg)
4. SkinThickness: Triceps skin fold thickness (mm)
5. Insulin: 2-Hour serum insulin (mu U/ml)
6. BMI: Body mass index (weight in kg/(height in m)^2)
7. DiabetesPedigreeFunction: Diabetes pedigree function
8. Age: Age (years)
9. Outcome: Class variable (0 or 1) 268 of 768 are 1, the others are 0

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Importing required packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

In [None]:
warnings.filterwarnings("ignore")

# Data Exploration:

In [None]:
health_df = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv', header=0)
health_df.head()

In [None]:
health_df.info()

In [None]:
health_df.describe().T

There are 9 variables in this dataset. Outcome is our target/ dependent variable. All are numeric variables. Outcome is a categorical variable and can have value either 0 or 1. It can be seen that there is no NaN value for any of the variables.

But we can see Glucose, BloodPressure, SkinThickness, Insulin, BMI Features have minimum value 0. That does not make any sense. We will check each feature one by one.

## Visually explore these variables using histograms and treayting missing values

In [None]:
plt.figure(figsize=[20,12])
ax1 = plt.subplot(3,2,1)
ax2 = plt.subplot(3,2,2)
ax3 = plt.subplot(3,2,3)
ax4 = plt.subplot(3,2,4)
ax5 = plt.subplot(3,2,5)
sns.histplot(data= health_df, x='Glucose', kde=True, ax=ax1)
sns.histplot(data= health_df, x='SkinThickness', kde=True, ax=ax2)
sns.histplot(data= health_df, x='BloodPressure', kde=True, ax=ax3)
sns.histplot(data= health_df, x='Insulin', kde=True, ax=ax4)
sns.histplot(data= health_df, x='BMI', kde=True, ax=ax5)
plt.show()

We can see Glucose, BMI, BloodPressure have few number of 0 value, where SkinThickness and Insulin have very higher number of 0 values.

### Calculating percentage of missing values in these features

In [None]:
#replacing 0 with NaN for Glucose, BMI, BloodPressure

health_df.replace({'Glucose': 0, 'BloodPressure': 0, 'BMI': 0, 'SkinThickness' : 0, 'Insulin' : 0}, np.nan, inplace=True)

In [None]:
#percentage of missing value

health_df.isna().sum() * 100 / health_df.shape[0]

It can be seens that BloodPressure has almost normal distribution if missing values are ignored, so hear mean imputation should be ok. BMI and Glucose, SkinThickness have bit skewness, so median imputation can be used.

For Insulin, percentage of missing value is very high (48.7%). I'll go with median imputation. I'll build my predictive models in two different ways.

##### Approach 1: Including SkinThickness and Insulin in our model.

##### Approach 2: Another approach is, building models after excluding Insulin.

In [None]:
health_df['BloodPressure'].fillna(health_df['BloodPressure'].mean(), inplace = True)
health_df['Glucose'].fillna(health_df['Glucose'].median(), inplace = True)
health_df['BMI'].fillna(health_df['BMI'].median(), inplace = True)
health_df['SkinThickness'].fillna(health_df['SkinThickness'].median(), inplace = True)
health_df['Insulin'].fillna(health_df['Insulin'].median(), inplace = True)

In [None]:
health_df.isna().sum()

### Data types and the count of variables

There are 9 Numeric variables, aot of that 3 are intiger and 6 are of float type.

In [None]:
plt.figure(figsize=[8,5])
sns.countplot(health_df.dtypes.map(str))
plt.show()

In [None]:
health_df.Outcome.value_counts()

It can be seen that our dataset has imbalanced class. We have 500 observations of claas 0 and 268 observations for Class 1. To handle this Imbalanced data K-Fold CV and during Test Train split Stratifacation should be used.

### Scatter charts between the pair of variables

In [None]:
plt.figure(figsize=[15,8])
sns.pairplot(data=health_df, kind='reg', hue='Outcome')
plt.show()

In [None]:
#Checking below features in more details

plt.figure(figsize=[16,6])
plt.subplot(1,3,1)
sns.scatterplot(data= health_df, x= 'Glucose', y='BloodPressure', hue='Outcome')
plt.subplot(1,3,2)
sns.scatterplot(data= health_df, x= 'BMI', y='DiabetesPedigreeFunction', hue='Outcome')
plt.subplot(1,3,3)
sns.scatterplot(data= health_df, x= 'SkinThickness', y='Age', hue='Outcome')
plt.show()

It can be seen taht there is somewhat positive linear relation between Insulin and Glucose. Same with Age and Pregnancies. If this relation is very strong or not that we can see later on using correlation heatmap.

From the scatterplots it can be seen that observations of outcome 0 and 1 are almost overplapping with each other in case of most of the features. Only For Glucose, we can see if Glucose is below 90 then there is very low chance of outcome 1 and if Glucose is above 150 then there is a very high chance of outcome to be 1.

Again if BMI is below 25 then there is almost no observation having outcome =1 where if BMI is more than 25 then we can see both 0 and 1 in outcome.

Also it can be seen that the probablity of outcome =1 when age<25 is lesser than the when age > 25.

### Correlation Analysis

In [None]:
plt.figure(figsize=[12,8])
sns.heatmap(health_df.corr(), annot=True, cmap='RdYlGn', vmin=-1, vmax=1, center= 0)
plt.show()

We can see that there is no very strong linear relations between any of the variables. There are medium linear relations between Age and Number of Pregnancies, BMI and SkinThickness, Insulin and Glucose.

# Data Modeling:

## Approach 1 :

In [None]:
label= health_df.iloc[:,-1]
label

In [None]:
data= health_df.iloc[:,:-1]
data

In [None]:
ss= StandardScaler()
data_scaled= pd.DataFrame(ss.fit_transform(data))
data_scaled.head()

## Finding the best performing model
1. Logistic Regression
2. Support Vector Classifier
3. K Neighbors Classifier
4. Decision Tree Classifier
5. Random Forest Classifier
6. XGBoost Classifier

In [None]:
# Using StratifiedKFold for cross validation to find best performing model

kf= StratifiedKFold(n_splits= 7, random_state=None)

In [None]:
# Testing with 6 Models

lr= LogisticRegression(solver='liblinear') # as dataset is small
svc= SVC()
knn= KNeighborsClassifier()
dt= DecisionTreeClassifier()
rf= RandomForestClassifier()
xgb= XGBClassifier()

In [None]:
lr_accuracy= []
svc_accuracy= []
knn_accuracy= []
dt_accuracy= []
rf_accuracy= []
xgb_accuracy= []

In [None]:
for train_idx, test_idx in kf.split(data,label):
    X_train, X_test= data_scaled.iloc[train_idx,:], data_scaled.iloc[test_idx,:]
    y_train, y_test= label[train_idx], label[test_idx]
    
    # Logistic Regression
    lr.fit(X_train, y_train)
    lr_prediction= lr.predict(X_test)
    lr_acc= accuracy_score(lr_prediction, y_test)
    lr_accuracy.append(lr_acc)
    
    # SVC
    svc.fit(X_train, y_train)
    svc_prediction= svc.predict(X_test)
    svc_acc= accuracy_score(svc_prediction, y_test)
    svc_accuracy.append(svc_acc)
    
    # KNN
    knn.fit(X_train, y_train)
    knn_prediction= knn.predict(X_test)
    knn_acc= accuracy_score(knn_prediction, y_test)
    knn_accuracy.append(knn_acc)
    
    # Decision Tree
    dt.fit(X_train, y_train)
    dt_prediction= dt.predict(X_test)
    dt_acc= accuracy_score(dt_prediction, y_test)
    dt_accuracy.append(dt_acc)
    
    # Random Forest
    rf.fit(X_train, y_train)
    rf_prediction= rf.predict(X_test)
    rf_acc= accuracy_score(rf_prediction, y_test)
    rf_accuracy.append(rf_acc)
    
    # XGB Classifier
    xgb.fit(X_train, y_train)
    xgb_prediction= xgb.predict(X_test)
    xgb_acc= accuracy_score(xgb_prediction, y_test)
    xgb_accuracy.append(xgb_acc)

In [None]:
print('Logistic Regression- Accuracy of each fold:',*lr_accuracy)
print('Average accuracy of Logistic Regression: ', np.mean(lr_accuracy))
print('Standard deviation of accuracy:', np.std(lr_accuracy))
print('='*50)
print('SVC- Accuracy of each fold:',*svc_accuracy)
print('Average accuracy of SVC: ', np.mean(svc_accuracy))
print('Standard deviation of accuracy:', np.std(svc_accuracy))
print('='*50)
print('KNN- Accuracy of each fold:',*knn_accuracy)
print('Average accuracy of KNN: ', np.mean(knn_accuracy))
print('Standard deviation of accuracy:', np.std(knn_accuracy))
print('='*50)
print('Decision Tree- Accuracy of each fold:',*dt_accuracy)
print('Average accuracy of Decision Tree: ', np.mean(dt_accuracy))
print('Standard deviation of accuracy:', np.std(dt_accuracy))
print('='*50)
print('Random Forest- Accuracy of each fold:',*rf_accuracy)
print('Average accuracy of Random Forest: ', np.mean(rf_accuracy))
print('Standard deviation of accuracy:', np.std(rf_accuracy))
print('='*50)
print('XGB Classifier- Accuracy of each fold:',*xgb_accuracy)
print('Average accuracy of XGB Classifier: ', np.mean(xgb_accuracy))
print('Standard deviation of accuracy:', np.std(xgb_accuracy))

## Hyperparameters Tunning and Comparing best 2 models with KNN

In [None]:
#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(data_scaled, label, test_size= 0.2, stratify= label, random_state= 42)

### 1. Logistic Regression

In [None]:
log_reg = LogisticRegression(solver='liblinear')
param_grid= {"C": np.logspace(-5,5,22), "penalty": ["l1","l2"]}
log_reg_grid = GridSearchCV(log_reg, param_grid= param_grid, cv= 25, verbose= True, n_jobs= -1)
log_reg_grid.fit(X_train, y_train)

In [None]:
# Checking the best score on train data

print(log_reg_grid.best_score_)
print(log_reg_grid.best_params_)

In [None]:
# Testing on test data
log_reg_opt = LogisticRegression(solver='liblinear', C= 1.7301957388458944, penalty= 'l1')
log_reg_opt.fit(X_train, y_train)
log_reg_opt.score(X_test, y_test)

In [None]:
y_pred= log_reg_opt.predict(X_test)

In [None]:
#Confusion Matrix

tn, fp, fn, tp= confusion_matrix(y_test, y_pred).ravel()

print('True Negative:', tn)
print('False Positive:', fp)
print('False Negative:', fn)
print('True Positive:', tp)

In [None]:
# Classification Report

print(classification_report(y_test, y_pred))

Overall Accuracy of the model is : .70

Fraction of positives that were correctly identified (Recall) for class 0 is good, .81. But for class 1 the recall value is not that good. Recall for class 1 = TP/(TP+FN) = 27/(27+27) = .5

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. Precision For class 1: TP/(TP + FP) = 27/(27+19) = .59 For class 0 it's .75

F1 score can be calculated as: 2 (precision recall) / (precision + recall). F1 score of class 0 is .78 and for class 1 is .54

#### Receiver Operating Characteristics Curve

In [None]:
predict_pr = log_reg_opt.predict_proba(data_scaled)
predict_pr = predict_pr[:, 1]
auc = roc_auc_score(label, predict_pr)
print('AUC:', round(auc, 4))
fpr, tpr, thresholds = roc_curve(label, predict_pr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(fpr, tpr, marker='.')
plt.show()

### 2. Random Forest Classifier

In [None]:
rf_model= RandomForestClassifier()

param_grid= {'n_estimators': list(range(20,41,1)),
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [3,4,5,6,7,8],
    'criterion': ['gini', 'entropy']}

rf_grid= GridSearchCV(rf_model, param_grid= param_grid, cv= 25, verbose= True, n_jobs= -1)
rf_grid.fit(X_train, y_train)

In [None]:
# Checking the best score on train data

print(rf_grid.best_score_)
#print(rf_grid.best_params_)

In [None]:
rf_model_opt= RandomForestClassifier(criterion= 'gini', max_depth= 6, max_features= 'auto', n_estimators= 32)
rf_model_opt.fit(X_train, y_train)
rf_model_opt.score(X_test, y_test)

In [None]:
y_pred= rf_model_opt.predict(X_test)

In [None]:
#Confusion Matrix

tn, fp, fn, tp= confusion_matrix(y_test, y_pred).ravel()

print('True Negative:', tn)
print('False Positive:', fp)
print('False Negative:', fn)
print('True Positive:', tp)

In [None]:
# Classification Report

print(classification_report(y_test, y_pred))

In [None]:
probs = rf_model_opt.predict_proba(data_scaled)
probs 

#### ROC Curve

In [None]:
predict_pr = rf_model_opt.predict_proba(data_scaled)
predict_pr = predict_pr[:, 1]
auc = roc_auc_score(label, predict_pr)
print('AUC:', round(auc, 4))
fpr, tpr, thresholds = roc_curve(label, predict_pr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(fpr, tpr, marker='.')
plt.show()

### 3. KNN

In [None]:
knn_model= KNeighborsClassifier()
param_grid= {'n_neighbors': list(range(1,20)), 'weights': ["uniform", "distance"], 'metric': ['minkowski','manhattan','euclidean']}
knn_grid= GridSearchCV(knn_model, param_grid= param_grid, cv= 25, verbose= True, n_jobs= -1)
knn_grid.fit(X_train, y_train)

In [None]:
# Checking the best score on train data

print(knn_grid.best_score_)
print(knn_grid.best_params_)

In [None]:
# Testing on test data

knn_opt = KNeighborsClassifier(n_neighbors= 15, weights= 'uniform', metric= 'minkowski')
knn_opt.fit(X_train, y_train)
knn_opt.score(X_test, y_test)

In [None]:
y_pred= knn_opt.predict(X_test)

In [None]:
#Confusion Matrix

tn, fp, fn, tp= confusion_matrix(y_test, y_pred).ravel()

print('True Negative:', tn)
print('False Positive:', fp)
print('False Negative:', fn)
print('True Positive:', tp)

In [None]:
# Classification Report

print(classification_report(y_test, y_pred))

#### ROC Curve

In [None]:
predict_pr = knn_opt.predict_proba(data_scaled)
predict_pr = predict_pr[:, 1]
auc = roc_auc_score(label, predict_pr)
print('AUC:', round(auc, 4))
fpr, tpr, thresholds = roc_curve(label, predict_pr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(fpr, tpr, marker='.')
plt.show()

## Approach 2: (Droping Insulin and checking model performance)

In [None]:
data_ap2 = data.drop('Insulin', axis=1)
data_ap2

In [None]:
ss1= StandardScaler()
data_ap2_scaled= pd.DataFrame(ss1.fit_transform(data))
data_ap2_scaled.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data_ap2_scaled, label, test_size= 0.2, stratify= label, random_state= 42)

### 1. Random Forest Classifier

In [None]:
rf_model= RandomForestClassifier()

param_grid= {'n_estimators': list(range(20,41,1)),
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [3,4,5,6,7,8],
    'criterion': ['gini', 'entropy']}

rf_grid= GridSearchCV(rf_model, param_grid= param_grid, cv= 25, verbose= True, n_jobs= -1)
rf_grid.fit(X_train, y_train)

In [None]:
# Checking the best score on train data

print(rf_grid.best_score_)
#print(rf_grid.best_params_)

In [None]:
rf_model_opt2= RandomForestClassifier(criterion= 'entropy', max_depth= 4, max_features= 'auto', n_estimators= 30)
rf_model_opt2.fit(X_train, y_train)
rf_model_opt2.score(X_test, y_test)

In [None]:
y_pred= rf_model_opt2.predict(X_test)

In [None]:
#Confusion Matrix

tn, fp, fn, tp= confusion_matrix(y_test, y_pred).ravel()

print('True Negative:', tn)
print('False Positive:', fp)
print('False Negative:', fn)
print('True Positive:', tp)

In [None]:
# Classification Report

print(classification_report(y_test, y_pred))

In [None]:
predict_pr = rf_model_opt2.predict_proba(data_ap2_scaled)
predict_pr = predict_pr[:, 1]
auc = roc_auc_score(label, predict_pr)
print('AUC:', round(auc, 4))
fpr, tpr, thresholds = roc_curve(label, predict_pr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.plot([0, 1], [0, 1], linestyle='--')
plt.plot(fpr, tpr, marker='.')
plt.show()

After Droping Insulin column from the feature the model is giving slighly bad performance.

### 2. KNN

In [None]:
knn_model= KNeighborsClassifier()
param_grid= {'n_neighbors': list(range(1,20)), 'weights': ["uniform", "distance"], 'metric': ['minkowski','manhattan','euclidean']}
knn_grid= GridSearchCV(knn_model, param_grid= param_grid, cv= 25, verbose= True, n_jobs= -1)
knn_grid.fit(X_train, y_train)

In [None]:
# Checking the best score on train data

print(knn_grid.best_score_)
print(knn_grid.best_params_)

In [None]:
# Testing on test data

knn_opt2 = KNeighborsClassifier(n_neighbors= 15, weights= 'uniform', metric= 'minkowski')
knn_opt2.fit(X_train, y_train)
knn_opt2.score(X_test, y_test)

In [None]:
y_pred= knn_opt2.predict(X_test)

In [None]:
# Classification Report

print(classification_report(y_test, y_pred))

There is no change in classification report of KNN, whether we include or exclude Insulin feature in our model.

# Tableau Dashboard Link
https://public.tableau.com/profile/anik.chakraborty#!/vizhome/Healthcare-DiabetesAnalysis_16205897996520/Dashboard