# Diabetes Classification Predictive Modelling

**Context**

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases, Maryland, USA. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females, at least 21 years old and of Pima Indian heritage.

**Dataset**

The datasets consists of several medical predictor variables and one binary target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level and age.

**Notebook Contents**

- 1. Exploratory Data Analysis
- 2. Feature Engineering
- 3. Machine Learning
- 4. Summary

In [None]:
#Import Libraries
import numpy as np
import pandas as pd 
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

from numpy import sort

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV,RandomizedSearchCV
from sklearn.utils import resample
from sklearn.feature_selection import SelectFromModel,SelectKBest,f_classif

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score,recall_score,classification_report,roc_curve,roc_auc_score

In [None]:
data = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")
data.head()

In [None]:
data.info()

We have 768 subjects in the dataset and 8 features, all of which are shown as numeric. We have a target variable, Outcome, which we know to be a binary indicator of the presence or absence of diabetes.There are no missing values.

# 1. EDA

First we will examine the distributions of the dependent and independent variables.

In [None]:
#Sample proportion of diabetes:
prop = 100*data['Outcome'].value_counts()/len(data)

labels = ['No Diabetes '+ str(round(prop[0],2)) + "%",'Diabetes '+ str(round(prop[1],2)) + "%"]
colormap = {'tab:orange','lightgrey'}
data['Outcome'].value_counts().plot.pie(startangle=90, colors=colormap, labels=labels)
plt.title("Diabetes Overall Sample Proportion", fontsize=15)
plt.ylabel('')
circle = plt.Circle((0,0),0.7,color="white")
p = plt.gcf()
p.gca().add_artist(circle)
plt.show()

34.9% of the women in the sample have diabetes. This means our classes are imbalanced.

In [None]:
data.describe()

In [None]:
#Distributions of independent variable:
features = data.columns.drop('Outcome')

nrows=4
ncols=2
plt.subplots(nrows,ncols,figsize=(8,8))
plt.tight_layout()

j=1
for i in features:
    plt.subplot(nrows,ncols,j)
    sns.distplot(data[i],bins=50, kde=False, color='teal')
    j+=1

With the exception of the number of pregnancies, which is discrete, all other features are continuous numeric values. It is evident that missing values have been coded as 0 in the dataset because zero values in the tests are not possible. There is also a value of 99 in the skin thickness which is likely to be another missing value as the subject's BMI is only average. 
Before we explore the relationship between outcome and independent variables we will examine the relationship between outcome and missing status. 

## 1.1 Missing Values

We will examine whether there is a diffence in diabetes status for the missing and non-missing data. We will have to assume that there are no missing pregnancy values.

In [None]:
#Create missing markers:
miss_markers = []

for i in features:
    if i in ['Pregnancies','Age','DiabetesPedigreeFunction']:
        pass
    else:
        data['missing_'+i]  = data[i].apply(lambda x: 1 if x==0 else 0)
        miss_markers.append('missing_'+i)
        
 #Create DF of proportions:
props = pd.DataFrame(index=[0,1])

for i in miss_markers:
    props[i] = data.groupby(i).Outcome.sum()/data.groupby(i).Outcome.count()

props['Total']=1

#Proportion of missing values by variable:
for i in miss_markers:
    percent_missing = 100*data[i].value_counts()/len(data)
    print("Percentage missing in ", i, ": ",round(percent_missing[1],2),"%")
    


Almost half the subjects have missing insulin values, 48.7%, and almost a third have missing skin thickness measurements, 29.6%.

In [None]:
#Chart proportion diabetic by missing status:

for i in miss_markers:
    plt.figure(figsize=(4,4))
    sns.barplot(props.index,props['Total'],color='lightgrey')
    sns.barplot(props.index,props[i],color='tab:orange')
    plt.title(i,fontsize=20)
    plt.ylabel("Proportion Diabetic")
    plt.show()
    
data.drop(miss_markers,axis=1,inplace=True)

The proportion of diabetics are, with the exception of BMI, similar across missing status so we will impute with variable means or medians where appropriate. The proportion of diabetics is lower in the missing BMI group, but only 11(1.4%) of the BMI values are missing so this difference is not significant( Fisher's Exact test, p=0.35).

## 1.2 Features Relationship with Outcome and Each Other

We will examine the features against outcome, excluding the missing values identified earlier.

In [None]:
#Boxplots of features by diabetes status:
nrows=3
ncols=3
plt.subplots(nrows,ncols,figsize=(9,9))
plt.tight_layout()

j=1
for i in features:
    if i in ['Pregnancies','Age','DiabetesPedigreeFunction']:
        plt.subplot(nrows,ncols,j)
        sns.boxplot(data['Outcome'],data[i], palette={'lightgrey','tab:orange'})
        plt.xlabel("")
        plt.title(i)
        j+=1
    else:
        mini = data[data[i]!=0]
        plt.subplot(nrows,ncols,j)
        sns.boxplot(mini['Outcome'],mini[i], palette={'lightgrey','tab:orange'})
        plt.xlabel("")
        plt.title(i)
        j+=1
        

We can see that the diabetic group have higher glucose levels, age, BMI, pregnancies and insulin measures.

In [None]:
#Recode missing values as nan:
miss_data = data.copy()

for i in ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI']:
    miss_data[i] = miss_data[i].replace(0,np.nan)
    
miss_data['SkinThickness'] = miss_data['SkinThickness'].replace(99,np.nan)

In [None]:
#Feature correlation excluding missing:
corr_mat = miss_data[features].corr()

mask = np.zeros_like(corr_mat,dtype=np.bool)
mask[np.triu_indices_from(mask)]=True
sns.heatmap(corr_mat,annot=True,mask=mask,cmap=sns.diverging_palette(240,10,as_cmap=True),vmin=-1,vmax=1)
plt.title("Feature Correlation Matrix",fontsize=20)
plt.show()

We can see that a few of the features are moderately correlated - Age and number of pregnancies, Insulin and glucose levels, skin thickness and BMI - but not so much as to cause concern. 

In [None]:
#Pairplots of features:
sns.pairplot(miss_data, hue='Outcome', palette={'darkgrey','tab:orange'})
plt.show()

We can clearly see the correlation between some of the features, such as skin thickness and BMI. We can also see the difference across features of the diabetic status, clearest with glucose and more subtly with features like BMI.

# 2. Feature Engineering

We will balance the diabetes classes. We will also impute missing values and scale the data using a pipeline.

## 2.1 Upscale Diabetes Group

We saw earlier that our diabetes classes are not balanced so we will upscales the diabetes group.

In [None]:
#Upscale diabetes group: miss_data.Outcome.value_counts()

df_min = miss_data[miss_data['Outcome']==1] 
df_maj = miss_data[miss_data['Outcome']==0]

df_min_upscaled = resample(df_min, replace=True,
                           n_samples=500,
                           random_state=0) 
miss_data = pd.concat([df_maj,df_min_upscaled]) 
miss_data.Outcome.value_counts()


## 2.2 Imputation and Scaling


We will replace the missing - previously zero - values with the variable medians as some of the distributions are skewed. We don't generally need to scale for logistic regression, but, by default, Python uses regularisation which requires scaling. We will also be fitting a KNN model which also requres scaling.  We will create a modelling pipeline to do this.

## 2.3 Feature Selection Using L1 Regularization

We will make an initial selection of the best features using L1 regularisation. We will further select features for the logistic regression model and Feature Importance for the Random  Forest and XGBoost models.

In [None]:
#Create a test set:
y = miss_data['Outcome'].copy()
X = miss_data[features].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print("X",X.shape)
print("X_train",X_train.shape)
print("X_test",X_test.shape)

In [None]:
#Create a modeling pipeline:
# Create num transformer for imputing and scaling then bundle into a preprocessor:

num_transformer = Pipeline( steps=[('imputer',SimpleImputer(strategy='median')),
                                        ('scaler',MinMaxScaler())
                                       ])


preprocessor = ColumnTransformer( transformers = [('num',num_transformer, features)
                                                 ])

In [None]:
#Define the L1 Logistic Regression model and
#bundle into a processing and modelling pipeline:

clf = LogisticRegression(penalty="l1",solver='liblinear',random_state=0)

pipe = Pipeline(steps = [('preprocessing',preprocessor),
                            ('modelling', clf)
                            ])

In [None]:
#Tune C with GridSearchCV:

hparams = {'modelling__C':[0.001,0.003,0.01,0.03,0.1,0.3,1,3,10,30,100]}

grid = GridSearchCV(pipe,param_grid = hparams,cv=5,scoring='roc_auc',n_jobs=3,verbose=1)
grid.fit(X_train,y_train)
print("Best params",grid.best_params_)
print("Best score:",grid.best_score_)

## 2.3.1 Examine Selected Features

In [None]:
#Plot kept coefficients:
coefs = pd.Series(grid.best_estimator_.named_steps['modelling'].coef_[0],index = X.columns)
kept = coefs[coefs!=0]
kept.sort_values().plot.barh()

plt.title("L1 Regularised Coefficients", fontsize=15);

None of the features were dropped through L1 regularisation. Glucose and BMI are the features most strongly associated with diabetes status.  

# 3. Machine Learning

We will start by fitting baseline models. We will fit the following models:
- L2 Logistic Regression,
- KNN,
- SVCs,
- Random Forest,
- XGBoost

In [None]:
#Create table of results:
model_table = pd.DataFrame(columns=["Model","AUC"])


#Fit baseline models:
for i in [LogisticRegression(),KNeighborsClassifier(), SVC(kernel='linear'), SVC(kernel='rbf'),RandomForestClassifier(random_state=0),XGBClassifier(random_state=0)]:
    #define processing pipeline:
    num_transformer = Pipeline( steps=[('imputer',SimpleImputer(strategy='median')),
                                        ('scaler',MinMaxScaler())
                                       ])
    preprocessor = ColumnTransformer( transformers = [('num',num_transformer, features)
                                                 ])
    #Fit model:
    clf = i
    
    pipe = Pipeline(steps = [('preprocessing',preprocessor),
                            ('modelling', clf)
                            ])
    
    scores = cross_val_score(pipe,X_train,y_train,cv=5,scoring='roc_auc')
    
    
    #Add to model table:
    model_table = model_table.append({"Model":i,"AUC":scores.mean()},ignore_index=True)

round(model_table,3)

Our baseline Random Forest and XGBoost models achieve the highest AUC scores so we will tune them:

## 3.1 Tuned Random Forest Model

In [None]:
#Tune RF model with GridSearchCV:
#Create pipeline:
num_transformer = Pipeline( steps=[('imputer',SimpleImputer(strategy='median')),
                                        ('scaler',MinMaxScaler())
                                       ])
preprocessor = ColumnTransformer( transformers = [('num',num_transformer, features)
                                                 ])
clf = RandomForestClassifier(random_state=0)   
    
pipe = Pipeline(steps = [('preprocessing',preprocessor),
                            ('modelling', clf)
                            ])

hparams = {'modelling__n_estimators':[n for n in range(100,501,100)],
          'modelling__max_depth':[4,6,8]}

grid = GridSearchCV(pipe,param_grid = hparams,cv=5,scoring='roc_auc',n_jobs=3,verbose=1)
grid.fit(X_train,y_train)
print("Best params",grid.best_params_)
print("Best score:",grid.best_score_)


In [None]:
#Add Tuned RF model to table:
clf = RandomForestClassifier(n_estimators=100,max_depth=8, random_state=0)

pipe = Pipeline(steps = [('preprocessing',preprocessor),
                            ('modelling', clf)
                            ])
    
scores = cross_val_score(pipe,X_train,y_train,cv=5,scoring='roc_auc')
    
#Add to model table:
model_table = model_table.append({"Model":clf,"AUC":scores.mean()},ignore_index=True)
  

round(model_table,3)

Our tuned Random Forest Classifier performed worse than our default RF classifier. This is likely to be because the default RF model has no max_depth, which can lead to overfitting.

## 3.2 Tuned XGB Model

In [None]:
#Tune XGBoost model with GridSearchCV:
#Create pipeline:
num_transformer = Pipeline( steps=[('imputer',SimpleImputer(strategy='median')),
                                        ('scaler',MinMaxScaler())
                                       ])
preprocessor = ColumnTransformer( transformers = [('num',num_transformer, features)
                                                 ])
clf = XGBClassifier(random_state=0)   
    
pipe = Pipeline(steps = [('preprocessing',preprocessor),
                            ('modelling', clf)
                            ])

hparams = {'modelling__learning_rate':[0.001,0.003,0.01,0.03,0.1,0.3],
           'modelling__n_estimators':[n for n in range(100,501,100)],
          'modelling__max_depth':[4,6,8]}

grid = GridSearchCV(pipe,param_grid = hparams,cv=5,scoring='roc_auc',n_jobs=-1,verbose=3)
grid.fit(X_train,y_train)
print("Best params",grid.best_params_)
print("Best score:",grid.best_score_)

In [None]:
#Fit the tuned XGBoost model:
clf = XGBClassifier(learning_rate=0.03,n_estimators=500,max_depth=6, random_state=0)

pipe = Pipeline(steps = [('preprocessing',preprocessor),
                            ('modelling', clf)
                            ])
    
scores = cross_val_score(pipe,X_train,y_train,cv=5,scoring='roc_auc')

#Add to table of results:
model_table = model_table.append({"Model":clf,"AUC":scores.mean()},ignore_index=True)
round(model_table,3)

The tuned RF model is the best so far. 

## 3.3 Tuned RF Model Feature Importances
Refit the tuned RF model to all the training data to obtain Feature Importances and predictions.

In [None]:
#Refit tuned RF model:
clf = RandomForestClassifier(n_estimators=100,max_depth=8, random_state=0)

pipe = Pipeline(steps = [('preprocessing',preprocessor),
                            ('modelling', clf)
                            ])

pipe.fit(X_train,y_train)
y_preds = pipe.predict(X_test)
y_probs = pipe.predict_proba(X_test)[:,1]

#Save tuned RF model Feature Importances:
tuned_feat_imps = pipe.named_steps['modelling'].feature_importances_

In [None]:
#Plot Tuned Random Forest Feature Importances:

feat_imp = pd.Series(tuned_feat_imps,index=X.columns)
feat_imp = feat_imp.sort_values(ascending=False)
sns.barplot(feat_imp,feat_imp.index)
plt.title("Feature Importances in Tuned Random Forest Model",fontsize=15);

In the tuned Random Forest model Glucose is the most important feature while SkinThickness is the least important feature. 

In [None]:
#Confusion matrix for tuned RF model:
con_mat = pd.crosstab(y_test,y_preds,rownames=['Actual'],colnames=['Predicted'])
sns.heatmap(con_mat,annot=True,cmap='Blues',vmin=0)
plt.title("Confusion Matrix for tuned RF Model", fontsize=15)
plt.show()

In [None]:
#ROC curve:
fpr,tpr,_ = roc_curve(y_test,y_probs)
auc = roc_auc_score(y_test,y_probs)
plt.plot(fpr,tpr,label="RF Model,auc=" +  str(round(auc,4)))
plt.title(" Random Forest Model ROC Curve",fontsize=15)
plt.legend(loc=4);

In [None]:
print(classification_report(y_test, y_preds))

# 4.Summary

The Random Forest Classifier model is a good predictor of diabetes in Pima Indian women aged over 21, AUC score = 90.1%, sensitivity = 82.0%. The mose important feature for prediction is Glucose.

NOTE: Almost half the subjects had missing insulin values, 48.7%, and almost a third have missing skin thickness measurements, 29.6%.