<a href="https://www.kaggle.com/code/orjiugochukwu/ml-models-for-detecting-diabetes-with-99-accuracy?scriptVersionId=106036132" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# ML Models for Detecting Onset Diabetes
By

**Ugochukwu Orji**

# Executive Summary of Project

* The Diabetes mellitus disease is fast rising to epidemic levels all over the world according to WHO reports. When it is not effectively treated, it can lead to organ failure, cardiovascular disease, other bodily functions being disrupted and then death. 

* Stakeholders in the health industry have been seeking machine learning (ML) tools and techniques that can assist healthcare practitioners in the early stage of diabetes diagnosis so as to halt the lethal disease's progression. Recently, clinical trials have employed ML, data mining, and big data technologies to predict diabetes diagnosis and other diseases in patients. 

* Exploratory data analysis on the data showed severe skewness in class distribution,  then random under-sampling technique was introduced to handle this and deployed 5 Machine Learning (ML) models including; Random Forest Classifier (RF), Decision Tree Classifier (DT), XGBoost Classifier (XGB), KNeighbors Classifier (KNN) and Logistic Regression (LR) to build the models. 

* The models were evaluated using the confusion matrix and classification report approach, the ROC curve and AUC score of the models were also introduced. Finally, the RF model obtained the best accuracy score of 99% followed by DT, XGB, KNN & LR models with an accuracy score of 98%, 97%, 80% & 75% respectively. 

## Importing libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from tabulate import tabulate
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split, ParameterGrid, cross_val_score, RepeatedKFold, RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import warnings  as ws
ws.filterwarnings("ignore")

### Load data

In [None]:
# load the dataframe
df=pd.read_csv("../input/diabetes-data-set/diabetes-dataset.csv")

### Surface-level information and Shape

In [None]:
# print out the head of the dataframe and the shape
df.head()

# Exploratory Data Analysis

In [None]:
# get the shape
df.shape

**Insight**: We have 2000 rows/samples and 9 columns. 8 of the 9 columns are our *feature* columns, while the last column (Outcome) represents our *target* column. Moreover, all of our columns appear to consist of numeric data. Let's look at how much of each class we have:

In [None]:
df["Outcome"].value_counts()

In [None]:
sns.set()
sns.countplot(df["Outcome"])
plt.show()

**Insight**: Our data set consists of 1316 healthy individuals and only 684 diabetic individuals; evidently, our data set is imbalanced. We will need to deal with this later.

### Checking for Null and Duplicate Entries

In [None]:
# check for null entries
df.isna().sum()

In [None]:
# check for duplicate entries
num_duplicate_entries = df.duplicated(subset=None, keep='first').sum()
num_duplicate_entries

In [None]:
# to which class do the duplicates belong to
duplicate_data = df[df.duplicated(subset=None, keep='first')]
duplicate_data.Outcome.value_counts()

**Insight**: We don't have any null/missing values, however, we do have 1256 duplicate values--825 of which belong to the healthy class, and 431 of which belong to the diabetic class.

### Feature Analysis
Now that we have a decent understanding of the structure of our data, let's dive deeper by exploring the features themselves and how they might impact our target variable. To start, we can take a look at the correlation of the features:

In [None]:
corr = df.corr() # compute the correlation matrix
mask = np.triu(np.ones_like(corr, dtype=bool)) # define the upper-triangular mask for the heatmap
cmap = sns.color_palette("Blues", as_cmap=True) # define the color palette to use
plt.figure(figsize=(12, 10)) # update the figure size to dispaly nicely
sns.heatmap(corr, mask=mask, cmap=cmap, square=True, annot=True)

**Insight**: The features neither have a strong correlation with one another nor the target variable. Thus, we should be able to incorporate all of them when building our final model. Let's take a closer look at their correlations and distributions.

In [None]:
# use pairplot to show relationships between features and individual distributions
plt.figure(figsize=(12, 10))
sns.pairplot(data=df, hue="Outcome", corner=True, diag_kind="kde")

**Insight**: From the scatter plots, we can see that outliers are prevalent within the data set. Moreover, from the KDE distributions on the diagonal, there is a great deal of overlap between healthy and diabetic patients. In other words, there is no subset of features that easily discern healthy individuals from diabetic individuals.

### Outlier Detection
Let's confirm our assumption that outliers exist in the data set by taking a quick glance at the distribution of the data using panda's `describe` method:

In [None]:
df.describe().T

**Insight**: In the Glucose, BloodPressure, and BMI columns, there is a large gap from the minimum value to the 25th percentile. Moreover, in many of these columns, the minimum value is zero, which makes little sense in the context of the task (for example, a BloodPressure value of zero indicates that the patient's heart is no longer beating!). In a similar matter, In the Pregnancies, BloodPressure, SkinThickness, Insulin, BMI, and Age columns, there is a large gap from between the 75th percentile to the maximum value, indicating the presence of outliers.

To get a sense for just how many outliers we are dealing with, we can use a box and whisker plot over each of the features:

In [None]:
for column in df.columns[:-1]:
    plt.figure(figsize=(12, 10))
    plt.title(f"{column} Boxplot")
    sns.boxplot(data=df, x=column)

To make things even more concrete, we can compute the number of outliers present in each of the columns of our data set:

In [None]:
for column in df.columns[:-1]:
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    outlier_range = (df[column] < (Q1 - 1.5*IQR)) | (df[column] > (Q3 + 1.5 * IQR))
    num_outliers = df[column][outlier_range].count()
    
    print(f"{column}: {num_outliers} outliers")

**Insight**: Looks like our data set contains a large number of outliers; we will need to deal with this later.

#  Data Preprocessing
Now that we a have a solid understanding of our data, we can begin to clean and prepare it for our models. Because all of our data is already numeric and no two features have a high correlation, there is not much left for us to do, other than address our class imbalance and standardize our data. 

### Splitting our Data
We begin by splitting our data into train and test sets; we must do this *before* any preprocessing so as to avoid any type of [data leakage](https://machinelearningmastery.com/data-leakage-machine-learning/).

In [None]:
# split our data into train/test sets to avoid data leakage
train_df, test_df = train_test_split(df, test_size=0.25, random_state=0)

### Addressing Class Imbalance
As we saw earlier, our data is severely imbalanced; we have a greater number of healthy individuals in our data set than diabetic individuals. If this is still the case in our training set, we will perform [random undersampling](https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/#:~:text=Random%20undersampling%20involves%20randomly%20selecting,more%20balanced%20distribution%20is%20reached.) to balance our training set. One way we can achieve this is by removing all duplicate entries that were classified as healthy individuals from our training set.

In [None]:
train_df["Outcome"].value_counts()

Evidently, our training set is still heavily imbalanced. Hence, random undersampling needs to be done. One way we can achieve this is by removing all the duplicate entries that were classified as healthy individuals from our training set.

In [None]:
# get all the duplicated rows
train_dups = train_df[train_df.duplicated(subset=None, keep='first')]
# get the index of all the healthy duplicates
healthy_dups = train_dups.loc[train_df["Outcome"] == 0].index
# drop the healthy duplicates
train_df = train_df.drop(healthy_dups)

In [None]:
train_df["Outcome"].value_counts()

Now that we've performed random sampling, we can standardize our data. Because we have a large number of outliers, we will be using sklearn's [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html).

In [None]:
# separate the data from the labels
X_train, y_train = train_df.drop(columns=["Outcome"], axis=1), train_df["Outcome"]
X_test, y_test = test_df.drop(columns=["Outcome"], axis=1), test_df["Outcome"]

# create the scaler
scaler = RobustScaler()
# fit the scaler and transform our data
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Model Predictions and Evaluation
Now all that we have left to do is generate some models and make predictions!

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
def model_Evaluate(model):
    
    # Predict values for Test dataset
    y_pred = model.predict(X_test)

    # Print the evaluation metrics for the dataset.
    print(classification_report(y_test, y_pred))
    
    # Compute and plot the Confusion matrix
    cf_matrix = confusion_matrix(y_test, y_pred)

    categories  = ['Negative','Positive']
    group_names = ['True Neg','False Pos', 'False Neg','True Pos']
    group_percentages = ['{0:.2%}'.format(value) for value in cf_matrix.flatten() / np.sum(cf_matrix)]

    labels = [f'{v1}\n{v2}' for v1, v2 in zip(group_names,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)

    sns.heatmap(cf_matrix, annot = labels, cmap = 'Blues',fmt = '',
                xticklabels = categories, yticklabels = categories)

    plt.xlabel("Predicted values", fontdict = {'size':14}, labelpad = 10)
    plt.ylabel("Actual values"   , fontdict = {'size':14}, labelpad = 10)
    plt.title ("Confusion Matrix", fontdict = {'size':18}, pad = 20)

### RandomForest

In [None]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
acc_rf= model_Evaluate(rf)

### DecisionTreeClassifier

In [None]:
dc = DecisionTreeClassifier()
dc.fit(X_train, y_train)
acc_dc= model_Evaluate(dc)

### KNeighborsClassifier

In [None]:
kn = KNeighborsClassifier()
kn.fit(X_train, y_train)
acc_kn= model_Evaluate(kn)

### Logistic Regression

In [None]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
acc_lr= model_Evaluate(lr)

### XGBoost

In [None]:
xg = XGBClassifier()
xg.fit(X_train, y_train)
acc_xg= model_Evaluate(xg)

## ROC curve

In [None]:
model_lr = LogisticRegression().fit(X_train, y_train)
probs_lr = model_lr.predict_proba(X_test)[:, 1]

model_dt = DecisionTreeClassifier().fit(X_train, y_train)
probs_dt = model_dt.predict_proba(X_test)[:, 1]

model_kn = KNeighborsClassifier().fit(X_train, y_train)
probs_kn = model_kn.predict_proba(X_test)[:, 1]

model_rf = RandomForestClassifier().fit(X_train, y_train)
probs_rf = model_rf.predict_proba(X_test)[:, 1]

model_xg = XGBClassifier().fit(X_train, y_train)
probs_xg = model_xg.predict_proba(X_test)[:, 1]

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve

y_test_int = y_test.replace({'Good': 1, 'Bad': 0})
auc_lr = roc_auc_score(y_test_int, probs_lr)
fpr_lr, tpr_lr, thresholds_lr = roc_curve(y_test_int, probs_lr)

auc_dt = roc_auc_score(y_test_int, probs_dt)
fpr_dt, tpr_dt, thresholds_dt = roc_curve(y_test_int, probs_dt)

auc_kn = roc_auc_score(y_test_int, probs_kn)
fpr_kn, tpr_kn, thresholds_kn = roc_curve(y_test_int, probs_kn)

auc_rf = roc_auc_score(y_test_int, probs_rf)
fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_test_int, probs_rf)

auc_xg = roc_auc_score(y_test_int, probs_xg)
fpr_xg, tpr_xg, thresholds_xg = roc_curve(y_test_int, probs_xg)

plt.figure(figsize=(12, 7))
plt.plot(fpr_lr, tpr_lr, label=f'AUC (Logistic Regression) = {auc_lr:.2f}')
plt.plot(fpr_dt, tpr_dt, label=f'AUC (Decision Tree) = {auc_dt:.2f}')
plt.plot(fpr_kn, tpr_kn, label=f'AUC (K-nearest Neighbors) = {auc_kn:.2f}')
plt.plot(fpr_rf, tpr_rf, label=f'AUC (Random Forests) = {auc_rf:.2f}')
plt.plot(fpr_xg, tpr_xg, label=f'AUC (XGBoost) = {auc_xg:.2f}')
plt.plot([0, 1], [0, 1], color='blue', linestyle='--', label='Baseline')
plt.title('ROC Curve', size=20)
plt.xlabel('False Positive Rate', size=14)
plt.ylabel('True Positive Rate', size=14)
plt.legend();

**Thanks to https://www.kaggle.com/code/anandmural/diabetesprediction#Diabetes-Prediction for the insight...**