

# DIABETES

### The unprocessed dataset was acquired from UCI Machine Learning organisation. This dataset is preprocessed by me, originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to accurately predict whether or not, a patient has diabetes, based on multiple features included in the dataset. 


### Number of Instances: 768
### Number of Attributes: 8 plus class
### For Each Attribute: (all numeric-valued)

***Pregnancies**:                Number of times pregnant 

***Glucose**    :                Plasma glucose concentration a 2 hours in an oral glucose tolerance test

***BloodPressure**:              Diastolic blood pressure (mm Hg)

***SkinThickness**:              Triceps skin fold thickness (mm)

***Insulin**:                    2-Hour serum insulin (mu U/ml)

***BMI**:                        Body mass index (weight in kg/(height in m)^2)

***DiabetesPedigreeFunction**:   Diabetes pedigree function

***Age**:                        Age (years)

***Outcome**:                    Class variable (0 or 1)

***Missing Attribute Values**: Yes

### Class Distribution: (class value 1 is interpreted as "tested positive for diabetes")


### Attributes Normal Value Range:

***Glucose: Glucose (< 140) = Normal, Glucose (140-200) = Pre-Diabetic, Glucose (> 200) = Diabetic


***BloodPressure: B.P (< 60) = Below Normal, B.P (60-80) = Normal, B.P (80-90) = Stage 1 Hypertension, B.P (90-120) = Stage 2 Hypertension, B.P (> 120) = Hypertensive Crisis


***SkinThickness: SkinThickness (< 10) = Below Normal, SkinThickness (10-30) = Normal, SkinThickness (> 30) = Above Normal


***Insulin: Insulin (< 200) = Normal, Insulin (> 200) = Above Normal


***BMI: BMI (< 18.5) = Underweight, BMI (18.5-25) = Normal, BMI (25-30) = Overweight, BMI (> 30) = Obese*




### Class Value Number of instances
*0 : 500

*1 : 268

# Load libraries


In [None]:
import numpy as np
import pandas as pd 
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale, StandardScaler, RobustScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import plot_confusion_matrix, confusion_matrix, accuracy_score, mean_squared_error, r2_score, roc_auc_score, roc_curve, classification_report, precision_recall_curve, auc
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
import missingno as msno

import warnings
warnings.simplefilter(action = 'ignore')

# Read data

In [None]:
diabetes = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')


# Overview

In [None]:
diabetes.head()

In [None]:
diabetes.info()

In [None]:
diabetes.describe([0.10, 0.25, 0.40, 0.50,0.70, 0.90,0.95, 0.99]).T

In [None]:
df = diabetes.copy()
df["Outcome"].value_counts()

## Correlation:

In [None]:
df.corr()

In [None]:
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')

In [None]:
df.head()

In [None]:
df.groupby("Outcome").agg({"Pregnancies":"mean"})

In [None]:
df.groupby("Outcome").agg({"Age":"mean"})

In [None]:
df.groupby("Outcome").agg({"Age":"max"})

In [None]:
df.groupby("Outcome").agg({"Insulin": "mean"})

In [None]:
df.groupby("Outcome").agg({"Insulin": "max"})

In [None]:
df.groupby("Outcome").agg({"Glucose": "mean"})

In [None]:
df.groupby("Outcome").agg({"Glucose": "max"})

In [None]:
df.groupby("Outcome").agg({"BMI": "mean"})

## Missing Values

In [None]:
df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)

In [None]:
df.isnull().sum()

###  Visualization of the Missing Values

In [None]:
msno.bar(df);

The missing data are assigned the median values of the variable in which they are located.

In [None]:
def median_target(sfy):   
    temp = df[df[sfy].notnull()]
    temp = temp[[sfy, 'Outcome']].groupby(['Outcome'])[[sfy]].median().reset_index()
    return temp

columns = df.columns
columns = columns.drop("Outcome")
for i in columns:
    median_target(i)
    df.loc[(df['Outcome'] == 0 ) & (df[i].isnull()), i] = median_target(i)[i][0]
    df.loc[(df['Outcome'] == 1 ) & (df[i].isnull()), i] = median_target(i)[i][1]

In [None]:
df.head()

In [None]:
df.isnull().sum()

### Visualization of outliers in all columns with boxplot.

In [None]:
sns.set(font_scale=0.7) 
fig, axes = plt.subplots(nrows=int(len(df.columns)/2), ncols=2,figsize=(7,12))
fig.tight_layout()
for ax,col in zip(axes.flatten(),df.columns):
    sns.boxplot(x=df[col],ax=ax)

In [None]:
for feature in df:
    
    Q1 = df[feature].quantile(0.05)
    Q3 = df[feature].quantile(0.95)
    IQR = Q3-Q1
    upper = Q3 + 1.5*IQR
    
    if df[(df[feature] > upper)].any(axis=None):
        print(feature,"yes")
    else:
        print(feature, "no")

###  As a result of our analysis, it was seen that there were outliers in two variables. These values are filled with threshold values.

In [None]:
Q1 = df.Insulin.quantile(0.25)
Q3 = df.Insulin.quantile(0.75)
IQR = Q3-Q1
lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR
df.loc[df["Insulin"] > upper,"Insulin"] = upper

In [None]:
Q1 = df.SkinThickness.quantile(0.25)
Q3 = df.SkinThickness.quantile(0.75)
IQR = Q3-Q1
lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR
df.loc[df["SkinThickness"] > upper,"SkinThickness"] = upper

# Local Outlier Factor (LOF)

In [None]:
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors = 20, contamination = 0.1)
lof.fit_predict(df)
df_scores = lof.negative_outlier_factor_
df_scores = pd.DataFrame(np.sort(df_scores))
df_scores.plot(stacked=True, xlim=[0,60], style='.-'); # first 20 rows
    
df_scores[0:20]

In [None]:
df_scores.iloc[4,:]

In [None]:
threshold = np.sort(df_scores)[4]
new_df = df[np.array(df_scores > threshold)]
new_df.info()

In [None]:
df = new_df
df.describe().T

##  Feature Engineering




The BMI variable is divided into groups according to general standards and a new categorical variable named NewBMI is created.

In [None]:
NewBMI = pd.Series(["Underweight", "Normal", "Overweight", "Obesity 1", "Obesity 2", "Obesity 3"], dtype = "category")
df["NewBMI"] = NewBMI
df.loc[df["BMI"] < 18.5, "NewBMI"] = NewBMI[0]
df.loc[(df["BMI"] > 18.5) & (df["BMI"] <= 24.9), "NewBMI"] = NewBMI[1]
df.loc[(df["BMI"] > 24.9) & (df["BMI"] <= 29.9), "NewBMI"] = NewBMI[2]
df.loc[(df["BMI"] > 29.9) & (df["BMI"] <= 34.9), "NewBMI"] = NewBMI[3]
df.loc[(df["BMI"] > 34.9) & (df["BMI"] <= 39.9), "NewBMI"] = NewBMI[4]
df.loc[df["BMI"] > 39.9 ,"NewBMI"] = NewBMI[5]

In [None]:
df.head()

The data in the insulin variable was divided into normal and abnormal groups and a new variable called NewInsulinScore was created.

In [None]:
def set_insulin(row):
    if row["Insulin"] >= 16 and row["Insulin"] <= 166:
        return "Normal"
    else:
        return "Abnormal"
    
df = df.assign(NewInsulinScore=df.apply(set_insulin, axis=1))
df.head()

The data in the glucose variable were divided into groups according to general standards and a new variable named NewGlucose was defined.

In [None]:
NewGlucose = pd.Series(["Low", "Normal", "Overweight", "Secret", "High"], dtype = "category")
df["NewGlucose"] = NewGlucose
df.loc[df["Glucose"] <= 70, "NewGlucose"] = NewGlucose[0]
df.loc[(df["Glucose"] > 70) & (df["Glucose"] <= 99), "NewGlucose"] = NewGlucose[1]
df.loc[(df["Glucose"] > 99) & (df["Glucose"] <= 126), "NewGlucose"] = NewGlucose[2]
df.loc[df["Glucose"] > 126 ,"NewGlucose"] = NewGlucose[3]

In [None]:
df.head()

# One Hot Encoding

With the One Hot Encoding method, the values in categorical variables have been converted into numerical expressions.

In [None]:
df = pd.get_dummies(df, columns =["NewBMI","NewInsulinScore", "NewGlucose"], drop_first = True)

In [None]:
df.head()

In [None]:
categorical_df = df[['NewBMI_Obesity 1','NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight','NewBMI_Underweight',
                     'NewInsulinScore_Normal','NewGlucose_Low','NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret']]

In [None]:
categorical_df.head()

In [None]:
X_ = df.drop(['NewBMI_Obesity 1','NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight','NewBMI_Underweight',
                     'NewInsulinScore_Normal','NewGlucose_Low','NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret'], axis = 1)


In [None]:
X_.head()

In [None]:
df = pd.concat([X_,categorical_df], axis = 1)

In [None]:
df.head()

# Splitting Dataset
### Splitting the target variable in y and all the other features in X

In [None]:
y = df['Outcome']
X = df.drop('Outcome', axis = 1)
X.head()

# Standardization:

In [None]:
MinMax = MinMaxScaler(feature_range = (0, 1)).fit(X)
X_st = MinMax.transform(X)
X_st = pd.DataFrame(X_st, columns = X.columns)
X_st.head()

In [None]:
sdf = pd.concat([X_st, y], axis = 1)
sdf.describe().T


# Machine Learning:


### We will train out data on different machine learning models and use different techniques on each model and then compare our finding at the end to determine which model is working best for out data.





## -----   Model Performance and Comparison   -----

### To measure the performance of a model, we need several elements

**Confusion matrix** : also known as the error matrix, allows visualization of the performance of an algorithm

    True Positive (TP) : Diabetic correctly identified as diabetic
    True Negative (TN) : Healthy correctly identified as healthy
    False Positive (FP) : Healthy incorrectly identified as diabetic
    False Negative (FN) : Diabetic incorrectly identified as healthy

**Metrics**

    Accuracy : (TP + TN) / (TP + TN + FP +FN)
    Precision : TP / (TP + FP)
    Recall : TP / (TP + FN)
    F1 score : 2 x ((Precision x Recall) / (Precision + Recall))



### Defining variables to store the outputs.**

In [None]:
avg_accuracies={}
accuracies={}
roc_auc={}
pr_auc={}

### Defining function to calculate the Cross-Validation score.




In [None]:
def cal_score(name,model,folds):
    scores = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
    avg_result = []
    for sc in scores:
        scores = cross_val_score(model, X_st, y, cv = folds, scoring = sc)
        avg_result.append(np.average(scores))
    df_avg_score = pd.DataFrame(avg_result)
    df_avg_score = df_avg_score.rename(index={0: 'Accuracy',
                                             1:'Precision',
                                             2:'Recall',
                                             3:'F1 score',
                                             4:'Roc auc'}, columns = {0: 'Average'})
    avg_accuracies[name] = np.round(df_avg_score.loc['Accuracy'] * 100, 2)
    values = [np.round(df_avg_score.loc['Accuracy'] * 100, 2),
            np.round(df_avg_score.loc['Precision'] * 100, 2),
            np.round(df_avg_score.loc['Recall'] * 100, 2),
            np.round(df_avg_score.loc['F1 score'] * 100, 2),
            np.round(df_avg_score.loc['Roc auc'] * 100, 2)]
    plt.figure(figsize = (15, 8))
    sns.set_palette('mako')
    ax = sns.barplot(x = ['Accuracy', 'Precision', 'Recall', 'F1 score', 'Roc auc'], y = values)
    plt.yticks(np.arange(0, 100, 10))
    plt.ylabel('Percentage %', labelpad = 10)
    plt.xlabel('Scoring Parameters', labelpad = 10)
    plt.title('Cross Validation ' + str(folds) + '-Folds Average Scores', pad = 20)
    for p in ax.patches:
        ax.annotate(str(p.get_height()), (p.get_x(), p.get_height()), xytext = (p.get_x() + 0.3, p.get_height() + 1.02))
    plt.show()

### Defining function to create Confusion Matrix.

In [None]:
def conf_matrix(ytest, pred):
    plt.figure(figsize = (15, 8))
    global cm1
    cm1 = confusion_matrix(ytest, pred)
    ax = sns.heatmap(cm1, annot = True, cmap = 'Blues')
    plt.title('Confusion Matrix', pad = 30)

### Defining function to calculate the Metrics Scores.

In [None]:
def metrics_score(cm):
    total = sum(sum(cm))
    accuracy = (cm[0, 0] + cm[1, 1]) / total
    precision = cm[1, 1] / (cm[0, 1] + cm[1, 1])
    sensitivity = cm[1, 1] / (cm[1, 0] + cm[1, 1])
    f1 = 2 * (precision * sensitivity) / (precision + sensitivity)
    specificity = cm[0,0] / (cm[0, 1] + cm[0, 0])
    values = [np.round(accuracy * 100, 2),
            np.round(precision * 100, 2),
            np.round(sensitivity * 100, 2),
            np.round(f1 * 100, 2),
            np.round(specificity * 100, 2)]
    plt.figure(figsize = (15, 8))
    sns.set_palette('magma')
    ax = sns.barplot(x = ['Accuracy', 'Precision', 'Recall', 'F1 score', 'Specificity'], y = values)
    plt.yticks(np.arange(0, 100, 10))
    plt.ylabel('Percentage %', labelpad = 10)
    plt.xlabel('Scoring Parameter', labelpad = 10)
    plt.title('Metrics Scores', pad = 20)
    for p in ax.patches:
        ax.annotate(str(p.get_height()), (p.get_x(), p.get_height()), xytext = (p.get_x() + 0.3, p.get_height() + 1.02))
    plt.show()

### Defining function to plot ROC Curve.

In [None]:
def plot_roc_curve(fpr, tpr):
    plt.figure(figsize = (8, 6))
    plt.plot(fpr, tpr, color = 'Orange', label = 'ROC')
    plt.plot([0, 1], [0, 1], color = 'black', linestyle = '--')
    plt.ylabel('True Positive Rate', labelpad = 10)
    plt.xlabel('False Positive Rate', labelpad = 10)
    plt.title('Receiver Operating Characteristic (ROC) Curve', pad = 20)
    plt.legend()
    plt.show()

### Defining function to plot Precision-Recall Curve.

In [None]:
def plot_precision_recall_curve(recall, precision):
    plt.figure(figsize=(8,6))
    plt.plot(recall, precision, color='orange', label='PRC')
    plt.ylabel('Precision',labelpad=10)
    plt.xlabel('Recall',labelpad=10)
    plt.title('Precision Recall Curve',pad=20)
    plt.legend()
    plt.show()

## 1. Logistic Regression Classifier:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_st, y, test_size = 0.20, random_state = 5)
log_model = LogisticRegression()
log_model.fit(X_train, y_train)
prediction1 = log_model.predict(X_test)
accuracy1 = log_model.score(X_test, y_test) 
print ('Model Accuracy:',accuracy1 * 100)

Storing model accuracy to plot for comparison with other Machine Learning models.

In [None]:
accuracies['Linear Regression'] = np.round(accuracy1 * 100, 2)

1. Plotting Confusion Matrix to describe the performance of Random Forest Classifier on a set of test data.

In [None]:
conf_matrix(y_test, prediction1)

Plotting different metrics scores for the Linear Regression Classifier for evaluation.

In [None]:
metrics_score(cm1)

* Plotting the average of different metrics scores for further evaluation.

In [None]:
cal_score('Linear Regression', log_model, 5)

Plotting Receiver Operating Characteristic (ROC) Curve, to illustrate the diagnostic ability of Linear Regression Classifier as its discrimination threshold is varied and showing the Area under the ROC Curve (AUC) value which will tell us how much our model is capable of distinguishing between healthy and diabetic patients.

In [None]:
probs = log_model.predict_proba(X_test)
probs = probs[:, 1]
auc1 = roc_auc_score(y_test, probs)
roc_auc['Linear Regression'] = np.round(auc1, 2)
print('Area under the ROC Curve (AUC): %.2f' % auc1)
fpr1, tpr1, _ = roc_curve(y_test, probs)
plot_roc_curve(fpr1, tpr1)

Plotting Precision-Recall Curve for different thresholds of precision and recall much like the ROC Curve and showing the Area under the Precision-Recall Curve (AUCPR), it gives the number summary of the information in the Precision-Recall Curve.

In [None]:
precision1, recall1, _ = precision_recall_curve(y_test, probs)
auc_score1 = auc(recall1, precision1)
pr_auc['Linear Regression'] = np.round(auc_score1, 2)
print('Area under the PR Curve (AUCPR): %.2f' % auc_score1)
plot_precision_recall_curve(recall1, precision1)

## 2. KNNeighbors Classifier:

KNN is a non-parametric, lazy learning algorithm. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_st, y, test_size = 0.20, random_state = 5)
KNN_model = KNeighborsClassifier()
KNN_model.fit(X_train, y_train)
prediction2 = KNN_model.predict(X_test)
accuracy2 = KNN_model.score(X_test, y_test) 
print ('Model Accuracy:',accuracy2 * 100)

Storing model accuracy to plot for comparison with other Machine Learning models.

In [None]:
accuracies['KNeighbors Classifier'] = np.round(accuracy2 * 100, 2)

1. Plotting Confusion Matrix to describe the performance of KNN Classifier on a set of test data.

In [None]:
conf_matrix(y_test, prediction2)

Plotting different metrics scores for the KNN Classifier for evaluation.

In [None]:
metrics_score(cm1)

* Plotting the average of different metrics scores for further evaluation.

In [None]:
cal_score('KNeighbors Classifier', KNN_model, 5)

Plotting Receiver Operating Characteristic (ROC) Curve, to illustrate the diagnostic ability of KNN Classifier as its discrimination threshold is varied and showing the Area under the ROC Curve (AUC) value which will tell us how much our model is capable of distinguishing between healthy and diabetic patients.

In [None]:
probs = KNN_model.predict_proba(X_test)
probs = probs[:, 1]
auc2 = roc_auc_score(y_test, probs)
roc_auc['KNeighbors Classifier'] = np.round(auc2, 2)
print('Area under the ROC Curve (AUC): %.2f' % auc2)
fpr2, tpr2, _ = roc_curve(y_test, probs)
plot_roc_curve(fpr2, tpr2)

Plotting Precision-Recall Curve for different thresholds of precision and recall much like the ROC Curve and showing the Area under the Precision-Recall Curve (AUCPR), it gives the number summary of the information in the Precision-Recall Curve.

In [None]:
precision2, recall2, _ = precision_recall_curve(y_test, probs)
auc_score2 = auc(recall2, precision2)
pr_auc['KNeighbors Classifier'] = np.round(auc_score2, 2)
print('Area under the PR Curve (AUCPR): %.2f' % auc_score2)
plot_precision_recall_curve(recall2, precision2)

## 3. Support Vector Machine Classifier:


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_st, y, test_size = 0.20, random_state = 5)
SVC_model = SVC(probability = True)
SVC_model.fit(X_train, y_train)
prediction3 = SVC_model.predict(X_test)
accuracy3 = SVC_model.score(X_test, y_test) 
print ('Model Accuracy:',accuracy3 * 100)

Storing model accuracy to plot for comparison with other Machine Learning models.

In [None]:
accuracies['Support Vector Machine Classifier'] = np.round(accuracy3 * 100, 2)

1. Plotting Confusion Matrix to describe the performance of SVM Classifier on a set of test data.

In [None]:
conf_matrix(y_test, prediction3)

Plotting different metrics scores for the SVM Classifier for evaluation.

In [None]:
metrics_score(cm1)

* Plotting the average of different metrics scores for further evaluation.

In [None]:
cal_score('Support Vector Machine Classifier', SVC_model, 5)

Plotting Receiver Operating Characteristic (ROC) Curve, to illustrate the diagnostic ability of SVM Classifier as its discrimination threshold is varied and showing the Area under the ROC Curve (AUC) value which will tell us how much our model is capable of distinguishing between healthy and diabetic patients.

In [None]:
probs = SVC_model.predict_proba(X_test)
probs = probs[:, 1]
auc3 = roc_auc_score(y_test, probs)
roc_auc['Support Vector Machine Classifier'] = np.round(auc3, 2)
print('Area under the ROC Curve (AUC): %.2f' % auc3)
fpr3, tpr3, _ = roc_curve(y_test, probs)
plot_roc_curve(fpr3, tpr3)

Plotting Precision-Recall Curve for different thresholds of precision and recall much like the ROC Curve and showing the Area under the Precision-Recall Curve (AUCPR), it gives the number summary of the information in the Precision-Recall Curve.

In [None]:
precision3, recall3, _ = precision_recall_curve(y_test, probs)
auc_score3 = auc(recall3, precision3)
pr_auc['Support Vector Machine Classifier'] = np.round(auc_score3, 2)
print('Area under the PR Curve (AUCPR): %.2f' % auc_score3)
plot_precision_recall_curve(recall3, precision3)

## 4. Classification and Regression Tree:


Decision Trees are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_st, y, test_size = 0.20, random_state = 5)
CART_model = DecisionTreeClassifier(max_depth = 10, min_samples_split = 50)
CART_model.fit(X_train, y_train)
prediction4 = CART_model.predict(X_test)
accuracy4 = CART_model.score(X_test, y_test) 
print ('Model Accuracy:',accuracy4 * 100)

Storing model accuracy to plot for comparison with other Machine Learning models.

In [None]:
accuracies['Classification and Regression Tree'] = np.round(accuracy4 * 100, 2)

1. Plotting Confusion Matrix to describe the performance of CART Classifier on a set of test data.

In [None]:
conf_matrix(y_test, prediction4)

Plotting different metrics scores for the CART Classifier for evaluation.

In [None]:
metrics_score(cm1)

* Plotting the average of different metrics scores for further evaluation.

In [None]:
cal_score('Classification and Regression Tree', CART_model, 5)

Plotting Receiver Operating Characteristic (ROC) Curve, to illustrate the diagnostic ability of CART Classifier as its discrimination threshold is varied and showing the Area under the ROC Curve (AUC) value which will tell us how much our model is capable of distinguishing between healthy and diabetic patients.

In [None]:
probs = CART_model.predict_proba(X_test)
probs = probs[:, 1]
auc4 = roc_auc_score(y_test, probs)
roc_auc['Desicion Tree Classifier']=np.round(auc4, 2)
print('Area under the ROC Curve (AUC): %.2f' % auc4)
fpr4, tpr4, _ = roc_curve(y_test, probs)
plot_roc_curve(fpr4, tpr4)

Plotting Precision-Recall Curve for different thresholds of precision and recall much like the ROC Curve and showing the Area under the Precision-Recall Curve (AUCPR), it gives the number summary of the information in the Precision-Recall Curve.

In [None]:
precision4, recall4, _ = precision_recall_curve(y_test, probs)
auc_score4 = auc(recall4, precision4)
pr_auc['Desicion Tree Classifier'] = np.round(auc_score4, 2)
print('Area under the PR Curve (AUCPR): %.2f' % auc_score4)
plot_precision_recall_curve(recall4, precision4)

## 5. Random Forests:


A Random Forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_st, y, test_size = 0.20, random_state = 5)
rf_model = RandomForestClassifier(max_features = 8, min_samples_split = 12, n_estimators = 120)
rf_model.fit(X_train, y_train)
prediction5 = rf_model.predict(X_test)
accuracy5 = rf_model.score(X_test, y_test) 
print ('Model Accuracy:',accuracy5 * 100)

Storing model accuracy to plot for comparison with other Machine Learning models.

In [None]:
accuracies['Random Forests'] = np.round(accuracy5 * 100, 2)

1. Plotting Confusion Matrix to describe the performance of Random Forest Classifier on a set of test data.

In [None]:
conf_matrix(y_test, prediction5)

Plotting different metrics scores for the Random Forest Classifier for evaluation.

In [None]:
metrics_score(cm1)

* Plotting the average of different metrics scores for further evaluation.

In [None]:
cal_score('Random Forests', rf_model, 5)

Plotting Receiver Operating Characteristic (ROC) Curve, to illustrate the diagnostic ability of Random Forest Classifier as its discrimination threshold is varied and showing the Area under the ROC Curve (AUC) value which will tell us how much our model is capable of distinguishing between healthy and diabetic patients.

In [None]:
probs = rf_model.predict_proba(X_test)
probs = probs[:, 1]
auc5 = roc_auc_score(y_test, probs)
roc_auc['Random Forests Classifier']=np.round(auc5, 2)
print('Area under the ROC Curve (AUC): %.2f' % auc5)
fpr5, tpr5, _ = roc_curve(y_test, probs)
plot_roc_curve(fpr5, tpr5)

Plotting Precision-Recall Curve for different thresholds of precision and recall much like the ROC Curve and showing the Area under the Precision-Recall Curve (AUCPR), it gives the number summary of the information in the Precision-Recall Curve.

In [None]:
precision5, recall5, _ = precision_recall_curve(y_test, probs)
auc_score5 = auc(recall5, precision5)
pr_auc['Random Forests'] = np.round(auc_score5,3)
print('Area under the PR Curve (AUCPR): %.2f' % auc_score5)
plot_precision_recall_curve(recall5, precision5)

## Feature Importance:

In [None]:
feature_imp = pd.Series(rf_model.feature_importances_,
                        index = X_train.columns).sort_values(ascending = False)

sns.barplot(x = feature_imp, y = feature_imp.index)
plt.xlabel('Feature Important Scores')
plt.ylabel('Features')
plt.title("Feature Important Range")
plt.show()

## 6. Gradient Boosting Machines

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_st, y, test_size = 0.20, random_state = 5)
gbm_model = GradientBoostingClassifier(learning_rate = 0.01, max_depth = 2, n_estimators = 500)
gbm_model.fit(X_train, y_train)
prediction6 = gbm_model.predict(X_test)
accuracy6 = gbm_model.score(X_test, y_test) 
print ('Model Accuracy:',accuracy6 * 100)

Storing model accuracy to plot for comparison with other Machine Learning models.

In [None]:
accuracies['Gradient Boosting Machines'] = np.round(accuracy6 * 100, 2)

1. Plotting Confusion Matrix to describe the performance of GBM Classifier on a set of test data.

In [None]:
conf_matrix(y_test, prediction6)

Plotting different metrics scores for the GBM Classifier for evaluation.

In [None]:
metrics_score(cm1)

* Plotting the average of different metrics scores for further evaluation.

In [None]:
cal_score('Gradient Boosting Machines', gbm_model, 5)

Plotting Receiver Operating Characteristic (ROC) Curve, to illustrate the diagnostic ability of GBM Classifier as its discrimination threshold is varied and showing the Area under the ROC Curve (AUC) value which will tell us how much our model is capable of distinguishing between healthy and diabetic patients.

In [None]:
probs = gbm_model.predict_proba(X_test)
probs = probs[:, 1]
auc6 = roc_auc_score(y_test, probs)
roc_auc['Gradient Boosting Machine Classifier'] = np.round(auc6, 2)
print('Area under the ROC Curve (AUC): %.2f' % auc6)
fpr6, tpr6, _ = roc_curve(y_test, probs)
plot_roc_curve(fpr6, tpr6)

Plotting Precision-Recall Curve for different thresholds of precision and recall much like the ROC Curve and showing the Area under the Precision-Recall Curve (AUCPR), it gives the number summary of the information in the Precision-Recall Curve.

In [None]:
precision6, recall6, _ = precision_recall_curve(y_test, probs)
auc_score6 = auc(recall6, precision6)
pr_auc['Gradient Boosting Machine Classifier'] = np.round(auc_score6, 2)
print('Area under the PR Curve (AUCPR): %.2f' % auc_score6)
plot_precision_recall_curve(recall6, precision6)

## Feature Importance:

In [None]:
feature_imp = pd.Series(gbm_model.feature_importances_,
                        index = X_train.columns).sort_values(ascending = False)

sns.barplot(x = feature_imp, y = feature_imp.index)
plt.xlabel('Feature Important Scores')
plt.ylabel('Features')
plt.title("Feature Important Range")
plt.show()

## 7. XGBoost:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_st, y, test_size = 0.20, random_state = 5)
xgb_model = XGBClassifier(learning_rate = 0.01,max_depth = 3, n_estimators = 500, subsample = 1 )
xgb_model.fit(X_train, y_train)
prediction7 = xgb_model.predict(X_test)
accuracy7 = xgb_model.score(X_test, y_test) 
print ('Model Accuracy:',accuracy7 * 100)

Storing model accuracy to plot for comparison with other Machine Learning models.

In [None]:
accuracies['XGBoost Classifier'] = np.round(accuracy7 * 100, 2)

1. Plotting Confusion Matrix to describe the performance of XGBM Classifier on a set of test data.

In [None]:
conf_matrix(y_test, prediction7)

Plotting different metrics scores for the XGBM Classifier for evaluation.

In [None]:
metrics_score(cm1)

* Plotting the average of different metrics scores for further evaluation.

In [None]:
cal_score('XGBoost Classifier', xgb_model, 5)

Plotting Receiver Operating Characteristic (ROC) Curve, to illustrate the diagnostic ability of XGBM Classifier as its discrimination threshold is varied and showing the Area under the ROC Curve (AUC) value which will tell us how much our model is capable of distinguishing between healthy and diabetic patients.

In [None]:
probs = xgb_model.predict_proba(X_test)
probs = probs[:, 1]
auc7 = roc_auc_score(y_test, probs)
roc_auc['XGB Machine Classifier']=np.round(auc7, 2)
print('Area under the ROC Curve (AUC): %.2f' % auc7)
fpr7, tpr7, _ = roc_curve(y_test, probs)
plot_roc_curve(fpr7, tpr7)

Plotting Precision-Recall Curve for different thresholds of precision and recall much like the ROC Curve and showing the Area under the Precision-Recall Curve (AUCPR), it gives the number summary of the information in the Precision-Recall Curve.

In [None]:
precision7, recall7, _ = precision_recall_curve(y_test, probs)
auc_score7 = auc(recall7, precision7)
pr_auc['XGB Machine Classifier'] = np.round(auc_score7, 2)
print('Area under the PR Curve (AUCPR): %.2f' % auc_score7)
plot_precision_recall_curve(recall7, precision7)

## Feature Importance:

In [None]:
feature_imp = pd.Series(gbm_model.feature_importances_,
                        index = X_train.columns).sort_values(ascending = False)

sns.barplot(x = feature_imp, y = feature_imp.index)
plt.xlabel('Feature Important Scores')
plt.ylabel('Features')
plt.title("Feature Important Range")
plt.show()

## 8. Light GBM

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_st, y, test_size = 0.20, random_state = 5)
lgbm_model = LGBMClassifier(learning_rate = 0.1, max_depth = 1, n_estimators = 200)
lgbm_model.fit(X_train, y_train)
prediction8 = lgbm_model.predict(X_test)
accuracy8 = lgbm_model.score(X_test, y_test) 
print ('Model Accuracy:',accuracy8 * 100)

Storing model accuracy to plot for comparison with other Machine Learning models.

In [None]:
accuracies['LightGBM Classifier'] = np.round(accuracy8 * 100, 2)

1. Plotting Confusion Matrix to describe the performance of LGBM Classifier on a set of test data.

In [None]:
conf_matrix(y_test, prediction8)

Plotting different metrics scores for the LGBM Classifier for evaluation.

In [None]:
metrics_score(cm1)

* Plotting the average of different metrics scores for further evaluation.

In [None]:
cal_score('LightGBM Classifier', lgbm_model, 5)

Plotting Receiver Operating Characteristic (ROC) Curve, to illustrate the diagnostic ability of LGBM Classifier as its discrimination threshold is varied and showing the Area under the ROC Curve (AUC) value which will tell us how much our model is capable of distinguishing between healthy and diabetic patients.

In [None]:
probs = lgbm_model.predict_proba(X_test)
probs = probs[:, 1]
auc8 = roc_auc_score(y_test, probs)
roc_auc['LightGBM Classifier'] = np.round(auc8, 2)
print('Area under the ROC Curve (AUC): %.2f' % auc8)
fpr8, tpr8, _ = roc_curve(y_test, probs)
plot_roc_curve(fpr8, tpr8)

Plotting Precision-Recall Curve for different thresholds of precision and recall much like the ROC Curve and showing the Area under the Precision-Recall Curve (AUCPR), it gives the number summary of the information in the Precision-Recall Curve.

In [None]:
precision8, recall8, _ = precision_recall_curve(y_test, probs)
auc_score8 = auc(recall8, precision8)
pr_auc['LightGBM Classifier'] = np.round(auc_score8, 2)
print('Area under the PR Curve (AUCPR): %.2f' % auc_score8)
plot_precision_recall_curve(recall8, precision8)

## Feature Importance:

In [None]:
feature_imp = pd.Series(gbm_model.feature_importances_,
                        index = X_train.columns).sort_values(ascending = False)

sns.barplot(x = feature_imp, y = feature_imp.index)
plt.xlabel('Feature Important Scores')
plt.ylabel('Features')
plt.title("Feature Important Range")
plt.show()

## 9. CatBoost

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_st, y, test_size = 0.20, random_state = 5)
catboost_model = CatBoostClassifier(depth = 8, iterations = 100, learning_rate = 0.1)
catboost_model.fit(X_train, y_train)
prediction9 = catboost_model.predict(X_test)
accuracy9 = catboost_model.score(X_test, y_test) 
print ('Model Accuracy:', accuracy9 * 100)

Storing model accuracy to plot for comparison with other Machine Learning models.

In [None]:
accuracies['CatBoost Classifier'] = np.round(accuracy9 * 100, 2)

1. Plotting Confusion Matrix to describe the performance of CatBoost Classifier on a set of test data.

In [None]:
conf_matrix(y_test, prediction9)

Plotting different metrics scores for the CatBoost Classifier for evaluation.

In [None]:
metrics_score(cm1)

* Plotting the average of different metrics scores for further evaluation.

In [None]:
cal_score('CatBoost Classifier', catboost_model, 5)

Plotting Receiver Operating Characteristic (ROC) Curve, to illustrate the diagnostic ability of CatBoost Classifier as its discrimination threshold is varied and showing the Area under the ROC Curve (AUC) value which will tell us how much our model is capable of distinguishing between healthy and diabetic patients.

In [None]:
probs = catboost_model.predict_proba(X_test)
probs = probs[:, 1]
auc9 = roc_auc_score(y_test, probs)
roc_auc['CatBoost Classifier']=np.round(auc9, 2)
print('Area under the ROC Curve (AUC): %.2f' % auc9)
fpr9, tpr9, _ = roc_curve(y_test, probs)
plot_roc_curve(fpr9, tpr9)

Plotting Precision-Recall Curve for different thresholds of precision and recall much like the ROC Curve and showing the Area under the Precision-Recall Curve (AUCPR), it gives the number summary of the information in the Precision-Recall Curve.

In [None]:
precision9, recall9, _ = precision_recall_curve(y_test, probs)
auc_score9 = auc(recall9, precision9)
pr_auc['CatBoost Classifier'] = np.round(auc_score9, 2)
print('Area under the PR Curve (AUCPR): %.2f' % auc_score9)
plot_precision_recall_curve(recall9, precision9)

## Feature Importance:

In [None]:
feature_imp = pd.Series(gbm_model.feature_importances_,
                        index = X_train.columns).sort_values(ascending = False)

sns.barplot(x = feature_imp, y = feature_imp.index)
plt.xlabel('Feature Important Scores')
plt.ylabel('Features')
plt.title("Feature Important Range")
plt.show()

## Performance Comparison

Plotting the accuracy metric score of the machine learning models for comparison.

In [None]:
models_tuned = [
    log_model,
    KNN_model,
    SVC_model,
    CART_model,
    rf_model,
    gbm_model,
    catboost_model,
    lgbm_model,
    xgb_model]

result = []
results = pd.DataFrame(columns = ["Models","Accuracy"])

for model in models_tuned:
    names = model.__class__.__name__
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    result = pd.DataFrame([[names, acc * 100]], columns = ["Models", "Accuracy"])
    results = results.append(result)
results

In [None]:
plt.figure(figsize = (15, 8))
sns.set_palette('cividis')
ax = sns.barplot(x = list(accuracies.keys()), y = list(accuracies.values()))
plt.yticks(np.arange(0, 100, 10))
plt.ylabel('Percentage %', labelpad = 10)
plt.xlabel('Algorithms', labelpad = 10)
plt.title('Accuracy Scores Comparison', pad = 20)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x(), p.get_height()), xytext = (p.get_x() + 0.3, p.get_height() + 1.02))
plt.show()


Plotting the average accuracy metric score of the machine learning models for comparison.

In [None]:
plt.figure(figsize = (15, 8))
sns.set_palette('viridis')
ax=sns.barplot(x = list(avg_accuracies.keys()), y = list(avg_accuracies.values()))
plt.yticks(np.arange(0, 100, 10))
plt.ylabel('Percentage %', labelpad = 10)
plt.xlabel('Algorithms', labelpad = 10)
plt.title('Average Accuracy Scores Comparison', pad = 20)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x(), p.get_height()),xytext=(p.get_x() + 0.3, p.get_height() + 1.02))
plt.show()


Plotting the ROC Curve of the machine learning models for comparison.

In [None]:
plt.figure(figsize = (8, 6))
sns.set_palette('Set1')
plt.plot(fpr1, tpr1, label = 'Linear Regression')
plt.plot(fpr2, tpr2, label = 'KNeiihbors Classifier')
plt.plot(fpr3, tpr3, label = 'SVM')
plt.plot(fpr4, tpr4, label = 'Decision Tree')
plt.plot(fpr5, tpr5, label = 'Random Forests')
plt.plot(fpr6, tpr6, label = 'Gradient Boosting MachineC')
plt.plot(fpr7, tpr7, label = 'XGBoost')
plt.plot(fpr8, tpr8, label = 'LightGBM')
plt.plot(fpr9, tpr9, label = 'CatBosst')
plt.plot([0, 1], [0, 1], linestyle = '--')
plt.ylabel('True Positive Rate', labelpad = 10)
plt.xlabel('False Positive Rate', labelpad = 10)
plt.title('Receiver Operating Characteristic (ROC) Curves', pad = 20)
plt.legend()
plt.show()

Plotting the AUC values of ROC Curve of the machine learning models for comparison.

In [None]:
plt.figure(figsize = (15, 8))
sns.set_palette('magma')
ax = sns.barplot(x = list(roc_auc.keys()), y = list(roc_auc.values()))
#plt.yticks(np.arange(0,100,10))
plt.ylabel('Score', labelpad = 10)
plt.xlabel('Algorithms', labelpad = 10)
plt.title('Area under the ROC Curves (AUC)', pad = 20)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x(), p.get_height()), xytext = (p.get_x() + 0.3, p.get_height() + 0.01))
plt.show()


Plotting the PR Curve of the machine learning models for comparison.

In [None]:
plt.figure(figsize = (8, 6))
sns.set_palette('Set1')
plt.plot(recall1, precision1, label = 'Linear Regression PRC')
plt.plot(recall2, precision2, label = 'KNN PRC')
plt.plot(recall3, precision3, label = 'SVM PRC')
plt.plot(recall4, precision4, label = 'CART PRC')
plt.plot(recall5, precision5, label = 'Random Forests PRC')
plt.plot(recall6, precision6, label = 'GBM PRC')
plt.plot(recall7, precision7, label = 'XGB PRC')
plt.plot(recall5, precision5, label = 'LGBM PRC')
plt.plot(recall6, precision6, label = 'CatBoost PRC')
plt.ylabel('Precision', labelpad = 10)
plt.xlabel('Recall', labelpad = 10)
plt.title('Precision Recall Curves', pad = 20)
plt.legend()
plt.show()

Plotting the AUC values of PR Curve of the machine learning models for comparison.

In [None]:
plt.figure(figsize = (15, 8))
sns.set_palette('mako')
ax = sns.barplot(x = list(pr_auc.keys()), y = list(pr_auc.values()))
plt.ylabel('Score', labelpad = 10)
plt.xlabel('Algorithms', labelpad = 10)
plt.title('Area under the PR Curves (AUCPR)', pad = 20)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x(), p.get_height()), xytext = (p.get_x() + 0.3, p.get_height() + 0.01))
plt.show()

## Final Model:

## Random Forests:


A Random Forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_st, y, test_size = 0.20, random_state = 5)
rf_model = RandomForestClassifier(max_features = 8, min_samples_split = 12, n_estimators = 120)
rf_model.fit(X_train, y_train)
prediction5 = rf_model.predict(X_test)
accuracy5 = rf_model.score(X_test, y_test) 
print ('Model Accuracy:',accuracy5 * 100)

1. Plotting Confusion Matrix to describe the performance of Random Forest Classifier on a set of test data.

In [None]:
conf_matrix(y_test, prediction5)

Plotting different metrics scores for the Random Forest Classifier for evaluation.

In [None]:
metrics_score(cm1)

* Plotting the average of different metrics scores for further evaluation.

In [None]:
cal_score('Random Forests', rf_model, 10)

# - - - -  REPORTING  - - - -



### Our aim in this study was to estimate the probability of diabetes disease by using different classification models on the 'diabetes' data set.

### First the data set read and displayed.

###  Missing values were filled with the median values of the variables in which they were found.

###  Then outliers were detected and suppressed.

### Then, the values in the variables were divided into groups according to general health standards and new variables were created.

### Values in all variables are standardized from 0 to 1.

### Then results were with 9 different classification models predicted.

### Predictions were evaluated with different metrics. Alle results were visualisated.

### Finally, an prediction of over 91% was achieved with the random forests model.
l