# CSCE 585: Machine Learning Systems Final Project

## Project Title: Diabetes Prediction Using Machine Learning

*Diabetes Mellitus* is among critical diseases and lots of people are suffering from it in recent years. According to the recent studies, age, obesity, lack of exercise, hereditary diabetes, living style, bad diet, high blood pressure, etc. can cause Diabetes Mellitus. People having diabetes have high risk of diseases such as heart disease, kidney disease, stroke, eye problem, and nerve damage. With the recent advancements in the field of machine learning (ML), several researchers have tried to apply ML models to perform Diabetes prediction in patients based on various factors. However, there is no rigorous and comprehensive study on the evaluation of different ML models to determine the best practices in this specific problem.

In this project, we aim at implementing and evaluating different classification methods (e.g., decision tree, random forest, support vector machine, and neural network) on the given dataset and determine which methods perform better and under which conditions. We will use the Pima Indians onset of diabetes dataset. This is a standard machine learning dataset from the UCI Machine Learning repository. It describes patient medical record data for Pima Indians and whether they had an onset of diabetes within five years.


Our dataset has the following features:
* Number of Instances: 768

* Number of Attributes: 8 plus class 

* For Each row in the dataset (all numeric-valued), we have the following columns:
   * Number of times pregnant
   * Plasma glucose concentration a 2 hours in an oral glucose tolerance test
   * Diastolic blood pressure (mm Hg)
   * Triceps skin fold thickness (mm)
   * 2-Hour serum insulin (mu U/ml)
   * Body mass index (weight in kg/(height in m)^2)
   * Diabetes pedigree function
   * Age (years)
   * Class variable (0 or 1)

In [1]:
# import the necessary modules here!
import pandas as pd
import os
from pycaret.classification import *
import numpy as np
#import matplotlib

In [2]:
# form the path to the dataset 
current_path_str = os.getcwd()
current_path_list = current_path_str.split("/")
dataset_path_list = current_path_list[:-1]
dataset_path_list.append("Dataset")
dataset_path_str = "/".join(dataset_path_list)
path = dataset_path_str + "/diabetes.csv"

# load the dataset to pandas dataframe
df = pd.read_csv(path)

In [3]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
df.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

In [5]:
experiment = setup(df, target="Outcome")

Unnamed: 0,Description,Value
0,Session id,1384
1,Target,Outcome
2,Target type,Binary
3,Original data shape,"(768, 9)"
4,Transformed data shape,"(768, 9)"
5,Transformed train set shape,"(537, 9)"
6,Transformed test set shape,"(231, 9)"
7,Numeric features,8
8,Preprocess,True
9,Imputation type,simple


In [6]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.7746,0.825,0.5652,0.7284,0.6277,0.4722,0.4854,0.007
lr,Logistic Regression,0.7709,0.8224,0.5652,0.7182,0.6238,0.4649,0.4772,0.258
ridge,Ridge Classifier,0.7709,0.0,0.5494,0.7296,0.6189,0.4618,0.4761,0.006
et,Extra Trees Classifier,0.7616,0.8157,0.5395,0.7138,0.6088,0.4431,0.4555,0.033
rf,Random Forest Classifier,0.7598,0.8202,0.5825,0.6785,0.6192,0.4477,0.4552,0.037
lightgbm,Light Gradient Boosting Machine,0.7524,0.7983,0.6199,0.6627,0.632,0.4474,0.4544,0.012
nb,Naive Bayes,0.7485,0.8012,0.5865,0.6602,0.6159,0.4308,0.4362,0.11
qda,Quadratic Discriminant Analysis,0.7392,0.806,0.5278,0.6624,0.5796,0.3958,0.4064,0.007
gbc,Gradient Boosting Classifier,0.7376,0.8097,0.5719,0.6477,0.5997,0.4071,0.4141,0.019
ada,Ada Boost Classifier,0.7265,0.7963,0.5561,0.6224,0.5815,0.3808,0.3861,0.017


In [7]:
predict_model(best_model, df.tail())

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Linear Discriminant Analysis,0.8,0.75,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,prediction_label,prediction_score
0,10.0,101.0,76.0,48.0,180.0,32.900002,0.171,63.0,0,0,0.7959
1,2.0,122.0,70.0,27.0,0.0,36.799999,0.34,27.0,0,0,0.6561
2,5.0,121.0,72.0,23.0,112.0,26.200001,0.245,30.0,0,0,0.8345
3,1.0,126.0,60.0,0.0,0.0,30.1,0.349,47.0,1,0,0.7538
4,1.0,93.0,70.0,31.0,0.0,30.4,0.315,23.0,0,0,0.9256


In [8]:
# saves the best model as pickle file, which can be easily loaded later
save_model(best_model, model_name = "best_classification_model")

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=/var/folders/36/2yr7cfr96b983psr0fv9b4w00000gn/T/joblib),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['Pregnancies', 'Glucose',
                                              'BloodPressure', 'SkinThickness',
                                              'Insulin', 'BMI',
                                              'DiabetesPedigreeFunction',
                                              'Age'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               missing_values...
                                                               fill_value='constant',
                                                               missing

In [9]:
# testing the model loading
# load_model("logistic-regression-model")
trained_model_diabetes = load_model("best_classification_model")

Transformation Pipeline and Model Successfully Loaded


In [10]:
# perform the inferrence from the loaded model
predict_model(trained_model_diabetes, df.tail())

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Linear Discriminant Analysis,0.8,0.75,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,prediction_label,prediction_score
0,10.0,101.0,76.0,48.0,180.0,32.900002,0.171,63.0,0,0,0.7959
1,2.0,122.0,70.0,27.0,0.0,36.799999,0.34,27.0,0,0,0.6561
2,5.0,121.0,72.0,23.0,112.0,26.200001,0.245,30.0,0,0,0.8345
3,1.0,126.0,60.0,0.0,0.0,30.1,0.349,47.0,1,0,0.7538
4,1.0,93.0,70.0,31.0,0.0,30.4,0.315,23.0,0,0,0.9256


In [25]:
from simple_nn import Network

# helper function to encode the labels as 0 or 1 as positive or negative results
def one_hot_encode(y):
    encoded = np.zeros((2, 1))
    encoded[y] = 1.0
    return encoded

# transform the pandas data frame to numpy arrays associated with features and labels
dataset_numpy = df.to_numpy()
features = dataset_numpy[:,:8]
label = dataset_numpy[:,8:]
label = label.astype(int)

# Breake down the data to training and test
breakdown = int(0.6 * len(label))
features_train = features[:breakdown,:]
label_train = label[:breakdown,:]
features_test = features[breakdown:,:]
label_test = label[breakdown:,:]

train_x = [np.reshape(i, (8, 1)) for i in features_train]
train_y = [one_hot_encode(j) for j in label_train]
test_x = [np.reshape(i, (8, 1)) for i in features_test]
test_y = [one_hot_encode(j) for j in label_test]

train = zip(train_x, train_y)
test = zip(test_x, test_y)

In [26]:
# intialize the network and perform the training/inference
network = Network([8, 8, 8, 2])
network.stochastic_gradient_decent(train, 10, 10, 2, test)

Epoch 0: accuracy = 69.8051948051948%
Epoch 1: accuracy = 29.87012987012987%
Epoch 2: accuracy = 69.8051948051948%
Epoch 3: accuracy = 69.8051948051948%
Epoch 4: accuracy = 69.8051948051948%
Epoch 5: accuracy = 30.1948051948052%
Epoch 6: accuracy = 69.8051948051948%
Epoch 7: accuracy = 69.8051948051948%
Epoch 8: accuracy = 69.8051948051948%
Epoch 9: accuracy = 69.8051948051948%


In [None]:
'''
The implemented neural network model performs slightly worse than the best model obtained from the pycaret library.
We will perform additional analyis to determine the possible reasons behind this observation. We will also try to
improve the accuracy of the neural network by changing its key parameters.
'''