# GSB 545: Advanced Machine Learning for Business Analytics

## Predicting Diabetes

In this lab we'll be using a dataset from kaggle yet again...it's just so fun and rich! We're using publicly available data from the Centers for Disease Control and Prevention (CDC), and in particular the Behavioral Risk Factor Surveillance System (BRFSS).

### DATASET: 
https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset?resource=download&select=diabetes_binary_health_indicators_BRFSS2015.csv 

### Primary Goals:

- Predict diabetes (binary classification)

### Assignment Specs:

You need to explore multiple neural network models to solve this problem. You may use at most one model from earlier in our course, if you wish to see if neural networks can be beaten (I think this should be your best model from the heart disease lab).


In [74]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, accuracy_score, f1_score
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from keras import models
import pydot
import graphviz
from tensorflow.keras.utils import plot_model
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.compose import ColumnTransformer, make_column_selector


In [75]:
df = pd.read_csv("data/diabetes_binary_5050split_health_indicators_BRFSS2015.csv")


In [76]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70692 entries, 0 to 70691
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Diabetes_binary       70692 non-null  float64
 1   HighBP                70692 non-null  float64
 2   HighChol              70692 non-null  float64
 3   CholCheck             70692 non-null  float64
 4   BMI                   70692 non-null  float64
 5   Smoker                70692 non-null  float64
 6   Stroke                70692 non-null  float64
 7   HeartDiseaseorAttack  70692 non-null  float64
 8   PhysActivity          70692 non-null  float64
 9   Fruits                70692 non-null  float64
 10  Veggies               70692 non-null  float64
 11  HvyAlcoholConsump     70692 non-null  float64
 12  AnyHealthcare         70692 non-null  float64
 13  NoDocbcCost           70692 non-null  float64
 14  GenHlth               70692 non-null  float64
 15  MentHlth           

No missing values in the dataset, but all the values are currecntly floats, even though they do not have decimal points.

In [77]:
df.head()

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,3.0,5.0,30.0,0.0,1.0,4.0,6.0,8.0
1,0.0,1.0,1.0,1.0,26.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,1.0,12.0,6.0,8.0
2,0.0,0.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,10.0,0.0,1.0,13.0,6.0,8.0
3,0.0,1.0,1.0,1.0,28.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,3.0,0.0,3.0,0.0,1.0,11.0,6.0,8.0
4,0.0,0.0,0.0,1.0,29.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,8.0,5.0,8.0


In [78]:
# convert floats to ints
for column in df:
    if df[column].dtype == 'float64':
        df[column] = np.float64(df[column]).astype(np.int64)

In [79]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70692 entries, 0 to 70691
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   Diabetes_binary       70692 non-null  int64
 1   HighBP                70692 non-null  int64
 2   HighChol              70692 non-null  int64
 3   CholCheck             70692 non-null  int64
 4   BMI                   70692 non-null  int64
 5   Smoker                70692 non-null  int64
 6   Stroke                70692 non-null  int64
 7   HeartDiseaseorAttack  70692 non-null  int64
 8   PhysActivity          70692 non-null  int64
 9   Fruits                70692 non-null  int64
 10  Veggies               70692 non-null  int64
 11  HvyAlcoholConsump     70692 non-null  int64
 12  AnyHealthcare         70692 non-null  int64
 13  NoDocbcCost           70692 non-null  int64
 14  GenHlth               70692 non-null  int64
 15  MentHlth              70692 non-null  int64
 16  Phys

Changed the float values to integers

In [80]:
df['Diabetes_binary'].value_counts()

Diabetes_binary
0    35346
1    35346
Name: count, dtype: int64

I used the balanced dataset from Kaggle, so that I did not have to weigh the dataset myself.

In [81]:
X = df.drop(["Diabetes_binary"], axis=1)
y = df["Diabetes_binary"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Train test split the data with Diabetes_binary being the response variable. 

## First Neural Network

In [82]:
#construct the model
inputs = keras.Input(shape=(X_train.shape[1],))
x = layers.Dense(22, activation = 'relu')(inputs)
x = layers.Dense(15, activation = 'relu')(x)
outputs = layers.Dense(2, activation='softmax')(x)

model = keras.Model(inputs=inputs, outputs=outputs, name="Diabetes_model")

In [83]:
model.summary()

[1mModel: "Diabetes_model"[0m
[1mModel: "Diabetes_model"[0m


┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃[1m [0m[1mLayer (type)                        [0m[1m [0m┃[1m [0m[1mOutput Shape               [0m[1m [0m┃[1m [0m[1m        Param #[0m[1m [0m┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ input_layer_6 ([38;5;33mInputLayer[0m)           │ ([38;5;45mNone[0m, [38;5;34m21[0m)                  │               [38;5;34m0[0m │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_20 ([38;5;33mDense[0m)                     │ ([38;5;45mNone[0m, [38;5;34m22[0m)                  │             [38;5;34m484[0m │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_21 ([38;5;33mDense[0m)                     │ ([38;5;45mNone[0m, [38;5;34m15[0m)                  │             [38;5;34m345[0m │
├──────────────────────────────────────┼────────

[1m Total params: [0m[38;5;34m861[0m (3.36 KB)
[1m Total params: [0m[38;5;34m861[0m (3.36 KB)


[1m Trainable params: [0m[38;5;34m861[0m (3.36 KB)
[1m Trainable params: [0m[38;5;34m861[0m (3.36 KB)


[1m Non-trainable params: [0m[38;5;34m0[0m (0.00 B)
[1m Non-trainable params: [0m[38;5;34m0[0m (0.00 B)


This neural network is made up of an input layer, two hidden layers, and an output layer. The input layer takes in the health data for each person. The first hidden layer has 22 units that learn patterns in the data, followed by a second hidden layer with 15 units for deeper analysis. The output layer has 2 units with a softmax function, which gives the model’s prediction: the probability of either having diabetes or not.


In [84]:
model.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer=keras.optimizers.RMSprop(),
    metrics=["accuracy"],
)

history = model.fit(X_train, y_train, batch_size=64, epochs=10, validation_split=0.1, verbose=0)



I used SparseCategoricalCrossentropy because my response variable is only 0 or 1, and this loss function is designed to work directly with that format while still comparing them to the model’s two-class probability output.

In [85]:
# Predict class labels
y_pred_probs = model.predict(X_test)
y_pred = np.argmax(y_pred_probs, axis=1)

# Show metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Confusion Matrix:\n {confusion_matrix(y_test, y_pred)}")
print(classification_report(y_test, y_pred, digits=4))


[1m  1/442[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m30s[0m 69ms/step[1m  1/442[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m30s[0m 69ms/step

[1m104/442[0m [32m━━━━[0m[37m━━━━━━━━━━━━━━━━[0m [1m0s[0m 486us/step[1m104/442[0m [32m━━━━[0m[37m━━━━━━━━━━━━━━━━[0m [1m0s[0m 486us/step

[1m233/442[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 432us/step[1m233/442[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 432us/step

[1m361/442[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 418us/step[1m361/442[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 418us/step

[1m442/442[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 465us/step[1m442/442[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 465us/step

[1m442/442[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 501us/step
[1m442/442[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 501us/step


Accuracy: 0.747719074899215
Confusion Matrix:
 [[5193 1897]
 [1670 5379]]
              precision    recall  f1-score   support

           0     0.7567    0.7324    0.7444      7090
           1     0.7393    0.7631    0.7510      7049

    accuracy                         0.7477     14139
   macro avg     0.7480    0.7478    0.7477     14139
weighted avg     0.7480    0.7477    0.7477     14139

Accuracy: 0.747719074899215
Confusion Matrix:
 [[5193 1897]
 [1670 5379]]
              precision    recall  f1-score   support

           0     0.7567    0.7324    0.7444      7090
           1     0.7393    0.7631    0.7510      7049

    accuracy                         0.7477     14139
   macro avg     0.7480    0.7478    0.7477     14139
weighted avg     0.7480    0.7477    0.7477     14139



## Second Neural Network

In [86]:
# Construct the model
inputs = keras.Input(shape=(X_train.shape[1],))
x = layers.Dense(64, activation='relu')(inputs)
x = layers.Dropout(0.3)(x)
x = layers.Dense(32, activation='relu')(x)
x = layers.Dense(16, activation='relu')(x)
outputs = layers.Dense(2, activation='softmax')(x)

model2 = keras.Model(inputs=inputs, outputs=outputs, name="Diabetes_model_2")

In [87]:
model2.summary()

[1mModel: "Diabetes_model_2"[0m
[1mModel: "Diabetes_model_2"[0m


┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃[1m [0m[1mLayer (type)                        [0m[1m [0m┃[1m [0m[1mOutput Shape               [0m[1m [0m┃[1m [0m[1m        Param #[0m[1m [0m┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ input_layer_7 ([38;5;33mInputLayer[0m)           │ ([38;5;45mNone[0m, [38;5;34m21[0m)                  │               [38;5;34m0[0m │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_23 ([38;5;33mDense[0m)                     │ ([38;5;45mNone[0m, [38;5;34m64[0m)                  │           [38;5;34m1,408[0m │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_2 ([38;5;33mDropout[0m)                  │ ([38;5;45mNone[0m, [38;5;34m64[0m)                  │               [38;5;34m0[0m │
├──────────────────────────────────────┼────────

[1m Total params: [0m[38;5;34m4,050[0m (15.82 KB)
[1m Total params: [0m[38;5;34m4,050[0m (15.82 KB)


[1m Trainable params: [0m[38;5;34m4,050[0m (15.82 KB)
[1m Trainable params: [0m[38;5;34m4,050[0m (15.82 KB)


[1m Non-trainable params: [0m[38;5;34m0[0m (0.00 B)
[1m Non-trainable params: [0m[38;5;34m0[0m (0.00 B)


This neural network has three hidden layers that help the model learn complex patterns, with dropout added to reduce overfitting. The final layer uses softmax to output the probabilities of either having or not having diabetes

In [88]:
model2.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer=keras.optimizers.RMSprop(),
    metrics=["accuracy"],
)

history = model2.fit(X_train, y_train, batch_size=64, epochs=10, validation_split=0.1, verbose=0)

scores = model2.evaluate(X_test, y_test, verbose=0)

In [89]:
# Predict class labels
y_pred_probs = model2.predict(X_test)
y_pred = np.argmax(y_pred_probs, axis=1)

# Show metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Confusion Matrix:\n {confusion_matrix(y_test, y_pred)}")
print(classification_report(y_test, y_pred, digits=4))

[1m  1/442[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m20s[0m 47ms/step[1m  1/442[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m20s[0m 47ms/step

[1m124/442[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 408us/step[1m124/442[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 408us/step

[1m262/442[0m [32m━━━━━━━━━━━[0m[37m━━━━━━━━━[0m [1m0s[0m 384us/step[1m262/442[0m [32m━━━━━━━━━━━[0m[37m━━━━━━━━━[0m [1m0s[0m 384us/step

[1m396/442[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 381us/step[1m396/442[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 381us/step

[1m442/442[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 472us/step
[1m442/442[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 472us/step


Accuracy: 0.7252988188697927
Confusion Matrix:
 [[5873 1217]
 [2667 4382]]
Accuracy: 0.7252988188697927
Confusion Matrix:
 [[5873 1217]
 [2667 4382]]


              precision    recall  f1-score   support

           0     0.6877    0.8283    0.7515      7090
           1     0.7826    0.6216    0.6929      7049

    accuracy                         0.7253     14139
   macro avg     0.7352    0.7250    0.7222     14139
weighted avg     0.7350    0.7253    0.7223     14139

              precision    recall  f1-score   support

           0     0.6877    0.8283    0.7515      7090
           1     0.7826    0.6216    0.6929      7049

    accuracy                         0.7253     14139
   macro avg     0.7352    0.7250    0.7222     14139
weighted avg     0.7350    0.7253    0.7223     14139



## Third Neural Network

In [90]:
inputs = keras.Input(shape=(X_train.shape[1],))
x = layers.Dense(32, activation='relu')(inputs)
x = layers.BatchNormalization()(x)
x = layers.Dense(16, activation='relu')(x)
x = layers.BatchNormalization()(x)
outputs = layers.Dense(2, activation='softmax')(x)

model3 = keras.Model(inputs=inputs, outputs=outputs, name="Diabetes_model_3")


In [91]:
model3.summary()

[1mModel: "Diabetes_model_3"[0m
[1mModel: "Diabetes_model_3"[0m


┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃[1m [0m[1mLayer (type)                        [0m[1m [0m┃[1m [0m[1mOutput Shape               [0m[1m [0m┃[1m [0m[1m        Param #[0m[1m [0m┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ input_layer_8 ([38;5;33mInputLayer[0m)           │ ([38;5;45mNone[0m, [38;5;34m21[0m)                  │               [38;5;34m0[0m │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_27 ([38;5;33mDense[0m)                     │ ([38;5;45mNone[0m, [38;5;34m32[0m)                  │             [38;5;34m704[0m │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_4                │ ([38;5;45mNone[0m, [38;5;34m32[0m)                  │             [38;5;34m128[0m │
│ ([38;5;33mBatchNormalization[0m)                 │        

[1m Total params: [0m[38;5;34m1,458[0m (5.70 KB)
[1m Total params: [0m[38;5;34m1,458[0m (5.70 KB)


[1m Trainable params: [0m[38;5;34m1,362[0m (5.32 KB)
[1m Trainable params: [0m[38;5;34m1,362[0m (5.32 KB)


[1m Non-trainable params: [0m[38;5;34m96[0m (384.00 B)
[1m Non-trainable params: [0m[38;5;34m96[0m (384.00 B)


This model processes health data through two hidden layers, each followed by batch normalization, which helps stabilize and speed up training. 

In [92]:
model3.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer=keras.optimizers.RMSprop(),
    metrics=["accuracy"],
)

history = model3.fit(X_train, y_train, batch_size=64, epochs=10, validation_split=0.1, verbose=0)

scores = model3.evaluate(X_test, y_test, verbose=0)

In [93]:
# Predict class labels
y_pred_probs = model3.predict(X_test)
y_pred = np.argmax(y_pred_probs, axis=1)

# Show metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Confusion Matrix:\n {confusion_matrix(y_test, y_pred)}")
print(classification_report(y_test, y_pred, digits=4))


[1m  1/442[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m27s[0m 62ms/step[1m  1/442[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m27s[0m 62ms/step

[1m116/442[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 438us/step[1m116/442[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 438us/step

[1m264/442[0m [32m━━━━━━━━━━━[0m[37m━━━━━━━━━[0m [1m0s[0m 382us/step[1m264/442[0m [32m━━━━━━━━━━━[0m[37m━━━━━━━━━[0m [1m0s[0m 382us/step

[1m417/442[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 362us/step[1m417/442[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 362us/step

[1m442/442[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 478us/step
[1m442/442[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 478us/step


Accuracy: 0.7543673527123559
Confusion Matrix:
 [[5098 1992]
 [1481 5568]]
Accuracy: 0.7543673527123559
Confusion Matrix:
 [[5098 1992]
 [1481 5568]]


              precision    recall  f1-score   support

           0     0.7749    0.7190    0.7459      7090
           1     0.7365    0.7899    0.7623      7049

    accuracy                         0.7544     14139
   macro avg     0.7557    0.7545    0.7541     14139
weighted avg     0.7558    0.7544    0.7541     14139

              precision    recall  f1-score   support

           0     0.7749    0.7190    0.7459      7090
           1     0.7365    0.7899    0.7623      7049

    accuracy                         0.7544     14139
   macro avg     0.7557    0.7545    0.7541     14139
weighted avg     0.7558    0.7544    0.7541     14139



## XGBoosting Model

In [94]:
# Preprocessing pipeline for categorical and numerical columns
ct = ColumnTransformer(
  [
    ("dummify", OneHotEncoder(sparse_output=False, handle_unknown='ignore', drop="first"), make_column_selector(dtype_include=object)),
    ("standardize", StandardScaler(), make_column_selector(dtype_include=np.number))
  ],
  remainder="passthrough"
).set_output(transform="pandas")

# XGBoosting pipeline
xgboost_pipeline = Pipeline(
  [("preprocessing", ct),
   ("xgboost", XGBClassifier())]
   
).set_output(transform="pandas")

xgboost_pipeline.fit(X_train, y_train)
y_pred = xgboost_pipeline.predict(X_test)

# Show metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Confusion Matrix:\n {confusion_matrix(y_test, y_pred)}")
print(classification_report(y_test, y_pred, digits=4))


Accuracy: 0.7484263384963576
Confusion Matrix:
 [[5009 2081]
 [1476 5573]]
              precision    recall  f1-score   support

           0     0.7724    0.7065    0.7380      7090
           1     0.7281    0.7906    0.7581      7049

    accuracy                         0.7484     14139
   macro avg     0.7503    0.7485    0.7480     14139
weighted avg     0.7503    0.7484    0.7480     14139

Accuracy: 0.7484263384963576
Confusion Matrix:
 [[5009 2081]
 [1476 5573]]
              precision    recall  f1-score   support

           0     0.7724    0.7065    0.7380      7090
           1     0.7281    0.7906    0.7581      7049

    accuracy                         0.7484     14139
   macro avg     0.7503    0.7485    0.7480     14139
weighted avg     0.7503    0.7484    0.7480     14139



## Conclusion

All the models perform very similar. The third neural network with batch normalization preforms the best out of all the models because it has the highest accuracy and the highest recall. This suggests that not only was it the most reliable in making correct predictions overall, but it was also the most effective at correctly identifying individuals with diabetes, which is especially important in a healthcare context where missing positive cases can have serious consequences.