# Supervised Learning Final Project

Diabetes is a chronic condition that occurs when the pacreas can no longer make insulin, or the body cannot effectively use insulin. Insulin is a hormone created in the pancreas that to move glucose into cells in the body. Glucose is a source of energy for the body. A person with diabetes is unable to control the levels of glucose in their blood. High levels of glucose over a long term is associated with organ failure and other damage to the body. As per [this report](https://idf.org/media/uploads/2024/06/IDF-Annual-Report-2023.pdf) from the International Diabetes Federation, an estimated 540 million people live with diabetes.

Early diabetes detection and prediction is an important tool to prevent diabetes. While there are several modern tests/features that can be used for diabetes detection, we will use some of the historically common ones that are cheaper and readily available. e.g. BMI, Glucose, Insulin, Age etc.

In [None]:
%pip install pandas numpy matplotlib seaborn scikit-learn catboost

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, roc_curve, confusion_matrix, ConfusionMatrixDisplay

from catboost import CatBoostClassifier

## Data

In [None]:
# 1. Data Loading and Initial Exploration
data = pd.read_csv('data/diabetes.csv')
print(data.head())
print(data.info())

## Data Cleaning

In [None]:
# Find distribution of target variable
data.hist(figsize = (10,10))
plt.show()

In [None]:
# Replace zero values with median
zero_columns = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
for col in zero_columns:
    data[col] = data[col].replace(0, data[col].median())

## Exploratory Data Analysis

In [None]:
sns.pairplot(data, hue='Outcome', diag_kind='hist', height=1.5)
plt.show()

In [None]:
corr_matrix = data.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, center=0)
plt.title('Correlation Heatmap')
plt.show()

## Models

In [None]:
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Feature Selection using Logistic Regression as the base estimator


In [None]:
# Use Logistic Regression as the base estimator for feature selection
lr = LogisticRegression(random_state=42)

# Perform forward selection to select the top 5 features
sfs = SequentialFeatureSelector(estimator=lr, n_features_to_select=5, direction='forward', scoring='accuracy', cv=5)
sfs.fit(X_train_scaled, y_train)

# Get the selected features
selected_features = X.columns[sfs.get_support()]
print("Selected features:", selected_features)

# Create new datasets with selected features
X_train_sfs = sfs.transform(X_train_scaled)
X_test_sfs = sfs.transform(X_test_scaled)


In [None]:
## Logistic Regression
lr_params = {
    'C': [0.01, 0.1, 1],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga']
}
lr_grid = GridSearchCV(LogisticRegression(random_state=42), lr_params, cv=5, scoring='roc_auc')
lr_grid.fit(X_train_sfs, y_train)
lr_best = lr_grid.best_estimator_

print("\nBest Logistic Regression Parameters:", lr_grid.best_params_)
print("Logistic Regression Performance:")
print(classification_report(y_test, lr_best.predict(X_test_sfs)))
print("ROC AUC Score:", roc_auc_score(y_test, lr_best.predict_proba(X_test_sfs)[:, 1]))


In [None]:
## Random Forest
rf_params = {
    'n_estimators': [100, 200],
    'max_depth': [5, 10],
    'min_samples_split': [5, 10],
    'min_samples_leaf': [1, 4]
}
rf_grid = GridSearchCV(RandomForestClassifier(random_state=42), rf_params, cv=5, scoring='roc_auc')
rf_grid.fit(X_train_sfs, y_train)
rf_best = rf_grid.best_estimator_

print("\nBest Random Forest Parameters:", rf_grid.best_params_)
print("Random Forest Performance:")
print(classification_report(y_test, rf_best.predict(X_test_sfs)))
print("ROC AUC Score:", roc_auc_score(y_test, rf_best.predict_proba(X_test_sfs)[:, 1]))

In [None]:
## CatBoost Classifier
catboost_params = {
    'iterations': [100, 200, 300],
    'depth': [4, 8],
    'learning_rate': [0.01, 0.1],
    'l2_leaf_reg': [1, 3, 9]
}
catboost_grid = GridSearchCV(CatBoostClassifier(verbose=0, random_state=42), catboost_params, cv=5, scoring='roc_auc')
catboost_grid.fit(X_train_sfs, y_train)
catboost_best = catboost_grid.best_estimator_

print("\nBest CatBoost Parameters:", catboost_grid.best_params_)
print("CatBoost Performance:")
print(classification_report(y_test, catboost_best.predict(X_test_sfs)))
print("ROC AUC Score:", roc_auc_score(y_test, catboost_best.predict_proba(X_test_sfs)[:, 1]))

# 6. Model Comparison using ROC Curves
models = [lr_best, rf_best, catboost_best]
model_names = ['Logistic Regression', 'Random Forest', 'CatBoost']

## Results and Analysis

In [None]:

def plot_confusion_matrix(model, X_test, y_test, title):
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
    disp.plot(cmap='Blues')
    plt.title(title)
    plt.show()

# For Logistic Regression
plot_confusion_matrix(lr_best, X_test_sfs, y_test, "Confusion Matrix - Logistic Regression")

# For Random Forest
plot_confusion_matrix(rf_best, X_test_sfs, y_test, "Confusion Matrix - Random Forest")

# For CatBoost
plot_confusion_matrix(catboost_best, X_test_sfs, y_test, "Confusion Matrix - CatBoost")

In [None]:
models = [lr_best, rf_best, catboost_best]
model_names = ['Logistic Regression', 'Random Forest', 'CatBoost']

plt.figure(figsize=(10, 6))
for model, name in zip(models, model_names):
    y_pred = model.predict(X_test_sfs)
    y_pred_proba = model.predict_proba(X_test_sfs)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
    auc = roc_auc_score(y_test, y_pred_proba)
    plt.plot(fpr, tpr, label=f'{name} (AUC={auc:.2f})')

plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend()
plt.show()

## References

Xi Li, Michele Curiger, Rolf Dornberger, and Thomas Hanne. 2023. Optimized Computational Diabetes Prediction with Feature Selection Algorithms. In Proceedings of the 2023 7th International Conference on Intelligent Systems, Metaheuristics &amp; Swarm Intelligence (ISMSI '23). Association for Computing Machinery, New York, NY, USA, 36â€“43. https://doi.org/10.1145/3596947.3596948