### Introduction

This project uses the PIMA indian data source to predict diabetes diagnosis. This has practical implications to identify high risk patients. This project leverages a baseline model, an ensemble approach, and automated machine learning module and compares the approaches. [1][5]

### Dataset Exploration

Every machine learning problem begins with the data. For that reason, data exploration is the first step undertaken here. A quick review of the data revealed a slight imbalance between the boolean Outcome column. The true-false ratio is 66:34.

There is also a lack of null or NaN values. However, the dataset is missing values in the form of zeros in the BMI, SkinThickness, BloodPressure, Glucose columns. From a medical perspective, these data points should rarely or never equal zero leading to the conclusion that data is absent.

There is also little correlation between the features. Specifically, no columns correlated above 0.54. The mean correlation is 0.25 while, with the 1.0 self correlation removed, the mean correlation fell to 0.14 indicating low correlation across the features.

In [1]:
# installs the pycaret library as it does not come included in python or anaconda
!pip install pycaret



In [2]:
# import popular data processing package
import pandas as pd

In [3]:
# Import date from csv
df = pd.read_csv('diabetes.csv')

In [4]:
# Exploratory analysis of the dataset
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
# slightly unbalanced dataset no: 500 yes: 268 (66:34)
print('\n\nNo Diabetes: ',df.Outcome.eq(0).sum(),'\n  ', 'Diabetes: ', df.Outcome.eq(1).sum())

# Inappropriate zero values in BMI, SkinThickness, BloodPressure, Glucose
print('\nZeros in each column:\n', df.eq(0).sum()) # AI written (Prompt: In pandas, how can I check the number of 0 values in the columns?; ChatGPT 3.5-Turbo)



No Diabetes:  500 
   Diabetes:  268

Zeros in each column:
 Pregnancies                 111
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                     500
dtype: int64


In [6]:
# Correlation between features
df.corr()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Pregnancies,1.0,0.129459,0.141282,-0.081672,-0.073535,0.017683,-0.033523,0.544341,0.221898
Glucose,0.129459,1.0,0.15259,0.057328,0.331357,0.221071,0.137337,0.263514,0.466581
BloodPressure,0.141282,0.15259,1.0,0.207371,0.088933,0.281805,0.041265,0.239528,0.065068
SkinThickness,-0.081672,0.057328,0.207371,1.0,0.436783,0.392573,0.183928,-0.11397,0.074752
Insulin,-0.073535,0.331357,0.088933,0.436783,1.0,0.197859,0.185071,-0.042163,0.130548
BMI,0.017683,0.221071,0.281805,0.392573,0.197859,1.0,0.140647,0.036242,0.292695
DiabetesPedigreeFunction,-0.033523,0.137337,0.041265,0.183928,0.185071,0.140647,1.0,0.033561,0.173844
Age,0.544341,0.263514,0.239528,-0.11397,-0.042163,0.036242,0.033561,1.0,0.238356
Outcome,0.221898,0.466581,0.065068,0.074752,0.130548,0.292695,0.173844,0.238356,1.0


In [7]:
# import machine learning modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier


In [8]:
# define random seed
seed = 42 # 42 because it's the answer to life, the universe, and everything.

# split the data into features and the label
X = df.drop(['Outcome'], axis=1).values
y = df['Outcome'].values

In [9]:
# Split the dataset into training and testing data 85:15 split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=seed, test_size=0.15)

In [10]:
# Evaluation Function
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score

classes = ['Diabetes','No Diabetes']
def evaluate_model(true_labels, predictions):
    print(classification_report(true_labels, predictions, target_names=classes)) # The report includes precision, recall, and F1 scores, which show if the s
    print('Accuracy Score:   ', accuracy_score(true_labels, predictions)) # Accuracy is the core performance metric; although it does not convey how the model is performing
    print('AUC Score:        ', roc_auc_score(true_labels, predictions)) # Similar to accuracy AUC is the area under the ROC curve and indicates predictive performance.

### Baseline: Logistic Regression

Logistic Regression uses the logistic function to define a linear decision threshold. It preforms generally well on binary or multilabel classification. It is also simple to implement. For those two reasons, it is a compelling choice for a baseline model.

In [11]:
# Logistic Regression Baseline
# LR makes for a ideal baseline due to its simple implementation and power performance.
baseline = LogisticRegression(max_iter=1000) # increase max iterations from the default 100 which failed to converge.
baseline.fit(X_train,y_train) # trains the model based on the features and label

baseline_preds = baseline.predict(X_test) # produces an array of predictions based on the set of features split from the training data.
evaluate_model(y_test, baseline_preds) # calls the pre-defined function to print chosen evaluation metrics.

              precision    recall  f1-score   support

    Diabetes       0.83      0.79      0.81        76
 No Diabetes       0.64      0.70      0.67        40

    accuracy                           0.76       116
   macro avg       0.73      0.74      0.74       116
weighted avg       0.77      0.76      0.76       116

Accuracy Score:    0.7586206896551724
AUC Score:         0.7447368421052631


### Gradient Boosting

This techniques constitutes a form of boosting, which machine learning approach that iteratively leverages multiple weak classifiers to create a stronger one. The use of several models is what makes this an ensemble approach. Gradient boosting differs from standard boosting in how the model handles its errors. Essentially, it focuses on reducing the preceding models errors moving further along the gradient.

In [12]:
# XGBoost; the implementation is the same Logistic Regression
model = GradientBoostingClassifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
evaluate_model(y_test, y_pred)

              precision    recall  f1-score   support

    Diabetes       0.81      0.74      0.77        76
 No Diabetes       0.57      0.68      0.62        40

    accuracy                           0.72       116
   macro avg       0.69      0.71      0.70       116
weighted avg       0.73      0.72      0.72       116

Accuracy Score:    0.7155172413793104
AUC Score:         0.705921052631579


### Automated Machine Learning

Automated machine learning (AutoML) is based on the concept that most machine learning code is repetitive and so much of it can be automated. Using a defined task, an AutoML package performs rapid training on a number of models that perform generally well on the task and compare the performance on a number of metrics, primarily accuracy. For this to work, AutoML makes a number of assumptions and decisions that would normally be explicitly coded by an expert. The trade-off is rapid and easy comparison of models for lack of control. For this reason, experts who require a high level of control over the parameters of their model might choose other solutions, or start with AutoML but produce custom code for production. [4]

### PyCaret

PyCaret is an open-source Python machine learning library that facilitates access to an API for AutoML. It can be used across an entire machine learning pipeline from ingestion to production. It relies on several machine learning libraries for its functionality. It can produce machine learning results quickly compared to more conventional solutions. It is intended for data scientists of all levels of experience to increase productivity and rapidly prototype. [2]

In [13]:
# imports PyCaret's module for classification
from pycaret.classification import *
exp = ClassificationExperiment()

# inputs the data, sets the column to predict, and defines a random seed to be used throughout the models and functions for reproducibility.
exp.setup(df, target='Outcome', session_id=seed)

Unnamed: 0,Description,Value
0,Session id,42
1,Target,Outcome
2,Target type,Binary
3,Original data shape,"(768, 9)"
4,Transformed data shape,"(768, 9)"
5,Transformed train set shape,"(537, 9)"
6,Transformed test set shape,"(231, 9)"
7,Numeric features,8
8,Preprocess,True
9,Imputation type,simple


<pycaret.classification.oop.ClassificationExperiment at 0x7c702755de70>

In [16]:
# PyCaret / AutoML
# training & selection; trains a variety of classification models and selects the model with the highest accuracy.
best = exp.compare_models()

# Evaluates the models
exp.evaluate_model(best)

# tests the model
pycaret_pred = exp.predict_model(best)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7784,0.8284,0.5819,0.7369,0.6397,0.4851,0.4994,0.087
lda,Linear Discriminant Analysis,0.7784,0.8304,0.5819,0.7327,0.6395,0.4848,0.4977,0.027
ridge,Ridge Classifier,0.7747,0.0,0.5661,0.7344,0.6297,0.474,0.4883,0.028
nb,Naive Bayes,0.7599,0.816,0.593,0.6858,0.6275,0.4537,0.4616,0.029
qda,Quadratic Discriminant Analysis,0.7525,0.8166,0.5775,0.6718,0.6149,0.4356,0.4424,0.03
et,Extra Trees Classifier,0.7524,0.81,0.5605,0.6824,0.5995,0.4275,0.4427,0.256
gbc,Gradient Boosting Classifier,0.7449,0.8157,0.5398,0.6773,0.5874,0.4096,0.4239,0.152
rf,Random Forest Classifier,0.7393,0.8114,0.5401,0.6601,0.5857,0.4005,0.4101,0.237
ada,Ada Boost Classifier,0.7336,0.7854,0.5398,0.6396,0.5749,0.3865,0.3961,0.127
lightgbm,Light Gradient Boosting Machine,0.7225,0.7811,0.5506,0.6231,0.5779,0.3739,0.3804,0.162


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.7403,0.8376,0.5185,0.6667,0.5833,0.3989,0.4056


### Discussion
As with every tool, there are trade-offs. The use of AutoML drastically speeds up model selection and experimentation. It produced detailed evaluation results and PyCaret comes with an intuitive output interface that allows users to explore different aspects of the models' evaluation to gather a more comprehensive understanding of the models' performance. This enables users to make a more informed model selection decision. The tool also allows for rapid experimentation. It is a powerful tool for machine learning workflows.

On the other hand, AutoML makes decisions and assumptions for the user. This makes it a less ideal choice in situations where users need to prioritize certain aspects of model performance such as to maximize the precision instead of pure accuracy. There are features of AutoML that allow further control than the basic implementation. However, they do not have same level of flexibility as other popular options. In conclusion, AutoML is a powerful tool that naturally includes trade-offs.

### Citations

[1] J. W. Smith, J. E. Everhart, W. C. Dickson, W. C. Knowler, and R. S. Johannes, “Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus,” Proc Annu Symp Comput Appl Med Care, pp. 261–265, Nov. 1988.
[2] “PyCaret 3.0 - Docs.” Accessed: Feb. 02, 2024. [Online]. Available: https://pycaret.gitbook.io/docs/
[3] “Frequently Asked Questions — xgboost 2.0.3 documentation.” Accessed: Feb. 02, 2024. [Online]. Available: https://xgboost.readthedocs.io/en/stable/faq.html
[4] “Eight years of AutoML: categorisation, review and trends | Knowledge and Information Systems.” Accessed: Feb. 02, 2024. [Online]. Available: https://link.springer.com/article/10.1007/s10115-023-01935-1#Sec19
