# AI4Health - 01 - Tabular Classification

---

## Introduction

Tabular data—structured information organised in rows and columns—is the foundation of most clinical records and healthcare analytics. From electronic health records to laboratory results, much of the data used in medical decision-making is tabular in nature. Machine learning models that can interpret and classify this data have the potential to support clinicians in diagnosing diseases, identifying at-risk patients, and improving outcomes.

In this notebook, you will build a machine learning classifier to predict diabetes using the well-known **Pima Indians dataset**. Diabetes is a chronic condition with significant health impacts, and early detection is crucial for effective management. The dataset is a real-world scenario where routine health measurements are used to assess a patient's risk. We use the Pima Indians dataset because the Pima Indian population was known to be highly affected by diabetes, making it a valuable resource for studying risk factors and prediction in a real clinical context.

You will learn how to:

- Explore and visualise clinical tabular data
- Prepare data for modelling, including handling class imbalance
- Train and evaluate two common classifiers: **Logistic Regression & Decision Trees**
- Interpret model performance using **Confusion Matrices** and **ROC Curves**
- Reflect on the practical and ethical implications of deploying such models in healthcare

By the end of this notebook, you will have hands-on experience with the end-to-end workflow of building a clinical prediction model, and a deeper understanding of both the opportunities and challenges of applying machine learning in medicine.

### Learning Objectives:

- Understand binary classification and model evaluation in healthcare
- Learn how to train logistic regression and decision tree models
- Interpret model performance using ROC curves and confusion matrices

---

## Additional Context

### What is Tabular Clinical Data?

Tabular data is the most common format for clinical information. Each row typically represents a patient or a clinical encounter, and each column is a variable such as age, blood pressure, lab results, or diagnosis codes. This structure makes it easy to store, query, and analyse patient data using spreadsheets or databases.

In healthcare, tabular datasets are used for:
- **Risk prediction** (e.g., likelihood of diabetes, heart disease)
- **Resource planning** (e.g., predicting hospital admissions)
- **Quality improvement** (e.g., identifying gaps in care)

### Why Binary Classification?

Many clinical questions are naturally framed as binary classification problems:
- Does this patient have diabetes? (Yes/No)
- Will this patient be readmitted within 30 days? (Yes/No)
- Is this test result abnormal? (Yes/No)

Binary classifiers help automate decision support, triage, and screening, but their predictions must be interpreted carefully, especially when errors can have real-world consequences.

### Key Concepts in Clinical Machine Learning

Understanding the foundational concepts of clinical machine learning is essential before building predictive models. In clinical settings, the data and the way we evaluate models have unique characteristics and implications. Below are some of the most important concepts to grasp:

- **Features**: Patient measurements (e.g., glucose, BMI, age)
- **Target**: The outcome to predict (e.g., diabetes diagnosis)
- **Class Imbalance**: Often, one outcome (e.g., "no disease") is much more common than the other. This can bias models if not addressed.
- **Model Evaluation**: Metrics like accuracy, precision, recall, and ROC-AUC help assess how well a model performs, but their clinical meaning depends on the context.

### Why Use Logistic Regression and Decision Trees?

Choosing the right machine learning model is crucial in healthcare, where interpretability and transparency are often as important as predictive accuracy. Two commonly used models in clinical prediction are logistic regression and decision trees, each with their own strengths:

- **Logistic Regression**: Simple, interpretable, and widely used for clinical risk prediction. Coefficients can be linked to odds ratios, making them familiar to clinicians.
- **Decision Trees**: Provide clear, rule-based decisions that can be visualised and explained, which is valuable for transparency in healthcare.

### Clinical Impact and Caution

Machine learning can support, but not replace, clinical judgment. Models must be validated, monitored, and used alongside other information. Ethical considerations include fairness, transparency, and the risk of automation bias.

---

## Related Guides

- *MatPlotLib - Pyplot:* https://matplotlib.org/stable/tutorials/pyplot.html
- *Pandas - DataFrame:* https://pandas.pydata.org/docs/user_guide/dsintro.html#basics-dataframe
- *SciKit-Learn - Classification Report:* https://scikit-learn.org/stable/modules/model_evaluation.html#classification-report
- *SciKit-Learn - Confusion Matrix:* https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
- *SciKit-Learn - Cross Validation (train, test, split):* https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation
- *SciKit-Learn - Decision Tree:* https://scikit-learn.org/stable/modules/tree.html
- *SciKit-Learn - Logistic Regression:* https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
- *SciKit-Learn - ROC Curve:* https://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics

---

## Step 1: Load Required Libraries

Before we begin, let's import the essential Python libraries for data analysis, visualisation, and machine learning.

- **pandas**: for data manipulation and analysis
- **matplotlib** and **seaborn**: for creating informative plots
- **scikit-learn**: for splitting data, building models, and evaluating performance

Understanding the purpose of each library will help you build, visualise, and evaluate your models effectively.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Ensure plots show inline in Jupyter
%matplotlib inline

print("OK")

**Questions:**

- **1.1.** What is the purpose of `train_test_split`?
- **1.2.** Why might we use `roc_curve` and `auc` in a clinical context?

---

## Step 2: Load the Dataset

Now, we will load the Pima Indians diabetes dataset from a local CSV file. This dataset contains clinical measurements used to predict diabetes.
It is important to handle potential errors during loading, such as missing files or corrupted data, especially in healthcare where data integrity is critical.

After loading, always verify that the data looks as expected before proceeding.

In [None]:
file_path = "./datasets/pima_indians_diabetes.csv"
try:
    df = pd.read_csv(file_path)
    print("OK")
except Exception as e:
    print("Failed to load dataset:", e)
    df = pd.DataFrame()

**Questions:**

- **2.1.** What might go wrong when loading a dataset from a local file?
- **2.2.** Why is it important to handle exceptions when reading files?
- **2.3.** How can you verify that the dataset was loaded correctly and contains the expected data?
- **2.4.** Why is it important to ensure the dataset has not been tampered with or corrupted, especially in a clinical setting?

---

## Step 3: Explore the Data

Before modelling, it is crucial to understand the dataset's structure and contents.
We will look at the first few rows, summary statistics, and check for missing values.
This helps us spot potential issues, such as implausible values (e.g., zero blood pressure), and understand what each feature represents clinically.

Reflect on how data quality and feature meaning can impact downstream analysis and model performance.

In [None]:
print("\nFirst 5 rows:")
print(df.head())

print("\nSummary statistics:")
print(df.describe())

print("\nMissing values:")
print(df.isnull().sum())

**Questions:**

- **3.1.** Why is it important to know the structure of your dataset before starting analysis?
- **3.2.** What do the columns represent clinically?
- **3.3.** Are there any features with unusual min/max values?
- **3.4.** How might you handle features with implausible values (e.g., zero for blood pressure)?

---

## Step 4: Visualise Class Distribution

Let's examine the distribution of the target variable—whether patients are diabetic or not.
Visualising class balance is important because imbalanced datasets can bias models and affect their ability to generalise.

Consider how class imbalance might influence your model's predictions and what strategies you could use to address it.

- *Seaborn - countplot:* https://seaborn.pydata.org/generated/seaborn.countplot.html

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(x='Outcome', data=df)
plt.title("Diabetes Diagnosis Distribution (0 = No, 1 = Yes)")
plt.xlabel("Outcome")
plt.ylabel("Count")
plt.show()

**Questions:**

- **4.1.** Are the classes balanced?
- **4.2.** What could be the impact of imbalance on model training?
- **4.3.** What techniques can you use to address class imbalance in your dataset?

---

## Step 5: Split the Data

To fairly evaluate our models, we will split the data into training and testing sets.
We use stratified sampling to ensure the class proportions remain consistent in both sets, which is especially important for imbalanced data.

Think about why this split is necessary and how improper splitting could lead to misleading results.

- *SciKit-Learn - train_test_split: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html*

In [None]:
X = df.drop('Outcome', axis=1)  # Features
y = df['Outcome']               # Target label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("OK")

**Questions:**

- **5.1.** Why do we split data into training and testing sets?
- **5.2.** What is the role of `stratify=y`?
- **5.3.** What could happen if you do not stratify by the target variable when splitting the data?

---

## Step 6: Train Logistic Regression

We will train a logistic regression model to predict diabetes.
Logistic regression estimates the probability of an outcome using a linear combination of features.
After training, we will evaluate its performance using metrics like precision, recall, and F1-score.

Consider how these metrics relate to clinical decision-making and what the model's coefficients might reveal about risk factors.

- *SciKit-Learn - classification_report:* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
- *SciKit-Learn - LogisticRegression:* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [None]:
print("\n--- Logistic Regression ---")
lr = LogisticRegression(max_iter=1000)  # Increase iterations to ensure convergence
lr.fit(X_train, y_train)                # Train the model
y_pred_lr = lr.predict(X_test)          # Predict on test set

print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr))

**Questions:**

- **6.1.** How does logistic regression estimate the likelihood of diabetes?
- **6.2.** What performance metrics are most relevant here?
- **6.3.** How can you interpret the coefficients of a logistic regression model in a clinical context?

---

## Step 7: Train Decision Tree

Next, we will train a decision tree classifier.
Decision trees make predictions by splitting data based on feature thresholds.
We will limit the tree's depth to prevent overfitting and improve interpretability.

Compare the strengths and weaknesses of decision trees and logistic regression, especially in terms of transparency and clinical usefulness.

- *SciKit-Learn - classification_report:* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
- *SciKit-Learn - DecisionTreeClassifier:* https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [None]:
print("\n--- Decision Tree ---")
dt = DecisionTreeClassifier(max_depth=4, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

print("\nClassification Report:")
print(classification_report(y_test, y_pred_dt))

**Questions:**

- **7.1.** What does `max_depth=4` mean in terms of model complexity?
- **7.2.** What are the pros and cons of decision trees compared to logistic regression?
- **7.3.** How might the interpretability of a decision tree benefit clinicians?

---

## Step 8: Confusion Matrix Comparison

Let's compare the confusion matrices for both models.
The confusion matrix shows how many predictions were correct or incorrect, broken down by class.
This helps us understand the types of errors each model makes—such as false positives and false negatives—and their potential clinical consequences.

Reflect on which errors are more critical in a healthcare context and how you might adjust your model to minimise them.

- *SciKit-Learn - confusion_matrix:* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
- *Seaborn - heatmap:* https://seaborn.pydata.org/generated/seaborn.heatmap.html

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12,5))
for ax, y_pred, title in zip(axes, [y_pred_lr, y_pred_dt], ['Logistic Regression', 'Decision Tree']):
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax)
    ax.set_title(f"Confusion Matrix: {title}")
    ax.set_xlabel("Predicted")
    ax.set_ylabel("True")
plt.tight_layout()
plt.show()

**Questions:**

- **8.1.** Which model makes more false positives or false negatives?
- **8.2.** What would be the consequence of each in a clinical context?
- **8.3.** How could you adjust your model or threshold to prioritise reducing false negatives?

---

## Step 9: ROC Curves

We will plot ROC (Receiver Operating Characteristic) curves to visualise the trade-off between sensitivity (true positive rate) and specificity (false positive rate) at different thresholds.
The AUC (Area Under the Curve) summarises the model's ability to distinguish between classes.

Think about what a good AUC means in practice, how you would select a threshold for clinical use, and what factors might influence that choice.

- *SciKit-Learn - roc_curve:* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
- *SciKit-Learn - auc:* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html

In [None]:
def plot_roc(model, model_name):
    """Plot ROC curve for a given model."""
    y_probs = model.predict_proba(X_test)[:, 1]  # Probabilities for class 1
    fpr, tpr, _ = roc_curve(y_test, y_probs)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f'{model_name} (AUC = {roc_auc:.2f})')

plt.figure(figsize=(8,6))
plot_roc(lr, "Logistic Regression")
plot_roc(dt, "Decision Tree")
plt.plot([0,1],[0,1],'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve Comparison")
plt.legend()
plt.grid(True)
plt.show()

**Questions:**

- **9.1.** What does an AUC of 0.79 and 0.82 mean in a clinical context?
- **9.2.** Are both models better than chance?
- **9.3.** Are both models good enough to use in a clinic?
- **9.4.** How would you choose an appropriate threshold for clinical use, and what factors would influence your decision?

---

## Step 10: Summary and Reflection

In this notebook, we walked through the complete workflow of building a machine learning classifier for a real-world healthcare problem—predicting diabetes from clinical tabular data. We started by loading and exploring the Pima Indians diabetes dataset, visualising the distribution of the target classes, and discussing the importance of class balance. We then prepared the data by splitting it into training and testing sets, ensuring the class proportions were preserved.

Next, we trained two different models: logistic regression and a decision tree, and evaluated their performance using classification reports, confusion matrices, and ROC curves. Along the way, we reflected on the clinical meaning of various metrics, the impact of class imbalance, and the practical consequences of model errors. This hands-on process highlighted both the power and the limitations of machine learning in a clinical context, emphasising the need for careful evaluation and interpretation before deploying such models in practice.

### Summary

- We trained two models: Logistic Regression & Decision Tree.
- Both models show moderate predictive ability on the diabetes dataset.
- ROC and confusion matrices help interpret real-world consequences of predictions.

### What's next?

- **10.1.** What additional clinical variables could improve prediction (e.g., family history, lab tests)?
- **10.2.** How could such a model be integrated into real clinical workflows?
- **10.3.** What safeguards are needed to handle incorrect predictions?
- **10.4.** How would you monitor and update the model to ensure ongoing accuracy and fairness?

---

## Explore Further

### Datasets

- **Pima Indians Diabetes Database**
  - https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database

### Articles

- **Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus**
<br>*Proceedings of the Annual Symposium on Computer Application in Medical Care*
  - https://pmc.ncbi.nlm.nih.gov/articles/PMC2245318/

- **Machine Learning in Medicine**
<br>*New England Journal of Medicine*
  - https://www.nejm.org/doi/full/10.1056/NEJMra1814259

- **Predictive models for diabetes mellitus using machine learning techniques**
<br>*BMC Endocrine Disorders*
  - https://bmcendocrdisord.biomedcentral.com/articles/10.1186/s12902-019-0436-6

- **Interpretable machine learning method to predict the risk of pre-diabetes using a national-wide cross-sectional data: evidence from CHNS**
<br>*BMC Public Health*
  - https://bmcpublichealth.biomedcentral.com/articles/10.1186/s12889-025-22419-7