# Model evaluation

In this notebook, you will explore key techniques for evaluating machine learning models.
Specifically, you will focus on:

- Manually calculating performance metrics from a confusion matrix.
- Understanding the differences between ROC and PR curves, especially when dealing with imbalanced data.
- Identifying and mitigating the issue of data leakage when performing feature selection.

## Learning objectives

By the end of this notebook, you should be able to:

- Calculate and interpret key performance metrics such as accuracy, precision, recall, and F1-score.
- Understand the significance of ROC and PR curves and when to use each.
- Identify data leakage and prevent it by applying proper cross-validation strategies.

## Model evaluation recap

Before we dive into the exercises, let’s briefly review the key concepts that are essential for evaluating machine learning models.
These concepts will guide you through the tasks you will be working on.

### Train-validation-test split

Properly evaluating your machine learning model is essential to assess it performance.
Importantly, we want to model to be **generalizable**, i.e. it performs well on unseen (future) data.

Typically, the data is split into a **train and test set**.
The model is then trained on the former, and its performance is evaluated in the end on the latter.
It is very important to only evaluate performance of your model on the test set as the very last step.
In case the test performance is used to direct model changes, it is no longer independent and **overfitting** will occur.

In case your model has hyperparameters that can be tuned, the data is split into a third set, called the **validation set**.
The validation set can then be used to assess the performance for different hyperparameter values to keep the test set separate for the final performance evaluation.

### Cross-validation

Cross-validation is an essential technique used to assess model performance more reliably.
It involves splitting the data into multiple subsets (folds), training the model on some folds, and testing it on the remaining folds.
The most common form is **k-fold cross-validation**, where the data is split into `k` folds, and the model is trained `k` times, each time using a different fold for testing.

Cross-validation helps mitigate overfitting by ensuring the model generalizes well to unseen data.

### Bootstrap

Bootstrap is another resampling technique used to assess the uncertainty in model performance.
It involves repeatedly sampling the data (with replacement) to create new training sets, training the model on these sets, and evaluating its performance on the unsampled data.
Bootstrap helps to estimate the variance of the model's performance metrics.

### Confusion matrix & performance metrics

The **confusion matrix** is a 2x2 table used to describe the performance of a classification model.
The entries are:

- **True positives (TP)**: The model correctly predicts the positive class.
- **False positives (FP)**: The model incorrectly predicts the positive class.
- **True negatives (TN)**: The model correctly predicts the negative class.
- **False negatives (FN)**: The model incorrectly predicts the negative class.

From the confusion matrix, you can derive several performance metrics:

- **Accuracy**: The proportion of correct predictions.
- **Precision**: The proportion of positive predictions that are actually positive.
- **Recall (sensitivity)**: The proportion of actual positives that are correctly predicted.
- **F1-Score**: The harmonic mean of precision and recall, useful when classes are imbalanced.

### **ROC and PR curves**

The **receiver operating characteristic (ROC) curve** plots the true positive rate (recall) against the false positive rate at various threshold settings.
The **area under the ROC curve (AUROC)** is commonly used to summarize the model's performance, with 1.0 indicating a perfect classifier and 0.5 indicating a random classifier.

The **precision-recall (PR) curve** plots precision against recall at various thresholds.
PR curves are particularly useful when dealing with imbalanced data, where the positive class is much smaller than the negative class.
The **average precision (AP)** (approximately equivalent to the area under the PR curve) summarizes the trade-off between precision and recall.

## Task: Calculate performance metrics from a confusion matrix

In this task, you will:

- Train a logistic regression model on a simple binary classification dataset.
- Generate a confusion matrix based on the predictions.
- Manually calculate key performance metrics: accuracy, precision, recall, and F1-score.

**Steps**

- Create the dataset: Create a binary classification dataset with 1,000 samples and 20 features.
- Split the data: Divide the data into training and test sets.
- Train a logistic regression model: Use scikit-learn's `LogisticRegression` to train a model.
- Evaluate the model: Make predictions on the test set and compute the confusion matrix. Using the values from the confusion matrix, calculate the following metrics manually:
    - Accuracy = (TP + TN) / (TP + TN + FP + FN)
    - Precision = TP / (TP + FP)
    - Recall = TP / (TP + FN)
    - F1-score = 2 * (Precision * Recall) / (Precision + Recall)

In [None]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split


# Generate synthetic dataset.
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
# TODO
X_train, X_test, y_train, y_test = None

# Train logistic regression model.
model = None

# Generate confusion matrix.
# TODO
y_pred = None
cm = None

# Calculate performance metrics.
# TODO
accuracy = None
precision = None
recall = None
f1_score = None

# Print the results.
print(f"Confusion matrix: \n{cm}")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1_score}")

## Task: ROC vs PR curves for imbalanced data

In biomedical applications, it is common to encounter imbalanced datasets where the number of positive cases (e.g., patients with a rare disease) is significantly smaller than the number of negative cases (e.g., healthy patients).
For example, in cancer detection, only a small fraction of patients might actually have cancer, making the dataset highly skewed toward healthy individuals.

This class imbalance can lead to misleading performance metrics when evaluating classifiers.
For instance, accuracy can be very high if the model simply predicts the majority class, but it would fail to detect the minority class, which is often the class of interest in medical diagnostics.

In this task, you will:

- Simulate a highly imbalanced dataset to mimic real-world scenarios in biomedicine, such as disease diagnosis where the positive cases (disease presence) are much rarer than negative cases (no disease).
- Train a logistic regression model and generate ROC and PR curves.
- Analyze the differences between ROC and PR curves for imbalanced datasets, and explain which is more appropriate for imbalanced biomedical problems like rare disease detection.

**Steps**

- Generate an imbalanced dataset: Create a dataset with 1,000 samples and 20 features, where 90% of the samples belong to the negative class (representing healthy individuals), and 10% belong to the positive class (representing diseased individuals).
- Split the data: Divide the data into training and test sets.
- Train a logistic regression model: Use scikit-learn's `LogisticRegression` to train a model.
- Plot the ROC curve: Use `roc_curve` from `sklearn.metrics` to plot the ROC curve.
- Plot the PR curve: Use `precision_recall_curve` from `sklearn.metrics` to plot the PR curve. On each curve, draw a dashed line indicating random performance.
- Compare the curves: Explain why the PR curve may give more insight than the ROC curve in biomedical problems with imbalanced datasets, where detecting the minority class (diseased patients) is more important than simply maximizing overall accuracy.

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import auc, average_precision_score, roc_curve


# Generate imbalanced dataset.
# TODO
X, y = None
X_train, X_test, y_train, y_test = None

# Train logistic regression model.
# TODO
model = None

# Predict on the test set.
# TODO
y_proba = None

# Plot ROC and PR curves.
# TODO
fpr, tpr, _ = None
roc_auc = None
precision, recall, _ = None
ap = None

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12.8, 4.8))

axes[0].plot(fpr, tpr, label=f"ROC curve (AUC = {roc_auc:.2f})")
axes[0].set_xlim(-0.05, 1.05)
axes[0].set_ylim(-0.05, 1.05)
axes[0].set_xlabel("False positive rate")
axes[0].set_ylabel("True positive rate")
axes[0].set_title("ROC curve")
axes[0].legend(loc="lower right")

axes[1].plot(recall, precision, label=f"PR curve (AP = {ap:.2f})")
axes[1].set_xlim(-0.05, 1.05)
axes[1].set_ylim(-0.05, 1.05)
axes[1].set_xlabel("Recall")
axes[1].set_ylabel("Precision")
axes[1].set_title("PR curve")
axes[1].legend()

# Draw dashed lines indicating random performance.
# TODO

plt.show()
plt.close()

# Analyze the differences between ROC and PR curves for imbalanced datasets.

## Task: Data leakage due to feature selection outside of cross-validation

In biomedical research, especially when working with molecular data (e.g., gene expression, proteomics, or metabolomics), it is common to have datasets with many more features (e.g., thousands of genes, proteins, or metabolites) than samples (e.g., dozens to hundreds of patients).
For example, when developing a classifier for diagnostic purposes, such as distinguishing between patients with and without a disease based on their molecular profiles, the feature set can be vast, but the number of samples may be limited due to the difficulty and cost of collecting biological data.

Feature selection is a crucial step in such situations, as models trained on all available features might suffer from overfitting due to the high dimensionality of the data.
The goal of feature selection is to reduce the feature space to only the most informative features, improving model generalization and interpretability.

There are different approaches to feature selection:

- Filter methods: Select features based on statistical measures, such as correlation or mutual information with the target variable, before training the model.
- Wrapper methods: Select features by evaluating model performance on different subsets of features, iteratively optimizing the subset.
- Embedded methods: Perform feature selection as part of the model training process, such as Lasso (L1 regularization).

For this task, we will use **filter-based feature selection** with the ANOVA F-test, which selects features based on their ability to differentiate between the classes.
However, it is important to perform this selection properly to avoid data leakage.

If feature selection is done **before** cross-validation, information about the test data can "leak" into the training process, resulting in overly optimistic performance estimates.
This is a common mistake in molecular data analysis and even the scientific literature, leading to classifiers that appear to work well but fail when applied to new, unseen data.

In this task, you will:

- Implement a naive feature selection approach, where feature selection is done outside of cross-validation (incorrect), and observe inflated performance.
- Implement the correct feature selection approach, where feature selection is applied within cross-validation, and compare the results.

**Steps**

- Simulate data: Generate a random dataset with 100 samples and 10,000 features. Such a large number of features mimics real-world molecular datasets where the majority of features (genes, proteins, metabolites) may not be relevant to the classification task.
- Perform naive feature selection (outside cross-dalidation):
    - Use `SelectKBest` with the ANOVA F-test to select the top 50 features **before** cross-validation.
    - Train a logistic regression model using these selected features and evaluate its performance using cross-validation.
    - Compare the cross-validation scores and observe that the performance will likely appear better than random, even though the data is purely random.
- Perform correct feature selection (inside cross-dalidation):
    - Use a `Pipeline` to combine feature selection and logistic regression, ensuring that feature selection happens inside cross-validation, preventing data leakage.
    - Compare the cross-validation scores with the naive approach and observe that the performance is now close to random, as expected with random data.

In [None]:
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline


# Generate random dataset.
# 100 samples, 10,000 features (many more features than samples).
np.random.seed(42)
X = np.random.rand(100, 10000)  # 100 samples, 10,000 random features.
y = np.random.randint(0, 2, 100)  # Random binary labels (0 or 1).

# Define cross-validation scheme.
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Naive feature selection outside CV (incorrect).
# Select the top 50 features before cross-validation.
# TODO
selector = None
X_selected = None

# Train a logistic regression model and
# evaluate performance using cross-validation.
model = LogisticRegression()
scores_naive = cross_val_score(model, X_selected, y, cv=cv)

# Print the performance.
print(f"Naive feature selection (outside CV) accuracy: {np.mean(scores_naive):.2f}")

# Correct feature selection inside CV.
# Use a Pipeline to ensure feature selection is done inside each CV fold.
# TODO
pipeline = None

# Evaluate performance using cross-validation.
scores_correct = cross_val_score(pipeline, X, y, cv=cv)

# Print the performance.
print(f"Correct feature selection (inside CV) accuracy: {np.mean(scores_correct):.2f}")

**Expected outcome**

When feature selection is done outside of cross-validation, you may observe an accuracy much higher than expected, even though the dataset is purely random.
This happens because information about the test set "leaks" into the training process, causing the model to appear better than it really is.

When feature selection is done inside cross-validation (i.e., the correct way), the accuracy should drop significantly, reflecting the fact that the dataset is random and the model cannot find meaningful patterns.
The performance will be closer to random guessing (around 50%).

Remember the **prime directive of machine learning**:

> Only evaluate your models on data that has never been used for training.

Performing feature selection on the full dataset violates this rule, even though we subsequently use cross-validation to train and evaluate our model.
This leads to data leakage, which is a subtle but common issue.
In this case, because we used only statistically significant features for model training, class information leaked to our test set.