# Applied Machine Learning: In-class Exercise 02-1

## Goal

After this exercise, you should be able to control the resampling process when using scikit-learn in order to account for data specificities, such as class imbalances in classification settings or grouping phenomena. Further, you will have learned how to construct and utilize custom metrics for performance evaluation within scikit-learn.

## Prerequisites

We load the most important libraries and use a fixed seed for reproducibility:

In [1]:
import numpy as np
import pandas as pd

rng = np.random.default_rng(7832)

## 1 Stratified Resampling

In classification tasks, the ratio of the target class distribution should be similar in each train/test split, which is achieved by stratification. This is particularly useful in the case of imbalanced classes and small datasets.

Stratification can also be performed with respect to explanatory categorical variables to ensure that all subgroups are represented in both training and test sets.

In `scikit-learn`, stratified sampling is performed using `StratifiedKFold`, which ensures that each fold maintains the proportion of classes found in the complete dataset. You can stratify based on the target variable directly, or, if needed, by categorical explanatory variables by creating custom stratification groups.

We illustrate this using the German Credit dataset:

In [2]:
from sklearn.datasets import fetch_openml

X, y = fetch_openml("credit-g", version=1, as_frame=True, return_X_y=True)

# 1.1 & 1.2 Create StratifiedKFold.

We use the classification target as the random variable and instantiate the `StratifiedKFold` object.

<details><summary>Hint 1:</summary>
    Use `pd.Series.value_counts()` to see the the Strata information. I.e. the class distribution.
</details>

In [3]:
#===SOLUTION===

from sklearn.model_selection import StratifiedKFold


# Display stratification information: count and row ids for each group in the target variable.
print(f"Strata information:\n{y.value_counts()}\n")

# Create a 3-fold stratified cross validation
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=7832)

Strata information:
class
good    700
bad     300
Name: count, dtype: int64



## 1.3 Sanity Check

We check if the classes are distributed similarly within each CV fold, matching the overall class distribution:

In [4]:
#===SOLUTION===


for i, (train_indices, test_indices) in enumerate(cv.split(X, y)):
    num_pos_class = (y[test_indices] == 'good').sum() 
    num_neg_class = (y[test_indices] != 'good').sum() 
    class_ratio = num_neg_class / num_pos_class
    print(f"Fold {i}: {class_ratio:.4f}")
    

Fold 0: 0.4274
Fold 1: 0.4292
Fold 2: 0.4292


## 2 Block Resampling

An additional concern when specifying resampling is respecting the natural grouping of the data. Blocking refers to the situation where subsets of observations belong together and must not be separated during resampling. Hence, for one train/test set pair the entire block is either in the training set or in the test set.

In the following example, we will consider the Breast Cancer dataset from OpenML:

In [5]:
from sklearn.datasets import fetch_openml

# Load the breast cancer dataset from openml with dataset ID 46591
X, y = fetch_openml(data_id=46591, as_frame=True, return_X_y=True)

In this dataset, multiple samples may have the same patient identifier (column "Sample code number"), 
which implies these are samples taken from the same patient at different times.

## 2.1 Count groups

Let's count how many observations actually have the same Id more than once:

In [6]:
#===SOLUTION===

id_col = 'Sample code number'
id_counts = X[id_col].value_counts()
num_multiple = (id_counts > 1).sum()
print(f"Number of Ids with more than one observation: {num_multiple}")

Number of Ids with more than one observation: 46


The model trained on this data set will be used to predict cancer status of new patients. Hence, we have to make sure that each Id occurs exactly in one fold, so that all observations with the same Id should be either used for training or for evaluating the model. This way, we get less biased performance estimates via k-fold cross validation. This can be achieved by block cross validation.

## 2.2 & 2.3 Instantiate GroupKFold

In scikit-learn, block resampling is performed using `GroupKFold`. This method ensures that observations belonging to the same group, identified by the grouping variable (e.g., Id), are kept together within the same fold. Below we set up block resampling using the Id column as the grouping variable. 

Then we check which fold is each sample assigned to. I.e. we print out a table containing two columns "Sample code number" and "fold".

In [7]:
#===SOLUTION===

from sklearn.model_selection import GroupKFold

gkf = GroupKFold(n_splits=5)

# Create a mapping from each group (Id) to a fold number.
fold_assignments = pd.DataFrame(columns=[id_col, "fold"])
for fold, (train_idx, test_idx) in enumerate(gkf.split(X, y, groups=X[id_col]), start=1):
    unique_ids = X.iloc[test_idx][id_col].unique()
    temp_df = pd.DataFrame({id_col: unique_ids, "fold": fold})
    fold_assignments = pd.concat([fold_assignments, temp_df], ignore_index=True)
fold_assignments = fold_assignments.sort_values(by=id_col).reset_index(drop=True)
print("Block resampling instance (group to fold mapping):")
fold_assignments

Block resampling instance (group to fold mapping):


Unnamed: 0,Sample code number,fold
0,61634,4
1,63375,4
2,76389,3
3,95719,2
4,128059,1
...,...,...
640,1369821,1
641,1371026,5
642,1371920,4
643,8233704,1


## 2.4 Sanity check

If the specified blocking groups are respected, each Id appears in exactly one fold. To inspect whether blocking was successful when generating the folds, count how often each Id appears in each fold and print any Ids that appear in more than one fold:

In [8]:
#===SOLUTION===

id_counts = fold_assignments[id_col].value_counts()

# Identify any Id that appears more than once.
duplicates = id_counts[id_counts > 1]

if duplicates.empty:
    print("Sanity check passed: Each Id appears in exactly one fold.")
else:
    print("Sanity check failed: Some Ids appear in more than one fold:")
    print(duplicates)

Sanity check passed: Each Id appears in exactly one fold.


## 3 Custom Performance Measures

Many domain applications require custom measures for performance evaluations not supported in scikit-learn. You can inspect all available measures by checking the scikit-learn documentation on metrics (`sklearn.metrics`). Luckily, you can design your own measures for evaluating model performance. To do so, we simply define a custom function that takes as input the true and predicted values and returns the performance measure. Let’s see how this works in practice. Consider a regression measure that scores a prediction as 1 if the difference between the true and predicted values is less than one standard deviation of the true values, and scores the prediction as 0 otherwise. In mathematical terms, this would be defined as $ f(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{I}(|y_i - \hat{y}_i| < \sigma_y) $, where $\sigma_y$ is the standard deviation of the true values and $\mathbb{I}$ is the indicator function. To create such a measure in Python, we implement a performance measure as follows:

<details><summary>Hint 1:</summary>
To create a custom performance evaluation function, we need to define a function that takes the true values and predicted values as input and returns a value, and the value serves as a performance evaluation for the model. The value can be something similar to e.g. accuracy, but also can be a loss. We can then use `make_scorer` from `sklearn.metrics` to convert this function into a scorer object that can be used with scikit-learn's cross-validation and model selection tools.
</details>

<details><summary>Hint 2:</summary>
In scikit-learn, there are two types of evaluation metrics: score functions and loss functions. Score functions (like accuracy, $R^2$) return higher values for better models - we want to maximize these. Loss functions (like MSE, MAE) return lower values for better models - we want to minimize these. When using `make_scorer`, you need to specify `greater_is_better=True` for score functions (the default) and `greater_is_better=False` for loss functions. Alternatively, you can negate a loss function to convert it to a score function.
</details>

In [9]:
from numpy.typing import ArrayLike
from sklearn.metrics import make_scorer


def threshold_accuracy(y_true: ArrayLike, y_pred: ArrayLike) -> float:
    """
    Compute the threshold accuracy for regression.

    For each observation, if the absolute difference between y_true and y_pred
    is less than the sample standard deviation of y_true, the prediction is
    considered accurate (scored as 1), otherwise 0. The final score is the mean
    of these indicators.

    Parameters
    ----------
    y_true : array-like of shape (n_samples,)
        True target values.
    y_pred : array-like of shape (n_samples,)
        Predicted target values.

    Returns
    -------
    accuracy : float
        The threshold accuracy, a value between 0 and 1.
    """
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    std_y = np.std(y_true, ddof=1)

    # If the standard deviation is 0 (all values are identical),
    # define accuracy as the fraction of perfect predictions.
    if std_y == 0:
        return np.mean(y_true == y_pred)

    # Compute the indicator: 1 if the prediction is within std_y, else 0.
    indicator = np.abs(y_true - y_pred) < std_y
    return np.mean(indicator)


# Create a scorer object for use with scikit-learn's model evaluation tools
threshold_accuracy_score = make_scorer(threshold_accuracy)

## 3.1 MSE-MAE

Define your own risk measure for regression, the maximum of MSE and MAE: $ f(y,\hat{y}) = \frac{1}{n} \sum_{i=1}^{n} \max\left((y_i - \hat{y}_i)^2, |y_i - \hat{y}_i|\right) $, using the Python scoring function approach described above.

<details><summary>Hint 1:</summary>
    The function's signature is the same as above.
</details>

In [10]:
#===SOLUTION===

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer
import statsmodels.api as sm


def max_mse_mae(y_true: ArrayLike, y_pred: ArrayLike) -> float:
    """
    Compute the custom risk measure defined as the average of the maximum
    between the squared error and absolute error for each observation.

    Parameters
    ----------
    y_true : array-like of shape (n_samples,)
        True target values.
    y_pred : array-like of shape (n_samples,)
        Predicted target values.

    Returns
    -------
    risk_measure : float
        The custom risk measure.
    """
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    squared_error = (y_true - y_pred) ** 2
    absolute_error = np.abs(y_true - y_pred)
    loss = np.maximum(squared_error, absolute_error)
    return np.mean(loss)


# Create a scorer object for use with scikit-learn's model evaluation tools
# Note: higher is better for scorers, but our metric is a loss (lower is better)
max_mse_mae_scorer = make_scorer(max_mse_mae, greater_is_better=False)


## 3.2 Evaluate a custom measure

Use your custom scoring function with scikit-learn's `make_scorer` utility to evaluate the performance of the following model prediction:

In [None]:
data_mtcars = sm.datasets.get_rdataset("mtcars", "datasets")
df_mtcars = data_mtcars.data

# For this example, we treat 'mpg' (miles per gallon) as the target variable.
X = df_mtcars.drop(columns=["mpg"])
y = df_mtcars["mpg"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7832)

model = RandomForestRegressor(random_state=7832)
model.fit(X_train, y_train)

In [12]:
#===SOLUTION===

# Evaluate the predictions with the custom risk measure
score_value = max_mse_mae_scorer(model, X_test, y_test)
print(f"Custom max MSE-MAE score: {score_value:.5f}")

Custom max MSE-MAE score: -4.76320


## Summary

1. Stratified resampling helps with balancing classes and features within CV folds, to ensure each fold represents the data well enough.

2. Block resampling (`GroupKFold`) reduces bias in generalization error estimates by ensuring that observations from the same group end up in the same fold.

3. Custom domain applications require custom performance measures. In scikit-learn, you can define custom measures by implementing a scoring function and integrating it using the make_scorer utility.