Trust but Verify:
Cross-Validation, Metric Selection, and Class Imbalance

Learning Objectives
• Apply cross-validation (KFold, StratifiedKFold)
• Run basic hyperparameter search
• Choose metrics that match the task and class balance (ROC-AUC, PR-AUC, F1, MCC)
• Detect and prevent data leakage
• Handle imbalance with class weights, resampling, and threshold tuning
• Document seeds, splits, and pipelines for reproducibility

Why “Trust but Verify”?
• Models can look strong for the wrong reasons
• Evaluation is easy to accidentally “cheat”
• A single bad choice can inflate results by 10–50 points

Accuracy can be misleading
Accuracy = (TP + TN) / (TP + TN + FP + FN)

When classes are imbalanced:
• Predicting “all negative” can look “accurate”
• The model can be useless for the minority class

In [1]:
import numpy as np

# 95 negatives, 5 positives
y_true = np.array([0]*95 + [1]*5)

# model predicts all zeros
y_pred = np.zeros_like(y_true)

acc = (y_true == y_pred).mean()
acc


np.float64(0.95)

Confusion Matrix (binary classification)

                Pred 0      Pred 1
Actual 0           TN         FP
Actual 1           FN         TP

Key idea:
• Errors come in two types: false positives and false negatives

Core classification metrics
Precision = TP / (TP + FP)
Recall    = TP / (TP + FN)
F1        = 2 * (Precision * Recall) / (Precision + Recall)

Interpretation
• Precision: “When I predict positive, how often am I right?”
• Recall:    “Of all real positives, how many did I catch?”
• F1:        balance of precision and recall

In [2]:

from sklearn.metrics import precision_score, recall_score, f1_score

precision_score(y_true, y_pred), recall_score(y_true, y_pred), f1_score(y_true, y_pred)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


(0.0, 0.0, 0.0)

ROC-AUC (Receiver Operating Characteristic)
• Uses predicted scores, not just hard labels
• Varies the classification threshold
• Plots:
  TPR (Recall) vs FPR

AUC meaning:
• Probability the model ranks a random positive above a random negative


Precision–Recall (PR) curve
• Focuses on the positive class
• Better diagnostic when positives are rare

PR-AUC meaning:
• Average precision across recall levels
• Sensitive to false positives when positives are rare

Rule of thumb:
• If the positive class is rare → consider PR-AUC

MCC (Matthews Correlation Coefficient)
• Uses all four confusion matrix cells
• Works well with imbalance
• Range: -1 to +1
  +1 perfect, 0 random, -1 perfectly wrong

MCC formula:
(TP*TN - FP*FN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN))


Choosing the right metric
Ask:
• What type of error is worse? FP or FN?
• Are classes imbalanced?
• Do we need calibrated probabilities or just ranking?
• Do we care about performance at a specific threshold?


Why cross-validation (CV)?
A single train/test split:
• depends on the random seed
• can be unusually easy or hard
• gives a high-variance estimate

Cross-validation:
• repeats evaluation across multiple splits
• reduces variance
• produces a more reliable estimate

K-Fold Cross-Validation
1) Split data into K folds
2) For each fold:
   • train on K-1 folds
   • validate on the remaining fold
3) Average the scores

Benefits:
• Uses data efficiently
• Reduces dependence on one split

StratifiedKFold (classification)
Goal:
• preserve class proportions in each fold

Why it matters:
• avoids folds with “almost no positives”
• makes metrics comparable across folds


What does CV estimate?
• An estimate of generalization performance
• Under the assumption data is i.i.d. (independent and identically distributed)

CV does NOT fix:
• data leakage
• dataset shift
• bad features
• label noise



Data leakage = using information in training that would not be available at prediction time

Common leakage sources:
• fitting preprocessing on all data before splitting
• target leakage features (post-outcome variables)
• duplicates across train and test
• time leakage (future information in the past)

 LEAGKAGE DEMO: scaling before CV

Wrong approach:
• scale on full dataset
• then do cross-validation

Right approach:
• use Pipeline so scaling happens inside each fold

In [3]:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

X, y = make_classification(
    n_samples=2000, n_features=20, n_informative=5,
    weights=[0.95, 0.05], random_state=42
)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# WRONG: scaling on full data before CV
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # leakage
wrong_scores = cross_val_score(LogisticRegression(max_iter=1000), X_scaled, y, cv=cv, scoring="f1")

# RIGHT: pipeline scales inside each fold
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000))
])
right_scores = cross_val_score(pipe, X, y, cv=cv, scoring="f1")

wrong_scores.mean(), right_scores.mean()



(np.float64(0.0), np.float64(0.0))

Reproducibility checklist
• Set random_state seeds
• Record split strategy (KFold vs StratifiedKFold)
• Use Pipelines for preprocessing + model
• Report mean ± std across folds
• Keep a final untouched test set (optional but recommended)

Hyperparameter search
Goal:
• choose model settings using ONLY training data

Two common methods:
• GridSearchCV: tries every combination
• RandomizedSearchCV: samples combinations

Important:
• the search itself must be inside CV
• do NOT tune on the test set

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, f1_score

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=2000, class_weight="balanced"))
])

param_grid = {
    "clf__C": [0.01, 0.1, 1, 10],
    "clf__penalty": ["l2"],
    "clf__solver": ["lbfgs"]
}

search = GridSearchCV(
    pipe,
    param_grid=param_grid,
    scoring="f1",
    cv=cv,
    n_jobs=-1
)

search.fit(X, y)
search.best_params_, search.best_score_

Class imbalance changes:
• metrics (accuracy becomes less useful)
• training (the model may ignore the minority class)
• decision threshold (0.5 may be wrong)

Three simple fixes:
• class weights
• resampling (over/under-sampling)
• threshold tuning


Fix #1: class weights
Idea:
• penalize mistakes on minority class more heavily
Common:
• class_weight="balanced" (scikit-learn)

Use when:
• you want a simple, low-risk improvement
• you do not want to change the dataset size


In [4]:

from sklearn.model_selection import cross_validate
from sklearn.metrics import roc_auc_score

pipe_unweighted = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=2000))
])

pipe_weighted = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=2000, class_weight="balanced"))
])

scoring = {"f1": "f1", "roc_auc": "roc_auc", "average_precision": "average_precision"}

res1 = cross_validate(pipe_unweighted, X, y, cv=cv, scoring=scoring)
res2 = cross_validate(pipe_weighted, X, y, cv=cv, scoring=scoring)

{m: res1["test_"+m].mean() for m in scoring}, {m: res2["test_"+m].mean() for m in scoring}

({'f1': np.float64(0.0),
  'roc_auc': np.float64(0.782779738584488),
  'average_precision': np.float64(0.18361580994336668)},
 {'f1': np.float64(0.24312387390733486),
  'roc_auc': np.float64(0.7979325863230876),
  'average_precision': np.float64(0.1504554393162627)})

Fix #2: resampling
• Oversampling: duplicate or synthesize minority examples
• Undersampling: drop majority examples

Pros:
• can help models learn minority patterns

Cons:
• risk of overfitting (oversampling)
• loss of information (undersampling)

Important:
• resampling must happen inside CV folds (Pipeline!)


Fix #3: decision-threshold tuning
Default threshold = 0.5 is not sacred

Workflow:
1) Train a model that outputs probabilities or scores
2) Choose a threshold based on a metric or constraint
   • maximize F1
   • target recall ≥ 0.90
   • control false positive rate



In [5]:

from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve
import numpy as np

X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=2000, class_weight="balanced"))
])

pipe.fit(X_tr, y_tr)
probs = pipe.predict_proba(X_te)[:, 1]

prec, rec, thr = precision_recall_curve(y_te, probs)

f1 = 2 * prec * rec / (prec + rec + 1e-12)
best_idx = np.argmax(f1)
best_threshold = thr[max(best_idx-1, 0)]  # thr is length-1 vs prec/rec

best_threshold, f1[best_idx], prec[best_idx], rec[best_idx]

(np.float64(0.636406148017043),
 np.float64(0.30645161290293393),
 np.float64(0.18627450980392157),
 np.float64(0.8636363636363636))

Important: avoid “double dipping”
• If you tune thresholds or hyperparameters,
  do it using CV or a validation split
• Keep the final test set untouched until the end


Best practice:
• Nested CV for research-grade estimates
• Simple CV + final test set for most projects

Nested Cross-Validation (concept)
Outer loop:
• estimates generalization
Inner loop:
• chooses hyperparameters

Why it matters:
• prevents over-optimistic tuning results

Reliable evaluation workflow
1) Define the metric(s) that match your task
2) Build a Pipeline (preprocessing + model)
3) Use StratifiedKFold for imbalanced classification
4) Run CV to estimate baseline performance
5) Tune hyperparameters with GridSearchCV / RandomizedSearchCV
6) (Optional) Tune threshold on validation
7) Report mean ± std across folds, then final test score

 Accuracy can hide failure under imbalance
• Choose metrics based on consequences
• Use StratifiedKFold + Pipelines to avoid leakage
• Tune hyperparameters inside CV
• Handle imbalance: weights, resampling, threshold tuning
• Document everything so results are reproducible


If positives are 1% of the data, which is usually more informative:
ROC-AUC or PR-AUC? Why?

Cheat sheet
• Imbalanced classification: StratifiedKFold + PR-AUC / F1 / MCC
• Use Pipeline for preprocessing to prevent leakage
• Tune hyperparameters with GridSearchCV inside CV
• Consider class_weight="balanced"
• Tune threshold if recall/precision trade-off matters