# Appendix I — Failure Modes and Troubleshooting
## *Python for AI/ML: A Complete Learning Journey*

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/APP_I_Failure_Modes.ipynb)
&nbsp;&nbsp;[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)

---

Most ML tutorials show you the happy path: clean data, well-behaved gradients,
metrics that improve monotonically. Real projects don't work that way.

This appendix documents the **failure modes that silently ruin ML projects** —
bugs that don't crash your code but produce models that look fine and aren't.
Each section follows the same structure:

- **What goes wrong** — the failure and why it's dangerous
- **How to detect it** — reproducible diagnostic code
- **How to fix it** — the correct pattern
- **Worked example** — applied to the SO 2025 salary dataset

### Contents

| Section | Failure Mode | Chapters it affects |
|---------|-------------|---------------------|
| I.1 | Silent data leakage | Ch 6, 8, 12 |
| I.2 | Train/test contamination | Ch 6, 7 |
| I.3 | NaN propagation | Ch 3, 6 |
| I.4 | Overfitting that looks like underfitting | Ch 6, 7 |
| I.5 | Class imbalance disasters | Ch 6, 10 |
| I.6 | Gradient problems in deep learning | Ch 7, 11 |
| I.7 | Tokenisation and embedding gotchas | Ch 8 |
| I.8 | Training-serving skew | Ch 12 |
| I.9 | Diagnostic checklist | All |

**How to use this appendix:** Read it end-to-end once, then return to specific
sections when something in your project doesn't make sense.


---

## Setup


In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from typing import Optional

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, f1_score, roc_auc_score,
                              precision_recall_curve, confusion_matrix,
                              ConfusionMatrixDisplay)
from sklearn.impute import SimpleImputer

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# ── Synthetic SO 2025 dataset (consistent with all chapters) ──────
def make_so_dataset(n: int = 8000, nan_frac: float = 0.0,
                    seed: int = RANDOM_STATE) -> pd.DataFrame:
    rng = np.random.default_rng(seed)
    years   = rng.exponential(6, n).clip(0, 35)
    salary  = np.exp(10.8 + 0.07 * years + rng.normal(0, 0.5, n)).clip(20000, 500000)
    df = pd.DataFrame({
        'YearsCodePro':        years,
        'ConvertedCompYearly': salary,
        'uses_python':         rng.integers(0, 2, n),
        'uses_sql':            rng.integers(0, 2, n),
        'uses_js':             rng.integers(0, 2, n),
        'uses_ai':             rng.integers(0, 2, n),
        'EdLevel':             rng.choice(
            ['Bachelor', 'Master', 'PhD', 'No degree'], n,
            p=[0.45, 0.30, 0.10, 0.15]),
        'survey_year':         rng.integers(2020, 2026, n),  # for temporal leakage demo
    })
    if nan_frac > 0:
        mask = rng.random((n, 4)) < nan_frac
        for j, col in enumerate(['YearsCodePro','uses_python','uses_sql','uses_js']):
            df.loc[mask[:, j], col] = np.nan
    return df

df = make_so_dataset()
print(f'Dataset: {len(df):,} rows, {df.shape[1]} columns')
print(df.dtypes)


---

## I.1 — Silent Data Leakage

**What goes wrong:** The model has access to information during training that
it cannot possibly have at prediction time. It learns to exploit this shortcut
instead of the actual signal you want it to learn.

Data leakage is dangerous because it is *silent*: the model produces excellent
validation scores that completely evaporate in production.

### Three common leakage patterns

**1. Temporal leakage** — using future data to predict the past.
Example: training a salary model on 2020–2025 data, including a feature
derived from 2025 survey trends, then 'predicting' 2022 salaries.

**2. Target leakage** — a feature that is computed *from* the target,
or is only known *because* you know the target.
Example: `total_compensation` as a feature when predicting `base_salary`.

**3. Row leakage** — test set rows appear in the training set.
Checking `len(X_train) + len(X_test) == len(df)` is not enough
if you have duplicate rows in the dataset.


In [None]:
# I.1 -- Demonstrating and detecting data leakage

# ── Pattern 1: Temporal leakage ───────────────────────────────────
df['high_earner'] = (df['ConvertedCompYearly'] >=
                     df['ConvertedCompYearly'].quantile(0.60)).astype(int)

# LEAKY: compute a 'future trend' feature on the full dataset,
# then split. Year 2025 data 'leaks' into earlier rows via the mean.
df['mean_salary_by_year'] = df.groupby('survey_year')['ConvertedCompYearly']\
                              .transform('mean')  # ← uses ALL rows including test rows

FEATURES_LEAKY  = ['YearsCodePro','uses_python','uses_sql','mean_salary_by_year']
FEATURES_CLEAN  = ['YearsCodePro','uses_python','uses_sql','uses_js']

def quick_eval(X: pd.DataFrame, y: pd.Series, label: str) -> float:
    X_tr, X_te, y_tr, y_te = train_test_split(
        X, y, test_size=0.2, random_state=RANDOM_STATE)
    clf = GradientBoostingClassifier(n_estimators=50, random_state=RANDOM_STATE)
    clf.fit(X_tr.fillna(0), y_tr)
    acc = accuracy_score(y_te, clf.predict(X_te.fillna(0)))
    print(f'  {label:<45} accuracy = {acc:.4f}')
    return acc

print('Leaky vs clean features:')
acc_leaky = quick_eval(df[FEATURES_LEAKY], df['high_earner'], 'Leaky (mean_salary_by_year included)')
acc_clean = quick_eval(df[FEATURES_CLEAN], df['high_earner'], 'Clean (no future-derived features)')
print(f'\nLeakage inflation: +{acc_leaky - acc_clean:.4f} accuracy points')
print('This gap disappears entirely in production — a 100% phantom gain.')

# ── Pattern 3: Row leakage via duplicates ─────────────────────────
print('\nRow leakage detection:')
df_with_dupes = pd.concat([df, df.sample(200, random_state=0)], ignore_index=True)
X = df_with_dupes[FEATURES_CLEAN]
y = df_with_dupes['high_earner']
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)

# Detect: hash every row and check for train/test overlap
def row_hashes(df_: pd.DataFrame) -> set:
    return set(pd.util.hash_pandas_object(df_, index=False))

train_hashes = row_hashes(X_tr)
test_hashes  = row_hashes(X_te)
overlap      = train_hashes & test_hashes
print(f'  Duplicate rows in dataset:      {df_with_dupes.duplicated().sum()}')
print(f'  Train/test row overlap (hashed): {len(overlap)} rows')
print(f'  Fix: df.drop_duplicates() before splitting')


In [None]:
# I.1 -- Target leakage: how to spot it in feature importances

from sklearn.inspection import permutation_importance

# Introduce a leaky feature: partial_salary = salary * noise (correlated with target)
rng = np.random.default_rng(1)
df['partial_salary'] = df['ConvertedCompYearly'] * rng.uniform(0.85, 1.15, len(df))

FEATURES_WITH_LEAK = FEATURES_CLEAN + ['partial_salary']
X = df[FEATURES_WITH_LEAK].fillna(0)
y = df['high_earner']

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)
clf = RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE)
clf.fit(X_tr, y_tr)

importances = pd.Series(clf.feature_importances_, index=FEATURES_WITH_LEAK).sort_values()

fig, ax = plt.subplots(figsize=(8, 3.5))
colours = ['#C0392B' if f == 'partial_salary' else '#2E75B6' for f in importances.index]
importances.plot.barh(ax=ax, color=colours)
ax.set_title('Feature Importances — Red = Leaky Feature\n'
             'A single feature dominating importance is a leakage red flag')
ax.set_xlabel('Mean Decrease in Impurity')

# Add legend
from matplotlib.patches import Patch
ax.legend(handles=[Patch(color='#C0392B', label='Leaky feature'),
                   Patch(color='#2E75B6', label='Legitimate feature')])
plt.tight_layout()
plt.show()

print('Rule of thumb: if one feature has >> 50% importance, investigate for leakage.')
print('Apply permutation importance on the TEST SET as a second check —')
print('leaky features score high on train but lower on fresh data.')

# Cleanup
df.drop(columns=['mean_salary_by_year','partial_salary'], inplace=True)


---

## I.2 — Train/Test Contamination

**What goes wrong:** Preprocessing steps are fitted on the *entire* dataset
(including the test set) before the train/test split. The test set is no longer
a fair representation of unseen data — the model has seen its statistical
properties during fitting.

This is one of the most common mistakes made by practitioners who have
learned sklearn but haven't yet internalised the Pipeline pattern.

```python
# ❌ WRONG — scaler sees test data during fit
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)          # full dataset scaled
X_train, X_test = train_test_split(X_scaled)

# ✅ CORRECT — scaler only sees training data
X_train, X_test = train_test_split(X)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)     # fit on train only
X_test  = scaler.transform(X_test)          # apply to test

# ✅ BEST — Pipeline handles this automatically and correctly
pipe = Pipeline([('scaler', StandardScaler()), ('clf', LogisticRegression())])
pipe.fit(X_train, y_train)                  # scaler.fit only on X_train
pipe.predict(X_test)                        # scaler.transform applied cleanly
```

**Why it matters with encoders:** With `StandardScaler`, contamination is subtle
(mean/std shift). With `TargetEncoder`, it's catastrophic — target statistics
from test rows directly leak into training features.


In [None]:
# I.2 -- Quantifying contamination inflation

from sklearn.preprocessing import TargetEncoder

df['EdLevel_code'] = df['EdLevel'].astype('category').cat.codes
FEATURES = ['YearsCodePro','uses_python','uses_sql','uses_js','uses_ai','EdLevel_code']
X = df[FEATURES].fillna(0)
y = df['high_earner']

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)

# ── Contaminated: scale entire X before split ────────────────────
scaler_contam = StandardScaler()
X_all_scaled  = pd.DataFrame(scaler_contam.fit_transform(X), columns=FEATURES)
X_tr_c, X_te_c = X_all_scaled.iloc[X_tr.index], X_all_scaled.iloc[X_te.index]

# ── Clean: Pipeline enforces correct order ────────────────────────
pipe_clean = Pipeline([
    ('scaler', StandardScaler()),
    ('clf',    GradientBoostingClassifier(n_estimators=100, random_state=RANDOM_STATE))
])

# Contaminated model
clf_c = GradientBoostingClassifier(n_estimators=100, random_state=RANDOM_STATE)
clf_c.fit(X_tr_c, y_tr)
acc_contam = accuracy_score(y_te, clf_c.predict(X_te_c))

# Clean model
pipe_clean.fit(X_tr, y_tr)
acc_clean = accuracy_score(y_te, pipe_clean.predict(X_te))

print('Train/test contamination via full-dataset scaling:')
print(f'  Contaminated accuracy: {acc_contam:.4f}')
print(f'  Clean accuracy:        {acc_clean:.4f}')
print(f'  Inflation:             {acc_contam - acc_clean:+.4f}')
print()
print('With StandardScaler: inflation is small (data is already similar scale).')
print('With TargetEncoder or OrdinalEncoder on high-cardinality features,')
print('the inflation can exceed 10 percentage points.')
print()
print('Canonical detection: cross_val_score on the full dataset via Pipeline.')
print('If CV score >> hold-out score, contamination is likely.')

# Cleanup
df.drop(columns=['EdLevel_code'], inplace=True)


---

## I.3 — NaN Propagation

**What goes wrong:** Missing values silently corrupt downstream computations.
The failure modes are model-specific:

- **sklearn tree models** (GBM, RandomForest): handle NaN internally,
  but the *missingness pattern itself* becomes a learned signal — your
  model may be predicting salary from *who didn't answer the survey*,
  not from the features you think you're using.

- **Linear models / SVM / KNN**: crash with `ValueError: Input X contains NaN`.
  This is the *good* failure mode — it tells you immediately.

- **PyTorch / deep learning**: NaN in a single training example propagates
  through the entire batch via backprop. Loss goes to `nan`; gradients
  become `nan`; all weights become `nan`. Model is silently destroyed.

- **Target column NaN**: rows are silently dropped by some libraries,
  or cause a crash in others. The silent drop is more dangerous.


In [None]:
# I.3 -- NaN propagation: detection and safe imputation patterns

df_nan = make_so_dataset(n=5000, nan_frac=0.15)
df_nan['high_earner'] = (df_nan['ConvertedCompYearly'] >=
                          df_nan['ConvertedCompYearly'].quantile(0.60)).astype(int)

FEATURES = ['YearsCodePro','uses_python','uses_sql','uses_js','uses_ai']

# ── Step 1: Audit NaN before any processing ───────────────────────
print('=== NaN Audit ===')
nan_summary = df_nan[FEATURES].isnull().sum()
nan_pct     = (nan_summary / len(df_nan) * 100).round(1)
audit = pd.DataFrame({'missing_count': nan_summary, 'missing_pct': nan_pct})
print(audit[audit['missing_count'] > 0].to_string())

# ── Step 2: Detect NaN-as-signal (missingness correlation with target) ─
print('\nNaN-as-signal check (missingness correlation with target):')
for col in FEATURES:
    if df_nan[col].isnull().any():
        is_missing  = df_nan[col].isnull().astype(int)
        corr        = is_missing.corr(df_nan['high_earner'])
        flag        = '⚠ potential signal' if abs(corr) > 0.05 else '  ok'
        print(f'  {col:<20} corr(missing, target) = {corr:+.3f}  {flag}')

# ── Step 3: Safe imputation inside Pipeline ───────────────────────
print('\nImputation strategies:')
X = df_nan[FEATURES]
y = df_nan['high_earner']
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)

for strategy in ['mean', 'median', 'most_frequent']:
    pipe = Pipeline([
        ('imputer', SimpleImputer(strategy=strategy)),
        ('clf',     GradientBoostingClassifier(n_estimators=50, random_state=RANDOM_STATE))
    ])
    pipe.fit(X_tr, y_tr)
    acc = accuracy_score(y_te, pipe.predict(X_te))
    print(f'  SimpleImputer(strategy={strategy!r:<15}) accuracy = {acc:.4f}')

# ── Step 4: Assert no NaN escapes the pipeline ────────────────────
print('\nDefensive assertion pattern:')
print('  After .fit_transform() or .transform(), always assert:')
print('  assert not np.isnan(X_transformed).any(), "NaN escaped imputer"')
print()
print('For PyTorch: add NaN checks at dataloader and loss computation:')
print('  assert not torch.isnan(batch).any(), f"NaN in batch at step {step}"')
print('  assert not torch.isnan(loss),        "NaN loss — check inputs"')


---

## I.4 — Overfitting That Looks Like Underfitting

**What goes wrong:** Your model's validation metrics suggest it isn't learning,
but the true problem is that it's learning the *wrong thing* — and the metric
you're watching isn't sensitive enough to show it.

### Symptom 1: High train accuracy, low test accuracy (classic overfitting)
The model memorised training examples. Fix: regularisation, more data, simpler model.

### Symptom 2: High accuracy on both sets, but terrible real-world performance
Usually class imbalance + accuracy as metric. A model that always predicts the
majority class can hit 90%+ accuracy while being completely useless.

### Symptom 3: Train loss and val loss both high and not improving
Could be: wrong learning rate, wrong architecture, wrong loss function,
or the labels are too noisy to learn from. Learning curves distinguish them.

### Symptom 4: Perfect training metrics, random test metrics
Almost always leakage (see I.1) — the model learned the shortcut perfectly.


In [None]:
# I.4 -- Learning curves: the right diagnostic for training problems

from sklearn.model_selection import learning_curve

df_lc = make_so_dataset(n=8000)
df_lc['high_earner'] = (df_lc['ConvertedCompYearly'] >=
                         df_lc['ConvertedCompYearly'].quantile(0.60)).astype(int)
FEATURES = ['YearsCodePro','uses_python','uses_sql','uses_js','uses_ai']
X = df_lc[FEATURES].fillna(0)
y = df_lc['high_earner']

def plot_learning_curve(estimator, X: pd.DataFrame, y: pd.Series,
                         title: str, ax, cv: int = 5) -> None:
    """Plot learning curves showing train vs validation score as n_samples grows."""
    train_sizes, train_scores, val_scores = learning_curve(
        estimator, X, y,
        train_sizes=np.linspace(0.1, 1.0, 8),
        cv=cv, scoring='accuracy', n_jobs=-1)
    ts_mean = train_scores.mean(axis=1)
    vs_mean = val_scores.mean(axis=1)
    ts_std  = train_scores.std(axis=1)
    vs_std  = val_scores.std(axis=1)

    ax.fill_between(train_sizes, ts_mean-ts_std, ts_mean+ts_std, alpha=0.15, color='#2E75B6')
    ax.fill_between(train_sizes, vs_mean-vs_std, vs_mean+vs_std, alpha=0.15, color='#C0392B')
    ax.plot(train_sizes, ts_mean, 'o-', color='#2E75B6', label='Train')
    ax.plot(train_sizes, vs_mean, 's-', color='#C0392B', label='Validation')
    gap = ts_mean[-1] - vs_mean[-1]
    ax.set_title(f'{title}\nFinal gap: {gap:.3f}')
    ax.set_xlabel('Training examples')
    ax.set_ylabel('Accuracy')
    ax.legend(loc='lower right')
    ax.set_ylim(0.5, 1.01)

fig, axes = plt.subplots(1, 3, figsize=(15, 4.5))

# Underfitting: logistic regression on this nonlinear problem
plot_learning_curve(LogisticRegression(max_iter=200), X, y,
                    'Underfitting\n(too simple: LogReg)', axes[0])

# Good fit
plot_learning_curve(
    GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=RANDOM_STATE),
    X, y, 'Good fit\n(GBM depth=3)', axes[1])

# Overfitting: very deep trees, no regularisation
plot_learning_curve(
    GradientBoostingClassifier(n_estimators=300, max_depth=8,
                                min_samples_leaf=1, random_state=RANDOM_STATE),
    X, y, 'Overfitting\n(GBM depth=8, 300 trees)', axes[2])

fig.suptitle('Learning Curves: Diagnosing Fit Quality', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

print('Interpretation guide:')
print('  Underfitting: both curves low and flat — add features or model capacity')
print('  Good fit:     small gap, both curves converging at reasonable accuracy')
print('  Overfitting:  large gap between train and validation — regularise')
print('  Leakage:      validation curve ABOVE or equal to train curve')


In [None]:
# I.4 -- Confusion matrix: what accuracy hides

from sklearn.dummy import DummyClassifier

# Create a moderately imbalanced dataset (80/20)
df_imb = make_so_dataset(n=5000)
df_imb['rare_event'] = (df_imb['ConvertedCompYearly'] >=
                         df_imb['ConvertedCompYearly'].quantile(0.80)).astype(int)
X = df_imb[['YearsCodePro','uses_python','uses_sql','uses_js','uses_ai']].fillna(0)
y = df_imb['rare_event']

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)

models = {
    'Always predict majority': DummyClassifier(strategy='most_frequent'),
    'GBM (no class weight)':   GradientBoostingClassifier(n_estimators=100, random_state=RANDOM_STATE),
    'GBM (class_weight)':      GradientBoostingClassifier(n_estimators=100, random_state=RANDOM_STATE),
}

# Manually add class weight for third model
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
print(f'Class distribution: {dict(y.value_counts().sort_index())}')
print(f'{"Model":<30} {"Accuracy":>10} {"F1 (minority)":>15} {"AUC-ROC":>10}')
print('-' * 70)

for ax, (name, clf) in zip(axes, models.items()):
    if 'class_weight' in name:
        # Resample manually: oversample minority in training
        min_idx = y_tr[y_tr == 1].index
        oversample = pd.concat([X_tr, X_tr.loc[min_idx.repeat(3)]])
        y_over     = pd.concat([y_tr, y_tr.loc[min_idx.repeat(3)]])
        clf.fit(oversample, y_over)
    else:
        clf.fit(X_tr, y_tr)

    preds = clf.predict(X_te)
    acc   = accuracy_score(y_te, preds)
    f1    = f1_score(y_te, preds, zero_division=0)
    try:
        auc = roc_auc_score(y_te, clf.predict_proba(X_te)[:, 1])
    except Exception:
        auc = float('nan')

    print(f'{name:<30} {acc:>10.4f} {f1:>15.4f} {auc:>10.4f}')
    cm = confusion_matrix(y_te, preds)
    ConfusionMatrixDisplay(cm, display_labels=['Majority','Minority']).plot(ax=ax, colorbar=False)
    ax.set_title(f'{name}\nAcc={acc:.3f}, F1={f1:.3f}')

plt.tight_layout()
plt.show()
print('\nConclusion: the majority classifier achieves high accuracy but zero F1.')
print('Always report F1 or AUC-PR alongside accuracy on imbalanced problems.')


---

## I.5 — Class Imbalance Disasters

**What goes wrong:** Your dataset has many more examples of one class.
Models minimising cross-entropy loss will learn to mostly predict the majority
class — it's the locally optimal thing to do given the loss surface.

### The metrics that actually matter

| Metric | Use when... | Watch out for... |
|--------|------------|------------------|
| Accuracy | Classes roughly balanced | Completely misleading on imbalanced data |
| F1 score | Moderate imbalance | Doesn't account for true negatives |
| AUC-ROC | Binary classification | Optimistic when positives are rare |
| **AUC-PR** | **Severe imbalance, positives rare** | **The right metric for rare events** |
| MCC | Best single-number summary | Less interpretable |

### SMOTE: what it does and when it backfires
SMOTE (Synthetic Minority Oversampling) generates synthetic examples by
interpolating between existing minority class examples. It helps with moderate
imbalance on structured data. It **backfires** when:
- Applied before train/test split (data leakage)
- Applied to high-dimensional data (interpolated examples are unrealistic)
- Applied to text or images (interpolation in raw feature space is meaningless)


In [None]:
# I.5 -- PR-AUC vs ROC-AUC: the metric that survives severe imbalance

from sklearn.metrics import average_precision_score, PrecisionRecallDisplay, RocCurveDisplay

# Severe imbalance: 95/5
df_sev = make_so_dataset(n=10000)
df_sev['rare_event'] = (df_sev['ConvertedCompYearly'] >=
                         df_sev['ConvertedCompYearly'].quantile(0.95)).astype(int)
print(f'Class balance: {dict(df_sev["rare_event"].value_counts().sort_index())}')
print(f'Minority fraction: {df_sev["rare_event"].mean():.1%}')

FEATURES = ['YearsCodePro','uses_python','uses_sql','uses_js','uses_ai']
X = df_sev[FEATURES].fillna(0)
y = df_sev['rare_event']
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                            stratify=y, random_state=RANDOM_STATE)

clf = GradientBoostingClassifier(n_estimators=100, random_state=RANDOM_STATE)
clf.fit(X_tr, y_tr)
proba = clf.predict_proba(X_te)[:, 1]

roc_auc = roc_auc_score(y_te, proba)
pr_auc  = average_precision_score(y_te, proba)
dummy_pr_auc = y_te.mean()  # random classifier's PR-AUC baseline

fig, axes = plt.subplots(1, 2, figsize=(12, 4.5))
RocCurveDisplay.from_predictions(y_te, proba, ax=axes[0],
                                  name=f'GBM (AUC={roc_auc:.3f})')
axes[0].plot([0,1],[0,1],'k--',label='Random (AUC=0.500)')
axes[0].set_title('ROC Curve\n(optimistic on imbalanced data)')
axes[0].legend()

PrecisionRecallDisplay.from_predictions(y_te, proba, ax=axes[1],
                                         name=f'GBM (AP={pr_auc:.3f})')
axes[1].axhline(dummy_pr_auc, color='k', linestyle='--',
                label=f'Random baseline ({dummy_pr_auc:.3f})')
axes[1].set_title('Precision-Recall Curve\n(honest on imbalanced data)')
axes[1].legend()

plt.tight_layout()
plt.show()

print(f'ROC-AUC:   {roc_auc:.3f}  (looks good — misleading)')
print(f'PR-AUC:    {pr_auc:.3f}  (honest — much lower, shows real challenge)')
print(f'Random PR: {dummy_pr_auc:.3f}  (baseline: always predict minority fraction)')
print()
print('Rule: if minority class < 10%, use PR-AUC as your primary metric.')


---

## I.6 — Gradient Problems in Deep Learning

**What goes wrong:** Gradients in deep networks can vanish (become too small
to update early layers) or explode (become so large they corrupt weights).
Both failures are subtle — the training loop runs without errors.

| Problem | Symptom | Root cause | Fix |
|---------|---------|-----------|-----|
| Vanishing | Early layer weights barely change; loss plateaus | Sigmoid/tanh in deep nets; bad init | BatchNorm, ReLU, residual connections |
| Exploding | Loss goes to `nan` or `inf` after a few steps | LR too high; gradient accumulation | Gradient clipping; lower LR |
| Dead ReLU | Some neurons output zero forever | Negative bias at init; high LR | LeakyReLU; careful LR; batch norm |


In [None]:
# I.6 -- Detecting and fixing gradient problems

import torch
import torch.nn as nn
import torch.nn.functional as F

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def make_salary_tensors(n: int = 4000) -> tuple[torch.Tensor, torch.Tensor]:
    """Synthetic salary regression data as tensors."""
    df_ = make_so_dataset(n)
    FEAT = ['YearsCodePro','uses_python','uses_sql','uses_js','uses_ai']
    X = torch.tensor(df_[FEAT].fillna(0).values, dtype=torch.float32)
    y = torch.tensor(
        np.log(df_['ConvertedCompYearly'].values / 1000), dtype=torch.float32).unsqueeze(1)
    return X.to(DEVICE), y.to(DEVICE)

X_all, y_all = make_salary_tensors()
n_tr = int(len(X_all) * 0.8)
X_tr, X_te = X_all[:n_tr], X_all[n_tr:]
y_tr, y_te = y_all[:n_tr], y_all[n_tr:]

# ── Gradient norm logger ─────────────────────────────────────────
def train_with_grad_logging(
    model:        nn.Module,
    lr:           float,
    n_epochs:     int = 20,
    clip:         Optional[float] = None,
    label:        str = '',
) -> dict:
    """Train and log per-epoch gradient norms and losses."""
    model = model.to(DEVICE)
    optimiser = torch.optim.Adam(model.parameters(), lr=lr)
    history = {'loss': [], 'grad_norm': []}

    for epoch in range(n_epochs):
        model.train()
        optimiser.zero_grad()
        out  = model(X_tr)
        loss = F.mse_loss(out, y_tr)
        if torch.isnan(loss):
            print(f'  [{label}] NaN loss at epoch {epoch} — exploding gradients')
            history['loss'].append(float('nan'))
            history['grad_norm'].append(float('nan'))
            break
        loss.backward()

        # Log gradient norm before clipping
        total_norm = sum(p.grad.data.norm(2).item() ** 2
                         for p in model.parameters() if p.grad is not None) ** 0.5
        history['grad_norm'].append(total_norm)

        if clip:
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=clip)

        optimiser.step()
        history['loss'].append(loss.item())

    return history


def make_deep_sigmoid(depth: int = 6) -> nn.Sequential:
    """Deep network with sigmoid activations -- prone to vanishing gradients."""
    layers = []
    in_f = 5
    for _ in range(depth):
        layers += [nn.Linear(in_f, 32), nn.Sigmoid()]
        in_f = 32
    layers.append(nn.Linear(32, 1))
    return nn.Sequential(*layers)

def make_deep_relu(depth: int = 6) -> nn.Sequential:
    """Deep network with ReLU + BatchNorm -- healthy gradient flow."""
    layers = []
    in_f = 5
    for _ in range(depth):
        layers += [nn.Linear(in_f, 32), nn.BatchNorm1d(32), nn.ReLU()]
        in_f = 32
    layers.append(nn.Linear(32, 1))
    return nn.Sequential(*layers)


print('Training comparison (20 epochs):')
h_vanish  = train_with_grad_logging(make_deep_sigmoid(), lr=0.001,  label='Sigmoid (vanishing)')
h_explode = train_with_grad_logging(make_deep_sigmoid(), lr=5.0,    label='High LR (exploding)')
h_clip    = train_with_grad_logging(make_deep_sigmoid(), lr=5.0, clip=1.0, label='High LR + clip')
h_relu    = train_with_grad_logging(make_deep_relu(),    lr=0.001,  label='ReLU+BN (healthy)')

fig, axes = plt.subplots(1, 2, figsize=(13, 4.5))
for h, label, ls in [
    (h_vanish,  'Sigmoid deep (vanishing)', '-'),
    (h_explode, 'High LR (exploding/NaN)',  '--'),
    (h_clip,    'High LR + grad clip',      '-.'),
    (h_relu,    'ReLU + BatchNorm',         ':'),
]:
    valid = [v for v in h['loss'] if not (isinstance(v, float) and v != v)]
    valid_norms = [v for v in h['grad_norm'] if not (isinstance(v, float) and v != v)]
    axes[0].plot(range(len(valid)), valid,       label=label, linestyle=ls, linewidth=2)
    axes[1].plot(range(len(valid_norms)), valid_norms, label=label, linestyle=ls, linewidth=2)

axes[0].set_title('Training Loss'); axes[0].set_xlabel('Epoch'); axes[0].legend(fontsize=8)
axes[1].set_title('Gradient Norm'); axes[1].set_xlabel('Epoch')
axes[1].set_yscale('log'); axes[1].legend(fontsize=8)
plt.tight_layout(); plt.show()

print('Vanishing: sigmoid deep net — gradient norm near zero, loss barely moves')
print('Exploding: high LR — gradient norm spikes, loss hits NaN')
print('Fix:       gradient clipping (max_norm=1.0) + ReLU/BN architecture')


---

## I.7 — Tokenisation and Embedding Gotchas

**What goes wrong:** Transformer pipelines have several silent failure modes
that produce plausible-looking outputs with degraded (or random) quality.

### Truncation silently cuts off content
Most BERT-family models have a 512-token maximum. The tokeniser's default is
`truncation=True`, which silently drops everything after token 512.
For a developer job description (often 600–1000 tokens), the model never
sees the salary range, requirements, or benefits — typically the last third.

### Attention mask not passed to the model
Padding tokens (`[PAD]`) are added to make batches uniform. If you don't pass
`attention_mask` to the model, it attends to padding tokens as if they were
real content. This degrades performance and is a very common beginner mistake.

### Tokeniser/model mismatch at inference
If you fine-tune with `bert-base-uncased` but accidentally load
`bert-base-cased` at inference time, the vocabulary is different.
The model runs without error — it just produces garbage predictions.


In [None]:
# I.7 -- Tokenisation diagnostics

try:
    from transformers import AutoTokenizer
    HF_AVAILABLE = True
except ImportError:
    HF_AVAILABLE = False
    print('transformers not installed — showing conceptual examples only')

# ── Truncation audit ─────────────────────────────────────────────
sample_texts = [
    'Python developer with 5 years experience. ' * 30,  # ~150 tokens
    'Senior ML engineer. ' * 80,                         # ~400 tokens
    'Full stack developer experienced in React, Node, and AWS. ' * 25,  # ~300 tokens
]

if HF_AVAILABLE:
    tokeniser = AutoTokenizer.from_pretrained('bert-base-uncased')

    print('Truncation audit (max_length=512):')
    print(f'{"Text preview":<45} {"Tokens":>8} {"Truncated?":>12} {"% kept":>8}')
    print('-' * 78)
    for text in sample_texts:
        n_raw = len(tokeniser.encode(text, truncation=False))
        n_trunc = len(tokeniser.encode(text, truncation=True, max_length=512))
        truncated = n_raw > 512
        pct_kept = min(512, n_raw) / n_raw * 100
        print(f'{text[:42]!r:<45} {n_raw:>8} {str(truncated):>12} {pct_kept:>7.0f}%')

    print()
    print('Mitigation strategies for long documents:')
    print('  1. Sliding window: tokenise with stride, pool CLS embeddings')
    print('  2. First+last: concatenate first 128 and last 384 tokens')
    print('  3. Longformer/BigBird: models supporting up to 4096 tokens')
    print()

    # ── Attention mask demo ──────────────────────────────────────
    print('Attention mask check:')
    batch = tokeniser(['short text', 'a much longer text with more tokens'],
                       padding=True, return_tensors='pt')
    print(f'  input_ids shape:      {batch["input_ids"].shape}')
    print(f'  attention_mask shape: {batch["attention_mask"].shape}')
    print(f'  Mask row 0: {batch["attention_mask"][0].tolist()}')
    print(f'  Mask row 1: {batch["attention_mask"][1].tolist()}')
    print(f'  Zeros in mask = padding positions model must ignore')
    print(f'  Always pass attention_mask= to model(**batch)')
else:
    print('Conceptual example (transformers not installed):')
    print('  tokens = tokeniser(text, truncation=True, max_length=512)')
    print('  # Check: does len(tokens["input_ids"]) == 512?')
    print('  # If yes — content was cut. Audit long texts explicitly.')
    print()
    print('  Always audit your corpus:')
    print('  lengths = [len(tokeniser.encode(t)) for t in texts]')
    print('  pct_truncated = sum(l > 512 for l in lengths) / len(lengths)')
    print('  print(f"{pct_truncated:.1%} of texts will be truncated")')


---

## I.8 — Training-Serving Skew

**What goes wrong:** The feature computation during training and during
inference produces different values for the same raw input.
The model was trained on correctly computed features; at serving time,
slightly different features are fed in. The model degrades silently.

### Common causes

- **Different libraries:** pandas vs SQL vs Spark computing the same feature differently
- **Different imputation:** training uses `df.fillna(median)` with the training median;
  production code uses a hardcoded number that drifts from the actual median
- **Feature order mismatch:** model trained on `[A, B, C]`; served `[B, A, C]`.
  Tree models are column-order agnostic; linear models are not
- **Timezone or rounding differences:** `days_since_survey` computed differently
  in training pipeline vs real-time inference

### The fix: one feature computation function, used everywhere
Export the *entire* sklearn Pipeline (including all preprocessing), not just
the model. The pipeline is your contract: same input → same features → same output.


In [None]:
# I.8 -- Detecting and preventing training-serving skew

import joblib
from io import BytesIO

# ── Simulate skew: training uses Pipeline median imputation;
# serving code accidentally uses a hardcoded value ─────────────────
df_skew = make_so_dataset(n=5000, nan_frac=0.10)
df_skew['high_earner'] = (df_skew['ConvertedCompYearly'] >=
                           df_skew['ConvertedCompYearly'].quantile(0.60)).astype(int)
FEATURES = ['YearsCodePro','uses_python','uses_sql','uses_js','uses_ai']
X = df_skew[FEATURES]
y = df_skew['high_earner']
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)

# Train with Pipeline (correct)
pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('clf',     GradientBoostingClassifier(n_estimators=100, random_state=RANDOM_STATE))
])
pipe.fit(X_tr, y_tr)
acc_correct = accuracy_score(y_te, pipe.predict(X_te))

# Simulate skew: serving code fills NaN with 0 instead of learned median
X_te_skewed = X_te.fillna(0)  # wrong imputation at serving time
# Extract just the GBM (no pipeline wrapper) as if someone exported model separately
gbm_only = pipe.named_steps['clf']
# Feed skewed features directly to the model
X_te_skewed_imputed_wrong = X_te_skewed.values  # filled with 0
X_te_correct_imputed = pipe.named_steps['imputer'].transform(X_te)  # pipeline median

acc_skewed  = accuracy_score(y_te, gbm_only.predict(X_te_skewed_imputed_wrong))
acc_pipeline = accuracy_score(y_te, gbm_only.predict(X_te_correct_imputed))

print('Training-serving skew demonstration:')
print(f'  Pipeline (correct imputation):         {acc_pipeline:.4f}')
print(f'  Skewed  (fillna(0) instead of median): {acc_skewed:.4f}')
print(f'  Accuracy drop due to skew:             {acc_pipeline - acc_skewed:+.4f}')
print()

# ── Feature schema validation at serving time ─────────────────────
print('Feature schema validation (add to your FastAPI endpoint):')
expected_features = list(X_tr.columns)
expected_dtypes   = dict(X_tr.dtypes)

def validate_serving_features(X_input: pd.DataFrame) -> list[str]:
    """Return list of schema violations. Empty = clean."""
    errors = []
    missing = set(expected_features) - set(X_input.columns)
    extra   = set(X_input.columns) - set(expected_features)
    order_ok = list(X_input.columns) == expected_features
    if missing: errors.append(f'Missing features: {missing}')
    if extra:   errors.append(f'Unexpected features: {extra}')
    if not order_ok: errors.append(f'Feature order mismatch')
    return errors

# Good input
print('  Good input:', validate_serving_features(X_te) or 'OK')
# Bad input: wrong order
X_wrong_order = X_te[list(reversed(FEATURES))]
print('  Wrong order:', validate_serving_features(X_wrong_order))
# Bad input: missing feature
print('  Missing col:', validate_serving_features(X_te.drop(columns=['uses_ai'])))

print()
print('Best practice: serialise and serve the full Pipeline, not just the model.')
buf = BytesIO()
joblib.dump(pipe, buf)
buf.seek(0)
pipe_reloaded = joblib.load(buf)
acc_reloaded = accuracy_score(y_te, pipe_reloaded.predict(X_te))
print(f'Reloaded pipeline accuracy: {acc_reloaded:.4f} (identical = no skew)')


---

## I.9 — Diagnostic Checklist

Use this checklist when your model's behaviour is puzzling.
Work through it top to bottom — each stage rules out a class of failures.

### Stage 1 — Data integrity
- [ ] `df.isnull().sum()` — any unexpected NaN?
- [ ] `df.duplicated().sum()` — any duplicate rows that could cause row leakage?
- [ ] `df.dtypes` — are all columns the expected types? (int vs float, object vs category)
- [ ] `df[target].value_counts()` — is the class distribution what you expect?
- [ ] Is there a time column? Have you ensured no future data leaks into training?

### Stage 2 — Preprocessing integrity
- [ ] Is all preprocessing inside a `Pipeline`? No `fit_transform` on full dataset?
- [ ] Does the training pipeline include imputers for all potentially null columns?
- [ ] Are categorical encoders seeing all categories they'll encounter at inference?
- [ ] Are any features derived from the target or from post-event information?

### Stage 3 — Model sanity checks
- [ ] `train_accuracy >> test_accuracy`? → overfitting: regularise or reduce capacity
- [ ] Both low? → underfitting: add features or model capacity
- [ ] Both high but F1 / AUC-PR low? → class imbalance: check confusion matrix
- [ ] `test_accuracy ≈ train_accuracy` and both suspiciously high? → likely leakage
- [ ] Loss = `nan` after step 1? → NaN in data, exploding gradient, or wrong loss function

### Stage 4 — Deep learning specific
- [ ] Log gradient norms per epoch. Near zero = vanishing. Spike then NaN = exploding
- [ ] Visualise activations. All-zero layers = dead ReLUs
- [ ] Is `optimizer.zero_grad()` called before each backward pass?
- [ ] Is `model.eval()` called before inference (disables BatchNorm/Dropout)?
- [ ] Is `torch.no_grad()` used during evaluation (prevents gradient accumulation)?

### Stage 5 — NLP / transformer specific
- [ ] What fraction of your texts exceed `max_length`? Is truncation acceptable?
- [ ] Is `attention_mask` passed to the model in every forward call?
- [ ] Are you using the same tokeniser checkpoint at train and inference time?
- [ ] Are special tokens `[CLS]`, `[SEP]` handled consistently?

### Stage 6 — Production / serving
- [ ] Is the full Pipeline (not just the model) serialised and loaded at serving time?
- [ ] Are feature names and order validated at the API boundary?
- [ ] Is PSI monitoring running? Has it alerted recently?
- [ ] Is the serving environment using the same library versions as training?

### Decision tree: 'My model is performing worse than expected'

```
Model performing worse than expected
│
├─► Was performance good during development?
│   ├─► NO  → Was it good on training data?
│   │         ├─► NO  → Underfitting: more features, more capacity, fewer constraints
│   │         └─► YES → Overfitting: regularise, more data, early stopping
│   │
│   └─► YES → Check for leakage: shuffle test, feature importances, temporal audit
│
├─► Did performance drop after deployment?
│   ├─► Suddenly → Training-serving skew: feature computation changed?
│   │              Leakage source removed in production?
│   └─► Gradually → Data drift: PSI monitoring (Chapter 12)
│
└─► Is the metric the right one?
    └─► Imbalanced classes → Switch to F1 / PR-AUC
        Deep learning NaN → Check gradients, inputs, loss function
```


---

## Summary

The eight failure modes covered in this appendix share a common property:
**they don't crash your code**. They produce models that train, evaluate,
and deploy — but underperform, mislead, or silently deteriorate.

| Failure mode | Detection method | Key fix |
|-------------|-----------------|--------|
| Data leakage | Feature importance dominance; shuffle test | Temporal hold-out; no future-derived features |
| Train/test contamination | CV score vs hold-out comparison | Always use Pipeline |
| NaN propagation | `isnull().sum()`; missingness correlation | Explicit imputation inside Pipeline |
| Wrong fit diagnosis | Learning curves | Plot train vs val score by sample size |
| Class imbalance | Confusion matrix; PR-AUC vs ROC-AUC | Report F1 and PR-AUC; oversample or class weights |
| Gradient problems | Gradient norm logging | Gradient clipping; ReLU + BatchNorm |
| Tokenisation | Truncation audit; attention mask check | `len(tokens) > max_length` alert; always pass mask |
| Training-serving skew | Schema validation; pipeline reload test | Serialise full Pipeline |

**The single most impactful habit:** run the diagnostic checklist (Section I.9)
at the end of every modelling session — before declaring a result final.

---

*End of Appendix I — Python for AI/ML*  
[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)
