# Model Performance Monitoring with Weights & Biases

## Prerequisites

- Python 3.8+
- Google Colab (recommended) or local Jupyter
- W&B account and API key (Students: Settings → API Keys)

### Files/Resources Provided
- This notebook (you are reading it)

---
Run the cells below in order. Cells that require manual input are clearly indicated.


In [1]:
# Install required packages. In Colab this will install into the runtime.
!pip install --quiet wandb scikit-learn pandas matplotlib

print('Install complete.')

Install complete.



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# Imports and quick version check
import sys
import sklearn
import pandas as pd
import wandb

print('Python:', sys.version.splitlines()[0])
print('scikit-learn:', sklearn.__version__)
print('pandas:', pd.__version__)
print('wandb import ok')

Python: 3.11.0 (main, Oct 24 2022, 18:26:48) [MSC v.1933 64 bit (AMD64)]
scikit-learn: 1.3.2
pandas: 2.1.4
wandb import ok


## Step 1 — Login to Weights & Biases

Run the cell below and follow the prompt to authenticate with your W&B API key. In Colab you will be asked to paste the key.

**Instructor tip:** For a classroom, you can create a shared team project and provide the `entity` name in `wandb.init()` cells.


In [3]:
# Login to W&B (interactive). In Colab this will prompt for your API key.
import wandb
wandb.login()
print('If login succeeded, you will see your W&B username above.')

wandb: Currently logged in as: sairohith (ir2023) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin


If login succeeded, you will see your W&B username above.


## Step 2 — Load data and helper functions

This notebook uses the Iris dataset for simplicity. We provide helper functions to train, evaluate, and log runs to W&B.


In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import numpy as np
import pandas as pd

# Load dataset
data = load_iris()
X = data['data']
y = data['target']
feature_names = data['feature_names']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print('Shapes: X_train', X_train.shape, 'X_test', X_test.shape)

Shapes: X_train (105, 4) X_test (45, 4)


## Step 3 — Training function: runs & logging

The `train_and_log` function trains a RandomForestClassifier, logs metrics, confusion matrix, and the model as an artifact to W&B.


In [7]:
import matplotlib.pyplot as plt
import io

def plot_confusion_matrix(cm, labels):
    fig, ax = plt.subplots(figsize=(4,4))
    ax.imshow(cm, interpolation='nearest')
    ax.set_title('Confusion matrix')
    ax.set_xticks(range(len(labels)))
    ax.set_yticks(range(len(labels)))
    ax.set_xticklabels(labels, rotation=45)
    ax.set_yticklabels(labels)
    for i in range(len(labels)):
        for j in range(len(labels)):
            ax.text(j, i, str(cm[i, j]), ha='center', va='center')
    plt.tight_layout()
    return fig

def train_and_log(run_name, n_estimators=50, random_state=42, simulate_shift=False, entity=None, project='mlops-performance-monitoring'):
    """Train RandomForest and log to W&B.
    If simulate_shift=True, a small shift is added to X_test to emulate data drift.
    """
    run = wandb.init(project=project, name=run_name, entity=entity, reinit=True)
    wandb.config.update({'n_estimators': n_estimators, 'random_state': random_state})

    clf = RandomForestClassifier(n_estimators=n_estimators, random_state=random_state)
    clf.fit(X_train, y_train)
    if simulate_shift:
        X_eval = X_test + np.random.normal(loc=0.5, scale=0.1, size=X_test.shape)
    else:
        X_eval = X_test
    preds = clf.predict(X_eval)
    acc = accuracy_score(y_test, preds)
    cr = classification_report(y_test, preds, output_dict=True)
    cm = confusion_matrix(y_test, preds)

    # Log metrics
    wandb.log({'accuracy': acc})
    # Log classification report as metrics
    for k, v in cr.items():
        if k.isdigit():
            wandb.log({f'class_{k}_precision': v['precision'], f'class_{k}_recall': v['recall'], f'class_{k}_f1': v['f1-score']})

    # Confusion matrix image
    fig = plot_confusion_matrix(cm, labels=data['target_names'])
    wandb.log({"confusion_matrix": wandb.Image(fig)})
    buf = io.BytesIO()
    fig.savefig(buf, format='png')
    buf.seek(0)
    # wandb.log({"confusion_matrix": wandb.Image(buf)})
    plt.close(fig)

    # Log model as artifact
    artifact = wandb.Artifact('rf-model', type='model')
    import joblib
    joblib.dump(clf, 'rf_model.joblib')
    artifact.add_file('rf_model.joblib')
    run.log_artifact(artifact)

    print(f'Run {run_name} logged. Accuracy = {acc:.4f}')

    return acc, run

## Step 4 — Baseline run

Run a baseline training run and observe metrics in your W&B project dashboard.


In [8]:
# Baseline run
baseline_acc,run = train_and_log('baseline-run', n_estimators=50, random_state=42)
run.finish()
baseline_acc

0,1
accuracy,▁
class_0_f1,▁
class_0_precision,▁
class_0_recall,▁
class_1_f1,▁
class_1_precision,▁
class_1_recall,▁
class_2_f1,▁
class_2_precision,▁
class_2_recall,▁

0,1
accuracy,1
class_0_f1,1
class_0_precision,1
class_0_recall,1
class_1_f1,1
class_1_precision,1
class_1_recall,1
class_2_f1,1
class_2_precision,1
class_2_recall,1


Run baseline-run logged. Accuracy = 1.0000


0,1
accuracy,▁
class_0_f1,▁
class_0_precision,▁
class_0_recall,▁
class_1_f1,▁
class_1_precision,▁
class_1_recall,▁
class_2_f1,▁
class_2_precision,▁
class_2_recall,▁

0,1
accuracy,1
class_0_f1,1
class_0_precision,1
class_0_recall,1
class_1_f1,1
class_1_precision,1
class_1_recall,1
class_2_f1,1
class_2_precision,1
class_2_recall,1


1.0

## Step 5 — Simulate drift and run again

Now run the same model but simulate a small distribution shift in evaluation data to cause performance degradation. Compare the new run to the baseline in W&B UI.


In [9]:
# Drifted run (simulate data shift)
drifted_acc,run = train_and_log('drifted-run', n_estimators=50, random_state=99, simulate_shift=True)
run.finish()
drifted_acc

Run drifted-run logged. Accuracy = 0.6222


0,1
accuracy,▁
class_0_f1,▁
class_0_precision,▁
class_0_recall,▁
class_1_f1,▁
class_1_precision,▁
class_1_recall,▁
class_2_f1,▁
class_2_precision,▁
class_2_recall,▁

0,1
accuracy,0.62222
class_0_f1,0.8125
class_0_precision,1.0
class_0_recall,0.68421
class_1_f1,0.19048
class_1_precision,0.25
class_1_recall,0.15385
class_2_f1,0.7027
class_2_precision,0.54167
class_2_recall,1.0


0.6222222222222222

## Step 6 — Simple alerting example

W&B supports programmatic alerts. The snippet below demonstrates raising an alert when accuracy drops below a threshold. Alerts show up in the W&B UI (and can be configured to send emails or Slack messages in organization settings).


In [10]:
# This cell demonstrates programmatic alerting. Replace threshold as appropriate.
drifted_acc,run = train_and_log('drifted-run', simulate_shift=True)
threshold = 0.85
if drifted_acc < threshold:
    wandb.alert(title='Low accuracy detected', text=f'Accuracy {drifted_acc:.3f} below threshold {threshold}', level=wandb.AlertLevel.WARN)
    print('Alert sent (check W&B)')
else:
    print('Accuracy OK')

run.finish()

Run drifted-run logged. Accuracy = 0.7556
Alert sent (check W&B)


0,1
accuracy,▁
class_0_f1,▁
class_0_precision,▁
class_0_recall,▁
class_1_f1,▁
class_1_precision,▁
class_1_recall,▁
class_2_f1,▁
class_2_precision,▁
class_2_recall,▁

0,1
accuracy,0.75556
class_0_f1,0.94444
class_0_precision,1.0
class_0_recall,0.89474
class_1_f1,0.42105
class_1_precision,0.66667
class_1_recall,0.30769
class_2_f1,0.74286
class_2_precision,0.59091
class_2_recall,1.0
