Week 8 · Day 5 — Practical Fine-Tuning Workflow
Why this matters

Once you know how to freeze/unfreeze layers, you need a workflow to fine-tune models reliably. Production setups include class imbalance handling, checkpointing, and planned unfreezing to get the most from pretrained networks without overfitting.

Theory Essentials

Weighted Loss: counteracts class imbalance by giving more weight to under-represented classes.

Checkpointing: save best models (by val acc/loss) → reproducible & avoids losing progress.

Early Unfreeze: start with frozen backbone, then unfreeze gradually when head stabilizes.

Config-driven training: argparse/YAML configs help organize hyperparameters.

Trade-off: faster convergence vs risk of catastrophic forgetting when unfreezing too early.


### **Weighted Loss**

* Problem: if some classes are rare in your dataset, the model can “ignore” them and still get high accuracy.
* Fix: give **higher weight** to mistakes on under-represented classes.
* Effect: balances learning so every class matters equally.

---

### **Checkpointing**

* During training, save a copy of the **best model so far** (based on validation accuracy or loss).
* This ensures you:

  1. Don’t lose progress if training crashes.
  2. Can always reload the **best-performing** version, not just the last epoch.

---

### **Early Unfreeze**

* Strategy for transfer learning:

  1. Start with the pretrained backbone **frozen** (only train the new head).
  2. Once the head is stable, **unfreeze** the backbone and fine-tune with a lower LR.
* Benefit: avoids destroying useful pretrained features too early.

---

### **Config-driven Training**

* Instead of hardcoding hyperparameters (like LR, batch size, smoothing), put them in a **YAML/argparse config file**.
* This makes experiments reproducible and easier to organize → you just change the config, not the code.

---

### **Trade-off** (faster convergence vs catastrophic forgetting)

* **Faster convergence**: unfreezing earlier lets the backbone adapt to your dataset sooner.
* **Risk**: if you unfreeze too early (with high LR), pretrained features can be overwritten → the model “forgets” useful general features.

---

👉 In short:

* Weighted loss = fair learning.
* Checkpointing = safety + reproducibility.
* Early unfreeze = careful fine-tuning.
* Configs = clean experiment management.
* Trade-off = speed vs stability in transfer learning.


In [1]:
# Setup
import torch, torch.nn as nn, torch.optim as optim
from torch.utils.data import DataLoader, Subset
from torchvision import datasets, transforms, models
import numpy as np, time

torch.manual_seed(42)
device = torch.device("cpu")

# ---------- Small CIFAR-10 subset ----------
SUB_TRAIN, SUB_VAL = 3000, 800
tf = transforms.Compose([
    transforms.Resize((224,224)),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.5,0.5,0.5),(0.5,0.5,0.5)),
])
train_full = datasets.CIFAR10("data", train=True, download=True, transform=tf)
val_full   = datasets.CIFAR10("data", train=False, download=True, transform=tf)

trainset = Subset(train_full, range(SUB_TRAIN))
valset   = Subset(val_full,   range(SUB_VAL))
trainloader = DataLoader(trainset, batch_size=32, shuffle=True)
valloader  = DataLoader(valset,   batch_size=64)

# ---------- Weighted Loss ----------
labels = [y for _,y in trainset]
class_counts = np.bincount(labels, minlength=10)
weights = 1.0 / (class_counts + 1e-6)
weights = torch.tensor(weights / weights.sum(), dtype=torch.float32)
print("Class weights:", weights)

# ---------- Model ----------
def get_model(freeze=True):
    m = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
    if freeze:
        for p in m.parameters(): p.requires_grad=False
    in_feats = m.fc.in_features
    m.fc = nn.Linear(in_feats, 10)
    return m

# ---------- Training ----------
@torch.inference_mode()
def evaluate(model, loader):
    model.eval(); correct=0; total=0
    for X,y in loader:
        X,y = X.to(device), y.to(device)
        preds = model(X).argmax(1)
        correct += (preds==y).sum().item(); total += y.size(0)
    return correct/total

def train_finetune(epochs=2, freeze=True, unfreeze_at=None):
    model = get_model(freeze=freeze).to(device)
    crit = nn.CrossEntropyLoss(weight=weights)
    opt = optim.Adam(filter(lambda p:p.requires_grad, model.parameters()), lr=1e-3)
    best_acc=0
    for ep in range(1,epochs+1):
        model.train()
        for X,y in trainloader:
            X,y = X.to(device), y.to(device)
            opt.zero_grad()
            loss = crit(model(X), y)
            loss.backward(); opt.step()
        val_acc = evaluate(model,valloader)
        print(f"Epoch {ep}: val_acc {val_acc:.3f}")
        # Early unfreeze step
        if unfreeze_at and ep==unfreeze_at:
            for p in model.parameters(): p.requires_grad=True
            opt = optim.Adam(model.parameters(), lr=1e-4)  # lower LR for fine-tuning
            print(">>> Unfroze backbone at epoch", ep)
        # Save best checkpoint
        if val_acc > best_acc:
            best_acc = val_acc
            torch.save(model.state_dict(),"best_resnet18.pt")
            print(">>> Saved new best checkpoint")
    return best_acc

# ---------- Run ----------
print("\nFeature Extract (frozen):")
acc1 = train_finetune(epochs=2, freeze=True)

print("\nEarly Unfreeze (after 1 epoch):")
acc2 = train_finetune(epochs=2, freeze=True, unfreeze_at=1)

print("\nFinal best checkpoint stored as best_resnet18.pt")


Class weights: tensor([0.1001, 0.1043, 0.0930, 0.1051, 0.0963, 0.1073, 0.0960, 0.1008, 0.0972,
        0.0998])

Feature Extract (frozen):
Epoch 1: val_acc 0.689
>>> Saved new best checkpoint
Epoch 2: val_acc 0.711
>>> Saved new best checkpoint

Early Unfreeze (after 1 epoch):
Epoch 1: val_acc 0.647
>>> Unfroze backbone at epoch 1
>>> Saved new best checkpoint
Epoch 2: val_acc 0.875
>>> Saved new best checkpoint

Final best checkpoint stored as best_resnet18.pt


1) Core (10–15 min)

Task: Run training once with freeze=True and once with unfreeze_at=1. Compare accuracies.

Frozen backbone: val_acc 0.711
Unfreeze at 1: val_acc 0.875

2) Practice (10–15 min)

Task: Change the optimizer LR after unfreezing (try 1e-5 vs 1e-4). Which gives smoother accuracy?

In [2]:
def train_finetune_lr(epochs=2, freeze=True, unfreeze_at=None, lr_ft= 1e-4):
    model = get_model(freeze=freeze).to(device)
    crit = nn.CrossEntropyLoss(weight=weights)
    opt = optim.Adam(filter(lambda p:p.requires_grad, model.parameters()), lr=1e-3)
    best_acc=0
    for ep in range(1,epochs+1):
        model.train()
        for X,y in trainloader:
            X,y = X.to(device), y.to(device)
            opt.zero_grad()
            loss = crit(model(X), y)
            loss.backward(); opt.step()
        val_acc = evaluate(model,valloader)
        print(f"Epoch {ep}: val_acc {val_acc:.3f}")
        # Early unfreeze step
        if unfreeze_at and ep==unfreeze_at:
            for p in model.parameters(): p.requires_grad=True
            opt = optim.Adam(model.parameters(), lr_ft)  # lower LR for fine-tuning
            print(">>> Unfroze backbone at epoch", ep)
        # Save best checkpoint
        if val_acc > best_acc:
            best_acc = val_acc
            torch.save(model.state_dict(),"best_resnet18.pt")
            print(">>> Saved new best checkpoint")
    return best_acc

print("\nEarly Unfreeze (after 1 epoch). Lr after unfreezing 1e-5:")
acc3 = train_finetune_lr(epochs=2, freeze=True, unfreeze_at=1, lr_ft=1e-5)



Early Unfreeze (after 1 epoch). Lr after unfreesing 1e-5:
Epoch 1: val_acc 0.625
>>> Unfroze backbone at epoch 1
>>> Saved new best checkpoint
Epoch 2: val_acc 0.769
>>> Saved new best checkpoint


Unfreeze at 1 with lr 1e-4: val_acc 0.875
Unfreeze at 1 with 1e-5: val_acc 0.76

3) Stretch (optional, 10–15 min)

Task: Simulate imbalance: train only on classes 0–4. Does weighted loss help validation accuracy across all 10 classes?

In [3]:
# ====== Stretch: train only on classes 0–4, then eval on all 10 ======

import torch, numpy as np
from torch.utils.data import Subset
from collections import Counter

# 1) Build a TRAIN subset that only contains classes 0–4
subset_labels = {0,1,2,3,4}
idx_0_4 = [i for i, y in enumerate(train_full.targets) if y in subset_labels]
# keep the first SUB_TRAIN from those for speed
idx_0_4 = idx_0_4[:SUB_TRAIN]
train_0_4 = Subset(train_full, idx_0_4)
trainloader_0_4 = DataLoader(train_0_4, batch_size=32, shuffle=True)

# 2) Recompute class weights for weighted loss (len=10 for CrossEntropy)
#    We weight only the seen classes (0–4); unseen (5–9) get tiny counts.
labels_seen = [train_full.targets[i] for i in idx_0_4]
cnt = Counter(labels_seen)
counts = np.zeros(10, dtype=np.float32)
for c in range(10):
    counts[c] = cnt.get(c, 1e-6)  # tiny epsilon for unseen classes to avoid div-by-zero
w_invfreq = 1.0 / counts
w_invfreq = w_invfreq / w_invfreq.sum()
weights_0_4 = torch.tensor(w_invfreq, dtype=torch.float32)

print("Train size (classes 0–4 only):", len(train_0_4))
print("Counts per class (0..9):", counts.astype(int))
print("Weights (sum=1):", weights_0_4.numpy().round(4))

# 3) Eval helper with per-class accuracy
@torch.inference_mode()
def eval_report(model, loader):
    model.eval()
    total = np.zeros(10, dtype=np.int64)
    correct = np.zeros(10, dtype=np.int64)
    for X, y in loader:
        X = X.to(device)
        pred = model(X).argmax(1).cpu().numpy()
        y = y.numpy()
        for t, p in zip(y, pred):
            total[t] += 1
            correct[t] += int(t == p)
    overall = correct.sum()/total.sum()
    per_class = {c: (correct[c]/total[c] if total[c] else 0.0) for c in range(10)}
    return overall, per_class

# 4) Training runner that lets us pass a custom loader and optional weights
def run_once(name, use_weights=False, freeze=True, epochs=2, unfreeze_at=None):
    print(f"\n=== {name} | weights={use_weights} | epochs={epochs} ===")
    model = get_model(freeze=freeze).to(device)
    weight_vec = (weights_0_4 if use_weights else None)
    if weight_vec is not None:
        # ensure device match if you ever switch to GPU
        weight_vec = weight_vec.to(device)
    criterion = nn.CrossEntropyLoss(weight=weight_vec)
    opt = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-3)

    for ep in range(1, epochs+1):
        model.train()
        for X, y in trainloader_0_4:
            X, y = X.to(device), y.to(device)
            opt.zero_grad()
            loss = criterion(model(X), y)
            loss.backward()
            opt.step()

        # optional early unfreeze (kept off here to stay fair/fast)
        if unfreeze_at and ep == unfreeze_at:
            for p in model.parameters(): p.requires_grad = True
            opt = optim.Adam(model.parameters(), lr=1e-4)
            print(">>> Unfroze backbone")

        overall, per_class = eval_report(model, valloader)
        print(f"Ep {ep}: val_acc={overall:.3f}  "
              f"(avg acc on 0–4={np.mean([per_class[c] for c in range(5)]):.3f}, "
              f"on 5–9={np.mean([per_class[c] for c in range(5,10)]):.3f})")

    overall, per_class = eval_report(model, valloader)
    return overall, per_class

# 5) Run: Unweighted vs Weighted (frozen backbone for speed)
acc_unw, per_unw = run_once("Train on 0–4 (Unweighted CE)", use_weights=False, epochs=2)
acc_wt,  per_wt  = run_once("Train on 0–4 (Weighted CE)",   use_weights=True,  epochs=2)

# 6) Quick summary table
import pandas as pd
def per_class_list(d): return [round(d[c]*100,2) for c in range(10)]
df = pd.DataFrame({
    "Setup": ["Unweighted", "Weighted"],
    "Overall Val Acc (%)": [round(acc_unw*100,2), round(acc_wt*100,2)],
    **{f"class_{c} (%)": [per_unw[c]*100, per_wt[c]*100] for c in range(10)}
})
print("\nSummary (accuracy %):")
print(df.to_string(index=False))


Train size (classes 0–4 only): 3000
Counts per class (0..9): [615 554 627 585 619   0   0   0   0   0]
Weights (sum=1): [0.  0.  0.  0.  0.  0.2 0.2 0.2 0.2 0.2]

=== Train on 0–4 (Unweighted CE) | weights=False | epochs=2 ===
Ep 1: val_acc=0.381  (avg acc on 0–4=0.814, on 5–9=0.000)
Ep 2: val_acc=0.389  (avg acc on 0–4=0.832, on 5–9=0.000)

=== Train on 0–4 (Weighted CE) | weights=True | epochs=2 ===
Ep 1: val_acc=0.370  (avg acc on 0–4=0.794, on 5–9=0.000)
Ep 2: val_acc=0.384  (avg acc on 0–4=0.822, on 5–9=0.000)

Summary (accuracy %):
     Setup  Overall Val Acc (%)  class_0 (%)  class_1 (%)  class_2 (%)  class_3 (%)  class_4 (%)  class_5 (%)  class_6 (%)  class_7 (%)  class_8 (%)  class_9 (%)
Unweighted                38.75        83.75    92.424242    76.543210    80.769231    80.281690          0.0          0.0          0.0          0.0          0.0
  Weighted                38.00        88.75    95.454545    55.555556    83.333333    84.507042          0.0          0.0          



## What the stretch code does now

Goal: **simulate class imbalance** *more aggressively* and test whether **weighted loss** helps.

1. **Filter the training data to only classes 0–4**

   * We build a new `train_0_4` dataset that **removes** classes 5–9 from the **training** set.
   * Validation set **still has all 10 classes** (unchanged) → we evaluate generalization to the real distribution.

2. **Recompute class weights for this new (imbalanced) train set**

   * Count how many samples we have for each of the 10 classes **in the new train set**.
   * For 5–9, counts are \~0 (we never train on them).
   * Build a 10-length weight vector (inverse frequency), normalized to sum to 1.
   * This lets us compare:

     * **Unweighted CE** (no class balancing)
     * **Weighted CE** (penalizes mistakes on rarer seen classes more)

3. **Train two short runs on the filtered train set**

   * Same model (ResNet18 head), frozen backbone for speed.
   * Run A: **Unweighted** cross-entropy.
   * Run B: **Weighted** cross-entropy (using the vector above).
   * We do not change the validation set.

4. **Evaluate on all 10 classes (overall + per-class)**

   * After each epoch, we compute:

     * **Overall val accuracy** (all classes together)
     * **Per-class accuracy** (so you can see 0–4 vs 5–9 separately)

## What to expect (intuition)

* Because the model **never sees classes 5–9 during training**, it will perform **poorly on 5–9** no matter what.
* **Weighted loss** can help the model allocate capacity more fairly **among the seen classes (0–4)**, so you may see better accuracy **within 0–4** compared to unweighted.
* **Overall val accuracy** might not improve much (half the val set are unseen 5–9), but the **per-class breakdown** will show if weighting helped where it can.

🚀 New Mini-Challenge (≤40 min)

Task:
Build a reusable fine-tuning script that supports the following via simple variables at the top of the notebook/script:

freeze_backbone (bool)

unfreeze_epoch (int or None)

lr_head (float)

lr_backbone (float)

Run 3 configs on a small CIFAR-10 subset (2–3 epochs each):

Feature Extract only (freeze_backbone=True, unfreeze_epoch=None)

Early Unfreeze (freeze_backbone=True, unfreeze_epoch=1, lr_backbone=1e-4)

Fine-Tune from start (freeze_backbone=False, unfreeze_epoch=None, lr_backbone=1e-4)

Acceptance Criteria:

One clean training function that reads the config and handles freeze/unfreeze automatically.

A results table with config, final val acc, time/epoch.

3–4 lines of analysis: Which setup balanced speed vs accuracy best? Would you use different LRs for head/backbone in production?

In [4]:
# ===== New Mini-Challenge runner (uses your existing loaders + get_model) =====
import time, torch, pandas as pd
import torch.nn as nn, torch.optim as optim

# If you already computed class weights earlier, set `class_weights = weights.to(device)`.
# Otherwise leave as None.
class_weights = None  # or: weights.to(device)

def split_params(model):
    """Return (backbone_params, head_params). Assumes model.fc is the head."""
    head = list(model.fc.parameters())
    bb   = [p for n,p in model.named_parameters() if not n.startswith("fc.")]
    return bb, head

@torch.inference_mode()
def eval_acc(model, loader):
    model.eval(); correct=0; total=0
    for X,y in loader:
        X,y = X.to(device), y.to(device)
        pred = model(X).argmax(1)
        correct += (pred==y).sum().item(); total += y.size(0)
    return correct/total

def run_config(config, epochs=3, verbose=True):
    """
    config:
      freeze_backbone: bool
      unfreeze_epoch : int | None  (1-based epoch to unfreeze)
      lr_head        : float
      lr_backbone    : float
    """
    m = get_model(freeze=config["freeze_backbone"]).to(device)
    bb_params, head_params = split_params(m)

    # Optimizer with 2 param groups (backbone may be frozen initially)
    def make_opt(backbone_lr, head_lr):
        groups = []
        if any(p.requires_grad for p in bb_params):
            groups.append({"params": [p for p in bb_params if p.requires_grad], "lr": backbone_lr})
        groups.append({"params": head_params, "lr": head_lr})
        return optim.Adam(groups)

    opt = make_opt(config["lr_backbone"], config["lr_head"])
    crit = nn.CrossEntropyLoss(weight=class_weights)

    per_epoch_times, val_hist = [], []
    for ep in range(1, epochs+1):
        t0 = time.time()
        m.train()
        for X,y in trainloader:
            X,y = X.to(device), y.to(device)
            opt.zero_grad(set_to_none=True)
            loss = crit(m(X), y)
            loss.backward()
            opt.step()

        # Optional early unfreeze
        if config["unfreeze_epoch"] is not None and ep == config["unfreeze_epoch"]:
            for p in m.parameters(): p.requires_grad = True
            # after unfreezing, rebuild optimizer with smaller backbone LR
            opt = make_opt(config["lr_backbone"], config["lr_head"])
            if verbose: print(f">>> Unfroze backbone at epoch {ep}")

        val_acc = eval_acc(m, valloader)
        val_hist.append(val_acc)
        per_epoch_times.append(time.time() - t0)
        if verbose:
            # show current LRs for each group
            lrs = [pg["lr"] for pg in opt.param_groups]
            print(f"Ep {ep:02d} | val_acc={val_acc:.3f} | time={per_epoch_times[-1]:.1f}s | LRs={lrs}")

    return {
        "final_val_acc": val_hist[-1],
        "time_per_epoch": sum(per_epoch_times)/len(per_epoch_times),
        "val_hist": val_hist,
    }

# ---- Define the 3 configs (edit here) ----
CONFIGS = [
    {
        "name": "Feature Extract only",
        "freeze_backbone": True,  "unfreeze_epoch": None,
        "lr_head": 1e-3,          "lr_backbone": 0.0,   # ignored while frozen
    },
    {
        "name": "Early Unfreeze (ep=1)",
        "freeze_backbone": True,  "unfreeze_epoch": 1,
        "lr_head": 1e-3,          "lr_backbone": 1e-4,  # small LR for backbone
    },
    {
        "name": "Fine-tune from start",
        "freeze_backbone": False, "unfreeze_epoch": None,
        "lr_head": 1e-3,          "lr_backbone": 1e-4,
    },
]

# ---- Run all and collect results ----
rows = []
for cfg in CONFIGS:
    print(f"\n=== {cfg['name']} ===")
    out = run_config(cfg, epochs=3, verbose=True)
    rows.append({
        "Setup": cfg["name"],
        "freeze_backbone": cfg["freeze_backbone"],
        "unfreeze_epoch": cfg["unfreeze_epoch"],
        "lr_head": cfg["lr_head"],
        "lr_backbone": cfg["lr_backbone"],
        "Final Val Acc (%)": round(100*out["final_val_acc"], 2),
        "Time / epoch (s)": round(out["time_per_epoch"], 1),
    })

res_df = pd.DataFrame(rows)
print("\nResults:")
print(res_df.to_string(index=False))



=== Feature Extract only ===
Ep 01 | val_acc=0.603 | time=135.3s | LRs=[0.001]
Ep 02 | val_acc=0.704 | time=142.9s | LRs=[0.001]
Ep 03 | val_acc=0.741 | time=144.3s | LRs=[0.001]

=== Early Unfreeze (ep=1) ===
>>> Unfroze backbone at epoch 1
Ep 01 | val_acc=0.654 | time=161.0s | LRs=[0.0001, 0.001]
Ep 02 | val_acc=0.850 | time=387.3s | LRs=[0.0001, 0.001]
Ep 03 | val_acc=0.856 | time=400.7s | LRs=[0.0001, 0.001]

=== Fine-tune from start ===
Ep 01 | val_acc=0.829 | time=356.3s | LRs=[0.0001, 0.001]
Ep 02 | val_acc=0.865 | time=487.0s | LRs=[0.0001, 0.001]
Ep 03 | val_acc=0.869 | time=356.9s | LRs=[0.0001, 0.001]

Results:
                Setup  freeze_backbone  unfreeze_epoch  lr_head  lr_backbone  Final Val Acc (%)  Time / epoch (s)
 Feature Extract only             True             NaN    0.001       0.0000              74.12             140.8
Early Unfreeze (ep=1)             True             1.0    0.001       0.0001              85.62             316.4
 Fine-tune from start      

Notes / Key Takeaways

Weighted loss is key when classes are imbalanced.

Start with frozen backbone → warm up the classifier.

Gradually unfreeze with lower LR for stability.

Save checkpoints on best val metrics.

Fine-tuning is about control: freeze/unfreeze schedule, LR, checkpoints.

Reflection

Why should LR usually be smaller when unfreezing pretrained layers?

Why is checkpointing critical in real training?

1) Why should LR usually be smaller when unfreezing pre-trained layers?

Pretrained layers already contain useful generic features (edges, textures, shapes).

A high learning rate could overwrite these weights too aggressively, causing catastrophic forgetting of the knowledge from ImageNet.

A smaller LR lets the backbone adapt gently to the new dataset while keeping most of its useful features intact.

2) Why is checkpointing critical in real training?

Training can be long and unstable; later epochs may overfit or even diverge.

If a crash or interruption happens, checkpointing prevents losing hours of progress.

Most importantly, it ensures you can always reload the best-performing model on validation data, not just the final epoch.

This makes experiments reproducible, safe, and production-ready.