Week 8 · Day 4 — Transfer Learning: Feature Extract vs Fine-Tune
Why this matters

Training CNNs from scratch is expensive. With transfer learning, we reuse pretrained networks (e.g. ResNet18 on ImageNet). Sometimes we just extract features; other times we fine-tune the whole model. Knowing when to freeze vs unfreeze saves time and data.

Theory Essentials

Feature Extractor: freeze backbone conv layers, train only classifier head → fast, less data needed.

Fine-Tuning: unfreeze backbone too, update all weights → more accurate, but slower & risk overfitting.

ResNet18: common starting point; pretrained on ImageNet.

Trade-off: feature extraction = efficiency; fine-tuning = accuracy.

When: small dataset → feature extract; larger dataset/custom domain → fine-tune.

In [2]:
# Setup
import torch, torch.nn as nn, torch.optim as optim
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader, Subset

torch.manual_seed(42)
device = torch.device("cpu")  # CPU only

# ---------- Data (tiny CIFAR-10 subset for speed) ----------
SUB_TRAIN, SUB_TEST = 2000, 500
tf_train = transforms.Compose([
    transforms.Resize((224,224)),   # ResNet expects 224x224
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.5,0.5,0.5),(0.5,0.5,0.5))
])
tf_test = transforms.Compose([
    transforms.Resize((224,224)),
    transforms.ToTensor(),
    transforms.Normalize((0.5,0.5,0.5),(0.5,0.5,0.5))
])

train_full = datasets.CIFAR10("data", train=True, download=True, transform=tf_train)
test_full  = datasets.CIFAR10("data", train=False, download=True, transform=tf_test)
trainset = Subset(train_full, range(SUB_TRAIN))
testset  = Subset(test_full, range(SUB_TEST))

trainloader = DataLoader(trainset, batch_size=32, shuffle=True)
testloader  = DataLoader(testset,  batch_size=64)

# ---------- Model Builders ----------
def get_resnet18(feature_extract=True):
    model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
    if feature_extract:
        for p in model.parameters(): p.requires_grad = False
    in_feats = model.fc.in_features
    model.fc = nn.Linear(in_feats, 10)   # CIFAR-10 has 10 classes
    return model

# ---------- Train/Eval ----------
def train_eval(model, epochs=1, lr=1e-3):
    model = model.to(device)
    crit = nn.CrossEntropyLoss()
    opt = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=lr)

    for ep in range(epochs):
        model.train()
        for X,y in trainloader:
            X,y = X.to(device), y.to(device)
            opt.zero_grad()
            loss = crit(model(X), y)
            loss.backward(); opt.step()
        acc = evaluate(model, testloader)
        print(f"Epoch {ep+1}: val acc {acc:.3f}")
    return model

@torch.inference_mode()
def evaluate(model, loader):
    model.eval()
    correct,total=0,0
    for X,y in loader:
        X,y = X.to(device), y.to(device)
        preds = model(X).argmax(1)
        correct += (preds==y).sum().item()
        total += y.size(0)
    return correct/total

# ---------- Run ----------
print("\nFeature Extract mode:")
model_feat = get_resnet18(feature_extract=True)
train_eval(model_feat, epochs=1)

print("\nFull Fine-Tune mode:")
model_ft = get_resnet18(feature_extract=False)
train_eval(model_ft, epochs=1)



Feature Extract mode:
Epoch 1: val acc 0.570

Full Fine-Tune mode:
Epoch 1: val acc 0.542


ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

1) Core (10–15 min)

Task: Count how many parameters are trainable in feature extract vs fine-tune.

In [3]:
print("Trainable params (FE):", sum(p.numel() for p in model_feat.parameters() if p.requires_grad))
print("Trainable params (FT):", sum(p.numel() for p in model_ft.parameters() if p.requires_grad))

Trainable params (FE): 5130
Trainable params (FT): 11181642


2) Practice (10–15 min)

Task: Train feature extractor for 3 epochs instead of 1. Did accuracy improve?

In [4]:
print("\nFeature Extract mode:")
model_feat = get_resnet18(feature_extract=True)
train_eval(model_feat, epochs=3)


Feature Extract mode:
Epoch 1: val acc 0.536
Epoch 2: val acc 0.678
Epoch 3: val acc 0.678


ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

It was clear that the accuracy was going tog et better

3) Stretch (optional, 10–15 min)

Task: Try smaller LR (1e-4) in fine-tuning. Does it stabilize training?

In [5]:
print("\nFull Fine-Tune mode:")
model_ft = get_resnet18(feature_extract=False)
train_eval(model_ft, epochs=1, lr=1e-4)


Full Fine-Tune mode:
Epoch 1: val acc 0.760


ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

Accuracy improved a lot

Mini-Challenge (≤40 min, CPU-friendly)

Task:

Run both configs (feature extract vs fine-tune) for 3 epochs.

Record: training time & val accuracy.

Fill in a table
| Mode            | Trainable Params | Val Acc | Time/epoch |
| --------------- | ---------------- | ------- | ---------- |
| Feature Extract | …                | …       | …          |
| Fine-Tune       | …                | …       | …          |

Acceptance Criteria:

Completed table with numbers.

2–3 lines explaining: which was faster, which was more accurate, when you’d pick each.

In [6]:
import time



t0 = time.time()
print("\nFeature Extract mode:")
model_feat = get_resnet18(feature_extract=True)
train_eval(model_feat, epochs=3)
print("Time taken:", time.time() - t0)

t0 = time.time()
print("\nFull Fine-Tune mode:")
model_ft = get_resnet18(feature_extract=False)
train_eval(model_ft, epochs=3)
print("Time taken:", time.time() - t0)


Feature Extract mode:
Epoch 1: val acc 0.612
Epoch 2: val acc 0.680
Epoch 3: val acc 0.688
Time taken: 305.12708258628845

Full Fine-Tune mode:
Epoch 1: val acc 0.552
Epoch 2: val acc 0.606
Epoch 3: val acc 0.614
Time taken: 686.5507521629333


| Mode            | Trainable Params | Val Acc | Time (3 epochs) |
| --------------- | ---------------- | ------- | ---------- |
| Feature Extract | 5130                | 0.688       | 305s          |
| Fine-Tune       | 11181642                | 0.614       | 667s          |


Notes / Key Takeaways

Feature Extract = fast, few params, works well on small datasets.

Fine-Tune = more compute, more flexible, but can overfit.

Always freeze pretrained backbones when data is limited.

Small LR helps when unfreezing pretrained layers.

Transfer learning is the shortcut to state-of-the-art results in vision.

Reflection

Why does feature extraction need fewer parameters to update?

In what scenarios would full fine-tuning clearly outperform?

1) Why does feature extraction need fewer parameters to update?

In feature extraction, we freeze the pretrained backbone (all convolutional layers). Only the final classifier head is trained.

That means millions of parameters stay fixed, and only a few thousand in the last layer update.

This is faster, uses less memory, and needs less data, because most weights already capture useful features (edges, textures, shapes).

2) In what scenarios would full fine-tuning clearly outperform?

When the new dataset is very different from the original pretraining dataset (e.g. medical scans vs. ImageNet photos).

When you have a large labeled dataset, so you can safely update all parameters without overfitting.

When small improvements matter (e.g. production systems, competitions), because fine-tuning can squeeze out extra performance.