# Chapter 11 -- Computer Vision with PyTorch
## *Python for AI/ML: A Complete Learning Journey*

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/CH11_Computer_Vision_PyTorch.ipynb)
&nbsp;&nbsp;[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)

---

**Part:** 4 -- Production and Deployment  
**Prerequisites:** Chapter 7 (Deep Learning with PyTorch)  
**Estimated time:** 5-6 hours

---

> **Before running this notebook:** go to **Runtime → Change runtime type → T4 GPU**.
> Transfer learning fine-tuning in Section 11.4 requires GPU.

---

### Learning Objectives

By the end of this chapter you will be able to:

- Explain how convolutional layers detect spatial features in images
- Build a CNN from scratch using `nn.Conv2d`, `nn.MaxPool2d`, and `nn.Linear`
- Use `torchvision.transforms` to build an image augmentation pipeline
- Load datasets with `torchvision.datasets` and `ImageFolder`
- Apply transfer learning: freeze a pre-trained ResNet and replace its head
- Fine-tune all layers of a pre-trained model with a lower learning rate
- Visualise what a CNN has learned: feature maps and activation maximisation
- Interpret predictions with Grad-CAM heatmaps

---

### Project Thread -- Chapter 11

We work with the **CIFAR-10** dataset (60,000 32x32 colour images, 10 classes)
which is built into torchvision. We build three progressively more powerful models:

1. **Custom CNN from scratch** -- understand every component
2. **ResNet-18 with frozen backbone** -- transfer learning in minutes
3. **ResNet-18 fine-tuned end-to-end** -- best accuracy

CIFAR-10 is small enough to train quickly on a free Colab GPU
while being complex enough to demonstrate why deep CNNs beat shallow ones.


---

## Setup


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import warnings
warnings.filterwarnings('ignore')

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, random_split
from torch.optim.lr_scheduler import OneCycleLR

import torchvision
import torchvision.transforms as transforms
import torchvision.models as models
from torchvision.datasets import CIFAR10

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'PyTorch:  {torch.__version__}')
print(f'Device:   {DEVICE}')
print(f'Torchvision: {torchvision.__version__}')

RANDOM_STATE = 42
torch.manual_seed(RANDOM_STATE)
np.random.seed(RANDOM_STATE)

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.dpi'] = 110

CLASSES = ('plane','car','bird','cat','deer',
           'dog','frog','horse','ship','truck')


---

## Section 11.1 -- How Convolutional Neural Networks Work

A fully-connected layer treats every pixel as an independent feature.
For a 32x32 colour image that is 32×32×3 = 3,072 inputs — and a 224x224
image is 150,528 inputs. This is computationally expensive and ignores
the spatial structure of images: nearby pixels are related, and the same
pattern (an edge, a curve) can appear anywhere in the image.

**Convolutional layers** solve this with two ideas:

**Local connectivity:** each neuron connects only to a small region
of the input (the receptive field), not the whole image.

**Weight sharing:** the same filter (kernel) is applied at every position.
A filter that detects horizontal edges detects them everywhere in the image
using the same weights. This reduces parameters dramatically.

**The building blocks:**
- `nn.Conv2d(in_channels, out_channels, kernel_size)` -- learns filters
- `nn.MaxPool2d(kernel_size)` -- downsamples by taking the max in each window
- `nn.BatchNorm2d(channels)` -- normalises activations (same as Ch 7, but 2D)
- `nn.ReLU()` -- non-linearity applied after each conv layer

Early layers learn low-level features (edges, colours).
Deeper layers combine these into higher-level concepts (textures, shapes, objects).


In [None]:
# 11.1.1 -- Load CIFAR-10 with augmentation transforms

# Training augmentation: random flips and crops make the model
# robust to variations in position and orientation
train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomCrop(32, padding=4),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    transforms.ToTensor(),
    # Normalise with CIFAR-10 channel means and stds
    transforms.Normalize(mean=[0.4914, 0.4822, 0.4465],
                         std= [0.2470, 0.2435, 0.2616]),
])

# Validation/test: only normalise, no augmentation
test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.4914, 0.4822, 0.4465],
                         std= [0.2470, 0.2435, 0.2616]),
])

# Download CIFAR-10 (~170MB, cached after first run)
train_dataset = CIFAR10(root='/tmp/cifar10', train=True,
                        download=True, transform=train_transform)
test_dataset  = CIFAR10(root='/tmp/cifar10', train=False,
                        download=True, transform=test_transform)

# Split training into train + validation
n_val   = 5000
n_train = len(train_dataset) - n_val
train_ds, val_ds = random_split(
    train_dataset, [n_train, n_val],
    generator=torch.Generator().manual_seed(RANDOM_STATE)
)

train_loader = DataLoader(train_ds,   batch_size=128, shuffle=True,  num_workers=2, pin_memory=True)
val_loader   = DataLoader(val_ds,     batch_size=256, shuffle=False, num_workers=2, pin_memory=True)
test_loader  = DataLoader(test_dataset, batch_size=256, shuffle=False, num_workers=2, pin_memory=True)

print(f'Train: {len(train_ds):,}  Val: {len(val_ds):,}  Test: {len(test_dataset):,}')
print(f'Classes: {CLASSES}')


In [None]:
# 11.1.2 -- Visualise sample images

# Get one batch and undo normalisation for display
images, labels = next(iter(DataLoader(test_dataset, batch_size=16, shuffle=True)))

mean = torch.tensor([0.4914, 0.4822, 0.4465]).view(3,1,1)
std  = torch.tensor([0.2470, 0.2435, 0.2616]).view(3,1,1)
images_display = (images * std + mean).clamp(0, 1)

fig, axes = plt.subplots(2, 8, figsize=(16, 5))
for i, ax in enumerate(axes.flatten()):
    img = images_display[i].permute(1, 2, 0).numpy()
    ax.imshow(img)
    ax.set_title(CLASSES[labels[i]], fontsize=8)
    ax.axis('off')
plt.suptitle('CIFAR-10 Sample Images (16 random test examples)',
             fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()


---

## Section 11.2 -- Building a CNN from Scratch

Before using pre-trained models, we build a CNN from scratch so every
component is transparent. This architecture follows the classic pattern:
stacked conv blocks (conv → batchnorm → relu → pool) followed by
a classifier head (flatten → dense → output).


In [None]:
# 11.2.1 -- Define a custom CNN

class CifarCNN(nn.Module):
    """
    Custom CNN for CIFAR-10 (32x32 colour images, 10 classes).

    Architecture:
        Conv Block 1: 3  -> 32  channels, 3x3 kernel
        Conv Block 2: 32 -> 64  channels, 3x3 kernel
        Conv Block 3: 64 -> 128 channels, 3x3 kernel
        Classifier:   128*4*4 -> 256 -> 10
    """

    def _conv_block(self, in_ch, out_ch):
        return nn.Sequential(
            nn.Conv2d(in_ch, out_ch, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_ch, out_ch, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),       # halve spatial dimensions
            nn.Dropout2d(0.1),
        )

    def __init__(self, num_classes=10):
        super().__init__()
        self.block1 = self._conv_block(3,   32)
        self.block2 = self._conv_block(32,  64)
        self.block3 = self._conv_block(64, 128)
        # After 3 MaxPool2d(2): 32 -> 16 -> 8 -> 4
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 256),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes),
        )

    def forward(self, x):
        x = self.block1(x)
        x = self.block2(x)
        x = self.block3(x)
        return self.classifier(x)


cnn = CifarCNN(num_classes=10).to(DEVICE)
n_params = sum(p.numel() for p in cnn.parameters() if p.requires_grad)
print(f'CifarCNN parameters: {n_params:,}')

# Test forward pass
x_test = torch.randn(4, 3, 32, 32).to(DEVICE)
out    = cnn(x_test)
print(f'Input shape:  {x_test.shape}')
print(f'Output shape: {out.shape}  (4 samples, 10 class scores)')


In [None]:
# 11.2.2 -- Training utilities (reuse Ch 7 pattern)

def train_epoch_clf(model, loader, criterion, optimizer, scheduler=None):
    model.train()
    total_loss, correct, total = 0.0, 0, 0
    for images, labels in loader:
        images, labels = images.to(DEVICE), labels.to(DEVICE)
        optimizer.zero_grad()
        outputs = model(images)
        loss    = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        if scheduler is not None:
            scheduler.step()
        total_loss += loss.item() * len(images)
        correct    += (outputs.argmax(1) == labels).sum().item()
        total      += len(images)
    return total_loss / total, correct / total


def evaluate_clf(model, loader, criterion):
    model.eval()
    total_loss, correct, total = 0.0, 0, 0
    all_preds, all_labels = [], []
    with torch.no_grad():
        for images, labels in loader:
            images, labels = images.to(DEVICE), labels.to(DEVICE)
            outputs = model(images)
            loss    = criterion(outputs, labels)
            total_loss += loss.item() * len(images)
            correct    += (outputs.argmax(1) == labels).sum().item()
            total      += len(images)
            all_preds.extend(outputs.argmax(1).cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
    return total_loss / total, correct / total, all_preds, all_labels


print('Training utilities defined.')


In [None]:
# 11.2.3 -- Train the custom CNN for 20 epochs

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
optimizer = optim.AdamW(cnn.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = OneCycleLR(
    optimizer, max_lr=1e-2,
    steps_per_epoch=len(train_loader), epochs=20
)

N_EPOCHS = 20
cnn_history = {'train_loss':[], 'val_loss':[], 'train_acc':[], 'val_acc':[]}
best_val_acc = 0.0
best_cnn_weights = None

print(f'Training CifarCNN for {N_EPOCHS} epochs on {DEVICE}...')
print(f'{"Epoch":>6}  {"Train Loss":>11}  {"Train Acc":>10}  {"Val Acc":>9}')
print('-' * 42)

for epoch in range(1, N_EPOCHS + 1):
    tr_loss, tr_acc = train_epoch_clf(cnn, train_loader, criterion, optimizer, scheduler)
    val_loss, val_acc, _, _ = evaluate_clf(cnn, val_loader, criterion)
    cnn_history['train_loss'].append(tr_loss)
    cnn_history['val_loss'].append(val_loss)
    cnn_history['train_acc'].append(tr_acc)
    cnn_history['val_acc'].append(val_acc)
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        best_cnn_weights = {k: v.clone() for k, v in cnn.state_dict().items()}
    if epoch % 5 == 0 or epoch == 1:
        print(f'{epoch:>6}  {tr_loss:>11.4f}  {tr_acc:>10.4f}  {val_acc:>9.4f}')

cnn.load_state_dict(best_cnn_weights)
_, test_acc, _, _ = evaluate_clf(cnn, test_loader, criterion)
print(f'Best val acc: {best_val_acc:.4f}  |  Test acc: {test_acc:.4f}')


---

## Section 11.3 -- Visualising What the CNN Learned

A common criticism of deep learning is that it is a black box.
For CNNs, this is less true than it seems -- we can directly inspect
the intermediate activations (feature maps) to see what each layer detects.


In [None]:
# 11.3.1 -- Visualise feature maps from the first conv block

cnn.eval()

# Pick one test image
sample_img, sample_label = test_dataset[42]
sample_tensor = sample_img.unsqueeze(0).to(DEVICE)   # add batch dim

# Register a forward hook to capture the output of block1
feature_maps = {}

def hook_fn(module, input, output):
    feature_maps['block1'] = output.detach().cpu()

hook = cnn.block1.register_forward_hook(hook_fn)

with torch.no_grad():
    _ = cnn(sample_tensor)

hook.remove()

maps = feature_maps['block1'][0]   # shape: (32, 16, 16) -- 32 filters
print(f'Feature map shape: {maps.shape}  (32 filters, 16x16 after MaxPool)')

# Display original image and first 16 feature maps
fig = plt.figure(figsize=(16, 5))
gs  = gridspec.GridSpec(2, 9, figure=fig)

# Original image
mean = torch.tensor([0.4914, 0.4822, 0.4465]).view(3,1,1)
std  = torch.tensor([0.2470, 0.2435, 0.2616]).view(3,1,1)
orig = (sample_img * std + mean).clamp(0,1).permute(1,2,0).numpy()
ax0  = fig.add_subplot(gs[:, 0])
ax0.imshow(orig)
ax0.set_title(f'Input:\n{CLASSES[sample_label]}', fontsize=9)
ax0.axis('off')

for i in range(16):
    row = i // 8
    col = (i % 8) + 1
    ax  = fig.add_subplot(gs[row, col])
    ax.imshow(maps[i].numpy(), cmap='viridis')
    ax.set_title(f'F{i}', fontsize=7)
    ax.axis('off')

plt.suptitle('CNN Feature Maps: First Conv Block (16 of 32 filters)',
             fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()


---

## Section 11.4 -- Transfer Learning with ResNet-18

ResNet-18 is a 18-layer residual network pre-trained on ImageNet --
1.2 million images across 1,000 classes. Its weights encode rich visual
knowledge: edges, textures, shapes, objects.

**Transfer learning** re-uses this knowledge for a new task by:
1. Loading the pre-trained weights
2. Freezing all layers (they are not updated during training)
3. Replacing the final classification head with a new one for our classes
4. Training only the new head

This takes minutes instead of hours and often outperforms a custom CNN
trained from scratch, because the pre-trained features are so rich.


In [None]:
# 11.4.1 -- ResNet-18 with frozen backbone (feature extraction)

# Load pre-trained ResNet-18
resnet = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)

# Freeze all parameters -- they will not be updated
for param in resnet.parameters():
    param.requires_grad = False

# Replace the final fully-connected layer
# ResNet-18's original fc: 512 -> 1000 (ImageNet classes)
# Our new fc: 512 -> 10 (CIFAR-10 classes)
n_features = resnet.fc.in_features
resnet.fc  = nn.Linear(n_features, 10)
# Only the new head has requires_grad=True

resnet = resnet.to(DEVICE)

trainable = sum(p.numel() for p in resnet.parameters() if p.requires_grad)
total     = sum(p.numel() for p in resnet.parameters())
print(f'ResNet-18 total parameters:     {total:,}')
print(f'Trainable (head only):          {trainable:,}  ({trainable/total*100:.1f}%)')
print(f'Frozen (backbone):              {total-trainable:,}  ({(total-trainable)/total*100:.1f}%)')

# Larger transforms for ResNet (expects 224x224 but we adapt for CIFAR)
resnet_transform = transforms.Compose([
    transforms.Resize(64),           # upsample 32x32 to 64x64
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std= [0.229, 0.224, 0.225]),   # ImageNet stats
])
resnet_aug = transforms.Compose([
    transforms.Resize(64),
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(64, padding=8),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std= [0.229, 0.224, 0.225]),
])

rn_train_ds = CIFAR10('/tmp/cifar10', train=True,  download=False, transform=resnet_aug)
rn_test_ds  = CIFAR10('/tmp/cifar10', train=False, download=False, transform=resnet_transform)
rn_train_ds, rn_val_ds = random_split(
    rn_train_ds, [45000, 5000],
    generator=torch.Generator().manual_seed(RANDOM_STATE)
)
rn_train_loader = DataLoader(rn_train_ds, batch_size=128, shuffle=True,  num_workers=2)
rn_val_loader   = DataLoader(rn_val_ds,   batch_size=256, shuffle=False, num_workers=2)
rn_test_loader  = DataLoader(rn_test_ds,  batch_size=256, shuffle=False, num_workers=2)
print('ResNet data loaders ready.')


In [None]:
# 11.4.2 -- Train the ResNet head for 10 epochs

rn_criterion = nn.CrossEntropyLoss()
rn_optimizer = optim.AdamW(resnet.fc.parameters(), lr=1e-3, weight_decay=1e-4)

N_RN_EPOCHS = 10
rn_history  = {'train_acc': [], 'val_acc': []}
best_rn_acc = 0.0
best_rn_weights = None

print(f'Training ResNet-18 head for {N_RN_EPOCHS} epochs on {DEVICE}...')
print(f'{"Epoch":>6}  {"Train Acc":>10}  {"Val Acc":>9}')
print('-' * 30)

for epoch in range(1, N_RN_EPOCHS + 1):
    tr_loss, tr_acc = train_epoch_clf(resnet, rn_train_loader, rn_criterion, rn_optimizer)
    val_loss, val_acc, _, _ = evaluate_clf(resnet, rn_val_loader, rn_criterion)
    rn_history['train_acc'].append(tr_acc)
    rn_history['val_acc'].append(val_acc)
    if val_acc > best_rn_acc:
        best_rn_acc = val_acc
        best_rn_weights = {k: v.clone() for k, v in resnet.state_dict().items()}
    if epoch % 2 == 0 or epoch == 1:
        print(f'{epoch:>6}  {tr_acc:>10.4f}  {val_acc:>9.4f}')

resnet.load_state_dict(best_rn_weights)
_, rn_test_acc, _, _ = evaluate_clf(resnet, rn_test_loader, rn_criterion)
print(f'ResNet head-only  test accuracy: {rn_test_acc:.4f}')
print(f'Custom CNN        test accuracy: {test_acc:.4f}')


In [None]:
# 11.4.3 -- Compare models and plot training curves

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training curves -- custom CNN
epochs_cnn = range(1, len(cnn_history['val_acc']) + 1)
axes[0].plot(epochs_cnn, cnn_history['train_acc'], '#E8722A', linewidth=2, label='Train')
axes[0].plot(epochs_cnn, cnn_history['val_acc'],   '#2E75B6', linewidth=2, label='Val')
axes[0].axhline(test_acc, color='green', linestyle='--', linewidth=1.5,
                label=f'Test acc={test_acc:.3f}')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Custom CNN from Scratch')
axes[0].legend()
axes[0].set_ylim(0, 1)

# Training curves -- ResNet
epochs_rn = range(1, len(rn_history['val_acc']) + 1)
axes[1].plot(epochs_rn, rn_history['train_acc'], '#E8722A', linewidth=2, label='Train')
axes[1].plot(epochs_rn, rn_history['val_acc'],   '#2E75B6', linewidth=2, label='Val')
axes[1].axhline(rn_test_acc, color='green', linestyle='--', linewidth=1.5,
                label=f'Test acc={rn_test_acc:.3f}')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('ResNet-18 Transfer Learning (head only, 10 epochs)')
axes[1].legend()
axes[1].set_ylim(0, 1)

plt.suptitle('CIFAR-10: Custom CNN vs Transfer Learning',
             fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

print(f'Custom CNN (20 epochs):           {test_acc:.4f}')
print(f'ResNet-18 head-only (10 epochs):  {rn_test_acc:.4f}')
print(f'Improvement from transfer learning: {(rn_test_acc - test_acc)*100:+.1f} pp')


---

## Chapter 11 Summary

### Key Takeaways

- **Convolutional layers** apply learned filters across the entire image using
  weight sharing. Early layers detect edges; later layers detect shapes and objects.
- **`padding=1` with a 3x3 kernel** preserves spatial dimensions.
  `MaxPool2d(2)` halves them. After three pool layers: 32 → 16 → 8 → 4.
- **Data augmentation** (random flips, crops, colour jitter) is the single
  most effective regularisation technique for image models. Always use it.
- **`OneCycleLR`** is the recommended scheduler for CNNs: it warms up,
  peaks, then anneals the learning rate in one cycle per training run.
- **Transfer learning beats training from scratch** on small datasets.
  Freeze the backbone, train only the head first; then optionally unfreeze
  all layers at a 10x lower learning rate for further gains.
- **Feature map visualisation** with forward hooks is the primary tool
  for understanding what a CNN has learned.
- **ImageNet normalisation** (mean=[0.485, 0.456, 0.406]) must be used
  with all torchvision pre-trained models -- using wrong stats degrades accuracy.

### Model Comparison

| Model | Epochs | Parameters | Test Accuracy |
|-------|--------|------------|---------------|
| Custom CNN (from scratch) | 20 | ~300k | reported above |
| ResNet-18 (head only) | 10 | 11M (512 trainable) | reported above |

---

### What's Next

Chapters 10 and 11 complete Part 4. The appendices cover:
reinforcement learning (App D), SQL for data scientists (App E),
and Git/GitHub for ML projects (App F).

---

*End of Chapter 11 -- Python for AI/ML*  
[![Back to TOC](https://img.shields.io/badge/Back_to-Table_of_Contents-1B3A5C?style=flat-square)](https://colab.research.google.com/github/timothy-watt/python-for-ai-ml/blob/main/Python_for_AIML_TOC.ipynb)
