# Dog Breed Classification - CNN From Scratch

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/)
[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/)

---

## Project Overview

In this notebook, we build a **Convolutional Neural Network (CNN) from scratch** using PyTorch to classify dog (and cat) breeds from the **Oxford-IIIT Pet Dataset** (37 categories, ~7,400 images). The goal is to deeply understand CNN mechanics — convolution, pooling, batch normalization, dropout — by designing every layer ourselves, without relying on pretrained weights.

**What you'll learn:**
- How convolutional neural networks process images
- Data augmentation and normalization techniques
- Handling class imbalance with weighted sampling
- Training, validation, and evaluation pipelines in PyTorch
- Interpreting metrics: accuracy, top-5 accuracy, confusion matrix, classification report

---

## Table of Contents

1. [Imports](#1-imports)
2. [Constants & Device Setup](#2-constants--device-setup)
3. [Data Transforms](#3-data-transforms)
4. [Dataset & DataLoader](#4-dataset--dataloader)
5. [Visualize Data](#5-visualize-data)
6. [Model Architecture](#6-model-architecture)
7. [Display Model Summary](#7-display-model-summary)
8. [Loss, Optimizer & Scheduler](#8-loss-optimizer--scheduler)
9. [Training & Validation Functions](#9-training--validation-functions)
10. [Training Loop](#10-training-loop)
11. [Test & Evaluation](#11-test--evaluation)
12. [Save Model](#12-save-model)
13. [Bonus: Stanford Dogs Extension](#13-bonus-stanford-dogs-extension)

In [None]:
# Environment setup — install required packages (silent for Colab/Kaggle)
!pip install torchinfo -q

---

## 1. Imports

### Why These Libraries?

| Library | Purpose |
|---|---|
| **torch** | Core deep learning framework — tensors, autograd, neural network modules |
| **torch.nn** | Pre-built neural network layers (Conv2d, Linear, BatchNorm, etc.) |
| **torch.nn.functional** | Stateless functions (activation functions, loss functions) |
| **torch.optim** | Optimization algorithms (SGD, Adam, learning rate schedulers) |
| **torchvision** | Computer vision utilities — datasets, transforms, pretrained models |
| **torchinfo** | Model summary — parameter counts, layer shapes, memory usage |
| **tqdm** | Progress bars for training loops |
| **sklearn.metrics** | Classification report, confusion matrix, top-k accuracy |
| **matplotlib/seaborn** | Visualization — training curves, confusion matrices, sample images |

**PyTorch Ecosystem Overview:**
```
PyTorch Ecosystem
├── torch           → Tensors, autograd, device management
│   ├── torch.nn    → Layer definitions (Conv2d, Linear, ...)
│   ├── torch.optim → Optimizers (Adam, SGD, schedulers)
│   └── torch.utils.data → Dataset, DataLoader, samplers
├── torchvision     → CV-specific tools
│   ├── datasets    → Built-in datasets (ImageNet, CIFAR, Oxford Pet)
│   ├── transforms  → Image preprocessing & augmentation
│   └── models      → Pretrained architectures
└── torchinfo       → Model introspection & summaries
```

In [None]:
# ============================================================
# 1. IMPORTS
# ============================================================

# --- PyTorch core ---
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# --- Torchvision ---
import torchvision
import torchvision.transforms as transforms

# --- Model summary ---
from torchinfo import summary

# --- Data utilities ---
from torch.utils.data import Dataset, DataLoader, random_split, WeightedRandomSampler, Subset

# --- Progress bars ---
from tqdm.notebook import tqdm

# --- Standard libraries ---
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# --- Metrics ---
from sklearn.metrics import classification_report, confusion_matrix, top_k_accuracy_score

print(f"PyTorch version: {torch.__version__}")
print(f"Torchvision version: {torchvision.__version__}")

---

## 2. Constants & Device Setup

### Why Hyperparameters Matter

Hyperparameters are the "knobs" we tune to control how the model learns. Unlike model parameters (weights/biases) which are learned during training, hyperparameters are set **before** training begins.

| Hyperparameter | Our Value | Why? |
|---|---|---|
| `EPOCHS` | 25 | Enough iterations for a custom CNN to converge |
| `LEARNING_RATE` | 1e-3 | Standard starting point for Adam optimizer |
| `BATCH_SIZE` | 32 | Balances memory usage and gradient noise |
| `IMG_SIZE` | 128 | Smaller than typical (224) to keep parameter count manageable for a from-scratch CNN |
| `SEED` | 42 | Fixed seed for reproducibility across runs |

### GPU vs CPU

Deep learning involves massive matrix multiplications. GPUs have thousands of cores optimized for parallel computation, making them **10-100x faster** than CPUs for training. We use `torch.device` to automatically select the best available hardware:

```
Priority: CUDA (NVIDIA GPU) → MPS (Apple Silicon) → CPU
```

### Reproducibility

Setting random seeds ensures that results are reproducible across runs. We seed Python's `random`, NumPy, and PyTorch (both CPU and CUDA) to control all sources of randomness.

In [None]:
# ============================================================
# 2. CONSTANTS & DEVICE SETUP
# ============================================================

# --- Hyperparameters ---
EPOCHS = 25
LEARNING_RATE = 1e-3
BATCH_SIZE = 32
IMG_SIZE = 128
NUM_CLASSES = 37
SEED = 42
EARLY_STOPPING_PATIENCE = 5

# --- Reproducibility ---
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
np.random.seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# --- Device selection ---
if torch.cuda.is_available():
    DEVICE = torch.device("cuda")
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    DEVICE = torch.device("mps")
else:
    DEVICE = torch.device("cpu")

print(f"Using device: {DEVICE}")
if DEVICE.type == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

---

## 3. Data Transforms

### Why Normalize?

Raw pixel values range from 0-255. Neural networks train faster and more stably when inputs are centered around 0 with unit variance. Normalization applies:

$$x_{\text{normalized}} = \frac{x - \mu}{\sigma}$$

where $\mu$ (mean) and $\sigma$ (std) are computed per channel (R, G, B). We use ImageNet statistics as a reasonable default even for custom CNNs:
- **Mean:** [0.485, 0.456, 0.406]
- **Std:** [0.229, 0.224, 0.225]

### Data Augmentation

Augmentation artificially increases training set diversity by applying random transformations. This acts as a **regularizer** — the model sees slightly different versions of each image every epoch, preventing it from memorizing specific pixel patterns.

| Augmentation | What It Does | Why It Helps |
|---|---|---|
| `RandomHorizontalFlip` | Mirrors image left-right with 50% probability | Dogs can face either direction |
| `RandomRotation(20)` | Rotates up to ±20 degrees | Photos aren't always perfectly level |
| `ColorJitter` | Randomly adjusts brightness, contrast, saturation | Handles varying lighting conditions |
| `RandomResizedCrop` | Crops a random portion and resizes | Forces model to recognize partial views |

**Important:** Augmentation is applied **only during training**. Validation and test data use deterministic transforms (resize + center crop) so evaluation is consistent.

```
Training Pipeline:         Validation/Test Pipeline:
┌─────────────┐            ┌─────────────┐
│   Resize    │            │   Resize    │
│  (144x144)  │            │  (144x144)  │
├─────────────┤            ├─────────────┤
│ RandomCrop  │            │ CenterCrop  │
│  (128x128)  │            │  (128x128)  │
├─────────────┤            ├─────────────┤
│  RandomFlip │            │  ToTensor   │
├─────────────┤            ├─────────────┤
│  Rotation   │            │  Normalize  │
├─────────────┤            └─────────────┘
│ ColorJitter │
├─────────────┤
│  ToTensor   │
├─────────────┤
│  Normalize  │
└─────────────┘
```

In [None]:
# ============================================================
# 3. DATA TRANSFORMS
# ============================================================

# ImageNet normalization statistics
IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD = [0.229, 0.224, 0.225]

# Training transforms — with augmentation
train_transforms = transforms.Compose([
    transforms.Resize((int(IMG_SIZE * 1.125), int(IMG_SIZE * 1.125))),  # 144x144
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(degrees=20),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    transforms.RandomResizedCrop(IMG_SIZE, scale=(0.8, 1.0)),
    transforms.ToTensor(),
    transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
])

# Validation/Test transforms — deterministic, no augmentation
val_test_transforms = transforms.Compose([
    transforms.Resize((int(IMG_SIZE * 1.125), int(IMG_SIZE * 1.125))),  # 144x144
    transforms.CenterCrop(IMG_SIZE),  # 128x128
    transforms.ToTensor(),
    transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
])

print("Training transforms:")
print(train_transforms)
print("\nValidation/Test transforms:")
print(val_test_transforms)

---

## 4. Dataset & DataLoader

### Dataset vs DataLoader

| Concept | Role |
|---|---|
| **Dataset** | Stores samples and their labels. Defines `__getitem__` (get one sample) and `__len__` (total count). |
| **DataLoader** | Wraps a Dataset to provide batching, shuffling, parallel loading, and sampling. |

### Oxford-IIIT Pet Dataset

- **37 categories** of pet breeds (25 dog breeds, 12 cat breeds)
- **~7,400 images** total with roughly 200 images per class
- Comes with two splits: `trainval` (~3,680 images) and `test` (~3,669 images)

### Our Split Strategy

```
Oxford-IIIT Pet Dataset
├── trainval split (~3,680 images)
│   ├── 80% → Training set  (~2,944 images)
│   └── 20% → Validation set (~736 images)
└── test split (~3,669 images) → Test set (as-is)
```

This gives us approximately a **40/10/50** train/val/test ratio. The large test set provides a reliable estimate of real-world performance.

### The `TransformSubset` Problem

PyTorch's `random_split` returns `Subset` objects that inherit transforms from the parent dataset. But we need **different transforms** for training (with augmentation) vs validation (without). Our `TransformSubset` wrapper solves this by overriding the transform at access time.

### Class Imbalance & Weighted Sampling

If some breeds have more images than others, the model may become biased toward majority classes. `WeightedRandomSampler` assigns higher sampling probability to underrepresented classes, ensuring each class contributes equally to training:

$$w_i = \frac{1}{\text{count}(\text{class}_i)}$$

In [None]:
# ============================================================
# 4. DATASET & DATALOADER
# ============================================================

# --- TransformSubset wrapper ---
class TransformSubset(Dataset):
    """Wraps a Subset to apply custom transforms instead of the parent's."""
    def __init__(self, subset, transform=None):
        self.subset = subset
        self.transform = transform

    def __getitem__(self, idx):
        image, label = self.subset[idx]
        if self.transform:
            image = self.transform(image)
        return image, label

    def __len__(self):
        return len(self.subset)


# --- Load Oxford-IIIT Pet dataset (no transform yet — applied via TransformSubset) ---
# Using target_type="category" for breed classification (0-36)
raw_trainval = torchvision.datasets.OxfordIIITPet(
    root="./data",
    split="trainval",
    target_types="category",
    transform=None,   # Raw PIL images — transforms applied later
    download=True,
)

raw_test = torchvision.datasets.OxfordIIITPet(
    root="./data",
    split="test",
    target_types="category",
    transform=None,
    download=True,
)

# --- Split trainval into train (80%) and val (20%) ---
trainval_size = len(raw_trainval)
train_size = int(0.8 * trainval_size)
val_size = trainval_size - train_size

generator = torch.Generator().manual_seed(SEED)
train_subset, val_subset = random_split(raw_trainval, [train_size, val_size], generator=generator)

# --- Apply transforms via wrapper ---
train_dataset = TransformSubset(train_subset, transform=train_transforms)
val_dataset = TransformSubset(val_subset, transform=val_test_transforms)
test_dataset = TransformSubset(Subset(raw_test, range(len(raw_test))), transform=val_test_transforms)

print(f"Training set:   {len(train_dataset):,} images")
print(f"Validation set: {len(val_dataset):,} images")
print(f"Test set:       {len(test_dataset):,} images")
print(f"Total:          {len(train_dataset) + len(val_dataset) + len(test_dataset):,} images")
print(f"Number of classes: {NUM_CLASSES}")

In [None]:
# --- Compute class weights for WeightedRandomSampler ---
# Extract labels from the training subset
train_labels = [raw_trainval[idx][1] for idx in train_subset.indices]

class_counts = Counter(train_labels)
num_samples = len(train_labels)
class_weights = {cls: num_samples / count for cls, count in class_counts.items()}
sample_weights = [class_weights[label] for label in train_labels]
sample_weights = torch.tensor(sample_weights, dtype=torch.float64)

sampler = WeightedRandomSampler(
    weights=sample_weights,
    num_samples=len(sample_weights),
    replacement=True,
)

# --- DataLoaders ---
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, sampler=sampler, num_workers=2, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)

# --- Class names ---
class_names = raw_trainval.classes

print(f"\nNumber of batches — Train: {len(train_loader)}, Val: {len(val_loader)}, Test: {len(test_loader)}")
print(f"\nClass distribution (training set):")
for cls_idx in sorted(class_counts.keys())[:10]:
    print(f"  {class_names[cls_idx]:25s}: {class_counts[cls_idx]} images (weight: {class_weights[cls_idx]:.2f})")
print(f"  ... ({len(class_counts)} classes total)")

---

## 5. Visualize Data

### Why Visualize Before Modeling?

Visualization helps us:
1. **Verify data loading** — Are images loading correctly? Are labels right?
2. **Understand the data** — What do the images look like? How varied are they?
3. **Check augmentations** — Are transforms reasonable or too aggressive?
4. **Spot issues** — Corrupted images, mislabeled data, class imbalance

In [None]:
# ============================================================
# 5. VISUALIZE DATA
# ============================================================

def denormalize(tensor, mean=IMAGENET_MEAN, std=IMAGENET_STD):
    """Reverse normalization for display."""
    mean = torch.tensor(mean).view(3, 1, 1)
    std = torch.tensor(std).view(3, 1, 1)
    return tensor * std + mean


# --- Display a grid of sample images ---
fig, axes = plt.subplots(4, 4, figsize=(14, 14))
fig.suptitle("Sample Images from Training Set", fontsize=16, fontweight="bold")

for i, ax in enumerate(axes.flat):
    image, label = train_dataset[i]
    image = denormalize(image)
    image = image.permute(1, 2, 0).clamp(0, 1).numpy()
    ax.imshow(image)
    ax.set_title(class_names[label], fontsize=10)
    ax.axis("off")

plt.tight_layout()
plt.show()

In [None]:
# --- Class distribution bar chart ---
fig, ax = plt.subplots(figsize=(16, 5))

sorted_classes = sorted(class_counts.items(), key=lambda x: x[1], reverse=True)
names = [class_names[c] for c, _ in sorted_classes]
counts = [cnt for _, cnt in sorted_classes]

bars = ax.bar(range(len(names)), counts, color="steelblue", edgecolor="white")
ax.set_xticks(range(len(names)))
ax.set_xticklabels(names, rotation=90, fontsize=8)
ax.set_ylabel("Number of Images")
ax.set_title("Class Distribution (Training Set)", fontsize=14, fontweight="bold")
ax.axhline(y=np.mean(counts), color="red", linestyle="--", label=f"Mean: {np.mean(counts):.0f}")
ax.legend()
plt.tight_layout()
plt.show()

In [None]:
# --- Augmentation comparison: same image with different random transforms ---
fig, axes = plt.subplots(1, 6, figsize=(18, 3))
fig.suptitle("Same Image with Different Augmentations", fontsize=14, fontweight="bold")

# Get a raw image from the underlying dataset
raw_idx = train_subset.indices[0]
raw_image, label = raw_trainval[raw_idx]

# Show original
axes[0].imshow(raw_image)
axes[0].set_title("Original", fontsize=10)
axes[0].axis("off")

# Show 5 augmented versions
for i in range(1, 6):
    augmented = train_transforms(raw_image)
    augmented = denormalize(augmented).permute(1, 2, 0).clamp(0, 1).numpy()
    axes[i].imshow(augmented)
    axes[i].set_title(f"Augmented {i}", fontsize=10)
    axes[i].axis("off")

plt.tight_layout()
plt.show()

---

## 6. Model Architecture

### CNN Fundamentals

A **Convolutional Neural Network** processes images by learning spatial hierarchies of features — edges in early layers, textures in middle layers, and object parts in deep layers.

#### The Convolution Operation

A convolutional layer slides a small filter (kernel) across the input, computing dot products at each position to produce a **feature map**:

$$\text{Output}(i, j) = \sum_{m} \sum_{n} \text{Input}(i+m, j+n) \cdot \text{Kernel}(m, n) + \text{bias}$$

**Output size formula:**
$$O = \frac{W - K + 2P}{S} + 1$$

where $W$ = input size, $K$ = kernel size, $P$ = padding, $S$ = stride.

#### Key Building Blocks

| Component | Purpose |
|---|---|
| **Conv2d** | Learns spatial features via learnable filters |
| **BatchNorm2d** | Normalizes activations → faster, more stable training |
| **ReLU** | Non-linear activation: $f(x) = \max(0, x)$ — introduces non-linearity |
| **MaxPool2d** | Downsamples by 2x → reduces spatial size, adds translation invariance |
| **AdaptiveAvgPool2d** | Pools to a fixed size regardless of input dimensions |
| **Dropout** | Randomly zeroes neurons during training → prevents co-adaptation |

### Our Architecture

```
Input: (B, 3, 128, 128)
│
├── Conv Block 1: 3→32    │ (B, 32, 64, 64)
│   Conv2d(3,32,3,pad=1) → BatchNorm → ReLU → MaxPool(2)
│
├── Conv Block 2: 32→64   │ (B, 64, 32, 32)
│   Conv2d(32,64,3,pad=1) → BatchNorm → ReLU → MaxPool(2)
│
├── Conv Block 3: 64→128  │ (B, 128, 16, 16)
│   Conv2d(64,128,3,pad=1) → BatchNorm → ReLU → MaxPool(2)
│
├── Conv Block 4: 128→256 │ (B, 256, 8, 8)
│   Conv2d(128,256,3,pad=1) → BatchNorm → ReLU → MaxPool(2)
│
├── Conv Block 5: 256→512 │ (B, 512, 4, 4)
│   Conv2d(256,512,3,pad=1) → BatchNorm → ReLU → MaxPool(2)
│
├── AdaptiveAvgPool2d(1)  │ (B, 512, 1, 1)
├── Flatten               │ (B, 512)
│
├── Classifier:
│   Linear(512, 256) → ReLU → Dropout(0.5)
│   Linear(256, 128) → ReLU → Dropout(0.3)
│   Linear(128, 37)  → (output logits)
│
Output: (B, 37) — raw class scores
```

In [None]:
# ============================================================
# 6. MODEL ARCHITECTURE
# ============================================================

class DogBreedCNN(nn.Module):
    """Custom CNN for pet breed classification.

    Architecture: 5 convolutional blocks + 3-layer classifier.
    Each conv block: Conv2d → BatchNorm2d → ReLU → MaxPool2d
    """

    def __init__(self, num_classes=NUM_CLASSES):
        super().__init__()

        # --- Feature Extractor (Convolutional Blocks) ---
        self.features = nn.Sequential(
            # Block 1: 3 → 32 channels
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            # Block 2: 32 → 64 channels
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            # Block 3: 64 → 128 channels
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            # Block 4: 128 → 256 channels
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            # Block 5: 256 → 512 channels
            nn.Conv2d(256, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )

        # --- Global Average Pooling ---
        self.pool = nn.AdaptiveAvgPool2d((1, 1))

        # --- Classifier ---
        self.classifier = nn.Sequential(
            nn.Linear(512, 256),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),

            nn.Linear(256, 128),
            nn.ReLU(inplace=True),
            nn.Dropout(0.3),

            nn.Linear(128, num_classes),
        )

    def forward(self, x):
        x = self.features(x)     # (B, 512, 4, 4)
        x = self.pool(x)         # (B, 512, 1, 1)
        x = torch.flatten(x, 1)  # (B, 512)
        x = self.classifier(x)   # (B, num_classes)
        return x


# --- Instantiate model ---
model = DogBreedCNN(num_classes=NUM_CLASSES).to(DEVICE)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters:     {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

---

## 7. Display Model Summary

The `torchinfo` summary shows each layer's output shape, parameter count, and memory usage. This helps verify that our architecture is correct and understand where most parameters live.

In [None]:
# ============================================================
# 7. DISPLAY MODEL SUMMARY
# ============================================================

summary(model, input_size=(BATCH_SIZE, 3, IMG_SIZE, IMG_SIZE), col_names=["input_size", "output_size", "num_params"])

---

## 8. Loss, Optimizer & Scheduler

### Cross-Entropy Loss

For multi-class classification, **Cross-Entropy Loss** measures the distance between the predicted probability distribution and the true label:

$$\mathcal{L} = -\sum_{c=1}^{C} y_c \cdot \log(\hat{y}_c)$$

where $y_c$ is the one-hot encoded true label and $\hat{y}_c$ is the predicted probability for class $c$. PyTorch's `nn.CrossEntropyLoss` combines `LogSoftmax` + `NLLLoss` in one step for numerical stability.

### Adam Optimizer

**Adam** (Adaptive Moment Estimation) combines the best of two worlds:
- **Momentum** (like SGD with momentum): Tracks exponential moving average of gradients
- **RMSProp**: Adapts learning rate per-parameter based on gradient magnitude

$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t$$

We also add **weight decay** (L2 regularization) to penalize large weights and reduce overfitting.

### Learning Rate Scheduler

**ReduceLROnPlateau** monitors validation loss. If it stops improving for `patience` epochs, the learning rate is reduced by `factor`. This allows the model to take smaller steps as it approaches a minimum:

```
LR: 1e-3 ──plateau──→ 5e-4 ──plateau──→ 2.5e-4 ──plateau──→ ...
```

In [None]:
# ============================================================
# 8. LOSS, OPTIMIZER & SCHEDULER
# ============================================================

criterion = nn.CrossEntropyLoss()

optimizer = optim.Adam(
    model.parameters(),
    lr=LEARNING_RATE,
    weight_decay=1e-4,  # L2 regularization
)

scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer,
    mode="min",       # Monitor validation loss (lower is better)
    patience=3,       # Wait 3 epochs before reducing
    factor=0.5,       # Halve the learning rate
    verbose=True,
)

print(f"Loss function: {criterion}")
print(f"Optimizer: Adam (lr={LEARNING_RATE}, weight_decay=1e-4)")
print(f"Scheduler: ReduceLROnPlateau (patience=3, factor=0.5)")

---

## 9. Training & Validation Functions

### The Training Cycle

Each training step follows this cycle:

```
1. Forward Pass:   input → model → predictions
2. Compute Loss:   predictions vs true labels → scalar loss
3. Backward Pass:  loss.backward() → compute gradients (∂L/∂w for all weights)
4. Update Weights: optimizer.step() → w_new = w_old - lr * gradient
5. Zero Gradients: optimizer.zero_grad() → reset for next batch
```

### What `model.train()` vs `model.eval()` Does

| Mode | Dropout | BatchNorm | Gradients |
|---|---|---|---|
| `model.train()` | Active (randomly zeros neurons) | Uses batch statistics | Computed |
| `model.eval()` | Disabled | Uses running statistics | Not needed (use `torch.no_grad()`) |

### Why Validate?

Validation measures how well the model generalizes to data it hasn't been trained on. By comparing training and validation metrics, we can detect:

| Scenario | Train Loss | Val Loss | Diagnosis |
|---|---|---|---|
| Both low, close together | ↓ | ↓ | Good fit |
| Train low, val high | ↓↓ | ↑ | **Overfitting** — model memorized training data |
| Both high | ↑ | ↑ | **Underfitting** — model too simple or needs more training |

In [None]:
# ============================================================
# 9. TRAINING & VALIDATION FUNCTIONS
# ============================================================

def train_one_epoch(model, loader, criterion, optimizer, device):
    """Train the model for one epoch.

    Args:
        model: The neural network
        loader: Training DataLoader
        criterion: Loss function
        optimizer: Optimizer
        device: torch.device

    Returns:
        avg_loss (float): Average loss over all batches
        accuracy (float): Training accuracy (0-100)
    """
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0

    for images, labels in tqdm(loader, desc="Training", leave=False):
        images, labels = images.to(device), labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Track metrics
        running_loss += loss.item() * images.size(0)
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()

    avg_loss = running_loss / total
    accuracy = 100.0 * correct / total
    return avg_loss, accuracy


@torch.no_grad()
def validate(model, loader, criterion, device):
    """Evaluate the model on validation/test data.

    Args:
        model: The neural network
        loader: Validation/Test DataLoader
        criterion: Loss function
        device: torch.device

    Returns:
        avg_loss (float): Average loss
        accuracy (float): Accuracy (0-100)
    """
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0

    for images, labels in tqdm(loader, desc="Validating", leave=False):
        images, labels = images.to(device), labels.to(device)

        outputs = model(images)
        loss = criterion(outputs, labels)

        running_loss += loss.item() * images.size(0)
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()

    avg_loss = running_loss / total
    accuracy = 100.0 * correct / total
    return avg_loss, accuracy

---

## 10. Training Loop

### Epoch-Level Training

One **epoch** = one complete pass through the entire training set. We train for multiple epochs because:
- The model needs repeated exposure to learn patterns
- Each epoch uses different augmentations (via random transforms)
- Gradients from a single pass aren't enough to converge

### Early Stopping

Early stopping prevents overfitting by halting training when validation loss stops improving. If the validation loss doesn't decrease for `patience` consecutive epochs, we stop and restore the best model weights.

```
Epoch:  1  2  3  4  5  6  7  8  9  10
Val Loss: ↓  ↓  ↓  ↓  ↑  ↑  ↑  ↑  ↑  STOP!
                    ↑               ↑
                Best model     Patience exhausted (5)
```

In [None]:
# ============================================================
# 10. TRAINING LOOP
# ============================================================

# --- History tracking ---
history = {
    "train_loss": [],
    "val_loss": [],
    "train_acc": [],
    "val_acc": [],
    "lr": [],
}

# --- Early stopping variables ---
best_val_loss = float("inf")
best_model_state = None
patience_counter = 0

print(f"Starting training for {EPOCHS} epochs...")
print(f"{'='*70}")

total_start = time.time()

for epoch in range(1, EPOCHS + 1):
    epoch_start = time.time()

    # Train
    train_loss, train_acc = train_one_epoch(model, train_loader, criterion, optimizer, DEVICE)

    # Validate
    val_loss, val_acc = validate(model, val_loader, criterion, DEVICE)

    # Scheduler step (monitors val_loss)
    scheduler.step(val_loss)
    current_lr = optimizer.param_groups[0]["lr"]

    # Record history
    history["train_loss"].append(train_loss)
    history["val_loss"].append(val_loss)
    history["train_acc"].append(train_acc)
    history["val_acc"].append(val_acc)
    history["lr"].append(current_lr)

    epoch_time = time.time() - epoch_start

    # Early stopping check
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_model_state = model.state_dict().copy()
        patience_counter = 0
        marker = " ★ Best"
    else:
        patience_counter += 1
        marker = ""

    print(
        f"Epoch [{epoch:02d}/{EPOCHS}] "
        f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}% | "
        f"Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.2f}% | "
        f"LR: {current_lr:.6f} | Time: {epoch_time:.1f}s{marker}"
    )

    if patience_counter >= EARLY_STOPPING_PATIENCE:
        print(f"\nEarly stopping triggered after {epoch} epochs (patience={EARLY_STOPPING_PATIENCE})")
        break

# Restore best model
if best_model_state is not None:
    model.load_state_dict(best_model_state)
    print(f"\nRestored best model (val_loss={best_val_loss:.4f})")

total_time = time.time() - total_start
print(f"\nTotal training time: {total_time / 60:.1f} minutes")

---

## 11. Test & Evaluation

### Final Evaluation on Unseen Data

The test set is data the model has **never** seen during training or validation. This gives us an unbiased estimate of real-world performance.

### Metrics We Track

| Metric | What It Measures |
|---|---|
| **Top-1 Accuracy** | % of times the model's top prediction is correct |
| **Top-5 Accuracy** | % of times the correct class is in the model's top 5 predictions |
| **Precision** | Of all predicted as class X, how many are actually X? |
| **Recall** | Of all actual class X, how many did we correctly predict? |
| **F1-Score** | Harmonic mean of precision and recall |
| **Confusion Matrix** | NxN grid showing predicted vs actual for every class pair |

In [None]:
# ============================================================
# 11. TEST & EVALUATION
# ============================================================

@torch.no_grad()
def test_model(model, loader, device):
    """Run inference on the test set and collect all predictions.

    Returns:
        all_labels: Ground truth labels
        all_preds: Predicted labels
        all_probs: Predicted probabilities (for top-k accuracy)
    """
    model.eval()
    all_labels = []
    all_preds = []
    all_probs = []

    for images, labels in tqdm(loader, desc="Testing"):
        images, labels = images.to(device), labels.to(device)

        outputs = model(images)
        probs = F.softmax(outputs, dim=1)
        _, predicted = outputs.max(1)

        all_labels.extend(labels.cpu().numpy())
        all_preds.extend(predicted.cpu().numpy())
        all_probs.extend(probs.cpu().numpy())

    return np.array(all_labels), np.array(all_preds), np.array(all_probs)


# --- Run test ---
print("Evaluating on test set...\n")
test_labels, test_preds, test_probs = test_model(model, test_loader, DEVICE)

# --- Accuracy ---
top1_acc = 100.0 * np.mean(test_labels == test_preds)
top5_acc = 100.0 * top_k_accuracy_score(test_labels, test_probs, k=5)

print(f"Test Top-1 Accuracy: {top1_acc:.2f}%")
print(f"Test Top-5 Accuracy: {top5_acc:.2f}%")

In [None]:
# --- Classification Report ---
print("\nClassification Report:")
print("=" * 70)
print(classification_report(test_labels, test_preds, target_names=class_names, zero_division=0))

In [None]:
# --- Confusion Matrix ---
cm = confusion_matrix(test_labels, test_preds)

fig, ax = plt.subplots(figsize=(18, 16))
sns.heatmap(
    cm, annot=False, fmt="d", cmap="Blues",
    xticklabels=class_names, yticklabels=class_names,
    ax=ax,
)
ax.set_xlabel("Predicted", fontsize=12)
ax.set_ylabel("Actual", fontsize=12)
ax.set_title(f"Confusion Matrix (Test Set) — Accuracy: {top1_acc:.2f}%", fontsize=14, fontweight="bold")
plt.xticks(rotation=90, fontsize=7)
plt.yticks(rotation=0, fontsize=7)
plt.tight_layout()
plt.show()

In [None]:
# --- Training Curves ---
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

epochs_range = range(1, len(history["train_loss"]) + 1)

# Loss curves
ax1.plot(epochs_range, history["train_loss"], "b-o", label="Train Loss", markersize=4)
ax1.plot(epochs_range, history["val_loss"], "r-o", label="Val Loss", markersize=4)
ax1.set_xlabel("Epoch")
ax1.set_ylabel("Loss")
ax1.set_title("Loss Curves", fontsize=13, fontweight="bold")
ax1.legend()
ax1.grid(True, alpha=0.3)

# Accuracy curves
ax2.plot(epochs_range, history["train_acc"], "b-o", label="Train Acc", markersize=4)
ax2.plot(epochs_range, history["val_acc"], "r-o", label="Val Acc", markersize=4)
ax2.set_xlabel("Epoch")
ax2.set_ylabel("Accuracy (%)")
ax2.set_title("Accuracy Curves", fontsize=13, fontweight="bold")
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.suptitle("Training History — CNN From Scratch", fontsize=15, fontweight="bold", y=1.02)
plt.tight_layout()
plt.show()

---

## 12. Save Model

We save the model's **state dict** (learned weights and biases) rather than the entire model object. This is more portable and doesn't depend on the exact class definition being available at load time.

In [None]:
# ============================================================
# 12. SAVE MODEL
# ============================================================

save_path = "dog_breed_cnn_from_scratch.pth"

torch.save({
    "model_state_dict": model.state_dict(),
    "num_classes": NUM_CLASSES,
    "img_size": IMG_SIZE,
    "class_names": class_names,
    "test_accuracy": top1_acc,
    "history": history,
}, save_path)

print(f"Model saved to '{save_path}'")
print(f"Test accuracy: {top1_acc:.2f}%")

---

## 13. Bonus: Stanford Dogs Extension

The **Stanford Dogs Dataset** is a more challenging benchmark with **120 dog breeds** and **20,580 images**. This section shows how to adapt our CNN pipeline for this larger dataset.

### Key Differences from Oxford-IIIT Pet

| Property | Oxford-IIIT Pet | Stanford Dogs |
|---|---|---|
| Classes | 37 (dogs + cats) | 120 (dogs only) |
| Images | ~7,400 | ~20,580 |
| Difficulty | Moderate | Hard (fine-grained) |
| Source | torchvision built-in | Kaggle download |

> **Note:** This section requires a Kaggle API key. On Kaggle notebooks, the dataset is available directly. On Colab, you'll need to upload your `kaggle.json` credentials.

In [None]:
# ============================================================
# 13. BONUS: STANFORD DOGS EXTENSION
# ============================================================

# Uncomment and run the cells below to train on Stanford Dogs (120 breeds).
# This requires either:
#   - Running on Kaggle (dataset available at /kaggle/input/stanford-dogs-dataset)
#   - Kaggle API credentials for download

STANFORD_DOGS = False  # Set to True to enable

if STANFORD_DOGS:
    import os
    from torchvision.datasets import ImageFolder

    # --- Download dataset (Colab only — skip on Kaggle) ---
    # !pip install kaggle -q
    # !mkdir -p ~/.kaggle && cp kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json
    # !kaggle dataset download -d jessicali9530/stanford-dogs-dataset -p ./data/stanford_dogs --unzip

    # --- Configuration ---
    STANFORD_NUM_CLASSES = 120
    STANFORD_EPOCHS = 30

    # --- Paths (adjust based on your environment) ---
    # Kaggle:
    # TRAIN_DIR = "/kaggle/input/stanford-dogs-dataset/images/Images"
    # Colab/Local:
    TRAIN_DIR = "./data/stanford_dogs/images/Images"

    # --- Load with ImageFolder ---
    stanford_full = ImageFolder(root=TRAIN_DIR)
    print(f"Stanford Dogs: {len(stanford_full)} images, {len(stanford_full.classes)} classes")

    # --- Split: 80% train, 10% val, 10% test ---
    total = len(stanford_full)
    train_n = int(0.8 * total)
    val_n = int(0.1 * total)
    test_n = total - train_n - val_n

    generator = torch.Generator().manual_seed(SEED)
    sd_train, sd_val, sd_test = random_split(stanford_full, [train_n, val_n, test_n], generator=generator)

    # Apply transforms
    sd_train_dataset = TransformSubset(sd_train, transform=train_transforms)
    sd_val_dataset = TransformSubset(sd_val, transform=val_test_transforms)
    sd_test_dataset = TransformSubset(sd_test, transform=val_test_transforms)

    # DataLoaders
    sd_train_loader = DataLoader(sd_train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, pin_memory=True)
    sd_val_loader = DataLoader(sd_val_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)
    sd_test_loader = DataLoader(sd_test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)

    print(f"Train: {len(sd_train_dataset)} | Val: {len(sd_val_dataset)} | Test: {len(sd_test_dataset)}")

    # --- Build model with 120 classes ---
    sd_model = DogBreedCNN(num_classes=STANFORD_NUM_CLASSES).to(DEVICE)
    sd_criterion = nn.CrossEntropyLoss()
    sd_optimizer = optim.Adam(sd_model.parameters(), lr=LEARNING_RATE, weight_decay=1e-4)
    sd_scheduler = optim.lr_scheduler.ReduceLROnPlateau(sd_optimizer, patience=3, factor=0.5)

    print(f"\nStanford Dogs model ready — {STANFORD_NUM_CLASSES} classes")
    print("Run the same training loop (Section 10) with sd_model, sd_train_loader, etc.")
else:
    print("Stanford Dogs extension is disabled. Set STANFORD_DOGS = True to enable.")