# Dog Breed Classification - Transfer Learning

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/)
[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/)

---

## Project Overview

In this notebook, we leverage **Transfer Learning** — using models pretrained on ImageNet (1.2M images, 1000 classes) — to classify pet breeds from the **Oxford-IIIT Pet Dataset** (37 categories). We compare **four architectures**: ResNet-50, VGG-16, EfficientNet-B0, and MobileNet-V2, using a two-phase training strategy: frozen feature extraction followed by fine-tuning.

**What you'll learn:**
- Why transfer learning works and when to use it
- Feature extraction vs fine-tuning strategies
- How to adapt pretrained models for new tasks
- Comparing model architectures systematically
- ResNet skip connections, EfficientNet compound scaling, MobileNet depthwise separable convolutions

---

## Table of Contents

1. [Imports](#1-imports)
2. [Constants & Device Setup](#2-constants--device-setup)
3. [Data Transforms](#3-data-transforms)
4. [Dataset & DataLoader](#4-dataset--dataloader)
5. [Visualize Data](#5-visualize-data)
6. [Model Architectures](#6-model-architectures)
7. [Display Model Summary](#7-display-model-summary)
8. [Training Infrastructure](#8-training-infrastructure)
9. [Training & Validation Functions](#9-training--validation-functions)
10. [Train & Compare All Models](#10-train--compare-all-models)
11. [Comprehensive Evaluation](#11-comprehensive-evaluation)
12. [Save Best Model](#12-save-best-model)
13. [Bonus: Stanford Dogs Extension](#13-bonus-stanford-dogs-extension)

In [None]:
# Environment setup — install required packages (silent for Colab/Kaggle)
!pip install torchinfo -q

---

## 1. Imports

### Why These Libraries?

| Library | Purpose |
|---|---|
| **torch** | Core deep learning framework — tensors, autograd, neural network modules |
| **torch.nn** | Pre-built neural network layers (Conv2d, Linear, BatchNorm, etc.) |
| **torch.nn.functional** | Stateless functions (activation functions, loss functions) |
| **torch.optim** | Optimization algorithms (SGD, Adam, learning rate schedulers) |
| **torchvision** | Computer vision utilities — datasets, transforms, pretrained models |
| **torchvision.models** | Pretrained model zoo (ResNet, VGG, EfficientNet, MobileNet, etc.) |
| **torchinfo** | Model summary — parameter counts, layer shapes, memory usage |
| **tqdm** | Progress bars for training loops |
| **sklearn.metrics** | Classification report, confusion matrix, top-k accuracy |
| **matplotlib/seaborn** | Visualization — training curves, confusion matrices, comparison charts |

**Transfer Learning Ecosystem:**
```
torchvision.models
├── ResNet50        → Skip connections, 25.6M params
├── VGG16           → Deep sequential, 138M params
├── EfficientNet_B0 → Compound scaling, 5.3M params
└── MobileNet_V2    → Depthwise separable convs, 3.4M params
```

In [None]:
# ============================================================
# 1. IMPORTS
# ============================================================

# --- PyTorch core ---
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# --- Torchvision ---
import torchvision
import torchvision.transforms as transforms
from torchvision import models
from torchvision.models import (
    ResNet50_Weights,
    VGG16_Weights,
    EfficientNet_B0_Weights,
    MobileNet_V2_Weights,
)

# --- Model summary ---
from torchinfo import summary

# --- Data utilities ---
from torch.utils.data import Dataset, DataLoader, random_split, WeightedRandomSampler, Subset

# --- Progress bars ---
from tqdm.notebook import tqdm

# --- Standard libraries ---
import time
import copy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# --- Metrics ---
from sklearn.metrics import classification_report, confusion_matrix, top_k_accuracy_score

print(f"PyTorch version: {torch.__version__}")
print(f"Torchvision version: {torchvision.__version__}")

---

## 2. Constants & Device Setup

### Key Differences from CNN-From-Scratch

| Parameter | CNN From Scratch | Transfer Learning | Why? |
|---|---|---|---|
| `IMG_SIZE` | 128 | **224** | Pretrained models were trained on 224×224 ImageNet images |
| `EPOCHS` | 25 | **15** | Pretrained features converge much faster |
| `LEARNING_RATE` | 1e-3 | **1e-4** | Smaller LR to avoid destroying pretrained weights |

### Reproducibility

We seed all random number generators to ensure consistent results across runs.

In [None]:
# ============================================================
# 2. CONSTANTS & DEVICE SETUP
# ============================================================

# --- Hyperparameters ---
EPOCHS = 15
FEATURE_EXTRACT_EPOCHS = 5    # Phase 1: frozen backbone
FINETUNE_EPOCHS = 10          # Phase 2: unfrozen backbone
LEARNING_RATE = 1e-4
FINETUNE_LR = 1e-5            # Lower LR for fine-tuning phase
BATCH_SIZE = 32
IMG_SIZE = 224                 # Standard for pretrained models
NUM_CLASSES = 37
SEED = 42
EARLY_STOPPING_PATIENCE = 5

# --- Model names to compare ---
MODEL_NAMES = ["resnet50", "vgg16", "efficientnet_b0", "mobilenet_v2"]

# --- Reproducibility ---
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
np.random.seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# --- Device selection ---
if torch.cuda.is_available():
    DEVICE = torch.device("cuda")
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    DEVICE = torch.device("mps")
else:
    DEVICE = torch.device("cpu")

print(f"Using device: {DEVICE}")
if DEVICE.type == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
print(f"\nModels to compare: {MODEL_NAMES}")

---

## 3. Data Transforms

### ImageNet Normalization

Pretrained models were trained with specific normalization statistics computed from the ImageNet dataset. We **must** use the same values:

$$x_{\text{normalized}} = \frac{x - \mu}{\sigma}$$

- **Mean:** [0.485, 0.456, 0.406] (per RGB channel)
- **Std:** [0.229, 0.224, 0.225]

Using different normalization would shift the input distribution, making the pretrained features meaningless.

### Image Size: 224×224

All four pretrained models were trained on 224×224 images. While they can technically handle other sizes (due to adaptive pooling), using the original training size gives the best feature extraction.

In [None]:
# ============================================================
# 3. DATA TRANSFORMS
# ============================================================

# ImageNet normalization statistics (MUST match pretrained model training)
IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD = [0.229, 0.224, 0.225]

# Training transforms — with augmentation
train_transforms = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(degrees=15),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    transforms.RandomResizedCrop(IMG_SIZE, scale=(0.8, 1.0)),
    transforms.ToTensor(),
    transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
])

# Validation/Test transforms — deterministic
val_test_transforms = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.CenterCrop(IMG_SIZE),  # 224x224
    transforms.ToTensor(),
    transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
])

print("Training transforms:")
print(train_transforms)
print("\nValidation/Test transforms:")
print(val_test_transforms)

---

## 4. Dataset & DataLoader

### Split Strategy

Identical to Notebook 1 — we use the same data splits for a fair comparison:

```
Oxford-IIIT Pet Dataset
├── trainval split (~3,680 images)
│   ├── 80% → Training set  (~2,944 images)
│   └── 20% → Validation set (~736 images)
└── test split (~3,669 images) → Test set (as-is)
```

### TransformSubset Wrapper

Same pattern as Notebook 1 — `random_split` returns `Subset` objects that don't support per-subset transforms, so we wrap them.

In [None]:
# ============================================================
# 4. DATASET & DATALOADER
# ============================================================

# --- TransformSubset wrapper ---
class TransformSubset(Dataset):
    """Wraps a Subset to apply custom transforms instead of the parent's."""
    def __init__(self, subset, transform=None):
        self.subset = subset
        self.transform = transform

    def __getitem__(self, idx):
        image, label = self.subset[idx]
        if self.transform:
            image = self.transform(image)
        return image, label

    def __len__(self):
        return len(self.subset)


# --- Load Oxford-IIIT Pet dataset ---
raw_trainval = torchvision.datasets.OxfordIIITPet(
    root="./data",
    split="trainval",
    target_types="category",
    transform=None,
    download=True,
)

raw_test = torchvision.datasets.OxfordIIITPet(
    root="./data",
    split="test",
    target_types="category",
    transform=None,
    download=True,
)

# --- Split trainval into train (80%) and val (20%) ---
trainval_size = len(raw_trainval)
train_size = int(0.8 * trainval_size)
val_size = trainval_size - train_size

generator = torch.Generator().manual_seed(SEED)
train_subset, val_subset = random_split(raw_trainval, [train_size, val_size], generator=generator)

# --- Apply transforms via wrapper ---
train_dataset = TransformSubset(train_subset, transform=train_transforms)
val_dataset = TransformSubset(val_subset, transform=val_test_transforms)
test_dataset = TransformSubset(Subset(raw_test, range(len(raw_test))), transform=val_test_transforms)

print(f"Training set:   {len(train_dataset):,} images")
print(f"Validation set: {len(val_dataset):,} images")
print(f"Test set:       {len(test_dataset):,} images")
print(f"Number of classes: {NUM_CLASSES}")

In [None]:
# --- Compute class weights for WeightedRandomSampler ---
train_labels = [raw_trainval[idx][1] for idx in train_subset.indices]

class_counts = Counter(train_labels)
num_samples = len(train_labels)
class_weights = {cls: num_samples / count for cls, count in class_counts.items()}
sample_weights = [class_weights[label] for label in train_labels]
sample_weights = torch.tensor(sample_weights, dtype=torch.float64)

sampler = WeightedRandomSampler(
    weights=sample_weights,
    num_samples=len(sample_weights),
    replacement=True,
)

# --- DataLoaders ---
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, sampler=sampler, num_workers=2, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)

# --- Class names ---
class_names = raw_trainval.classes

print(f"\nNumber of batches — Train: {len(train_loader)}, Val: {len(val_loader)}, Test: {len(test_loader)}")
print(f"Class names ({len(class_names)} total): {class_names[:5]} ...")

---

## 5. Visualize Data

Quick visual check to verify data loading and transforms are working correctly.

In [None]:
# ============================================================
# 5. VISUALIZE DATA
# ============================================================

def denormalize(tensor, mean=IMAGENET_MEAN, std=IMAGENET_STD):
    """Reverse normalization for display."""
    mean = torch.tensor(mean).view(3, 1, 1)
    std = torch.tensor(std).view(3, 1, 1)
    return tensor * std + mean


# --- Display a grid of sample images ---
fig, axes = plt.subplots(4, 4, figsize=(14, 14))
fig.suptitle("Sample Images from Training Set (224x224)", fontsize=16, fontweight="bold")

for i, ax in enumerate(axes.flat):
    image, label = train_dataset[i]
    image = denormalize(image)
    image = image.permute(1, 2, 0).clamp(0, 1).numpy()
    ax.imshow(image)
    ax.set_title(class_names[label], fontsize=10)
    ax.axis("off")

plt.tight_layout()
plt.show()

In [None]:
# --- Class distribution bar chart ---
fig, ax = plt.subplots(figsize=(16, 5))

sorted_classes = sorted(class_counts.items(), key=lambda x: x[1], reverse=True)
names = [class_names[c] for c, _ in sorted_classes]
counts = [cnt for _, cnt in sorted_classes]

ax.bar(range(len(names)), counts, color="steelblue", edgecolor="white")
ax.set_xticks(range(len(names)))
ax.set_xticklabels(names, rotation=90, fontsize=8)
ax.set_ylabel("Number of Images")
ax.set_title("Class Distribution (Training Set)", fontsize=14, fontweight="bold")
ax.axhline(y=np.mean(counts), color="red", linestyle="--", label=f"Mean: {np.mean(counts):.0f}")
ax.legend()
plt.tight_layout()
plt.show()

---

## 6. Model Architectures

### What is Transfer Learning?

Transfer learning reuses a model trained on a **large source task** (ImageNet: 1.2M images, 1000 classes) for a **smaller target task** (our pet breeds: ~3,700 images, 37 classes). It works because:

1. **Early layers learn universal features** — edges, textures, colors that are useful for ANY image task
2. **Middle layers learn compositional features** — patterns like fur textures, eye shapes
3. **Final layers learn task-specific features** — only these need retraining

```
Pretrained Model (ImageNet)
┌─────────────────────────┐
│  Early Layers           │ ← Universal features (edges, textures)
│  (FROZEN — keep as-is)  │    These transfer well to any image task
├─────────────────────────┤
│  Middle Layers          │ ← Compositional features
│  (FREEZE or FINE-TUNE)  │    May need slight adjustment
├─────────────────────────┤
│  Final Classifier       │ ← Task-specific (1000 ImageNet classes)
│  (REPLACE with ours)    │    Replace with 37-class head
└─────────────────────────┘
```

### Feature Extraction vs Fine-Tuning

| Strategy | What Changes | When to Use |
|---|---|---|
| **Feature Extraction** | Only the new classifier head | Small dataset, similar to ImageNet |
| **Fine-Tuning** | Classifier + some backbone layers | Moderate dataset, somewhat different from ImageNet |
| **Full Fine-Tuning** | Everything | Large dataset, very different from ImageNet |

We use a **two-phase approach**: feature extraction first (to train the new head), then fine-tuning (to adapt the backbone).

### Our Four Models

#### ResNet-50 (2015)
- **Key innovation:** Skip connections (residual connections) that allow gradients to flow through shortcuts
- **Architecture:** 50 layers with residual blocks: `x + F(x)` where `F` is 2-3 conv layers
- **Why it works:** Skip connections solve the degradation problem — deeper networks can be trained without vanishing gradients
- **Parameters:** 25.6M

#### VGG-16 (2014)
- **Key innovation:** Uniform architecture — all 3×3 convolutions, simple and deep
- **Architecture:** 13 conv layers + 3 fully connected layers
- **Trade-off:** Very large (138M params), but simple to understand
- **Parameters:** 138M

#### EfficientNet-B0 (2019)
- **Key innovation:** Compound scaling — scales depth, width, and resolution together with a fixed ratio
- **Architecture:** Mobile inverted bottleneck blocks (MBConv) with squeeze-and-excitation
- **Why it works:** Achieves better accuracy with far fewer parameters than ResNet/VGG
- **Parameters:** 5.3M

#### MobileNet-V2 (2018)
- **Key innovation:** Depthwise separable convolutions — factorize standard convolution into depthwise + pointwise
- **Architecture:** Inverted residual blocks with linear bottlenecks
- **Why it works:** Designed for mobile/edge deployment — minimal compute with competitive accuracy
- **Parameters:** 3.4M

### Modern Weights API

We use torchvision's modern weights API (`ResNet50_Weights.DEFAULT`, etc.) instead of the deprecated `pretrained=True`. This provides:
- Explicit weight versioning
- Associated preprocessing transforms
- Better reproducibility

In [None]:
# ============================================================
# 6. MODEL ARCHITECTURES
# ============================================================

def build_model(model_name, num_classes, freeze=True):
    """Build a pretrained model with a custom classifier head.

    Args:
        model_name: One of 'resnet50', 'vgg16', 'efficientnet_b0', 'mobilenet_v2'
        num_classes: Number of output classes
        freeze: If True, freeze the backbone (feature extraction mode)

    Returns:
        model: Modified pretrained model
    """
    if model_name == "resnet50":
        model = models.resnet50(weights=ResNet50_Weights.DEFAULT)
        if freeze:
            for param in model.parameters():
                param.requires_grad = False
        # Replace final FC layer (originally 2048 → 1000)
        in_features = model.fc.in_features
        model.fc = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(in_features, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.3),
            nn.Linear(512, num_classes),
        )

    elif model_name == "vgg16":
        model = models.vgg16(weights=VGG16_Weights.DEFAULT)
        if freeze:
            for param in model.features.parameters():
                param.requires_grad = False
        # Replace classifier (originally 25088 → 4096 → 4096 → 1000)
        model.classifier = nn.Sequential(
            nn.Linear(25088, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(4096, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.3),
            nn.Linear(512, num_classes),
        )

    elif model_name == "efficientnet_b0":
        model = models.efficientnet_b0(weights=EfficientNet_B0_Weights.DEFAULT)
        if freeze:
            for param in model.features.parameters():
                param.requires_grad = False
        # Replace classifier (originally 1280 → 1000)
        in_features = model.classifier[1].in_features
        model.classifier = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(in_features, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.3),
            nn.Linear(512, num_classes),
        )

    elif model_name == "mobilenet_v2":
        model = models.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT)
        if freeze:
            for param in model.features.parameters():
                param.requires_grad = False
        # Replace classifier (originally 1280 → 1000)
        in_features = model.classifier[1].in_features
        model.classifier = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(in_features, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.3),
            nn.Linear(512, num_classes),
        )

    else:
        raise ValueError(f"Unknown model: {model_name}")

    return model


def unfreeze_last_n_layers(model, model_name, n=2):
    """Unfreeze the last n layers of the backbone for fine-tuning."""
    if model_name == "resnet50":
        # Unfreeze layer4 (last residual block)
        for param in model.layer4.parameters():
            param.requires_grad = True
    elif model_name == "vgg16":
        # Unfreeze last 4 conv layers (features[24:])
        for param in list(model.features.parameters())[-8:]:
            param.requires_grad = True
    elif model_name == "efficientnet_b0":
        # Unfreeze last 2 MBConv blocks
        for param in list(model.features.parameters())[-20:]:
            param.requires_grad = True
    elif model_name == "mobilenet_v2":
        # Unfreeze last 3 inverted residual blocks
        for param in list(model.features.parameters())[-20:]:
            param.requires_grad = True


def count_params(model):
    """Count total and trainable parameters."""
    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, trainable


# --- Preview all models ---
print(f"{'Model':<20} {'Total Params':>15} {'Trainable (frozen)':>20}")
print("=" * 60)
for name in MODEL_NAMES:
    m = build_model(name, NUM_CLASSES, freeze=True)
    total, trainable = count_params(m)
    print(f"{name:<20} {total:>15,} {trainable:>20,}")
    del m

---

## 7. Display Model Summary

Showing ResNet-50 as a representative example — the skip connections and bottleneck blocks are clearly visible in the summary.

In [None]:
# ============================================================
# 7. DISPLAY MODEL SUMMARY
# ============================================================

example_model = build_model("resnet50", NUM_CLASSES, freeze=True).to(DEVICE)
summary(example_model, input_size=(1, 3, IMG_SIZE, IMG_SIZE), col_names=["input_size", "output_size", "num_params"])
del example_model

---

## 8. Training Infrastructure

### Two-Phase Training Strategy

```
Phase 1: Feature Extraction (5 epochs)
┌────────────────────────────────────┐
│  Backbone: FROZEN (no gradients)   │
│  Classifier: TRAINING (lr=1e-4)   │
│  Goal: Train the new head quickly  │
└────────────────────────────────────┘
            ↓
Phase 2: Fine-Tuning (10 epochs)
┌────────────────────────────────────┐
│  Backbone: PARTIALLY UNFROZEN     │
│  Last few layers: lr=1e-5         │
│  Classifier: lr=1e-5             │
│  Goal: Adapt backbone features    │
└────────────────────────────────────┘
```

### Differential Learning Rates

During fine-tuning, we use a **lower learning rate** for backbone layers than for the classifier. This prevents the pretrained features from being destroyed while still allowing them to adapt:

- Backbone layers: `1e-5` (gentle updates to pretrained weights)
- Classifier head: `1e-5` (fine adjustments after Phase 1)

In [None]:
# ============================================================
# 8. TRAINING INFRASTRUCTURE
# ============================================================

# Loss function (shared across all models)
criterion = nn.CrossEntropyLoss()


def create_optimizer(model, lr):
    """Create Adam optimizer for trainable parameters only."""
    trainable_params = filter(lambda p: p.requires_grad, model.parameters())
    return optim.Adam(trainable_params, lr=lr, weight_decay=1e-4)


def create_scheduler(optimizer):
    """Create ReduceLROnPlateau scheduler."""
    return optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, mode="min", patience=2, factor=0.5, verbose=True
    )


print("Training infrastructure ready.")
print(f"Phase 1: Feature extraction — {FEATURE_EXTRACT_EPOCHS} epochs, LR={LEARNING_RATE}")
print(f"Phase 2: Fine-tuning — {FINETUNE_EPOCHS} epochs, LR={FINETUNE_LR}")

---

## 9. Training & Validation Functions

### Training Cycle

Same forward-backward-update cycle as Notebook 1:
```
Forward → Loss → Backward → Update → Zero Grads
```

The key difference: during feature extraction, gradients only flow through the classifier head (backbone is frozen), so training is much faster.

In [None]:
# ============================================================
# 9. TRAINING & VALIDATION FUNCTIONS
# ============================================================

def train_one_epoch(model, loader, criterion, optimizer, device):
    """Train the model for one epoch."""
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0

    for images, labels in tqdm(loader, desc="Training", leave=False):
        images, labels = images.to(device), labels.to(device)

        outputs = model(images)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running_loss += loss.item() * images.size(0)
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()

    return running_loss / total, 100.0 * correct / total


@torch.no_grad()
def validate(model, loader, criterion, device):
    """Evaluate the model on validation/test data."""
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0

    for images, labels in tqdm(loader, desc="Validating", leave=False):
        images, labels = images.to(device), labels.to(device)

        outputs = model(images)
        loss = criterion(outputs, labels)

        running_loss += loss.item() * images.size(0)
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()

    return running_loss / total, 100.0 * correct / total


@torch.no_grad()
def test_model(model, loader, device):
    """Run inference and collect all predictions."""
    model.eval()
    all_labels = []
    all_preds = []
    all_probs = []

    for images, labels in tqdm(loader, desc="Testing"):
        images, labels = images.to(device), labels.to(device)

        outputs = model(images)
        probs = F.softmax(outputs, dim=1)
        _, predicted = outputs.max(1)

        all_labels.extend(labels.cpu().numpy())
        all_preds.extend(predicted.cpu().numpy())
        all_probs.extend(probs.cpu().numpy())

    return np.array(all_labels), np.array(all_preds), np.array(all_probs)


print("Training, validation, and test functions defined.")

---

## 10. Train & Compare All Models

### Comparison Methodology

We train all four models with the **same data, same splits, same hyperparameters** — the only variable is the architecture. This ensures a fair comparison.

For each model:
1. **Phase 1 (Feature Extraction):** Freeze backbone, train classifier head for 5 epochs
2. **Phase 2 (Fine-Tuning):** Unfreeze last backbone layers, train everything for 10 epochs with lower LR
3. **Evaluate** on the test set

In [None]:
# ============================================================
# 10. TRAIN & COMPARE ALL MODELS
# ============================================================

def train_model_two_phase(model_name, num_classes, train_loader, val_loader, device):
    """Train a model using two-phase transfer learning.

    Phase 1: Feature extraction (frozen backbone)
    Phase 2: Fine-tuning (partially unfrozen backbone)

    Returns:
        model: Trained model (best weights restored)
        history: Dict with training metrics per epoch
        total_time: Total training time in seconds
    """
    print(f"\n{'='*70}")
    print(f"Training: {model_name.upper()}")
    print(f"{'='*70}")

    history = {"train_loss": [], "val_loss": [], "train_acc": [], "val_acc": [], "phase": []}
    best_val_loss = float("inf")
    best_model_state = None
    total_start = time.time()

    # ---- PHASE 1: Feature Extraction ----
    print(f"\n--- Phase 1: Feature Extraction ({FEATURE_EXTRACT_EPOCHS} epochs) ---")
    model = build_model(model_name, num_classes, freeze=True).to(device)
    optimizer = create_optimizer(model, LEARNING_RATE)
    scheduler = create_scheduler(optimizer)

    total_p, trainable_p = count_params(model)
    print(f"Trainable params: {trainable_p:,} / {total_p:,} ({100*trainable_p/total_p:.1f}%)")

    for epoch in range(1, FEATURE_EXTRACT_EPOCHS + 1):
        epoch_start = time.time()
        train_loss, train_acc = train_one_epoch(model, train_loader, criterion, optimizer, device)
        val_loss, val_acc = validate(model, val_loader, criterion, device)
        scheduler.step(val_loss)

        history["train_loss"].append(train_loss)
        history["val_loss"].append(val_loss)
        history["train_acc"].append(train_acc)
        history["val_acc"].append(val_acc)
        history["phase"].append(1)

        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_model_state = copy.deepcopy(model.state_dict())
            marker = " ★"
        else:
            marker = ""

        print(
            f"  [P1] Epoch {epoch}/{FEATURE_EXTRACT_EPOCHS} | "
            f"Train: {train_loss:.4f} / {train_acc:.1f}% | "
            f"Val: {val_loss:.4f} / {val_acc:.1f}% | "
            f"{time.time()-epoch_start:.1f}s{marker}"
        )

    # ---- PHASE 2: Fine-Tuning ----
    print(f"\n--- Phase 2: Fine-Tuning ({FINETUNE_EPOCHS} epochs) ---")
    unfreeze_last_n_layers(model, model_name)
    optimizer = create_optimizer(model, FINETUNE_LR)
    scheduler = create_scheduler(optimizer)

    total_p, trainable_p = count_params(model)
    print(f"Trainable params: {trainable_p:,} / {total_p:,} ({100*trainable_p/total_p:.1f}%)")

    patience_counter = 0

    for epoch in range(1, FINETUNE_EPOCHS + 1):
        epoch_start = time.time()
        train_loss, train_acc = train_one_epoch(model, train_loader, criterion, optimizer, device)
        val_loss, val_acc = validate(model, val_loader, criterion, device)
        scheduler.step(val_loss)

        history["train_loss"].append(train_loss)
        history["val_loss"].append(val_loss)
        history["train_acc"].append(train_acc)
        history["val_acc"].append(val_acc)
        history["phase"].append(2)

        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_model_state = copy.deepcopy(model.state_dict())
            patience_counter = 0
            marker = " ★"
        else:
            patience_counter += 1
            marker = ""

        print(
            f"  [P2] Epoch {epoch}/{FINETUNE_EPOCHS} | "
            f"Train: {train_loss:.4f} / {train_acc:.1f}% | "
            f"Val: {val_loss:.4f} / {val_acc:.1f}% | "
            f"{time.time()-epoch_start:.1f}s{marker}"
        )

        if patience_counter >= EARLY_STOPPING_PATIENCE:
            print(f"  Early stopping at epoch {epoch}")
            break

    # Restore best model
    if best_model_state is not None:
        model.load_state_dict(best_model_state)

    total_time = time.time() - total_start
    print(f"\n  Total time: {total_time/60:.1f} min | Best val loss: {best_val_loss:.4f}")

    return model, history, total_time

In [None]:
# --- Train all models and collect results ---
all_results = {}

for model_name in MODEL_NAMES:
    # Train
    model, history, train_time = train_model_two_phase(
        model_name, NUM_CLASSES, train_loader, val_loader, DEVICE
    )

    # Test
    print(f"  Evaluating {model_name} on test set...")
    test_labels, test_preds, test_probs = test_model(model, test_loader, DEVICE)
    top1_acc = 100.0 * np.mean(test_labels == test_preds)
    top5_acc = 100.0 * top_k_accuracy_score(test_labels, test_probs, k=5)

    total_p, trainable_p = count_params(model)

    all_results[model_name] = {
        "model": model,
        "history": history,
        "train_time": train_time,
        "top1_acc": top1_acc,
        "top5_acc": top5_acc,
        "test_labels": test_labels,
        "test_preds": test_preds,
        "test_probs": test_probs,
        "total_params": total_p,
    }

    print(f"  ✓ {model_name}: Top-1={top1_acc:.2f}%, Top-5={top5_acc:.2f}%")

print(f"\n{'='*70}")
print("All models trained successfully!")
print(f"{'='*70}")

---

## 11. Comprehensive Evaluation

### Side-by-Side Comparison

We compare all four models on:
- **Test accuracy** (Top-1 and Top-5)
- **Training time**
- **Parameter count**
- **Training curves** (loss and accuracy over epochs)
- **Confusion matrix** (for the best model)

In [None]:
# ============================================================
# 11. COMPREHENSIVE EVALUATION
# ============================================================

# --- Comparison Table ---
comparison_data = []
for name in MODEL_NAMES:
    r = all_results[name]
    comparison_data.append({
        "Model": name,
        "Top-1 Acc (%)": f"{r['top1_acc']:.2f}",
        "Top-5 Acc (%)": f"{r['top5_acc']:.2f}",
        "Total Params": f"{r['total_params']:,}",
        "Train Time (min)": f"{r['train_time']/60:.1f}",
    })

comparison_df = pd.DataFrame(comparison_data)
print("\nModel Comparison Summary")
print("=" * 80)
print(comparison_df.to_string(index=False))

In [None]:
# --- Training curves for all models ---
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
colors = {"resnet50": "#2196F3", "vgg16": "#FF9800", "efficientnet_b0": "#4CAF50", "mobilenet_v2": "#E91E63"}

for name in MODEL_NAMES:
    h = all_results[name]["history"]
    epochs = range(1, len(h["train_loss"]) + 1)
    c = colors[name]

    # Train loss
    axes[0, 0].plot(epochs, h["train_loss"], "-o", color=c, label=name, markersize=3)
    # Val loss
    axes[0, 1].plot(epochs, h["val_loss"], "-o", color=c, label=name, markersize=3)
    # Train acc
    axes[1, 0].plot(epochs, h["train_acc"], "-o", color=c, label=name, markersize=3)
    # Val acc
    axes[1, 1].plot(epochs, h["val_acc"], "-o", color=c, label=name, markersize=3)

titles = ["Train Loss", "Validation Loss", "Train Accuracy (%)", "Validation Accuracy (%)"]
for ax, title in zip(axes.flat, titles):
    ax.set_title(title, fontsize=13, fontweight="bold")
    ax.set_xlabel("Epoch")
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)
    # Add phase separator line
    ax.axvline(x=FEATURE_EXTRACT_EPOCHS + 0.5, color="gray", linestyle="--", alpha=0.5, label="Phase boundary")

plt.suptitle("Training History — All Models", fontsize=16, fontweight="bold")
plt.tight_layout()
plt.show()

In [None]:
# --- Bar chart: Test accuracy comparison ---
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Top-1 Accuracy
top1_scores = [all_results[n]["top1_acc"] for n in MODEL_NAMES]
bar_colors = [colors[n] for n in MODEL_NAMES]
bars1 = ax1.bar(MODEL_NAMES, top1_scores, color=bar_colors, edgecolor="white", linewidth=1.5)
ax1.set_ylabel("Accuracy (%)")
ax1.set_title("Top-1 Test Accuracy", fontsize=14, fontweight="bold")
for bar, score in zip(bars1, top1_scores):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3, f"{score:.1f}%",
             ha="center", va="bottom", fontweight="bold")
ax1.set_ylim(0, 100)

# Top-5 Accuracy
top5_scores = [all_results[n]["top5_acc"] for n in MODEL_NAMES]
bars2 = ax2.bar(MODEL_NAMES, top5_scores, color=bar_colors, edgecolor="white", linewidth=1.5)
ax2.set_ylabel("Accuracy (%)")
ax2.set_title("Top-5 Test Accuracy", fontsize=14, fontweight="bold")
for bar, score in zip(bars2, top5_scores):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3, f"{score:.1f}%",
             ha="center", va="bottom", fontweight="bold")
ax2.set_ylim(0, 100)

plt.suptitle("Model Comparison — Test Set", fontsize=15, fontweight="bold")
plt.tight_layout()
plt.show()

In [None]:
# --- Best model: Confusion matrix + Classification report ---
best_model_name = max(all_results, key=lambda n: all_results[n]["top1_acc"])
best = all_results[best_model_name]

print(f"Best model: {best_model_name.upper()} (Top-1: {best['top1_acc']:.2f}%, Top-5: {best['top5_acc']:.2f}%)")
print("\nClassification Report:")
print("=" * 70)
print(classification_report(best["test_labels"], best["test_preds"], target_names=class_names, zero_division=0))

In [None]:
# --- Confusion Matrix for best model ---
cm = confusion_matrix(best["test_labels"], best["test_preds"])

fig, ax = plt.subplots(figsize=(18, 16))
sns.heatmap(
    cm, annot=False, fmt="d", cmap="Blues",
    xticklabels=class_names, yticklabels=class_names, ax=ax,
)
ax.set_xlabel("Predicted", fontsize=12)
ax.set_ylabel("Actual", fontsize=12)
ax.set_title(
    f"Confusion Matrix — {best_model_name.upper()} (Acc: {best['top1_acc']:.2f}%)",
    fontsize=14, fontweight="bold"
)
plt.xticks(rotation=90, fontsize=7)
plt.yticks(rotation=0, fontsize=7)
plt.tight_layout()
plt.show()

---

## 12. Save Best Model

We save the best performing model's state dict along with metadata for easy loading later.

In [None]:
# ============================================================
# 12. SAVE BEST MODEL
# ============================================================

save_path = f"dog_breed_transfer_{best_model_name}.pth"

torch.save({
    "model_name": best_model_name,
    "model_state_dict": best["model"].state_dict(),
    "num_classes": NUM_CLASSES,
    "img_size": IMG_SIZE,
    "class_names": class_names,
    "top1_accuracy": best["top1_acc"],
    "top5_accuracy": best["top5_acc"],
    "comparison_results": {
        name: {
            "top1_acc": all_results[name]["top1_acc"],
            "top5_acc": all_results[name]["top5_acc"],
            "train_time": all_results[name]["train_time"],
            "total_params": all_results[name]["total_params"],
        }
        for name in MODEL_NAMES
    },
}, save_path)

print(f"Best model ({best_model_name}) saved to '{save_path}'")
print(f"Top-1 Accuracy: {best['top1_acc']:.2f}%")
print(f"Top-5 Accuracy: {best['top5_acc']:.2f}%")

---

## 13. Bonus: Stanford Dogs Extension

The **Stanford Dogs Dataset** (120 breeds, 20,580 images) is a much harder fine-grained classification task. Transfer learning is especially powerful here — the pretrained features can distinguish subtle breed differences that a from-scratch CNN would struggle with.

### Expected Performance Boost

| Dataset | CNN From Scratch | Transfer Learning |
|---|---|---|
| Oxford-IIIT Pet (37 classes) | ~30-40% | ~85-92% |
| Stanford Dogs (120 classes) | ~10-20% | ~70-80% |

> **Note:** This section requires a Kaggle API key or running on Kaggle where the dataset is available directly.

In [None]:
# ============================================================
# 13. BONUS: STANFORD DOGS EXTENSION
# ============================================================

# Uncomment and run the cells below to train on Stanford Dogs (120 breeds).

STANFORD_DOGS = False  # Set to True to enable

if STANFORD_DOGS:
    import os
    from torchvision.datasets import ImageFolder

    # --- Download dataset (Colab only — skip on Kaggle) ---
    # !pip install kaggle -q
    # !mkdir -p ~/.kaggle && cp kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json
    # !kaggle dataset download -d jessicali9530/stanford-dogs-dataset -p ./data/stanford_dogs --unzip

    # --- Configuration ---
    STANFORD_NUM_CLASSES = 120

    # --- Paths ---
    # Kaggle: TRAIN_DIR = "/kaggle/input/stanford-dogs-dataset/images/Images"
    TRAIN_DIR = "./data/stanford_dogs/images/Images"

    # --- Load with ImageFolder ---
    stanford_full = ImageFolder(root=TRAIN_DIR)
    print(f"Stanford Dogs: {len(stanford_full)} images, {len(stanford_full.classes)} classes")

    # --- Split: 80% train, 10% val, 10% test ---
    total = len(stanford_full)
    train_n = int(0.8 * total)
    val_n = int(0.1 * total)
    test_n = total - train_n - val_n

    generator = torch.Generator().manual_seed(SEED)
    sd_train, sd_val, sd_test = random_split(stanford_full, [train_n, val_n, test_n], generator=generator)

    # Apply transforms
    sd_train_dataset = TransformSubset(sd_train, transform=train_transforms)
    sd_val_dataset = TransformSubset(sd_val, transform=val_test_transforms)
    sd_test_dataset = TransformSubset(sd_test, transform=val_test_transforms)

    # DataLoaders
    sd_train_loader = DataLoader(sd_train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, pin_memory=True)
    sd_val_loader = DataLoader(sd_val_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)
    sd_test_loader = DataLoader(sd_test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)

    print(f"Train: {len(sd_train_dataset)} | Val: {len(sd_val_dataset)} | Test: {len(sd_test_dataset)}")

    # --- Train best architecture on Stanford Dogs ---
    print(f"\nTraining {best_model_name} on Stanford Dogs ({STANFORD_NUM_CLASSES} classes)...")
    sd_model, sd_history, sd_time = train_model_two_phase(
        best_model_name, STANFORD_NUM_CLASSES, sd_train_loader, sd_val_loader, DEVICE
    )

    # Evaluate
    sd_labels, sd_preds, sd_probs = test_model(sd_model, sd_test_loader, DEVICE)
    sd_top1 = 100.0 * np.mean(sd_labels == sd_preds)
    sd_top5 = 100.0 * top_k_accuracy_score(sd_labels, sd_probs, k=5)
    print(f"\nStanford Dogs Results — Top-1: {sd_top1:.2f}%, Top-5: {sd_top5:.2f}%")
else:
    print("Stanford Dogs extension is disabled. Set STANFORD_DOGS = True to enable.")