# 4. Transfer Learning with ResNet18

## Overview

In this notebook, we'll use **transfer learning** to leverage a model pretrained on ImageNet.

### What is Transfer Learning?

Instead of training from scratch, we:
1. Start with a model already trained on a large dataset (ImageNet: 1.2M images, 1000 classes)
2. Replace the final classification layer for our 7 emotions
3. Fine-tune on our specific dataset

### Why Transfer Learning?

| Benefit | Explanation |
|---------|-------------|
| **Better features** | Pretrained models learned robust features from millions of images |
| **Faster training** | Only need to fine-tune, not learn from scratch |
| **Less data needed** | Features already generalize well |
| **Higher accuracy** | Usually outperforms training from scratch |

### Our Strategy:

**Two-phase training:**
1. **Phase 1**: Freeze backbone, train only classifier (fast)
2. **Phase 2**: Unfreeze, fine-tune entire network with low learning rate

## Step 1: Import Libraries

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path
import pickle
import json
import time

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, models

from PIL import Image
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
from sklearn.metrics import classification_report, confusion_matrix

torch.manual_seed(42)
np.random.seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## Step 2: Configuration

In [None]:
# Configuration
IMG_SIZE = 224
IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD = [0.229, 0.224, 0.225]

BATCH_SIZE = 32
NUM_EPOCHS = 25
LEARNING_RATE = 0.001      # For phase 1 (classifier only)
FINE_TUNE_LR = 0.0001      # For phase 2 (full network)
PATIENCE = 5

EMOTION_CLASSES = ["anger", "disgust", "fear", "happiness", "neutral", "sadness", "surprise"]
NUM_CLASSES = len(EMOTION_CLASSES)
IDX_TO_EMOTION = {i: e for i, e in enumerate(EMOTION_CLASSES)}

print("Configuration loaded!")

## Step 3: Load Data

In [None]:
# Load processed data
with open('processed_data.pkl', 'rb') as f:
    data = pickle.load(f)

train_df = data['train_df']
val_df = data['val_df']
test_df = data['test_df']

print(f"Training: {len(train_df)}, Validation: {len(val_df)}, Test: {len(test_df)}")

In [None]:
# Dataset and DataLoaders (same as before)
train_transform = transforms.Compose([
    transforms.Resize((IMG_SIZE, IMG_SIZE)),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(degrees=10),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD)
])

test_transform = transforms.Compose([
    transforms.Resize((IMG_SIZE, IMG_SIZE)),
    transforms.ToTensor(),
    transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD)
])

class FacialExpressionDataset(Dataset):
    def __init__(self, dataframe, transform=None):
        self.data = dataframe.reset_index(drop=True)
        self.transform = transform
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        image = Image.open(row['image_path']).convert('RGB')
        label = row['label']
        if self.transform:
            image = self.transform(image)
        return image, label

train_dataset = FacialExpressionDataset(train_df, transform=train_transform)
val_dataset = FacialExpressionDataset(val_df, transform=test_transform)
test_dataset = FacialExpressionDataset(test_df, transform=test_transform)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=0, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=0, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=0, pin_memory=True)

print("Data loaders created!")

## Step 4: Understanding ResNet18

### What is ResNet?

**ResNet (Residual Network)** introduced skip connections that allow training very deep networks.

```
Standard block:    Input → Conv → Conv → Output
Residual block:    Input → Conv → Conv → Add(Input) → Output
                            ↑______________↓
                           (skip connection)
```

### Why Skip Connections Help:
- **Gradient flow**: Gradients can flow directly through skip connections
- **Identity mapping**: If a layer isn't needed, it can learn to be identity
- **Deeper networks**: Enables training 100+ layer networks

### ResNet18 Architecture:
```
conv1 (7x7, 64 channels)
    ↓
maxpool
    ↓
layer1 (2 residual blocks, 64 channels)
    ↓
layer2 (2 residual blocks, 128 channels)
    ↓
layer3 (2 residual blocks, 256 channels)
    ↓
layer4 (2 residual blocks, 512 channels)
    ↓
avgpool (global average pooling)
    ↓
fc (512 → 1000 classes)  ← We replace this!
```

In [None]:
# Let's look at the pretrained ResNet18
resnet_demo = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)

print("ResNet18 Architecture:")
print("=" * 60)

# Show each main component
for name, module in resnet_demo.named_children():
    if isinstance(module, nn.Sequential):
        print(f"{name}: {len(module)} blocks")
    else:
        print(f"{name}: {module.__class__.__name__}")

print(f"\nOriginal fc layer: {resnet_demo.fc}")
print(f"Input features to fc: {resnet_demo.fc.in_features}")
print(f"Output classes: {resnet_demo.fc.out_features}")

## Step 5: Create Transfer Learning Model

In [None]:
class TransferLearningModel(nn.Module):
    """
    Transfer learning model using ResNet18 pretrained on ImageNet.
    
    Strategy:
    1. Load pretrained ResNet18 weights
    2. Replace final fc layer with our classifier
    3. Optionally freeze backbone for initial training
    """
    
    def __init__(self, num_classes=NUM_CLASSES, freeze_backbone=True):
        super().__init__()
        
        # Load pretrained ResNet18
        # weights=IMAGENET1K_V1 loads ImageNet pretrained weights
        self.backbone = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
        
        # Freeze backbone if specified
        # This prevents updating pretrained weights initially
        if freeze_backbone:
            for param in self.backbone.parameters():
                param.requires_grad = False
        
        # Replace the final fully connected layer
        # Original: Linear(512, 1000) for ImageNet
        # New: Dropout + Linear(512, 7) for our emotions
        num_features = self.backbone.fc.in_features  # 512
        self.backbone.fc = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(num_features, num_classes)
        )
    
    def forward(self, x):
        return self.backbone(x)
    
    def unfreeze_backbone(self):
        """
        Unfreeze all backbone layers for fine-tuning.
        
        Call this after phase 1 training to enable
        updating all weights with a lower learning rate.
        """
        for param in self.backbone.parameters():
            param.requires_grad = True
        print("Backbone unfrozen - all layers now trainable")
    
    def get_feature_maps(self, x):
        """Get feature maps from last conv layer (for Grad-CAM)."""
        x = self.backbone.conv1(x)
        x = self.backbone.bn1(x)
        x = self.backbone.relu(x)
        x = self.backbone.maxpool(x)
        x = self.backbone.layer1(x)
        x = self.backbone.layer2(x)
        x = self.backbone.layer3(x)
        x = self.backbone.layer4(x)
        return x

print("TransferLearningModel defined!")

In [None]:
# Create model with frozen backbone
model = TransferLearningModel(num_classes=NUM_CLASSES, freeze_backbone=True)
model = model.to(device)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
frozen_params = total_params - trainable_params

print("=" * 60)
print("MODEL SUMMARY (Phase 1: Backbone Frozen)")
print("=" * 60)
print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,} ({100*trainable_params/total_params:.1f}%)")
print(f"Frozen parameters: {frozen_params:,} ({100*frozen_params/total_params:.1f}%)")

# This shows that most parameters are frozen!
# Only the new fc layer is trainable

## Step 6: Training Functions

In [None]:
def train_one_epoch(model, train_loader, criterion, optimizer, device):
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0
    
    pbar = tqdm(train_loader, desc="Training", leave=False)
    for images, labels in pbar:
        images, labels = images.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item() * images.size(0)
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()
        
        pbar.set_postfix({'loss': f'{loss.item():.4f}', 'acc': f'{100.*correct/total:.2f}%'})
    
    return running_loss / total, correct / total


def validate(model, val_loader, criterion, device):
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for images, labels in val_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            
            running_loss += loss.item() * images.size(0)
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
    
    return running_loss / total, correct / total


class EarlyStopping:
    def __init__(self, patience=PATIENCE, min_delta=0.001):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_loss = float('inf')
        self.should_stop = False
    
    def __call__(self, val_loss):
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
            return True
        else:
            self.counter += 1
            if self.counter >= self.patience:
                self.should_stop = True
            return False

print("Training functions defined!")

## Step 7: Phase 1 - Train Classifier Only

In Phase 1:
- Backbone is **frozen** (pretrained weights preserved)
- Only train the new classification layer
- Use normal learning rate
- This is fast because we only update ~4000 parameters

In [None]:
# Phase 1 setup
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=2)
early_stopping = EarlyStopping(patience=3)  # Shorter patience for phase 1

phase1_epochs = NUM_EPOCHS // 2
history_phase1 = {'train_loss': [], 'train_acc': [], 'val_loss': [], 'val_acc': []}

print("=" * 60)
print("PHASE 1: Training Classifier (Backbone Frozen)")
print("=" * 60)
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
print(f"Training for up to {phase1_epochs} epochs...")
print("-" * 60)

start_time = time.time()
best_model_state = None

for epoch in range(phase1_epochs):
    print(f"\nPhase 1 - Epoch {epoch + 1}/{phase1_epochs}")
    
    train_loss, train_acc = train_one_epoch(model, train_loader, criterion, optimizer, device)
    val_loss, val_acc = validate(model, val_loader, criterion, device)
    
    scheduler.step(val_loss)
    
    history_phase1['train_loss'].append(train_loss)
    history_phase1['train_acc'].append(train_acc)
    history_phase1['val_loss'].append(val_loss)
    history_phase1['val_acc'].append(val_acc)
    
    print(f"  Train: loss={train_loss:.4f}, acc={train_acc:.4f}")
    print(f"  Val:   loss={val_loss:.4f}, acc={val_acc:.4f}")
    
    is_best = early_stopping(val_loss)
    if is_best:
        print("  ✓ Best model so far!")
        best_model_state = model.state_dict().copy()
    
    if early_stopping.should_stop:
        print("\nEarly stopping phase 1")
        break

phase1_time = time.time() - start_time
print(f"\nPhase 1 completed in {phase1_time/60:.1f} minutes")

## Step 8: Phase 2 - Fine-tune Entire Network

In Phase 2:
- **Unfreeze** the backbone
- Use a **lower learning rate** (10x smaller)
- Fine-tune all layers together
- The pretrained features get slightly adjusted for our task

In [None]:
# Load best model from phase 1
model.load_state_dict(best_model_state)

# Unfreeze backbone
model.unfreeze_backbone()

# Phase 2 setup with lower learning rate
optimizer = optim.Adam(model.parameters(), lr=FINE_TUNE_LR)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=2)
early_stopping = EarlyStopping(patience=PATIENCE)

phase2_epochs = NUM_EPOCHS // 2
history_phase2 = {'train_loss': [], 'train_acc': [], 'val_loss': [], 'val_acc': []}

trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print("=" * 60)
print("PHASE 2: Fine-tuning Entire Network")
print("=" * 60)
print(f"Trainable parameters: {trainable_params:,}")
print(f"Learning rate: {FINE_TUNE_LR} (10x lower than phase 1)")
print(f"Training for up to {phase2_epochs} epochs...")
print("-" * 60)

start_time = time.time()
best_epoch = 0

for epoch in range(phase2_epochs):
    print(f"\nPhase 2 - Epoch {epoch + 1}/{phase2_epochs}")
    
    train_loss, train_acc = train_one_epoch(model, train_loader, criterion, optimizer, device)
    val_loss, val_acc = validate(model, val_loader, criterion, device)
    
    scheduler.step(val_loss)
    
    history_phase2['train_loss'].append(train_loss)
    history_phase2['train_acc'].append(train_acc)
    history_phase2['val_loss'].append(val_loss)
    history_phase2['val_acc'].append(val_acc)
    
    print(f"  Train: loss={train_loss:.4f}, acc={train_acc:.4f}")
    print(f"  Val:   loss={val_loss:.4f}, acc={val_acc:.4f}")
    
    is_best = early_stopping(val_loss)
    if is_best:
        print("  ✓ New best model!")
        best_model_state = model.state_dict().copy()
        best_epoch = epoch
    
    if early_stopping.should_stop:
        print("\nEarly stopping phase 2")
        break

phase2_time = time.time() - start_time
print(f"\nPhase 2 completed in {phase2_time/60:.1f} minutes")

# Load final best model
model.load_state_dict(best_model_state)
print(f"Loaded best model from phase 2 epoch {best_epoch + 1}")

## Step 9: Visualize Training History

In [None]:
# Combine histories
all_train_loss = history_phase1['train_loss'] + history_phase2['train_loss']
all_val_loss = history_phase1['val_loss'] + history_phase2['val_loss']
all_train_acc = history_phase1['train_acc'] + history_phase2['train_acc']
all_val_acc = history_phase1['val_acc'] + history_phase2['val_acc']

phase1_end = len(history_phase1['train_loss'])
epochs = range(1, len(all_train_loss) + 1)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss plot
axes[0].plot(epochs, all_train_loss, 'b-', label='Training')
axes[0].plot(epochs, all_val_loss, 'r-', label='Validation')
axes[0].axvline(phase1_end, color='green', linestyle='--', label='Phase 1 → 2', alpha=0.7)
axes[0].fill_between(range(1, phase1_end + 1), 0, max(all_train_loss), 
                     alpha=0.1, color='blue', label='Phase 1 (frozen)')
axes[0].fill_between(range(phase1_end, len(epochs) + 1), 0, max(all_train_loss), 
                     alpha=0.1, color='orange', label='Phase 2 (fine-tune)')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training and Validation Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy plot
axes[1].plot(epochs, all_train_acc, 'b-', label='Training')
axes[1].plot(epochs, all_val_acc, 'r-', label='Validation')
axes[1].axvline(phase1_end, color='green', linestyle='--', label='Phase 1 → 2', alpha=0.7)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Training and Validation Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.suptitle('Transfer Learning Training History (Two-Phase)', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

## Step 10: Final Evaluation

In [None]:
# Test evaluation
test_loss, test_acc = validate(model, test_loader, criterion, device)

print("=" * 60)
print("FINAL RESULTS (Transfer Learning)")
print("=" * 60)
print(f"\nTest Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_acc:.4f} ({test_acc*100:.2f}%)")

In [None]:
# Get predictions and detailed metrics
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for images, labels in test_loader:
        images = images.to(device)
        outputs = model(images)
        _, preds = outputs.max(1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.numpy())

all_preds = np.array(all_preds)
all_labels = np.array(all_labels)

print("\nClassification Report:")
print(classification_report(all_labels, all_preds, target_names=EMOTION_CLASSES))

In [None]:
# Confusion matrix
cm = confusion_matrix(all_labels, all_preds)
cm_normalized = cm.astype('float') / cm.sum(axis=1, keepdims=True)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=EMOTION_CLASSES, yticklabels=EMOTION_CLASSES, ax=axes[0])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('True')
axes[0].set_title('Confusion Matrix (Counts)')

sns.heatmap(cm_normalized, annot=True, fmt='.1%', cmap='Blues',
            xticklabels=EMOTION_CLASSES, yticklabels=EMOTION_CLASSES, ax=axes[1])
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('True')
axes[1].set_title('Confusion Matrix (Normalized)')

plt.suptitle('Transfer Learning (ResNet18) Results', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

## Step 11: Save Model and Results

In [None]:
# Save model
torch.save({
    'model_state_dict': model.state_dict(),
    'test_acc': test_acc,
    'test_loss': test_loss
}, 'transfer_learning_best.pth')

print("Model saved!")

# Save results for comparison
transfer_results = {
    'model_name': 'Transfer Learning (ResNet18)',
    'test_accuracy': test_acc,
    'test_loss': test_loss,
    'phase1_time_minutes': phase1_time / 60,
    'phase2_time_minutes': phase2_time / 60,
    'confusion_matrix': cm.tolist()
}

with open('transfer_learning_results.pkl', 'wb') as f:
    pickle.dump(transfer_results, f)

print("Results saved!")

## Summary

### What We Learned:

1. **Transfer Learning** leverages pretrained knowledge from ImageNet
2. **Two-phase training** is effective: freeze first, then fine-tune
3. **Lower learning rate** for fine-tuning prevents destroying pretrained features

### Results Comparison:

| Model | Test Accuracy |
|-------|---------------|
| HOG + SVM | ~XX% |
| Custom CNN | ~XX% |
| ResNet18 (Transfer) | ~XX% |

### Key Takeaways:

- **Pretrained models** often outperform training from scratch
- **Fine-tuning** adapts general features to your specific task
- **ImageNet features** transfer well to facial expression recognition

### Next Steps:
- **Notebook 5**: Compare all models side by side
- **Notebook 6**: Use Grad-CAM to visualize what the models learn