# Module 4 - Exercise 2: Fine-Tuning

## Learning Objectives
- Understand the principles of transfer learning and fine-tuning
- Implement feature extraction with frozen layers
- Apply selective layer freezing and gradual unfreezing strategies
- Use discriminative learning rates for different layer groups
- Implement parameter-efficient fine-tuning methods (Adapters, LoRA)
- Work with Hugging Face models for practical fine-tuning tasks

## Test Framework Setup

In [None]:
# Clone the test repository
!git clone https://github.com/racousin/data_science_practice.git /tmp/tests 2>/dev/null || true

# Import required modules
import sys
sys.path.append('/tmp/tests/tests/python_deep_learning')

# Import the test utilities
from test_utils import NotebookTestRunner, create_inline_test
from module4.test_exercise2 import Exercise2Validator, EXERCISE2_SECTIONS

# Create test runner and validator
test_runner = NotebookTestRunner("module4", 2)
validator = Exercise2Validator()

## Environment Setup

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict, Optional
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Check CUDA availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## Introduction to Fine-Tuning

Fine-tuning is a transfer learning technique where we take a pre-trained model and adapt it to a new, related task. This approach leverages the knowledge learned from a large dataset and applies it to a specific problem, often with much less data.

### Key Concepts:

1. **Transfer Learning**: Using knowledge from one task to improve performance on another
2. **Feature Extraction**: Using pre-trained layers as fixed feature extractors
3. **Fine-Tuning**: Updating pre-trained weights with a small learning rate
4. **Layer Freezing**: Keeping certain layers fixed during training
5. **Discriminative Learning Rates**: Using different learning rates for different layers

## Section 1: Feature Extraction Basics

In this section, we'll start with the simplest form of transfer learning: using a pre-trained model as a fixed feature extractor.

### Creating a Simple Pre-trained Model

Let's simulate a pre-trained model with frozen feature layers and a trainable classifier.

In [None]:
class SimpleCNN(nn.Module):
    """A simple CNN to demonstrate fine-tuning concepts"""
    def __init__(self, num_classes=10):
        super(SimpleCNN, self).__init__()
        # Feature extraction layers (will be frozen)
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d((4, 4))
        )
        
        # Classification layers (will be trainable)
        self.classifier = nn.Sequential(
            nn.Linear(128 * 4 * 4, 256),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )
    
    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

# TODO: Create a simple_pretrained_model instance with 10 classes
# Freeze the feature layers (set requires_grad=False for features parameters)
# Keep the classifier layers trainable
simple_pretrained_model = None

# Display model info
if simple_pretrained_model:
    total_params = sum(p.numel() for p in simple_pretrained_model.parameters())
    trainable_params = sum(p.numel() for p in simple_pretrained_model.parameters() if p.requires_grad)
    print(f"Total parameters: {total_params:,}")
    print(f"Trainable parameters: {trainable_params:,}")
    print(f"Frozen parameters: {total_params - trainable_params:,}")

### Setting up Feature Extraction

In [None]:
# TODO: Create a feature_extractor using only the features part of the model
# Set it to eval mode and ensure no gradients are computed
feature_extractor = None

# Create dummy data for testing
dummy_images = torch.randn(32, 3, 32, 32)

# TODO: Extract features from the dummy images using feature_extractor
# Store the result in extracted_features (should be a 2D tensor after flattening)
extracted_features = None

if extracted_features is not None:
    print(f"Extracted features shape: {extracted_features.shape}")
    print(f"Features require grad: {extracted_features.requires_grad}")

In [None]:
# Test Section 1: Feature Extraction Basics
section_tests = [(getattr(validator, name), desc) for name, desc in EXERCISE2_SECTIONS["Section 1: Feature Extraction Basics"]]
test_runner.test_section("Section 1: Feature Extraction Basics", validator, section_tests, locals())

## Section 2: Fine-Tuning Strategies

Now let's explore different strategies for fine-tuning models, including selective layer freezing and gradual unfreezing.

### Creating a Fine-Tuned Model

In [None]:
# TODO: Create a fine_tuned_model based on SimpleCNN
# Freeze only the first two convolutional layers in features
# Keep the rest trainable
fine_tuned_model = None

if fine_tuned_model:
    # Count parameters by layer group
    frozen_count = 0
    trainable_count = 0
    
    for name, param in fine_tuned_model.named_parameters():
        if param.requires_grad:
            trainable_count += param.numel()
        else:
            frozen_count += param.numel()
        print(f"{name}: {'Trainable' if param.requires_grad else 'Frozen'} - {param.numel():,} params")
    
    print(f"\nTotal frozen: {frozen_count:,}")
    print(f"Total trainable: {trainable_count:,}")

### Implementing Layer Freezing Strategy

In [None]:
# TODO: Implement a function to freeze the first N layers of a model
def freeze_layers(model: nn.Module, num_layers_to_freeze: int):
    """
    Freeze the first N layers of a Sequential model.
    
    Args:
        model: The model to modify
        num_layers_to_freeze: Number of layers to freeze from the beginning
    """
    # TODO: Implement the freezing logic
    pass

# Test the function
test_model = nn.Sequential(
    nn.Linear(10, 20),
    nn.Linear(20, 30),
    nn.Linear(30, 10)
)

if freeze_layers:
    freeze_layers(test_model, 2)
    for i, layer in enumerate(test_model):
        grad_status = any(p.requires_grad for p in layer.parameters())
        print(f"Layer {i}: {'Trainable' if grad_status else 'Frozen'}")

### Gradual Unfreezing Schedule

In [None]:
# TODO: Create an unfreezing schedule
# Format: [(epoch, layers_to_unfreeze), ...]
# Example: At epoch 0, all layers frozen except classifier
#          At epoch 3, unfreeze last conv layer
#          At epoch 5, unfreeze all layers
unfreeze_schedule = None

if unfreeze_schedule:
    print("Unfreezing Schedule:")
    for epoch, action in unfreeze_schedule:
        print(f"  Epoch {epoch}: {action}")

In [None]:
# Test Section 2: Fine-Tuning Strategies
section_tests = [(getattr(validator, name), desc) for name, desc in EXERCISE2_SECTIONS["Section 2: Fine-Tuning Strategies"]]
test_runner.test_section("Section 2: Fine-Tuning Strategies", validator, section_tests, locals())

## Section 3: Advanced Techniques

Let's explore more sophisticated fine-tuning techniques including discriminative learning rates and parameter-efficient methods.

### Discriminative Learning Rates

In [None]:
# Create a model for demonstration
model_for_training = SimpleCNN(num_classes=5)

# TODO: Create parameter groups with different learning rates
# Group 1: Feature layers (lower learning rate, e.g., 1e-4)
# Group 2: Classifier layers (higher learning rate, e.g., 1e-3)
lr_groups = None

if lr_groups:
    # Create optimizer with parameter groups
    optimizer = optim.Adam(lr_groups)
    
    print("Learning rate groups:")
    for i, group in enumerate(optimizer.param_groups):
        print(f"  Group {i}: LR = {group['lr']}, Params = {sum(p.numel() for p in group['params']):,}")

### Measuring Fine-Tuning Performance

In [None]:
# Create synthetic data for fine-tuning demonstration
X_train = torch.randn(100, 3, 32, 32)
y_train = torch.randint(0, 5, (100,))
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

# TODO: Calculate initial loss before fine-tuning
model_for_training.eval()
with torch.no_grad():
    # Calculate initial_loss on the first batch
    initial_loss = None

# TODO: Perform one epoch of fine-tuning and calculate final_loss
if lr_groups:
    model_for_training.train()
    criterion = nn.CrossEntropyLoss()
    
    for batch_idx, (data, target) in enumerate(train_loader):
        # TODO: Implement training step
        pass
    
    # Calculate final loss
    model_for_training.eval()
    with torch.no_grad():
        # TODO: Calculate final_loss on the first batch
        final_loss = None

if initial_loss is not None and final_loss is not None:
    print(f"Initial loss: {initial_loss:.4f}")
    print(f"Final loss: {final_loss:.4f}")
    print(f"Improvement: {((initial_loss - final_loss) / initial_loss * 100):.1f}%")

### Adapter Modules

Adapters are small trainable modules inserted into a frozen pre-trained model, allowing parameter-efficient fine-tuning.

In [None]:
# TODO: Implement an Adapter module
class AdapterModule(nn.Module):
    """
    A simple adapter module that can be inserted into a pre-trained model.
    Uses a bottleneck architecture: down-projection -> nonlinearity -> up-projection
    """
    def __init__(self, input_dim: int, bottleneck_dim: int = None):
        super(AdapterModule, self).__init__()
        # TODO: Implement the adapter architecture
        # Default bottleneck_dim to input_dim // 4 if not specified
        # Create down_proj, up_proj layers and activation
        pass
    
    def forward(self, x):
        # TODO: Implement forward pass with residual connection
        # adapter_out = x + adapter_layers(x)
        pass

# Test the adapter
if AdapterModule:
    adapter = AdapterModule(512, bottleneck_dim=64)
    test_input = torch.randn(1, 512)
    output = adapter(test_input)
    print(f"Adapter input shape: {test_input.shape}")
    print(f"Adapter output shape: {output.shape}")
    print(f"Adapter parameters: {sum(p.numel() for p in adapter.parameters()):,}")

### LoRA (Low-Rank Adaptation)

LoRA is a parameter-efficient fine-tuning method that adds trainable low-rank decomposition matrices to frozen weights.

In [None]:
# TODO: Implement a LoRA Linear layer
class LoRALinear(nn.Module):
    """
    Linear layer with LoRA (Low-Rank Adaptation).
    W' = W + BA where B ∈ R^(d×r) and A ∈ R^(r×k), with r << min(d, k)
    """
    def __init__(self, in_features: int, out_features: int, rank: int = 8):
        super(LoRALinear, self).__init__()
        # TODO: Initialize the frozen weight matrix and LoRA matrices
        # Create self.weight (frozen), self.lora_A, and self.lora_B
        # Initialize lora_A with normal distribution and lora_B with zeros
        pass
    
    def forward(self, x):
        # TODO: Implement forward pass
        # output = x @ W^T + x @ A^T @ B^T
        pass

# Test LoRA implementation
if LoRALinear:
    lora_layer = LoRALinear(256, 128, rank=8)
    test_input = torch.randn(1, 256)
    output = lora_layer(test_input)
    
    total_params = 256 * 128
    lora_params = sum(p.numel() for p in [lora_layer.lora_A, lora_layer.lora_B])
    
    print(f"Original parameters: {total_params:,}")
    print(f"LoRA parameters: {lora_params:,}")
    print(f"Parameter reduction: {(1 - lora_params/total_params)*100:.1f}%")

In [None]:
# Test Section 3: Advanced Techniques
section_tests = [(getattr(validator, name), desc) for name, desc in EXERCISE2_SECTIONS["Section 3: Advanced Techniques"]]
test_runner.test_section("Section 3: Advanced Techniques", validator, section_tests, locals())

## Section 4: Hugging Face Integration

Hugging Face provides a vast ecosystem of pre-trained models. Let's explore how to work with them for fine-tuning.

**Note**: For this exercise, we'll simulate Hugging Face concepts without requiring actual library installation.

### Understanding Hugging Face Models

Hugging Face models typically consist of:
1. **Base Model**: Pre-trained transformer layers (BERT, GPT, etc.)
2. **Task-Specific Heads**: Classification, token classification, generation heads
3. **Tokenizers**: Convert text to model inputs
4. **Config**: Model architecture and training configuration

In [None]:
# TODO: Choose a small Hugging Face model name for fine-tuning
# Examples: 'distilbert-base-uncased', 'bert-tiny', 'distilgpt2'
hf_model_name = None

print(f"Selected model: {hf_model_name}")

# Simulate model architecture info
if hf_model_name:
    if 'bert' in hf_model_name.lower():
        print("\nModel Architecture:")
        print("  - Type: Encoder-only transformer")
        print("  - Use cases: Classification, NER, Question Answering")
        print("  - Hidden size: 768 (base) or 256 (tiny)")
        print("  - Layers: 12 (base) or 4 (tiny)")
    elif 'gpt' in hf_model_name.lower():
        print("\nModel Architecture:")
        print("  - Type: Decoder-only transformer")
        print("  - Use cases: Text generation, completion")
        print("  - Hidden size: 768")
        print("  - Layers: 12 (base) or 6 (distilled)")

### Tokenizer Configuration

In [None]:
# TODO: Set tokenizer parameters
max_length = None  # Maximum sequence length (32-512)

# Simulate tokenization
sample_text = "Fine-tuning allows us to adapt pre-trained models to specific tasks."

if max_length:
    print(f"Tokenizer configuration:")
    print(f"  Max length: {max_length}")
    print(f"  Padding: 'max_length'")
    print(f"  Truncation: True")
    print(f"\nSample tokenization:")
    print(f"  Input: '{sample_text[:50]}...'")
    print(f"  Output shape: [batch_size, {max_length}]")

### Custom Classification Head

In [None]:
# TODO: Create a custom classification head for a Hugging Face model
class ClassificationHead(nn.Module):
    def __init__(self, hidden_size: int = 768, num_classes: int = 3, dropout_prob: float = 0.1):
        super(ClassificationHead, self).__init__()
        # TODO: Implement classification head
        # Should include: dropout, dense layer(s), and output projection
        pass
    
    def forward(self, hidden_states):
        # TODO: Implement forward pass
        # Typically uses the [CLS] token representation (first token)
        pass

# TODO: Create classification_head instance
classification_head = None

if classification_head:
    # Test with dummy hidden states
    dummy_hidden = torch.randn(4, 768)  # [batch_size, hidden_size]
    logits = classification_head(dummy_hidden)
    print(f"Classification head output shape: {logits.shape}")
    print(f"Number of parameters: {sum(p.numel() for p in classification_head.parameters()):,}")

### Fine-Tuning Configuration

In [None]:
# TODO: Create a configuration dictionary for fine-tuning
fine_tuning_config = None

# Should include:
# - learning_rate: Small learning rate (e.g., 2e-5 to 5e-5)
# - batch_size: Appropriate batch size (e.g., 16 or 32)
# - num_epochs: Few epochs (e.g., 3-5)
# - warmup_steps: Number of warmup steps (e.g., 100-500)
# - weight_decay: Regularization (e.g., 0.01)
# - gradient_accumulation_steps: For large batches (e.g., 1-4)

if fine_tuning_config:
    print("Fine-tuning configuration:")
    for key, value in fine_tuning_config.items():
        print(f"  {key}: {value}")
    
    # Calculate effective batch size
    effective_batch = fine_tuning_config['batch_size'] * fine_tuning_config.get('gradient_accumulation_steps', 1)
    print(f"\nEffective batch size: {effective_batch}")

### Best Practices for Fine-Tuning Hugging Face Models

1. **Start with Feature Extraction**: Freeze the base model initially
2. **Use Small Learning Rates**: Pre-trained weights are already good
3. **Monitor for Overfitting**: Use validation data and early stopping
4. **Layer-wise Learning Rates**: Lower rates for early layers
5. **Warmup Period**: Gradually increase learning rate at start
6. **Mixed Precision Training**: Use fp16 for faster training
7. **Gradient Checkpointing**: Save memory for large models

In [None]:
# Test Section 4: Hugging Face Integration
section_tests = [(getattr(validator, name), desc) for name, desc in EXERCISE2_SECTIONS["Section 4: Hugging Face Integration"]]
test_runner.test_section("Section 4: Hugging Face Integration", validator, section_tests, locals())

## Summary and Practical Applications

### When to Use Each Technique:

1. **Feature Extraction**: 
   - Very small datasets (< 1000 samples)
   - Limited computational resources
   - Similar domain to pre-training data

2. **Full Fine-Tuning**:
   - Larger datasets (> 10,000 samples)
   - Sufficient computational resources
   - Domain shift from pre-training data

3. **Gradual Unfreezing**:
   - Medium-sized datasets
   - Prevent catastrophic forgetting
   - Stable training progression

4. **Parameter-Efficient Methods (Adapters, LoRA)**:
   - Multiple downstream tasks
   - Limited storage for model weights
   - Need to preserve original model

5. **Discriminative Learning Rates**:
   - Always recommended for fine-tuning
   - Especially important for deep networks
   - Helps preserve pre-trained features

### Common Pitfalls to Avoid:

- Using too high learning rates → Catastrophic forgetting
- Training for too many epochs → Overfitting
- Not using validation data → Poor generalization
- Ignoring class imbalance → Biased predictions
- Not monitoring training metrics → Missing problems early

In [None]:
# Display final summary of all tests
test_runner.final_summary()

## Congratulations!

You've completed the Fine-Tuning exercise! You've learned:

✅ How to implement feature extraction with frozen layers  
✅ Strategies for selective layer freezing and gradual unfreezing  
✅ Using discriminative learning rates for different layer groups  
✅ Parameter-efficient fine-tuning with Adapters and LoRA  
✅ Working with Hugging Face models and configurations  

These techniques are essential for modern deep learning applications where pre-trained models are adapted to specific tasks with limited data and resources.