# Module 4 - Exercise 3: Performance Optimization Techniques

## Learning Objectives
- Understand and apply model compilation with torch.compile for faster inference
- Implement model pruning to reduce model size and improve efficiency
- Use mixed precision training to accelerate training and reduce memory usage
- Compare different optimization techniques and their trade-offs
- Combine multiple optimization strategies for maximum performance gains

In [None]:
# Clone the test repository
!git clone https://github.com/racousin/data_science_practice.git /tmp/tests 2>/dev/null || true

# Import required modules
import sys
sys.path.append('/tmp/tests/tests/python_deep_learning')

# Import the improved test utilities
from test_utils import NotebookTestRunner, create_inline_test
from module4.test_exercise3 import Exercise3Validator, EXERCISE3_SECTIONS

# Create test runner and validator
test_runner = NotebookTestRunner("module4", 3)
validator = Exercise3Validator()

## Environment Setup

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
import torch.nn.utils.prune as prune
import numpy as np
import time
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Check CUDA availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")

## Section 1: Model Compilation with torch.compile

PyTorch 2.0 introduced `torch.compile`, which can significantly speed up model inference and training by optimizing the computational graph.

In [None]:
# TODO: Define a simple CNN model for MNIST-like data
# The model should have:
# - conv1: Conv2d layer (1 input channel, 32 output channels, kernel size 3)
# - conv2: Conv2d layer (32 input channels, 64 output channels, kernel size 3)
# - fc1: Linear layer (appropriate input size, 128 output features)
# - fc2: Linear layer (128 input features, 10 output classes)
# Use ReLU activations and max pooling

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        # TODO: Define layers
        self.conv1 = None
        self.conv2 = None
        self.fc1 = None
        self.fc2 = None
        self.pool = nn.MaxPool2d(2, 2)
    
    def forward(self, x):
        # TODO: Implement forward pass
        # Input shape: (batch_size, 1, 28, 28)
        # Apply conv1 -> relu -> pool -> conv2 -> relu -> pool
        # Then flatten and apply fc1 -> relu -> fc2
        return x

# Test the model
test_model = SimpleCNN()
test_input = torch.randn(8, 1, 28, 28)
test_output = test_model(test_input)
print(f"Model output shape: {test_output.shape}")

In [None]:
# Create a model instance and prepare test data
model = SimpleCNN().to(device)
model.eval()

# Create synthetic test data
test_data = torch.randn(100, 1, 28, 28).to(device)

# TODO: Measure baseline inference time (without compilation)
# Run inference 100 times and measure the total time
baseline_time = None

# Warm up
with torch.no_grad():
    for _ in range(10):
        _ = model(test_data)

# TODO: Measure baseline time
# Hint: Use time.time() and run model(test_data) 100 times

print(f"Baseline inference time: {baseline_time:.4f} seconds" if baseline_time else "Not measured")

In [None]:
# TODO: Create a compiled version of the model using torch.compile
# Note: torch.compile is available in PyTorch 2.0+
# If not available, create a copy of the original model

compiled_model = None

try:
    # TODO: Use torch.compile on the model
    # Hint: compiled_model = torch.compile(model)
    pass
except:
    # If torch.compile is not available, use the original model
    print("torch.compile not available, using original model")
    compiled_model = model

if compiled_model:
    print("Compiled model created successfully")

In [None]:
# TODO: Measure compiled model inference time
compiled_time = None

if compiled_model:
    # Warm up the compiled model
    with torch.no_grad():
        for _ in range(10):
            _ = compiled_model(test_data)
    
    # TODO: Measure compiled inference time (100 iterations)
    # Similar to baseline measurement

print(f"Compiled inference time: {compiled_time:.4f} seconds" if compiled_time else "Not measured")

# TODO: Calculate speedup from compilation
compile_speedup = None
if baseline_time and compiled_time:
    # TODO: Calculate speedup = baseline_time / compiled_time
    pass

if compile_speedup:
    print(f"Compilation speedup: {compile_speedup:.2f}x")

In [None]:
# Test Section 1: Model Compilation with torch.compile
section_tests = [(getattr(validator, name), desc) for name, desc in EXERCISE3_SECTIONS["Section 1: Model Compilation with torch.compile"]]
test_runner.test_section("Section 1: Model Compilation with torch.compile", validator, section_tests, locals())

## Section 2: Model Pruning

Pruning removes unnecessary weights from a model, reducing its size and potentially improving inference speed.

In [None]:
# TODO: Implement a function to apply structured pruning to a model
def apply_pruning(model, amount=0.3):
    """
    Apply structured pruning to all Linear and Conv2d layers in the model.
    
    Args:
        model: The model to prune
        amount: Fraction of connections to prune (0.3 = 30%)
    
    Returns:
        The pruned model
    """
    # TODO: Iterate through all modules in the model
    # For each Linear and Conv2d layer, apply L1 unstructured pruning
    # Hint: Use prune.l1_unstructured(module, name='weight', amount=amount)
    
    # TODO: Make pruning permanent
    # Hint: Use prune.remove(module, 'weight') for each pruned module
    
    return model

# Test the pruning function
test_model = nn.Sequential(
    nn.Linear(10, 20),
    nn.ReLU(),
    nn.Linear(20, 10)
)
pruned_test = apply_pruning(test_model.clone(), 0.3)
print("Pruning function implemented")

In [None]:
# TODO: Create a fresh model and apply pruning
original_model = SimpleCNN().to(device)
original_model.eval()

# TODO: Apply pruning with 30% sparsity
pruned_model = None
# Hint: pruned_model = apply_pruning(original_model.clone(), amount=0.3)

if pruned_model:
    print("Model pruned successfully")
    
    # Test that pruned model still works
    with torch.no_grad():
        test_output = pruned_model(test_data[:1])
        print(f"Pruned model output shape: {test_output.shape}")

In [None]:
# TODO: Calculate the sparsity of the pruned model
def calculate_sparsity(model):
    """
    Calculate the percentage of zero weights in the model.
    """
    total_params = 0
    zero_params = 0
    
    for module in model.modules():
        if isinstance(module, (nn.Linear, nn.Conv2d)):
            total_params += module.weight.numel()
            zero_params += (module.weight == 0).sum().item()
    
    if total_params > 0:
        return (zero_params / total_params) * 100
    return 0

# TODO: Calculate sparsity of the pruned model
sparsity = None
if pruned_model:
    # TODO: Use calculate_sparsity function
    pass

if sparsity is not None:
    print(f"Model sparsity: {sparsity:.2f}%")
    
    # Compare model sizes
    original_params = sum(p.numel() for p in original_model.parameters())
    pruned_params = sum(p.numel() for p in pruned_model.parameters())
    print(f"Original parameters: {original_params:,}")
    print(f"Pruned parameters: {pruned_params:,}")

In [None]:
# TODO: Measure pruned model inference time
pruned_time = None

if pruned_model:
    # Warm up
    with torch.no_grad():
        for _ in range(10):
            _ = pruned_model(test_data)
    
    # TODO: Measure pruned model inference time (100 iterations)
    
print(f"Pruned inference time: {pruned_time:.4f} seconds" if pruned_time else "Not measured")

if baseline_time and pruned_time:
    pruning_speedup = baseline_time / pruned_time
    print(f"Pruning speedup: {pruning_speedup:.2f}x")

In [None]:
# Test Section 2: Model Pruning
section_tests = [(getattr(validator, name), desc) for name, desc in EXERCISE3_SECTIONS["Section 2: Model Pruning"]]
test_runner.test_section("Section 2: Model Pruning", validator, section_tests, locals())

## Section 3: Mixed Precision Training

Mixed precision training uses both float16 and float32 data types to accelerate training while maintaining model accuracy.

In [None]:
# Create training data
X_train = torch.randn(1000, 1, 28, 28).to(device)
y_train = torch.randint(0, 10, (1000,)).to(device)
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

def train_epoch(model, loader, optimizer, criterion, use_amp=False):
    """
    Train for one epoch with optional automatic mixed precision.
    """
    model.train()
    scaler = torch.cuda.amp.GradScaler() if use_amp else None
    
    for data, target in loader:
        optimizer.zero_grad()
        
        if use_amp and torch.cuda.is_available():
            with torch.cuda.amp.autocast():
                output = model(data)
                loss = criterion(output, target)
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()

In [None]:
# TODO: Train with mixed precision (if GPU is available)
mixed_precision_time = None

if torch.cuda.is_available():
    mp_model = SimpleCNN().to(device)
    mp_optimizer = optim.Adam(mp_model.parameters())
    criterion = nn.CrossEntropyLoss()
    
    # TODO: Measure training time with mixed precision for 3 epochs
    # Hint: Use train_epoch with use_amp=True
    
    print(f"Mixed precision training time: {mixed_precision_time:.4f} seconds" if mixed_precision_time else "Not measured")
else:
    print("CUDA not available - skipping mixed precision training")
    mixed_precision_time = None

In [None]:
# TODO: Train with standard FP32 precision
fp32_model = SimpleCNN().to(device)
fp32_optimizer = optim.Adam(fp32_model.parameters())
criterion = nn.CrossEntropyLoss()

# TODO: Measure training time with FP32 for 3 epochs
fp32_time = None
# Hint: Use train_epoch with use_amp=False

print(f"FP32 training time: {fp32_time:.4f} seconds" if fp32_time else "Not measured")

# Compare if both times are available
if mixed_precision_time and fp32_time:
    mp_speedup = fp32_time / mixed_precision_time
    print(f"Mixed precision speedup: {mp_speedup:.2f}x")

In [None]:
# TODO: Compare memory usage between FP32 and FP16
memory_comparison = {}

# FP32 tensor memory
fp32_tensor = torch.randn(1000, 1000, dtype=torch.float32)
memory_comparison['fp32'] = fp32_tensor.element_size() * fp32_tensor.numel() / (1024 * 1024)  # MB

# TODO: Calculate FP16 tensor memory
# Create same size tensor with dtype=torch.float16
if torch.cuda.is_available():
    # TODO: Create FP16 tensor and calculate memory
    memory_comparison['fp16'] = None
else:
    memory_comparison['fp16'] = None

print("Memory usage comparison:")
print(f"FP32 tensor (1000x1000): {memory_comparison['fp32']:.2f} MB")
if memory_comparison['fp16'] is not None:
    print(f"FP16 tensor (1000x1000): {memory_comparison['fp16']:.2f} MB")
    print(f"Memory reduction: {(1 - memory_comparison['fp16']/memory_comparison['fp32'])*100:.1f}%")

In [None]:
# Test Section 3: Mixed Precision Training
section_tests = [(getattr(validator, name), desc) for name, desc in EXERCISE3_SECTIONS["Section 3: Mixed Precision Training"]]
test_runner.test_section("Section 3: Mixed Precision Training", validator, section_tests, locals())

## Section 4: Optimization Comparison

Compare all optimization techniques to understand their relative benefits.

In [None]:
# TODO: Create a summary of all optimization techniques
optimization_summary = {}

# TODO: Add all measured times to the summary
# optimization_summary['baseline'] = baseline_time
# optimization_summary['compiled'] = compiled_time
# optimization_summary['pruned'] = pruned_time
# optimization_summary['mixed_precision'] = mixed_precision_time  # May be None if no GPU

# Visualize results
if optimization_summary:
    import matplotlib.pyplot as plt
    
    # Filter out None values
    valid_results = {k: v for k, v in optimization_summary.items() if v is not None}
    
    if valid_results:
        plt.figure(figsize=(10, 6))
        techniques = list(valid_results.keys())
        times = list(valid_results.values())
        
        bars = plt.bar(techniques, times, color=['blue', 'green', 'orange', 'red'][:len(techniques)])
        plt.xlabel('Optimization Technique')
        plt.ylabel('Time (seconds)')
        plt.title('Performance Comparison of Optimization Techniques')
        
        # Add value labels on bars
        for bar, time in zip(bars, times):
            plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
                    f'{time:.3f}s', ha='center')
        
        plt.grid(True, alpha=0.3)
        plt.show()
        
        # Print speedups
        if 'baseline' in valid_results:
            baseline = valid_results['baseline']
            print("\nSpeedup compared to baseline:")
            for technique, time in valid_results.items():
                if technique != 'baseline':
                    speedup = baseline / time
                    print(f"  {technique}: {speedup:.2f}x")

In [None]:
# Test Section 4: Optimization Comparison
section_tests = [(getattr(validator, name), desc) for name, desc in EXERCISE3_SECTIONS["Section 4: Optimization Comparison"]]
test_runner.test_section("Section 4: Optimization Comparison", validator, section_tests, locals())

## Section 5: Combined Optimizations

Combine multiple optimization techniques to achieve maximum performance gains.

In [None]:
# TODO: Create a model with combined optimizations
# Apply both pruning and compilation

combined_model = None

# TODO: Start with a fresh model
# 1. Create SimpleCNN model
# 2. Apply pruning with 30% sparsity
# 3. Apply torch.compile if available

if combined_model:
    print("Combined optimizations model created")
    
    # Test the model
    with torch.no_grad():
        test_output = combined_model(test_data[:1])
        print(f"Combined model output shape: {test_output.shape}")

In [None]:
# TODO: Measure combined model performance
combined_time = None

if combined_model:
    # Warm up
    with torch.no_grad():
        for _ in range(10):
            _ = combined_model(test_data)
    
    # TODO: Measure inference time (100 iterations)

print(f"Combined optimization time: {combined_time:.4f} seconds" if combined_time else "Not measured")

# TODO: Calculate total speedup
total_speedup = None
if baseline_time and combined_time:
    # TODO: Calculate total_speedup = baseline_time / combined_time
    pass

if total_speedup:
    print(f"Total speedup with combined optimizations: {total_speedup:.2f}x")
    print(f"That's {(total_speedup - 1) * 100:.1f}% faster than baseline!")

In [None]:
# Test Section 5: Combined Optimizations
section_tests = [(getattr(validator, name), desc) for name, desc in EXERCISE3_SECTIONS["Section 5: Combined Optimizations"]]
test_runner.test_section("Section 5: Combined Optimizations", validator, section_tests, locals())

In [None]:
# Display final summary of all tests
test_runner.final_summary()

## Summary

In this exercise, you've learned and practiced:

1. **Model Compilation (torch.compile)**:
   - How to use PyTorch 2.0's compilation feature
   - Understanding the performance benefits of graph optimization
   - Trade-offs between compilation time and inference speedup

2. **Model Pruning**:
   - Implementing structured and unstructured pruning
   - Calculating model sparsity
   - Understanding the impact on model size and inference speed

3. **Mixed Precision Training**:
   - Using automatic mixed precision (AMP) for faster training
   - Understanding FP16 vs FP32 trade-offs
   - Memory savings from reduced precision

4. **Optimization Comparison**:
   - Evaluating different optimization techniques
   - Understanding which optimizations work best for different scenarios
   - Making informed decisions about optimization strategies

5. **Combined Optimizations**:
   - Stacking multiple optimization techniques
   - Understanding cumulative performance gains
   - Practical considerations for production deployment

These optimization techniques are essential for:
- Deploying models on resource-constrained devices
- Reducing inference latency in production
- Lowering computational costs
- Enabling real-time applications
- Scaling model serving infrastructure