# 06 - Debugging and Profiling PyTorch Code

## Learning Objectives

By the end of this notebook, you will:

1. **Identify common bugs** - Shape mismatches, NaN gradients, memory leaks, device mismatches
2. **Use debugging tools** - `torch.autograd.detect_anomaly`, gradient hooks, assertions
3. **Profile performance** - PyTorch Profiler for CPU/GPU/memory bottlenecks
4. **Visualize training** - TensorBoard integration for loss curves and model inspection
5. **Debug memory issues** - Track GPU memory, find leaks, optimize usage

---

## Setup

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
from torch.profiler import profile, record_function, ProfilerActivity
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import matplotlib.pyplot as plt
import traceback
import warnings
import gc
import os
from typing import Dict, List, Optional, Tuple

# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

torch.manual_seed(42)

---

## 1. Common PyTorch Bugs and How to Fix Them

### 1.1 Shape Mismatches

The most common bug category. Let's see examples and solutions.

In [None]:
# BUG 1: Incorrect input shape to linear layer

class BuggyModel1(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(1, 32, 3)  # Output: (B, 32, H-2, W-2)
        self.fc = nn.Linear(32, 10)  # Expects flattened input!
    
    def forward(self, x):
        x = F.relu(self.conv(x))  # (B, 32, 26, 26) for 28x28 input
        x = self.fc(x)  # BUG: x is (B, 32, 26, 26), not (B, 32)
        return x

# This will fail
try:
    model = BuggyModel1()
    x = torch.randn(4, 1, 28, 28)
    output = model(x)
except RuntimeError as e:
    print(f"Error: {e}")
    print("\nShape mismatch: Conv output wasn't flattened before Linear layer")

In [None]:
# FIXED: Properly flatten before linear layer

class FixedModel1(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(1, 32, 3)  # Output: (B, 32, 26, 26)
        self.fc = nn.Linear(32 * 26 * 26, 10)  # Correct input size
    
    def forward(self, x):
        x = F.relu(self.conv(x))
        x = x.flatten(1)  # (B, 32*26*26)
        x = self.fc(x)
        return x

model = FixedModel1()
x = torch.randn(4, 1, 28, 28)
output = model(x)
print(f"Output shape: {output.shape}")  # (4, 10)

In [None]:
# BUG 2: Forgetting batch dimension

model = nn.Linear(10, 5)

# Wrong: passing 1D tensor instead of 2D
try:
    x = torch.randn(10)  # Missing batch dimension
    output = model(x)
    print(f"Accidentally worked! Output shape: {output.shape}")
    # This works but might cause issues later with BatchNorm, etc.
except Exception as e:
    print(f"Error: {e}")

# Correct: always use batch dimension
x = torch.randn(1, 10)  # Batch size of 1
output = model(x)
print(f"Correct output shape: {output.shape}")

In [None]:
# BUG 3: Channel order mismatch (NCHW vs NHWC)

# PyTorch uses NCHW (Batch, Channels, Height, Width)
# Some other frameworks use NHWC

# Wrong: NumPy image in HWC format
numpy_image = np.random.rand(28, 28, 3)  # HWC
try:
    x = torch.from_numpy(numpy_image).float()
    conv = nn.Conv2d(3, 16, 3)
    output = conv(x.unsqueeze(0))  # Adds batch but wrong channel position!
except RuntimeError as e:
    print(f"Error: {e}")

# Correct: Convert HWC to CHW
x = torch.from_numpy(numpy_image).float()
x = x.permute(2, 0, 1)  # HWC -> CHW
x = x.unsqueeze(0)  # Add batch: CHW -> NCHW
print(f"Correct shape: {x.shape}")  # (1, 3, 28, 28)
output = conv(x)
print(f"Conv output shape: {output.shape}")

### 1.2 Device Mismatches

In [None]:
# BUG 4: Mixing CPU and GPU tensors

if torch.cuda.is_available():
    model = nn.Linear(10, 5).cuda()
    x_cpu = torch.randn(4, 10)  # On CPU
    
    try:
        output = model(x_cpu)  # Model on GPU, data on CPU
    except RuntimeError as e:
        print(f"Error: {e}")
    
    # Fix: Move data to same device as model
    x_gpu = x_cpu.to(model.weight.device)
    output = model(x_gpu)
    print(f"Success! Output device: {output.device}")
else:
    print("GPU not available - skipping device mismatch demo")

In [None]:
# Helper function to ensure all data on correct device

def to_device(data, device):
    """Recursively move data to device"""
    if isinstance(data, torch.Tensor):
        return data.to(device)
    elif isinstance(data, (list, tuple)):
        return type(data)(to_device(d, device) for d in data)
    elif isinstance(data, dict):
        return {k: to_device(v, device) for k, v in data.items()}
    else:
        return data

# Usage in training loop
def train_step(model, batch, device):
    batch = to_device(batch, device)
    x, y = batch
    return model(x)

print("Always use a consistent device handling pattern!")

### 1.3 NaN and Inf Values

In [None]:
# BUG 5: Numerical instability causing NaN

def unstable_softmax(x):
    """Numerically unstable softmax"""
    return torch.exp(x) / torch.exp(x).sum(dim=-1, keepdim=True)

def stable_softmax(x):
    """Numerically stable softmax"""
    x_max = x.max(dim=-1, keepdim=True).values
    exp_x = torch.exp(x - x_max)
    return exp_x / exp_x.sum(dim=-1, keepdim=True)

# Test with large values
x = torch.tensor([1000.0, 1001.0, 1002.0])

print(f"Unstable softmax: {unstable_softmax(x)}")  # NaN!
print(f"Stable softmax: {stable_softmax(x)}")  # Works
print(f"PyTorch softmax: {F.softmax(x, dim=-1)}")  # PyTorch handles this

In [None]:
# BUG 6: Log of zero

def buggy_cross_entropy(pred, target):
    """Can produce -inf when pred contains 0"""
    return -torch.sum(target * torch.log(pred))

def safe_cross_entropy(pred, target, eps=1e-8):
    """Clamp to avoid log(0)"""
    return -torch.sum(target * torch.log(pred.clamp(min=eps)))

# Test
pred = torch.tensor([0.0, 0.5, 0.5])  # Contains 0!
target = torch.tensor([1.0, 0.0, 0.0])

print(f"Buggy CE: {buggy_cross_entropy(pred, target)}")  # -inf
print(f"Safe CE: {safe_cross_entropy(pred, target)}")  # Large but finite

In [None]:
# Detecting NaN/Inf in tensors

def check_tensor(tensor, name="tensor"):
    """Check tensor for NaN and Inf values"""
    has_nan = torch.isnan(tensor).any().item()
    has_inf = torch.isinf(tensor).any().item()
    
    if has_nan:
        print(f"WARNING: {name} contains NaN!")
        print(f"  NaN count: {torch.isnan(tensor).sum().item()}")
    if has_inf:
        print(f"WARNING: {name} contains Inf!")
        print(f"  Inf count: {torch.isinf(tensor).sum().item()}")
    
    if not has_nan and not has_inf:
        print(f"{name}: OK (min={tensor.min():.4f}, max={tensor.max():.4f})")
    
    return not (has_nan or has_inf)

# Test
good_tensor = torch.randn(100)
bad_tensor = torch.tensor([1.0, float('nan'), float('inf'), -float('inf')])

check_tensor(good_tensor, "good_tensor")
check_tensor(bad_tensor, "bad_tensor")

### 1.4 Memory Leaks

In [None]:
# BUG 7: Accumulating tensors in a list (keeps computation graph!)

def buggy_training_loop(model, data_loader, epochs=3):
    """Memory leak: storing tensors with gradients"""
    losses = []  # Stores tensors with computation graphs!
    
    for epoch in range(epochs):
        for x, y in data_loader:
            output = model(x)
            loss = F.mse_loss(output, y)
            losses.append(loss)  # BUG: loss has grad_fn attached!
    
    return losses  # All computation graphs are retained in memory

def fixed_training_loop(model, data_loader, epochs=3):
    """Fixed: detach or use .item() for scalars"""
    losses = []
    
    for epoch in range(epochs):
        for x, y in data_loader:
            output = model(x)
            loss = F.mse_loss(output, y)
            losses.append(loss.item())  # .item() extracts Python float
            # Or: losses.append(loss.detach())  # Remove from graph
    
    return losses

print("Always use .item() when storing loss values!")
print("Or use .detach() if you need to keep the tensor.")

In [None]:
# BUG 8: Not using torch.no_grad() during evaluation

def buggy_evaluate(model, data_loader):
    """Builds computation graphs during evaluation - wastes memory!"""
    total_loss = 0
    for x, y in data_loader:
        output = model(x)  # Builds graph even though we don't need gradients
        loss = F.mse_loss(output, y)
        total_loss += loss.item()
    return total_loss

@torch.no_grad()  # Decorator form
def fixed_evaluate(model, data_loader):
    """No computation graphs built - saves memory"""
    total_loss = 0
    for x, y in data_loader:
        output = model(x)
        loss = F.mse_loss(output, y)
        total_loss += loss.item()
    return total_loss

# Alternative: context manager
def fixed_evaluate_v2(model, data_loader):
    total_loss = 0
    with torch.no_grad():
        for x, y in data_loader:
            output = model(x)
            loss = F.mse_loss(output, y)
            total_loss += loss.item()
    return total_loss

print("Always use torch.no_grad() or torch.inference_mode() during evaluation!")

---

## 2. Debugging Tools

### 2.1 torch.autograd.detect_anomaly

Enables detailed error messages for backward pass issues.

In [None]:
# Create a model that produces NaN during training

class ProblematicModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(10, 10)
    
    def forward(self, x):
        x = self.linear(x)
        # Intentionally cause NaN: sqrt of negative number
        x = torch.sqrt(x)  # NaN for negative values!
        return x

# Without anomaly detection
model = ProblematicModel()
x = torch.randn(4, 10)
output = model(x)
print(f"Output has NaN: {torch.isnan(output).any().item()}")

try:
    loss = output.sum()
    loss.backward()  # Might silently produce NaN gradients
    print(f"Gradient has NaN: {torch.isnan(model.linear.weight.grad).any().item()}")
except Exception as e:
    print(f"Error: {e}")

In [None]:
# With anomaly detection - get detailed error location

torch.autograd.set_detect_anomaly(True)

model = ProblematicModel()
x = torch.randn(4, 10)

try:
    output = model(x)
    loss = output.sum()
    loss.backward()
except RuntimeError as e:
    print(f"Caught anomaly!")
    print(f"Error: {str(e)[:200]}...")

torch.autograd.set_detect_anomaly(False)  # Disable (has performance cost)

In [None]:
# Using as context manager (recommended)

model = ProblematicModel()
x = torch.randn(4, 10)

with torch.autograd.detect_anomaly():
    try:
        output = model(x)
        loss = output.sum()
        loss.backward()
    except RuntimeError as e:
        print("Anomaly detected in context manager!")
        # The error message includes the forward pass location

### 2.2 Gradient Hooks for Debugging

In [None]:
class GradientDebugger:
    """
    Utility class to monitor gradients during training.
    Helps identify vanishing/exploding gradients and NaN values.
    """
    
    def __init__(self, model: nn.Module):
        self.model = model
        self.gradient_stats: Dict[str, Dict] = {}
        self.hooks = []
    
    def register_hooks(self):
        """Register backward hooks on all parameters"""
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                hook = param.register_hook(
                    lambda grad, n=name: self._gradient_hook(n, grad)
                )
                self.hooks.append(hook)
        print(f"Registered hooks on {len(self.hooks)} parameters")
    
    def _gradient_hook(self, name: str, grad: torch.Tensor):
        """Called during backward pass for each parameter"""
        with torch.no_grad():
            self.gradient_stats[name] = {
                'mean': grad.mean().item(),
                'std': grad.std().item(),
                'min': grad.min().item(),
                'max': grad.max().item(),
                'norm': grad.norm().item(),
                'has_nan': torch.isnan(grad).any().item(),
                'has_inf': torch.isinf(grad).any().item(),
            }
    
    def print_stats(self):
        """Print gradient statistics"""
        print("\nGradient Statistics:")
        print("-" * 80)
        for name, stats in self.gradient_stats.items():
            status = "OK"
            if stats['has_nan']:
                status = "NaN!"
            elif stats['has_inf']:
                status = "Inf!"
            elif stats['norm'] < 1e-7:
                status = "Vanishing?"
            elif stats['norm'] > 1e3:
                status = "Exploding?"
            
            print(f"{name:40} | norm: {stats['norm']:10.4f} | {status}")
    
    def remove_hooks(self):
        """Remove all registered hooks"""
        for hook in self.hooks:
            hook.remove()
        self.hooks.clear()


# Demo
model = nn.Sequential(
    nn.Linear(10, 50),
    nn.ReLU(),
    nn.Linear(50, 50),
    nn.ReLU(),
    nn.Linear(50, 1)
)

debugger = GradientDebugger(model)
debugger.register_hooks()

# Forward and backward pass
x = torch.randn(32, 10)
y = torch.randn(32, 1)
output = model(x)
loss = F.mse_loss(output, y)
loss.backward()

debugger.print_stats()
debugger.remove_hooks()

### 2.3 Shape Assertions and Debugging

In [None]:
def assert_shape(tensor: torch.Tensor, expected_shape: tuple, name: str = "tensor"):
    """
    Assert tensor has expected shape.
    Use -1 for dimensions that can be any size.
    """
    actual = tensor.shape
    
    if len(actual) != len(expected_shape):
        raise AssertionError(
            f"{name}: Expected {len(expected_shape)} dims, got {len(actual)}. "
            f"Shape: {tuple(actual)}"
        )
    
    for i, (a, e) in enumerate(zip(actual, expected_shape)):
        if e != -1 and a != e:
            raise AssertionError(
                f"{name}: Dimension {i} mismatch. "
                f"Expected {expected_shape}, got {tuple(actual)}"
            )


class DebuggedModel(nn.Module):
    """Model with shape assertions for debugging"""
    
    def __init__(self, input_dim=10, hidden_dim=50, output_dim=5):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim
    
    def forward(self, x):
        # Assert input shape (batch_size can be anything, hence -1)
        assert_shape(x, (-1, self.input_dim), "input")
        
        x = F.relu(self.fc1(x))
        assert_shape(x, (-1, self.hidden_dim), "after fc1")
        
        x = self.fc2(x)
        assert_shape(x, (-1, self.output_dim), "output")
        
        return x


# Test
model = DebuggedModel(input_dim=10, hidden_dim=50, output_dim=5)

# Correct input
x = torch.randn(32, 10)
output = model(x)
print(f"Success! Output shape: {output.shape}")

# Wrong input
try:
    x_wrong = torch.randn(32, 15)  # Wrong input dimension
    output = model(x_wrong)
except AssertionError as e:
    print(f"Caught error: {e}")

---

## 3. PyTorch Profiler

The PyTorch Profiler helps identify performance bottlenecks.

### 3.1 Basic Profiling

In [None]:
# Create a model and data for profiling

class ModelForProfiling(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, 3, padding=1)
        self.conv2 = nn.Conv2d(64, 128, 3, padding=1)
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(128, 10)
    
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = self.pool(x)
        x = x.flatten(1)
        x = self.fc(x)
        return x

model = ModelForProfiling().to(device)
x = torch.randn(32, 3, 64, 64, device=device)

In [None]:
# Basic profiling

activities = [ProfilerActivity.CPU]
if torch.cuda.is_available():
    activities.append(ProfilerActivity.CUDA)

with profile(
    activities=activities,
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    for _ in range(10):
        output = model(x)
        if torch.cuda.is_available():
            torch.cuda.synchronize()

# Print results sorted by CPU time
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=15))

In [None]:
# Profile with custom labels using record_function

class LabeledModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, 3, padding=1)
        self.conv2 = nn.Conv2d(64, 128, 3, padding=1)
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(128, 10)
    
    def forward(self, x):
        with record_function("conv_block_1"):
            x = F.relu(self.conv1(x))
        
        with record_function("conv_block_2"):
            x = F.relu(self.conv2(x))
        
        with record_function("pooling_and_fc"):
            x = self.pool(x)
            x = x.flatten(1)
            x = self.fc(x)
        
        return x

labeled_model = LabeledModel().to(device)

with profile(activities=activities, record_shapes=True) as prof:
    for _ in range(10):
        output = labeled_model(x)
        if torch.cuda.is_available():
            torch.cuda.synchronize()

# Filter for our custom labels
print("\nCustom region timing:")
for event in prof.key_averages():
    if event.key in ["conv_block_1", "conv_block_2", "pooling_and_fc"]:
        print(f"{event.key}: {event.cpu_time_total / 1000:.2f}ms")

### 3.2 Profiling Training Loop

In [None]:
# Create synthetic data
train_data = TensorDataset(
    torch.randn(1000, 3, 64, 64),
    torch.randint(0, 10, (1000,))
)
train_loader = DataLoader(train_data, batch_size=32, shuffle=True)

model = ModelForProfiling().to(device)
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

def train_step(model, batch, optimizer, criterion):
    x, y = batch
    x, y = x.to(device), y.to(device)
    
    with record_function("forward"):
        output = model(x)
        loss = criterion(output, y)
    
    with record_function("backward"):
        optimizer.zero_grad()
        loss.backward()
    
    with record_function("optimizer_step"):
        optimizer.step()
    
    return loss.item()

# Profile training
with profile(
    activities=activities,
    schedule=torch.profiler.schedule(
        wait=1,     # Skip first batch (warmup)
        warmup=1,   # Warmup profiler
        active=3,   # Profile 3 batches
        repeat=1
    ),
    record_shapes=True,
    profile_memory=True,
) as prof:
    for batch_idx, batch in enumerate(train_loader):
        if batch_idx >= 5:  # Only profile first 5 batches
            break
        loss = train_step(model, batch, optimizer, criterion)
        prof.step()  # Signal profiler

print("\nTraining step breakdown:")
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))

### 3.3 Memory Profiling

In [None]:
# Memory profiling

if torch.cuda.is_available():
    # Clear cache
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()
    
    print("GPU Memory Profiling:")
    print(f"Initial allocated: {torch.cuda.memory_allocated() / 1e6:.1f} MB")
    
    # Create model and data
    model = ModelForProfiling().cuda()
    print(f"After model: {torch.cuda.memory_allocated() / 1e6:.1f} MB")
    
    x = torch.randn(64, 3, 64, 64, device='cuda')
    print(f"After input: {torch.cuda.memory_allocated() / 1e6:.1f} MB")
    
    # Forward pass
    output = model(x)
    print(f"After forward: {torch.cuda.memory_allocated() / 1e6:.1f} MB")
    
    # Backward pass
    loss = output.sum()
    loss.backward()
    print(f"After backward: {torch.cuda.memory_allocated() / 1e6:.1f} MB")
    
    print(f"\nPeak memory: {torch.cuda.max_memory_allocated() / 1e6:.1f} MB")
else:
    print("GPU not available for memory profiling")

In [None]:
# Detailed memory breakdown

if torch.cuda.is_available():
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        profile_memory=True,
        record_shapes=True,
    ) as prof:
        model = ModelForProfiling().cuda()
        x = torch.randn(64, 3, 64, 64, device='cuda')
        output = model(x)
        loss = output.sum()
        loss.backward()
    
    # Sort by memory usage
    print("\nOperations by CUDA memory:")
    print(prof.key_averages().table(
        sort_by="self_cuda_memory_usage", 
        row_limit=15
    ))

---

## 4. TensorBoard Integration

### 4.1 Basic TensorBoard Logging

In [None]:
# Create TensorBoard writer
log_dir = '../runs/debug_demo'
os.makedirs(log_dir, exist_ok=True)
writer = SummaryWriter(log_dir)

print(f"TensorBoard logs will be saved to: {log_dir}")
print("Run 'tensorboard --logdir=../runs' to view")

In [None]:
# Log scalars (loss, accuracy, learning rate)

# Simulate training
for epoch in range(100):
    # Fake metrics
    train_loss = 1.0 / (epoch + 1) + 0.1 * np.random.randn()
    val_loss = 1.2 / (epoch + 1) + 0.1 * np.random.randn()
    accuracy = 1 - 1.0 / (epoch + 2)
    lr = 0.001 * (0.95 ** epoch)
    
    # Log to TensorBoard
    writer.add_scalar('Loss/train', train_loss, epoch)
    writer.add_scalar('Loss/val', val_loss, epoch)
    writer.add_scalar('Accuracy/train', accuracy, epoch)
    writer.add_scalar('LearningRate', lr, epoch)

print("Logged 100 epochs of training metrics")

In [None]:
# Log histograms (weights, gradients, activations)

model = nn.Sequential(
    nn.Linear(10, 50),
    nn.ReLU(),
    nn.Linear(50, 10)
)

# Log initial weights
for name, param in model.named_parameters():
    writer.add_histogram(f'weights/{name}', param, 0)

# Simulate training and log weight evolution
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for epoch in range(50):
    x = torch.randn(32, 10)
    y = torch.randn(32, 10)
    
    output = model(x)
    loss = F.mse_loss(output, y)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # Log every 10 epochs
    if epoch % 10 == 0:
        for name, param in model.named_parameters():
            writer.add_histogram(f'weights/{name}', param, epoch)
            if param.grad is not None:
                writer.add_histogram(f'gradients/{name}', param.grad, epoch)

print("Logged weight and gradient histograms")

In [None]:
# Log model graph

model = ModelForProfiling()
x = torch.randn(1, 3, 64, 64)

writer.add_graph(model, x)
print("Logged model graph")

In [None]:
# Log images

import torchvision
from torchvision.datasets import MNIST
import torchvision.transforms as T

# Load some MNIST images
transform = T.Compose([T.ToTensor()])
mnist = MNIST('../data', train=True, download=True, transform=transform)

# Create a grid of images
images = torch.stack([mnist[i][0] for i in range(16)])
grid = torchvision.utils.make_grid(images, nrow=4, normalize=True)

writer.add_image('mnist_samples', grid, 0)
print("Logged image grid")

In [None]:
# Log embeddings for visualization

# Get embeddings from model
model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28*28, 128),
    nn.ReLU(),
    nn.Linear(128, 32),  # 32-dim embedding
)

# Get embeddings for 500 images
n_samples = 500
images = torch.stack([mnist[i][0] for i in range(n_samples)])
labels = [mnist[i][1] for i in range(n_samples)]

with torch.no_grad():
    embeddings = model(images)

# Log embeddings
writer.add_embedding(
    embeddings,
    metadata=labels,
    label_img=images,
    global_step=0,
    tag='mnist_embeddings'
)

print("Logged embeddings")

In [None]:
# Close writer
writer.close()
print("\nTensorBoard logging complete!")
print("Run: tensorboard --logdir=../runs")

### 4.2 Comprehensive Training Logger

In [None]:
class TrainingLogger:
    """
    Comprehensive logger for training experiments.
    Logs metrics, gradients, learning rate, and more to TensorBoard.
    """
    
    def __init__(self, log_dir: str, model: nn.Module = None):
        self.writer = SummaryWriter(log_dir)
        self.model = model
        self.step = 0
        self.epoch = 0
    
    def log_scalars(self, metrics: Dict[str, float], prefix: str = ""):
        """Log multiple scalar values"""
        for name, value in metrics.items():
            tag = f"{prefix}/{name}" if prefix else name
            self.writer.add_scalar(tag, value, self.step)
    
    def log_gradients(self):
        """Log gradient statistics for all parameters"""
        if self.model is None:
            return
        
        total_norm = 0.0
        for name, param in self.model.named_parameters():
            if param.grad is not None:
                grad_norm = param.grad.norm().item()
                total_norm += grad_norm ** 2
                self.writer.add_scalar(f'gradients/norm/{name}', grad_norm, self.step)
        
        total_norm = total_norm ** 0.5
        self.writer.add_scalar('gradients/total_norm', total_norm, self.step)
    
    def log_weights(self):
        """Log weight histograms"""
        if self.model is None:
            return
        
        for name, param in self.model.named_parameters():
            self.writer.add_histogram(f'weights/{name}', param, self.step)
    
    def log_learning_rate(self, optimizer):
        """Log learning rate from optimizer"""
        for i, param_group in enumerate(optimizer.param_groups):
            self.writer.add_scalar(f'learning_rate/group_{i}', param_group['lr'], self.step)
    
    def log_images(self, tag: str, images: torch.Tensor, nrow: int = 8):
        """Log image grid"""
        grid = torchvision.utils.make_grid(images, nrow=nrow, normalize=True)
        self.writer.add_image(tag, grid, self.step)
    
    def step_batch(self):
        """Increment batch step counter"""
        self.step += 1
    
    def step_epoch(self):
        """Increment epoch counter and log weights"""
        self.epoch += 1
        self.log_weights()
    
    def close(self):
        self.writer.close()


print("TrainingLogger class defined!")

---

## 5. Memory Debugging

### 5.1 GPU Memory Tracking

In [None]:
class GPUMemoryTracker:
    """
    Track GPU memory usage during training.
    """
    
    def __init__(self):
        self.snapshots = []
    
    def snapshot(self, label: str = ""):
        """Take a memory snapshot"""
        if not torch.cuda.is_available():
            return
        
        self.snapshots.append({
            'label': label,
            'allocated': torch.cuda.memory_allocated() / 1e6,
            'reserved': torch.cuda.memory_reserved() / 1e6,
            'max_allocated': torch.cuda.max_memory_allocated() / 1e6,
        })
    
    def print_snapshots(self):
        """Print all snapshots"""
        print("\nGPU Memory Snapshots:")
        print("-" * 70)
        print(f"{'Label':<30} {'Allocated':>12} {'Reserved':>12} {'Peak':>12}")
        print("-" * 70)
        for snap in self.snapshots:
            print(f"{snap['label']:<30} {snap['allocated']:>10.1f}MB {snap['reserved']:>10.1f}MB {snap['max_allocated']:>10.1f}MB")
    
    def plot(self):
        """Plot memory usage over time"""
        if not self.snapshots:
            print("No snapshots to plot")
            return
        
        labels = [s['label'] for s in self.snapshots]
        allocated = [s['allocated'] for s in self.snapshots]
        reserved = [s['reserved'] for s in self.snapshots]
        
        plt.figure(figsize=(12, 5))
        x = range(len(labels))
        plt.bar(x, reserved, alpha=0.5, label='Reserved')
        plt.bar(x, allocated, alpha=0.8, label='Allocated')
        plt.xticks(x, labels, rotation=45, ha='right')
        plt.ylabel('Memory (MB)')
        plt.title('GPU Memory Usage')
        plt.legend()
        plt.tight_layout()
        plt.show()


# Demo
if torch.cuda.is_available():
    tracker = GPUMemoryTracker()
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()
    
    tracker.snapshot("Initial")
    
    model = ModelForProfiling().cuda()
    tracker.snapshot("After model creation")
    
    x = torch.randn(128, 3, 64, 64, device='cuda')
    tracker.snapshot("After input tensor")
    
    output = model(x)
    tracker.snapshot("After forward pass")
    
    loss = output.sum()
    loss.backward()
    tracker.snapshot("After backward pass")
    
    del x, output, loss
    torch.cuda.empty_cache()
    tracker.snapshot("After cleanup")
    
    tracker.print_snapshots()
    tracker.plot()
else:
    print("GPU not available")

### 5.2 Finding Memory Leaks

In [None]:
def find_tensors_in_memory():
    """
    Find all tensors currently in memory.
    Useful for debugging memory leaks.
    """
    tensors = []
    for obj in gc.get_objects():
        try:
            if torch.is_tensor(obj):
                tensors.append({
                    'shape': tuple(obj.shape),
                    'dtype': obj.dtype,
                    'device': str(obj.device),
                    'size_mb': obj.element_size() * obj.nelement() / 1e6,
                    'requires_grad': obj.requires_grad,
                    'has_grad': obj.grad is not None,
                })
        except:
            pass
    
    return tensors


def summarize_tensors(tensors):
    """Summarize tensor memory usage"""
    total_size = sum(t['size_mb'] for t in tensors)
    by_device = {}
    
    for t in tensors:
        device = t['device']
        if device not in by_device:
            by_device[device] = {'count': 0, 'size_mb': 0}
        by_device[device]['count'] += 1
        by_device[device]['size_mb'] += t['size_mb']
    
    print(f"\nTotal tensors: {len(tensors)}")
    print(f"Total size: {total_size:.2f} MB")
    print("\nBy device:")
    for device, stats in by_device.items():
        print(f"  {device}: {stats['count']} tensors, {stats['size_mb']:.2f} MB")


# Demo
print("Before creating tensors:")
tensors_before = find_tensors_in_memory()
summarize_tensors(tensors_before)

# Create some tensors
a = torch.randn(1000, 1000)
b = torch.randn(1000, 1000, requires_grad=True)
c = a @ b

print("\nAfter creating tensors:")
tensors_after = find_tensors_in_memory()
summarize_tensors(tensors_after)

# Clean up
del a, b, c

### 5.3 Gradient Checkpointing for Memory Efficiency

In [None]:
from torch.utils.checkpoint import checkpoint

class MemoryEfficientModel(nn.Module):
    """
    Model using gradient checkpointing to reduce memory usage.
    Trades compute for memory - recomputes activations during backward.
    """
    
    def __init__(self, use_checkpointing: bool = False):
        super().__init__()
        self.use_checkpointing = use_checkpointing
        
        # Create multiple transformer-like blocks
        self.blocks = nn.ModuleList([
            nn.Sequential(
                nn.Linear(256, 1024),
                nn.ReLU(),
                nn.Linear(1024, 256),
            )
            for _ in range(8)
        ])
        
        self.head = nn.Linear(256, 10)
    
    def forward(self, x):
        for block in self.blocks:
            if self.use_checkpointing and self.training:
                # Recompute activations during backward
                x = checkpoint(block, x, use_reentrant=False)
            else:
                x = block(x)
        
        return self.head(x)


# Compare memory usage
if torch.cuda.is_available():
    for use_checkpoint in [False, True]:
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats()
        
        model = MemoryEfficientModel(use_checkpointing=use_checkpoint).cuda()
        x = torch.randn(64, 256, device='cuda')
        
        output = model(x)
        loss = output.sum()
        loss.backward()
        
        peak_memory = torch.cuda.max_memory_allocated() / 1e6
        checkpoint_str = "with" if use_checkpoint else "without"
        print(f"Peak memory {checkpoint_str} checkpointing: {peak_memory:.1f} MB")
        
        del model, x, output, loss
else:
    print("GPU not available for checkpointing demo")

---

## Exercises

### Exercise 1: Debug a Broken Training Loop

Find and fix all bugs in the training loop below.

In [None]:
# Exercise 1: Fix the bugs in this training loop

def buggy_training():
    # Setup
    model = nn.Linear(10, 2)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    
    # Data
    X = torch.randn(100, 10)
    y = torch.randint(0, 2, (100,))
    
    losses = []
    
    for epoch in range(10):
        # BUG 1: Something wrong with the forward pass
        output = model(X.cuda())  # Hint: is model on CUDA?
        
        # BUG 2: Wrong loss function usage
        loss = F.cross_entropy(output, y.float())  # Hint: check y dtype
        
        # BUG 3: Missing something before backward
        loss.backward()
        
        optimizer.step()
        
        # BUG 4: Memory leak
        losses.append(loss)  # Hint: computation graph?
        
        print(f"Epoch {epoch}: Loss = {loss}")
    
    return losses

# YOUR TASK: Fix all bugs and make this run correctly
# Try running it first to see the errors!

# try:
#     losses = buggy_training()
# except Exception as e:
#     print(f"Error: {e}")

### Exercise 2: Profile and Optimize

Use the profiler to identify bottlenecks and suggest optimizations.

In [None]:
# Exercise 2: Profile this model and identify bottlenecks

class SlowModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(1000, 1000)
        self.fc2 = nn.Linear(1000, 1000)
        self.fc3 = nn.Linear(1000, 10)
    
    def forward(self, x):
        # Inefficient: creates many intermediate tensors
        x = self.fc1(x)
        x = x.cpu()  # Unnecessary device transfer!
        x = F.relu(x)
        x = x.cuda() if torch.cuda.is_available() else x
        
        x = self.fc2(x)
        x = x.cpu()  # Another unnecessary transfer!
        x = F.relu(x)
        x = x.cuda() if torch.cuda.is_available() else x
        
        x = self.fc3(x)
        return x

# YOUR TASK:
# 1. Profile this model using PyTorch Profiler
# 2. Identify the bottlenecks
# 3. Create an optimized version

### Exercise 3: Custom Training Monitor

Implement a training monitor that logs to both console and TensorBoard.

In [None]:
# Exercise 3: Implement a comprehensive training monitor

class TrainingMonitor:
    """
    Monitor training progress with:
    - Loss tracking and early stopping detection
    - Gradient norm monitoring
    - NaN/Inf detection
    - TensorBoard logging
    - Console progress output
    """
    
    def __init__(self, model: nn.Module, patience: int = 5, log_dir: str = None):
        # YOUR CODE HERE
        pass
    
    def on_batch_end(self, loss: torch.Tensor, batch_idx: int):
        """Called after each training batch"""
        # YOUR CODE HERE
        # - Check for NaN/Inf in loss
        # - Log gradient norms
        # - Update running average
        pass
    
    def on_epoch_end(self, val_loss: float, epoch: int):
        """Called after each epoch"""
        # YOUR CODE HERE
        # - Check for early stopping
        # - Log to TensorBoard
        # - Print progress
        pass
    
    def should_stop(self) -> bool:
        """Returns True if training should stop early"""
        # YOUR CODE HERE
        pass

---

## Solutions

In [None]:
# Solution 1: Fixed training loop

def fixed_training():
    # Setup - keep everything on same device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = nn.Linear(10, 2).to(device)  # FIX 1: Move model to device
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    
    # Data
    X = torch.randn(100, 10).to(device)  # Move data to device
    y = torch.randint(0, 2, (100,)).to(device)  # FIX 2: Keep as long, move to device
    
    losses = []
    
    for epoch in range(10):
        output = model(X)
        
        # FIX 2: y should be Long tensor for cross_entropy
        loss = F.cross_entropy(output, y)
        
        # FIX 3: Zero gradients before backward!
        optimizer.zero_grad()
        loss.backward()
        
        optimizer.step()
        
        # FIX 4: Use .item() to avoid memory leak
        losses.append(loss.item())
        
        print(f"Epoch {epoch}: Loss = {loss.item():.4f}")
    
    return losses

losses = fixed_training()
print(f"\nFinal loss: {losses[-1]:.4f}")

In [None]:
# Solution 2: Optimized model

class OptimizedModel(nn.Module):
    """Optimized version without unnecessary device transfers"""
    
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(1000, 1000)
        self.fc2 = nn.Linear(1000, 1000)
        self.fc3 = nn.Linear(1000, 10)
    
    def forward(self, x):
        # No device transfers - stay on same device throughout
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


# Compare performance
if torch.cuda.is_available():
    import time
    
    # Slow model
    slow_model = SlowModel().cuda()
    x = torch.randn(64, 1000, device='cuda')
    
    torch.cuda.synchronize()
    start = time.time()
    for _ in range(100):
        _ = slow_model(x)
    torch.cuda.synchronize()
    slow_time = time.time() - start
    
    # Optimized model
    fast_model = OptimizedModel().cuda()
    
    torch.cuda.synchronize()
    start = time.time()
    for _ in range(100):
        _ = fast_model(x)
    torch.cuda.synchronize()
    fast_time = time.time() - start
    
    print(f"Slow model: {slow_time:.3f}s")
    print(f"Optimized model: {fast_time:.3f}s")
    print(f"Speedup: {slow_time / fast_time:.1f}x")
else:
    print("GPU not available for benchmark")

In [None]:
# Solution 3: Complete Training Monitor

class TrainingMonitor:
    """
    Comprehensive training monitor.
    """
    
    def __init__(self, model: nn.Module, patience: int = 5, log_dir: str = None):
        self.model = model
        self.patience = patience
        self.best_val_loss = float('inf')
        self.patience_counter = 0
        self.running_loss = 0.0
        self.batch_count = 0
        self.global_step = 0
        
        if log_dir:
            self.writer = SummaryWriter(log_dir)
        else:
            self.writer = None
    
    def on_batch_end(self, loss: torch.Tensor, batch_idx: int):
        """Called after each training batch"""
        loss_value = loss.item()
        
        # Check for NaN/Inf
        if np.isnan(loss_value) or np.isinf(loss_value):
            print(f"WARNING: Loss is {loss_value} at batch {batch_idx}!")
            return False  # Signal to stop
        
        # Update running average
        self.running_loss += loss_value
        self.batch_count += 1
        self.global_step += 1
        
        # Log gradient norms
        total_norm = 0.0
        for param in self.model.parameters():
            if param.grad is not None:
                total_norm += param.grad.norm().item() ** 2
        total_norm = total_norm ** 0.5
        
        if total_norm > 100:
            print(f"WARNING: Large gradient norm ({total_norm:.1f}) at batch {batch_idx}")
        
        # TensorBoard logging
        if self.writer:
            self.writer.add_scalar('train/loss', loss_value, self.global_step)
            self.writer.add_scalar('train/grad_norm', total_norm, self.global_step)
        
        return True
    
    def on_epoch_end(self, val_loss: float, epoch: int):
        """Called after each epoch"""
        avg_train_loss = self.running_loss / max(self.batch_count, 1)
        
        # Print progress
        print(f"Epoch {epoch}: Train Loss = {avg_train_loss:.4f}, Val Loss = {val_loss:.4f}")
        
        # TensorBoard logging
        if self.writer:
            self.writer.add_scalar('epoch/train_loss', avg_train_loss, epoch)
            self.writer.add_scalar('epoch/val_loss', val_loss, epoch)
        
        # Early stopping check
        if val_loss < self.best_val_loss:
            self.best_val_loss = val_loss
            self.patience_counter = 0
            print(f"  New best validation loss!")
        else:
            self.patience_counter += 1
            print(f"  No improvement ({self.patience_counter}/{self.patience})")
        
        # Reset running stats
        self.running_loss = 0.0
        self.batch_count = 0
    
    def should_stop(self) -> bool:
        """Returns True if training should stop early"""
        return self.patience_counter >= self.patience
    
    def close(self):
        if self.writer:
            self.writer.close()


# Test the monitor
model = nn.Linear(10, 2)
monitor = TrainingMonitor(model, patience=3)

# Simulate training
for epoch in range(10):
    # Simulate batches
    for batch_idx in range(5):
        fake_loss = torch.tensor(1.0 / (epoch + 1) + 0.1 * np.random.randn())
        monitor.on_batch_end(fake_loss, batch_idx)
    
    # Simulate validation
    val_loss = 1.0 / (epoch + 1) + 0.15  # Slightly worse than train
    monitor.on_epoch_end(val_loss, epoch)
    
    if monitor.should_stop():
        print("\nEarly stopping triggered!")
        break

monitor.close()

---

## Summary

### Key Takeaways

1. **Common Bugs**:
   - Shape mismatches: Always check tensor dimensions
   - Device mismatches: Keep all tensors on the same device
   - NaN/Inf values: Use numerical stability techniques
   - Memory leaks: Use `.item()` for scalars, `torch.no_grad()` for eval

2. **Debugging Tools**:
   - `torch.autograd.detect_anomaly()` for gradient issues
   - Gradient hooks for monitoring weight updates
   - Shape assertions for catching dimension errors early

3. **Profiling**:
   - PyTorch Profiler for CPU/GPU time and memory
   - `record_function()` for custom region labeling
   - Memory snapshots for tracking allocation

4. **TensorBoard**:
   - Log scalars, histograms, images, graphs
   - Track gradients and weights over time
   - Visualize embeddings

5. **Memory Management**:
   - Use `torch.cuda.memory_allocated()` to track usage
   - Gradient checkpointing trades compute for memory
   - `gc.collect()` and `torch.cuda.empty_cache()` for cleanup

### Quick Reference

```python
# Check for NaN/Inf
torch.isnan(tensor).any()
torch.isinf(tensor).any()

# Detect anomalies in backward
with torch.autograd.detect_anomaly():
    loss.backward()

# Profile code
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    model(x)
print(prof.key_averages().table())

# TensorBoard
writer = SummaryWriter('runs/exp')
writer.add_scalar('loss', value, step)

# GPU memory
torch.cuda.memory_allocated()  # Current
torch.cuda.max_memory_allocated()  # Peak
torch.cuda.empty_cache()  # Free cached memory
```