# PyTorch Autograd: Automatic Differentiation

## What is Autograd?

"_Neural networks (NNs) are a collection of nested functions that are executed on some input data. These functions are defined by **parameters** (consisting of weights and biases), which in PyTorch are stored in tensors._"

> Reference: https://docs.pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html

## The Two-Step Training Process

Training a NN happens in two steps:

### 1. Forward Propagation
In forward prop, the NN makes its best guess about the correct output. It runs the input data through each of its functions to make this guess.

### 2. Backward Propagation (Backprop)
In backprop, the NN adjusts its parameters proportionate to the error in its guess. It does this by:
- Traversing backwards from the output
- Collecting the derivatives of the error with respect to the parameters (gradients)
- Optimizing the parameters using gradient descent

---

## Basic Example: Training Loop with ResNet

In [90]:
import torch

# import pretrained model
from torchvision.models import resnet18, ResNet18_Weights

In [91]:
# =============================================================================
# STEP 1: Load a pretrained model and create dummy data
# =============================================================================

# Load ResNet18 with pretrained weights (trained on ImageNet)
model = resnet18(weights=ResNet18_Weights.DEFAULT)

# Create random input data simulating a batch of images
# Shape: (batch_size, channels, height, width)
data = torch.rand(1, 3, 64, 64)  # 1 image, 3 color channels (RGB), 64x64 pixels

# Create random labels (ImageNet has 1000 classes)
# Shape: (batch_size, num_classes)
labels = torch.rand(1, 1000)  # 1 sample with 1000 class scores

# Inspect what we've created
print("Model parameters:", len(list(model.parameters())), "tensors")
print("Data shape:", data.shape, "->", data.ndim, "dimensions")
print("Labels shape:", labels.shape, "->", labels.ndim, "dimensions")

Model parameters: 62 tensors
Data shape: torch.Size([1, 3, 64, 64]) -> 4 dimensions
Labels shape: torch.Size([1, 1000]) -> 2 dimensions


In [92]:
# =============================================================================
# STEP 2: Forward Pass - Get model predictions
# =============================================================================

# Pass input through the model to get predictions
# This builds the computation graph that autograd will use for backprop
prediction = model(data)

print("Prediction shape:", prediction.shape)  # [1, 1000] - scores for each class
print("Prediction range:", f"[{prediction.min().item():.2f}, {prediction.max().item():.2f}]")

# Note: prediction has grad_fn because it's the result of differentiable operations
print("Has gradient function:", prediction.grad_fn is not None)

Prediction shape: torch.Size([1, 1000])
Prediction range: [-2.61, 2.59]
Has gradient function: True


In [93]:
# =============================================================================
# STEP 3: Compute the Loss
# =============================================================================

# Calculate loss (error) between predictions and actual labels
# Note: This is a simplified loss for demonstration. In practice, use:
# - nn.CrossEntropyLoss() for classification
# - nn.MSELoss() for regression
loss = (prediction - labels).sum()

print("Loss tensor:", loss)
print(f"Loss value: {loss.item():.4f}")

# Understanding grad_fn:
# The 'grad_fn=<SumBackward0>' shows this tensor was created by a sum operation.
# PyTorch tracks this computation history (the "computation graph") so it can
# automatically compute gradients during backpropagation.
# 
# When you call loss.backward(), PyTorch:
# 1. Starts from this loss tensor
# 2. Follows the grad_fn chain backwards through all operations
# 3. Computes gradients for all tensors with requires_grad=True

Loss tensor: tensor(-500.8209, grad_fn=<SumBackward0>)
Loss value: -500.8209


In [94]:
# =============================================================================
# STEP 4: Backward Pass - Compute Gradients
# =============================================================================

# This is where the magic happens!
# backward() computes the gradient of loss with respect to ALL model parameters
loss.backward()

# After this call, each parameter tensor in the model has a .grad attribute
# containing the gradient (partial derivative) of the loss with respect to that parameter

# Let's verify gradients were computed
# model.parameters() returns an iterator over all learnable parameters
# next() gets the first one (ResNet18's first conv layer: 64 filters, 3 channels, 7x7 kernel)
first_param = next(model.parameters())

print(f"First parameter shape: {first_param.shape}")  # torch.Size([64, 3, 7, 7])
print(f"Gradients computed: {first_param.grad is not None}")

# The gradient has the SAME shape as the parameter
# Each weight gets its own gradient value telling it how to change to reduce loss
print(f"Gradient shape: {first_param.grad.shape}")  # torch.Size([64, 3, 7, 7])
print(f"Gradient shape matches parameter: {first_param.grad.shape == first_param.shape}")

#   - After backward(), every parameter has a .grad attribute containing its gradient
#   - The gradient has the same shape as the parameter (each weight gets its own gradient value)
#   - This gradient tells us: "how much should this weight change to reduce the loss?"

#   Key insight: The gradient shape always matches the parameter shape because each individual weight needs its own update direction.

First parameter shape: torch.Size([64, 3, 7, 7])
Gradients computed: True
Gradient shape: torch.Size([64, 3, 7, 7])
Gradient shape matches parameter: True


In [95]:
# =============================================================================
# STEP 5: Create Optimizer
# =============================================================================

# The optimizer uses the computed gradients to update model parameters
# SGD = Stochastic Gradient Descent

# Key hyperparameters:
# - lr (learning rate): Step size for each update (smaller = more stable, slower)
# - momentum: Helps accelerate SGD and dampen oscillations

optim = torch.optim.SGD(params=model.parameters(), lr=1e-2, momentum=0.9)

print("Optimizer:", optim)
print("\nOptimizer tracks", len(optim.param_groups[0]['params']), "parameter tensors")

Optimizer: SGD (
Parameter Group 0
    dampening: 0
    differentiable: False
    foreach: None
    fused: None
    lr: 0.01
    maximize: False
    momentum: 0.9
    nesterov: False
    weight_decay: 0
)

Optimizer tracks 62 parameter tensors


In [96]:
# =============================================================================
# STEP 6: Update Weights (Gradient Descent)
# =============================================================================

# Get a parameter's value before update
# - model.parameters() returns a generator/iterator over all parameters
# - next() gets the first parameter from that iterator (conv1 weights)
# - detach(): Removes from computation graph (no grad_fn)
# - clone(): Creates a copy in new memory so we keep the old values
param_before = next(model.parameters()).detach().clone()

# optimizer.step() updates all parameters using the computed gradients
# For SGD: new_param = old_param - lr * gradient
optim.step()

# Verify parameters changed
param_after = next(model.parameters()).detach()

# Check if ANY values changed (not just first 5)
print(f"Parameters updated: {not torch.equal(param_before, param_after)}")

# Show the actual difference - small changes are expected with lr=0.01
diff = (param_after - param_before).abs()
print(f"Max change: {diff.max().item():.6f}")
print(f"Mean change: {diff.mean().item():.6f}")
print(f"Num values changed: {(diff > 0).sum().item()} / {diff.numel()}")

# IMPORTANT: In a real training loop, you should also call:
# optim.zero_grad()  # Clear old gradients before next backward pass
# This is covered in detail later in this notebook

Parameters updated: True
Max change: 0.000100
Mean change: 0.000008
Num values changed: 8228 / 9408


---

## Understanding Gradient Computation

Let's see exactly how autograd computes gradients using a simple mathematical example.

> Reference: https://docs.pytorch.org/docs/stable/notes/autograd.html

### Mathematical Example

We'll compute gradients for: **Q = 3a^3 - b^2**

Using calculus:
- dQ/da = 9a^2
- dQ/db = -2b

In [97]:
# Create tensors with requires_grad=True to track computations
# These are our "learnable parameters"

a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

print("Input tensors:")
print(f"a = {a.data.tolist()}")  # [2, 3]
print(f"b = {b.data.tolist()}")  # [6, 4]

Input tensors:
a = [2.0, 3.0]
b = [6.0, 4.0]


In [98]:
# Compute Q = 3*a^3 - b^2
# PyTorch builds a computation graph as we perform operations

Q = 3*a**3 - b**2

# Q[0] = 3*(2^3) - (6^2) = -12
# Q[1] = 3*(3^3) - (4^2) = 65

print("Q =", Q)
print(f"Q.shape = {Q.shape}") # [5, 5]
print(f"Q.grad_fn = {Q.grad_fn}")  # Shows the operation that created Q

Q = tensor([-12.,  65.], grad_fn=<SubBackward0>)
Q.shape = torch.Size([2])
Q.grad_fn = <SubBackward0 object at 0x70d4daba6e00>


In [99]:
# Call backward to compute gradients
# Since Q is a vector (not a scalar), we need to reduce it first
# Using Q.sum() effectively means we want dQ_total/da and dQ_total/db
# where Q_total = Q[0] + Q[1]

Q.sum().backward()

# Alternative: pass a gradient vector to backward()
# external_grad = torch.tensor([1., 1.])
# Q.backward(gradient=external_grad)  # Same result as Q.sum().backward()

In [100]:
# Verify the gradients are mathematically correct
# dQ/da = 9*a^2, dQ/db = -2*b

print("Verifying gradients:")
print("-" * 40)

# For a = [2, 3]: dQ/da = 9*a^2 = [9*4, 9*9] = [36, 81]
expected_a_grad = 9 * a**2
print(f"Expected dQ/da = 9*a^2 = {expected_a_grad.data.tolist()}")
print(f"Computed a.grad       = {a.grad.tolist()}")
print(f"Match: {torch.allclose(a.grad, expected_a_grad.detach())}")

print()

# For b = [6, 4]: dQ/db = -2*b = [-12, -8]
expected_b_grad = -2 * b
print(f"Expected dQ/db = -2*b = {expected_b_grad.data.tolist()}")
print(f"Computed b.grad       = {b.grad.tolist()}")
print(f"Match: {torch.allclose(b.grad, expected_b_grad.detach())}")

Verifying gradients:
----------------------------------------
Expected dQ/da = 9*a^2 = [36.0, 81.0]
Computed a.grad       = [36.0, 81.0]
Match: True

Expected dQ/db = -2*b = [-12.0, -8.0]
Computed b.grad       = [-12.0, -8.0]
Match: True


---

## Gradient Propagation Rules

### requires_grad Propagation

The output tensor of an operation will require gradients if **any** input tensor has `requires_grad=True`.

In [101]:
# Demonstrating requires_grad propagation

x = torch.rand(5, 5)                           # No gradients
y = torch.rand(5, 5)                           # No gradients
z = torch.rand((5, 5), requires_grad=True)     # Requires gradients

# x + y: Neither input requires gradients -> output doesn't require gradients
a = x + y
print(f"a = x + y")
print(f"  x.requires_grad = {x.requires_grad}")
print(f"  y.requires_grad = {y.requires_grad}")
print(f"  a.requires_grad = {a.requires_grad}")  # False

print()

# x + z: One input requires gradients -> output requires gradients
b = x + z
print(f"b = x + z")
print(f"  x.requires_grad = {x.requires_grad}")
print(f"  z.requires_grad = {z.requires_grad}")
print(f"  b.requires_grad = {b.requires_grad}")  # True
print(f"  b.grad_fn = {b.grad_fn}")

a = x + y
  x.requires_grad = False
  y.requires_grad = False
  a.requires_grad = False

b = x + z
  x.requires_grad = False
  z.requires_grad = True
  b.requires_grad = True
  b.grad_fn = <AddBackward0 object at 0x70d4daba7370>


---

## Freezing Parameters (Transfer Learning)

In transfer learning / fine-tuning, we often want to:
1. **Freeze** most of the pretrained model (no gradient updates)
2. **Train only** the final classification layer(s)

This saves computation and prevents overfitting when you have limited data.

In [102]:
# Load a fresh pretrained model
model = resnet18(weights=ResNet18_Weights.DEFAULT)

# numel() returns the total number of elements in a tensor.
# Each parameter p is a tensor with a shape (e.g., a weight matrix of shape (512, 256) has 131,072 elements). 
# Using numel() counts all scalar values, giving you the actual parameter count.
# Without it, you'd just count the number of parameter tensors, not the total trainable parameters.

# Check initial state - all parameters require gradients
trainable_before = 0
for p in model.parameters():
    if p.requires_grad:
        trainable_before += p.numel()
print(f"Trainable parameters before freezing: {trainable_before:,}")

# Freeze ALL parameters in the network
for param in model.parameters():
    param.requires_grad = False

trainable_after = 0
for p in model.parameters():
    if p.requires_grad:
        trainable_after += p.numel()
print(f"Trainable parameters after freezing: {trainable_after:,}")

frozen_params = [p for p in model.parameters() if not p.requires_grad]
print(f"Total frozen parameters: {len(frozen_params)}")

# show a frozen parameters
print("\nSample frozen parameters:")
for name, param in list(model.named_parameters()):
    print(f"  {name}: requires_grad={param.requires_grad}")

Trainable parameters before freezing: 11,689,512
Trainable parameters after freezing: 0
Total frozen parameters: 62

Sample frozen parameters:
  conv1.weight: requires_grad=False
  bn1.weight: requires_grad=False
  bn1.bias: requires_grad=False
  layer1.0.conv1.weight: requires_grad=False
  layer1.0.bn1.weight: requires_grad=False
  layer1.0.bn1.bias: requires_grad=False
  layer1.0.conv2.weight: requires_grad=False
  layer1.0.bn2.weight: requires_grad=False
  layer1.0.bn2.bias: requires_grad=False
  layer1.1.conv1.weight: requires_grad=False
  layer1.1.bn1.weight: requires_grad=False
  layer1.1.bn1.bias: requires_grad=False
  layer1.1.conv2.weight: requires_grad=False
  layer1.1.bn2.weight: requires_grad=False
  layer1.1.bn2.bias: requires_grad=False
  layer2.0.conv1.weight: requires_grad=False
  layer2.0.bn1.weight: requires_grad=False
  layer2.0.bn1.bias: requires_grad=False
  layer2.0.conv2.weight: requires_grad=False
  layer2.0.bn2.weight: requires_grad=False
  layer2.0.bn2.bias: r

In [103]:
import torch.nn as nn
from torch import optim

# Replace the classifier (last layer(  fc.weight: requires_grad=False, fc.bias: requires_grad=False  )) with a new one for our task
# Original: 512 features -> 1000 ImageNet classes
# New: 512 features -> 10 classes (for our custom dataset)

print(f"Original classifier: {model.fc}")

model.fc = nn.Linear(512, 10)  # New layer - requires_grad=True by default!

print(f"New classifier: {model.fc}")

# Count trainable parameters - only the new fc layer
trainable_final = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTrainable parameters after replacing fc: {trainable_final:,}")
print(f"(That's {model.fc.weight.numel()} weights + {model.fc.bias.numel()} biases)")

Original classifier: Linear(in_features=512, out_features=1000, bias=True)
New classifier: Linear(in_features=512, out_features=10, bias=True)

Trainable parameters after replacing fc: 5,130
(That's 5120 weights + 10 biases)


In [104]:
# Create optimizer - even though we pass all parameters,
# only those with requires_grad=True will be updated

optimizer = optim.SGD(params=model.parameters(), lr=1e-2, momentum=0.9)

# Let's verify: check which parameters will actually be updated
trainable_params = [p for p in model.parameters() if p.requires_grad]
frozen_params = [p for p in model.parameters() if not p.requires_grad]

print(f"Parameters passed to optimizer: {len(list(model.parameters()))}")
print(f"Parameters that will be updated: {len(trainable_params)}")
print(f"Parameters that are frozen: {len(frozen_params)}")

# Better practice: only pass trainable parameters to optimizer
# This is more efficient and clearer:
# optimizer = optim.SGD(model.fc.parameters(), lr=1e-2, momentum=0.9)

print("\nTrainable parameter names:")
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"  {name}: shape={list(param.shape)}")

Parameters passed to optimizer: 62
Parameters that will be updated: 2
Parameters that are frozen: 60

Trainable parameter names:
  fc.weight: shape=[10, 512]
  fc.bias: shape=[10]


## Disabling Gradient Tracking with torch.no_grad()

When performing inference (predictions), we don't need to compute gradients. This saves memory and speeds up computation. Use `torch.no_grad()` or `torch.inference_mode()` for this purpose.

In [105]:
# Demonstrating torch.no_grad() context manager

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

# Normal operation - gradients are tracked
y = x * 2
print(f"With gradient tracking:")
print(f"  y = {y}")
print(f"  y.requires_grad = {y.requires_grad}")
print(f"  y.grad_fn = {y.grad_fn}")

With gradient tracking:
  y = tensor([2., 4., 6.], grad_fn=<MulBackward0>)
  y.requires_grad = True
  y.grad_fn = <MulBackward0 object at 0x70d4dabf7c40>


In [106]:
# Inside no_grad() context - no gradient tracking
with torch.no_grad():
    z = x * 2
    print(f"\nWithout gradient tracking (torch.no_grad()):")
    print(f"  z = {z}")
    print(f"  z.requires_grad = {z.requires_grad}")
    print(f"  z.grad_fn = {z.grad_fn}")


Without gradient tracking (torch.no_grad()):
  z = tensor([2., 4., 6.])
  z.requires_grad = False
  z.grad_fn = None


In [107]:
# inference_mode() is even faster than no_grad() (recommended for inference)
with torch.inference_mode():
    w = x * 2
    print(f"\nUsing inference_mode():")
    print(f"  w = {w}")
    print(f"  w.requires_grad = {w.requires_grad}")


Using inference_mode():
  w = tensor([2., 4., 6.])
  w.requires_grad = False


In [108]:
# Common use case: evaluating model during training
model = resnet18(weights=ResNet18_Weights.DEFAULT)
model.eval()  # Set model to evaluation mode (affects dropout, batchnorm, etc.)

with torch.no_grad():
    # No gradients computed - faster and uses less memory
    test_input = torch.rand(1, 3, 64, 64)
    output = model(test_input)
    print(f"\nInference output shape: {output.shape}")
    print(f"Output requires_grad: {output.requires_grad}")


Inference output shape: torch.Size([1, 1000])
Output requires_grad: False


## The Importance of zero_grad()

**CRITICAL**: PyTorch accumulates gradients by default! If you don't call `optimizer.zero_grad()` before each backward pass, gradients from previous iterations will add up, leading to incorrect updates.

In [109]:
# Demonstrating gradient accumulation problem

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

# First backward pass
y1 = (x ** 2).sum()
y1.backward()
print(f"After first backward: x.grad = {x.grad}")  # Should be [2, 4, 6]

After first backward: x.grad = tensor([2., 4., 6.])


In [110]:
# Second backward pass WITHOUT zeroing gradients
y2 = (x ** 2).sum()
y2.backward()
print(f"After second backward (no zero_grad): x.grad = {x.grad}")  # [4, 8, 12] - DOUBLED!

After second backward (no zero_grad): x.grad = tensor([ 4.,  8., 12.])


In [111]:
# Third backward pass WITH zeroing gradients
# x.grad.zero_()  # Zero the gradients manually as .zero_grad() is used for optimizer
y3 = (x ** 2).sum()
y3.backward()
print(f"After third backward (with zero_grad): x.grad = {x.grad}")  # [2, 4, 6] - Correct!

After third backward (with zero_grad): x.grad = tensor([ 6., 12., 18.])


### Complete Training Loop Pattern

Here's the standard pattern you should follow for training neural networks:

In [112]:
# Complete training loop example with a simple model

# Create a simple linear model
simple_model = nn.Linear(10, 2)
criterion = nn.MSELoss()  # Use proper loss function, not just sum of differences
optimizer = optim.SGD(simple_model.parameters(), lr=0.01)

# Generate some dummy data
inputs = torch.randn(5, 10)   # 5 samples, 10 features each
targets = torch.randn(5, 2)   # 5 samples, 2 outputs each

print("Training loop for 3 epochs:\n")

# Always call optimizer.zero_grad() BEFORE loss.backward()

for epoch in range(3):
    # ============================================
    # STANDARD TRAINING LOOP - MEMORIZE THIS ORDER
    # ============================================
    
    # Step 1: Zero the gradients from previous iteration
    optimizer.zero_grad()
    
    # Step 2: Forward pass - compute predictions
    outputs = simple_model(inputs)
    
    # Step 3: Compute the loss
    loss = criterion(outputs, targets)
    
    # Step 4: Backward pass - compute gradients
    loss.backward()
    
    # Step 5: Update weights using computed gradients
    optimizer.step()
    
    print(f"Epoch {epoch+1}: Loss = {loss.item():.4f}")

print("\nKey takeaway: Always call optimizer.zero_grad() BEFORE loss.backward()")

Training loop for 3 epochs:

Epoch 1: Loss = 0.7815
Epoch 2: Loss = 0.7499
Epoch 3: Loss = 0.7202

Key takeaway: Always call optimizer.zero_grad() BEFORE loss.backward()


## Breaking the Computation Graph with detach()

The `detach()` method creates a new tensor that shares the same data but is detached from the computation graph. This is useful when you want to:
- Use a tensor's value without tracking gradients through it
- Prevent gradients from flowing back through certain parts of your network

In [113]:
# Demonstrating detach()

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x * 2

# y is connected to x in the computation graph
print(f"y = {y}")
print(f"y.grad_fn = {y.grad_fn}")
print(f"y.requires_grad = {y.requires_grad}")

y = tensor([2., 4., 6.], grad_fn=<MulBackward0>)
y.grad_fn = <MulBackward0 object at 0x70d4db82cc10>
y.requires_grad = True


In [114]:
# detach() creates a new tensor with same values but no gradient connection
y_detached = y.detach()
print(f"\ny_detached = {y_detached}")
print(f"y_detached.grad_fn = {y_detached.grad_fn}")  # None - no computation history
print(f"y_detached.requires_grad = {y_detached.requires_grad}")


y_detached = tensor([2., 4., 6.])
y_detached.grad_fn = None
y_detached.requires_grad = False


In [115]:
# Use case: computing loss for logging without affecting gradients
z = y_detached * 3  # This operation won't contribute to gradients through x
print(f"\nz (from detached) = {z}")
print(f"z.requires_grad = {z.requires_grad}")


z (from detached) = tensor([ 6., 12., 18.])
z.requires_grad = False


In [116]:
# Important: detach() shares memory with original tensor
y_detached[0] = 100
print(f"\nAfter modifying y_detached[0] = 100:")
print(f"y = {y}")  # y is also modified because they share memory!
print(f"y_detached = {y_detached}")


After modifying y_detached[0] = 100:
y = tensor([100.,   4.,   6.], grad_fn=<MulBackward0>)
y_detached = tensor([100.,   4.,   6.])


In [117]:
# If you need a completely independent copy, use .detach().clone()
x2 = torch.tensor([1.0, 2.0], requires_grad=True)
y2 = x2 * 2
y2_copy = y2.detach().clone()  # Independent copy
y2_copy[0] = 999

print(f"\nWith .detach().clone():")
print(f"y2 = {y2}")  # Unchanged
print(f"y2_copy = {y2_copy}")


With .detach().clone():
y2 = tensor([2., 4.], grad_fn=<MulBackward0>)
y2_copy = tensor([999.,   4.])


## Gradient Accumulation for Large Batch Training

When you have limited GPU memory but want to train with a large effective batch size, you can accumulate gradients over multiple mini-batches before updating weights. This leverages PyTorch's default gradient accumulation behavior.

In [118]:
# Gradient accumulation example
# Goal: Train with effective batch size of 8, but only 2 samples fit in memory at a time

model_ga = nn.Linear(10, 2)
criterion = nn.MSELoss()
optimizer = optim.SGD(model_ga.parameters(), lr=0.01)

# Simulating 8 samples split into 4 mini-batches of 2
all_inputs = torch.randn(8, 10)
all_targets = torch.randn(8, 2)

accumulation_steps = 4  # Number of mini-batches to accumulate before updating
batch_size = 2

print("Gradient Accumulation Training:\n")

# Zero gradients once at the beginning
optimizer.zero_grad()

for i in range(accumulation_steps):
    # Get mini-batch
    start_idx = i * batch_size
    end_idx = start_idx + batch_size
    inputs = all_inputs[start_idx:end_idx]
    targets = all_targets[start_idx:end_idx]
    
    # Forward pass
    outputs = model_ga(inputs)
    
    # Scale loss to account for accumulation (average over all mini-batches)
    loss = criterion(outputs, targets) / accumulation_steps
    
    # Backward pass - gradients are ACCUMULATED (not replaced)
    loss.backward()
    
    print(f"Mini-batch {i+1}: Loss = {loss.item() * accumulation_steps:.4f}")

# Update weights after accumulating all gradients
optimizer.step()

# Zero gradients for next iteration
optimizer.zero_grad()

print("\nWeights updated with accumulated gradients from all 4 mini-batches")

Gradient Accumulation Training:

Mini-batch 1: Loss = 1.6563
Mini-batch 2: Loss = 0.8023
Mini-batch 3: Loss = 4.7670
Mini-batch 4: Loss = 1.3919

Weights updated with accumulated gradients from all 4 mini-batches


## retain_graph: Multiple Backward Passes

By default, PyTorch frees the computation graph after `.backward()` to save memory. If you need to call backward multiple times on the same graph, use `retain_graph=True`.

In [119]:
# Demonstrating retain_graph

x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x ** 2
z = y.sum()

# First backward - need retain_graph if we want to call backward again
z.backward(retain_graph=True)
print(f"After first backward: x.grad = {x.grad}")

# Without retain_graph=True, this would cause an error:
# "RuntimeError: Trying to backward through the graph a second time"

After first backward: x.grad = tensor([2., 4.])


In [120]:
# Second backward on the same graph
z.backward(retain_graph=True)
print(f"After second backward: x.grad = {x.grad}")  # Gradients accumulated

After second backward: x.grad = tensor([4., 8.])


In [121]:
# Third backward - if we don't need the graph anymore, don't retain it
x.grad.zero_()  # Reset gradients first
z.backward()  # retain_graph=False by default
print(f"After third backward (reset first): x.grad = {x.grad}")

# Now the graph is freed, calling backward again would error:
try:
    z.backward()
except RuntimeError as e:
    print(f"Error: {e}")

After third backward (reset first): x.grad = tensor([2., 4.])
Error: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.


## Higher-Order Gradients (Gradients of Gradients)

Using `create_graph=True` preserves the computation graph of the gradients themselves, allowing you to compute higher-order derivatives. This is useful for:
- Meta-learning (MAML)
- Some regularization techniques (e.g., gradient penalty in WGAN-GP)
- Hessian computation

In [122]:
# Computing second-order derivatives

x = torch.tensor([2.0], requires_grad=True)

# Function: f(x) = x^3
# First derivative: f'(x) = 3x^2
# Second derivative: f''(x) = 6x

y = x ** 3

# Compute first derivative with create_graph=True to track the gradient computation
first_derivative = torch.autograd.grad(outputs=y, inputs=x, create_graph=True)[0]
print(f"f(x) = x^3 at x=2: {y.item()}")
print(f"f'(x) = 3x^2 at x=2: {first_derivative.item()}")  # Should be 3*4 = 12

f(x) = x^3 at x=2: 8.0
f'(x) = 3x^2 at x=2: 12.0


In [123]:
# Compute second derivative (gradient of the gradient)
second_derivative = torch.autograd.grad(first_derivative, x)[0]
print(f"f''(x) = 6x at x=2: {second_derivative.item()}")  # Should be 6*2 = 12

f''(x) = 6x at x=2: 12.0


In [124]:
# Practical example: Gradient penalty (used in WGAN-GP)

# Simple network
net = nn.Linear(5, 1)
x_input = torch.randn(1, 5, requires_grad=True)

# Forward pass
output = net(x_input)

# Compute gradients of output w.r.t. input (with create_graph to allow further differentiation)
gradients = torch.autograd.grad(
    outputs=output,
    inputs=x_input,
    create_graph=True,  # Needed to compute gradient of the gradient
    retain_graph=True
)[0]

# Gradient penalty: (||gradient|| - 1)^2
gradient_penalty = ((gradients.norm(2) - 1) ** 2)
print(f"Gradient norm: {gradients.norm(2).item():.4f}")
print(f"Gradient penalty: {gradient_penalty.item():.4f}")

Gradient norm: 0.6957
Gradient penalty: 0.0926


In [125]:
# Now we can backprop through the penalty
gradient_penalty.backward()
print(f"Network weight gradients computed: {net.weight.grad is not None}")

Network weight gradients computed: True


## Common Pitfalls and Gotchas

### 1. In-place Operations Break Gradients

In-place operations (operations that modify tensors in place, denoted by `_` suffix) can break the computation graph.

In [126]:
# In-place operations can cause problems

x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x * 2

# BAD: In-place modification of y (which is part of the computation graph)
# y.add_(1)  # This would cause: RuntimeError when calling backward

# GOOD: Out-of-place operation instead
y = y + 1  # Creates a new tensor

z = y.sum()
z.backward()
print(f"x.grad = {x.grad}")  # Works fine

# List of common in-place operations to avoid on tensors in computation graph:
# add_(), sub_(), mul_(), div_(), zero_(), fill_(), etc.

print("\nIn-place operations are fine on leaf tensors that don't need gradients:")
a = torch.tensor([1.0, 2.0])
a.add_(10)  # This is fine - 'a' doesn't require gradients
print(f"a = {a}")

x.grad = tensor([2., 2.])

In-place operations are fine on leaf tensors that don't need gradients:
a = tensor([11., 12.])


### 2. Leaf Tensors and Non-Leaf Tensors

Only leaf tensors (tensors created directly, not as a result of operations) retain their gradients by default.

In [127]:
# Understanding leaf vs non-leaf tensors

# Leaf tensor: created directly
x = torch.tensor([1.0, 2.0], requires_grad=True)
print(f"x is leaf: {x.is_leaf}")  # True

x is leaf: True


In [128]:
# Non-leaf tensor: result of an operation
y = x * 2
print(f"y is leaf: {y.is_leaf}")  # False

y is leaf: False


In [129]:
z = y.sum()
z.backward()

print(f"\nx.grad = {x.grad}")  # Has gradient - it's a leaf
print(f"y.grad = {y.grad}")  # None by default - non-leaf tensors don't retain gradients


x.grad = tensor([2., 2.])
y.grad = None


  print(f"y.grad = {y.grad}")  # None by default - non-leaf tensors don't retain gradients


In [130]:
# If you need gradients for non-leaf tensors, use retain_grad()
x2 = torch.tensor([1.0, 2.0], requires_grad=True)
y2 = x2 * 2
y2.retain_grad()  # Tell PyTorch to keep gradient for this non-leaf tensor

z2 = y2.sum()
z2.backward()

print(f"\nWith retain_grad():")
print(f"x2.grad = {x2.grad}")
print(f"y2.grad = {y2.grad}")  # Now it has a gradient!


With retain_grad():
x2.grad = tensor([2., 2.])
y2.grad = tensor([1., 1.])


### 3. Converting Tensors with Gradients to NumPy

You cannot directly convert a tensor that requires gradients to NumPy. You must detach it first.

In [131]:
import numpy as np

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

# BAD: This will raise an error
# x_numpy = x.numpy()  # RuntimeError: Can't call numpy() on Tensor that requires grad

# GOOD: Detach first, then convert
x_numpy = x.detach().numpy()
print(f"Converted to numpy: {x_numpy}")

# Or use .cpu().detach().numpy() for GPU tensors
# x_numpy = x.cpu().detach().numpy()

Converted to numpy: [1. 2. 3.]


---

## Summary: Key Autograd Concepts

| Concept | Description | When to Use |
|---------|-------------|-------------|
| `requires_grad=True` | Enable gradient tracking | Learnable parameters |
| `backward()` | Compute gradients | After computing loss |
| `zero_grad()` | Reset gradients | Before each backward pass |
| `torch.no_grad()` | Disable gradient tracking | Inference/evaluation |
| `detach()` | Break computation graph | Get tensor value without gradients |
| `retain_graph=True` | Keep graph after backward | Multiple backward passes |
| `create_graph=True` | Track gradient computation | Higher-order derivatives |

### Training Loop Checklist

```python
optimizer.zero_grad()    # 1. Clear old gradients
output = model(input)    # 2. Forward pass
loss = criterion(output, target)  # 3. Compute loss
loss.backward()          # 4. Compute gradients
optimizer.step()         # 5. Update weights
```