# Module 1.2: Autograd Internals

Autograd is PyTorch's automatic differentiation engine. Understanding how it works under the hood is essential for:
- Debugging gradient issues
- Implementing custom operations
- Optimizing memory usage during training
- Understanding why certain code patterns work (or don't)

## Learning Objectives
- Understand dynamic computation graphs and how they differ from static graphs
- Master `requires_grad`, `grad_fn`, and the backward pass mechanics
- Distinguish leaf tensors from intermediate tensors
- Control gradient computation with `torch.no_grad()` and `detach()`
- Implement custom autograd functions
- Handle gradient accumulation correctly

---

## Setup

In [1]:
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np

print(f"PyTorch version: {torch.__version__}")

PyTorch version: 2.9.1+cu128


---
## 1. The Computation Graph

When you perform operations on tensors with `requires_grad=True`, PyTorch builds a **computation graph** that tracks how the output was computed. This graph is then used to compute gradients via backpropagation.

### 1.1 Dynamic vs Static Graphs

| Feature | PyTorch (Dynamic) | TensorFlow 1.x (Static) |
|---------|-------------------|-------------------------|
| Graph creation | Every forward pass | Once, before training |
| Control flow | Native Python if/for | Special graph ops |
| Debugging | Standard Python debugger | Harder, separate execution |
| Flexibility | Different graph each iteration | Same graph always |

In [2]:
# Dynamic graph: different computation path each time
def dynamic_example(x, use_square=True):
    if use_square:  # Native Python control flow!
        return x ** 2
    else:
        return x ** 3

x = torch.tensor([2.0], requires_grad=True)

# Different graphs created each call
y1 = dynamic_example(x, use_square=True)
y1.backward()
print(f"d(x^2)/dx at x=2: {x.grad.item()}")  # 2*x = 4

x.grad.zero_()  # Reset gradient

y2 = dynamic_example(x, use_square=False)
y2.backward()
print(f"d(x^3)/dx at x=2: {x.grad.item()}")  # 3*x^2 = 12

d(x^2)/dx at x=2: 4.0
d(x^3)/dx at x=2: 12.0


### 1.2 Visualizing the Graph

In [3]:
# Every tensor knows how it was created via grad_fn
a = torch.tensor([2.0], requires_grad=True)
b = torch.tensor([3.0], requires_grad=True)

c = a * b          # Multiplication
d = c + a          # Addition
e = d.sum()        # Sum (to get scalar for backward)

print(f"a.grad_fn: {a.grad_fn}")  # None - it's a leaf
print(f"c.grad_fn: {c.grad_fn}")  # MulBackward
print(f"d.grad_fn: {d.grad_fn}")  # AddBackward
print(f"e.grad_fn: {e.grad_fn}")  # SumBackward

a.grad_fn: None
c.grad_fn: <MulBackward0 object at 0x7e8b17c0eec0>
d.grad_fn: <AddBackward0 object at 0x7e8b17c0eec0>
e.grad_fn: <SumBackward0 object at 0x7e8b17c0eec0>


In [4]:
# We can traverse the graph backwards
def print_graph(tensor, indent=0):
    """Recursively print the computation graph."""
    prefix = "  " * indent
    if tensor.grad_fn is not None:
        print(f"{prefix}{tensor.grad_fn}")
        for child, _ in tensor.grad_fn.next_functions:
            if child is not None:
                # child is a grad_fn, need to find its tensor
                print(f"{prefix}  └─ {child}")
    else:
        print(f"{prefix}Leaf tensor")

print("Computation graph from e:")
print(f"e = {e.grad_fn}")
print(f"  └─ d = {d.grad_fn}")
print(f"      ├─ c = {c.grad_fn}")
print(f"      │   ├─ a (leaf)")
print(f"      │   └─ b (leaf)")
print(f"      └─ a (leaf)")

Computation graph from e:
e = <SumBackward0 object at 0x7e89c7370cd0>
  └─ d = <AddBackward0 object at 0x7e89c7370cd0>
      ├─ c = <MulBackward0 object at 0x7e89c7370cd0>
      │   ├─ a (leaf)
      │   └─ b (leaf)
      └─ a (leaf)


---
## 2. requires_grad and Gradient Flow

`requires_grad` is the switch that tells PyTorch whether to track operations for this tensor.

In [5]:
# Default: requires_grad=False
t1 = torch.tensor([1.0, 2.0, 3.0])
print(f"Default requires_grad: {t1.requires_grad}")

# Explicitly enable
t2 = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
print(f"Explicit requires_grad: {t2.requires_grad}")

# Enable in-place (note the underscore)
t1.requires_grad_(True)
print(f"After requires_grad_(): {t1.requires_grad}")

Default requires_grad: False
Explicit requires_grad: True
After requires_grad_(): True


In [6]:
# Gradient flow rules:
# If ANY input requires_grad, output requires_grad

a = torch.tensor([1.0], requires_grad=True)
b = torch.tensor([2.0], requires_grad=False)

c = a + b
print(f"a (grad=True) + b (grad=False) -> c.requires_grad: {c.requires_grad}")

# All inputs no grad -> output no grad
d = torch.tensor([1.0])
e = torch.tensor([2.0])
f = d + e
print(f"d (no grad) + e (no grad) -> f.requires_grad: {f.requires_grad}")

a (grad=True) + b (grad=False) -> c.requires_grad: True
d (no grad) + e (no grad) -> f.requires_grad: False


### 2.1 Leaf Tensors vs Intermediate Tensors

In [7]:
# Leaf tensors: created directly by the user, not from operations
x = torch.tensor([1.0, 2.0], requires_grad=True)
print(f"x is leaf: {x.is_leaf}")  # True

# Intermediate tensors: results of operations
y = x * 2
print(f"y is leaf: {y.is_leaf}")  # False

z = y.sum()
print(f"z is leaf: {z.is_leaf}")  # False

x is leaf: True
y is leaf: False
z is leaf: False


In [11]:
# Important: Only leaf tensors have .grad populated after backward()
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x * 2
z = y.sum()

z.backward()

print(f"x.grad (leaf): {x.grad}")      # Populated
print(f"y.grad (non-leaf): {y.grad}")  # None! Gradients not retained by default

x.grad (leaf): tensor([2., 2.])
y.grad (non-leaf): None


  print(f"y.grad (non-leaf): {y.grad}")  # None! Gradients not retained by default


In [19]:
# To retain gradients for non-leaf tensors, use retain_grad()
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x * 2
y.retain_grad()  # Tell PyTorch to keep y's gradient
z = y.sum()

z.backward()

print(f"x.grad: {x.grad}")
print(f"y.grad (retained): {y.grad}")  # Now populated!

x.grad: tensor([2., 2.])
y.grad (retained): tensor([1., 1.])


---
## 3. The Backward Pass

When you call `.backward()`, PyTorch traverses the computation graph in reverse, computing gradients using the chain rule.

### 3.1 Basic Backward

In [22]:
# Simple example: y = x^2, dy/dx = 2x
x = torch.tensor([3.0], requires_grad=True)
y = x ** 2

y.backward()  # Computes dy/dx

print(f"x = {x.item()}")
print(f"y = x^2 = {y.item()}")
print(f"dy/dx = 2x = {x.grad.item()}")

x = 3.0
y = x^2 = 9.0
dy/dx = 2x = 6.0


In [23]:
# Chain rule in action: y = sin(x^2)
# dy/dx = cos(x^2) * 2x

x = torch.tensor([2.0], requires_grad=True)
y = torch.sin(x ** 2)

y.backward()

# Manual calculation
manual_grad = torch.cos(x ** 2) * 2 * x

print(f"Autograd: {x.grad.item():.6f}")
print(f"Manual:   {manual_grad.item():.6f}")

Autograd: -2.614574
Manual:   -2.614574


### 3.2 Backward with Non-Scalar Outputs

In [24]:
# backward() only works directly on scalars
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2  # Non-scalar output

try:
    y.backward()
except RuntimeError as e:
    print(f"Error: {e}")

Error: grad can be implicitly created only for scalar outputs


In [25]:
# Solution 1: Reduce to scalar
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
loss = y.sum()  # or y.mean()
loss.backward()
print(f"Gradient via sum: {x.grad}")

Gradient via sum: tensor([2., 4., 6.])


In [26]:
# Solution 2: Provide gradient argument (Jacobian-vector product)
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2

# gradient argument = "upstream gradients" (dL/dy for some loss L)
# This computes dL/dx = dL/dy * dy/dx
upstream = torch.tensor([1.0, 1.0, 1.0])  # Equivalent to .sum().backward()
y.backward(gradient=upstream)
print(f"Gradient with upstream [1,1,1]: {x.grad}")

# Different upstream gradients
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
upstream = torch.tensor([1.0, 0.5, 0.0])  # Weight different elements
y.backward(gradient=upstream)
print(f"Gradient with upstream [1,0.5,0]: {x.grad}")

Gradient with upstream [1,1,1]: tensor([2., 4., 6.])
Gradient with upstream [1,0.5,0]: tensor([2., 2., 0.])


### 3.3 Multiple Backward Passes

In [27]:
# By default, the graph is freed after backward()
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2
y.backward()

try:
    y.backward()  # Graph already freed!
except RuntimeError as e:
    print(f"Error: {e}")

Error: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.


In [28]:
# Use retain_graph=True to keep the graph
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2

y.backward(retain_graph=True)  # Keep the graph
print(f"First backward: {x.grad}")

# Note: gradients ACCUMULATE!
y.backward(retain_graph=True)
print(f"Second backward (accumulated): {x.grad}")

x.grad.zero_()  # Reset
y.backward()
print(f"After zero_(): {x.grad}")

First backward: tensor([4.])
Second backward (accumulated): tensor([8.])
After zero_(): tensor([4.])


---
## 4. Controlling Gradient Computation

There are several ways to stop gradient tracking, each with different use cases.

### 4.1 torch.no_grad()

In [29]:
# torch.no_grad() temporarily disables gradient tracking
x = torch.tensor([1.0], requires_grad=True)

# Normal operation - gradient tracked
y = x * 2
print(f"Normal: y.requires_grad = {y.requires_grad}")

# Inside no_grad context - no tracking
with torch.no_grad():
    z = x * 2
    print(f"In no_grad: z.requires_grad = {z.requires_grad}")

# Back to normal
w = x * 2
print(f"After no_grad: w.requires_grad = {w.requires_grad}")

Normal: y.requires_grad = True
In no_grad: z.requires_grad = False
After no_grad: w.requires_grad = True


In [30]:
# Common use case: Inference/evaluation
model = nn.Linear(10, 5)
x = torch.randn(3, 10)

# Training mode - need gradients
model.train()
y_train = model(x)
print(f"Training: y.requires_grad = {y_train.requires_grad}")

# Inference mode - no gradients needed (faster, less memory)
model.eval()
with torch.no_grad():
    y_eval = model(x)
    print(f"Inference: y.requires_grad = {y_eval.requires_grad}")

Training: y.requires_grad = True
Inference: y.requires_grad = False


### 4.2 torch.inference_mode()

In [31]:
# inference_mode is like no_grad but faster (PyTorch 1.9+)
# It provides additional optimizations by guaranteeing no gradient computation

x = torch.tensor([1.0], requires_grad=True)

with torch.inference_mode():
    y = x * 2
    print(f"In inference_mode: y.requires_grad = {y.requires_grad}")
    
# Prefer inference_mode() over no_grad() for pure inference

In inference_mode: y.requires_grad = False


### 4.3 detach()

In [32]:
# detach() creates a tensor that shares data but doesn't track gradients
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2

y_detached = y.detach()

print(f"y.requires_grad: {y.requires_grad}")
print(f"y_detached.requires_grad: {y_detached.requires_grad}")
print(f"Share memory: {y.data_ptr() == y_detached.data_ptr()}")

y.requires_grad: True
y_detached.requires_grad: False
Share memory: True


In [33]:
# Use case: Breaking the computation graph
# Example: Training with a frozen encoder

class FrozenEncoderModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Linear(10, 20)  # Pretrained, frozen
        self.decoder = nn.Linear(20, 5)   # Trainable
    
    def forward(self, x):
        # Detach encoder output - gradients won't flow to encoder
        encoded = self.encoder(x).detach()
        return self.decoder(encoded)

model = FrozenEncoderModel()
x = torch.randn(3, 10)
y = model(x)
loss = y.sum()
loss.backward()

print(f"Encoder grad: {model.encoder.weight.grad}")  # None - detached
print(f"Decoder grad exists: {model.decoder.weight.grad is not None}")  # True

Encoder grad: None
Decoder grad exists: True


In [34]:
# detach() vs clone().detach()
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x * 2

# detach() shares memory - modifying one affects the other
y_detach = y.detach()

# clone().detach() creates independent copy
y_clone_detach = y.clone().detach()

print(f"detach shares memory: {y.data_ptr() == y_detach.data_ptr()}")
print(f"clone().detach() shares memory: {y.data_ptr() == y_clone_detach.data_ptr()}")

detach shares memory: True
clone().detach() shares memory: False


---
## 5. Gradient Accumulation

PyTorch accumulates gradients by default. This is a feature, not a bug!

In [35]:
# Gradients accumulate across backward calls
x = torch.tensor([1.0], requires_grad=True)

for i in range(3):
    y = x ** 2
    y.backward()
    print(f"After backward {i+1}: x.grad = {x.grad}")

# Without zeroing, gradients add up!

After backward 1: x.grad = tensor([2.])
After backward 2: x.grad = tensor([4.])
After backward 3: x.grad = tensor([6.])


In [36]:
# Standard training loop pattern: zero gradients before backward
x = torch.tensor([1.0], requires_grad=True)

for i in range(3):
    if x.grad is not None:
        x.grad.zero_()  # Reset gradient
    
    y = x ** 2
    y.backward()
    print(f"Iteration {i+1}: x.grad = {x.grad}")

Iteration 1: x.grad = tensor([2.])
Iteration 2: x.grad = tensor([2.])
Iteration 3: x.grad = tensor([2.])


In [37]:
# For optimizers, use optimizer.zero_grad()
model = nn.Linear(10, 5)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for i in range(3):
    optimizer.zero_grad()  # Reset all parameter gradients
    
    x = torch.randn(3, 10)
    y = model(x)
    loss = y.sum()
    loss.backward()
    
    optimizer.step()  # Update parameters
    
    print(f"Iteration {i+1}: weight grad norm = {model.weight.grad.norm():.4f}")

Iteration 1: weight grad norm = 10.9329
Iteration 2: weight grad norm = 7.4428
Iteration 3: weight grad norm = 9.8701


### 5.1 When Gradient Accumulation is Useful

In [38]:
# Use case: Simulating larger batch sizes with limited memory
model = nn.Linear(100, 10)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# We want effective batch size of 32, but can only fit 8 in memory
accumulation_steps = 4
micro_batch_size = 8

optimizer.zero_grad()

for i in range(accumulation_steps):
    x = torch.randn(micro_batch_size, 100)
    y = model(x)
    loss = y.sum() / accumulation_steps  # Scale loss
    loss.backward()  # Gradients accumulate

optimizer.step()  # Single update with accumulated gradients

print(f"Effective batch size: {micro_batch_size * accumulation_steps}")

Effective batch size: 32


---
## 6. Custom Autograd Functions

You can define custom operations with custom forward and backward passes using `torch.autograd.Function`.

In [39]:
# Example: Custom ReLU implementation
class MyReLU(torch.autograd.Function):
    
    @staticmethod
    def forward(ctx, x):
        """
        Forward pass.
        ctx is a context object to save tensors for backward.
        """
        ctx.save_for_backward(x)  # Save input for backward
        return x.clamp(min=0)
    
    @staticmethod
    def backward(ctx, grad_output):
        """
        Backward pass.
        grad_output is the gradient of the loss w.r.t. the output.
        Returns gradient w.r.t. input.
        """
        x, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[x < 0] = 0  # Gradient is 0 where x < 0
        return grad_input

# Use it
x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0], requires_grad=True)
y = MyReLU.apply(x)  # Use .apply(), not direct call
y.sum().backward()

print(f"Input:    {x.tolist()}")
print(f"Output:   {y.tolist()}")
print(f"Gradient: {x.grad.tolist()}")

Input:    [-2.0, -1.0, 0.0, 1.0, 2.0]
Output:   [0.0, 0.0, 0.0, 1.0, 2.0]
Gradient: [0.0, 0.0, 1.0, 1.0, 1.0]


In [40]:
# More complex example: Straight-through estimator (STE)
# Used for quantization - forward uses discrete values, backward pretends it's identity

class StraightThroughEstimator(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        # Forward: round to nearest integer
        return torch.round(x)
    
    @staticmethod
    def backward(ctx, grad_output):
        # Backward: pretend rounding didn't happen (identity gradient)
        return grad_output

x = torch.tensor([0.3, 0.7, 1.4, 2.6], requires_grad=True)
y = StraightThroughEstimator.apply(x)
loss = (y - 2).pow(2).sum()
loss.backward()

print(f"Input:    {x.tolist()}")
print(f"Rounded:  {y.tolist()}")
print(f"Gradient: {x.grad.tolist()}")  # Gradients flow through!

Input:    [0.30000001192092896, 0.699999988079071, 1.399999976158142, 2.5999999046325684]
Rounded:  [0.0, 1.0, 1.0, 3.0]
Gradient: [-4.0, -2.0, -2.0, 2.0]


In [41]:
# Example with multiple inputs and outputs
class WeightedSum(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, y, weight):
        ctx.save_for_backward(x, y)
        ctx.weight = weight  # Save non-tensor data
        return weight * x + (1 - weight) * y
    
    @staticmethod
    def backward(ctx, grad_output):
        x, y = ctx.saved_tensors
        weight = ctx.weight
        
        # Return gradient for each input
        grad_x = grad_output * weight
        grad_y = grad_output * (1 - weight)
        grad_weight = None  # weight doesn't require grad (it's a float)
        
        return grad_x, grad_y, grad_weight

x = torch.tensor([1.0, 2.0], requires_grad=True)
y = torch.tensor([3.0, 4.0], requires_grad=True)
z = WeightedSum.apply(x, y, 0.7)
z.sum().backward()

print(f"x.grad: {x.grad}")  # Should be 0.7
print(f"y.grad: {y.grad}")  # Should be 0.3

x.grad: tensor([0.7000, 0.7000])
y.grad: tensor([0.3000, 0.3000])


### 6.1 Gradient Checking

In [42]:
# Verify your custom gradients with numerical gradients
from torch.autograd import gradcheck

# Test MyReLU
# gradcheck requires double precision for numerical stability
x = torch.randn(5, dtype=torch.double, requires_grad=True)

# gradcheck compares analytical gradient with numerical finite difference
result = gradcheck(MyReLU.apply, x, eps=1e-6, atol=1e-4, rtol=1e-3)
print(f"MyReLU gradient check passed: {result}")

MyReLU gradient check passed: True


---
## 7. Debugging Gradients

### 7.1 Detecting Anomalies

In [43]:
# Enable anomaly detection to find the source of NaN/Inf gradients
torch.autograd.set_detect_anomaly(True)

x = torch.tensor([0.0], requires_grad=True)

try:
    y = torch.log(x)  # log(0) = -inf
    y.backward()      # Will produce NaN gradient
except RuntimeError as e:
    print(f"Caught anomaly: {str(e)[:100]}...")

torch.autograd.set_detect_anomaly(False)  # Disable (it's slow)

<torch.autograd.anomaly_mode.set_detect_anomaly at 0x7e89b984ed50>

### 7.2 Using Hooks

In [44]:
# Register hooks to inspect gradients during backward
def print_grad_hook(name):
    def hook(grad):
        print(f"{name} grad: {grad}")
        return grad  # Can modify gradient here
    return hook

x = torch.tensor([2.0], requires_grad=True)
y = x ** 2
z = y * 3

# Register hooks
x.register_hook(print_grad_hook("x"))
y.register_hook(print_grad_hook("y"))

print("Backward pass:")
z.backward()

Backward pass:
y grad: tensor([3.])
x grad: tensor([12.])


In [45]:
# Use hooks to clip gradients
def clip_grad_hook(max_norm):
    def hook(grad):
        norm = grad.norm()
        if norm > max_norm:
            return grad * max_norm / norm
        return grad
    return hook

x = torch.tensor([10.0], requires_grad=True)
x.register_hook(clip_grad_hook(max_norm=1.0))

y = x ** 2  # Gradient would be 20, but we clip to 1
y.backward()

print(f"Clipped gradient: {x.grad}")  # 1.0, not 20.0

Clipped gradient: tensor([1.])


---
## Exercises

### Exercise 1: Manual Gradient Computation

Compute the gradient of $f(x) = \frac{1}{1 + e^{-x}}$ (sigmoid) manually and verify with autograd.

In [46]:
def sigmoid(x):
    return 1 / (1 + torch.exp(-x))

def sigmoid_derivative(x):
    """
    Compute d/dx sigmoid(x).
    Hint: sigmoid'(x) = sigmoid(x) * (1 - sigmoid(x))
    """
    # YOUR CODE HERE
    pass

# Test
x = torch.tensor([0.0, 1.0, -1.0, 2.0], requires_grad=True)

# Autograd
y = sigmoid(x)
y.sum().backward()
autograd_grad = x.grad.clone()

# Manual
# manual_grad = sigmoid_derivative(x.detach())

# print(f"Autograd: {autograd_grad}")
# print(f"Manual:   {manual_grad}")
# print(f"Match: {torch.allclose(autograd_grad, manual_grad)}")

### Exercise 2: Implement Custom Softmax Backward

Implement a custom autograd function for softmax with its correct backward pass.

In [47]:
class MySoftmax(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        """
        Compute softmax along the last dimension.
        softmax(x)_i = exp(x_i) / sum_j(exp(x_j))
        """
        # For numerical stability, subtract max
        x_max = x.max(dim=-1, keepdim=True).values
        exp_x = torch.exp(x - x_max)
        softmax = exp_x / exp_x.sum(dim=-1, keepdim=True)
        ctx.save_for_backward(softmax)
        return softmax
    
    @staticmethod
    def backward(ctx, grad_output):
        """
        Compute gradient of softmax.
        Let s = softmax(x), then:
        ds_i/dx_j = s_i * (delta_ij - s_j)
        where delta_ij is 1 if i==j, 0 otherwise.
        
        For the chain rule with upstream gradient g:
        dx_j = sum_i(g_i * ds_i/dx_j) = sum_i(g_i * s_i * (delta_ij - s_j))
             = g_j * s_j - s_j * sum_i(g_i * s_i)
             = s_j * (g_j - sum_i(g_i * s_i))
        """
        # YOUR CODE HERE
        pass

# Test
x = torch.randn(3, 4, dtype=torch.double, requires_grad=True)

# Verify with gradcheck
# result = gradcheck(MySoftmax.apply, x, eps=1e-6, atol=1e-4, rtol=1e-3)
# print(f"Gradient check passed: {result}")

### Exercise 3: Build a Mini Autograd System

Implement a simplified autograd system from scratch to understand how it works.

In [48]:
class Value:
    """A simple scalar value with automatic differentiation."""
    
    def __init__(self, data, children=(), op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._children = set(children)
        self._op = op
    
    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
    
    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        
        def _backward():
            # d(a+b)/da = 1, d(a+b)/db = 1
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        
        return out
    
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        
        def _backward():
            # d(a*b)/da = b, d(a*b)/db = a
            # YOUR CODE HERE
            pass
        out._backward = _backward
        
        return out
    
    def __pow__(self, n):
        out = Value(self.data ** n, (self,), f'**{n}')
        
        def _backward():
            # d(x^n)/dx = n * x^(n-1)
            # YOUR CODE HERE
            pass
        out._backward = _backward
        
        return out
    
    def backward(self):
        """Topological sort and backprop."""
        topo = []
        visited = set()
        
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._children:
                    build_topo(child)
                topo.append(v)
        
        build_topo(self)
        
        self.grad = 1.0
        for v in reversed(topo):
            v._backward()

# Test when implemented:
# x = Value(2.0)
# y = Value(3.0)
# z = x * y + x ** 2
# z.backward()
# print(f"x: {x}")  # grad should be y + 2x = 3 + 4 = 7
# print(f"y: {y}")  # grad should be x = 2
# print(f"z: {z}")

---
## Solutions

In [49]:
# Exercise 1 Solution
def sigmoid_derivative_solution(x):
    s = sigmoid(x)
    return s * (1 - s)

x = torch.tensor([0.0, 1.0, -1.0, 2.0], requires_grad=True)
y = sigmoid(x)
y.sum().backward()

autograd_grad = x.grad.clone()
manual_grad = sigmoid_derivative_solution(x.detach())

print("Exercise 1 Solution:")
print(f"Autograd: {autograd_grad}")
print(f"Manual:   {manual_grad}")
print(f"Match: {torch.allclose(autograd_grad, manual_grad)}")

Exercise 1 Solution:
Autograd: tensor([0.2500, 0.1966, 0.1966, 0.1050])
Manual:   tensor([0.2500, 0.1966, 0.1966, 0.1050])
Match: True


In [50]:
# Exercise 2 Solution
class MySoftmaxSolution(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        x_max = x.max(dim=-1, keepdim=True).values
        exp_x = torch.exp(x - x_max)
        softmax = exp_x / exp_x.sum(dim=-1, keepdim=True)
        ctx.save_for_backward(softmax)
        return softmax
    
    @staticmethod
    def backward(ctx, grad_output):
        softmax, = ctx.saved_tensors
        # dx_j = s_j * (g_j - sum_i(g_i * s_i))
        dot_product = (grad_output * softmax).sum(dim=-1, keepdim=True)
        return softmax * (grad_output - dot_product)

x = torch.randn(3, 4, dtype=torch.double, requires_grad=True)
result = gradcheck(MySoftmaxSolution.apply, x, eps=1e-6, atol=1e-4, rtol=1e-3)
print(f"\nExercise 2 Solution:")
print(f"Gradient check passed: {result}")


Exercise 2 Solution:
Gradient check passed: True


In [51]:
# Exercise 3 Solution
class ValueSolution:
    def __init__(self, data, children=(), op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._children = set(children)
        self._op = op
    
    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
    
    def __add__(self, other):
        other = other if isinstance(other, ValueSolution) else ValueSolution(other)
        out = ValueSolution(self.data + other.data, (self, other), '+')
        
        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out
    
    def __mul__(self, other):
        other = other if isinstance(other, ValueSolution) else ValueSolution(other)
        out = ValueSolution(self.data * other.data, (self, other), '*')
        
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out
    
    def __pow__(self, n):
        out = ValueSolution(self.data ** n, (self,), f'**{n}')
        
        def _backward():
            self.grad += n * (self.data ** (n - 1)) * out.grad
        out._backward = _backward
        return out
    
    def backward(self):
        topo = []
        visited = set()
        
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._children:
                    build_topo(child)
                topo.append(v)
        
        build_topo(self)
        self.grad = 1.0
        for v in reversed(topo):
            v._backward()

print("\nExercise 3 Solution:")
x = ValueSolution(2.0)
y = ValueSolution(3.0)
z = x * y + x ** 2  # z = xy + x^2, dz/dx = y + 2x = 7, dz/dy = x = 2
z.backward()

print(f"x: {x}")  # grad = 7
print(f"y: {y}")  # grad = 2
print(f"z: {z}")


Exercise 3 Solution:
x: Value(data=2.0000, grad=7.0000)
y: Value(data=3.0000, grad=2.0000)
z: Value(data=10.0000, grad=1.0000)


---
## Summary

Key takeaways from this notebook:

1. **Dynamic Graphs**: PyTorch builds computation graphs on-the-fly, enabling native Python control flow
2. **requires_grad**: The switch that enables gradient tracking
3. **Leaf vs Intermediate**: Only leaf tensors retain gradients by default
4. **backward()**: Computes gradients by traversing the graph in reverse
5. **Gradient Control**: Use `no_grad()`, `inference_mode()`, and `detach()` appropriately
6. **Gradient Accumulation**: Gradients add up; zero them before each optimization step
7. **Custom Functions**: Use `torch.autograd.Function` for custom forward/backward passes

---
*Next: Module 1.3 - nn.Module Architecture*