# PyTorch Autograd: Automatic Differentiation Fundamentals

This notebook demonstrates PyTorch's **Autograd** system - the automatic differentiation engine that powers backpropagation-based learning in neural networks. It's designed for learners transitioning from TensorFlow to understand PyTorch's approach to gradient computation.

## Learning Objectives
- Understand what Autograd does and why it's powerful
- Learn gradient tracking with `requires_grad=True`
- Master computation graph creation and traversal
- Implement gradient computation in training loops
- Control gradient tracking with context managers and methods
- Compare TensorFlow vs PyTorch gradient computation approaches

## What Makes Autograd Powerful?
- **Dynamic Computation Graphs**: Built at runtime, perfect for dynamic models
- **Chain Rule Automation**: Automatically applies calculus chain rule
- **Memory Efficient**: Only stores necessary intermediate results
- **Flexible**: Works with control flow, loops, and conditional operations

**TensorFlow vs PyTorch**: While TensorFlow uses `tf.GradientTape`, PyTorch's autograd is built into every tensor operation, making gradient computation more intuitive.

---

## 1. Environment Setup and Runtime Detection

Following PyTorch best practices for cross-platform compatibility:

In [10]:
# Environment Detection and Setup
import sys
import subprocess
import os
import time

# Detect runtime environment
IS_COLAB = "google.colab" in sys.modules
IS_KAGGLE = "kaggle_secrets" in sys.modules
IS_LOCAL = not (IS_COLAB or IS_KAGGLE)

print(f"Environment: Local={IS_LOCAL}, Colab={IS_COLAB}, Kaggle={IS_KAGGLE}")

# Install packages
packages = ["torch", "matplotlib", "numpy", "tensorboard"]
for pkg in packages:
    if IS_COLAB or IS_KAGGLE:
        subprocess.run([sys.executable, "-m", "pip", "install", "-q", pkg])
    else:
        subprocess.run([sys.executable, "-m", "pip", "install", "-q", pkg], capture_output=True)
    print(f"✓ {pkg}")

# Import PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.tensorboard import SummaryWriter

print(f"\n✅ PyTorch {torch.__version__} ready!")
print(f"CUDA available: {torch.cuda.is_available()}")

Environment: Local=False, Colab=True, Kaggle=False
✓ torch
✓ matplotlib
✓ numpy
✓ tensorboard

✅ PyTorch 2.8.0+cu126 ready!
CUDA available: False


## 2. Basic Autograd: Australian Tourism Revenue

Let's understand autograd with a Sydney tourism revenue example.

**TensorFlow Equivalent**: `tf.GradientTape()` for gradient computation.

In [11]:
# Basic autograd demonstration
print("🇦🇺 Sydney Tourism Revenue Analysis with Autograd\n")

# Create tensor with gradient tracking
sydney_revenue = torch.tensor([150.0], requires_grad=True)
print(f"Initial revenue: ${sydney_revenue.item():.0f}k AUD")
print(f"Requires grad: {sydney_revenue.requires_grad}")

# Build computation graph
seasonal_boost = sydney_revenue * 1.2  # +20% summer boost
weekend_bonus = seasonal_boost + 50.0  # +$50k weekend
after_tax = weekend_bonus * 0.9       # -10% tax

print(f"\nRevenue calculation:")
print(f"Base: ${sydney_revenue.item():.0f}k")
print(f"After boost: ${seasonal_boost.item():.0f}k")
print(f"After bonus: ${weekend_bonus.item():.0f}k")
print(f"After tax: ${after_tax.item():.0f}k")

# Check computation graph
print(f"\nComputation graph:")
print(f"seasonal_boost.grad_fn: {seasonal_boost.grad_fn}")
print(f"weekend_bonus.grad_fn: {weekend_bonus.grad_fn}")
print(f"after_tax.grad_fn: {after_tax.grad_fn}")

# Compute gradients
after_tax.backward()
gradient = sydney_revenue.grad.item()

print(f"\nGradient: {gradient:.2f}")
print(f"Interpretation: $1k increase in base revenue → ${gradient:.2f}k increase in final revenue")

# Manual verification: 1.2 * 0.9 = 1.08
expected = 1.2 * 0.9
print(f"Expected: {expected:.2f} ✓")

🇦🇺 Sydney Tourism Revenue Analysis with Autograd

Initial revenue: $150k AUD
Requires grad: True

Revenue calculation:
Base: $150k
After boost: $180k
After bonus: $230k
After tax: $207k

Computation graph:
seasonal_boost.grad_fn: <MulBackward0 object at 0x7ee731e5f730>
weekend_bonus.grad_fn: <AddBackward0 object at 0x7ee731e5f730>
after_tax.grad_fn: <MulBackward0 object at 0x7ee731e5f730>

Gradient: 1.08
Interpretation: $1k increase in base revenue → $1.08k increase in final revenue
Expected: 1.08 ✓


## 3. Autograd in Training: Australian City Classifier

See how autograd works in a real training loop:

In [12]:
# Australian city classification with autograd
print("🏙️ Australian City Classification Training\n")

cities = ["Sydney", "Melbourne", "Brisbane", "Perth", "Adelaide", "Darwin", "Hobart", "Canberra"]
print(f"Cities: {cities}\n")

# Simple model
class CityClassifier(nn.Module):
    def __init__(self, input_size=10, num_cities=8):
        super().__init__()
        self.fc1 = nn.Linear(input_size, 16)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(16, num_cities)

    def forward(self, x):
        x = self.relu(self.fc1(x))
        return self.fc2(x)

model = CityClassifier()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

print(f"Model: {model}")
print(f"Parameters: {sum(p.numel() for p in model.parameters())}")

# Training step demonstration
print(f"\n🚂 Training Step Analysis:")

# Generate synthetic data
X = torch.randn(32, 10)
y = torch.randint(0, 8, (32,))

# 1. Zero gradients
print("1. Zeroing gradients...")
optimizer.zero_grad()

# 2. Forward pass
print("2. Forward pass...")
outputs = model(X)
loss = criterion(outputs, y)
print(f"   Loss: {loss.item():.4f}")
print(f"   Loss requires_grad: {loss.requires_grad}")

# 3. Backward pass
print("3. Backward pass...")
loss.backward()

# Check gradients
print("   Gradients computed:")
for name, param in model.named_parameters():
    if param.grad is not None:
        grad_norm = param.grad.norm().item()
        print(f"     {name}: {grad_norm:.6f}")

# 4. Update parameters
print("4. Updating parameters...")
optimizer.step()
print("   ✅ Parameters updated!")

print(f"\n📊 TensorFlow vs PyTorch Training:")
print(f"   TensorFlow: with tf.GradientTape() as tape:")
print(f"               gradients = tape.gradient(loss, variables)")
print(f"   PyTorch:    loss.backward(); optimizer.step()")

🏙️ Australian City Classification Training

Cities: ['Sydney', 'Melbourne', 'Brisbane', 'Perth', 'Adelaide', 'Darwin', 'Hobart', 'Canberra']

Model: CityClassifier(
  (fc1): Linear(in_features=10, out_features=16, bias=True)
  (relu): ReLU()
  (fc2): Linear(in_features=16, out_features=8, bias=True)
)
Parameters: 312

🚂 Training Step Analysis:
1. Zeroing gradients...
2. Forward pass...
   Loss: 2.0795
   Loss requires_grad: True
3. Backward pass...
   Gradients computed:
     fc1.weight: 0.258289
     fc1.bias: 0.079799
     fc2.weight: 0.306681
     fc2.bias: 0.205234
4. Updating parameters...
   ✅ Parameters updated!

📊 TensorFlow vs PyTorch Training:
   TensorFlow: with tf.GradientTape() as tape:
               gradients = tape.gradient(loss, variables)
   PyTorch:    loss.backward(); optimizer.step()


## 4. Controlling Gradient Tracking

Learn when and how to disable gradient computation:

In [13]:
# Gradient control methods
print("🎛️ Controlling Gradient Tracking\n")

# Method 1: torch.no_grad()
x = torch.tensor([2.0], requires_grad=True)
print(f"x requires_grad: {x.requires_grad}")

y1 = x ** 2
print(f"y1 = x² requires_grad: {y1.requires_grad}")

with torch.no_grad():
    y2 = x ** 2
    print(f"y2 = x² (no_grad) requires_grad: {y2.requires_grad}")

# Method 2: .detach()
print(f"\nUsing .detach():")
z = x ** 2
z_detached = z.detach()
print(f"z requires_grad: {z.requires_grad}")
print(f"z_detached requires_grad: {z_detached.requires_grad}")

# Practical example: Model evaluation
print(f"\n🔍 Practical Example: Inference")
test_input = torch.randn(5, 10)

model.eval()
print("Inference with gradients (slower):")
start = time.time()
with_grad_out = model(test_input)
with_grad_time = time.time() - start
print(f"   Time: {with_grad_time:.6f}s")

print("Inference without gradients (faster):")
start = time.time()
with torch.no_grad():
    no_grad_out = model(test_input)
no_grad_time = time.time() - start
print(f"   Time: {no_grad_time:.6f}s")
print(f"   Speedup: {with_grad_time/no_grad_time:.2f}x")

print(f"\n📊 TensorFlow vs PyTorch Gradient Control:")
print(f"   TensorFlow: @tf.function, tf.stop_gradient()")
print(f"   PyTorch:    torch.no_grad(), .detach()")

🎛️ Controlling Gradient Tracking

x requires_grad: True
y1 = x² requires_grad: True
y2 = x² (no_grad) requires_grad: False

Using .detach():
z requires_grad: True
z_detached requires_grad: False

🔍 Practical Example: Inference
Inference with gradients (slower):
   Time: 0.000455s
Inference without gradients (faster):
   Time: 0.000809s
   Speedup: 0.56x

📊 TensorFlow vs PyTorch Gradient Control:
   TensorFlow: @tf.function, tf.stop_gradient()
   PyTorch:    torch.no_grad(), .detach()


## 5. Why optimizer.zero_grad() is Critical

PyTorch accumulates gradients - see why zero_grad() is essential:

In [14]:
# Demonstrate gradient accumulation problem
print("📚 Gradient Accumulation: Why zero_grad() Matters\n")

# Simple linear model
simple_model = nn.Linear(1, 1)
data_x = torch.tensor([[1.0], [2.0], [3.0]])
data_y = torch.tensor([[2.0], [4.0], [6.0]])

criterion = nn.MSELoss()
optimizer = optim.SGD(simple_model.parameters(), lr=0.01)

print("Initial weight:", simple_model.weight.item())

# Training WITHOUT zero_grad() - WRONG!
print(f"\n❌ Training WITHOUT zero_grad():")
for i, (x, y) in enumerate(zip(data_x, data_y)):
    # NO zero_grad() call!
    pred = simple_model(x)
    loss = criterion(pred, y)
    loss.backward()

    grad = simple_model.weight.grad.item()
    print(f"Sample {i+1}: gradient = {grad:.4f} (accumulating!)")

optimizer.step()
print(f"Weight after wrong training: {simple_model.weight.item():.4f}")

# Reset model
simple_model.weight.data.fill_(1.0)
simple_model.bias.data.fill_(0.0)

# Training WITH zero_grad() - CORRECT!
print(f"\n✅ Training WITH zero_grad():")
for i, (x, y) in enumerate(zip(data_x, data_y)):
    optimizer.zero_grad()  # Clear gradients!
    pred = simple_model(x)
    loss = criterion(pred, y)
    loss.backward()

    grad = simple_model.weight.grad.item()
    print(f"Sample {i+1}: gradient = {grad:.4f} (fresh each time)")

    optimizer.step()

print(f"Weight after correct training: {simple_model.weight.item():.4f}")

print(f"\n🎯 Key Takeaway:")
print(f"   ALWAYS call optimizer.zero_grad() before loss.backward()")
print(f"   PyTorch accumulates gradients for flexibility")
print(f"   But this interferes with normal training if not cleared")

📚 Gradient Accumulation: Why zero_grad() Matters

Initial weight: -0.303381085395813

❌ Training WITHOUT zero_grad():
Sample 1: gradient = -5.7233 (accumulating!)
Sample 2: gradient = -26.3835 (accumulating!)
Sample 3: gradient = -71.1940 (accumulating!)
Weight after wrong training: 0.4086

✅ Training WITH zero_grad():
Sample 1: gradient = -2.0000 (fresh each time)
Sample 2: gradient = -7.7600 (fresh each time)
Sample 3: gradient = -15.8904 (fresh each time)
Weight after correct training: 1.2565

🎯 Key Takeaway:
   ALWAYS call optimizer.zero_grad() before loss.backward()
   PyTorch accumulates gradients for flexibility
   But this interferes with normal training if not cleared


## 6. TensorBoard Gradient Monitoring

Monitor gradients with TensorBoard for debugging:

In [15]:
# TensorBoard gradient monitoring
print("📊 TensorBoard Gradient Monitoring\n")

# Setup logging
if IS_COLAB:
    log_dir = "/content/tensorboard_logs/autograd_demo"
else:
    log_dir = "./tensorboard_logs/autograd_demo"

os.makedirs(log_dir, exist_ok=True)
writer = SummaryWriter(log_dir)

# Create fresh model for monitoring
monitor_model = CityClassifier()
monitor_optimizer = optim.Adam(monitor_model.parameters(), lr=0.001)
monitor_criterion = nn.CrossEntropyLoss()

print(f"Logging to: {log_dir}")

# Train with gradient logging
for step in range(10):
    # Generate data
    batch_x = torch.randn(16, 10)
    batch_y = torch.randint(0, 8, (16,))

    # Training step
    monitor_optimizer.zero_grad()
    outputs = monitor_model(batch_x)
    loss = monitor_criterion(outputs, batch_y)
    loss.backward()

    # Log gradients
    for name, param in monitor_model.named_parameters():
        if param.grad is not None:
            grad_norm = param.grad.norm().item()
            writer.add_scalar(f"Gradients/{name}_norm", grad_norm, step)
            writer.add_histogram(f"Gradients/{name}", param.grad, step)

    writer.add_scalar("Loss", loss.item(), step)
    monitor_optimizer.step()

    if step % 5 == 0:
        print(f"Step {step}: Loss = {loss.item():.4f}")

writer.close()
print(f"\n✅ Gradient monitoring complete!")
print(f"📊 View in TensorBoard:")
if IS_COLAB:
    print(f"   %load_ext tensorboard")
    print(f"   %tensorboard --logdir {log_dir}")
else:
    print(f"   tensorboard --logdir {log_dir}")
    print(f"   Open http://localhost:6006")

📊 TensorBoard Gradient Monitoring

Logging to: /content/tensorboard_logs/autograd_demo
Step 0: Loss = 2.1367
Step 5: Loss = 2.1358

✅ Gradient monitoring complete!
📊 View in TensorBoard:
   %load_ext tensorboard
   %tensorboard --logdir /content/tensorboard_logs/autograd_demo


## 7. Summary: PyTorch Autograd Mastery

🎓 **Congratulations!** You've mastered PyTorch Autograd fundamentals!

In [16]:
# Summary
print("🎓 PyTorch Autograd Mastery Summary\n")

print("✅ Concepts Mastered:")
concepts = [
    "Dynamic computation graphs with requires_grad=True",
    "Gradient computation using .backward()",
    "Autograd in training loops with zero_grad()",
    "Gradient control with torch.no_grad() and .detach()",
    "Understanding gradient accumulation",
    "TensorBoard gradient monitoring",
    "TensorFlow vs PyTorch comparisons"
]

for i, concept in enumerate(concepts, 1):
    print(f"  {i}. {concept}")

print(f"\n🌏 Australian Examples:")
examples = [
    "Sydney tourism revenue optimization",
    "Australian city classification",
    "Real-world gradient computation scenarios"
]

for i, example in enumerate(examples, 1):
    print(f"  {i}. {example}")

print(f"\n🚀 Next Steps:")
next_steps = [
    "🧠 Neural Network architectures with nn.Module",
    "📚 Data loading with DataLoader and Dataset",
    "🏋️ Advanced training techniques",
    "🤗 Hugging Face transformers integration",
    "⚡ Performance optimization"
]

for i, step in enumerate(next_steps, 1):
    print(f"  {i}. {step}")

print(f"\n🎯 Key Autograd Rules:")
rules = [
    "Always call optimizer.zero_grad() before loss.backward()",
    "Use torch.no_grad() for inference to save memory",
    "Monitor gradient norms to detect training issues",
    "Use .detach() when you need values without gradients"
]

for i, rule in enumerate(rules, 1):
    print(f"  {i}. {rule}")

print(f"\n🔥 PyTorch Advantages:")
advantages = [
    "Intuitive gradient computation",
    "Dynamic graphs for flexible models",
    "Easier debugging with immediate execution",
    "Strong research ecosystem"
]

for i, advantage in enumerate(advantages, 1):
    print(f"  {i}. {advantage}")

print(f"\n🏆 You're ready for advanced PyTorch development!")
print(f"Welcome to the PyTorch community! 🔥")

🎓 PyTorch Autograd Mastery Summary

✅ Concepts Mastered:
  1. Dynamic computation graphs with requires_grad=True
  2. Gradient computation using .backward()
  3. Autograd in training loops with zero_grad()
  4. Gradient control with torch.no_grad() and .detach()
  5. Understanding gradient accumulation
  6. TensorBoard gradient monitoring
  7. TensorFlow vs PyTorch comparisons

🌏 Australian Examples:
  1. Sydney tourism revenue optimization
  2. Australian city classification
  3. Real-world gradient computation scenarios

🚀 Next Steps:
  1. 🧠 Neural Network architectures with nn.Module
  2. 📚 Data loading with DataLoader and Dataset
  3. 🏋️ Advanced training techniques
  4. 🤗 Hugging Face transformers integration
  5. ⚡ Performance optimization

🎯 Key Autograd Rules:
  1. Always call optimizer.zero_grad() before loss.backward()
  2. Use torch.no_grad() for inference to save memory
  3. Monitor gradient norms to detect training issues
  4. Use .detach() when you need values without gra