# Calculus for Deep Learning: The Mathematics of Learning

## 🎯 Introduction

Welcome to the mathematical engine that makes neural networks learn! This notebook will demystify how calculus powers every aspect of deep learning, from simple gradient descent to complex backpropagation through transformer architectures. Understanding calculus is essential for truly mastering how neural networks optimize themselves.

### 🧠 What You'll Master

This comprehensive guide covers:
- **Derivatives and gradients**: How neural networks know which direction to improve
- **Chain rule mastery**: The mathematical foundation of backpropagation
- **Partial derivatives**: Understanding how multi-variable functions change
- **Optimization theory**: Why gradient descent works and when it fails
- **Computational graphs**: How automatic differentiation computes gradients

### 🎓 Prerequisites

- Basic understanding of functions and limits
- Familiarity with PyTorch autograd system
- Elementary knowledge of matrix operations
- High school algebra and basic function concepts

### 🚀 Why Calculus is the Heart of Deep Learning

Calculus enables neural network learning because:
- **Optimization**: Finding minima in high-dimensional loss landscapes
- **Backpropagation**: Efficiently computing gradients through complex networks
- **Learning rates**: Understanding how fast to update parameters
- **Convergence**: Knowing when and why training succeeds or fails
- **Architecture design**: Understanding gradient flow through different layers

---

## 📚 Table of Contents

1. **[Derivatives and Gradients](#derivatives-and-gradients)** - The direction of improvement
2. **[Chain Rule and Backpropagation](#chain-rule-and-backpropagation)** - How gradients flow backward
3. **[Partial Derivatives in Action](#partial-derivatives-in-action)** - Multi-variable optimization
4. **[Computational Graphs](#computational-graphs)** - Automatic differentiation explained
5. **[Optimization Landscapes](#optimization-landscapes)** - Understanding loss surfaces

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt

## Derivatives and Gradients

### 📈 The Mathematics of Direction

Derivatives tell us how functions change, and in deep learning, this translates to knowing which direction to adjust parameters to reduce loss. Gradients are the multivariable extension that powers all neural network optimization.

In [None]:
# =============================================================================
# DERIVATIVES AND GRADIENTS: THE MATHEMATICS OF LEARNING
# =============================================================================

print("📈 Derivatives in Deep Learning Context")
print("=" * 50)

# Show how derivatives guide neural network optimization
print("🎯 What Derivatives Tell Us in Neural Networks:")
print("• How loss changes with respect to each parameter")
print("• Which direction to move parameters to reduce loss")  
print("• How sensitive the output is to input changes")
print("• Whether we're near a minimum or still need to optimize")

# Demonstrate basic derivative concepts with PyTorch
x = torch.tensor(2.0, requires_grad=True)

# Simple function: f(x) = x² + 3x + 1
# Derivative: f'(x) = 2x + 3
f = x**2 + 3*x + 1
print(f"\nFunction: f(x) = x² + 3x + 1")
print(f"At x = {x.item()}")
print(f"f({x.item()}) = {f.item()}")

# Compute derivative using autograd
f.backward()
print(f"f'({x.item()}) = {x.grad.item()}")
print(f"Analytical: f'(2) = 2(2) + 3 = {2*2 + 3}")

print(f"\n🧮 Gradient Descent in Action")
print("=" * 50)

# Demonstrate how gradients guide optimization
# Goal: minimize f(x) = (x - 5)² using gradient descent

x = torch.tensor(0.0, requires_grad=True)  # Start at x = 0
learning_rate = 0.1
target = 5.0

print(f"Minimizing f(x) = (x - {target})² starting from x = {x.item()}")
print(f"True minimum is at x = {target}")

print("\nStep | x value | f(x) value | Gradient | Next step")
print("-----|---------|------------|----------|----------")

for step in range(10):
    # Compute function value
    f = (x - target)**2
    
    # Compute gradient
    if x.grad is not None:
        x.grad.zero_()  # Clear previous gradients
    f.backward()
    
    current_x = x.item()
    current_f = f.item()
    current_grad = x.grad.item()
    
    print(f"{step:4d} | {current_x:7.3f} | {current_f:10.3f} | {current_grad:8.3f} | ", end="")
    
    # Gradient descent update: x = x - learning_rate * gradient
    with torch.no_grad():
        x -= learning_rate * x.grad
    
    print(f"{x.item():7.3f}")
    
    # Stop if very close to minimum
    if abs(x.item() - target) < 0.001:
        print(f"Converged to target {target}!")
        break

print(f"\n💡 Key Insights About Gradients")
print("=" * 50)
print("1. **Direction**: Gradient points toward steepest increase")
print("2. **Magnitude**: Larger gradient = steeper slope = faster learning")
print("3. **Zero gradient**: Indicates critical point (minimum, maximum, or saddle)")
print("4. **Opposite direction**: We move opposite to gradient to minimize")
print("5. **Learning rate**: Controls how big steps we take")

print(f"\n🔍 Multivariable Functions and Partial Derivatives")
print("=" * 50)

# Neural networks have many parameters, so we need partial derivatives
x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)

# Function of two variables: f(x,y) = x²y + xy² + 2
# ∂f/∂x = 2xy + y²
# ∂f/∂y = x² + 2xy
f = x**2 * y + x * y**2 + 2

print(f"Function: f(x,y) = x²y + xy² + 2")
print(f"At point ({x.item()}, {y.item()})")
print(f"f({x.item()}, {y.item()}) = {f.item()}")

f.backward()

print(f"\nPartial derivatives:")
print(f"∂f/∂x = {x.grad.item():.3f}")
print(f"∂f/∂y = {y.grad.item():.3f}")

# Verify analytical computation
analytical_dx = 2*x.item()*y.item() + y.item()**2
analytical_dy = x.item()**2 + 2*x.item()*y.item()
print(f"\nAnalytical verification:")
print(f"∂f/∂x = 2xy + y² = 2({x.item()})({y.item()}) + ({y.item()})² = {analytical_dx}")
print(f"∂f/∂y = x² + 2xy = ({x.item()})² + 2({x.item()})({y.item()}) = {analytical_dy}")

print(f"\n🎯 The Gradient Vector")
print("=" * 30)

gradient = torch.tensor([x.grad.item(), y.grad.item()])
print(f"Gradient vector: ∇f = {gradient}")
print(f"Gradient magnitude: ||∇f|| = {torch.norm(gradient).item():.3f}")

# The gradient points in the direction of steepest increase
print(f"\nGradient interpretation:")
print(f"• Direction of steepest increase: {gradient/torch.norm(gradient)}")
print(f"• To minimize, move in direction: {-gradient/torch.norm(gradient)}")

print(f"\n🧠 Neural Network Parameter Update")
print("=" * 50)

# Simulate a simple neural network parameter update
print("In a neural network, each parameter gets updated based on its gradient:")

# Simulate some network parameters and their gradients
params = {
    'weight_1': torch.tensor(0.5, requires_grad=True),
    'weight_2': torch.tensor(-0.3, requires_grad=True), 
    'bias': torch.tensor(0.1, requires_grad=True)
}

# Simulate a loss computation
loss = params['weight_1']**2 + params['weight_2']**2 + params['bias']**2
print(f"Simulated loss: {loss.item():.4f}")

# Compute gradients
loss.backward()

# Show gradient-based updates
learning_rate = 0.1
print(f"\nParameter updates (learning_rate = {learning_rate}):")
print("Parameter | Current | Gradient | New Value")
print("----------|---------|----------|----------")

for name, param in params.items():
    current_val = param.item()
    grad_val = param.grad.item()
    new_val = current_val - learning_rate * grad_val
    print(f"{name:9} | {current_val:7.3f} | {grad_val:8.3f} | {new_val:7.3f}")

print(f"\n✨ The Magic of Automatic Differentiation")
print("=" * 50)
print("PyTorch automatically computes gradients for ANY function!")
print("• Forward pass: Compute function values")
print("• Backward pass: Compute gradients using chain rule")
print("• No manual derivative calculations needed")
print("• Works with arbitrarily complex neural network architectures")

## Partial Derivatives

**Formula:** $\frac{\partial f}{\partial x_i}$

Derivative with respect to one variable while holding others constant.

In [None]:
# Multi-variable function
x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)
z = x**2 * y + x * y**2  # f(x,y) = x²y + xy²

z.backward()
print(f"f({x.item()}, {y.item()}) = {z.item()}")
print(f"∂f/∂x = {x.grad.item()}")  # Should be 2xy + y²
print(f"∂f/∂y = {y.grad.item()}")  # Should be x² + 2xy

# Neural network layer with multiple parameters
batch_size, input_dim, output_dim = 4, 3, 2
X = torch.randn(batch_size, input_dim)
W = torch.randn(output_dim, input_dim, requires_grad=True)
b = torch.randn(output_dim, requires_grad=True)

# Forward pass
Y = X @ W.T + b
loss = Y.sum()
loss.backward()

print(f"\nWeight gradients shape: {W.grad.shape}")
print(f"Bias gradients shape: {b.grad.shape}")
print(f"Each gradient shows how loss changes w.r.t. that parameter")

# Examine specific parameter gradients
print(f"∂loss/∂W[0,0] = {W.grad[0,0].item():.3f}")
print(f"∂loss/∂b[0] = {b.grad[0].item():.3f}")

## Chain Rule

**Formula:** $\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)$

Mathematical foundation of backpropagation.

In [None]:
# Manual chain rule demonstration
x = torch.tensor(2.0, requires_grad=True)

# Composition: f(g(h(x))) where h(x)=x², g(u)=u+1, f(v)=v³
h = x**2        # h(x) = x²
g = h + 1       # g(h) = h + 1  
f = g**3        # f(g) = g³

f.backward()
print(f"Input: {x.item()}")
print(f"h(x) = x² = {h.item()}")
print(f"g(h) = h + 1 = {g.item()}")
print(f"f(g) = g³ = {f.item()}")
print(f"df/dx via chain rule: {x.grad.item()}")

# Manual verification: 
# df/dx = df/dg * dg/dh * dh/dx = 3g² * 1 * 2x = 3(x²+1)² * 2x
manual = 3 * (x.item()**2 + 1)**2 * 2 * x.item()
print(f"Manual calculation: {manual}")

# Neural network chain rule
class SimpleNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(2, 3)
        self.layer2 = torch.nn.Linear(3, 1)
        
    def forward(self, x):
        h1 = torch.relu(self.layer1(x))  # First composition
        h2 = self.layer2(h1)             # Second composition
        return h2

net = SimpleNet()
x_input = torch.randn(1, 2)
target = torch.randn(1, 1)

output = net(x_input)
loss = torch.nn.functional.mse_loss(output, target)
loss.backward()

print(f"\nNetwork output: {output.item():.3f}")
print(f"Loss: {loss.item():.3f}")
print(f"Layer 1 weight gradients: {net.layer1.weight.grad[0][:2]}")
print(f"Layer 2 weight gradients: {net.layer2.weight.grad[0][:2]}")
print("Gradients computed via automatic chain rule application")

## Gradient

**Formula:** $\nabla f = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n}\right]$

Points in direction of steepest increase.

In [None]:
# 2D function visualization
import matplotlib.pyplot as plt
import numpy as np

# Function: f(x,y) = x² + y² - 2x - 4y + 5 (has minimum at (1,2))
def f(x, y):
    return x**2 + y**2 - 2*x - 4*y + 5

# Gradient: ∇f = [2x-2, 2y-4]
def gradient(x, y):
    return torch.tensor([2*x - 2, 2*y - 4])

# Starting point
position = torch.tensor([0.0, 0.0], requires_grad=True)
learning_rate = 0.1
path = [position.detach().clone()]

print("Gradient descent optimization:")
for step in range(10):
    # Compute function value and gradient
    x, y = position
    loss = f(x, y)
    
    # Clear previous gradients
    if position.grad is not None:
        position.grad.zero_()
    
    loss.backward()
    
    print(f"Step {step}: pos=({x:.2f}, {y:.2f}), f={loss:.3f}, grad=({position.grad[0]:.2f}, {position.grad[1]:.2f})")
    
    # Update position (gradient descent step)
    with torch.no_grad():
        position -= learning_rate * position.grad
    
    path.append(position.detach().clone())
    
    # Stop if gradient is small
    if torch.norm(position.grad) < 0.01:
        break

print(f"\nFinal position: ({position[0]:.3f}, {position[1]:.3f})")
print(f"Theoretical minimum: (1.000, 2.000)")

# Gradient-based feature importance
model = torch.nn.Linear(5, 1)
input_data = torch.randn(1, 5, requires_grad=True)
target = torch.randn(1, 1)

output = model(input_data)
loss = torch.nn.functional.mse_loss(output, target)
loss.backward()

feature_importance = torch.abs(input_data.grad).squeeze()
print(f"\nFeature importance (|gradient|): {feature_importance}")
print(f"Most important feature: {feature_importance.argmax().item()}")

## Hessian

**Formula:** $\mathbf{H}_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$

Matrix of second derivatives describing curvature.

In [None]:
# Computing Hessian for simple function
def quadratic_loss(x):
    return 0.5 * (x[0]**2 + 2*x[1]**2 + x[0]*x[1])

x = torch.tensor([1.0, 2.0], requires_grad=True)
loss = quadratic_loss(x)

# Compute gradients
grad = torch.autograd.grad(loss, x, create_graph=True)[0]

# Compute Hessian (second derivatives)
hessian = torch.zeros(2, 2)
for i in range(2):
    grad2 = torch.autograd.grad(grad[i], x, retain_graph=True)[0]
    hessian[i] = grad2

print(f"Loss: {loss.item():.3f}")
print(f"Gradient: {grad}")
print(f"Hessian:\n{hessian}")

# Condition number analysis
eigenvals = torch.linalg.eigvals(hessian).real
condition_number = eigenvals.max() / eigenvals.min()
print(f"Condition number: {condition_number:.2f}")
print(f"Well-conditioned: {condition_number < 100}")

## Jacobian

**Formula:** $\mathbf{J}_{ij} = \frac{\partial f_i}{\partial x_j}$

Matrix of first derivatives for vector-valued functions.

In [None]:
# Vector-valued function example
def vector_function(x):
    return torch.stack([
        x[0]**2 + x[1],
        x[0] * x[1],
        torch.sin(x[0]) + torch.cos(x[1])
    ])

x = torch.tensor([1.0, 2.0], requires_grad=True)
y = vector_function(x)

# Compute Jacobian
jacobian = torch.zeros(3, 2)
for i in range(3):
    if x.grad is not None:
        x.grad.zero_()
    y[i].backward(retain_graph=True)
    jacobian[i] = x.grad.clone()

print(f"Input: {x}")
print(f"Output: {y}")
print(f"Jacobian:\n{jacobian}")

# Neural network layer Jacobian
layer = torch.nn.Linear(3, 2)
x_batch = torch.randn(1, 3, requires_grad=True)
y_batch = layer(x_batch)

# Jacobian for neural network layer
jac = torch.autograd.functional.jacobian(layer, x_batch)
print(f"NN Jacobian shape: {jac.shape}")  # (batch, output_dim, batch, input_dim)