# Backpropagation from Scratch with PyTorch-like API

This notebook implements backpropagation from first principles with an API similar to PyTorch's autograd.

**Learning Goals:**
1. Understand how automatic differentiation works
2. See the connection between math and code
3. Build intuition for gradient flow
4. Appreciate what PyTorch does under the hood

**Key Concepts:**
- Computational graph
- Chain rule
- Gradient accumulation
- Parameter updates

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Tuple

# Set random seed
np.random.seed(42)

print("Starting backpropagation tutorial...")

## Part 1: Building Our Own Tensor with Autograd

We'll create a `Tensor` class that:
- Stores data (like PyTorch tensors)
- Tracks gradients (`.grad` attribute)
- Records operations for backprop (computational graph)
- Has a `.backward()` method to compute gradients

In [None]:
class Tensor:
    """
    A tensor that supports automatic differentiation.
    Similar to PyTorch's torch.Tensor with requires_grad=True
    """
    
    def __init__(self, data, requires_grad=False, _children=()):
        """
        Args:
            data: The actual data (numpy array or scalar)
            requires_grad: Whether to track operations for gradient computation
            _children: Parent tensors in the computational graph
        """
        self.data = np.array(data, dtype=np.float32)
        self.requires_grad = requires_grad
        
        # Gradient will be accumulated here
        self.grad = None
        
        # For building computational graph
        self._backward = lambda: None  # Function to compute gradients
        self._prev = set(_children)     # Parent nodes
    
    def __repr__(self):
        return f"Tensor(data={self.data}, grad={self.grad})"
    
    # ============================================
    # Mathematical Operations (Forward Pass)
    # ============================================
    
    def __matmul__(self, other):
        """
        Matrix multiplication: A @ B
        
        Forward: out = A @ B
        Backward: 
            dL/dA = dL/dout @ B.T
            dL/dB = A.T @ dL/dout
        """
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(
            self.data @ other.data,
            requires_grad=self.requires_grad or other.requires_grad,
            _children=(self, other)
        )
        
        def _backward():
            if self.requires_grad:
                # Chain rule: dL/dA = dL/dout @ B.T
                grad = out.grad @ other.data.T
                if self.grad is None:
                    self.grad = grad
                else:
                    self.grad += grad
            
            if other.requires_grad:
                # Chain rule: dL/dB = A.T @ dL/dout
                grad = self.data.T @ out.grad
                if other.grad is None:
                    other.grad = grad
                else:
                    other.grad += grad
        
        out._backward = _backward
        return out
    
    def __add__(self, other):
        """
        Addition: A + B
        
        Forward: out = A + B
        Backward: dL/dA = dL/dout, dL/dB = dL/dout
        """
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(
            self.data + other.data,
            requires_grad=self.requires_grad or other.requires_grad,
            _children=(self, other)
        )
        
        def _backward():
            if self.requires_grad:
                # Gradient passes through unchanged
                grad = out.grad
                # Handle broadcasting (sum over broadcasted dimensions)
                if self.data.shape != grad.shape:
                    # Sum over dimensions that were broadcasted
                    ndims_added = grad.ndim - self.data.ndim
                    for i in range(ndims_added):
                        grad = grad.sum(axis=0)
                    for i, (dim_orig, dim_grad) in enumerate(zip(self.data.shape, grad.shape)):
                        if dim_orig == 1 and dim_grad > 1:
                            grad = grad.sum(axis=i, keepdims=True)
                
                if self.grad is None:
                    self.grad = grad
                else:
                    self.grad += grad
            
            if other.requires_grad:
                grad = out.grad
                # Handle broadcasting
                if other.data.shape != grad.shape:
                    ndims_added = grad.ndim - other.data.ndim
                    for i in range(ndims_added):
                        grad = grad.sum(axis=0)
                    for i, (dim_orig, dim_grad) in enumerate(zip(other.data.shape, grad.shape)):
                        if dim_orig == 1 and dim_grad > 1:
                            grad = grad.sum(axis=i, keepdims=True)
                
                if other.grad is None:
                    other.grad = grad
                else:
                    other.grad += grad
        
        out._backward = _backward
        return out
    
    def __mul__(self, other):
        """
        Element-wise multiplication: A * B
        
        Forward: out = A * B
        Backward: dL/dA = dL/dout * B, dL/dB = dL/dout * A
        """
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(
            self.data * other.data,
            requires_grad=self.requires_grad or other.requires_grad,
            _children=(self, other)
        )
        
        def _backward():
            if self.requires_grad:
                grad = out.grad * other.data
                if self.grad is None:
                    self.grad = grad
                else:
                    self.grad += grad
            
            if other.requires_grad:
                grad = out.grad * self.data
                if other.grad is None:
                    other.grad = grad
                else:
                    other.grad += grad
        
        out._backward = _backward
        return out
    
    def __sub__(self, other):
        """Subtraction: A - B = A + (-B)"""
        return self + (other * -1)
    
    def __pow__(self, power):
        """
        Power: A ** n
        
        Forward: out = A^n
        Backward: dL/dA = dL/dout * n * A^(n-1)
        """
        out = Tensor(
            self.data ** power,
            requires_grad=self.requires_grad,
            _children=(self,)
        )
        
        def _backward():
            if self.requires_grad:
                # Power rule
                grad = out.grad * power * (self.data ** (power - 1))
                if self.grad is None:
                    self.grad = grad
                else:
                    self.grad += grad
        
        out._backward = _backward
        return out
    
    def sum(self, axis=None, keepdims=False):
        """
        Sum elements
        
        Forward: out = sum(A)
        Backward: dL/dA = dL/dout (broadcast to original shape)
        """
        out = Tensor(
            self.data.sum(axis=axis, keepdims=keepdims),
            requires_grad=self.requires_grad,
            _children=(self,)
        )
        
        def _backward():
            if self.requires_grad:
                # Gradient broadcasts back to original shape
                grad = out.grad
                if axis is not None and not keepdims:
                    grad = np.expand_dims(grad, axis=axis)
                grad = np.broadcast_to(grad, self.data.shape)
                
                if self.grad is None:
                    self.grad = grad.copy()
                else:
                    self.grad += grad
        
        out._backward = _backward
        return out
    
    def mean(self):
        """
        Mean of all elements
        
        Forward: out = mean(A) = sum(A) / n
        Backward: dL/dA = dL/dout / n
        """
        n = self.data.size
        return self.sum() * (1.0 / n)
    
    # ============================================
    # Activation Functions
    # ============================================
    
    def sigmoid(self):
        """
        Sigmoid activation: Ïƒ(x) = 1 / (1 + e^(-x))
        
        Forward: out = Ïƒ(A)
        Backward: dL/dA = dL/dout * Ïƒ(A) * (1 - Ïƒ(A))
        """
        out = Tensor(
            1 / (1 + np.exp(-self.data)),
            requires_grad=self.requires_grad,
            _children=(self,)
        )
        
        def _backward():
            if self.requires_grad:
                # Derivative: Ïƒ'(x) = Ïƒ(x) * (1 - Ïƒ(x))
                sigmoid_grad = out.data * (1 - out.data)
                grad = out.grad * sigmoid_grad
                
                if self.grad is None:
                    self.grad = grad
                else:
                    self.grad += grad
        
        out._backward = _backward
        return out
    
    def relu(self):
        """
        ReLU activation: max(0, x)
        
        Forward: out = max(0, A)
        Backward: dL/dA = dL/dout * (A > 0)
        """
        out = Tensor(
            np.maximum(0, self.data),
            requires_grad=self.requires_grad,
            _children=(self,)
        )
        
        def _backward():
            if self.requires_grad:
                # Gradient is 1 where input > 0, else 0
                grad = out.grad * (self.data > 0)
                
                if self.grad is None:
                    self.grad = grad
                else:
                    self.grad += grad
        
        out._backward = _backward
        return out
    
    # ============================================
    # Backpropagation
    # ============================================
    
    def backward(self):
        """
        Compute gradients via backpropagation.
        Similar to PyTorch's loss.backward()
        
        This builds a topological ordering of the computational graph
        and calls _backward() on each node in reverse order.
        """
        # Build topological order (reverse of forward pass)
        topo = []
        visited = set()
        
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        
        build_topo(self)
        
        # Initialize gradient of output (this tensor) to 1
        self.grad = np.ones_like(self.data)
        
        # Backpropagate through graph in reverse topological order
        for node in reversed(topo):
            node._backward()
    
    def zero_grad(self):
        """Reset gradients to None (like PyTorch's optimizer.zero_grad())"""
        self.grad = None


print("âœ“ Tensor class with autograd created!")

## Part 2: Test Basic Operations

Let's verify our implementation works correctly.

In [None]:
# Test 1: Simple scalar operations
print("Test 1: Scalar Operations")
print("=" * 50)

# Create tensors
a = Tensor(2.0, requires_grad=True)
b = Tensor(3.0, requires_grad=True)

# Forward pass: c = a * b + a
c = a * b + a
print(f"a = {a.data}, b = {b.data}")
print(f"c = a * b + a = {c.data}")

# Backward pass
c.backward()
print(f"\nGradients:")
print(f"dc/da = {a.grad} (expected: b + 1 = 4.0)")
print(f"dc/db = {b.grad} (expected: a = 2.0)")

In [None]:
# Test 2: Matrix multiplication
print("\nTest 2: Matrix Multiplication")
print("=" * 50)

# Create matrices
X = Tensor([[1., 2.], [3., 4.]], requires_grad=True)
W = Tensor([[0.5, 0.5], [0.5, 0.5]], requires_grad=True)

# Forward: Y = X @ W
Y = X @ W
print(f"X shape: {X.data.shape}")
print(f"W shape: {W.data.shape}")
print(f"Y = X @ W:\n{Y.data}")

# Loss = mean(Y)
loss = Y.mean()
print(f"\nLoss (mean of Y): {loss.data}")

# Backward
loss.backward()
print(f"\nGradient of X:\n{X.grad}")
print(f"\nGradient of W:\n{W.grad}")

In [None]:
# Test 3: Sigmoid activation
print("\nTest 3: Sigmoid Activation")
print("=" * 50)

x = Tensor([0., 1., -1., 2.], requires_grad=True)
y = x.sigmoid()

print(f"x = {x.data}")
print(f"sigmoid(x) = {y.data}")

# Compute gradient
loss = y.sum()
loss.backward()

print(f"\nGradient of x: {x.grad}")
print(f"Expected: sigmoid(x) * (1 - sigmoid(x))")
print(f"Computed: {y.data * (1 - y.data)}")

## Part 3: Build a Neural Network Class

Now let's create a simple neural network using our Tensor class.

In [None]:
class NeuralNetwork:
    """
    A simple feedforward neural network.
    Similar to PyTorch's nn.Module
    """
    
    def __init__(self, input_size, hidden_size, output_size):
        """
        Initialize network parameters.
        
        Args:
            input_size: Number of input features
            hidden_size: Number of hidden neurons
            output_size: Number of output neurons
        """
        # Xavier initialization for better training
        # Scale by sqrt(1/input_size) to keep variance stable
        
        self.W1 = Tensor(
            np.random.randn(input_size, hidden_size) * np.sqrt(1.0 / input_size),
            requires_grad=True
        )
        self.b1 = Tensor(np.zeros((1, hidden_size)), requires_grad=True)
        
        self.W2 = Tensor(
            np.random.randn(hidden_size, output_size) * np.sqrt(1.0 / hidden_size),
            requires_grad=True
        )
        self.b2 = Tensor(np.zeros((1, output_size)), requires_grad=True)
        
        print(f"Network architecture: {input_size} -> {hidden_size} -> {output_size}")
        print(f"Total parameters: {self.count_parameters()}")
    
    def forward(self, X):
        """
        Forward pass through the network.
        
        Args:
            X: Input tensor
            
        Returns:
            Output tensor
        """
        # Hidden layer: h = sigmoid(X @ W1 + b1)
        self.z1 = X @ self.W1 + self.b1
        self.h = self.z1.sigmoid()
        
        # Output layer: y = sigmoid(h @ W2 + b2)
        self.z2 = self.h @ self.W2 + self.b2
        self.y = self.z2.sigmoid()
        
        return self.y
    
    def parameters(self):
        """Return list of all parameters (like PyTorch's model.parameters())"""
        return [self.W1, self.b1, self.W2, self.b2]
    
    def count_parameters(self):
        """Count total number of parameters"""
        return sum(p.data.size for p in self.parameters())
    
    def zero_grad(self):
        """Zero out all gradients (like PyTorch's optimizer.zero_grad())"""
        for p in self.parameters():
            p.zero_grad()


print("âœ“ Neural Network class created!")

## Part 4: Create Dataset (XOR Problem)

XOR is a classic problem that requires non-linearity to solve.

In [None]:
# Create XOR dataset
X_data = np.array([
    [0., 0.],
    [0., 1.],
    [1., 0.],
    [1., 1.]
])

y_data = np.array([
    [0.],  # 0 XOR 0 = 0
    [1.],  # 0 XOR 1 = 1
    [1.],  # 1 XOR 0 = 1
    [0.]   # 1 XOR 1 = 0
])

# Convert to our Tensor class
X = Tensor(X_data)
y = Tensor(y_data)

print("XOR Dataset:")
print("=" * 50)
for i in range(len(X_data)):
    print(f"Input: {X_data[i]} -> Output: {y_data[i][0]}")

## Part 5: Training Loop

Now we'll train our network using gradient descent.

In [None]:
# Initialize network
np.random.seed(42)
model = NeuralNetwork(input_size=2, hidden_size=4, output_size=1)

# Training hyperparameters
learning_rate = 1.0
n_epochs = 2000

# Track loss history
loss_history = []

print("\nTraining started...")
print("=" * 50)

for epoch in range(n_epochs):
    # Forward pass
    y_pred = model.forward(X)
    
    # Compute loss: MSE = mean((y_pred - y)^2)
    loss = ((y_pred - y) ** 2).mean()
    loss_history.append(loss.data.item())
    
    # Zero gradients from previous iteration
    model.zero_grad()
    
    # Backward pass (compute gradients)
    loss.backward()
    
    # Update parameters using gradient descent
    # Î¸_new = Î¸_old - learning_rate * gradient
    for param in model.parameters():
        param.data -= learning_rate * param.grad
    
    # Print progress
    if (epoch + 1) % 200 == 0:
        print(f"Epoch {epoch + 1}/{n_epochs}, Loss: {loss.data.item():.6f}")

print("\nâœ“ Training completed!")

## Part 6: Evaluate Results

In [None]:
# Final predictions
y_pred_final = model.forward(X)

print("Final Results:")
print("=" * 50)
for i in range(len(X_data)):
    pred = y_pred_final.data[i, 0]
    true = y_data[i, 0]
    correct = "âœ“" if round(pred) == true else "âœ—"
    print(f"Input: {X_data[i]} | Predicted: {pred:.4f} | True: {true:.0f} | {correct}")

# Calculate accuracy
predictions_rounded = np.round(y_pred_final.data)
accuracy = (predictions_rounded == y_data).mean()
print(f"\nAccuracy: {accuracy * 100:.1f}%")
print(f"Final loss: {loss_history[-1]:.6f}")

## Part 7: Visualize Training

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Loss curve
axes[0].plot(loss_history, linewidth=2, color='#2E86AB')
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss (MSE)', fontsize=12)
axes[0].set_title('Training Loss Over Time', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].set_yscale('log')

# Plot 2: Decision boundary
# Create mesh grid
x_min, x_max = -0.5, 1.5
y_min, y_max = -0.5, 1.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))

# Predict for each point in grid
grid_points = Tensor(np.c_[xx.ravel(), yy.ravel()])
grid_pred = model.forward(grid_points)
grid_pred = grid_pred.data.reshape(xx.shape)

# Plot decision boundary
contour = axes[1].contourf(xx, yy, grid_pred, levels=20, cmap='RdYlBu_r', alpha=0.8)
plt.colorbar(contour, ax=axes[1], label='Predicted Value')

# Plot training points
colors = ['#A23B72' if label == 0 else '#F18F01' for label in y_data[:, 0]]
axes[1].scatter(X_data[:, 0], X_data[:, 1], c=colors, s=200, 
                edgecolors='black', linewidth=2, zorder=10)
axes[1].set_xlabel('Input 1', fontsize=12)
axes[1].set_ylabel('Input 2', fontsize=12)
axes[1].set_title('Decision Boundary (XOR)', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nâœ“ Visualizations complete!")

## Part 8: Compare with PyTorch

Let's verify our implementation matches PyTorch's autograd.

In [None]:
import torch
import torch.nn as nn

# Create PyTorch version
class PyTorchNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        x = self.sigmoid(self.fc1(x))
        x = self.sigmoid(self.fc2(x))
        return x

# Initialize with same random seed
torch.manual_seed(42)
pytorch_model = PyTorchNet(2, 4, 1)

# Convert data to PyTorch tensors
X_torch = torch.FloatTensor(X_data)
y_torch = torch.FloatTensor(y_data)

# Training
optimizer = torch.optim.SGD(pytorch_model.parameters(), lr=1.0)
criterion = nn.MSELoss()

pytorch_loss_history = []

print("Training PyTorch model...")
print("=" * 50)

for epoch in range(2000):
    # Forward pass
    y_pred_torch = pytorch_model(X_torch)
    loss_torch = criterion(y_pred_torch, y_torch)
    pytorch_loss_history.append(loss_torch.item())
    
    # Backward pass
    optimizer.zero_grad()
    loss_torch.backward()
    optimizer.step()
    
    if (epoch + 1) % 200 == 0:
        print(f"Epoch {epoch + 1}/2000, Loss: {loss_torch.item():.6f}")

print("\nâœ“ PyTorch training complete!")

In [None]:
# Compare results
print("\nComparison: Our Implementation vs PyTorch")
print("=" * 50)
print(f"Our final loss: {loss_history[-1]:.6f}")
print(f"PyTorch final loss: {pytorch_loss_history[-1]:.6f}")
print(f"Difference: {abs(loss_history[-1] - pytorch_loss_history[-1]):.6f}")

# Plot comparison
plt.figure(figsize=(10, 5))
plt.plot(loss_history, label='Our Implementation', linewidth=2, alpha=0.8)
plt.plot(pytorch_loss_history, label='PyTorch Autograd', 
         linewidth=2, linestyle='--', alpha=0.8)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss (MSE)', fontsize=12)
plt.title('Our Implementation vs PyTorch', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.tight_layout()
plt.show()

print("\nâœ“ Both implementations converge to similar solutions!")

## Key Takeaways

### What We Built
1. **Tensor class with autograd**: Automatically tracks operations and computes gradients
2. **Computational graph**: Built dynamically during forward pass
3. **Backpropagation**: Chain rule applied recursively through the graph
4. **Neural network**: Using our custom tensors with PyTorch-like API

### Core Concepts
1. **Forward Pass**: Data flows through operations, building a graph
2. **Backward Pass**: Gradients flow backwards using chain rule
3. **Gradient Accumulation**: Gradients add up for nodes used multiple times
4. **Parameter Updates**: Move opposite to gradient direction

### API Similarities with PyTorch
- `Tensor(data, requires_grad=True)` â†” `torch.tensor(data, requires_grad=True)`
- `.backward()` â†” `.backward()`
- `.zero_grad()` â†” `.zero_grad()`
- Matrix operations: `@`, `+`, `*` work the same
- Activations: `.sigmoid()`, `.relu()`

### Why This Matters
- **PyTorch does this automatically**: Our manual implementation shows what's happening
- **Debugging intuition**: Understanding autograd helps debug gradient issues
- **Foundation for deep learning**: All modern frameworks use similar principles
- **Extensibility**: You can create custom operations by defining forward + backward

This is the foundation of deep learning - just scaled to larger networks! ðŸŽ“