# Backpropagation Neural Network from Scratch

This notebook implements a simple neural network with manual backpropagation to understand the core concepts without unnecessary complexity.

**Key Concepts:**
1. Forward pass: Computing predictions
2. Loss calculation: Measuring prediction error
3. Backward pass: Computing gradients using chain rule
4. Weight updates: Using gradient descent

We'll use PyTorch tensors but implement backprop manually to see what's happening.

In [None]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("PyTorch version:", torch.__version__)

## STEP 1: Create a Simple Dataset

We'll use a simple classification problem: **XOR gate**

XOR is not linearly separable, so we need a hidden layer.

- **Input:** Two binary values
- **Output:** 1 if inputs differ, 0 if they're the same

In [None]:
# XOR dataset
X = torch.tensor([
    [0.0, 0.0],
    [0.0, 1.0],
    [1.0, 0.0],
    [1.0, 1.0]
], dtype=torch.float32)

y = torch.tensor([
    [0.0],  # 0 XOR 0 = 0
    [1.0],  # 0 XOR 1 = 1
    [1.0],  # 1 XOR 0 = 1
    [0.0]   # 1 XOR 1 = 0
], dtype=torch.float32)

print("Input shape:", X.shape)  # (4, 2) - 4 samples, 2 features
print("Output shape:", y.shape)  # (4, 1) - 4 samples, 1 output
print("\nDataset:")
for i in range(len(X)):
    print(f"Input: {X[i].numpy()} -> Output: {y[i].item()}")

## STEP 2: Initialize Network Parameters

**Network architecture:**
- Input layer: 2 neurons (for 2 input features)
- Hidden layer: 3 neurons (arbitrary choice, enough for XOR)
- Output layer: 1 neuron (for binary classification)

**Parameters:**
- `W1`: Weights from input to hidden (2 x 3)
- `b1`: Biases for hidden layer (3)
- `W2`: Weights from hidden to output (3 x 1)
- `b2`: Bias for output layer (1)

In [None]:
# Network dimensions
input_size = 2
hidden_size = 3
output_size = 1

# Initialize weights with small random values
# Xavier initialization: helps with training stability
W1 = torch.randn(input_size, hidden_size) * 0.5
b1 = torch.zeros(hidden_size)

W2 = torch.randn(hidden_size, output_size) * 0.5
b2 = torch.zeros(output_size)

print("W1 shape (input to hidden):", W1.shape)
print("b1 shape:", b1.shape)
print("W2 shape (hidden to output):", W2.shape)
print("b2 shape:", b2.shape)
print(f"\nTotal parameters: {W1.numel() + b1.numel() + W2.numel() + b2.numel()}")

## STEP 3: Define Activation Functions

**Sigmoid:** σ(x) = 1 / (1 + e^(-x))
- Squashes values to range (0, 1)
- Used in hidden and output layers for this binary classification

**Derivative:** σ'(x) = σ(x) * (1 - σ(x))
- Needed for backpropagation
- Computed from the sigmoid output (saves computation)

In [None]:
def sigmoid(x):
    """Apply sigmoid activation function"""
    return 1 / (1 + torch.exp(-x))

def sigmoid_derivative(sigmoid_output):
    """
    Compute derivative of sigmoid from its output
    This is more efficient than recomputing sigmoid
    """
    return sigmoid_output * (1 - sigmoid_output)

# Test the functions
test_input = torch.tensor([-2.0, 0.0, 2.0])
test_output = sigmoid(test_input)
print("Sigmoid test:")
print(f"Input: {test_input.numpy()}")
print(f"Output: {test_output.numpy()}")
print(f"Derivative: {sigmoid_derivative(test_output).numpy()}")

## STEP 4: Forward Pass

Compute predictions by passing input through the network:

1. Hidden layer: **h = σ(X @ W1 + b1)**
2. Output layer: **y_pred = σ(h @ W2 + b2)**

Where:
- `@` is matrix multiplication
- `σ` is sigmoid activation

In [None]:
def forward_pass(X, W1, b1, W2, b2):
    """
    Forward propagation through the network
    
    Returns:
        y_pred: Final predictions
        h: Hidden layer activations (needed for backprop)
        z1: Pre-activation hidden values (needed for backprop)
        z2: Pre-activation output values (needed for backprop)
    """
    # Hidden layer
    z1 = X @ W1 + b1              # Linear combination
    h = sigmoid(z1)                # Apply activation
    
    # Output layer
    z2 = h @ W2 + b2              # Linear combination
    y_pred = sigmoid(z2)           # Apply activation
    
    return y_pred, h, z1, z2

# Test forward pass with initial weights
y_pred, h, z1, z2 = forward_pass(X, W1, b1, W2, b2)
print("Predictions with random initialization:")
print(y_pred.squeeze().detach().numpy())
print("\nTrue labels:")
print(y.squeeze().numpy())

## STEP 5: Loss Function

**Mean Squared Error (MSE):** Loss = mean((y_pred - y)²)

For binary classification, we could use Binary Cross-Entropy, but MSE is simpler to understand for learning backprop.

**Derivative:** dLoss/dy_pred = 2 * (y_pred - y) / n

In [None]:
def compute_loss(y_pred, y):
    """Calculate Mean Squared Error loss"""
    return torch.mean((y_pred - y) ** 2)

# Test loss calculation
initial_loss = compute_loss(y_pred, y)
print(f"Initial loss: {initial_loss.item():.4f}")

## STEP 6: Backward Pass (Backpropagation)

**This is the heart of neural network training!**

We compute gradients using the chain rule, working backwards from loss to inputs.

**Chain rule reminder:** If z = f(g(x)), then dz/dx = (dz/dg) * (dg/dx)

**Gradients we need to compute:**
1. dLoss/dW2, dLoss/db2 (output layer weights and bias)
2. dLoss/dW1, dLoss/db1 (hidden layer weights and bias)

**Working backwards:**
- Start with loss gradient
- Multiply by activation derivative
- Multiply by previous layer's output

In [None]:
def backward_pass(X, y, y_pred, h, z1, z2, W2):
    """
    Compute all gradients using backpropagation
    
    Returns gradients for W1, b1, W2, b2
    """
    n_samples = X.shape[0]
    
    # === OUTPUT LAYER GRADIENTS ===
    
    # 1. Gradient of loss with respect to predictions
    # dLoss/dy_pred = 2 * (y_pred - y) / n
    dLoss_dy_pred = 2 * (y_pred - y) / n_samples
    
    # 2. Gradient through sigmoid activation at output
    # dLoss/dz2 = dLoss/dy_pred * dy_pred/dz2
    # dy_pred/dz2 = sigmoid'(z2) = y_pred * (1 - y_pred)
    dLoss_dz2 = dLoss_dy_pred * sigmoid_derivative(y_pred)
    
    # 3. Gradient with respect to W2
    # z2 = h @ W2 + b2, so dz2/dW2 = h
    # dLoss/dW2 = h^T @ dLoss_dz2
    dLoss_dW2 = h.T @ dLoss_dz2
    
    # 4. Gradient with respect to b2
    # dLoss/db2 = sum of dLoss_dz2 across samples
    dLoss_db2 = torch.sum(dLoss_dz2, dim=0)
    
    # === HIDDEN LAYER GRADIENTS ===
    
    # 5. Gradient flowing back to hidden layer
    # dLoss/dh = dLoss/dz2 @ W2^T (chain rule)
    dLoss_dh = dLoss_dz2 @ W2.T
    
    # 6. Gradient through sigmoid activation at hidden layer
    # dLoss/dz1 = dLoss/dh * dh/dz1
    # dh/dz1 = sigmoid'(z1) = h * (1 - h)
    dLoss_dz1 = dLoss_dh * sigmoid_derivative(h)
    
    # 7. Gradient with respect to W1
    # z1 = X @ W1 + b1, so dz1/dW1 = X
    # dLoss/dW1 = X^T @ dLoss_dz1
    dLoss_dW1 = X.T @ dLoss_dz1
    
    # 8. Gradient with respect to b1
    # dLoss/db1 = sum of dLoss_dz1 across samples
    dLoss_db1 = torch.sum(dLoss_dz1, dim=0)
    
    return dLoss_dW1, dLoss_db1, dLoss_dW2, dLoss_db2

# Test backward pass
dW1, db1, dW2, db2 = backward_pass(X, y, y_pred, h, z1, z2, W2)
print("Gradient shapes:")
print(f"dW1: {dW1.shape}, db1: {db1.shape}")
print(f"dW2: {dW2.shape}, db2: {db2.shape}")

## STEP 7: Gradient Descent Update

**Update rule:** θ_new = θ_old - learning_rate * gradient

Learning rate controls step size:
- **Too large:** May overshoot and diverge
- **Too small:** Slow convergence

In [None]:
def update_parameters(W1, b1, W2, b2, dW1, db1, dW2, db2, learning_rate):
    """
    Update all parameters using gradient descent
    """
    W1 = W1 - learning_rate * dW1
    b1 = b1 - learning_rate * db1
    W2 = W2 - learning_rate * dW2
    b2 = b2 - learning_rate * db2
    
    return W1, b1, W2, b2

# Test update (not actually updating yet)
learning_rate = 0.5
print(f"Learning rate: {learning_rate}")
print("Parameter updates will move parameters in opposite direction of gradients")

## STEP 8: Training Loop

Combine all steps to train the network:
1. Forward pass (compute predictions)
2. Compute loss
3. Backward pass (compute gradients)
4. Update parameters

Repeat for many epochs until loss converges.

In [None]:
# Re-initialize parameters for clean training
torch.manual_seed(42)
W1 = torch.randn(input_size, hidden_size) * 0.5
b1 = torch.zeros(hidden_size)
W2 = torch.randn(hidden_size, output_size) * 0.5
b2 = torch.zeros(output_size)

# Training hyperparameters
learning_rate = 1.0  # Higher learning rate works for this simple problem
n_epochs = 5000

# Track loss history for visualization
loss_history = []

print("Training started...\n")

for epoch in range(n_epochs):
    # Forward pass
    y_pred, h, z1, z2 = forward_pass(X, W1, b1, W2, b2)
    
    # Compute loss
    loss = compute_loss(y_pred, y)
    loss_history.append(loss.item())
    
    # Backward pass
    dW1, db1, dW2, db2 = backward_pass(X, y, y_pred, h, z1, z2, W2)
    
    # Update parameters
    W1, b1, W2, b2 = update_parameters(W1, b1, W2, b2, dW1, db1, dW2, db2, learning_rate)
    
    # Print progress every 500 epochs
    if (epoch + 1) % 500 == 0:
        print(f"Epoch {epoch + 1}/{n_epochs}, Loss: {loss.item():.6f}")

print("\nTraining completed!")

## STEP 9: Evaluate Final Results

Check how well the network learned XOR logic.

In [None]:
# Final predictions
y_pred_final, _, _, _ = forward_pass(X, W1, b1, W2, b2)

print("Final Results:")
print("=" * 50)
for i in range(len(X)):
    pred = y_pred_final[i].item()
    true = y[i].item()
    print(f"Input: {X[i].numpy()} | Predicted: {pred:.4f} | True: {true:.0f} | {'✓' if round(pred) == true else '✗'}")

# Calculate accuracy
predictions_rounded = torch.round(y_pred_final)
accuracy = (predictions_rounded == y).float().mean().item()
print(f"\nAccuracy: {accuracy * 100:.1f}%")
print(f"Final loss: {loss_history[-1]:.6f}")

## STEP 10: Visualize Training Progress

In [None]:
plt.figure(figsize=(12, 5))

# Plot 1: Loss over time
plt.subplot(1, 2, 1)
plt.plot(loss_history, linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss (MSE)', fontsize=12)
plt.title('Training Loss Over Time', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.yscale('log')  # Log scale to see convergence better

# Plot 2: Decision boundary visualization
plt.subplot(1, 2, 2)

# Create a mesh grid to visualize decision boundary
x_min, x_max = -0.5, 1.5
y_min, y_max = -0.5, 1.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))

# Compute predictions for each point in the grid
grid_points = torch.tensor(np.c_[xx.ravel(), yy.ravel()], dtype=torch.float32)
grid_pred, _, _, _ = forward_pass(grid_points, W1, b1, W2, b2)
grid_pred = grid_pred.detach().numpy().reshape(xx.shape)

# Plot decision boundary
plt.contourf(xx, yy, grid_pred, levels=20, cmap='RdYlBu_r', alpha=0.8)
plt.colorbar(label='Predicted Value')

# Plot training points
colors = ['blue' if label == 0 else 'red' for label in y.squeeze().numpy()]
plt.scatter(X[:, 0], X[:, 1], c=colors, s=200, edgecolors='black', linewidth=2)
plt.xlabel('Input 1', fontsize=12)
plt.ylabel('Input 2', fontsize=12)
plt.title('Decision Boundary (XOR)', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## BONUS: Compare with PyTorch's Autograd

Let's verify our implementation matches PyTorch's automatic differentiation.

In [None]:
# Create the same network using PyTorch's autograd
torch.manual_seed(42)
W1_auto = torch.randn(input_size, hidden_size, requires_grad=True) * 0.5
b1_auto = torch.zeros(hidden_size, requires_grad=True)
W2_auto = torch.randn(hidden_size, output_size, requires_grad=True) * 0.5
b2_auto = torch.zeros(output_size, requires_grad=True)

# Training hyperparameters
learning_rate = 1.0
n_epochs = 5000
loss_history_auto = []

print("Training with PyTorch autograd...\n")

for epoch in range(n_epochs):
    # Forward pass (PyTorch tracks operations automatically)
    z1 = X @ W1_auto + b1_auto
    h = torch.sigmoid(z1)
    z2 = h @ W2_auto + b2_auto
    y_pred = torch.sigmoid(z2)
    
    # Compute loss
    loss = torch.mean((y_pred - y) ** 2)
    loss_history_auto.append(loss.item())
    
    # Backward pass (PyTorch computes gradients automatically)
    loss.backward()
    
    # Update parameters
    with torch.no_grad():
        W1_auto -= learning_rate * W1_auto.grad
        b1_auto -= learning_rate * b1_auto.grad
        W2_auto -= learning_rate * W2_auto.grad
        b2_auto -= learning_rate * b2_auto.grad
        
        # Zero gradients for next iteration
        W1_auto.grad.zero_()
        b1_auto.grad.zero_()
        W2_auto.grad.zero_()
        b2_auto.grad.zero_()
    
    if (epoch + 1) % 500 == 0:
        print(f"Epoch {epoch + 1}/{n_epochs}, Loss: {loss.item():.6f}")

print("\nComparing both implementations:")
print("=" * 50)

# Compare final losses
print(f"Manual backprop final loss: {loss_history[-1]:.6f}")
print(f"PyTorch autograd final loss: {loss_history_auto[-1]:.6f}")
print(f"Difference: {abs(loss_history[-1] - loss_history_auto[-1]):.8f}")

In [None]:
# Plot comparison
plt.figure(figsize=(10, 5))
plt.plot(loss_history, label='Manual Backprop', linewidth=2)
plt.plot(loss_history_auto, label='PyTorch Autograd', linewidth=2, linestyle='--')
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss (MSE)', fontsize=12)
plt.title('Manual Backprop vs PyTorch Autograd', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.tight_layout()
plt.show()

## KEY TAKEAWAYS

### 1. Forward Pass
Data flows through layers, each applying linear transform + activation

### 2. Loss Function
Measures how far predictions are from true values

### 3. Backpropagation
Chain rule applied backwards to compute gradients
- Start at loss
- Flow backwards through each layer
- Multiply gradients at each step

### 4. Gradient Descent
Move parameters opposite to gradient direction
- Reduces loss over time
- Learning rate controls step size

### 5. Why It Works
- Gradients point in direction of steepest increase
- Moving opposite direction decreases loss
- Repeated small steps converge to good solution

### 6. PyTorch's Role
- **Manual:** We computed every gradient explicitly
- **Autograd:** PyTorch tracks operations and computes gradients automatically
- **Both give identical results!**

---

**This is the foundation of deep learning - just scaled up to larger networks!**

✓ Notebook complete! You now understand backpropagation from first principles.