# Multi-Layer Perceptrons: Deep Learning Foundations

## 🎯 Introduction

Welcome to the building blocks of deep learning! This notebook will take you from simple linear transformations to sophisticated neural networks that can learn complex patterns. MLPs are the foundation that everything else builds upon - understanding them deeply is crucial for mastering modern AI.

### 🧠 What You'll Master

This comprehensive guide covers:
- **From linear to nonlinear**: How activation functions enable complex learning
- **Universal approximation**: Why MLPs can learn any continuous function
- **Deep architectures**: Building networks with multiple hidden layers
- **Training dynamics**: Understanding how MLPs learn through backpropagation
- **Modern patterns**: Batch normalization, dropout, and residual connections

### 🎓 Prerequisites

- Solid understanding of PyTorch tensors and basic operations
- Familiarity with linear algebra (matrix multiplication, vectors)
- Basic calculus concepts (derivatives, chain rule)
- Understanding of gradient descent optimization

### 🚀 Why MLPs Matter

MLPs are fundamental because they:
- **Prove universal approximation**: Can theoretically learn any continuous function
- **Form building blocks**: Every complex architecture contains MLP components
- **Enable nonlinear learning**: Activation functions make complex patterns possible
- **Scale effectively**: From tiny networks to massive language models
- **Transfer broadly**: Same principles work across vision, NLP, and beyond

---

## 📚 Table of Contents

1. **[From Linear to Nonlinear](#from-linear-to-nonlinear)** - The power of activation functions
2. **[Deep Architecture Design](#deep-architecture-design)** - Building effective multi-layer networks
3. **[Universal Approximation Demo](#universal-approximation-demo)** - Seeing MLPs learn complex functions
4. **[Modern MLP Patterns](#modern-mlp-patterns)** - Batch norm, dropout, and residuals
5. **[Training Dynamics](#training-dynamics)** - Understanding how MLPs learn

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

# Set seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## From Linear to Nonlinear

### ⚡ The Activation Function Revolution

The difference between a linear transformation and a neural network is the activation function. Without activations, even the deepest network is just a complex matrix multiplication. Let's see how nonlinearity enables universal learning!

In [None]:
# =============================================================================
# THE POWER OF NONLINEARITY
# =============================================================================

print("⚡ From Linear to Nonlinear Transformation")
print("=" * 50)

# Demonstrate why activation functions are crucial
def demonstrate_linearity_problem():
    """Show why pure linear layers can't learn complex patterns."""
    
    print("🔍 The Linear Limitation")
    print("-" * 30)
    
    # Create a "deep" network with NO activation functions
    class PureLinearNetwork(nn.Module):
        def __init__(self, input_dim, hidden_dim, output_dim, num_layers=3):
            super().__init__()
            
            # Multiple linear layers - but NO activations!
            layers = []
            layers.append(nn.Linear(input_dim, hidden_dim))
            
            for _ in range(num_layers - 2):
                layers.append(nn.Linear(hidden_dim, hidden_dim))
            
            layers.append(nn.Linear(hidden_dim, output_dim))
            
            self.network = nn.Sequential(*layers)
        
        def forward(self, x):
            return self.network(x)
    
    # Create pure linear network
    linear_net = PureLinearNetwork(2, 64, 1, num_layers=5)
    
    print(f"Created 'deep' network with {len(list(linear_net.parameters()))} parameter tensors")
    print(f"Layers: 2 → 64 → 64 → 64 → 1")
    
    # Test with simple input
    x = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
    output = linear_net(x)
    
    print(f"Input: {x}")
    print(f"Output: {output.detach()}")
    
    # The crucial insight: This is just ONE big matrix multiplication!
    print(f"\n💡 Key Insight:")
    print(f"No matter how many linear layers you stack,")
    print(f"the result is mathematically equivalent to a SINGLE linear transformation!")
    print(f"W5 @ W4 @ W3 @ W2 @ W1 = W_combined")
    
    # Prove this by computing the equivalent single matrix
    with torch.no_grad():
        # Get all weight matrices
        W1 = linear_net.network[0].weight  # [64, 2]
        W2 = linear_net.network[1].weight  # [64, 64]
        W3 = linear_net.network[2].weight  # [64, 64]
        W4 = linear_net.network[3].weight  # [1, 64]
        
        # Compose them into single matrix (note reversed order for matrix mult)
        W_combined = W4 @ W3 @ W2 @ W1  # [1, 2]
        
        # Test equivalence
        single_matrix_output = x @ W_combined.T
        
        print(f"\nProof - Single matrix multiplication gives same result:")
        print(f"Deep network output: {output.detach()}")
        print(f"Single matrix output: {single_matrix_output}")
        print(f"Difference: {torch.abs(output - single_matrix_output).max().item():.10f}")

demonstrate_linearity_problem()

print(f"\n✨ Enter Activation Functions")
print("=" * 50)

# Compare different activation functions
class ActivationComparison(nn.Module):
    """Network to compare different activation functions."""
    
    def __init__(self, input_dim, hidden_dim, output_dim, activation='relu'):
        super().__init__()
        
        self.linear1 = nn.Linear(input_dim, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, hidden_dim)
        self.linear3 = nn.Linear(hidden_dim, output_dim)
        
        # Different activation functions
        if activation == 'relu':
            self.activation = nn.ReLU()
        elif activation == 'tanh':
            self.activation = nn.Tanh()
        elif activation == 'sigmoid':
            self.activation = nn.Sigmoid()
        elif activation == 'gelu':
            self.activation = nn.GELU()
        else:
            raise ValueError(f"Unknown activation: {activation}")
            
        self.activation_name = activation
    
    def forward(self, x):
        # Now we have: Linear → Activation → Linear → Activation → Linear
        x = self.linear1(x)
        x = self.activation(x)  # This breaks the linearity!
        x = self.linear2(x)
        x = self.activation(x)  # Each activation enables new expressiveness
        x = self.linear3(x)
        return x

# Create networks with different activations
activations = ['relu', 'tanh', 'sigmoid', 'gelu']
networks = {}

for act in activations:
    networks[act] = ActivationComparison(2, 32, 1, activation=act)

print("Created networks with different activations:")
for name, net in networks.items():
    param_count = sum(p.numel() for p in net.parameters())
    print(f"  {name.upper():8}: {param_count:,} parameters")

# Test how different activations handle the same input
test_input = torch.tensor([[-2.0, -1.0], [0.0, 0.0], [1.0, 2.0]])
print(f"\nTesting with input: {test_input}")

print(f"\nActivation function responses:")
for name, net in networks.items():
    with torch.no_grad():
        output = net(test_input)
        print(f"{name.upper():8}: {output.squeeze().tolist()}")

print(f"\n🎯 Activation Function Properties")
print("=" * 50)

# Visualize activation function shapes
x_range = torch.linspace(-3, 3, 100)

activations_funcs = {
    'ReLU': nn.ReLU(),
    'Tanh': nn.Tanh(), 
    'Sigmoid': nn.Sigmoid(),
    'GELU': nn.GELU()
}

print("Activation function characteristics:")
print("Function | Range       | Zero-Centered | Differentiable | Modern Use")
print("---------|-------------|---------------|----------------|------------")

for name, func in activations_funcs.items():
    with torch.no_grad():
        y = func(x_range)
        range_str = f"({y.min().item():.1f}, {y.max().item():.1f})"
        zero_centered = "Yes" if y.mean().abs().item() < 0.1 else "No"
        
        if name == 'ReLU':
            diff = "At x≠0"
            use = "CNNs, MLPs"
        elif name == 'Tanh':
            diff = "Everywhere"
            use = "RNNs, Classic"
        elif name == 'Sigmoid':
            diff = "Everywhere" 
            use = "Output layer"
        else:  # GELU
            diff = "Everywhere"
            use = "Transformers"
            
        print(f"{name:8} | {range_str:11} | {zero_centered:13} | {diff:14} | {use}")

print(f"\n💡 Why Activations Enable Universal Approximation")
print("=" * 50)
print("1. **Break linearity**: Prevent network collapse to single matrix")
print("2. **Create boundaries**: ReLU creates piecewise linear regions") 
print("3. **Enable compositions**: Stack simple nonlinear transforms")
print("4. **Approximate any function**: Universal approximation theorem")
print("5. **Learn complex patterns**: XOR, circles, arbitrary decision boundaries")

## Training MLP on Synthetic Data

In [None]:
# Generate synthetic classification data
def generate_synthetic_data(n_samples=1000, n_features=20, n_classes=3):
    """Generate synthetic classification data"""
    torch.manual_seed(42)
    
    # Generate features
    X = torch.randn(n_samples, n_features)
    
    # Create non-linear decision boundary
    # Use a simple polynomial to generate labels
    weights = torch.randn(n_features, 1)
    linear_combo = X @ weights
    
    # Add some non-linearity
    scores = linear_combo.squeeze() + 0.5 * (X[:, 0] * X[:, 1]) + 0.3 * (X[:, 2] ** 2)
    
    # Convert to class labels
    percentiles = torch.quantile(scores, torch.tensor([1/3, 2/3]))
    y = torch.zeros(n_samples, dtype=torch.long)
    y[scores > percentiles[0]] = 1
    y[scores > percentiles[1]] = 2
    
    return X, y

# Generate data
X, y = generate_synthetic_data(n_samples=2000, n_features=20, n_classes=3)
print(f"Data shape: {X.shape}, Labels shape: {y.shape}")
print(f"Class distribution: {torch.bincount(y)}")

# Split into train/val
train_size = int(0.8 * len(X))
X_train, X_val = X[:train_size], X[train_size:]
y_train, y_val = y[:train_size], y[train_size:]

print(f"Train size: {len(X_train)}, Val size: {len(X_val)}")

# Create model
model = SimpleMLP(input_size=20, hidden_size=128, output_size=3, num_layers=4)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training function
def train_epoch(model, X, y, optimizer, criterion, batch_size=64):
    model.train()
    total_loss = 0
    correct = 0
    
    # Simple batching (normally you'd use DataLoader)
    for i in range(0, len(X), batch_size):
        batch_X = X[i:i+batch_size]
        batch_y = y[i:i+batch_size]
        
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        
        # Calculate accuracy
        _, predicted = torch.max(outputs.data, 1)
        correct += (predicted == batch_y).sum().item()
    
    return total_loss / (len(X) // batch_size), correct / len(X)

def evaluate(model, X, y):
    model.eval()
    with torch.no_grad():
        outputs = model(X)
        loss = criterion(outputs, y)
        _, predicted = torch.max(outputs.data, 1)
        accuracy = (predicted == y).sum().item() / len(y)
    return loss.item(), accuracy

# Training loop
train_losses, train_accs = [], []
val_losses, val_accs = [], []

for epoch in range(50):
    # Train
    train_loss, train_acc = train_epoch(model, X_train, y_train, optimizer, criterion)
    
    # Evaluate
    val_loss, val_acc = evaluate(model, X_val, y_val)
    
    train_losses.append(train_loss)
    train_accs.append(train_acc)
    val_losses.append(val_loss)
    val_accs.append(val_acc)
    
    if epoch % 10 == 0:
        print(f"Epoch {epoch:2d}: Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}, "
              f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")

print(f"\nFinal validation accuracy: {val_accs[-1]:.4f}")

In [None]:
# Plot training curves
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Loss curves
ax1.plot(train_losses, label='Train Loss', color='blue')
ax1.plot(val_losses, label='Val Loss', color='red')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training and Validation Loss')
ax1.legend()
ax1.grid(True)

# Accuracy curves
ax2.plot(train_accs, label='Train Acc', color='blue')
ax2.plot(val_accs, label='Val Acc', color='red')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.set_title('Training and Validation Accuracy')
ax2.legend()
ax2.grid(True)

plt.tight_layout()
plt.show()

print("Training completed successfully!")

## Common Gotchas

In [None]:
print("=== COMMON MLP GOTCHAS ===")

# 1. Forgetting to set model.train() / model.eval()
print("\n1. Train/Eval Mode:")
model_with_dropout = nn.Sequential(
    nn.Linear(10, 64),
    nn.ReLU(),
    nn.Dropout(0.5),  # 50% dropout
    nn.Linear(64, 1)
)

x = torch.randn(1, 10)

model_with_dropout.train()
out_train = model_with_dropout(x)

model_with_dropout.eval()
out_eval = model_with_dropout(x)

print(f"Output in train mode: {out_train.item():.4f}")
print(f"Output in eval mode: {out_eval.item():.4f}")
print("Notice how outputs can be different due to dropout!")

# 2. Wrong loss function for the task
print("\n2. Loss Function Selection:")
# For classification: CrossEntropyLoss
# For regression: MSELoss or L1Loss
# For binary classification: BCEWithLogitsLoss

logits = torch.randn(3, 5)  # 3 samples, 5 classes
targets = torch.tensor([0, 2, 4])  # Class indices

ce_loss = nn.CrossEntropyLoss()
print(f"Correct (CrossEntropy): {ce_loss(logits, targets).item():.4f}")

# Wrong: using MSE for classification
mse_loss = nn.MSELoss()
# This would be wrong: mse_loss(logits, targets)  # Shape mismatch!
print("MSE for classification would cause shape errors!")

# 3. Gradient explosion/vanishing
print("\n3. Gradient Issues:")

# Very deep network without proper initialization
deep_model = nn.Sequential()
for i in range(20):  # 20 layers!
    deep_model.add_module(f'linear_{i}', nn.Linear(64, 64))
    deep_model.add_module(f'sigmoid_{i}', nn.Sigmoid())  # Sigmoid can cause vanishing gradients
deep_model.add_module('final', nn.Linear(64, 1))

# Check gradient norms
x = torch.randn(10, 64)
y = torch.randn(10, 1)
criterion = nn.MSELoss()

output = deep_model(x)
loss = criterion(output, y)
loss.backward()

# Check gradients in first vs last layer
first_layer_grad = deep_model[0].weight.grad
last_layer_grad = deep_model[-1].weight.grad

if first_layer_grad is not None and last_layer_grad is not None:
    print(f"First layer gradient norm: {first_layer_grad.norm().item():.8f}")
    print(f"Last layer gradient norm: {last_layer_grad.norm().item():.8f}")
    print("Notice how gradients can vanish in deep networks with sigmoid!")

# 4. Learning rate too high/low
print("\n4. Learning Rate Effects:")
print("Too high: Loss explodes or oscillates")
print("Too low: Very slow convergence")
print("Rule of thumb: Start with 1e-3 for Adam, 1e-2 for SGD")

# 5. Batch size effects
print("\n5. Batch Size Considerations:")
print("Small batches: Noisy gradients, may help with generalization")
print("Large batches: Smooth gradients, faster training, may overfit")
print("Typical range: 16-256 for most problems")

print("\n=== END OF GOTCHAS ===")