# Activation Functions Visualized

This notebook explores activation functions and their impact on neural network training. We'll visualize the functions, their gradients, and demonstrate key problems like vanishing gradients and dead ReLU.

**Goal:** Build intuition for why activation function choice matters.

**Prerequisites:** [activation-functions.md](../neural-networks/activation-functions.md), [backpropagation.md](../neural-networks/backpropagation.md)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

## 1. The Activation Functions

Let's implement and visualize the most common activation functions.

In [None]:
# Activation functions

def sigmoid(x):
    """Logistic sigmoid: squashes to (0, 1)"""
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

def tanh(x):
    """Hyperbolic tangent: squashes to (-1, 1)"""
    return np.tanh(x)

def relu(x):
    """Rectified Linear Unit: max(0, x)"""
    return np.maximum(0, x)

def leaky_relu(x, alpha=0.01):
    """Leaky ReLU: allows small negative slope"""
    return np.where(x > 0, x, alpha * x)

def gelu(x):
    """Gaussian Error Linear Unit: smooth approximation of ReLU"""
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

def swish(x):
    """Swish/SiLU: x * sigmoid(x)"""
    return x * sigmoid(x)

In [None]:
# Derivatives (for backpropagation)

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

def tanh_derivative(x):
    return 1 - np.tanh(x)**2

def relu_derivative(x):
    return (x > 0).astype(float)

def leaky_relu_derivative(x, alpha=0.01):
    return np.where(x > 0, 1, alpha)

def gelu_derivative(x):
    # Approximate derivative
    cdf = 0.5 * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))
    pdf = np.exp(-x**2 / 2) / np.sqrt(2 * np.pi)
    return cdf + x * pdf

def swish_derivative(x):
    s = sigmoid(x)
    return s + x * s * (1 - s)

In [None]:
# Visualize all activation functions
x = np.linspace(-4, 4, 400)

activations = [
    ('Sigmoid', sigmoid, sigmoid_derivative, 'C0'),
    ('Tanh', tanh, tanh_derivative, 'C1'),
    ('ReLU', relu, relu_derivative, 'C2'),
    ('Leaky ReLU', leaky_relu, leaky_relu_derivative, 'C3'),
    ('GELU', gelu, gelu_derivative, 'C4'),
    ('Swish', swish, swish_derivative, 'C5'),
]

fig, axes = plt.subplots(2, 3, figsize=(14, 8))

for ax, (name, func, deriv, color) in zip(axes.flat, activations):
    ax.plot(x, func(x), color=color, linewidth=2, label=f'{name}')
    ax.plot(x, deriv(x), color=color, linewidth=2, linestyle='--', alpha=0.6, label=f'{name} derivative')
    ax.axhline(y=0, color='gray', linewidth=0.5)
    ax.axvline(x=0, color='gray', linewidth=0.5)
    ax.set_xlim(-4, 4)
    ax.set_ylim(-1.5, 2.5)
    ax.set_title(name, fontsize=12)
    ax.legend(loc='upper left', fontsize=9)
    ax.set_xlabel('x')
    ax.set_ylabel('y')

plt.suptitle('Activation Functions and Their Derivatives', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

## 2. The Vanishing Gradient Problem

Sigmoid and tanh have derivatives that approach zero for large inputs. In deep networks, gradients multiply through layers, causing them to vanish.

In [None]:
# Demonstrate vanishing gradients through multiple layers

def gradient_through_layers(activation_deriv, n_layers, input_value):
    """
    Simulate gradient flow through n layers.
    
    Gradient = derivative(layer_n) * derivative(layer_n-1) * ... * derivative(layer_1)
    """
    gradient = 1.0
    gradients = [gradient]
    
    # Assume each layer receives similar input
    for _ in range(n_layers):
        gradient *= activation_deriv(input_value)
        gradients.append(gradient)
    
    return gradients

# Compare gradient flow for different activations
n_layers = 20

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: gradients at x=0 (best case for sigmoid/tanh)
ax = axes[0]
for name, _, deriv, color in activations:
    grads = gradient_through_layers(deriv, n_layers, 0)
    ax.plot(grads, marker='o', markersize=3, color=color, label=name)

ax.set_xlabel('Layer depth')
ax.set_ylabel('Gradient magnitude')
ax.set_title('Gradient Flow (x = 0, best case)')
ax.legend()
ax.set_yscale('log')
ax.set_ylim(1e-10, 10)

# Right: gradients at x=2 (common case)
ax = axes[1]
for name, _, deriv, color in activations:
    grads = gradient_through_layers(deriv, n_layers, 2)
    ax.plot(grads, marker='o', markersize=3, color=color, label=name)

ax.set_xlabel('Layer depth')
ax.set_ylabel('Gradient magnitude')
ax.set_title('Gradient Flow (x = 2, common case)')
ax.legend()
ax.set_yscale('log')
ax.set_ylim(1e-10, 10)

plt.tight_layout()
plt.show()

print("After 20 layers with x=2:")
print(f"  Sigmoid gradient: {gradient_through_layers(sigmoid_derivative, 20, 2)[-1]:.2e}")
print(f"  Tanh gradient:    {gradient_through_layers(tanh_derivative, 20, 2)[-1]:.2e}")
print(f"  ReLU gradient:    {gradient_through_layers(relu_derivative, 20, 2)[-1]:.2e}")

**Key insight:** Sigmoid and tanh gradients shrink exponentially with depth. ReLU maintains gradients for positive inputs, enabling training of very deep networks.

## 3. The Dead ReLU Problem

ReLU neurons can "die" during training: if their input is always negative, the gradient is always zero and the neuron never updates.

In [None]:
class SimpleNetwork:
    """Simple network to demonstrate dead ReLU."""
    
    def __init__(self, input_dim, hidden_dim, activation='relu'):
        self.W1 = np.random.randn(input_dim, hidden_dim) * 0.5
        self.b1 = np.zeros(hidden_dim)
        self.W2 = np.random.randn(hidden_dim, 1) * 0.5
        self.b2 = np.zeros(1)
        
        if activation == 'relu':
            self.activation = relu
            self.activation_deriv = relu_derivative
        elif activation == 'leaky_relu':
            self.activation = leaky_relu
            self.activation_deriv = leaky_relu_derivative
        else:
            raise ValueError(f"Unknown activation: {activation}")
    
    def forward(self, X):
        self.X = X
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.activation(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        return self.z2
    
    def backward(self, y):
        batch_size = len(y)
        dz2 = (self.z2 - y) / batch_size
        
        dW2 = self.a1.T @ dz2
        db2 = dz2.sum(axis=0)
        
        da1 = dz2 @ self.W2.T
        dz1 = da1 * self.activation_deriv(self.z1)
        
        dW1 = self.X.T @ dz1
        db1 = dz1.sum(axis=0)
        
        return {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2}
    
    def update(self, grads, lr):
        self.W1 -= lr * grads['W1']
        self.b1 -= lr * grads['b1']
        self.W2 -= lr * grads['W2']
        self.b2 -= lr * grads['b2']
    
    def count_dead_neurons(self, X):
        """Count neurons that output zero for all inputs."""
        self.forward(X)
        # A neuron is "dead" if it's zero for all samples
        dead = (self.a1.max(axis=0) == 0)
        return dead.sum()

In [None]:
# Create a simple regression problem
np.random.seed(42)
X = np.random.randn(1000, 10)
y = (X[:, 0] * 2 + X[:, 1] - X[:, 2] * 0.5 + np.random.randn(1000) * 0.1).reshape(-1, 1)

# Train with ReLU - track dead neurons
model_relu = SimpleNetwork(10, 100, activation='relu')
dead_counts_relu = []

# Train with Leaky ReLU for comparison
model_leaky = SimpleNetwork(10, 100, activation='leaky_relu')
# Copy initial weights
model_leaky.W1 = model_relu.W1.copy()
model_leaky.b1 = model_relu.b1.copy()
model_leaky.W2 = model_relu.W2.copy()
model_leaky.b2 = model_relu.b2.copy()
dead_counts_leaky = []

# Use high learning rate to induce dead neurons
lr = 0.5
epochs = 200

for epoch in range(epochs):
    # ReLU
    pred = model_relu.forward(X)
    grads = model_relu.backward(y)
    model_relu.update(grads, lr)
    dead_counts_relu.append(model_relu.count_dead_neurons(X))
    
    # Leaky ReLU
    pred = model_leaky.forward(X)
    grads = model_leaky.backward(y)
    model_leaky.update(grads, lr)
    dead_counts_leaky.append(model_leaky.count_dead_neurons(X))

# Plot dead neuron counts
plt.figure(figsize=(10, 5))
plt.plot(dead_counts_relu, label='ReLU', color='C2')
plt.plot(dead_counts_leaky, label='Leaky ReLU', color='C3')
plt.xlabel('Epoch')
plt.ylabel('Number of Dead Neurons (out of 100)')
plt.title('Dead ReLU Problem: Neurons That Never Activate')
plt.legend()
plt.grid(True)
plt.show()

print(f"Final dead neurons - ReLU: {dead_counts_relu[-1]}, Leaky ReLU: {dead_counts_leaky[-1]}")

**Key insight:** High learning rates or poor initialization can push ReLU neurons into the negative regime permanently. Leaky ReLU prevents this by allowing small gradients even for negative inputs.

## 4. Activation Statistics During Training

Let's watch how activations evolve during training and how different functions affect the distribution.

In [None]:
# Generate synthetic classification data
from sklearn.datasets import make_classification

X_class, y_class = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_redundant=5, n_classes=2, random_state=42
)
y_class = y_class.reshape(-1, 1).astype(float)

def train_and_record_activations(X, y, activation_name, epochs=100):
    """Train a network and record activation statistics."""
    
    activation_funcs = {
        'sigmoid': (sigmoid, sigmoid_derivative),
        'tanh': (tanh, tanh_derivative),
        'relu': (relu, relu_derivative),
        'gelu': (gelu, gelu_derivative),
    }
    
    act_func, act_deriv = activation_funcs[activation_name]
    
    # Simple 2-layer network
    np.random.seed(42)
    W1 = np.random.randn(20, 50) * np.sqrt(2/20)
    b1 = np.zeros(50)
    W2 = np.random.randn(50, 1) * np.sqrt(2/50)
    b2 = np.zeros(1)
    
    activation_means = []
    activation_stds = []
    losses = []
    
    lr = 0.1
    
    for epoch in range(epochs):
        # Forward
        z1 = X @ W1 + b1
        a1 = act_func(z1)
        z2 = a1 @ W2 + b2
        pred = sigmoid(z2)  # Output sigmoid for classification
        
        # Record activation stats
        activation_means.append(a1.mean())
        activation_stds.append(a1.std())
        
        # Loss (binary cross-entropy)
        loss = -np.mean(y * np.log(pred + 1e-10) + (1-y) * np.log(1-pred + 1e-10))
        losses.append(loss)
        
        # Backward
        dz2 = (pred - y) / len(y)
        dW2 = a1.T @ dz2
        db2 = dz2.sum(axis=0)
        
        da1 = dz2 @ W2.T
        dz1 = da1 * act_deriv(z1)
        dW1 = X.T @ dz1
        db1 = dz1.sum(axis=0)
        
        # Update
        W1 -= lr * dW1
        b1 -= lr * db1
        W2 -= lr * dW2
        b2 -= lr * db2
    
    return {
        'means': activation_means,
        'stds': activation_stds,
        'losses': losses
    }

In [None]:
# Train with different activations
results = {}
for name in ['sigmoid', 'tanh', 'relu', 'gelu']:
    results[name] = train_and_record_activations(X_class, y_class, name)

# Plot
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

colors = {'sigmoid': 'C0', 'tanh': 'C1', 'relu': 'C2', 'gelu': 'C4'}

# Loss curves
ax = axes[0]
for name, res in results.items():
    ax.plot(res['losses'], label=name, color=colors[name])
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.set_title('Training Loss')
ax.legend()

# Activation means
ax = axes[1]
for name, res in results.items():
    ax.plot(res['means'], label=name, color=colors[name])
ax.set_xlabel('Epoch')
ax.set_ylabel('Mean Activation')
ax.set_title('Activation Mean Over Training')
ax.legend()

# Activation stds
ax = axes[2]
for name, res in results.items():
    ax.plot(res['stds'], label=name, color=colors[name])
ax.set_xlabel('Epoch')
ax.set_ylabel('Std of Activations')
ax.set_title('Activation Spread Over Training')
ax.legend()

plt.tight_layout()
plt.show()

## 5. Comparing Network Training

Let's train identical networks with different activations on MNIST and compare convergence.

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Load MNIST (smaller subset for speed)
print("Loading MNIST...")
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X_mnist, y_mnist = mnist.data[:10000] / 255.0, mnist.target[:10000].astype(int)
X_train, X_test, y_train, y_test = train_test_split(X_mnist, y_mnist, test_size=2000, random_state=42)
print(f"Train: {len(X_train)}, Test: {len(X_test)}")

In [None]:
def softmax(z):
    exp_z = np.exp(z - z.max(axis=1, keepdims=True))
    return exp_z / exp_z.sum(axis=1, keepdims=True)

class MNISTNetwork:
    """Network for comparing activations on MNIST."""
    
    def __init__(self, activation_name='relu', hidden_dim=128):
        self.activation_name = activation_name
        
        activations = {
            'sigmoid': (sigmoid, sigmoid_derivative),
            'tanh': (tanh, tanh_derivative),
            'relu': (relu, relu_derivative),
            'leaky_relu': (leaky_relu, leaky_relu_derivative),
            'gelu': (gelu, gelu_derivative),
        }
        self.act, self.act_deriv = activations[activation_name]
        
        # He initialization for ReLU variants, Xavier for others
        if activation_name in ['relu', 'leaky_relu', 'gelu']:
            scale = np.sqrt(2 / 784)
        else:
            scale = np.sqrt(1 / 784)
        
        np.random.seed(42)
        self.W1 = np.random.randn(784, hidden_dim) * scale
        self.b1 = np.zeros(hidden_dim)
        self.W2 = np.random.randn(hidden_dim, 10) * np.sqrt(2 / hidden_dim)
        self.b2 = np.zeros(10)
    
    def forward(self, X):
        self.X = X
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.act(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        self.probs = softmax(self.z2)
        return self.probs
    
    def backward(self, y):
        batch_size = len(y)
        
        dz2 = self.probs.copy()
        dz2[np.arange(batch_size), y] -= 1
        dz2 /= batch_size
        
        dW2 = self.a1.T @ dz2
        db2 = dz2.sum(axis=0)
        
        da1 = dz2 @ self.W2.T
        dz1 = da1 * self.act_deriv(self.z1)
        
        dW1 = self.X.T @ dz1
        db1 = dz1.sum(axis=0)
        
        return {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2}
    
    def update(self, grads, lr):
        self.W1 -= lr * grads['W1']
        self.b1 -= lr * grads['b1']
        self.W2 -= lr * grads['W2']
        self.b2 -= lr * grads['b2']
    
    def accuracy(self, X, y):
        probs = self.forward(X)
        preds = np.argmax(probs, axis=1)
        return (preds == y).mean()

In [None]:
def train_mnist(activation_name, epochs=20, batch_size=64, lr=0.1):
    """Train on MNIST and return history."""
    model = MNISTNetwork(activation_name)
    
    history = {'train_acc': [], 'test_acc': []}
    n_batches = len(X_train) // batch_size
    
    for epoch in range(epochs):
        # Shuffle
        indices = np.random.permutation(len(X_train))
        X_shuffled = X_train[indices]
        y_shuffled = y_train[indices]
        
        for i in range(n_batches):
            start = i * batch_size
            end = start + batch_size
            
            model.forward(X_shuffled[start:end])
            grads = model.backward(y_shuffled[start:end])
            model.update(grads, lr)
        
        train_acc = model.accuracy(X_train, y_train)
        test_acc = model.accuracy(X_test, y_test)
        history['train_acc'].append(train_acc)
        history['test_acc'].append(test_acc)
    
    return history

# Train with each activation
print("Training networks with different activations...")
histories = {}
for name in ['sigmoid', 'tanh', 'relu', 'leaky_relu', 'gelu']:
    print(f"  Training with {name}...")
    histories[name] = train_mnist(name)
print("Done!")

In [None]:
# Plot comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

colors = {'sigmoid': 'C0', 'tanh': 'C1', 'relu': 'C2', 'leaky_relu': 'C3', 'gelu': 'C4'}

for name, hist in histories.items():
    ax1.plot(hist['train_acc'], label=name, color=colors[name])
    ax2.plot(hist['test_acc'], label=name, color=colors[name])

ax1.set_xlabel('Epoch')
ax1.set_ylabel('Accuracy')
ax1.set_title('Training Accuracy')
ax1.legend()
ax1.set_ylim(0.5, 1.0)

ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.set_title('Test Accuracy')
ax2.legend()
ax2.set_ylim(0.5, 1.0)

plt.tight_layout()
plt.show()

print("\nFinal test accuracies:")
for name, hist in sorted(histories.items(), key=lambda x: -x[1]['test_acc'][-1]):
    print(f"  {name:12s}: {hist['test_acc'][-1]:.3f}")

## 6. Deeper Networks: Where Activation Choice Really Matters

With shallow networks, all activations work reasonably well. Let's see what happens with deeper networks.

In [None]:
class DeepNetwork:
    """Deeper network to show activation function impact."""
    
    def __init__(self, activation_name='relu', n_hidden_layers=5, hidden_dim=64):
        self.n_layers = n_hidden_layers + 1  # +1 for output
        
        activations = {
            'sigmoid': (sigmoid, sigmoid_derivative),
            'tanh': (tanh, tanh_derivative),
            'relu': (relu, relu_derivative),
        }
        self.act, self.act_deriv = activations[activation_name]
        
        # Initialize layers
        np.random.seed(42)
        self.weights = []
        self.biases = []
        
        # Input layer
        scale = np.sqrt(2 / 784) if activation_name == 'relu' else np.sqrt(1 / 784)
        self.weights.append(np.random.randn(784, hidden_dim) * scale)
        self.biases.append(np.zeros(hidden_dim))
        
        # Hidden layers
        scale = np.sqrt(2 / hidden_dim) if activation_name == 'relu' else np.sqrt(1 / hidden_dim)
        for _ in range(n_hidden_layers - 1):
            self.weights.append(np.random.randn(hidden_dim, hidden_dim) * scale)
            self.biases.append(np.zeros(hidden_dim))
        
        # Output layer
        self.weights.append(np.random.randn(hidden_dim, 10) * np.sqrt(2 / hidden_dim))
        self.biases.append(np.zeros(10))
    
    def forward(self, X):
        self.activations = [X]
        self.pre_activations = []
        
        current = X
        for i, (W, b) in enumerate(zip(self.weights[:-1], self.biases[:-1])):
            z = current @ W + b
            self.pre_activations.append(z)
            current = self.act(z)
            self.activations.append(current)
        
        # Output layer (no activation before softmax)
        z = current @ self.weights[-1] + self.biases[-1]
        self.pre_activations.append(z)
        self.probs = softmax(z)
        return self.probs
    
    def backward(self, y):
        batch_size = len(y)
        grads = {'W': [], 'b': []}
        
        # Output layer
        dz = self.probs.copy()
        dz[np.arange(batch_size), y] -= 1
        dz /= batch_size
        
        for i in range(len(self.weights) - 1, -1, -1):
            grads['W'].insert(0, self.activations[i].T @ dz)
            grads['b'].insert(0, dz.sum(axis=0))
            
            if i > 0:
                da = dz @ self.weights[i].T
                dz = da * self.act_deriv(self.pre_activations[i-1])
        
        return grads
    
    def update(self, grads, lr):
        for i in range(len(self.weights)):
            self.weights[i] -= lr * grads['W'][i]
            self.biases[i] -= lr * grads['b'][i]
    
    def accuracy(self, X, y):
        probs = self.forward(X)
        preds = np.argmax(probs, axis=1)
        return (preds == y).mean()
    
    def get_gradient_norms(self, X, y):
        """Return gradient norms for each layer."""
        self.forward(X)
        grads = self.backward(y)
        return [np.linalg.norm(g) for g in grads['W']]

In [None]:
# Compare gradient flow in deep networks
print("Training deep networks (5 hidden layers)...\n")

deep_histories = {}
gradient_histories = {}

for name in ['sigmoid', 'tanh', 'relu']:
    print(f"Training with {name}...")
    model = DeepNetwork(name, n_hidden_layers=5, hidden_dim=64)
    
    history = {'test_acc': []}
    grad_norms = []
    
    epochs = 20
    batch_size = 64
    lr = 0.1 if name == 'relu' else 0.5  # Sigmoid/tanh need higher LR
    n_batches = len(X_train) // batch_size
    
    for epoch in range(epochs):
        indices = np.random.permutation(len(X_train))
        X_shuffled = X_train[indices]
        y_shuffled = y_train[indices]
        
        epoch_grads = []
        for i in range(n_batches):
            start = i * batch_size
            end = start + batch_size
            
            batch_X = X_shuffled[start:end]
            batch_y = y_shuffled[start:end]
            
            model.forward(batch_X)
            grads = model.backward(batch_y)
            model.update(grads, lr)
            
            # Record gradient norms periodically
            if i == 0:
                epoch_grads.append(model.get_gradient_norms(batch_X, batch_y))
        
        grad_norms.append(np.mean(epoch_grads, axis=0))
        test_acc = model.accuracy(X_test, y_test)
        history['test_acc'].append(test_acc)
    
    deep_histories[name] = history
    gradient_histories[name] = grad_norms
    print(f"  Final accuracy: {history['test_acc'][-1]:.3f}")

print("\nDone!")

In [None]:
# Plot results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy
ax = axes[0]
for name, hist in deep_histories.items():
    ax.plot(hist['test_acc'], label=name, color=colors[name], linewidth=2)
ax.set_xlabel('Epoch')
ax.set_ylabel('Test Accuracy')
ax.set_title('Deep Network (5 Hidden Layers) - Test Accuracy')
ax.legend()
ax.set_ylim(0, 1)

# Gradient norms per layer (final epoch)
ax = axes[1]
for name, grads in gradient_histories.items():
    final_grads = grads[-1]  # Last epoch
    layers = range(1, len(final_grads) + 1)
    ax.plot(layers, final_grads, marker='o', label=name, color=colors[name], linewidth=2)

ax.set_xlabel('Layer')
ax.set_ylabel('Gradient Norm')
ax.set_title('Gradient Magnitude by Layer (Final Epoch)')
ax.legend()
ax.set_yscale('log')

plt.tight_layout()
plt.show()

## 7. Summary

| Activation | Pros | Cons | Best For |
|------------|------|------|----------|
| **Sigmoid** | Output in (0,1), smooth | Vanishing gradients, slow | Output layers (binary) |
| **Tanh** | Zero-centered, stronger gradients | Still vanishes for large inputs | RNNs (historically) |
| **ReLU** | Fast, no vanishing gradient for x>0 | Dead neurons, not zero-centered | Hidden layers (default) |
| **Leaky ReLU** | No dead neurons | Slight computational overhead | When dead ReLU is a problem |
| **GELU** | Smooth, best empirical results | More computation | Transformers |

**Key takeaways:**

1. **Vanishing gradients** killed sigmoid/tanh for deep networks
2. **ReLU** enabled training of deep networks (2012 breakthrough)
3. **Dead ReLU** can be mitigated with Leaky ReLU or careful initialization
4. **GELU** is now standard in transformers (GPT, BERT)
5. Proper **initialization** matters as much as activation choice

**Next:** [04-optimization-landscape.ipynb](04-optimization-landscape.ipynb) visualizes how optimizers navigate loss surfaces.

## 8. Exercises

1. **Implement ELU:** Add Exponential Linear Unit and compare to ReLU/Leaky ReLU.

2. **Batch normalization:** Add batch norm after activations. Does it help sigmoid/tanh in deep networks?

3. **Initialization experiment:** Try Xavier vs He initialization with each activation. Which combinations work best?

4. **Learning rate sensitivity:** Which activations are most sensitive to learning rate choice?

In [None]:
# Exercise 1 starter: Implement ELU
def elu(x, alpha=1.0):
    """Exponential Linear Unit"""
    # Your implementation here
    pass

def elu_derivative(x, alpha=1.0):
    # Your implementation here
    pass