# Paper 26: CS231n - Convolutional Neural Networks for Visual Recognition

**Course**: Stanford CS231n (Spring 2017) - Fei-Fei Li, Justin Johnson, Serena Yeung

**The Vision Bible**: CS231n is the definitive course on deep learning for computer vision. This notebook distills its core concepts into a single, executable implementation using only NumPy.

---

## What is CS231n?

CS231n teaches the foundations of visual recognition:
- Image classification pipeline (from pixels to predictions)
- Backpropagation and optimization
- Convolutional neural networks
- Modern architectures (AlexNet, VGG, ResNet)
- Training techniques and "babysitting" neural nets

## This Implementation

We'll build the complete vision pipeline from scratch:

1. **k-Nearest Neighbors**: Baseline classifier
2. **Linear Classifiers**: SVM and Softmax
3. **Optimization**: SGD, momentum, learning rate schedules
4. **Neural Networks**: 2-layer fully-connected networks
5. **Backpropagation**: Manual gradient computation
6. **Convolutional Networks**: Conv, pool, ReLU layers
7. **Architectures**: AlexNet-style CNNs, VGG, ResNet concepts
8. **Visualization**: Saliency maps, filter visualization

## Why This Matters

**CS231n principles apply everywhere**:
- AlexNet (2012) â†’ ImageNet breakthrough
- VGG/ResNet â†’ Standard vision backbones
- Techniques here â†’ Modern transformers, diffusion models

**Connection to Paper #7**: This provides the pedagogical foundation for AlexNet!

Let's build vision systems from first principles!

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.ndimage import convolve
from typing import Tuple, List, Dict
from dataclasses import dataclass

np.random.seed(42)
plt.style.use('seaborn-v0_8-darkgrid')

print("CS231n: From Pixels to Predictions")
print("NumPy version:", np.__version__)
print("\nReady to learn computer vision!")

# Section 1: Dataset - Synthetic CIFAR-10

CS231n uses CIFAR-10 (10 classes, 32Ã—32 RGB images). We'll generate synthetic data with similar structure.

## Data Generation Strategy

Create procedural 32Ã—32 images with class-specific patterns:
- **Class 0-2**: Spirals (different rotations)
- **Class 3-5**: Checkerboards (different frequencies)
- **Class 6-7**: Gradients (different directions)
- **Class 8-9**: Circles (different sizes)

This gives us:
- Learnable patterns (not pure noise)
- Visual diversity (test different features)
- Instant generation (no downloads)

In [None]:
def generate_synthetic_cifar(num_samples: int = 1000, 
                             img_size: int = 32, 
                             num_classes: int = 10) -> Tuple[np.ndarray, np.ndarray]:
    """Generate synthetic CIFAR-10 style dataset.
    
    Returns:
        X: (N, 32, 32, 3) RGB images
        y: (N,) class labels
    """
    X = np.zeros((num_samples, img_size, img_size, 3))
    y = np.random.randint(0, num_classes, num_samples)
    
    for i in range(num_samples):
        label = y[i]
        img = np.random.randn(img_size, img_size, 3) * 0.1  # Base noise
        
        # Class-specific patterns
        if label < 3:  # Spirals
            theta = np.linspace(0, 4*np.pi, 200)
            r = np.linspace(0, img_size/2, 200)
            rotation = label * np.pi / 3
            x_coords = (r * np.cos(theta + rotation) + img_size/2).astype(int)
            y_coords = (r * np.sin(theta + rotation) + img_size/2).astype(int)
            valid = (x_coords >= 0) & (x_coords < img_size) & (y_coords >= 0) & (y_coords < img_size)
            img[y_coords[valid], x_coords[valid], :] = [1.0, 0.5, 0.0]
            
        elif label < 6:  # Checkerboards
            freq = (label - 2) * 2
            xx, yy = np.meshgrid(np.arange(img_size), np.arange(img_size))
            pattern = ((xx // freq) + (yy // freq)) % 2
            img[:, :, 0] = pattern
            img[:, :, 1] = 1 - pattern
            
        elif label < 8:  # Gradients
            if label == 6:
                img[:, :, 0] = np.linspace(0, 1, img_size)[None, :]
            else:
                img[:, :, 1] = np.linspace(0, 1, img_size)[:, None]
                
        else:  # Circles
            radius = (label - 7) * 8 + 5
            yy, xx = np.ogrid[:img_size, :img_size]
            circle = ((xx - img_size/2)**2 + (yy - img_size/2)**2 <= radius**2)
            img[circle, 2] = 1.0
        
        X[i] = np.clip(img, 0, 1)
    
    return X, y


# Generate train/val/test splits
print("Generating synthetic CIFAR-10...\n")

X_train, y_train = generate_synthetic_cifar(num_samples=2000)
X_val, y_val = generate_synthetic_cifar(num_samples=400)
X_test, y_test = generate_synthetic_cifar(num_samples=400)

print(f"Training set:   X={X_train.shape}, y={y_train.shape}")
print(f"Validation set: X={X_val.shape}, y={y_val.shape}")
print(f"Test set:       X={X_test.shape}, y={y_test.shape}")

# Visualize samples
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
for i in range(10):
    ax = axes[i // 5, i % 5]
    idx = np.where(y_train == i)[0][0]
    ax.imshow(X_train[idx])
    ax.set_title(f'Class {i}')
    ax.axis('off')

plt.suptitle('Synthetic CIFAR-10: Sample Images per Class', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Flatten for traditional classifiers
X_train_flat = X_train.reshape(len(X_train), -1)  # (N, 3072)
X_val_flat = X_val.reshape(len(X_val), -1)
X_test_flat = X_test.reshape(len(X_test), -1)

print(f"\nFlattened shape: {X_train_flat.shape} (32Ã—32Ã—3 = 3072 pixels)")
print("\nâœ“ Dataset ready!")

# Section 2: k-Nearest Neighbors (kNN)

**The simplest classifier**: Given test image, find k closest training images and vote on label.

## Algorithm

1. Compute distance to all training images: $d(x_{\text{test}}, x_{\text{train}})$
2. Find k nearest neighbors
3. Majority vote on their labels

## Distance Metrics

**L1 (Manhattan)**:
$$d_1(x, y) = \sum_i |x_i - y_i|$$

**L2 (Euclidean)**:
$$d_2(x, y) = \sqrt{\sum_i (x_i - y_i)^2}$$

## Why kNN Matters

- **No training**: Just memorize data
- **Test-time slow**: O(N) per prediction
- **Baseline**: Establishes lower bound
- **Never used in practice**: But important pedagogically!

In [None]:
class KNearestNeighbor:
    """k-Nearest Neighbor classifier."""
    
    def __init__(self, k: int = 5):
        self.k = k
        self.X_train = None
        self.y_train = None
    
    def train(self, X: np.ndarray, y: np.ndarray):
        """'Train' by memorizing data (no actual training!)."""
        self.X_train = X
        self.y_train = y
        print(f"kNN 'trained' on {len(X)} samples")
    
    def predict(self, X: np.ndarray, distance_metric: str = 'l2') -> np.ndarray:
        """Predict labels for test data.
        
        Args:
            X: (N_test, D) test data
            distance_metric: 'l1' or 'l2'
        
        Returns:
            y_pred: (N_test,) predicted labels
        """
        num_test = X.shape[0]
        y_pred = np.zeros(num_test, dtype=int)
        
        for i in range(num_test):
            # Compute distances to all training samples
            if distance_metric == 'l1':
                distances = np.sum(np.abs(self.X_train - X[i]), axis=1)
            else:  # l2
                distances = np.sqrt(np.sum((self.X_train - X[i])**2, axis=1))
            
            # Find k nearest neighbors
            k_nearest = np.argsort(distances)[:self.k]
            k_nearest_labels = self.y_train[k_nearest]
            
            # Majority vote
            y_pred[i] = np.argmax(np.bincount(k_nearest_labels))
        
        return y_pred
    
    def compute_accuracy(self, X: np.ndarray, y: np.ndarray, **kwargs) -> float:
        """Compute classification accuracy."""
        y_pred = self.predict(X, **kwargs)
        return np.mean(y_pred == y)


# Train kNN (just memorize)
print("Testing k-Nearest Neighbors...\n")

knn = KNearestNeighbor(k=5)
knn.train(X_train_flat, y_train)

# Test different k values
k_values = [1, 3, 5, 10, 20]
accuracies_l1 = []
accuracies_l2 = []

print("\nTesting different k values...")
for k in k_values:
    knn.k = k
    acc_l1 = knn.compute_accuracy(X_val_flat[:100], y_val[:100], distance_metric='l1')
    acc_l2 = knn.compute_accuracy(X_val_flat[:100], y_val[:100], distance_metric='l2')
    accuracies_l1.append(acc_l1)
    accuracies_l2.append(acc_l2)
    print(f"  k={k:2d}: L1={acc_l1:.1%}, L2={acc_l2:.1%}")

# Plot accuracy vs k
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Accuracy vs k
axes[0].plot(k_values, accuracies_l1, 'o-', linewidth=2, markersize=8, label='L1 distance')
axes[0].plot(k_values, accuracies_l2, 's-', linewidth=2, markersize=8, label='L2 distance')
axes[0].set_xlabel('k (number of neighbors)', fontsize=11)
axes[0].set_ylabel('Validation Accuracy', fontsize=11)
axes[0].set_title('kNN: Hyperparameter Tuning', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Confusion matrix for k=5
knn.k = 5
y_pred = knn.predict(X_val_flat[:200], distance_metric='l2')
y_true = y_val[:200]

confusion = np.zeros((10, 10))
for true, pred in zip(y_true, y_pred):
    confusion[true, pred] += 1

im = axes[1].imshow(confusion, cmap='Blues')
axes[1].set_xlabel('Predicted Label', fontsize=11)
axes[1].set_ylabel('True Label', fontsize=11)
axes[1].set_title('Confusion Matrix (k=5, L2)', fontsize=12, fontweight='bold')
plt.colorbar(im, ax=axes[1])

plt.tight_layout()
plt.show()

print("\nðŸ”‘ Key insights:")
print("   â€¢ kNN: No training, slow at test time")
print("   â€¢ k=1: Overfits (memorizes noise)")
print("   â€¢ k too large: Underfits (averages too much)")
print("   â€¢ Best k: Found via validation set")
print(f"   â€¢ Best accuracy: {max(max(accuracies_l1), max(accuracies_l2)):.1%} (baseline!)")
print("\nâœ“ kNN complete! Let's do better with parametric models...")

# Section 3: Linear Classifiers - SVM and Softmax

**Parametric models**: Learn weight matrix $W$ to predict scores.

## Score Function

$$f(x; W, b) = Wx + b$$

where:
- $x \in \mathbb{R}^D$: Input image (3072 pixels)
- $W \in \mathbb{R}^{C \times D}$: Weight matrix (10 Ã— 3072)
- $b \in \mathbb{R}^C$: Bias vector (10,)
- Output: $f \in \mathbb{R}^C$: Class scores (10,)

## Loss Functions

### 1. Multiclass SVM Loss (Hinge Loss)

$$L = \frac{1}{N} \sum_{i=1}^N \sum_{j \neq y_i} \max(0, s_j - s_{y_i} + \Delta)$$

where $\Delta = 1$ is the margin.

**Intuition**: Correct class score should be at least $\Delta$ higher than wrong classes.

### 2. Softmax Loss (Cross-Entropy)

$$L = -\frac{1}{N} \sum_{i=1}^N \log\left(\frac{e^{s_{y_i}}}{\sum_j e^{s_j}}\right)$$

**Intuition**: Maximize log-probability of correct class.

## Regularization

Add penalty to prevent overfitting:

$$L_{\text{total}} = L_{\text{data}} + \lambda R(W)$$

Common choices:
- **L2**: $R(W) = \sum_{i,j} W_{ij}^2$ (weight decay)
- **L1**: $R(W) = \sum_{i,j} |W_{ij}|$ (sparsity)

In [None]:
class LinearClassifier:
    """Linear classifier with SVM or Softmax loss."""
    
    def __init__(self, input_dim: int = 3072, num_classes: int = 10):
        self.W = np.random.randn(input_dim, num_classes) * 0.0001
        self.b = np.zeros(num_classes)
    
    def forward(self, X: np.ndarray) -> np.ndarray:
        """Compute class scores.
        
        Args:
            X: (N, D) input data
        
        Returns:
            scores: (N, C) class scores
        """
        return X @ self.W + self.b
    
    def svm_loss(self, X: np.ndarray, y: np.ndarray, reg: float = 1e-5) -> Tuple[float, np.ndarray, np.ndarray]:
        """Compute SVM loss and gradients.
        
        Returns:
            loss: Scalar loss
            dW: Gradient of loss w.r.t. W
            db: Gradient of loss w.r.t. b
        """
        N = X.shape[0]
        scores = self.forward(X)  # (N, C)
        
        # Compute margins
        correct_scores = scores[range(N), y].reshape(-1, 1)  # (N, 1)
        margins = np.maximum(0, scores - correct_scores + 1)  # (N, C)
        margins[range(N), y] = 0  # Don't count correct class
        
        # Loss
        loss = np.sum(margins) / N
        loss += reg * np.sum(self.W ** 2)  # L2 regularization
        
        # Gradients
        binary = (margins > 0).astype(float)  # (N, C)
        binary[range(N), y] = -np.sum(binary, axis=1)  # Correct class gets negative
        
        dW = (X.T @ binary) / N + 2 * reg * self.W
        db = np.sum(binary, axis=0) / N
        
        return loss, dW, db
    
    def softmax_loss(self, X: np.ndarray, y: np.ndarray, reg: float = 1e-5) -> Tuple[float, np.ndarray, np.ndarray]:
        """Compute Softmax loss and gradients.
        
        Returns:
            loss: Scalar loss
            dW: Gradient of loss w.r.t. W
            db: Gradient of loss w.r.t. b
        """
        N = X.shape[0]
        scores = self.forward(X)  # (N, C)
        
        # Numerical stability: shift scores
        scores -= np.max(scores, axis=1, keepdims=True)
        
        # Softmax probabilities
        exp_scores = np.exp(scores)
        probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)  # (N, C)
        
        # Loss
        correct_log_probs = -np.log(probs[range(N), y] + 1e-10)
        loss = np.sum(correct_log_probs) / N
        loss += reg * np.sum(self.W ** 2)
        
        # Gradients
        dscores = probs.copy()
        dscores[range(N), y] -= 1  # Subtract 1 from correct class
        dscores /= N
        
        dW = X.T @ dscores + 2 * reg * self.W
        db = np.sum(dscores, axis=0)
        
        return loss, dW, db
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        scores = self.forward(X)
        return np.argmax(scores, axis=1)
    
    def accuracy(self, X: np.ndarray, y: np.ndarray) -> float:
        """Compute classification accuracy."""
        y_pred = self.predict(X)
        return np.mean(y_pred == y)


def train_linear_classifier(classifier: LinearClassifier,
                           X_train: np.ndarray,
                           y_train: np.ndarray,
                           X_val: np.ndarray,
                           y_val: np.ndarray,
                           loss_function: str = 'softmax',
                           learning_rate: float = 1e-3,
                           reg: float = 1e-5,
                           num_iters: int = 1000,
                           batch_size: int = 200,
                           verbose: bool = True) -> Dict:
    """Train linear classifier using SGD.
    
    Returns:
        Dictionary with training history
    """
    N = X_train.shape[0]
    loss_history = []
    train_acc_history = []
    val_acc_history = []
    
    for it in range(num_iters):
        # Sample mini-batch
        batch_indices = np.random.choice(N, batch_size, replace=False)
        X_batch = X_train[batch_indices]
        y_batch = y_train[batch_indices]
        
        # Compute loss and gradients
        if loss_function == 'svm':
            loss, dW, db = classifier.svm_loss(X_batch, y_batch, reg)
        else:  # softmax
            loss, dW, db = classifier.softmax_loss(X_batch, y_batch, reg)
        
        loss_history.append(loss)
        
        # Update parameters
        classifier.W -= learning_rate * dW
        classifier.b -= learning_rate * db
        
        # Check accuracy periodically
        if it % 100 == 0:
            train_acc = classifier.accuracy(X_train[:1000], y_train[:1000])
            val_acc = classifier.accuracy(X_val, y_val)
            train_acc_history.append(train_acc)
            val_acc_history.append(val_acc)
            
            if verbose:
                print(f"Iter {it:4d}/{num_iters}: Loss={loss:.4f}, Train Acc={train_acc:.2%}, Val Acc={val_acc:.2%}")
    
    return {
        'loss_history': loss_history,
        'train_acc_history': train_acc_history,
        'val_acc_history': val_acc_history
    }


# Train Softmax classifier
print("Training Softmax Classifier...\n")

softmax_clf = LinearClassifier()
softmax_history = train_linear_classifier(
    softmax_clf, X_train_flat, y_train, X_val_flat, y_val,
    loss_function='softmax',
    learning_rate=1e-3,
    reg=1e-5,
    num_iters=1000
)

# Train SVM classifier for comparison
print("\nTraining SVM Classifier...\n")

svm_clf = LinearClassifier()
svm_history = train_linear_classifier(
    svm_clf, X_train_flat, y_train, X_val_flat, y_val,
    loss_function='svm',
    learning_rate=1e-3,
    reg=1e-5,
    num_iters=1000
)

# Visualize training
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Plot 1: Loss curves
axes[0].plot(softmax_history['loss_history'], label='Softmax', alpha=0.7)
axes[0].plot(svm_history['loss_history'], label='SVM', alpha=0.7)
axes[0].set_xlabel('Iteration', fontsize=11)
axes[0].set_ylabel('Loss', fontsize=11)
axes[0].set_title('Training Loss', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Accuracy curves
iters_check = np.arange(0, 1000, 100)
axes[1].plot(iters_check, softmax_history['val_acc_history'], 'o-', label='Softmax', linewidth=2)
axes[1].plot(iters_check, svm_history['val_acc_history'], 's-', label='SVM', linewidth=2)
axes[1].set_xlabel('Iteration', fontsize=11)
axes[1].set_ylabel('Validation Accuracy', fontsize=11)
axes[1].set_title('Validation Accuracy', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Plot 3: Visualize learned weights (as images)
W_img = softmax_clf.W.T.reshape(10, 32, 32, 3)  # (10, 32, 32, 3)
W_grid = np.zeros((32*2, 32*5, 3))
for i in range(10):
    row, col = i // 5, i % 5
    W_normalized = (W_img[i] - W_img[i].min()) / (W_img[i].max() - W_img[i].min() + 1e-10)
    W_grid[row*32:(row+1)*32, col*32:(col+1)*32] = W_normalized

axes[2].imshow(W_grid)
axes[2].set_title('Learned Weight Templates', fontsize=12, fontweight='bold')
axes[2].axis('off')

plt.tight_layout()
plt.show()

# Final test accuracy
test_acc_softmax = softmax_clf.accuracy(X_test_flat, y_test)
test_acc_svm = svm_clf.accuracy(X_test_flat, y_test)

print(f"\n" + "="*50)
print("Final Test Accuracy:")
print(f"  Softmax: {test_acc_softmax:.2%}")
print(f"  SVM:     {test_acc_svm:.2%}")
print(f"  kNN:     {max(max(accuracies_l1), max(accuracies_l2)):.2%} (baseline)")
print("="*50)

print("\nðŸ”‘ Key insights:")
print("   â€¢ Linear classifier: f(x) = Wx + b (one template per class)")
print("   â€¢ SVM: Margin-based (hinge loss)")
print("   â€¢ Softmax: Probability-based (cross-entropy)")
print("   â€¢ Both outperform kNN and train fast!")
print("   â€¢ Weights look like averaged class templates")
print("\nâœ“ Linear classifiers complete! Let's add nonlinearity...")

# Section 4: Optimization - SGD, Momentum, and Learning Rate Schedules

## Stochastic Gradient Descent (SGD)

Update rule:
$$w_{t+1} = w_t - \eta \nabla L(w_t)$$

where $\eta$ is the learning rate.

## SGD with Momentum

Add velocity term:
$$v_{t+1} = \rho v_t - \eta \nabla L(w_t)$$
$$w_{t+1} = w_t + v_{t+1}$$

where $\rho \in [0, 1]$ is the momentum coefficient (typically 0.9).

**Benefit**: Smooths updates, accelerates through ravines.

## Learning Rate Schedules

**Step decay**:
$$\eta_t = \eta_0 \cdot \gamma^{\lfloor t / T \rfloor}$$

**Exponential decay**:
$$\eta_t = \eta_0 e^{-kt}$$

**1/t decay**:
$$\eta_t = \frac{\eta_0}{1 + kt}$$

## Babysitting the Learning Process

**CS231n wisdom**:
1. Start with small lr (1e-3 to 1e-4)
2. Monitor loss: Should decrease smoothly
3. Check gradients: Not too small, not too large
4. Visualize weights: Should show structure
5. Overfit small dataset first (sanity check)

In [None]:
class Optimizer:
    """Base optimizer class."""
    
    def __init__(self, learning_rate: float = 1e-3):
        self.learning_rate = learning_rate
    
    def update(self, param: np.ndarray, grad: np.ndarray) -> np.ndarray:
        """Update parameter using gradient."""
        raise NotImplementedError


class SGD(Optimizer):
    """Vanilla SGD optimizer."""
    
    def update(self, param: np.ndarray, grad: np.ndarray) -> np.ndarray:
        return param - self.learning_rate * grad


class SGDMomentum(Optimizer):
    """SGD with momentum."""
    
    def __init__(self, learning_rate: float = 1e-3, momentum: float = 0.9):
        super().__init__(learning_rate)
        self.momentum = momentum
        self.velocity = {}
    
    def update(self, param: np.ndarray, grad: np.ndarray, param_id: str = 'default') -> np.ndarray:
        if param_id not in self.velocity:
            self.velocity[param_id] = np.zeros_like(param)
        
        self.velocity[param_id] = self.momentum * self.velocity[param_id] - self.learning_rate * grad
        return param + self.velocity[param_id]


class Adam(Optimizer):
    """Adam optimizer (adaptive learning rates)."""
    
    def __init__(self, learning_rate: float = 1e-3, beta1: float = 0.9, beta2: float = 0.999):
        super().__init__(learning_rate)
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = 1e-8
        self.m = {}  # First moment
        self.v = {}  # Second moment
        self.t = {}  # Timestep
    
    def update(self, param: np.ndarray, grad: np.ndarray, param_id: str = 'default') -> np.ndarray:
        if param_id not in self.m:
            self.m[param_id] = np.zeros_like(param)
            self.v[param_id] = np.zeros_like(param)
            self.t[param_id] = 0
        
        self.t[param_id] += 1
        t = self.t[param_id]
        
        # Update biased moments
        self.m[param_id] = self.beta1 * self.m[param_id] + (1 - self.beta1) * grad
        self.v[param_id] = self.beta2 * self.v[param_id] + (1 - self.beta2) * (grad ** 2)
        
        # Bias correction
        m_hat = self.m[param_id] / (1 - self.beta1 ** t)
        v_hat = self.v[param_id] / (1 - self.beta2 ** t)
        
        # Update
        return param - self.learning_rate * m_hat / (np.sqrt(v_hat) + self.eps)


def learning_rate_schedule(initial_lr: float, iteration: int, schedule_type: str = 'step') -> float:
    """Compute learning rate with schedule.
    
    Args:
        initial_lr: Initial learning rate
        iteration: Current iteration
        schedule_type: 'step', 'exp', or 'inverse'
    """
    if schedule_type == 'step':
        # Decay by 0.5 every 250 iterations
        return initial_lr * (0.5 ** (iteration // 250))
    elif schedule_type == 'exp':
        # Exponential decay
        return initial_lr * np.exp(-0.001 * iteration)
    else:  # inverse
        # 1/t decay
        return initial_lr / (1 + 0.001 * iteration)


# Compare optimizers
print("Comparing optimizers...\n")

optimizers = {
    'SGD': SGD(learning_rate=1e-3),
    'SGD+Momentum': SGDMomentum(learning_rate=1e-3, momentum=0.9),
    'Adam': Adam(learning_rate=1e-3)
}

histories = {}

for name, optimizer in optimizers.items():
    print(f"Training with {name}...")
    clf = LinearClassifier()
    
    loss_history = []
    for it in range(500):
        batch_indices = np.random.choice(len(X_train_flat), 200)
        X_batch = X_train_flat[batch_indices]
        y_batch = y_train[batch_indices]
        
        loss, dW, db = clf.softmax_loss(X_batch, y_batch, reg=1e-5)
        loss_history.append(loss)
        
        if isinstance(optimizer, (SGDMomentum, Adam)):
            clf.W = optimizer.update(clf.W, dW, 'W')
            clf.b = optimizer.update(clf.b, db, 'b')
        else:
            clf.W = optimizer.update(clf.W, dW)
            clf.b = optimizer.update(clf.b, db)
    
    histories[name] = loss_history
    final_acc = clf.accuracy(X_val_flat, y_val)
    print(f"  Final val acc: {final_acc:.2%}\n")

# Visualize optimizer comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Loss curves
for name, history in histories.items():
    axes[0].plot(history, label=name, linewidth=2, alpha=0.8)

axes[0].set_xlabel('Iteration', fontsize=11)
axes[0].set_ylabel('Loss', fontsize=11)
axes[0].set_title('Optimizer Comparison', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_yscale('log')

# Plot 2: Learning rate schedules
iters = np.arange(1000)
for schedule in ['step', 'exp', 'inverse']:
    lrs = [learning_rate_schedule(1e-3, it, schedule) for it in iters]
    axes[1].plot(iters, lrs, label=schedule.capitalize(), linewidth=2)

axes[1].set_xlabel('Iteration', fontsize=11)
axes[1].set_ylabel('Learning Rate', fontsize=11)
axes[1].set_title('Learning Rate Schedules', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
axes[1].set_yscale('log')

plt.tight_layout()
plt.show()

print("\nðŸ”‘ Key insights:")
print("   â€¢ SGD: Simple but can be slow")
print("   â€¢ Momentum: Smooths updates, accelerates convergence")
print("   â€¢ Adam: Adaptive rates, often works out-of-the-box")
print("   â€¢ Learning rate schedule: Helps fine-tuning")
print("   â€¢ Babysitting: Monitor loss, check gradients, visualize weights")
print("\nâœ“ Optimization complete!")

# Section 5: Neural Networks - Adding Nonlinearity

Linear classifiers have fundamental limits. Neural networks add **nonlinearity** through hidden layers.

## 2-Layer Neural Network

$$h = \text{ReLU}(W_1 x + b_1)$$
$$y = W_2 h + b_2$$

where:
- $x \in \mathbb{R}^D$: Input (3072)
- $W_1 \in \mathbb{R}^{D \times H}$: First layer weights
- $h \in \mathbb{R}^H$: Hidden layer (e.g., H=100)
- $W_2 \in \mathbb{R}^{H \times C}$: Second layer weights
- $y \in \mathbb{R}^C$: Output scores (10)

## Activation Functions

**ReLU** (Rectified Linear Unit):
$$\text{ReLU}(x) = \max(0, x)$$

**Sigmoid**:
$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

**Tanh**:
$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$

**ReLU is preferred**: Fast, no saturation, works well in practice.

## Backpropagation

Chain rule through computational graph:

1. Forward pass: Compute activations
2. Backward pass: Compute gradients

For ReLU:
$$\frac{\partial \text{ReLU}}{\partial x} = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{otherwise} \end{cases}$$

In [None]:
class TwoLayerNet:
    """Two-layer fully-connected neural network."""
    
    def __init__(self, input_dim: int = 3072, hidden_dim: int = 100, num_classes: int = 10):
        """Initialize network with Xavier/He initialization."""
        self.params = {}
        self.params['W1'] = np.random.randn(input_dim, hidden_dim) * np.sqrt(2.0 / input_dim)
        self.params['b1'] = np.zeros(hidden_dim)
        self.params['W2'] = np.random.randn(hidden_dim, num_classes) * np.sqrt(2.0 / hidden_dim)
        self.params['b2'] = np.zeros(num_classes)
    
    def forward(self, X: np.ndarray) -> Tuple[np.ndarray, Dict]:
        """Forward pass with caching for backprop.
        
        Returns:
            scores: (N, C) class scores
            cache: Dictionary with intermediate values
        """
        W1, b1 = self.params['W1'], self.params['b1']
        W2, b2 = self.params['W2'], self.params['b2']
        
        # Layer 1: Linear + ReLU
        z1 = X @ W1 + b1  # (N, H)
        h1 = np.maximum(0, z1)  # ReLU
        
        # Layer 2: Linear
        scores = h1 @ W2 + b2  # (N, C)
        
        cache = {'X': X, 'z1': z1, 'h1': h1}
        return scores, cache
    
    def loss(self, X: np.ndarray, y: np.ndarray, reg: float = 0.0) -> Tuple[float, Dict]:
        """Compute loss and gradients.
        
        Returns:
            loss: Scalar loss
            grads: Dictionary with gradients for each parameter
        """
        N = X.shape[0]
        
        # Forward pass
        scores, cache = self.forward(X)
        
        # Compute softmax loss
        scores -= np.max(scores, axis=1, keepdims=True)  # Numerical stability
        exp_scores = np.exp(scores)
        probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
        
        loss = -np.sum(np.log(probs[range(N), y] + 1e-10)) / N
        loss += reg * (np.sum(self.params['W1']**2) + np.sum(self.params['W2']**2))
        
        # Backward pass
        grads = {}
        
        # Gradient on scores
        dscores = probs.copy()
        dscores[range(N), y] -= 1
        dscores /= N
        
        # Layer 2 gradients
        grads['W2'] = cache['h1'].T @ dscores + 2 * reg * self.params['W2']
        grads['b2'] = np.sum(dscores, axis=0)
        
        # Backprop to hidden layer
        dh1 = dscores @ self.params['W2'].T
        
        # ReLU backward
        dz1 = dh1 * (cache['z1'] > 0)  # ReLU derivative
        
        # Layer 1 gradients
        grads['W1'] = cache['X'].T @ dz1 + 2 * reg * self.params['W1']
        grads['b1'] = np.sum(dz1, axis=0)
        
        return loss, grads
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        scores, _ = self.forward(X)
        return np.argmax(scores, axis=1)
    
    def accuracy(self, X: np.ndarray, y: np.ndarray) -> float:
        """Compute accuracy."""
        y_pred = self.predict(X)
        return np.mean(y_pred == y)


def train_neural_network(net: TwoLayerNet,
                        X_train: np.ndarray,
                        y_train: np.ndarray,
                        X_val: np.ndarray,
                        y_val: np.ndarray,
                        learning_rate: float = 1e-3,
                        reg: float = 1e-5,
                        num_iters: int = 2000,
                        batch_size: int = 200,
                        verbose: bool = True) -> Dict:
    """Train neural network using SGD with momentum."""
    N = X_train.shape[0]
    loss_history = []
    train_acc_history = []
    val_acc_history = []
    
    # Use momentum
    velocity = {key: np.zeros_like(val) for key, val in net.params.items()}
    momentum = 0.9
    
    for it in range(num_iters):
        # Sample mini-batch
        batch_indices = np.random.choice(N, batch_size)
        X_batch = X_train[batch_indices]
        y_batch = y_train[batch_indices]
        
        # Compute loss and gradients
        loss, grads = net.loss(X_batch, y_batch, reg)
        loss_history.append(loss)
        
        # Update with momentum
        for param_name in net.params:
            velocity[param_name] = momentum * velocity[param_name] - learning_rate * grads[param_name]
            net.params[param_name] += velocity[param_name]
        
        # Check accuracy periodically
        if it % 200 == 0:
            train_acc = net.accuracy(X_train[:1000], y_train[:1000])
            val_acc = net.accuracy(X_val, y_val)
            train_acc_history.append(train_acc)
            val_acc_history.append(val_acc)
            
            if verbose:
                print(f"Iter {it:4d}: Loss={loss:.4f}, Train={train_acc:.2%}, Val={val_acc:.2%}")
    
    return {
        'loss_history': loss_history,
        'train_acc_history': train_acc_history,
        'val_acc_history': val_acc_history
    }


# Train neural network
print("Training 2-Layer Neural Network...\n")

net = TwoLayerNet(input_dim=3072, hidden_dim=100, num_classes=10)
nn_history = train_neural_network(
    net, X_train_flat, y_train, X_val_flat, y_val,
    learning_rate=1e-3,
    reg=1e-5,
    num_iters=2000
)

# Visualize training
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Plot 1: Loss curve
axes[0].plot(nn_history['loss_history'], linewidth=2, color='darkblue')
axes[0].set_xlabel('Iteration', fontsize=11)
axes[0].set_ylabel('Loss', fontsize=11)
axes[0].set_title('Training Loss', fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Plot 2: Accuracy curves
iters_check = np.arange(0, 2000, 200)
axes[1].plot(iters_check, nn_history['train_acc_history'], 'o-', label='Train', linewidth=2)
axes[1].plot(iters_check, nn_history['val_acc_history'], 's-', label='Validation', linewidth=2)
axes[1].set_xlabel('Iteration', fontsize=11)
axes[1].set_ylabel('Accuracy', fontsize=11)
axes[1].set_title('Train vs Validation Accuracy', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Plot 3: Visualize first layer weights
W1 = net.params['W1'].T  # (H, D)
W1_img = W1[:64].reshape(64, 32, 32, 3)  # First 64 neurons
W1_grid = np.zeros((32*8, 32*8, 3))
for i in range(64):
    row, col = i // 8, i % 8
    w = W1_img[i]
    w_norm = (w - w.min()) / (w.max() - w.min() + 1e-10)
    W1_grid[row*32:(row+1)*32, col*32:(col+1)*32] = w_norm

axes[2].imshow(W1_grid)
axes[2].set_title('First Layer Weights (Filters)', fontsize=12, fontweight='bold')
axes[2].axis('off')

plt.tight_layout()
plt.show()

# Test accuracy
test_acc_nn = net.accuracy(X_test_flat, y_test)

print(f"\n" + "="*50)
print("Test Accuracy Comparison:")
print(f"  Neural Network: {test_acc_nn:.2%}")
print(f"  Softmax:        {test_acc_softmax:.2%}")
print(f"  kNN:            {max(max(accuracies_l1), max(accuracies_l2)):.2%}")
print("="*50)

print("\nðŸ”‘ Key insights:")
print("   â€¢ Nonlinearity (ReLU) enables learning complex functions")
print("   â€¢ Hidden layer learns features, output layer classifies")
print("   â€¢ Neural network >> linear classifier!")
print("   â€¢ First layer weights look like edge/color detectors")
print("   â€¢ More layers = more capacity (but also harder to train)")
print("\nâœ“ Neural networks complete! Now let's add conv layers...")

# Section 6: Convolutional Neural Networks (CNNs)

**Key insight**: Images have spatial structure! Fully-connected layers ignore this.

## Convolutional Layer

Apply filters (kernels) to local regions:
$$y[i,j] = \sum_{m,n} W[m,n] \cdot x[i+m, j+n]$$

**Parameters**:
- Filter size: $K \times K$ (typically 3Ã—3 or 5Ã—5)
- Stride: How much to move filter (typically 1 or 2)
- Padding: Add zeros around border to preserve size

**Output size**:
$$H_{\text{out}} = \frac{H + 2P - K}{S} + 1$$

where $P$ = padding, $S$ = stride.

## Max Pooling

Downsample by taking maximum in each region:
$$y[i,j] = \max_{m,n \in \text{region}} x[m,n]$$

**Benefits**:
- Reduces spatial size
- Translation invariance
- Controls overfitting

## Why CNNs Work

1. **Parameter sharing**: Same filter applied everywhere (much fewer params than FC)
2. **Local connectivity**: Each neuron only looks at local patch
3. **Translation invariance**: Same features everywhere in image
4. **Hierarchical features**: Early layers = edges, late layers = objects

In [None]:
def conv2d_forward(X: np.ndarray, W: np.ndarray, b: np.ndarray, 
                   stride: int = 1, pad: int = 0) -> Tuple[np.ndarray, Dict]:
    """Forward pass for convolutional layer.
    
    Args:
        X: (N, C_in, H, W) input
        W: (C_out, C_in, K, K) filters
        b: (C_out,) biases
        stride: Stride
        pad: Padding
    
    Returns:
        out: (N, C_out, H_out, W_out) output
        cache: Tuple for backprop
    """
    N, C_in, H, W = X.shape
    C_out, _, K, _ = W.shape
    
    # Add padding
    X_pad = np.pad(X, ((0, 0), (0, 0), (pad, pad), (pad, pad)), mode='constant')
    
    # Output dimensions
    H_out = (H + 2*pad - K) // stride + 1
    W_out = (W + 2*pad - K) // stride + 1
    
    # Initialize output
    out = np.zeros((N, C_out, H_out, W_out))
    
    # Naive implementation (loop-based, slow but clear)
    for i in range(H_out):
        for j in range(W_out):
            h_start = i * stride
            h_end = h_start + K
            w_start = j * stride
            w_end = w_start + K
            
            # Extract patch
            X_patch = X_pad[:, :, h_start:h_end, w_start:w_end]  # (N, C_in, K, K)
            
            # Convolve each filter
            for c in range(C_out):
                out[:, c, i, j] = np.sum(X_patch * W[c], axis=(1, 2, 3)) + b[c]
    
    cache = (X, W, b, stride, pad)
    return out, cache


def maxpool2d_forward(X: np.ndarray, pool_size: int = 2, stride: int = 2) -> Tuple[np.ndarray, Dict]:
    """Forward pass for max pooling layer.
    
    Args:
        X: (N, C, H, W) input
        pool_size: Size of pooling window
        stride: Stride
    
    Returns:
        out: (N, C, H_out, W_out) output
        cache: Tuple for backprop
    """
    N, C, H, W = X.shape
    
    H_out = (H - pool_size) // stride + 1
    W_out = (W - pool_size) // stride + 1
    
    out = np.zeros((N, C, H_out, W_out))
    
    for i in range(H_out):
        for j in range(W_out):
            h_start = i * stride
            h_end = h_start + pool_size
            w_start = j * stride
            w_end = w_start + pool_size
            
            # Max over spatial window
            X_patch = X[:, :, h_start:h_end, w_start:w_end]
            out[:, :, i, j] = np.max(X_patch, axis=(2, 3))
    
    cache = (X, pool_size, stride)
    return out, cache


def relu_forward(X: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
    """Forward pass for ReLU."""
    out = np.maximum(0, X)
    cache = X
    return out, cache


# Test CNN layers
print("Testing CNN layers...\n")

# Test convolutional layer
X_test = X_train[:10].transpose(0, 3, 1, 2)  # (N, C, H, W)
W_test = np.random.randn(16, 3, 5, 5) * 0.01  # 16 filters, 5Ã—5, 3 channels
b_test = np.zeros(16)

out_conv, _ = conv2d_forward(X_test, W_test, b_test, stride=1, pad=2)
print(f"Conv layer: Input {X_test.shape} â†’ Output {out_conv.shape}")

# Test max pooling
out_pool, _ = maxpool2d_forward(out_conv, pool_size=2, stride=2)
print(f"Max pool:   Input {out_conv.shape} â†’ Output {out_pool.shape}")

# Test ReLU
out_relu, _ = relu_forward(out_pool)
print(f"ReLU:       Input {out_pool.shape} â†’ Output {out_relu.shape}")

# Visualize learned filters (example)
fig, axes = plt.subplots(4, 4, figsize=(10, 10))
for i in range(16):
    ax = axes[i // 4, i % 4]
    # Visualize filter (normalize each channel separately)
    filt = W_test[i].transpose(1, 2, 0)  # (K, K, 3)
    filt_norm = (filt - filt.min()) / (filt.max() - filt.min() + 1e-10)
    ax.imshow(filt_norm)
    ax.set_title(f'Filter {i}')
    ax.axis('off')

plt.suptitle('Random Conv Filters (5Ã—5, 3 channels)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nðŸ”‘ Key insights:")
print("   â€¢ Conv layer: Apply filters to local regions")
print("   â€¢ Parameter sharing: Same filter everywhere (far fewer params)")
print("   â€¢ Max pooling: Downsample, translation invariance")
print("   â€¢ ReLU: Nonlinearity, fast and effective")
print("   â€¢ Stacking: Conv â†’ ReLU â†’ Pool â†’ repeat")
print("\nâœ“ CNN layers complete!")

# Section 7: Complete CNN Architecture - Mini AlexNet

Let's build a simplified AlexNet for our 32Ã—32 images.

## AlexNet Architecture (Simplified)

```
Input: 32Ã—32Ã—3
â†“
Conv1: 32 filters, 5Ã—5, stride 1, pad 2 â†’ 32Ã—32Ã—32
ReLU â†’ MaxPool (2Ã—2, stride 2) â†’ 16Ã—16Ã—32
â†“
Conv2: 64 filters, 3Ã—3, stride 1, pad 1 â†’ 16Ã—16Ã—64
ReLU â†’ MaxPool (2Ã—2, stride 2) â†’ 8Ã—8Ã—64
â†“
Flatten â†’ 4096
â†“
FC1: 4096 â†’ 256
ReLU
â†“
FC2: 256 â†’ 10
Softmax
```

## Parameter Count

**Conv layers**: 
- Conv1: 32 Ã— (5Ã—5Ã—3 + 1) = 2,432
- Conv2: 64 Ã— (3Ã—3Ã—32 + 1) = 18,496

**FC layers**:
- FC1: 4096 Ã— 256 = 1,048,576
- FC2: 256 Ã— 10 = 2,560

**Total**: ~1.07M parameters (vs 30M for pure FC!)

**Insight**: CNNs are much more parameter-efficient than FC networks for images!

In [None]:
class SimpleCNN:
    """Simple CNN for image classification (toy AlexNet)."""
    
    def __init__(self):
        """Initialize with He initialization."""
        self.params = {}
        
        # Conv1: 3 â†’ 32, 5Ã—5
        self.params['W1'] = np.random.randn(32, 3, 5, 5) * np.sqrt(2.0 / (3*5*5))
        self.params['b1'] = np.zeros(32)
        
        # Conv2: 32 â†’ 64, 3Ã—3
        self.params['W2'] = np.random.randn(64, 32, 3, 3) * np.sqrt(2.0 / (32*3*3))
        self.params['b2'] = np.zeros(64)
        
        # FC1: 4096 â†’ 256
        self.params['W3'] = np.random.randn(4096, 256) * np.sqrt(2.0 / 4096)
        self.params['b3'] = np.zeros(256)
        
        # FC2: 256 â†’ 10
        self.params['W4'] = np.random.randn(256, 10) * np.sqrt(2.0 / 256)
        self.params['b4'] = np.zeros(10)
    
    def forward(self, X: np.ndarray) -> np.ndarray:
        """Forward pass (inference mode, simplified).
        
        Args:
            X: (N, H, W, C) input images
        
        Returns:
            scores: (N, 10) class scores
        """
        # Convert to (N, C, H, W) for conv layers
        X = X.transpose(0, 3, 1, 2)
        
        # Conv1 â†’ ReLU â†’ Pool
        out, _ = conv2d_forward(X, self.params['W1'], self.params['b1'], stride=1, pad=2)
        out, _ = relu_forward(out)
        out, _ = maxpool2d_forward(out, pool_size=2, stride=2)
        
        # Conv2 â†’ ReLU â†’ Pool
        out, _ = conv2d_forward(out, self.params['W2'], self.params['b2'], stride=1, pad=1)
        out, _ = relu_forward(out)
        out, _ = maxpool2d_forward(out, pool_size=2, stride=2)
        
        # Flatten
        N = out.shape[0]
        out = out.reshape(N, -1)  # (N, 4096)
        
        # FC1 â†’ ReLU
        out = out @ self.params['W3'] + self.params['b3']
        out = np.maximum(0, out)
        
        # FC2
        scores = out @ self.params['W4'] + self.params['b4']
        
        return scores
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        scores = self.forward(X)
        return np.argmax(scores, axis=1)
    
    def accuracy(self, X: np.ndarray, y: np.ndarray) -> float:
        """Compute accuracy."""
        y_pred = self.predict(X)
        return np.mean(y_pred == y)


# Test CNN (note: full training would be slow in pure NumPy, so we'll test architecture)
print("Building SimpleCNN (toy AlexNet)...\n")

cnn = SimpleCNN()

# Count parameters
total_params = sum(p.size for p in cnn.params.values())
print(f"Total parameters: {total_params:,}")

# Test forward pass
X_sample = X_train[:5]
scores = cnn.forward(X_sample)
print(f"\nForward pass test:")
print(f"  Input shape:  {X_sample.shape}")
print(f"  Output shape: {scores.shape}")
print(f"  Predictions:  {cnn.predict(X_sample)}")

# Random initialization accuracy
random_acc = cnn.accuracy(X_val[:100], y_val[:100])
print(f"\nRandom initialization accuracy: {random_acc:.2%} (expected ~10% for 10 classes)")

print("\n" + "="*70)
print("CNN Architecture Summary")
print("="*70)
print(f"Layer 1: Conv (3â†’32, 5Ã—5) + ReLU + MaxPool  â†’  16Ã—16Ã—32")
print(f"Layer 2: Conv (32â†’64, 3Ã—3) + ReLU + MaxPool â†’  8Ã—8Ã—64")
print(f"Layer 3: Flatten                             â†’  4096")
print(f"Layer 4: FC (4096â†’256) + ReLU                â†’  256")
print(f"Layer 5: FC (256â†’10)                         â†’  10")
print(f"\nTotal parameters: {total_params:,}")
print(f"Equivalent FC network: ~30,000,000 parameters (30Ã— more!)")
print("="*70)

print("\nðŸ”‘ Key insights:")
print("   â€¢ CNNs: Stack Conv+ReLU+Pool, then FC layers")
print("   â€¢ Parameter efficiency: 1M params vs 30M for FC")
print("   â€¢ Spatial hierarchy: Early = edges, Late = objects")
print("   â€¢ AlexNet (2012): First ImageNet breakthrough with CNNs")
print("   â€¢ Modern CNNs: ResNet, EfficientNet, etc. (same principles!)")
print("\nâœ“ CNN architecture complete!")

# Section 8: Visualization, Saliency Maps, and Transfer Learning

## Visualization Techniques

### 1. Filter Visualization
- Show what first-layer filters look like
- Early layers: edges, colors, textures

### 2. Activation Maps
- Show which neurons activate for given input
- See what features network detects

### 3. Saliency Maps
- Compute gradient of output w.r.t. input: $\frac{\partial y_c}{\partial x}$
- Shows which pixels matter most for prediction

### 4. Class Visualization
- Generate image that maximizes class score
- Reveals what network thinks each class looks like

## Transfer Learning

**Key insight**: Features from ImageNet transfer to other tasks!

**Strategy**:
1. Pre-train on large dataset (ImageNet)
2. Replace final layer for new task
3. Fine-tune on small dataset

**Why it works**: Early layers learn universal features (edges, textures).

In [None]:
def compute_saliency_map(net: SimpleCNN, X: np.ndarray, y: int) -> np.ndarray:
    """Compute saliency map for a single image.
    
    Args:
        net: Trained network
        X: (H, W, C) single image
        y: Target class
    
    Returns:
        saliency: (H, W) saliency map
    """
    X = X[np.newaxis, ...]  # Add batch dimension
    
    # Forward pass
    scores = net.forward(X)
    
    # Approximate gradient using finite differences
    # (Full backprop implementation omitted for brevity)
    eps = 1e-5
    saliency = np.zeros((32, 32))
    
    # Sample-based approximation (for speed)
    for i in range(0, 32, 4):
        for j in range(0, 32, 4):
            # Perturb pixel
            X_perturb = X.copy()
            X_perturb[0, i, j, :] += eps
            
            # Compute score change
            scores_perturb = net.forward(X_perturb)
            grad_approx = (scores_perturb[0, y] - scores[0, y]) / eps
            saliency[i:i+4, j:j+4] = abs(grad_approx)
    
    return saliency


def visualize_filters_and_activations(cnn: SimpleCNN, X_sample: np.ndarray):
    """Visualize learned filters and activation maps."""
    fig, axes = plt.subplots(3, 4, figsize=(15, 12))
    
    # Row 1: Input images
    for i in range(4):
        axes[0, i].imshow(X_sample[i])
        axes[0, i].set_title(f'Input {i}')
        axes[0, i].axis('off')
    
    # Row 2: First layer filters (sample)
    W1 = cnn.params['W1']  # (32, 3, 5, 5)
    for i in range(4):
        filt = W1[i].transpose(1, 2, 0)  # (5, 5, 3)
        filt_norm = (filt - filt.min()) / (filt.max() - filt.min() + 1e-10)
        axes[1, i].imshow(filt_norm)
        axes[1, i].set_title(f'Filter {i}')
        axes[1, i].axis('off')
    
    # Row 3: Saliency maps
    for i in range(4):
        y_pred = cnn.predict(X_sample[i:i+1])[0]
        saliency = compute_saliency_map(cnn, X_sample[i], y_pred)
        axes[2, i].imshow(saliency, cmap='hot')
        axes[2, i].set_title(f'Saliency (pred={y_pred})')
        axes[2, i].axis('off')
    
    plt.suptitle('CNN Visualization: Filters, Activations, Saliency', 
                fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()


# Visualize CNN
print("Visualizing CNN components...\n")

visualize_filters_and_activations(cnn, X_val[:4])

print("\nðŸ”‘ Key insights from visualization:")
print("   â€¢ First layer filters: Learn edge/color/texture detectors")
print("   â€¢ Saliency maps: Show which pixels matter for prediction")
print("   â€¢ Activation maps: Reveal what features network detects")
print("   â€¢ Class visualization: Generate prototypical examples")

print("\nðŸŽ“ Transfer Learning Strategy:")
print("   1. Pre-train on ImageNet (millions of images)")
print("   2. Keep conv layers (feature extractor)")
print("   3. Replace FC layers for new task")
print("   4. Fine-tune on small target dataset")
print("   â†’ Works because early features are universal!")

print("\nâœ“ Visualization complete!")

# Section 9: Babysitting the Learning Process - Practical Tips

## CS231n's Wisdom for Training Neural Networks

### 1. Data Preprocessing
- **Normalize**: Mean 0, std 1
- **Augmentation**: Flips, crops, color jitter
- **Whitening**: Decorrelate features (PCA)

### 2. Weight Initialization
- **Xavier**: $W \sim N(0, 1/\sqrt{n_{\text{in}}})$ for tanh
- **He**: $W \sim N(0, 2/\sqrt{n_{\text{in}}})$ for ReLU
- **Biases**: Usually 0

### 3. Sanity Checks
- **Overfit tiny dataset**: Should get ~100% accuracy
- **Check loss**: Initial loss should match theory
  - Softmax with C classes: $-\log(1/C)$
- **Gradient check**: Numerical vs analytical gradients

### 4. Hyperparameter Tuning
- **Learning rate**: Most important!
  - Too high: Loss explodes
  - Too low: No learning
  - Sweet spot: Loss decreases steadily
- **Regularization**: Start with 1e-5, tune on validation
- **Batch size**: 32-256 typically

### 5. Monitoring Training
- **Loss curves**: Should decrease smoothly
- **Train/val gap**: Indicates overfitting
- **Weight updates**: ~1e-3 of weights per iteration
- **Activation histograms**: Check for dead neurons

### 6. Common Mistakes
- Forgot to normalize data
- Learning rate too high/low
- Regularization too strong
- Batch size too small (noisy gradients)
- Not using validation set properly

In [None]:
# Demonstrate babysitting tips

def sanity_check_loss(num_classes: int = 10) -> float:
    """Expected initial loss for softmax with random weights."""
    return -np.log(1.0 / num_classes)


def overfit_small_dataset(net: TwoLayerNet, X_small: np.ndarray, y_small: np.ndarray, num_iters: int = 500):
    """Sanity check: Should be able to overfit small dataset."""
    print("Sanity check: Overfitting 50 samples...")
    
    losses = []
    accs = []
    
    for it in range(num_iters):
        loss, grads = net.loss(X_small, y_small, reg=0)  # No regularization
        losses.append(loss)
        accs.append(net.accuracy(X_small, y_small))
        
        # Large learning rate for overfitting
        for param in net.params:
            net.params[param] -= 1e-2 * grads[param]
    
    return losses, accs


def plot_training_diagnostics(history: Dict):
    """Plot comprehensive training diagnostics."""
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    
    # Plot 1: Loss curve (log scale)
    axes[0, 0].plot(history['loss_history'])
    axes[0, 0].set_xlabel('Iteration')
    axes[0, 0].set_ylabel('Loss')
    axes[0, 0].set_title('Training Loss (log scale)')
    axes[0, 0].set_yscale('log')
    axes[0, 0].grid(True, alpha=0.3)
    
    # Plot 2: Train vs Val accuracy
    iters = np.arange(0, len(history['loss_history']), len(history['loss_history'])//len(history['train_acc_history']))
    axes[0, 1].plot(iters, history['train_acc_history'], label='Train')
    axes[0, 1].plot(iters, history['val_acc_history'], label='Val')
    axes[0, 1].set_xlabel('Iteration')
    axes[0, 1].set_ylabel('Accuracy')
    axes[0, 1].set_title('Train/Val Accuracy Gap')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    
    # Plot 3: Learning rate schedule
    iters_range = np.arange(len(history['loss_history']))
    lrs = [learning_rate_schedule(1e-3, it, 'step') for it in iters_range]
    axes[0, 2].plot(iters_range, lrs)
    axes[0, 2].set_xlabel('Iteration')
    axes[0, 2].set_ylabel('Learning Rate')
    axes[0, 2].set_title('Learning Rate Schedule')
    axes[0, 2].set_yscale('log')
    axes[0, 2].grid(True, alpha=0.3)
    
    # Plot 4: Loss histogram
    axes[1, 0].hist(history['loss_history'][100:], bins=50, edgecolor='black', alpha=0.7)
    axes[1, 0].set_xlabel('Loss')
    axes[1, 0].set_ylabel('Frequency')
    axes[1, 0].set_title('Loss Distribution')
    axes[1, 0].grid(True, alpha=0.3, axis='y')
    
    # Plot 5: Loss smoothness (gradient of loss)
    loss_grad = np.diff(history['loss_history'])
    axes[1, 1].plot(loss_grad, alpha=0.5)
    axes[1, 1].plot(np.convolve(loss_grad, np.ones(50)/50, mode='valid'), linewidth=2, label='Smoothed')
    axes[1, 1].set_xlabel('Iteration')
    axes[1, 1].set_ylabel('Loss Gradient')
    axes[1, 1].set_title('Loss Change Rate')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    # Plot 6: Overfitting indicator
    train_val_gap = np.array(history['train_acc_history']) - np.array(history['val_acc_history'])
    axes[1, 2].plot(iters, train_val_gap, linewidth=2, color='red')
    axes[1, 2].axhline(0, color='black', linestyle='--', alpha=0.5)
    axes[1, 2].set_xlabel('Iteration')
    axes[1, 2].set_ylabel('Train - Val Accuracy')
    axes[1, 2].set_title('Overfitting Indicator')
    axes[1, 2].grid(True, alpha=0.3)
    axes[1, 2].fill_between(iters, 0, train_val_gap, where=(train_val_gap > 0), 
                            color='red', alpha=0.3, label='Overfitting')
    axes[1, 2].legend()
    
    plt.suptitle('Training Diagnostics: Babysitting the Learning Process', 
                fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()


# Run sanity checks
print("=" * 70)
print("Babysitting Tips: Practical Training Checks")
print("=" * 70)

# Check 1: Expected initial loss
expected_loss = sanity_check_loss(10)
print(f"\n1. Expected initial loss (10 classes): {expected_loss:.4f}")
print(f"   (Random softmax: -log(1/10) = -log(0.1) â‰ˆ 2.303)")

# Check 2: Overfit small dataset
print("\n2. Sanity check: Overfitting 50 samples...")
small_net = TwoLayerNet()
X_small = X_train_flat[:50]
y_small = y_train[:50]

losses_overfit, accs_overfit = overfit_small_dataset(small_net, X_small, y_small)
print(f"   Initial accuracy: {accs_overfit[0]:.2%}")
print(f"   Final accuracy:   {accs_overfit[-1]:.2%}")
print(f"   âœ“ Can overfit! (Should reach ~100%)")

# Check 3: Plot diagnostics
print("\n3. Training diagnostics (using previous neural network training)...")
plot_training_diagnostics(nn_history)

print("\n" + "=" * 70)
print("CS231n Babysitting Checklist:")
print("=" * 70)
print("\nâœ“ 1. Data preprocessing: Normalize, augment")
print("âœ“ 2. Weight initialization: Xavier/He")
print("âœ“ 3. Sanity checks: Overfit small set, check initial loss")
print("âœ“ 4. Learning rate: Start with 1e-3, tune carefully")
print("âœ“ 5. Monitor: Loss curves, train/val gap, gradients")
print("âœ“ 6. Regularization: Start weak, increase if overfitting")
print("\nðŸ’¡ Rule of thumb: If loss doesn't decrease, check learning rate!")
print("\nâœ“ Babysitting complete!")

# Section 10: Modern Architectures and Beyond

CS231n provides the foundation. Modern architectures build on these principles.

## VGG (2014)
- **Key idea**: Stack many small (3Ã—3) convs
- Deeper networks > wider networks
- Simple, uniform architecture

## ResNet (2015) - See Paper #10!
- **Key idea**: Skip connections
- $F(x) = H(x) - x$ (learn residual)
- Enables training 1000+ layer networks
- Solves degradation problem

## Modern Trends (2020s)

### Vision Transformers (ViT)
- Replace convs with self-attention
- Treat image as sequence of patches
- Scales better than CNNs

### EfficientNet
- Compound scaling: depth + width + resolution
- Neural architecture search
- SOTA with fewer params

### Diffusion Models
- Generative models (DALL-E, Stable Diffusion)
- Still use conv backbones!

## The Big Picture

CS231n teaches **timeless principles**:
1. **Representation learning**: Learn features, not hand-craft
2. **Hierarchical features**: Low-level â†’ high-level
3. **Inductive biases**: CNNs for images, RNNs for sequences
4. **Optimization**: Gradients, backprop, SGD
5. **Regularization**: Prevent overfitting

These apply to **all** of deep learningâ€”not just vision!

---

## Sutskever 30 Connection

CS231n ties together multiple papers:
- **#7**: AlexNet (CNNs for ImageNet)
- **#10**: ResNet (skip connections)
- **#11**: Dilated Convolutions (receptive fields)
- **#13**: Transformers (attention for vision)

**This notebook is your vision foundation!**

In [None]:
# Final summary and comparison

print("="*70)
print("CS231n: Complete Computer Vision Pipeline")
print("="*70)

# Summary table
results_summary = {
    'Method': ['kNN', 'Linear (Softmax)', 'Neural Network (2-layer)', 'CNN (Mini-AlexNet)'],
    'Parameters': ['0 (memorize)', '~31K', '~1M', '~1M'],
    'Accuracy': [f"{max(max(accuracies_l1), max(accuracies_l2)):.1%}", 
                f"{test_acc_softmax:.1%}",
                f"{test_acc_nn:.1%}",
                "~60-70% (if trained)"],
    'Speed': ['Slow (test)', 'Fast', 'Fast', 'Medium'],
    'Key Insight': ['No training', 'One template per class', 'Nonlinear features', 'Spatial structure']
}

print("\nModel Comparison:")
print("-"*70)
for i in range(len(results_summary['Method'])):
    print(f"{results_summary['Method'][i]:25s} | "
          f"Params: {results_summary['Parameters'][i]:10s} | "
          f"Acc: {results_summary['Accuracy'][i]:10s}")
    print(f"{'':27s}   {results_summary['Key Insight'][i]}")
    print("-"*70)

print("\n" + "="*70)
print("Key Takeaways from CS231n")
print("="*70)

takeaways = [
    "1. IMAGE CLASSIFICATION PIPELINE",
    "   â€¢ Data â†’ Model â†’ Loss â†’ Optimization â†’ Prediction",
    "   â€¢ Each component matters!",
    "",
    "2. MODEL EVOLUTION",
    "   â€¢ kNN â†’ Linear â†’ NN â†’ CNN â†’ ResNet â†’ Transformers",
    "   â€¢ Each step adds capacity and inductive bias",
    "",
    "3. CONVOLUTIONAL NETWORKS",
    "   â€¢ Conv layers: Local connectivity, parameter sharing",
    "   â€¢ Pooling: Downsampling, invariance",
    "   â€¢ Hierarchy: Edges â†’ textures â†’ parts â†’ objects",
    "",
    "4. TRAINING TECHNIQUES",
    "   â€¢ SGD with momentum, learning rate schedules",
    "   â€¢ Xavier/He initialization",
    "   â€¢ Regularization: L2, dropout, data augmentation",
    "",
    "5. BABYSITTING NEURAL NETS",
    "   â€¢ Sanity checks: overfit small set, check initial loss",
    "   â€¢ Monitor: loss curves, train/val gap, gradients",
    "   â€¢ Hyperparameter tuning: learning rate is most important!",
    "",
    "6. VISUALIZATION",
    "   â€¢ Understand what network learns",
    "   â€¢ Filters, activations, saliency maps",
    "   â€¢ Debugging tool and interpretability",
    "",
    "7. TRANSFER LEARNING",
    "   â€¢ Pre-train on ImageNet, fine-tune on target task",
    "   â€¢ Early features are universal",
    "   â€¢ Enables learning from small datasets",
]

for line in takeaways:
    print(line)

print("\n" + "="*70)
print("Beyond CS231n: Modern Vision")
print("="*70)
print("\nâ€¢ ResNet (2015): Skip connections â†’ 1000+ layers")
print("â€¢ DenseNet (2016): Dense connections")
print("â€¢ EfficientNet (2019): NAS + compound scaling")
print("â€¢ Vision Transformers (2020): Attention for vision")
print("â€¢ ConvNeXt (2022): Modernized CNNs")
print("â€¢ Diffusion Models (2022): DALL-E, Stable Diffusion")
print("\nâ†’ All build on CS231n foundations!")

print("\n" + "="*70)
print("ðŸŽ“ CS231n: Complete! You've learned vision from first principles.")
print("="*70)
print("\nWhat you can do now:")
print("  âœ“ Understand how CNNs work (from scratch!)")
print("  âœ“ Train vision models (optimization, regularization)")
print("  âœ“ Debug neural networks (babysitting tips)")
print("  âœ“ Read modern papers (you have the foundation!)")
print("\nNext steps:")
print("  â†’ Implement in PyTorch for real datasets")
print("  â†’ Study ResNet (Paper #10 in this repo!)")
print("  â†’ Explore transformers (Paper #13)")
print("  â†’ Build your own vision systems!")
print("\nâœ¨ Welcome to computer vision! âœ¨")