Week 7 · Day 2 — From MLP to CNN
Why this matters

MLPs can classify images, but they ignore spatial structure (treat pixels as flat). CNNs exploit locality (edges, textures) and scale much better to larger images. This is why CNNs dominate in vision tasks.

Theory Essentials

MLP (Multi-Layer Perceptron): fully connected layers, no spatial awareness.

CNN (Convolutional Neural Network): uses local filters and pooling → fewer params, better generalization.

Parameter efficiency: CNN shares weights across image, MLP learns every pixel connection.

Overfitting risk: MLP has huge parameter counts → overfit easily.

Evaluation metric: accuracy on train vs validation.

In [1]:
# Setup
import torch, torch.nn as nn, torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader, random_split

torch.manual_seed(42)

# ---------- Data ----------
transform = transforms.Compose([transforms.ToTensor()])
dataset = datasets.FashionMNIST(root="data", train=True, download=True, transform=transform)

train_data, val_data = random_split(dataset, [50000, 10000])
train_loader = DataLoader(train_data, batch_size=128, shuffle=True)
val_loader   = DataLoader(val_data, batch_size=256)

# ---------- Models ----------
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.layers = nn.Sequential(
            nn.Linear(28*28, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, 10)
        )
    def forward(self,x):
        return self.layers(self.flatten(x))

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 16, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(16, 32, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.fc = nn.Linear(32*7*7, 10)
    def forward(self,x):
        x = self.conv(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)

# ---------- Training Loop ----------
def train(model, epochs=3):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=1e-3)
    for epoch in range(epochs):
        model.train()
        for X,y in train_loader:
            X,y = X.to(device), y.to(device)
            optimizer.zero_grad()
            loss = criterion(model(X), y)
            loss.backward()
            optimizer.step()
    # Validation
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for X,y in val_loader:
            X,y = X.to(device), y.to(device)
            preds = model(X).argmax(dim=1)
            correct += (preds==y).sum().item()
            total += y.size(0)
    return correct/total

mlp_acc = train(MLP())
cnn_acc = train(SimpleCNN())

print(f"MLP val accuracy: {mlp_acc:.3f}")
print(f"CNN val accuracy: {cnn_acc:.3f}")


100%|██████████| 26.4M/26.4M [00:02<00:00, 11.3MB/s]
100%|██████████| 29.5k/29.5k [00:00<00:00, 1.39MB/s]
100%|██████████| 4.42M/4.42M [00:00<00:00, 6.99MB/s]
100%|██████████| 5.15k/5.15k [00:00<?, ?B/s]


MLP val accuracy: 0.873
CNN val accuracy: 0.875


1) Core (10–15 min)

Task: Compare parameter counts of MLP vs CNN.

In [2]:
print("MLP params:", sum(p.numel() for p in MLP().parameters()))
print("CNN params:", sum(p.numel() for p in SimpleCNN().parameters()))


MLP params: 235146
CNN params: 20490


MLP uses much more parameters despite it getting lower accuracy.

2) Practice (10–15 min)

Task: Increase hidden size of MLP to 512. Does accuracy improve?

In [3]:
class BiggerMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.layers = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        )
    def forward(self,x):
        return self.layers(self.flatten(x))

acc_big = train(BiggerMLP())
print("Bigger MLP acc:", acc_big)


Bigger MLP acc: 0.8695


Accuracy decreases.

3) Stretch (optional, 10–15 min)

Task: Modify CNN to use three conv layers instead of two.

In [4]:
class DeeperCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 16, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(16, 32, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.fc = nn.Linear(64*3*3, 10)
    def forward(self,x):
        x = self.conv(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)

deep_acc = train(DeeperCNN())
print("Deeper CNN acc:", deep_acc)


Deeper CNN acc: 0.8694


Accuracy drops a little

Mini-Challenge (≤40 min)

Task:

Train MLP and CNN both for 5 epochs.

Record: param count, training time, val accuracy.

Make a table comparison (MLP vs CNN).

Write 3–4 lines: Why CNN wins despite fewer parameters?

Acceptance Criteria:

Table includes params, time, acc.

Note mentions weight sharing and spatial locality.

In [5]:
# Setup
import time, math
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from pathlib import Path
np.random.seed(42)
plt.rcParams["figure.figsize"] = (6,4); plt.rcParams["axes.grid"] = True

import torch, torch.nn as nn, torch.optim as optim
from torch.utils.data import DataLoader, random_split
from torchvision import datasets, transforms

torch.manual_seed(42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# ---------- Data ----------
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))  # helps both models
])
full = datasets.FashionMNIST(root="data", train=True, download=True, transform=transform)
train_data, val_data = random_split(full, [50_000, 10_000])
train_loader = DataLoader(train_data, batch_size=128, shuffle=True, num_workers=2, pin_memory=True)
val_loader   = DataLoader(val_data,   batch_size=256, shuffle=False, num_workers=2, pin_memory=True)

# ---------- Models ----------
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28*28, 256), nn.ReLU(),
            nn.Linear(256, 128),   nn.ReLU(),
            nn.Linear(128, 10)
        )
    def forward(self,x): return self.net(x)

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 16, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),  # 28->14
            nn.Conv2d(16, 32, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),  # 14->7
        )
        self.classifier = nn.Sequential(nn.Flatten(), nn.Linear(32*7*7, 10))
    def forward(self,x): return self.classifier(self.features(x))

def count_params(m): return sum(p.numel() for p in m.parameters() if p.requires_grad)

# ---------- Train/Eval ----------
def evaluate(model, loader):
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for X,y in loader:
            X,y = X.to(device, non_blocking=True), y.to(device, non_blocking=True)
            pred = model(X).argmax(1)
            correct += (pred==y).sum().item()
            total   += y.size(0)
    return correct/total

def train_5_epochs(model):
    model = model.to(device)
    opt = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
    crit = nn.CrossEntropyLoss()
    start = time.time()
    for ep in range(5):
        model.train()
        for X,y in train_loader:
            X,y = X.to(device, non_blocking=True), y.to(device, non_blocking=True)
            opt.zero_grad()
            loss = crit(model(X), y)
            loss.backward()
            opt.step()
    elapsed = time.time() - start
    acc = evaluate(model, val_loader)
    return acc, elapsed

# ---------- Run ----------
mlp      = MLP()
cnn      = SimpleCNN()
mlp_acc, mlp_time = train_5_epochs(mlp)
cnn_acc, cnn_time = train_5_epochs(cnn)

# ---------- Table ----------
rows = [
    ["MLP", count_params(mlp), f"{mlp_time:6.1f}s", f"{mlp_acc:.3f}"],
    ["CNN", count_params(cnn), f"{cnn_time:6.1f}s", f"{cnn_acc:.3f}"],
]
import pandas as pd
df = pd.DataFrame(rows, columns=["Model","Trainable Params","Training Time (5 ep)","Val Acc"])
print(df.to_string(index=False))




Model  Trainable Params Training Time (5 ep) Val Acc
  MLP            235146                92.0s   0.876
  CNN             20490               118.1s   0.900


Notes / Key Takeaways

MLPs flatten images → lose spatial info.

CNNs exploit locality → fewer parameters, better accuracy.

Parameter efficiency = less overfitting risk.

Pooling helps reduce dimensions & preserve features.

CNNs are the default for vision tasks.

Reflection

Why does a CNN have fewer parameters than an MLP for the same input?

If MLPs can achieve high training accuracy, why do they generalize worse?

The CNN outperforms the MLP because it leverages convolutions and pooling to detect local image features (edges, textures) regardless of position. This gives translation invariance and requires fewer parameters than an MLP, which treats all pixels as independent. The MLP can fit but doesn’t generalize as well to unseen validation data.