Week 10 · Day 2 — Regularization & Stability in RNNs
Why this matters

RNNs easily overfit and suffer from unstable gradients. Regularization keeps models generalizable, while tricks like gradient clipping and AMP (mixed precision) make training faster and safer.

Theory Essentials

Dropout: randomly zeroes units; for RNNs, can be applied per time step (variational dropout).

Weight decay (L2): penalizes large weights, preventing overfitting.

Gradient clipping: caps exploding gradients common in RNNs.

Label smoothing: softens target labels (e.g., 0.9 / 0.1 instead of 1 / 0).

AMP (automatic mixed precision): uses float16 where safe → faster training.

In [5]:
# Setup
import numpy as np, torch, torch.nn as nn, torch.optim as optim
from torch.nn.utils import clip_grad_norm_

torch.manual_seed(42)

# BiLSTM with dropout
class BiLSTMClassifier(nn.Module):
    def __init__(self, vocab_size=200, embed_dim=32, hidden_dim=64, num_classes=2, dropout=0.3):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True,
                            bidirectional=True, dropout=dropout)
        self.fc = nn.Linear(hidden_dim*2, num_classes)
    def forward(self, x, lengths):
        embeds = self.embedding(x)
        packed = nn.utils.rnn.pack_padded_sequence(
            embeds, lengths.cpu(), batch_first=True, enforce_sorted=False
        )
        _, (h_n, _) = self.lstm(packed)
        h_cat = torch.cat((h_n[-2], h_n[-1]), dim=1)
        return self.fc(h_cat)

# Fake dataset (toy example, 6 words per sequence)
batch_size, seq_len, vocab_size = 4, 6, 200
X = torch.randint(1, vocab_size, (batch_size, seq_len))
lengths = torch.tensor([6, 5, 4, 3])
y = torch.tensor([0, 1, 0, 1])

# Model, loss, optimizer
model = BiLSTMClassifier(vocab_size=vocab_size)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)  # label smoothing
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)  # weight decay

# Training loop with gradient clipping
for epoch in range(15):
    optimizer.zero_grad()
    logits = model(X, lengths)
    loss = criterion(logits, y)
    loss.backward()
    clip_grad_norm_(model.parameters(), max_norm=1.0)  # gradient clipping
    optimizer.step()
    preds = logits.argmax(1)
    acc = (preds == y).float().mean().item()
    print(f"Epoch {epoch+1} | Loss={loss.item():.4f} | Acc={acc:.2f}")


Epoch 1 | Loss=0.6669 | Acc=1.00
Epoch 2 | Loss=0.6427 | Acc=1.00
Epoch 3 | Loss=0.6191 | Acc=1.00
Epoch 4 | Loss=0.5959 | Acc=1.00
Epoch 5 | Loss=0.5730 | Acc=1.00
Epoch 6 | Loss=0.5503 | Acc=1.00
Epoch 7 | Loss=0.5277 | Acc=1.00
Epoch 8 | Loss=0.5053 | Acc=1.00
Epoch 9 | Loss=0.4831 | Acc=1.00
Epoch 10 | Loss=0.4609 | Acc=1.00
Epoch 11 | Loss=0.4390 | Acc=1.00
Epoch 12 | Loss=0.4173 | Acc=1.00
Epoch 13 | Loss=0.3960 | Acc=1.00
Epoch 14 | Loss=0.3752 | Acc=1.00
Epoch 15 | Loss=0.3551 | Acc=1.00


1) Core (10–15 min)

Task: Set dropout=0.0 in the model. Train again. Compare loss & accuracy.

In [6]:
model = BiLSTMClassifier(vocab_size=vocab_size, dropout=0.0)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)  # label smoothing
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)  # weight decay

for epoch in range(15):
    optimizer.zero_grad()
    logits = model(X, lengths)
    loss = criterion(logits, y)
    loss.backward()
    clip_grad_norm_(model.parameters(), max_norm=1.0)  # gradient clipping
    optimizer.step()
    preds = logits.argmax(1)
    acc = (preds == y).float().mean().item()
    print(f"Epoch {epoch+1} | Loss={loss.item():.4f} | Acc={acc:.2f}")

Epoch 1 | Loss=0.6999 | Acc=0.50
Epoch 2 | Loss=0.6709 | Acc=1.00
Epoch 3 | Loss=0.6432 | Acc=1.00
Epoch 4 | Loss=0.6167 | Acc=1.00
Epoch 5 | Loss=0.5911 | Acc=1.00
Epoch 6 | Loss=0.5664 | Acc=1.00
Epoch 7 | Loss=0.5424 | Acc=1.00
Epoch 8 | Loss=0.5190 | Acc=1.00
Epoch 9 | Loss=0.4960 | Acc=1.00
Epoch 10 | Loss=0.4735 | Acc=1.00
Epoch 11 | Loss=0.4514 | Acc=1.00
Epoch 12 | Loss=0.4298 | Acc=1.00
Epoch 13 | Loss=0.4086 | Acc=1.00
Epoch 14 | Loss=0.3879 | Acc=1.00
Epoch 15 | Loss=0.3679 | Acc=1.00


No dropout ends up with higehr loss.

2) Practice (10–15 min)

Task: Change clip_grad_norm_ max_norm from 1.0 → 0.1. Observe how loss changes.

In [7]:
model = BiLSTMClassifier(vocab_size=vocab_size)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)  # label smoothing
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)  # weight decay

# Training loop with gradient clipping
for epoch in range(15):
    optimizer.zero_grad()
    logits = model(X, lengths)
    loss = criterion(logits, y)
    loss.backward()
    clip_grad_norm_(model.parameters(), max_norm=0.1)  # gradient clipping
    optimizer.step()
    preds = logits.argmax(1)
    acc = (preds == y).float().mean().item()
    print(f"Epoch {epoch+1} | Loss={loss.item():.4f} | Acc={acc:.2f}")

Epoch 1 | Loss=0.6665 | Acc=1.00
Epoch 2 | Loss=0.6442 | Acc=1.00
Epoch 3 | Loss=0.6223 | Acc=1.00
Epoch 4 | Loss=0.6009 | Acc=1.00
Epoch 5 | Loss=0.5798 | Acc=1.00
Epoch 6 | Loss=0.5588 | Acc=1.00
Epoch 7 | Loss=0.5380 | Acc=1.00
Epoch 8 | Loss=0.5172 | Acc=1.00
Epoch 9 | Loss=0.4964 | Acc=1.00
Epoch 10 | Loss=0.4756 | Acc=1.00
Epoch 11 | Loss=0.4548 | Acc=1.00
Epoch 12 | Loss=0.4341 | Acc=1.00
Epoch 13 | Loss=0.4135 | Acc=1.00
Epoch 14 | Loss=0.3930 | Acc=1.00
Epoch 15 | Loss=0.3729 | Acc=1.00




3) Stretch (optional, 10–15 min)

Task: If you have GPU, enable AMP for faster training.

I have no GPU

Mini-Challenge (≤40 min)

Task: Train two models on the same dataset:

Baseline: no dropout, no weight decay, no clipping.

Regularized: dropout=0.3, weight_decay=1e-4, clip_grad_norm=1.0.

Acceptance Criteria:

Print validation accuracy or F1 for both.

Show an ablation table: baseline vs regularized.

In [9]:
from collections import Counter
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
import torch

# --- 1) Tiny toy dataset (POS=1, NEG=0) ---
data = [
    ("i love this movie so much", 1),
    ("absolutely fantastic acting and soundtrack", 1),
    ("what a great experience highly recommend", 1),
    ("this was surprisingly good and fun", 1),
    ("i really enjoyed the story and characters", 1),
    ("boring plot and weak performances", 0),
    ("i hated every minute of it", 0),
    ("terrible script worst film ever", 0),
    ("not good the ending was awful", 0),
    ("mediocre at best would not recommend", 0),
    ("wonderful visuals and touching moments", 1),
    ("bad pacing and confusing scenes", 0),
]

# Split 80/20
import numpy as np
rng = np.random.RandomState(0)
perm = rng.permutation(len(data))
data = [data[i] for i in perm]
split = int(0.8 * len(data))
train_data, test_data = data[:split], data[split:]

# --- 2) Vocab ---
PAD, UNK = "<PAD>", "<UNK>"

def tokenize(s): return s.lower().split()

counter = Counter()
for text, _ in train_data:
    counter.update(tokenize(text))

itos = [PAD, UNK] + [w for w,_ in counter.items()]
stoi = {w:i for i,w in enumerate(itos)}

def numericalize(text):
    return torch.tensor([stoi.get(tok, stoi[UNK]) for tok in tokenize(text)], dtype=torch.long)

# --- 3) Dataset + collate ---
class SentDataset(Dataset):
    def __init__(self, pairs):
        self.x = [numericalize(t) for t,_ in pairs]
        self.y = [torch.tensor(lbl) for _,lbl in pairs]
    def __len__(self): return len(self.x)
    def __getitem__(self, i): return self.x[i], self.y[i]

def collate_fn(batch):
    seqs, labels = zip(*batch)
    lengths = torch.tensor([len(s) for s in seqs])
    padded = pad_sequence(seqs, batch_first=True, padding_value=stoi[PAD])
    labels = torch.stack(labels)
    return padded, lengths, labels

train_ds, test_ds = SentDataset(train_data), SentDataset(test_data)
train_loader = DataLoader(train_ds, batch_size=4, shuffle=True, collate_fn=collate_fn)
test_loader  = DataLoader(test_ds, batch_size=4, shuffle=False, collate_fn=collate_fn)


In [20]:
# Setup
import torch, torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence
from sklearn.metrics import f1_score
import numpy as np

# --- Reuse data pipeline from earlier (assume train_loader/test_loader exist) ---
# If not, load the dataset code from the previous mini-challenge cell before running this.

# ----------------------------
# 1) BiLSTM with configurable dropout
# ----------------------------
class BiLSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim=32, hidden_dim=32, num_classes=2, pad_idx=0, dropout=0.0):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
        self.dropout = nn.Dropout(dropout)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim*2, num_classes)

    def forward(self, padded, lengths):
        emb = self.dropout(self.embedding(padded))
        packed = pack_padded_sequence(emb, lengths.cpu(), batch_first=True, enforce_sorted=False)
        _, (h_n, _) = self.lstm(packed)
        h_cat = torch.cat((h_n[-2], h_n[-1]), dim=1)
        return self.fc(h_cat)

# ----------------------------
# 2) Train + evaluate function
# ----------------------------
def train_and_eval(dropout=0.0, weight_decay=0.0, clip=None, epochs=10):
    model = BiLSTMClassifier(vocab_size=len(itos), embed_dim=32, hidden_dim=32,
                             num_classes=2, pad_idx=stoi["<PAD>"], dropout=dropout)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=weight_decay)

    for ep in range(epochs):
        model.train()
        for padded, lengths, labels in train_loader:
            logits = model(padded, lengths)
            loss = criterion(logits, labels)
            optimizer.zero_grad()
            loss.backward()
            if clip:
                nn.utils.clip_grad_norm_(model.parameters(), max_norm=clip)
            optimizer.step()

    # Eval
    model.eval()
    y_true, y_pred = [], []
    with torch.no_grad():
        for padded, lengths, labels in test_loader:
            logits = model(padded, lengths)
            preds = logits.argmax(1)
            y_true.extend(labels.tolist())
            y_pred.extend(preds.tolist())
    acc = np.mean(np.array(y_true) == np.array(y_pred))
    f1 = f1_score(y_true, y_pred)
    return acc, f1

# ----------------------------
# 3) Run baseline vs regularized
# ----------------------------
baseline_acc, baseline_f1 = train_and_eval(dropout=0.0, weight_decay=0.0, clip=None)
reg_acc, reg_f1 = train_and_eval(dropout=0.3, weight_decay=1e-4, clip=1.0)

# ----------------------------
# 4) Show ablation table
# ----------------------------
import pandas as pd
table = pd.DataFrame({
    "Model": ["Baseline", "Regularized"],
    "Val Acc": [baseline_acc, reg_acc],
    "Val F1": [baseline_f1, reg_f1]
})
print(table)


         Model   Val Acc  Val F1
0     Baseline  0.333333     0.0
1  Regularized  0.666667     0.8


Data set is too small results still vary a lot. 

Notes / Key Takeaways

Dropout inside RNN = better generalization.

Weight decay adds an L2 penalty to weights.

Gradient clipping prevents exploding gradients.

Label smoothing avoids overconfidence.

AMP speeds training on GPUs.

Reflection

Why does gradient clipping matter more in RNNs than in CNNs?

How do dropout and weight decay differ in how they regularize?

1. Why does gradient clipping matter more in RNNs than in CNNs?

RNNs process sequences step by step.

During backpropagation, gradients are multiplied through many time steps → they can explode (grow uncontrollably) or vanish (shrink to near zero).

Exploding gradients are especially common in RNNs → training becomes unstable.

Gradient clipping puts a cap on gradient magnitude (e.g., max norm=1.0), preventing instability.

In CNNs, depth is limited and gradients flow through fewer steps, so exploding gradients are much less severe.

2. How do dropout and weight decay differ in how they regularize?

Dropout:

Randomly “drops” neurons during training (sets their output to 0).

Forces the network to not rely on any single neuron.

Acts like training many smaller networks and averaging them.

Reduces co-adaptation of features → better generalization.

Weight Decay (L2 regularization):

Adds a penalty to large weights (loss += λ * ||weights||²).

Encourages the network to keep weights small and simple.

Prevents overfitting by controlling model complexity.