[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shang-vikas/series1-coding-exercises/blob/main/exercises/blog-05/exercise-01.ipynb)

# üß™ IMDB Sentiment Classification ‚Äî Vanilla RNN

**Goal:** Build and train a vanilla RNN for sentiment classification to experience how RNNs process sequences, compress information into hidden states, and reveal their fundamental limitations.

## 1Ô∏è‚É£ Install + Imports

In [15]:
%pip install datasets -q

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from datasets import load_dataset
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence
from collections import Counter
import time
import re

def basic_english_tokenizer(text):
    """Simple tokenizer that splits on whitespace and converts to lowercase."""
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    return text.split()

## 2Ô∏è‚É£ Load IMDB Dataset

In [16]:
# Load IMDB dataset from Hugging Face datasets
dataset = load_dataset("imdb")

train_data = dataset['train']
test_data = dataset['test']

print("Train samples:", len(train_data))
print("Test samples:", len(test_data))
print("\nSample review:")
print("Text:", train_data[0]['text'][:100] + "...")
print("Label:", train_data[0]['label'], "(0=neg, 1=pos)")

Train samples: 25000
Test samples: 25000

Sample review:
Text: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it w...
Label: 0 (0=neg, 1=pos)


IMDB contains:

- 25,000 training reviews
- 25,000 test reviews
- Binary sentiment: pos / neg

## 3Ô∏è‚É£ Tokenization

In [17]:
# Tokenizer is defined above in imports
tokenizer = basic_english_tokenizer

## 4Ô∏è‚É£ Build Vocabulary

We restrict vocab size to keep training manageable.

In [18]:
counter = Counter()

for example in train_data:
    text = example['text']
    tokens = tokenizer(text)
    counter.update(tokens)

vocab_size = 20000
most_common = counter.most_common(vocab_size - 2)

vocab = {word: idx+2 for idx, (word, _) in enumerate(most_common)}
vocab["<pad>"] = 0
vocab["<unk>"] = 1

## 5Ô∏è‚É£ Numericalize Data

In [19]:
def encode(text):
    tokens = tokenizer(text)
    return [vocab.get(token, vocab["<unk>"]) for token in tokens]

## 6Ô∏è‚É£ Collate Function (Padding)

RNNs need fixed batch lengths.

In [20]:
def collate_batch(batch):
    texts, labels = [], []

    for example in batch:
        text = example['text']
        label = example['label']  # Already 0 or 1
        encoded = torch.tensor(encode(text))
        texts.append(encoded)
        labels.append(label)

    # Get lengths before padding
    lengths = torch.tensor([len(text) for text in texts], dtype=torch.long)

    texts = pad_sequence(texts, batch_first=True)
    labels = torch.tensor(labels)

    return texts, lengths, labels

## 7Ô∏è‚É£ DataLoaders

In [21]:
train_loader = DataLoader(train_data, batch_size=64, shuffle=True, collate_fn=collate_batch)
test_loader = DataLoader(test_data, batch_size=64, collate_fn=collate_batch)

## 8Ô∏è‚É£ Define Vanilla RNN Model

In [22]:
class VanillaRNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, 1)

    def forward(self, x, lengths):
        x = self.embedding(x)

        # Pack padded sequence
        packed = pack_padded_sequence(x, lengths.cpu(), batch_first=True, enforce_sorted=False)

        out, hidden = self.rnn(packed)

        # Use last hidden state
        final_hidden = hidden.squeeze(0)
        return self.fc(final_hidden)

## 9Ô∏è‚É£ Initialize

In [23]:

device = "cuda" if torch.cuda.is_available() else "cpu"

model = VanillaRNN(vocab_size, embed_dim=100, hidden_dim=128).to(device)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

## üîü Training Loop

In [24]:
def train_epoch():
    model.train()
    total_loss = 0

    for texts, lengths, labels in train_loader:
        texts, lengths, labels = texts.to(device), lengths.to(device), labels.to(device).float()

        outputs = model(texts, lengths).view(-1)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    return total_loss / len(train_loader)

## 1Ô∏è‚É£1Ô∏è‚É£ Evaluation

In [25]:
def evaluate():
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for texts, lengths, labels in test_loader:
            texts, lengths, labels = texts.to(device), lengths.to(device), labels.to(device)
            outputs = torch.sigmoid(model(texts, lengths).view(-1))
            preds = (outputs > 0.5).long()

            correct += (preds == labels).sum().item()
            total += labels.size(0)

    return correct / total

## 1Ô∏è‚É£2Ô∏è‚É£ Train

In [26]:

for epoch in range(5):
    start = time.time()

    loss = train_epoch()
    acc = evaluate()

    print(f"Epoch {epoch+1} | Loss: {loss:.4f} | Test Acc: {acc:.4f} | Time: {time.time()-start:.2f}s")

Epoch 1 | Loss: 0.6697 | Test Acc: 0.6164 | Time: 30.18s
Epoch 2 | Loss: 0.6171 | Test Acc: 0.6070 | Time: 28.35s
Epoch 3 | Loss: 0.5827 | Test Acc: 0.5927 | Time: 28.15s
Epoch 4 | Loss: 0.5783 | Test Acc: 0.7030 | Time: 29.10s
Epoch 5 | Loss: 0.5113 | Test Acc: 0.7282 | Time: 28.50s


In [27]:

for epoch in range(20):
    start = time.time()

    loss = train_epoch()
    acc = evaluate()

    print(f"Epoch {epoch+1} | Loss: {loss:.4f} | Test Acc: {acc:.4f} | Time: {time.time()-start:.2f}s")

Epoch 1 | Loss: 0.4793 | Test Acc: 0.7396 | Time: 26.75s
Epoch 2 | Loss: 0.4260 | Test Acc: 0.7644 | Time: 27.72s
Epoch 3 | Loss: 0.4200 | Test Acc: 0.6746 | Time: 27.02s
Epoch 4 | Loss: 0.3954 | Test Acc: 0.7620 | Time: 26.99s
Epoch 5 | Loss: 0.3341 | Test Acc: 0.7595 | Time: 27.59s
Epoch 6 | Loss: 0.3125 | Test Acc: 0.7578 | Time: 28.00s
Epoch 7 | Loss: 0.2738 | Test Acc: 0.7716 | Time: 27.21s
Epoch 8 | Loss: 0.2448 | Test Acc: 0.7557 | Time: 26.70s
Epoch 9 | Loss: 0.2184 | Test Acc: 0.7758 | Time: 27.32s
Epoch 10 | Loss: 0.1984 | Test Acc: 0.7727 | Time: 27.05s
Epoch 11 | Loss: 0.1676 | Test Acc: 0.7815 | Time: 26.80s
Epoch 12 | Loss: 0.3751 | Test Acc: 0.6184 | Time: 27.11s
Epoch 13 | Loss: 0.4855 | Test Acc: 0.6952 | Time: 27.90s
Epoch 14 | Loss: 0.4532 | Test Acc: 0.6524 | Time: 27.27s
Epoch 15 | Loss: 0.3630 | Test Acc: 0.7384 | Time: 26.56s
Epoch 16 | Loss: 0.3009 | Test Acc: 0.7330 | Time: 26.77s
Epoch 17 | Loss: 0.2622 | Test Acc: 0.7276 | Time: 27.44s
Epoch 18 | Loss: 0.2682

You should see:

- Accuracy around 80‚Äì85%.
- Good.
- But not state-of-the-art.

And that's intentional.

## ‚úÖ What This Exercise Teaches

1Ô∏è‚É£ **Order matters**

Unlike bag-of-words, performance is significantly higher.

2Ô∏è‚É£ **Hidden state compresses entire review**

Final decision comes from a single vector.

3Ô∏è‚É£ **Same weights reused per timestep**

True recurrence.

## ‚ö†Ô∏è Shortfalls (Make These Visible)

### ‚ùå 1. Long Reviews Hurt

**Goal:** Show that longer sequences slow training and hurt accuracy because information must pass through every timestep sequentially.

Increase max review length.

You'll see:

- Training slows
- Accuracy plateaus

Because:

Information must travel through every timestep.

### ‚ùå 2. Vanishing Gradients

**Goal:** Measure gradient norms to show they shrink as sequences get longer, making it hard to learn from early tokens.

Check gradient norm of embedding layer:

# Run one training step to get gradients
texts, lengths, labels = next(iter(train_loader))
texts, lengths, labels = texts.to(device), lengths.to(device), labels.to(device).float()

outputs = model(texts, lengths).view(-1)
loss = criterion(outputs, labels)

optimizer.zero_grad()
loss.backward()

for name, param in model.named_parameters():
    if "embedding" in name and param.grad is not None:
        print("Embedding grad norm:", param.grad.norm().item())

Increase sequence length ‚Üí gradient shrinks.

### ‚ùå 3. Serial Computation

**Goal:** Demonstrate that training time scales linearly with sequence length because RNNs cannot parallelize across time steps, leaving GPUs underutilized.

Time per epoch scales roughly linearly with sequence length.

You cannot parallelize across time.

GPU underutilized.

### ‚ùå 4. Fixed Memory Bottleneck

**Goal:** Show that all sequence information must be compressed into a fixed-size hidden vector, creating a memory bottleneck for long sequences.

All review meaning compressed into:

```python
hidden_dim = 128
```

Long review.
Single 128-dim vector.

Compression pressure.

## üß† Why This Is Perfect Before LSTM

Students now feel:

- Memory compression
- Gradient fragility
- Sequential bottleneck

So when you introduce LSTM gates later, it solves a problem they already experienced.

Not abstractly.

Mechanically.