[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shang-vikas/series1-coding-exercises/blob/main/exercises/blog-05/exercise-01.ipynb)

# üß™ IMDB Sentiment Classification ‚Äî Vanilla RNN

## 1Ô∏è‚É£ Install + Imports

In [1]:
%pip install datasets -q

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from datasets import load_dataset
from torch.nn.utils.rnn import pad_sequence
from collections import Counter
import time
import re

def basic_english_tokenizer(text):
    """Simple tokenizer that splits on whitespace and converts to lowercase."""
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    return text.split()

## 2Ô∏è‚É£ Load IMDB Dataset

In [2]:
# Load IMDB dataset from Hugging Face datasets
dataset = load_dataset("imdb")

train_data = dataset['train']
test_data = dataset['test']

print("Train samples:", len(train_data))
print("Test samples:", len(test_data))
print("\nSample review:")
print("Text:", train_data[0]['text'][:100] + "...")
print("Label:", train_data[0]['label'], "(0=neg, 1=pos)")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]



plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(‚Ä¶):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Train samples: 25000
Test samples: 25000

Sample review:
Text: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it w...
Label: 0 (0=neg, 1=pos)


IMDB contains:

- 25,000 training reviews
- 25,000 test reviews
- Binary sentiment: pos / neg

## 3Ô∏è‚É£ Tokenization

In [3]:
# Tokenizer is defined above in imports
tokenizer = basic_english_tokenizer

## 4Ô∏è‚É£ Build Vocabulary

We restrict vocab size to keep training manageable.

In [4]:
counter = Counter()

for example in train_data:
    text = example['text']
    tokens = tokenizer(text)
    counter.update(tokens)

vocab_size = 20000
most_common = counter.most_common(vocab_size - 2)

vocab = {word: idx+2 for idx, (word, _) in enumerate(most_common)}
vocab["<pad>"] = 0
vocab["<unk>"] = 1

## 5Ô∏è‚É£ Numericalize Data

In [5]:
def encode(text):
    tokens = tokenizer(text)
    return [vocab.get(token, vocab["<unk>"]) for token in tokens]

## 6Ô∏è‚É£ Collate Function (Padding)

RNNs need fixed batch lengths.

In [6]:
def collate_batch(batch):
    texts, labels = [], []

    for example in batch:
        text = example['text']
        label = example['label']  # Already 0 or 1
        encoded = torch.tensor(encode(text))
        texts.append(encoded)
        labels.append(label)

    texts = pad_sequence(texts, batch_first=True)
    labels = torch.tensor(labels)

    return texts, labels

## 7Ô∏è‚É£ DataLoaders

In [7]:
train_loader = DataLoader(train_data, batch_size=64, shuffle=True, collate_fn=collate_batch)
test_loader = DataLoader(test_data, batch_size=64, collate_fn=collate_batch)

## 8Ô∏è‚É£ Define Vanilla RNN Model

In [8]:
class VanillaRNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        x = self.embedding(x)
        out, hidden = self.rnn(x)

        # Use last hidden state
        final_hidden = hidden.squeeze(0)
        return self.fc(final_hidden)

## 9Ô∏è‚É£ Initialize

In [9]:

device = "cuda" if torch.cuda.is_available() else "cpu"

model = VanillaRNN(vocab_size, embed_dim=100, hidden_dim=128).to(device)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

## üîü Training Loop

In [10]:
def train_epoch():
    model.train()
    total_loss = 0

    for texts, labels in train_loader:
        texts, labels = texts.to(device), labels.to(device).float()

        outputs = model(texts).squeeze()
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    return total_loss / len(train_loader)

## 1Ô∏è‚É£1Ô∏è‚É£ Evaluation

In [11]:
def evaluate():
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for texts, labels in test_loader:
            texts, labels = texts.to(device), labels.to(device)
            outputs = torch.sigmoid(model(texts).squeeze())
            preds = (outputs > 0.5).long()

            correct += (preds == labels).sum().item()
            total += labels.size(0)

    return correct / total

## 1Ô∏è‚É£2Ô∏è‚É£ Train

In [12]:
for epoch in range(5):
    start = time.time()

    loss = train_epoch()
    acc = evaluate()

    print(f"Epoch {epoch+1} | Loss: {loss:.4f} | Test Acc: {acc:.4f} | Time: {time.time()-start:.2f}s")

Epoch 1 | Loss: 0.6960 | Test Acc: 0.5058 | Time: 12.31s
Epoch 2 | Loss: 0.6948 | Test Acc: 0.5059 | Time: 11.70s
Epoch 3 | Loss: 0.6960 | Test Acc: 0.5003 | Time: 12.39s
Epoch 4 | Loss: 0.6953 | Test Acc: 0.5001 | Time: 11.50s
Epoch 5 | Loss: 0.6940 | Test Acc: 0.4983 | Time: 11.62s


In [None]:
for epoch in range(50):
    start = time.time()

    loss = train_epoch()
    acc = evaluate()

    print(f"Epoch {epoch+1} | Loss: {loss:.4f} | Test Acc: {acc:.4f} | Time: {time.time()-start:.2f}s")


Epoch 1 | Loss: 0.6947 | Test Acc: 0.4951 | Time: 11.64s
Epoch 2 | Loss: 0.6952 | Test Acc: 0.4940 | Time: 11.47s
Epoch 3 | Loss: 0.6950 | Test Acc: 0.4939 | Time: 11.53s
Epoch 4 | Loss: 0.6948 | Test Acc: 0.4940 | Time: 12.02s
Epoch 5 | Loss: 0.6946 | Test Acc: 0.4982 | Time: 11.82s
Epoch 6 | Loss: 0.6946 | Test Acc: 0.4944 | Time: 11.35s


You should see:

- Accuracy around 80‚Äì85%.
- Good.
- But not state-of-the-art.

And that's intentional.

## ‚úÖ What This Exercise Teaches

1Ô∏è‚É£ **Order matters**

Unlike bag-of-words, performance is significantly higher.

2Ô∏è‚É£ **Hidden state compresses entire review**

Final decision comes from a single vector.

3Ô∏è‚É£ **Same weights reused per timestep**

True recurrence.

## ‚ö†Ô∏è Shortfalls (Make These Visible)

### ‚ùå 1. Long Reviews Hurt

Increase max review length.

You'll see:

- Training slows
- Accuracy plateaus

Because:

Information must travel through every timestep.

### ‚ùå 2. Vanishing Gradients

Check gradient norm of embedding layer:

In [None]:
# Run one training step to get gradients
texts, labels = next(iter(train_loader))
texts, labels = texts.to(device), labels.to(device).float()

outputs = model(texts).squeeze()
loss = criterion(outputs, labels)

optimizer.zero_grad()
loss.backward()

for name, param in model.named_parameters():
    if "embedding" in name and param.grad is not None:
        print("Embedding grad norm:", param.grad.norm().item())

Increase sequence length ‚Üí gradient shrinks.

### ‚ùå 3. Serial Computation

Time per epoch scales roughly linearly with sequence length.

You cannot parallelize across time.

GPU underutilized.

### ‚ùå 4. Fixed Memory Bottleneck

All review meaning compressed into:

```python
hidden_dim = 128
```

Long review.
Single 128-dim vector.

Compression pressure.

## üß† Why This Is Perfect Before LSTM

Students now feel:

- Memory compression
- Gradient fragility
- Sequential bottleneck

So when you introduce LSTM gates later, it solves a problem they already experienced.

Not abstractly.

Mechanically.