# Module 09: Text Classification with RNNs

**End-to-End Sentiment Analysis Pipeline**

---

## 1. Objectives

- âœ… Build complete text classification pipeline
- âœ… Handle variable-length sequences correctly
- âœ… Train on IMDB sentiment dataset
- âœ… Evaluate with proper metrics

## 2. Prerequisites

- [Module 08: Bidirectional & Deep RNNs](../08_bidirectional_deep_rnns/08_bidirectional_deep_rnns.ipynb)

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
from collections import Counter
import numpy as np
from tqdm import tqdm

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cpu


## 3. Dataset & Vocabulary

In [2]:
# Sample data (in practice, use torchtext or HuggingFace datasets)
train_texts = [
    "This movie was absolutely fantastic and amazing",
    "Terrible film waste of time and money",
    "I loved every moment of this masterpiece",
    "Awful acting and horrible plot",
    "Best movie I have seen in years",
    "Boring and predictable would not recommend",
    "Brilliant cinematography and great performances",
    "Complete disaster avoid at all costs"
]
train_labels = [1, 0, 1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative

# Build vocabulary
class Vocabulary:
    def __init__(self, min_freq=1):
        self.word2idx = {'<PAD>': 0, '<UNK>': 1}
        self.idx2word = {0: '<PAD>', 1: '<UNK>'}
        self.min_freq = min_freq

    def build(self, texts):
        word_counts = Counter()
        for text in texts:
            word_counts.update(text.lower().split())

        for word, count in word_counts.items():
            if count >= self.min_freq:
                idx = len(self.word2idx)
                self.word2idx[word] = idx
                self.idx2word[idx] = word

        print(f"Vocabulary size: {len(self.word2idx)}")

    def encode(self, text):
        return [self.word2idx.get(w, 1) for w in text.lower().split()]

    def __len__(self):
        return len(self.word2idx)

vocab = Vocabulary()
vocab.build(train_texts)

Vocabulary size: 44


In [3]:
# Dataset class
class TextDataset(Dataset):
    def __init__(self, texts, labels, vocab):
        self.texts = [torch.tensor(vocab.encode(t)) for t in texts]
        self.labels = torch.tensor(labels)

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return self.texts[idx], self.labels[idx]

# Collate function for variable lengths
def collate_fn(batch):
    texts, labels = zip(*batch)
    lengths = torch.tensor([len(t) for t in texts])
    texts_padded = pad_sequence(texts, batch_first=True, padding_value=0)
    return texts_padded, torch.stack(list(labels)), lengths

# Create DataLoader
dataset = TextDataset(train_texts, train_labels, vocab)
loader = DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=collate_fn)

# Test
batch = next(iter(loader))
print(f"Texts shape: {batch[0].shape}")
print(f"Labels: {batch[1]}")
print(f"Lengths: {batch[2]}")

Texts shape: torch.Size([4, 7])
Labels: tensor([1, 1, 0, 1])
Lengths: tensor([5, 7, 7, 7])


## 4. Model Architecture

In [4]:
class SentimentClassifier(nn.Module):
    """BiLSTM for sentiment classification."""

    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes,
                 num_layers=2, dropout=0.3, bidirectional=True):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(
            embed_dim, hidden_dim, num_layers,
            batch_first=True, dropout=dropout if num_layers > 1 else 0,
            bidirectional=bidirectional
        )

        lstm_output_dim = hidden_dim * 2 if bidirectional else hidden_dim
        self.fc = nn.Sequential(
            nn.Dropout(dropout),
            nn.Linear(lstm_output_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, num_classes)
        )

    def forward(self, x, lengths):
        # x: (batch, seq), lengths: (batch,)
        embedded = self.embedding(x)  # (batch, seq, embed)

        # Pack for efficiency with variable lengths
        packed = pack_padded_sequence(
            embedded, lengths.cpu(), batch_first=True, enforce_sorted=False
        )
        packed_out, (h_n, _) = self.lstm(packed)

        # Use final hidden states from both directions
        # h_n: (num_layers * num_directions, batch, hidden)
        if self.lstm.bidirectional:
            h_fwd = h_n[-2]  # Last layer forward
            h_bwd = h_n[-1]  # Last layer backward
            h = torch.cat([h_fwd, h_bwd], dim=-1)
        else:
            h = h_n[-1]

        return self.fc(h)

# Create model
model = SentimentClassifier(
    vocab_size=len(vocab),
    embed_dim=100,
    hidden_dim=128,
    num_classes=2
).to(device)

print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

Parameters: 668,338


## 5. Training Loop

In [5]:
def train_epoch(model, loader, optimizer, criterion):
    model.train()
    total_loss, correct, total = 0, 0, 0

    for texts, labels, lengths in loader:
        texts, labels = texts.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(texts, lengths)
        loss = criterion(outputs, labels)
        loss.backward()

        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()

        total_loss += loss.item()
        preds = outputs.argmax(dim=1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)

    return total_loss / len(loader), correct / total

def evaluate(model, loader, criterion):
    model.eval()
    total_loss, correct, total = 0, 0, 0

    with torch.no_grad():
        for texts, labels, lengths in loader:
            texts, labels = texts.to(device), labels.to(device)
            outputs = model(texts, lengths)
            loss = criterion(outputs, labels)

            total_loss += loss.item()
            preds = outputs.argmax(dim=1)
            correct += (preds == labels).sum().item()
            total += labels.size(0)

    return total_loss / len(loader), correct / total

# Training
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(5):
    train_loss, train_acc = train_epoch(model, loader, optimizer, criterion)
    print(f"Epoch {epoch+1}: Loss={train_loss:.4f}, Acc={train_acc:.2%}")

Epoch 1: Loss=0.6957, Acc=37.50%
Epoch 2: Loss=0.6763, Acc=87.50%
Epoch 3: Loss=0.6494, Acc=100.00%
Epoch 4: Loss=0.6261, Acc=100.00%
Epoch 5: Loss=0.5783, Acc=100.00%


## 6. Inference

In [6]:
def predict(model, vocab, text):
    model.eval()
    tokens = torch.tensor([vocab.encode(text)]).to(device)
    lengths = torch.tensor([tokens.size(1)])

    with torch.no_grad():
        output = model(tokens, lengths)
        prob = torch.softmax(output, dim=1)
        pred = output.argmax(dim=1).item()

    return 'Positive' if pred == 1 else 'Negative', prob[0][pred].item()

# Test predictions
test_texts = [
    "This is absolutely wonderful",
    "Terrible waste of time",
    "Not bad but could be better"
]

for text in test_texts:
    label, conf = predict(model, vocab, text)
    print(f"{text}")
    print(f"  â†’ {label} ({conf:.2%})\n")

This is absolutely wonderful
  â†’ Positive (58.54%)

Terrible waste of time
  â†’ Negative (51.85%)

Not bad but could be better
  â†’ Positive (54.23%)



## 7. ðŸ”¥ Real-World Usage

### Model Selection (2024)

| Data Size | Compute | Model |
|-----------|---------|-------|
| < 1K | Low | TF-IDF + LogReg |
| 1K-10K | Low | BiLSTM |
| 1K-10K | Medium | DistilBERT |
| > 10K | Any | BERT/RoBERTa |
| Any | High + API | GPT-4 (zero-shot) |

### Production Tips

- Always start with **TF-IDF baseline**
- Use **gradient clipping** for RNNs
- **Pack sequences** for efficiency
- Consider **BERT** for best accuracy

## 8. Interview Questions

**Q1: How do you handle variable-length sequences?**
<details><summary>Answer</summary>

- Pad sequences to same length
- Use `pack_padded_sequence` for efficiency
- Store lengths for proper handling
</details>

**Q2: Why use gradient clipping for RNNs?**
<details><summary>Answer</summary>

Prevents exploding gradients which cause training instability. Clips gradients to max norm (typically 1.0-5.0).
</details>

## 9. Summary

- **Pipeline**: Vocab â†’ Dataset â†’ DataLoader â†’ Model â†’ Train
- **Handle variable lengths**: `pack_padded_sequence`
- **Gradient clipping**: Essential for RNN training
- **BiLSTM**: Use for classification tasks
- **In practice**: Consider BERT for best results

## 10. References

- [PyTorch Packing](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html)
- [IMDB Dataset](https://ai.stanford.edu/~amaas/data/sentiment/)

---
**Next:** [Module 10: Named Entity Recognition](../10_ner/10_ner.ipynb)