Week 10 · Day 1 — Packed Sequences, Masks & Bidirectionality
Why this matters

Text comes in variable lengths. To train efficiently, we must handle padding and sequence masks correctly. Today you’ll also learn bidirectional RNNs (BiLSTM), which look at both past and future context — often boosting accuracy in NLP.

Theory Essentials

Sequences in NLP are padded to equal length for batching.

Packed sequences in PyTorch let RNNs skip padded tokens.

batch_first=True arranges data as [batch, seq, features].

Masks mark valid tokens vs. padding.

Bidirectional RNNs process forward and backward → concat hidden states.

BiLSTM captures both past and future context for classification.



### 1. **Packed Sequences (PyTorch)**

* Normally, if you batch `[“hi”, “hello there”]`, you pad → `[“hi <PAD> <PAD>”, “hello there”]`.
* Problem: the RNN wastes time looping through `<PAD>` tokens.
* **Packed sequences** tell PyTorch: *“these tokens after position X are just padding — skip them.”*
* You use:

  ```python
  from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
  ```
* This lets the RNN only process real tokens, which is faster and avoids learning noise.

---

### 2. **`batch_first=True`**

* By default, RNN input is `[seq_len, batch_size, features]`.
* That’s unintuitive.
* If you set `batch_first=True`, it becomes `[batch_size, seq_len, features]`, which matches how we usually think about data (`batch → sequence → features`).

---

### 3. **Masks**

* A *mask* is a boolean tensor that marks **valid tokens** vs **padding**.
* Example:

  ```
  sentence = ["I love AI", "AI rocks <PAD>"]
  mask = [[1,1,1], [1,1,0]]
  ```
* Masks are useful in:

  * Attention (don’t attend to `<PAD>`).
  * Metrics (ignore padding when computing loss/accuracy).
  * RNNs (sometimes used to zero out hidden states of padding tokens).

---

### 4. **Bidirectionality**

* A normal RNN/LSTM processes **left → right** only.
* But in many NLP tasks, context *after* a word is also useful.

  * Example: *“I went to the bank to deposit…”* vs *“…to sit on the bank of the river.”*
* A **Bidirectional RNN** runs two RNNs:

  * One forward (left → right).
  * One backward (right → left).
* It concatenates their outputs:
  `[forward_hidden, backward_hidden]`.
* This doubles hidden size, but captures **past + future context**.

---

### 5. **BiLSTM in Practice**

* Standard LSTM: at token *t*, you only see words before *t*.
* BiLSTM: at token *t*, you see both words before and after.
* Common in text classification, tagging (POS, NER), sentiment analysis.

---

👉 So the **workflow** is:

1. Pad sequences to batch.
2. Pack them (`pack_padded_sequence`) so the RNN skips padding.
3. Use `batch_first=True` for clean shape.
4. Optionally use a **mask** to track valid tokens.
5. Feed into **BiLSTM** → get richer sequence representations.

In [1]:
# Setup
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from pathlib import Path
np.random.seed(42)
plt.rcParams["figure.figsize"] = (6,4)
plt.rcParams["axes.grid"] = True

import torch, torch.nn as nn
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence

# Example toy batch (variable-length sentences, tokenized as ints)
batch = [
    torch.tensor([2, 5, 7, 9]),       # len 4
    torch.tensor([3, 8, 6]),          # len 3
    torch.tensor([4, 1])              # len 2
]

lengths = torch.tensor([len(x) for x in batch])   # [4,3,2]
padded = pad_sequence(batch, batch_first=True)    # pad with 0
print("Padded batch:\n", padded)

# Pack sequences (sorted by length, descending)
lengths_sorted, perm_idx = lengths.sort(descending=True)
batch_sorted = padded[perm_idx]
packed = pack_padded_sequence(batch_sorted, lengths_sorted.cpu(), batch_first=True, enforce_sorted=True)

# Define BiLSTM
class BiLSTMClassifier(nn.Module):
    def __init__(self, vocab_size=20, embed_dim=16, hidden_dim=32, num_classes=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim*2, num_classes)  # *2 for bidirection
    def forward(self, x, lengths):
        embeds = self.embedding(x)
        packed = pack_padded_sequence(embeds, lengths.cpu(), batch_first=True, enforce_sorted=False)
        packed_out, (h_n, c_n) = self.lstm(packed)
        # concat last hidden states (fwd + bwd)
        h_cat = torch.cat((h_n[-2,:,:], h_n[-1,:,:]), dim=1)
        return self.fc(h_cat)

# Demo forward pass
model = BiLSTMClassifier()
out = model(padded, lengths)
print("Output logits:", out)


Padded batch:
 tensor([[2, 5, 7, 9],
        [3, 8, 6, 0],
        [4, 1, 0, 0]])
Output logits: tensor([[0.1494, 0.1817],
        [0.0884, 0.1801],
        [0.0744, 0.0794]], grad_fn=<AddmmBackward0>)



### **1. Padded batch**

```
tensor([[2, 5, 7, 9],
        [3, 8, 6, 0],
        [4, 1, 0, 0]])
```

* You had 3 sentences of different lengths (`[2,5,7,9]`, `[3,8,6]`, `[4,1]`).
* `pad_sequence` made them the same length (4) by padding with `0`.

  * Sentence 1: no padding.
  * Sentence 2: one zero.
  * Sentence 3: two zeros.
* This is what lets you batch them together.

---

### **2. Output logits**

```
tensor([[0.1494, 0.1817],
        [0.0884, 0.1801],
        [0.0744, 0.0794]], grad_fn=<AddmmBackward0>)
```

* Shape: `[batch_size, num_classes]` → here: `3 × 2`.
* Each row is the model’s **raw scores (logits)** for a sequence in the batch.

  * Example: First sentence → `[0.1494, 0.1817]`.
* These are not yet probabilities; you’d typically apply `softmax` to turn them into probabilities.

---

### **3. What happened inside**

* **Embedding layer** converted token IDs (2, 5, 7, 9, etc.) into dense vectors.
* **BiLSTM** processed sequences forward + backward, skipping padding with `pack_padded_sequence`.
* At the end, you took the **final hidden states** (from forward + backward), concatenated them (`h_cat`), and passed them through a linear layer (`fc`).
* That produced the logits above.

---

👉 In short:

* **Top block** = your padded input sequences.
* **Bottom block** = the classifier’s predictions (logits) for each sequence.



1) Core (10–15 min)

Task: Change hidden_dim from 32 → 8. Run forward pass and check paramtere count.

In [8]:
model = BiLSTMClassifier(hidden_dim=8)
out = model(padded, lengths)
sum(p.numel() for p in model.parameters())



2018

2) Practice (10–15 min)

Task: Add nn.Dropout(0.3) after embeddings. Run forward pass and compare outputs.

In [9]:
class BiLSTMClassifier(nn.Module):
    def __init__(self, vocab_size=20, embed_dim=16, hidden_dim=32, num_classes=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.dropout = nn.Dropout(0.3)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim*2, num_classes)
    def forward(self, x, lengths):
        embeds = self.dropout(self.embedding(x))
        packed = pack_padded_sequence(embeds, lengths.cpu(), batch_first=True, enforce_sorted=False)
        _, (h_n, _) = self.lstm(packed)
        h_cat = torch.cat((h_n[-2], h_n[-1]), dim=1)
        return self.fc(h_cat)

model = BiLSTMClassifier()
print(model(padded, lengths))


tensor([[ 0.0536, -0.0072],
        [ 0.0894, -0.0778],
        [-0.0188,  0.0971]], grad_fn=<AddmmBackward0>)


Output changes

3) Stretch (optional, 10–15 min)

Task: Add a mask (0 for padding) and print token-level embeddings only for real tokens.

In [10]:
mask = (padded != 0).int()
print("Mask:\n", mask)

Mask:
 tensor([[1, 1, 1, 1],
        [1, 1, 1, 0],
        [1, 1, 0, 0]], dtype=torch.int32)


Mini-Challenge (≤40 min)

Task: Train a small BiLSTM sentiment classifier on a toy dataset (e.g., 6–10 sentences labeled pos/neg).

Build vocab + pad batches.

Use bidirectional=True.

Evaluate on a held-out mini test set.

Acceptance Criteria:

Code runs end-to-end (train + eval).

Model predicts correctly at least 70% on test set.

Prints confusion matrix.

In [18]:
# Setup
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from pathlib import Path
np.random.seed(42)
plt.rcParams["figure.figsize"] = (6,4)
plt.rcParams["axes.grid"] = True

import torch, torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence
from collections import Counter
from sklearn.metrics import confusion_matrix, accuracy_score

device = torch.device("cpu")

# ----------------------------
# 1) Toy dataset (POS=1, NEG=0)
# ----------------------------
data = [
    ("i love this movie so much", 1),
    ("absolutely fantastic acting and soundtrack", 1),
    ("what a great experience highly recommend", 1),
    ("this was surprisingly good and fun", 1),
    ("i really enjoyed the story and characters", 1),
    ("boring plot and weak performances", 0),
    ("i hated every minute of it", 0),
    ("terrible script worst film ever", 0),
    ("not good the ending was awful", 0),
    ("mediocre at best would not recommend", 0),
    ("wonderful visuals and touching moments", 1),
    ("bad pacing and confusing scenes", 0),
]

# simple train/test split (80/20)
rng = np.random.RandomState(0)
perm = rng.permutation(len(data))
data = [data[i] for i in perm]
split = int(0.8 * len(data))
train_data, test_data = data[:split], data[split:]

# ----------------------------
# 2) Vocab + numericalization
# ----------------------------
PAD, UNK = "<PAD>", "<UNK>"

def tokenize(s): 
    return s.lower().strip().split()

counter = Counter()
for text, _ in train_data:
    counter.update(tokenize(text))

itos = [PAD, UNK] + [w for w, c in counter.items() if c >= 1]
stoi = {w:i for i, w in enumerate(itos)}

def numericalize(text):
    return torch.tensor([stoi.get(tok, stoi[UNK]) for tok in tokenize(text)], dtype=torch.long)

# ----------------------------
# 3) Dataset + collate (pad + lengths)
# ----------------------------
class SentDataset(Dataset):
    def __init__(self, pairs): 
        self.x = [numericalize(t) for t,_ in pairs]
        self.y = [torch.tensor(lbl, dtype=torch.long) for _,lbl in pairs]
    def __len__(self): return len(self.x)
    def __getitem__(self, i): return self.x[i], self.y[i]

def collate_fn(batch):
    seqs, labels = zip(*batch)
    lengths = torch.tensor([len(s) for s in seqs], dtype=torch.long)
    padded = pad_sequence(seqs, batch_first=True, padding_value=stoi[PAD])
    labels = torch.stack(labels)
    return padded.to(device), lengths.to(device), labels.to(device)

train_ds, test_ds = SentDataset(train_data), SentDataset(test_data)
train_loader = DataLoader(train_ds, batch_size=4, shuffle=True, collate_fn=collate_fn)
test_loader  = DataLoader(test_ds,  batch_size=4, shuffle=False, collate_fn=collate_fn)

# ----------------------------
# 4) BiLSTM model
# ----------------------------
class BiLSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim=32, hidden_dim=32, num_classes=2, pad_idx=0):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim*2, num_classes)

    def forward(self, padded, lengths):
        emb = self.embedding(padded)                                 # [B, T, E]
        packed = pack_padded_sequence(emb, lengths.cpu(), batch_first=True, enforce_sorted=False)
        _, (h_n, _) = self.lstm(packed)                              # h_n: [num_dirs*layers, B, H]
        h_cat = torch.cat((h_n[-2], h_n[-1]), dim=1)                 # [B, 2H]  (fwd + bwd)
        return self.fc(h_cat)                                        # [B, C]

vocab_size = len(itos)
model = BiLSTMClassifier(vocab_size=vocab_size, embed_dim=32, hidden_dim=32, num_classes=2, pad_idx=stoi[PAD]).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# ----------------------------
# 5) Train
# ----------------------------
def train_epoch():
    model.train()
    total_loss = 0.0
    for padded, lengths, labels in train_loader:
        logits = model(padded, lengths)
        loss = criterion(logits, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * labels.size(0)
    return total_loss / len(train_ds)

def evaluate(loader):
    model.eval()
    all_y, all_p = [], []
    with torch.no_grad():
        for padded, lengths, labels in loader:
            logits = model(padded, lengths)
            preds = logits.argmax(dim=1)
            all_y.extend(labels.cpu().tolist())
            all_p.extend(preds.cpu().tolist())
    acc = accuracy_score(all_y, all_p)
    cm  = confusion_matrix(all_y, all_p, labels=[0,1])
    return acc, cm, (all_y, all_p)

epochs = 5
for ep in range(1, epochs+1):
    tr_loss = train_epoch()
    acc, _, _ = evaluate(test_loader)
    if ep % 5 == 0 or ep == 1:
        print(f"Epoch {ep:02d} | train_loss={tr_loss:.4f} | test_acc={acc*100:.1f}%")

# ----------------------------
# 6) Final evaluation
# ----------------------------
test_acc, cm, (y_true, y_pred) = evaluate(test_loader)
print("\nFinal Test Accuracy: {:.1f}%".format(test_acc*100))
print("Confusion Matrix (rows=true, cols=pred) [NEG, POS]:\n", cm)


Epoch 01 | train_loss=0.6876 | test_acc=66.7%
Epoch 05 | train_loss=0.5875 | test_acc=66.7%

Final Test Accuracy: 66.7%
Confusion Matrix (rows=true, cols=pred) [NEG, POS]:
 [[0 1]
 [0 2]]


Notes / Key Takeaways

Use pad_sequence + pack_padded_sequence to skip padding.

Always sort or set enforce_sorted=False.

BiLSTMs double hidden size (concat fwd+bwd).

Masks prevent padded tokens from affecting training.

Dropout + weight decay help regularize.

Packed sequences = efficiency + correctness.

Reflection

Why does bidirectionality improve NLP tasks?

What happens if you forget to pack padded sequences?

1. Why does bidirectionality improve NLP tasks?

A normal LSTM reads text left → right, so when it sees a word, it only knows what came before it.

But meaning often depends on what comes after.

Example: “I went to the bank … to deposit money” vs “… to sit on the river bank”.

A BiLSTM runs two LSTMs:

One forward (past → present).

One backward (future → present).

Their hidden states are concatenated, so at each word the model has context from both directions.

This usually boosts accuracy in classification, tagging, and sentiment analysis.

2. What happens if you forget to pack padded sequences?

If you feed padded batches directly into the LSTM:

The model will process <PAD> tokens as if they were real words.

That adds noise: the network wastes capacity trying to learn from meaningless padding.

It can also bias hidden states, since long sequences with lots of padding end up looking “similar.”

With pack_padded_sequence, the LSTM skips over padding, so hidden states reflect only real tokens.

Forgetting to pack often leads to slower training, worse accuracy, and unstable gradients.