Week 9 · Day 1 — Text to Tensors: Tokenization, Padding, Masking
Why this matters

Neural networks need numeric tensors, not raw text. Tokenization, vocabulary building, and padding are the first steps to make text data usable for RNNs or Transformers.

Theory Essentials

Tokenization: split sentences into words/subwords.

Vocabulary: map tokens → unique integer IDs.

Numericalization: replace tokens with IDs.

Padding: sequences must be same length in a batch → add special <PAD> tokens.

Masking: mark real tokens vs padding so model ignores the filler.

Collate function: prepares padded batches for the DataLoader.


* **Tokenization:** `"I love pizza"` → `["I", "love", "pizza"]` (or even subwords like `"pi"`, `"zza"`).
* **Vocabulary:** Build a dictionary mapping each token to an integer: `{"I":1, "love":2, "pizza":3, "<PAD>":0, ...}`.
* **Numericalization:** Convert tokens to IDs → `[1, 2, 3]`.
* **Padding:** Sentences differ in length, so pad shorter ones with `<PAD>` until they match the longest.

  * Ex: `["I love pizza"]` → `[1, 2, 3, 0, 0]`.
* **Masking:** Create a binary mask that says which positions are “real words” vs padding.

  * Ex: `[1, 1, 1, 0, 0]`.
* **Collate function:** In PyTorch, this is the batch-assembly function. It automatically pads sequences in a batch to the same length and produces both `input_ids` and `attention_mask`.

---

👉 So the visual picture is:
**Raw text → tokens → IDs → padded matrix + mask.**


In [3]:
# Setup
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from pathlib import Path
np.random.seed(42)
plt.rcParams["figure.figsize"] = (6,4)
plt.rcParams["axes.grid"] = True

import torch
from torch.utils.data import Dataset, DataLoader
from collections import Counter

# Example corpus
texts = ["I love deep learning",
         "Deep learning loves PyTorch",
         "PyTorch is powerful for text"]

labels = [1,1,0]  # pretend sentiment labels

# 1) Tokenize
def tokenize(text):
    return text.lower().split()

tokenized = [tokenize(t) for t in texts]

# 2) Build vocab
counter = Counter(token for sent in tokenized for token in sent)
vocab = {word: i+2 for i, (word,_) in enumerate(counter.most_common())}
vocab["<PAD>"] = 0
vocab["<UNK>"] = 1

# 3) Numericalize
def numericalize(tokens):
    return [vocab.get(tok, vocab["<UNK>"]) for tok in tokens]

numericalized = [numericalize(sent) for sent in tokenized]

# 4) Dataset
class TextDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = [numericalize(tokenize(t)) for t in texts]
        self.labels = labels
    def __len__(self): return len(self.texts)
    def __getitem__(self, idx): 
        return self.texts[idx], self.labels[idx]

dataset = TextDataset(texts, labels)

# 5) Collate with padding
def collate_fn(batch):
    sequences, labels = zip(*batch)
    lengths = [len(seq) for seq in sequences]
    max_len = max(lengths)
    padded = [seq + [vocab["<PAD>"]] * (max_len - len(seq)) for seq in sequences]
    return torch.tensor(padded), torch.tensor(lengths), torch.tensor(labels)

loader = DataLoader(dataset, batch_size=2, collate_fn=collate_fn, shuffle=True)

# Inspect a batch
for batch in loader:
    tokens, lengths, labels = batch
    print("Tokens:\n", tokens)
    print("Lengths:", lengths)
    print("Labels:", labels)
    break


Tokens:
 tensor([[ 4,  8,  9, 10, 11],
        [ 5,  6,  2,  3,  0]])
Lengths: tensor([5, 4])
Labels: tensor([0, 1])



### Flow of your code

**1. Raw text input**

```python
"I love deep learning"
```

**2. Tokenization** ✅ Tokens ≠ letters. They’re units defined by the tokenizer.
Break into tokens (words/subwords).

```python
["i", "love", "deep", "learning"]
```

**3. Vocabulary lookup**
Each token is mapped to an **integer ID** from the vocab.

```python
[4, 5, 2, 3]
```

**4. Dataset**
Now each sentence is a list of numbers.
Labels are kept alongside.

```
([4, 5, 2, 3], label=1)
```

**5. Collate (when batching)**
Sentences in a batch have **different lengths**, so you:

* Find the longest sentence in the batch.
* Add `<PAD>` tokens (ID=0) to shorter ones.

Example with batch size=2:

```
Sentence A: [4, 5, 2, 3]  
Sentence B: [7, 8] → pad to [7, 8, 0, 0]  
```

**6. Outputs from DataLoader**
Now you have three things:

* `tokens` → tensor with padded IDs, shape `[batch_size, max_len]`.
* `lengths` → true lengths before padding.
* `labels` → target labels.

This is what a model will use as input.

---

### Why do this?

Neural networks can’t read `"love"` directly — they only understand numbers.

* **Tokenization** = break text into pieces.
* **Vocabulary** = give each piece a unique number.
* **Padding** = align lengths so we can stack sentences into a matrix.
* **Masking (tomorrow)** = tell the model which parts are real vs just padding.

---

👉 **Big picture:**
Your code is an **input pipeline**:
**Text → tokens → IDs → padded batch tensor**
ready to be fed into RNNs or Transformers.


1) Core (10–15 min)
Task: Add a new sentence "I love PyTorch" and verify it is correctly tokenized, numericalized, and padded in a batch.

In [13]:
texts.append("I love PyTorch")
labels = [1,1,0,1]
dataset = TextDataset(texts, labels)
loader = DataLoader(dataset, batch_size=4, collate_fn=collate_fn)
for tokens,lengths,labels in loader:
    print(tokens, lengths, labels)
    break


tensor([[ 5,  6,  2,  3,  0],
        [ 2,  3,  7,  4,  0],
        [ 4,  8,  9, 10, 11],
        [ 5,  6,  4,  0,  0]]) tensor([4, 4, 5, 3]) tensor([1, 1, 0, 1])


In [16]:
dataset_max_len = max(len(seq) for seq,_ in dataset)
def collate_fixed(batch):
    sequences, labels = zip(*batch)
    padded = [seq + [vocab["<PAD>"]] * (dataset_max_len - len(seq)) for seq in sequences]
    lengths = [len(seq) for seq in sequences]
    return torch.tensor(padded), torch.tensor(lengths), torch.tensor(labels)

texts.append("I love PyTorch")
labels = [1,1,0,1]
dataset = TextDataset(texts, labels)
loader = DataLoader(dataset, batch_size=2, collate_fn=collate_fixed)
for tokens,lengths,labels in loader:
    print(tokens, lengths, labels)
    break


tensor([[5, 6, 2, 3, 0],
        [2, 3, 7, 4, 0]]) tensor([4, 4]) tensor([1, 1])


3) Stretch (optional, 10–15 min)
Task: Add an <UNK> test case by inserting a word not in vocab (e.g., "transformers").

In [17]:
print(numericalize(tokenize("I love transformers")))
# Should show ... 1 for <UNK>


[5, 6, 1]


Mini-Challenge (≤40 min)

Build a reusable pipeline:

Input: list of sentences + labels.

Output: DataLoader yielding (tokens, lengths, labels) with correct padding.

Acceptance Criteria: Your pipeline works on any small corpus, handles <UNK> tokens, and batch shapes are consistent.

In [18]:
# --- Reusable Text → Tensors pipeline (tokens, lengths, labels) ---

# 0) Config
PAD, UNK = "<PAD>", "<UNK>"

# 1) Tokenizer (swap this if you want subwords/chars later)
def basic_tokenize(text: str):
    return text.lower().split()

# 2) Vocab builder
from collections import Counter

class Vocab:
    def __init__(self, counter, min_freq=1, specials=(PAD, UNK)):
        self.itos = []
        self.stoi = {}

        # specials first (fixed ids)
        for sp in specials:
            self._add(sp)

        # add tokens by frequency
        for tok, freq in counter.most_common():
            if freq >= min_freq and tok not in specials:
                self._add(tok)

        self.pad_id = self.stoi[PAD]
        self.unk_id = self.stoi[UNK]

    def _add(self, tok):
        if tok not in self.stoi:
            self.stoi[tok] = len(self.itos)
            self.itos.append(tok)

    def numericalize(self, tokens):
        u = self.unk_id
        return [self.stoi.get(t, u) for t in tokens]

# 3) Dataset
import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer=basic_tokenize, vocab=None, min_freq=1):
        # tokenize
        tokenized = [tokenizer(t) for t in texts]

        # build vocab if not provided
        if vocab is None:
            counter = Counter(tok for sent in tokenized for tok in sent)
            self.vocab = Vocab(counter, min_freq=min_freq)
        else:
            self.vocab = vocab

        # numericalize
        self.seqs = [torch.tensor(self.vocab.numericalize(s), dtype=torch.long)
                     for s in tokenized]
        self.labels = torch.tensor(labels, dtype=torch.long)

    def __len__(self): return len(self.seqs)
    def __getitem__(self, i): return self.seqs[i], self.labels[i]

# 4) Collate: pad within-batch, return (tokens, lengths, labels)
def collate_pad(batch, pad_id):
    seqs, labels = zip(*batch)                 # tuples of tensors
    lengths = torch.tensor([len(s) for s in seqs], dtype=torch.long)
    tokens  = pad_sequence(seqs, batch_first=True, padding_value=pad_id)
    labels  = torch.stack(labels)
    return tokens, lengths, labels

# 5) Convenience factory
def make_text_loader(texts, labels, batch_size=4, tokenizer=basic_tokenize,
                     min_freq=1, shuffle=True):
    ds = TextDataset(texts, labels, tokenizer=tokenizer, vocab=None, min_freq=min_freq)
    loader = DataLoader(ds, batch_size=batch_size, shuffle=shuffle,
                        collate_fn=lambda b: collate_pad(b, ds.vocab.pad_id))
    return loader, ds.vocab

# --- Example usage (works with any small corpus) ---
texts = ["I love deep learning",
         "Deep learning loves PyTorch",
         "PyTorch is powerful for text"]

labels = [1, 1, 0]

loader, vocab = make_text_loader(texts, labels, batch_size=2)

for tokens, lengths, labels in loader:
    print("tokens:\n", tokens)      # [B, L] padded with PAD id
    print("lengths:", lengths)      # original lengths
    print("labels:", labels)        # your targets
    print("PAD id:", vocab.pad_id, "| UNK id:", vocab.unk_id)
    break


tokens:
 tensor([[ 4,  8,  9, 10, 11],
        [ 2,  3,  7,  4,  0]])
lengths: tensor([5, 4])
labels: tensor([0, 1])
PAD id: 0 | UNK id: 1


Notes / Key Takeaways

Text must be converted into integers before models can use it.

Padding aligns batch lengths, masking prevents padding from confusing the model.

<PAD> and <UNK> tokens are essential in any NLP pipeline.

Collate functions give flexibility in batch preparation.

The pipeline you wrote today will be reused in RNN/LSTM training.

Sequence length tracking is critical for packed sequences (coming next).

This setup is the foundation for embeddings.

Reflection

Can I explain why padding is needed in minibatches?

How would the model behave if we did not include <UNK>?

Can I explain why padding is needed in minibatches?
Yes. Sentences have variable lengths, but tensors in a minibatch must have the same shape so they can be stacked into one [batch_size, seq_len] matrix. Padding with <PAD> makes shorter sentences match the longest one in the batch. Without padding, you couldn’t train in batches (you’d have to process one sentence at a time, which is inefficient).

How would the model behave if we did not include <UNK>?
If <UNK> (unknown token) isn’t included, then any word not seen in training would cause an error: the model wouldn’t know how to map it to an ID. With <UNK>, all unseen words get a placeholder ID, so the model can still run — it just treats those words as “unknown,” instead of crashing.