# Tokenization

Tokenizers are the bridge between human text and AI models. Since neural networks only understand numbers, we need to convert text into numerical representations. This process involves splitting text into smaller pieces (tokens), converting tokens to numbers (IDs), and adding special markers for the model.

Tokenization converts human-readable text into numbers that models can process:

1. **Split** text into smaller units (tokens)
2. **Map** tokens to unique IDs from a vocabulary
3. **Add** special tokens that give the model instructions
4. **Create** attention masks to handle batches efficiently

# Settings Up Our Tokenizers

First, let's load a few different tokenizers from the transformers library. Different models use different tokenization strategies, and it's important to see how they compare. We'll look at three of the most common ones: **BERT**, **RoBERTa**, **GPT-2**, and **T5**.

In [None]:
from transformers import AutoTokenizer
from google.colab import output

# Load tokenizers for different models
tokenizers = {
    "BERT": AutoTokenizer.from_pretrained("bert-base-uncased"),
    "GPT-2": AutoTokenizer.from_pretrained("gpt2"),
    "RoBERTa": AutoTokenizer.from_pretrained("roberta-base"),
    "T5": AutoTokenizer.from_pretrained("t5-small")
}

output.clear() # Clears out all of those fun progress bars.

print("✅ Loaded tokenizers for comparison:")
for name, tokenizer in tokenizers.items():
    print(f"  - {name}: Vocabulary size = {tokenizer.vocab_size:,}")

### Vocabulary

Each tokenizer has a fixed vocabulary size. BERT, for instance, knows about 30,000 unique tokens, while GPT-2 knows about 50,000. Any word or character not in this vocabulary will be broken down further or marked as unknown.

# Basic Tokenization

The most basic step is splitting a piece of text into a list of tokens. For BERT, common words like "hello" become a single token, but less common words like "tokenization" are often broken into smaller subwords (`token` and `##ization`).

The `##` prefix in BERT's tokenizer signifies that the token is a continuation of the previous one. Let's see this in action.

`AutoTokenizer` automatically loads the right tokenizer type, `"bert-base-uncased"` means lowercase BERT tokenizer, and each model has its own tokenizer—they're not interchangeable.

In [None]:
text = "Hello, world! Tokenization is the process of converting text into tokens." # @param {"type":"string","placeholder":"Sample Text"}
tokenizer_id = "BERT" # @param ["BERT","GPT-2","T5","RoBERTa"]
tokenizer = tokenizers[tokenizer_id]
tokens = tokenizer.tokenize(text)

print(tokens)

**Try this**: Modify the text variable in the cell above and rerun it. Try using a sentence with more complex or unusual words to see how the tokenizers handle it. Also, try out some different tokenizers and compare the results.

# From Tokens to IDs: Preparing for the Model

Now that we have tokens, we need to convert them into numbers that the model can actually use. Each token in a tokenizer's vocabulary has a unique ID.

The `.encode()` method handles both tokenizing and converting to IDs in one step. It also adds the special tokens that the model needs to understand the sequence's structure.

In [None]:
# Let's use the BERT tokenizer as our main example
tokenizer = tokenizers["BERT"]
text = "Hello, world!"

# The encode method adds special tokens and converts to IDs
token_ids = tokenizer.encode(text)

print(f"Original Text: {text}")
print(f"Token IDs: {token_ids}")

Each token has a unique ID: "hello" → 7592, "world" → 2088. Numbers are what the model actually processes.

# Decoding Back to Text and Special Tokens

This reverses the process (IDs → tokens → text) and is useful for understanding model outputs. This reverses the process (IDs → tokens → text) and is useful for understanding model outputs.

BERT adds special tokens: **[CLS]** (101) for start of sequence used for classification, **[SEP]** (102) as separator between sentences, and **[PAD]** (0) for padding during batch processing.

Example: "Hello" becomes [101, 7592, 102] → [CLS] hello [SEP]

In [None]:
decoded = tokenizer.decode(token_ids)

print("Decoded:", decoded)

Special tokens are crucial for model performance:

- `[CLS]`: Marks the beginning of input (used for classification)
- `[SEP]`: Separates different segments (e.g., question from context)
- `[PAD]`: Fills shorter sequences to match batch length
- `[UNK]`: Replaces unknown words not in vocabulary

# Batch Processing

Models are most efficient when they process multiple texts at once (a *batch*). However, a batch of texts will likely have different lengths. To solve this, we pad the shorter sequences with the `[PAD]` token until they are all as long as the longest sequence.

But if we do this, how does the model know to ignore the padding? That's where the attention mask comes in.

# Attention Masks and Batch Processing

**Attention masks** tell the model which tokens to pay attention to: 1 means real token (pay attention) and 0 means padding token (ignore).

For **batch processing**, all sequences must be the same length. For example, "Hello" becomes [CLS] Hello [SEP] [PAD] [PAD], "Hello world" becomes [CLS] Hello world [SEP] [PAD], and "Hello world today" becomes [CLS] Hello world today [SEP].

```python
batch_encoding = tokenizer(
    batch_texts,
    padding=True,
    truncation=True,
    return_tensors="pt"
)
```

`padding=True` adds [PAD] tokens to make all sequences the same length, `truncation=True` cuts sequences longer than max length, and `return_tensors="pt"` returns PyTorch tensors (not lists).

Let's say you have a sentence like this:

> The cat sat on the mat.

But your magic flashlight can only shine on **important words**—you don’t want to waste light on words that aren’t real or just there to fill space.

Sometimes, you’re given a sentence like:

> The cat sat

But to make it the same size as other sentences, your grown-up teacher adds some blank spaces like this:

> The cat sat `[PAD]` `[PAD]` `[PAD]`

You only want to look at the real words, not the `[PAD]` ones. So

An **attention map** is like a little map that says"

> "Hey brain! Only pay attention to the **real** words. Ignore the fake ones!"

So you get a mask like this:

```
[1, 1, 1, 0, 0, 0]
```

Each number is like a light switch:

- `1` = "Look at this word!"
- `0` = "Skip this one. It's just padding."


## Why Do We Need It?

Because the model is trying to learn what words mean and how they connect—but it would get confused if it starts thinking `[PAD]` means something. Attention masks help it **focus** only on the real stuff.

So the model says: “Got it! I'll only pay attention to the first 4 words.”

In [None]:
texts = [
    "The cat sat on the mat.",
    "The cat sat.",
]

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer(texts, padding=True, return_tensors="pt")

print(inputs["input_ids"])
print(inputs["attention_mask"])

Tokenization is the foundation of all NLP tasks. All of the strings of text that we've processed in our previous time together went through this process automatically. Understanding tokenization helps you debug issues with model inputs, optimize for your specific use case, understand model limitations, and build more sophisticated applications.