---
title: "Tokenization: Unboxing How LLMs Read Text"
jupyter: applsoftcomp
execute:
    enabled: true
    cache: true
---

![](https://curator-production.s3.us.cloud-object-storage.appdomain.cloud/uploads/course-v1:IBMSkillsNetwork+GPXX0A7BEN+v1.jpg)

**Spoiler:** LLMs don't read words—they read compressed fragments optimized for a probability engine.

## The Mechanism (Why Subwords, Not Words)

You might assume that an LLM reads text the way you do: word by word, with each word treated as an atomic unit. This is wrong. The model operates on **tokens**—subword chunks that could be full words ("the"), word parts ("ingham"), or single characters ("B"). This choice is not arbitrary; it's a geometric compression strategy.

If we used whole words, the vocabulary would balloon to millions of entries. Each entry requires a row in the embedding matrix, meaning memory scales linearly with vocabulary size. A 1-million-word vocabulary with 2048-dimensional embeddings would require over 8GB just for the lookup table. Subword tokenization collapses this problem by focusing on frequently occurring fragments. With roughly 50,000 subwords, the model can reconstruct both common words (stored as single tokens) and rare words (assembled from parts). The system trades a small increase in sequence length for a massive reduction in memory and computational overhead.

This also explains why LLMs sometimes fail on seemingly trivial tasks like counting letters. The word "strawberry" might tokenize as `["straw", "berry"]`, meaning the model never sees the individual "r" characters as separate units. It's not stupidity—it's compression artifacts.

## The Application (How Tokenization Works in Practice)

Let's unbox an actual tokenizer from Hugging Face and trace the pipeline from raw text to embeddings. We'll use **Phi-1.5**, a compact model from Microsoft. For tokenization experiments, we only need the tokenizer—no need to load the full multi-gigabyte model.

In [None]:
from transformers import AutoTokenizer
import os

model_name = "microsoft/phi-1.5"
tokenizer = AutoTokenizer.from_pretrained(model_name)

Let's inspect the tokenizer's constraints.

In [None]:
#| code-fold: true
print(f"Vocabulary size: {tokenizer.vocab_size:,} tokens")
print(f"Max sequence length: {tokenizer.model_max_length} tokens")

This tokenizer knows 50,257 unique tokens and enforces a maximum sequence length of 2048 tokens. If your input exceeds this limit, the model will truncate or reject it. This is a hard boundary imposed by the positional encoding system, not a soft guideline.

### Text to Tokens

Tokenization splits text into the subword fragments the model actually processes. Watch what happens when we tokenize a university name.

In [None]:
text = "Binghamton University."

tokens = tokenizer.tokenize(text)

In [None]:
#| code-fold: true
print(f"Tokens: {tokens}")

The rare word "Binghamton" fractures into `['B', 'ingham', 'ton']`. The common word "University" survives intact (with a leading space marker). The tokenizer learned these splits from frequency statistics during training. High-frequency words get dedicated tokens; rare words get decomposed into reusable parts.

::: {.column-margin}
The `Ġ` character (U+0120) is a GPT-style tokenizer convention for encoding spaces. When you see `ĠUniversity`, it means "University" preceded by a space. This preserves word boundaries while allowing subword splits.
:::

Let's test a few more examples to see the pattern.

In [None]:
#| code-fold: true

texts = [
    "Bearcats",
    "New York",
]

print("Word tokenization examples:\n")
for text in texts:
    tokens = tokenizer.tokenize(text)
    print(f"{text:10s} → {tokens}")

"Bearcats" splits because it's domain-specific jargon. "New York" remains whole because it's common. The tokenizer's behavior is a direct reflection of its training corpus.

::: {.column-margin}
Check out [OpenAI's tokenizer](https://platform.openai.com/tokenizer) to see how different models slice the same text differently.
:::

### Tokens to Token IDs

Tokens are still strings. The model needs integers. Each token maps to a unique ID in the vocabulary dictionary.

In [None]:
#| code-fold: true

text = "Binghamton University"

# Get token IDs
token_ids = tokenizer.encode(text, add_special_tokens=False)
tokens = tokenizer.tokenize(text)

print("Token → Token ID mapping:\n")
for token, token_id in zip(tokens, token_ids):
    print(f"{token:10s} → {token_id:6d}")

Each token receives a unique integer ID. The vocabulary is a dictionary: `{token_string: integer_id}`. Let's peek inside.

In [None]:
# Get the full vocabulary
vocab = tokenizer.get_vocab()

# Sample some tokens
sample_tokens = list(vocab.items())[:5]
for token, id in sample_tokens:
    print(f"  {id:6d}: '{token}'")

Most LLMs reserve special tokens for sequence boundaries or control signals. Phi-1.5 uses `<|endoftext|>` as a separator during training. Let's verify.

In [None]:
token_id = [50256]
token = tokenizer.convert_ids_to_tokens(token_id)[0]
print(f"Token ID: {token_id} → Token: {token}")

Token ID 50256 is Phi-specific. Other models use different conventions (e.g., BERT uses `[SEP]` and `[CLS]`). Always check your tokenizer's special tokens before preprocessing data.

### Token IDs to Embeddings

![](https://res.cloudinary.com/lesswrong-2-0/image/upload/v1676309872/mirroredImages/pHPmMGEMYefk9jLeh/wegwbgiqyhig42gidlsg.png)

Now we need the full model to access the embedding layer—the matrix that converts token IDs into dense vectors.

In [None]:
from transformers import AutoModelForCausalLM
import torch

# Load the model
model = AutoModelForCausalLM.from_pretrained(model_name)

# Retrieve the embedding layer
embedding_layer = model.model.embed_tokens

The embedding layer is a simple lookup table: a 51,200 × 2,048 matrix where each row is the embedding for a token in the vocabulary. Let's examine the first few entries.

In [None]:
#| code-fold: true
print(embedding_layer.weight[:5, :10])

These numbers are learned parameters, optimized during training to capture semantic relationships. Token IDs are discrete symbols; embeddings are continuous coordinates in a 2048-dimensional space. This is what the transformer layers operate on.

## The Bigger Picture

You've now traced the full pipeline: raw text fractures into subword tokens, tokens map to integer IDs, and IDs retrieve vector embeddings from a learned matrix. This tokenization step is foundational—without it, the model cannot begin processing language. The transformer layers come next, using attention mechanisms to extract patterns from these embeddings.

Remember three constraints. First, LLMs operate on subwords, not words, because vocabulary size is a memory bottleneck. Second, tokenization is learned from data, not hand-designed, meaning different models will split text differently. Third, compression has side effects—tasks like character counting fail because the model never sees individual characters as atomic units.

With this machinery exposed, we're ready to examine the transformer itself—the architecture that processes these embeddings and enables LLMs to predict the next token.

---

**Next**: [Transformers: The Architecture Behind the Magic →](transformers.qmd)