# Tokenization

Models do not process direct text as input. Instead, they receive data represented in numerical form.

To input text into a model, we must first convert it into numerical sequences. This process, called tokenization, is a critical step in any NLP pipeline.

## Character-level tokenization

An easy option is to assign each letter a unique numerical ID, in what is called character-level tokenization:

![Character-level tokenization](https://media.vigliensoni.com/clips/CART498/genaibook-2-1.png)
(Image taken from Sanseviero et al. 2025. *Hands-On Generative AI with Transformers and Diffusion Models*)

 However, this approach is generally a bad idea:

- Loss of meaningful context: Letters on their own carry very little semantic meaning (Words, phrases, or subwords are much more meaningful units for understanding and generating language)

- Increased sequence length: Representing text at the character level results in much longer input sequences compared to word or subword tokenization.

  - For instance, a sentence like ”The quick brown fox jumps over the lazy dog” would require 43 tokens at the character level (including spaces), while a subword tokenizer might use only 9 or 10 tokens. Long sequences increase computational overhead and slow down processing.

- Difficulty capturing dependencies: Neural networks struggle to model dependencies over long sequences effectively. Tokenizing at the letter level exacerbates this issue because it increases the distance between related pieces of information.

- Poor generalization: While character-level models might learn patterns like spelling, they struggle to generalize linguistic structures, grammar, and semantics effectively compared to word- or subword-based models.

## Word-level tokenization

This is the “classic” tokenization algorithm. It let’s you simply map words to IDs without anything fancy. This has the advantage of being really simple to use and understand, but it requires extremely large vocabularies for a good coverage.


![Word-level tokenization](https://media.vigliensoni.com/clips/CART498/genaibook-2-2.png)
(Image taken from Sanseviero et al. 2025. *Hands-On Generative AI with Transformers and Diffusion Models*)



Word-level tokenization, while intuitive, has several disadvantages that make it a suboptimal choice for many natural language processing tasks:

- Vocabulary explosion: Languages have a vast number of words, including inflected forms, compounds, slang, and typos. Creating a vocabulary with all possible words leads to an enormous size, making models harder to train and requiring significant memory.

- Handling rare words: Words that appear infrequently or are unseen during training (e.g., names, technical terms, or typos) result in out-of-vocabulary issues.

- Poor handling of morphology: Word-level tokenization cannot effectively capture the shared structure of morphologically similar words (e.g., ”run,” “running,” “runner”), leading to inefficiencies in learning and generalization.

- Inefficiency for multilingual models: Multilingual tasks are particularly problematic with word-level tokenization, as it requires separate vocabularies for each language.

- Sensitivity to spelling variations: Spelling variations, contractions, and formatting issues (e.g., “color” vs. “colour,” “don't” vs. “do not”) increase the number of unique tokens, which adds complexity without adding meaningful differences.

## Tokenization strategies


Modern tokenization strategies balance splitting text into subwords that preserve both structure and meaning while effectively managing unknown words and variations of the same word.

![Word-level tokenization](https://media.vigliensoni.com/clips/CART498/genaibook-2-3.png)
(Image taken from Sanseviero et al. 2025. *Hands-On Generative AI with Transformers and Diffusion Models*)

Characters often found together, such as frequent words, can be assigned a single token representing the entire word or group. Longer or more complex words, including those with many inflections, may be split into multiple tokens, with each token typically representing a meaningful part of the word.

**There is no universally optimal tokenizer; each language model uses its own**. Tokenizers differ in the number of tokens they support and their tokenization strategies. For instance, the GPT-2 tokenizer averages 1.3 tokens per word.

---

# Tokenizer playground

Play with different tokenizers in [The Tokenizer Playground](https://huggingface.co/spaces/Xenova/the-tokenizer-playground).

Use the following paragraph from George Perec's novel *A Void* (1969).

”Incurably insomniac, Anton Vowl turns on a light. According to his watch it’s only 12.20. With a loud and languorous sigh Vowl sits up, stuffs a pillow at his back, draws his quilt up around his chin, picks up his whodunit, and idly scans a paragraph or two; but, judging its plot impossibly difficult to follow in his condition, its vocabulary to whimsically multisyllabic for comfort, throws it away in disgust.”

- How many words are in the paragraph?
- How many tokens each tokenizer gives back?
- Are there any apparent trends in the tokenizer's behaviour?

Model | Number of tokens | Ratio token per words
--- | --- | ---
Words | 71 | 1
GPT-4 | 130 | 1.8
Text DaVinci | 107 | 1.5
GPT-3 | 107 |
Grok-1 | 102 |
Claude | 105 |
Mistral V3 | 118 |
Mistral V1 | 118 |
Gemma | 105 |
Llama 3 | 103 |
Llama | 128 |
Cohere | 106 |
T5 | 124 |
Bert | 117 |

# Tokenization in Language models

In [None]:
# Tokenization

from transformers import AutoTokenizer
# To use tokenizers, we import them from the transformers library

# There are many available, use the ID of the model you want to use
# Qwen "Qwen/Qwen2-0.5B"
# GPT-2 "openai-community/gpt2"
# SmolLM "HuggingFaceTB/SomlLM-135M"

prompt = "It was a dark and stormy"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B")
input_ids = tokenizer(prompt).input_ids
input_ids


[2132, 572, 264, 6319, 323, 13458, 88]

In [None]:
# Print the token number for each word

for t in input_ids:
  print(t, "\t:", tokenizer.decode(t))

2132 	: It
572 	:  was
264 	:  a
6319 	:  dark
323 	:  and
13458 	:  storm
88 	: y


The tokenizer splits the input string into a series of tokens, assigning a unique ID to each.

While most words are represented by a single token, the Qwen and GPT-2 tokenizers split “stormy” into two tokens: one for “storm” (including the preceding space) and another for the suffix “y”

This approach helps the model understand that “stormy” is related to “storm” and that the suffix “y” often turns nouns into adjectives.

---

# Tokenization activity

Write a script to print the first 50 tokens of the GPT-2 model's vocabulary.

What are the 10 first 2-letter tokens?

What is the first 3-, 4-, and 5-letter token?


In [None]:
# prompt: Write a script to print the first 50 tokens of the GPT-2 model's vocabulary.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Print the first 50 tokens
for i in range(50):
  print(f"{i}: {tokenizer.decode(i)}")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

0: !
1: "
2: #
3: $
4: %
5: &
6: '
7: (
8: )
9: *
10: +
11: ,
12: -
13: .
14: /
15: 0
16: 1
17: 2
18: 3
19: 4
20: 5
21: 6
22: 7
23: 8
24: 9
25: :
26: ;
27: <
28: =
29: >
30: ?
31: @
32: A
33: B
34: C
35: D
36: E
37: F
38: G
39: H
40: I
41: J
42: K
43: L
44: M
45: N
46: O
47: P
48: Q
49: R


---

# Activity on predicting probabilities for a next token



The above models are autoregressive models, meaning they were trained to predict the next token in a sequence, given the preceding tokens.

- Write a script that returns the 10 most probable continuation words for the phrase “It was a dark and stormy” with their corresponding probability or percentage.

- Compare the results for the `Qwen2-0.5B` and `gpt-2` models.

## Ideas to try

- **Change a few words**. Try changing the adjectives (e.g., “dark” and “stormy”) in the input string and find out how the model’s predictions change. Is the predicted word still “night”? How do the probabilities change?

- **Change the input string**. Try different input strings and analyze how the model’s predictions change. Do you agree with the model’s predictions?

- **Grammar**. What happens if you provide a string that is not a grammatically correct sequence? How does the model handle it? Look at the probabilities of the top predictions.


In [None]:
# prompt: Write a script that returns the 10 most probable continuation words for the phrase “It was a dark and stormy” with their corresponding probability or percentage.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

def predict_next_tokens(prompt, model_name, num_tokens=10):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    with torch.no_grad():
        logits = model(input_ids).logits

    next_token_logits = logits[0, -1, :]
    probabilities = torch.softmax(next_token_logits, dim=-1)

    top_probabilities, top_indices = torch.topk(probabilities, num_tokens)

    predicted_tokens = []
    for i in range(num_tokens):
        token_id = top_indices[i].item()
        token_text = tokenizer.decode(token_id)
        predicted_tokens.append((token_text, top_probabilities[i].item()))

    return predicted_tokens

prompt = "it was the best of times, it was the "
models = ["Qwen/Qwen2-0.5B", "gpt2"]
for model in models:
    print(f"Predictions using model {model}:")
    predictions = predict_next_tokens(prompt, model)
    for token, prob in predictions:
        print(f"{token}: {prob:.4f}")
    print("---")

Predictions using model Qwen/Qwen2-0.5B:
2: 0.4857
1: 0.1894
4: 0.0439
7: 0.0435
3: 0.0420
9: 0.0362
5: 0.0247
6: 0.0156
 best: 0.0140
0: 0.0128
---
Predictions using model gpt2:
vern: 0.3724
 : 0.1917
ills: 0.0395
iced: 0.0230
________: 0.0204
urn: 0.0199
____: 0.0174
ers: 0.0166
_____: 0.0141
icky: 0.0128
---


In [None]:
# prompt: Using the GPT-2 language model, you will replace the last word of each line from The Snow Man with the word that has the seventh-highest probability according to the model’s predictions.
# here it is:
# "The Snow Man
# by Wallace Stevens (1879-1955)
# One must have a mind of winter
# To regard the frost and the boughs
# Of the pine-trees crusted with snow;
# And have been cold a long time
# To behold the junipers shagged with ice,
# The spruces rough in the distant glitter
# Of the January sun; and not to think
# Of any misery in the sound of the wind,
# In the sound of a few leaves,
# Which is the sound of the land
# Full of the same wind
# That is blowing in the same bare place
# For the listener, who listens in the snow,
# And, nothing himself, beholds
# Nothing that is not there and the nothing that is.
# "

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

def predict_next_tokens(prompt, model_name, num_tokens=10):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    with torch.no_grad():
        logits = model(input_ids).logits

    next_token_logits = logits[0, -1, :]
    probabilities = torch.softmax(next_token_logits, dim=-1)

    top_probabilities, top_indices = torch.topk(probabilities, num_tokens)

    predicted_tokens = []
    for i in range(num_tokens):
        token_id = top_indices[i].item()
        token_text = tokenizer.decode(token_id)
        predicted_tokens.append((token_text, top_probabilities[i].item()))

    return predicted_tokens

poem = """
One must have a mind of winter
To regard the frost and the boughs
Of the pine-trees crusted with snow;
And have been cold a long time
To behold the junipers shagged with ice,
The spruces rough in the distant glitter
Of the January sun; and not to think
Of any misery in the sound of the wind,
In the sound of a few leaves,
Which is the sound of the land
Full of the same wind
That is blowing in the same bare place
For the listener, who listens in the snow,
And, nothing himself, beholds
Nothing that is not there and the nothing that is.
"""

model_name = "gpt2" # You can change this to another GPT-2 variant if you want

for line in poem.strip().split('\n'):
  if not line.strip():
    continue
  words = line.split()
  if not words:
    continue

  prompt = " ".join(words[:-1])
  predictions = predict_next_tokens(prompt, model_name, num_tokens=10)

  # Replace with the 7th most probable word (index 6)
  if len(predictions) >= 7:
    new_word = predictions[6][0]
    new_line = prompt + " " + new_word
    print(new_line)
  else:
    print(line) # Print the original line if there are not enough predictions

One must have a mind of  her
To regard the frost and the  death
Of the pine-trees crusted with  oil
And have been cold a long  way
To behold the junipers shagged with  white
The spruces rough in the distant  horizon
Of the January sun; and not to  have
Of any misery in the sound of the  sound
In the sound of a few  shots
Which is the sound of the  voice
Full of the same  day
That is blowing in the same bare  air
For the listener, who listens in the  morning
And, nothing himself,  I
Nothing that is not there and the nothing that  isn


In [None]:
# prompt: Using the GPT-2 language model, you will replace the last word of each line from The Snow Man with the word that has the seventh-highest probability according to the model’s predictions.
# here it is:
# "The Snow Man
# by Wallace Stevens (1879-1955)
# One must have a mind of winter
# To regard the frost and the boughs
# Of the pine-trees crusted with snow;
# And have been cold a long time
# To behold the junipers shagged with ice,
# The spruces rough in the distant glitter
# Of the January sun; and not to think
# Of any misery in the sound of the wind,
# In the sound of a few leaves,
# Which is the sound of the land
# Full of the same wind
# That is blowing in the same bare place
# For the listener, who listens in the snow,
# And, nothing himself, beholds
# Nothing that is not there and the nothing that is.
# "

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

def predict_next_tokens(prompt, model_name, num_tokens=10):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    with torch.no_grad():
        logits = model(input_ids).logits

    next_token_logits = logits[0, -1, :]
    probabilities = torch.softmax(next_token_logits, dim=-1)

    top_probabilities, top_indices = torch.topk(probabilities, num_tokens)

    predicted_tokens = []
    for i in range(num_tokens):
        token_id = top_indices[i].item()
        token_text = tokenizer.decode(token_id)
        predicted_tokens.append((token_text, top_probabilities[i].item()))

    return predicted_tokens

poem = """
One must have a mind of winter
To regard the frost and the boughs
Of the pine-trees crusted with snow;
And have been cold a long time
To behold the junipers shagged with ice,
The spruces rough in the distant glitter
Of the January sun; and not to think
Of any misery in the sound of the wind,
In the sound of a few leaves,
Which is the sound of the land
Full of the same wind
That is blowing in the same bare place
For the listener, who listens in the snow,
And, nothing himself, beholds
Nothing that is not there and the nothing that is.
"""

model_name = "gpt2" # You can change this to another GPT-2 variant if you want

for line in poem.strip().split('\n'):
  if not line.strip():
    continue
  words = line.split()
  if not words:
    continue

  prompt = " ".join(words[:-1])
  predictions = predict_next_tokens(prompt, model_name, num_tokens=10)

  # Replace with the 7th most probable word (index 69)
  if len(predictions) >= 10:
    new_word = predictions[9][0]
    new_line = prompt + " " + new_word
    print(new_line)
  else:
    print(line) # Print the original line if there are not enough predictions

One must have a mind of  one
To regard the frost and the  storm
Of the pine-trees crusted with  white
And have been cold a long  and
To behold the junipers shagged with  tw
The spruces rough in the distant  distance
Of the January sun; and not to  my
Of any misery in the sound of the  earth
In the sound of a few  loud
Which is the sound of the  music
Full of the same  time
That is blowing in the same bare  bones
For the listener, who listens in the  studio
And, nothing himself,  though
Nothing that is not there and the nothing that  comes


In [None]:


from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

def predict_next_tokens(prompt, model_name, num_tokens=10):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    with torch.no_grad():
        logits = model(input_ids).logits

    next_token_logits = logits[0, -1, :]
    probabilities = torch.softmax(next_token_logits, dim=-1)

    top_probabilities, top_indices = torch.topk(probabilities, num_tokens)

    predicted_tokens = []
    for i in range(num_tokens):
        token_id = top_indices[i].item()
        token_text = tokenizer.decode(token_id)
        predicted_tokens.append((token_text, top_probabilities[i].item()))

    return predicted_tokens

poem = """
One must have a mind of winter
To regard the frost and the boughs
Of the pine-trees crusted with snow;
And have been cold a long time
To behold the junipers shagged with ice,
The spruces rough in the distant glitter
Of the January sun; and not to think
Of any misery in the sound of the wind,
In the sound of a few leaves,
Which is the sound of the land
Full of the same wind
That is blowing in the same bare place
For the listener, who listens in the snow,
And, nothing himself, beholds
Nothing that is not there and the nothing that is.
"""

model_name = "gpt2" # You can change this to another GPT-2 variant if you want

for line in poem.strip().split('\n'):
  if not line.strip():
    continue
  words = line.split()
  if not words:
    continue

  prompt = " ".join(words[:-1])
  predictions = predict_next_tokens(prompt, model_name, num_tokens=40)

  # Replace with the 7th most probable word (index 69)
  if len(predictions) >= 40:
    new_word = predictions[39][0]
    new_line = prompt + " " + new_word
    print(new_line)
  else:
    print(line) # Print the original line if there are not enough predictions

One must have a mind of  consistency
To regard the frost and the  effect
Of the pine-trees crusted with  seeds
And have been cold a long  Time
To behold the junipers shagged with  thick
The spruces rough in the distant  years
Of the January sun; and not to  him
Of any misery in the sound of the  old
In the sound of a few  notes
Which is the sound of the  ball
Full of the same  order
That is blowing in the same bare  hole
For the listener, who listens in the  normal
And, nothing himself,  other
Nothing that is not there and the nothing that  must


In [None]:
# prompt: Using the GPT-2 language model, you will replace the last word of each line from The Snow Man with the word that has the seventh-highest probability according to the model’s predictions.
# here it is:
# "The Snow Man
# by Wallace Stevens (1879-1955)
# One must have a mind of winter
# To regard the frost and the boughs
# Of the pine-trees crusted with snow;
# And have been cold a long time
# To behold the junipers shagged with ice,
# The spruces rough in the distant glitter
# Of the January sun; and not to think
# Of any misery in the sound of the wind,
# In the sound of a few leaves,
# Which is the sound of the land
# Full of the same wind
# That is blowing in the same bare place
# For the listener, who listens in the snow,
# And, nothing himself, beholds
# Nothing that is not there and the nothing that is.
# "

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

def predict_next_tokens(prompt, model_name, num_tokens=10):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    with torch.no_grad():
        logits = model(input_ids).logits

    next_token_logits = logits[0, -1, :]
    probabilities = torch.softmax(next_token_logits, dim=-1)

    top_probabilities, top_indices = torch.topk(probabilities, num_tokens)

    predicted_tokens = []
    for i in range(num_tokens):
        token_id = top_indices[i].item()
        token_text = tokenizer.decode(token_id)
        predicted_tokens.append((token_text, top_probabilities[i].item()))

    return predicted_tokens

poem = """
One must have a mind of winter
To regard the frost and the boughs
Of the pine-trees crusted with snow;
And have been cold a long time
To behold the junipers shagged with ice,
The spruces rough in the distant glitter
Of the January sun; and not to think
Of any misery in the sound of the wind,
In the sound of a few leaves,
Which is the sound of the land
Full of the same wind
That is blowing in the same bare place
For the listener, who listens in the snow,
And, nothing himself, beholds
Nothing that is not there and the nothing that is.
"""

model_name = "gpt2" # You can change this to another GPT-2 variant if you want

for line in poem.strip().split('\n'):
  if not line.strip():
    continue
  words = line.split()
  if not words:
    continue

  prompt = " ".join(words[:-1])
  predictions = predict_next_tokens(prompt, model_name, num_tokens=69)

  # Replace with the 7th most probable word (index 69)
  if len(predictions) >= 69:
    new_word = predictions[68][0]
    new_line = prompt + " " + new_word
    print(new_line)
  else:
    print(line) # Print the original line if there are not enough predictions

One must have a mind of  at
To regard the frost and the  p
Of the pine-trees crusted with  golden
And have been cold a long  for
To behold the junipers shagged with  wings
The spruces rough in the distant  wilderness
Of the January sun; and not to  her
Of any misery in the sound of the  morning
In the sound of a few  bell
Which is the sound of the  radio
Full of the same  season
That is blowing in the same bare  voice
For the listener, who listens in the  home
And, nothing himself,  anyway
Nothing that is not there and the nothing that  flows
