# Demo: Tokenization

*Tokenization* is the first step in processing text where the text is broken down into smaller units, such as words, subwords, or characters. These units are then mapped to numerical representations that a language model can process.

## UTF-8 Encoding

First, the text is represented as UTF-8 code.  
UTF-8 is a variable-length encoding system for Unicode that uses 1 to 4 bytes per character. It is backward-compatible with ASCII and supports all characters from every language, including symbols and emojis. This ensures that text from any language can be consistently encoded and decoded.

We can visualize the UTF-8 encoding of text in Python as follows:

In [3]:
# Example: UTF-8 encoding of different characters
examples = ['A', 'Ã©', 'ä½ ', 'ðŸ˜Š']

for char in examples:
    utf8_bytes = char.encode('utf-8')
    hex_repr = ' '.join(f'{byte:02X}' for byte in utf8_bytes)
    print(f"Character: {char}\tUnicode: U+{ord(char):04X}\tUTF-8 Bytes: {hex_repr}")

Character: A	Unicode: U+0041	UTF-8 Bytes: 41
Character: Ã©	Unicode: U+00E9	UTF-8 Bytes: C3 A9
Character: ä½ 	Unicode: U+4F60	UTF-8 Bytes: E4 BD A0
Character: ðŸ˜Š	Unicode: U+1F60A	UTF-8 Bytes: F0 9F 98 8A


In [2]:
# Example of a string
my_string = "Hello, World! ä½ å¥½ä¸–ç•Œ"

# Encode the string using UTF-8
encoded_string = my_string.encode('utf-8')

# Print the original string and the encoded bytes
print("Original string:", my_string)
print("Encoded (UTF-8):", encoded_string)

# Decode the bytes back to a string
decoded_string = encoded_string.decode('utf-8')

# Print the decoded string
print("Decoded string:", decoded_string)

Original string: Hello, World! ä½ å¥½ä¸–ç•Œ
Encoded (UTF-8): b'Hello, World! \xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c'
Decoded string: Hello, World! ä½ å¥½ä¸–ç•Œ


The code below can be used to take a UTF-8 code sequence and recover the original UTF-8 characters.

In [None]:
# Loop over the bytes in the encoded string
i = 0
while i < len(encoded_string):
    # Check if the current byte is the start of a multi-byte character
    if encoded_string[i] & 0b10000000 == 0:  # ASCII character (0xxxxxxx)
        char = encoded_string[i:i+1].decode('utf-8')
        hex_bytes = encoded_string[i:i+1].hex()
        print(f"Character: {char}, Bytes (hex): {hex_bytes}")
        i += 1
    elif encoded_string[i] & 0b11100000 == 0b11000000:  # 2-byte character (110xxxxx)
        char = encoded_string[i:i+2].decode('utf-8')
        hex_bytes = encoded_string[i:i+2].hex()
        print(f"Character: {char}, Bytes (hex): {hex_bytes}")
        i += 2
    elif encoded_string[i] & 0b11110000 == 0b11100000:  # 3-byte character (1110xxxx)
        char = encoded_string[i:i+3].decode('utf-8')
        hex_bytes = encoded_string[i:i+3].hex()
        print(f"Character: {char}, Bytes (hex): {hex_bytes}")
        i += 3
    elif encoded_string[i] & 0b11111000 == 0b11110000:  # 4-byte character (11110xxx)
        char = encoded_string[i:i+4].decode('utf-8')
        hex_bytes = encoded_string[i:i+4].hex()
        print(f"Character: {char}, Bytes (hex): {hex_bytes}")
        i += 4
    else:
        # Handle potential errors or unexpected bytes
        print(f"Skipping unexpected byte at position {i}: {encoded_string[i]:02x}")
        i += 1

Character: H, Bytes (hex): 48
Character: e, Bytes (hex): 65
Character: l, Bytes (hex): 6c
Character: l, Bytes (hex): 6c
Character: o, Bytes (hex): 6f
Character: ,, Bytes (hex): 2c
Character:  , Bytes (hex): 20
Character: W, Bytes (hex): 57
Character: o, Bytes (hex): 6f
Character: r, Bytes (hex): 72
Character: l, Bytes (hex): 6c
Character: d, Bytes (hex): 64
Character: !, Bytes (hex): 21
Character:  , Bytes (hex): 20
Character: ä½ , Bytes (hex): e4bda0
Character: å¥½, Bytes (hex): e5a5bd
Character: ä¸–, Bytes (hex): e4b896
Character: ç•Œ, Bytes (hex): e7958c


In [None]:
# Example with emojis
emoji_string = "Hello ðŸ˜Šä¸–ç•Œ!"

# Encode the string using UTF-8
encoded_emoji_string = emoji_string.encode('utf-8')

# Print the original string and the encoded bytes
print("Original string:", emoji_string)
print("Encoded (UTF-8):", encoded_emoji_string)

# Decode the bytes back to a string
decoded_emoji_string = encoded_emoji_string.decode('utf-8')

# Print the decoded string
print("Decoded string:", decoded_emoji_string)

Original string: Hello ðŸ˜Šä¸–ç•Œ!
Encoded (UTF-8): b'Hello \xf0\x9f\x98\x8a\xe4\xb8\x96\xe7\x95\x8c!'
Decoded string: Hello ðŸ˜Šä¸–ç•Œ!


## Using Pre-Trained Tokenizers

Tokenizers are an essential first step in any modern NLP model.  
They convert raw text into structured tokens that can be mapped to numerical IDs â€” enabling models to process language efficiently.  
Training tokenizers from scratch is difficult and requires large corpora and careful vocabulary design.  

Fortunately, the Hugging Face platform provides several excellent **pre-trained tokenizers** that are aligned with popular models.

To illustrate the idea of tokenization, we will use the tokenizer for **`gpt2`**, a transformer-based language model trained on web text.  
The `gpt2` tokenizer uses **Byte Pair Encoding (BPE)** â€” a subword tokenization method that balances vocabulary size and generalization.  
It can handle rare words, emojis, and informal text by breaking them into known subword units.

Weâ€™ll now load the tokenizer and apply it to a sample sentence.

First we install the `transformers` package from Hugging Face.

In [1]:
%pip install transformers



Next, we download a the `gpt2` pre-trained tokenizer.

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Now let's **tokenize** a simple sentence.

**Reasoning**:
Define the example text and then use the loaded tokenizer to encode it.



In [7]:
#example_text = "Hello ðŸ˜Šä¸–ç•Œ!"
example_text = "Hello world!"
encoded_text = tokenizer.encode(example_text)

print(encoded_text)

[15496, 995, 0]


The output is a list of numbers -- one for each token.  Each number represents a token ID.  We can print the token ID and corresponding text.  

In [8]:
tokens = tokenizer.convert_ids_to_tokens(encoded_text)

for token_id, token in zip(encoded_text, tokens):
    print(f"Token ID: {token_id}, Token: {token}")

Token ID: 15496, Token: Hello
Token ID: 995, Token: Ä world
Token ID: 0, Token: !


You will see a funny character at the beginning `Ä world`.  That means the token is for `space + 'world'`.  Tokenizers do not remove the spaces -- they keep them as part of the tokens.  That way, tokenizers can keep track of spaces since they may be important information.  In the BPE, `space` and `world` appeared commonly enough together that it gots its own token!   In fact, `world` (no space) and ` world` (with a space) are different tokens, as can be seen here:

In [14]:
text = ['world', ' world', 'World']
for t in text:
  enc = tokenizer.encode(t)
  token_id = enc[0]
  token = tokenizer.convert_ids_to_tokens(token_id)
  print('text: \'%s\' ID: %s token: %s' % (t, enc, token))

text: 'world' ID: [6894] token: world
text: ' world' ID: [995] token: Ä world
text: 'World' ID: [10603] token: World


To emphasize this last point, consider encoding some python code.  Observe that the encoding can re-construct the text with the spaces.  This property is very important for parsing software, for example.

In [15]:
code_snippet = """def foo(a, b):
    c = a + b % add the numbers
    return c
"""

# Encode the code
encoded = tokenizer.encode(code_snippet)
print("Encoded token IDs:", encoded)

# Decode back to text
decoded = tokenizer.decode(encoded)
print("\nDecoded text:\n")
print(decoded)


Encoded token IDs: [4299, 22944, 7, 64, 11, 275, 2599, 198, 220, 220, 220, 269, 796, 257, 1343, 275, 4064, 751, 262, 3146, 198, 220, 220, 220, 1441, 269, 198]

Decoded text:

def foo(a, b):
    c = a + b % add the numbers
    return c

