# Tokenization
By Youssef Al Hariri

This notebook aims to simplify the tokenization concept in the Text processing and LLM modelling.
In this notebook we utilize the library **tiktoken** *(a fast BPE open-source tokenizer by OpenAI)* to tokenize the text.

This notebook has been written based on several resources:

1) https://platform.openai.com/tokenizer
2) https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
3) https://udlbook.github.io/udlbook/ 

In [9]:
!pip install tiktoken


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [10]:
import tiktoken
enc = tiktoken.get_encoding("o200k_base")
assert enc.decode(enc.encode("hello world")) == "hello world"

enc = tiktoken.encoding_for_model("gpt-4o")

In [11]:
def print_colorful_tokens(tokens):
    """
    Display tokens with alternating background colors.

    Args:
        tokens: Iterable of token values. Items may be:
            - str: token text
            - bytes or bytearray: will be decoded with UTF-8 (errors='replace')
            - int: token id (will be converted to str)
    Behavior:
        Wraps each token in a colored <span> (4 pastel colors cycled) and attempts
        to render the result as HTML in a Jupyter environment. If HTML display is
        unavailable, falls back to printing the plain tokens.
    """
    spans = []
    colors = ["#ffe6e6", "#e6f2ff", "#e6ffe6", "#fff2e6"]  # pastel red, blue, green, orange
    
    for i, t in enumerate(tokens):
        if isinstance(t, (bytes, bytearray)):
            t = t.decode("utf-8", errors="replace")
        
        if str(t) == '\n':
            spans.append('<br>')
            continue
        
        safe_t = str(t).replace(' ', '_').replace('<', '&lt;').replace('>', '&gt;')
        color = colors[i % len(colors)]
        spans.append(f'<span style="background:{color}; padding:2px 6px; margin:2px; border-radius:4px; display:inline-block">{safe_t}</span>')

    html = '<div style="font-family:monospace; line-height:1.6">' + ' '.join(spans) + '</div>'

    try:
        from IPython.display import display, HTML
        display(HTML(html))
    
    except Exception:
        print(' '.join(tokens))
        
def tokenize_string(text: str) -> tuple[list[int], list[str]]:
    """
    Encode text using tiktoken and display tokens with alternating background colors.

    Args:
        text (str): The input string to be tokenized.

    Returns:
        tuple[list[int], list[str]]: 
            - encoded_tokens: list of token ids (ints) produced by enc.encode(text).
            - tokens: list of decoded token strings (each element is str).

    Side effects:
        - Prints a short summary showing the original text, encoded token ids,
          and decoded token strings.
        - Attempts to render the tokens as colored HTML spans in a Jupyter
          environment (falls back to plain printing if HTML display isn't available).

    Notes:
        - Relies on a global `enc` tiktoken tokenizer defined earlier in the notebook.
        - Decoding of single-token bytes uses enc.decode_single_token_bytes(...).
    """
    encoded_tokens = enc.encode(text)
    tokens = [enc.decode_single_token_bytes(token) for token in encoded_tokens]

    # Print plain representation (ids and token strings)
    print(f"Original Text: {text}\n")
    
    print(f"\nEncoded Tokens:")

    print_colorful_tokens(encoded_tokens)
    
    print(f"\nDecoded Tokens:")
    print_colorful_tokens(tokens)

    print(f"Notice that the spaces are replaced with underscores in the visualization above for clarity.\nFor example, the token ( to) it printed as (_to).\nNewline characters are replaced with a printed new line, however their token ids are preserved.\n")
    return encoded_tokens, tokens


In [12]:
text= "hello world aaaaaaaaaaaa"
text = """A sailor went to sea sea sea
to see what he could see see see
but all that he could see see see
was the bottom of the deep blue sea sea sea"""

encoded_tokens, tokens = tokenize_string(text)

Original Text: A sailor went to sea sea sea
to see what he could see see see
but all that he could see see see
was the bottom of the deep blue sea sea sea


Encoded Tokens:



Decoded Tokens:


Notice that the spaces are replaced with underscores in the visualization above for clarity.
For example, the token ( to) it printed as (_to).
Newline characters are replaced with a printed new line, however their token ids are preserved.



In [13]:
text = "https://www.udst.edu.qa/admissions/why-udst"

encoded_tokens, tokens = tokenize_string(text)


Original Text: https://www.udst.edu.qa/admissions/why-udst


Encoded Tokens:



Decoded Tokens:


Notice that the spaces are replaced with underscores in the visualization above for clarity.
For example, the token ( to) it printed as (_to).
Newline characters are replaced with a printed new line, however their token ids are preserved.



In [18]:
text = "@Hassani truly loves #AIresearch, but 🤖 & 🧠 make him go 'hmm...?'"

encoded_tokens, tokens = tokenize_string(text)


Original Text: @Hassani truly loves #AIresearch, but 🤖 & 🧠 make him go 'hmm...?'


Encoded Tokens:



Decoded Tokens:


Notice that the spaces are replaced with underscores in the visualization above for clarity.
For example, the token ( to) it printed as (_to).
Newline characters are replaced with a printed new line, however their token ids are preserved.

