## Why does character level tokenization fail?

In [44]:
sentence = "Today, I want to start my day with a cup of coffee"

# Split on whitespace to get individual words
words = sentence.split()

# Print the list of word tokens
print("Word tokens:", words)

# Print the total number of words
print("Number of words:", len(words))


sentence = "Today, I want to start my day with a cup of coffee"

# Convert the sentence into a list of individual characters
characters = list(sentence)

# Print the list of character tokens
print("Character tokens:", characters)

# Print the total number of characters
print("Number of characters:", len(characters))



Word tokens: ['Today,', 'I', 'want', 'to', 'start', 'my', 'day', 'with', 'a', 'cup', 'of', 'coffee']
Number of words: 12
Character tokens: ['T', 'o', 'd', 'a', 'y', ',', ' ', 'I', ' ', 'w', 'a', 'n', 't', ' ', 't', 'o', ' ', 's', 't', 'a', 'r', 't', ' ', 'm', 'y', ' ', 'd', 'a', 'y', ' ', 'w', 'i', 't', 'h', ' ', 'a', ' ', 'c', 'u', 'p', ' ', 'o', 'f', ' ', 'c', 'o', 'f', 'f', 'e', 'e']
Number of characters: 50


## Implementing Byte Pair Encoding (BPE) from scratch

🦇🦇🦇

### Step 1: Take raw text and tokenize into characters

In [45]:
#
text1 = """The Dark Knight Rises is a superhero movie released in 2012. It is the final part of Christopher Nolan’s Dark Knight trilogy, following Batman Begins and The Dark Knight. The film stars Christian Bale as Bruce Wayne/Batman, who has been retired as Batman for eight years after the events of the previous movie.

The main villain in the movie is Bane, played by Tom Hardy. Bane is a powerful and intelligent terrorist who threatens Gotham City with destruction. He forces Bruce Wayne to come out of retirement and become Batman again. Anne Hathaway plays Selina Kyle, also known as Catwoman, a skilled thief with her own agenda.

The movie is about Bruce Wayne’s struggle to overcome his physical and emotional challenges to save Gotham. It also shows themes of hope, sacrifice, and resilience. The film has many exciting action scenes, such as a plane hijack and a massive battle in Gotham.

In the end, Batman saves the city and inspires the people of Gotham. However, he is believed to have sacrificed his life. The movie ends with a twist, suggesting that Bruce Wayne is alive and has moved on to live a quiet life.

The Dark Knight Rises was a big success and is loved by many fans for its epic story, strong characters, and thrilling action.\
"""

text = """
Casandra Brené Brown (born November 18, 1965) is an American academic and podcaster who is the Huffington Foundation's Brené Brown Endowed Chair at the University of Houston's Graduate College of Social Work and a visiting professor in management at the McCombs School of Business in the University of Texas at Austin. Brown is known for her work on shame, vulnerability, and leadership, and for her widely viewed 2010 TEDx talk.[2] She has written six number-one New York Times bestselling books and hosted two podcasts on Spotify.[3]

She appears in the 2019 documentary Brené Brown: The Call to Courage on Netflix. In 2022, HBO Max released a documentary series based on her book Atlas of the Heart.

Early life and education
Brown was born on November 18, 1965,[4] in San Antonio, Texas, where her parents, Charles Arthur Brown and Casandra Deanne Rogers,[4] had her baptized in the Episcopal Church. She is the eldest of four children.[5] Her family then moved to New Orleans, Louisiana.[6]

Brown completed a Bachelor of Social Work degree at the University of Texas at Austin in 1995, a Master of Social Work degree in 1996,[7] and a Doctor of Philosophy degree in social work at the University of Houston Graduate School of Social Work in 2002.[8]

Career
Research and teaching
Brown has studied the topics of courage, vulnerability, shame, empathy, and leadership, which she has used to look at human connection and how it works.[9] She has spent her research career as a professor at her alma mater, the University of Houston's Graduate College of Social Work.[10]

Public speaking
Brown's TEDx talk from Houston in 2010, "The Power of Vulnerability", is one of the five most viewed TED talks. Its popularity shifted her work from relative obscurity in academia into the mainstream spotlight.[11][12][13][14] The talk "summarizes a decade of Brown's research on shame, framing her weightiest discoveries in self-deprecating and personal terms."[14] Reggie Ugwu for The New York Times said that this event gave the world "a new star of social psychology."[14] She went on to follow this popular TED talk with another titled "Listening to Shame" in 2012. In the second talk she talks about how her life has changed since the first talk and explains the connection between shame and vulnerability, building on the thesis of her first TED talk.[15]

She also has a less well-known talk from 2010 given at TEDxKC titled "The Price of Invulnerability." In it she explains that when numbing hard and difficult feelings, essentially feeling vulnerable, we also numb positive emotions, like joy.[16] This led to the creation of her filmed lecture, Brené Brown: The Call to Courage, which debuted on Netflix in 2019.[17] USA Today called it "a mix of a motivational speech and stand-up comedy special."[17] Brown discusses how and why to choose courage over comfort, equating being brave to being vulnerable. According to her research, doing this opens people to love, joy, and belonging by allowing them to better know themselves and more deeply connect with other people.[18]

Brown regularly works as a public speaker at private events and businesses, such as at Alain de Botton's School of Life[13] and at Google and Disney.[14]
"""

## For non-alphabet characters and for a more general purpose code, use this:
#tokens = text.encode("utf-8") # raw bytes
#tokens = list(map(int, tokens)) # convert to a list of integers in range 0..255 for convenience

# For sake of simplicity, we are only using ASCII character encoding:
tokens = [ord(ch) for ch in text]

In [46]:
ids = list(tokens)  # copy so we don't destroy the original list


In [47]:
print(ids)

[10, 67, 97, 115, 97, 110, 100, 114, 97, 32, 66, 114, 101, 110, 233, 32, 66, 114, 111, 119, 110, 32, 40, 98, 111, 114, 110, 32, 78, 111, 118, 101, 109, 98, 101, 114, 32, 49, 56, 44, 32, 49, 57, 54, 53, 41, 32, 105, 115, 32, 97, 110, 32, 65, 109, 101, 114, 105, 99, 97, 110, 32, 97, 99, 97, 100, 101, 109, 105, 99, 32, 97, 110, 100, 32, 112, 111, 100, 99, 97, 115, 116, 101, 114, 32, 119, 104, 111, 32, 105, 115, 32, 116, 104, 101, 32, 72, 117, 102, 102, 105, 110, 103, 116, 111, 110, 32, 70, 111, 117, 110, 100, 97, 116, 105, 111, 110, 39, 115, 32, 66, 114, 101, 110, 233, 32, 66, 114, 111, 119, 110, 32, 69, 110, 100, 111, 119, 101, 100, 32, 67, 104, 97, 105, 114, 32, 97, 116, 32, 116, 104, 101, 32, 85, 110, 105, 118, 101, 114, 115, 105, 116, 121, 32, 111, 102, 32, 72, 111, 117, 115, 116, 111, 110, 39, 115, 32, 71, 114, 97, 100, 117, 97, 116, 101, 32, 67, 111, 108, 108, 101, 103, 101, 32, 111, 102, 32, 83, 111, 99, 105, 97, 108, 32, 87, 111, 114, 107, 32, 97, 110, 100, 32, 97, 32, 118, 105, 1

### Step 2: Write a function to count the frequency of the adjacent pairs of characters

In [48]:
# 1) Count all adjacent pairs in our current sequence 'ids'.

def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

stats = get_stats(ids)
print(stats)


{(10, 67): 2, (67, 97): 5, (97, 115): 21, (115, 97): 3, (97, 110): 35, (110, 100): 28, (100, 114): 3, (114, 97): 18, (97, 32): 16, (32, 66): 15, (66, 114): 17, (114, 101): 23, (101, 110): 21, (110, 233): 4, (233, 32): 4, (114, 111): 18, (111, 119): 23, (119, 110): 15, (110, 32): 56, (32, 40): 1, (40, 98): 1, (98, 111): 5, (111, 114): 25, (114, 110): 2, (32, 78): 7, (78, 111): 2, (111, 118): 6, (118, 101): 20, (101, 109): 9, (109, 98): 6, (98, 101): 9, (101, 114): 49, (114, 32): 37, (32, 49): 6, (49, 56): 3, (56, 44): 2, (44, 32): 31, (49, 57): 6, (57, 54): 3, (54, 53): 2, (53, 41): 1, (41, 32): 1, (32, 105): 23, (105, 115): 17, (115, 32): 51, (32, 97): 57, (32, 65): 8, (65, 109): 1, (109, 101): 13, (114, 105): 9, (105, 99): 9, (99, 97): 10, (97, 99): 4, (97, 100): 9, (100, 101): 15, (109, 105): 5, (99, 32): 3, (100, 32): 46, (32, 112): 14, (112, 111): 7, (111, 100): 3, (100, 99): 2, (115, 116): 21, (116, 101): 16, (32, 119): 21, (119, 104): 6, (104, 111): 10, (111, 32): 17, (32, 116): 

### Step 3: Select the pair with the highest frequency

In [49]:
# 2) Select the pair with the highest frequency.
pair = max(stats, key=stats.get)
print(pair)


(101, 32)


### Step 4: Define the new token's ID (ID of the merged token is added to vocabulary)

In [50]:
# 3) Define the new token's ID as 256 + i.
i = 0
idx = 128 + i
print(idx)

128


In [51]:
# For readability, decode the original token IDs (pair[0], pair[1]) into characters
    # just for a nice printout. (Assumes these IDs map to ASCII, etc.)
char_pair = (chr(pair[0]), chr(pair[1]))
print(char_pair)

('e', ' ')


### Step 5: Show which pair we are merging

In [52]:
# Show which pair we are merging.
print(f"merging {pair} ({char_pair[0]}{char_pair[1]}) into a new token {idx}")

merging (101, 32) (e ) into a new token 128


### Step 6: Peform the merge: replace all occurences of the most frequent pair with the new token ID

In [53]:
# 4) Perform the merge, replacing all occurrences of 'pair' with 'idx'.
def merge(ids, pair, idx):
    newids = []
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
            newids.append(idx)
            i += 2
        else:
            newids.append(ids[i])
            i += 1
    return newids

ids = merge(ids, pair, idx)
print(ids)

[10, 67, 97, 115, 97, 110, 100, 114, 97, 32, 66, 114, 101, 110, 233, 32, 66, 114, 111, 119, 110, 32, 40, 98, 111, 114, 110, 32, 78, 111, 118, 101, 109, 98, 101, 114, 32, 49, 56, 44, 32, 49, 57, 54, 53, 41, 32, 105, 115, 32, 97, 110, 32, 65, 109, 101, 114, 105, 99, 97, 110, 32, 97, 99, 97, 100, 101, 109, 105, 99, 32, 97, 110, 100, 32, 112, 111, 100, 99, 97, 115, 116, 101, 114, 32, 119, 104, 111, 32, 105, 115, 32, 116, 104, 128, 72, 117, 102, 102, 105, 110, 103, 116, 111, 110, 32, 70, 111, 117, 110, 100, 97, 116, 105, 111, 110, 39, 115, 32, 66, 114, 101, 110, 233, 32, 66, 114, 111, 119, 110, 32, 69, 110, 100, 111, 119, 101, 100, 32, 67, 104, 97, 105, 114, 32, 97, 116, 32, 116, 104, 128, 85, 110, 105, 118, 101, 114, 115, 105, 116, 121, 32, 111, 102, 32, 72, 111, 117, 115, 116, 111, 110, 39, 115, 32, 71, 114, 97, 100, 117, 97, 116, 128, 67, 111, 108, 108, 101, 103, 128, 111, 102, 32, 83, 111, 99, 105, 97, 108, 32, 87, 111, 114, 107, 32, 97, 110, 100, 32, 97, 32, 118, 105, 115, 105, 116, 10

### Step 7: Write all functions together and define number of iterations to run.

Here, we have to select how many merges we do. If we do 20 merges, the vocabulary size increases from 128 to 148.

In [54]:
def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

def merge(ids, pair, idx):
    newids = []
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
            newids.append(idx)
            i += 2
        else:
            newids.append(ids[i])
            i += 1
    return newids

# ---
vocab_size = 148  # the desired final vocabulary size
num_merges = vocab_size - 128
ids = list(tokens)  # copy so we don't destroy the original list

merges = {}  # (int, int) -> int
for i in range(num_merges):
    # 1) Count all adjacent pairs in our current sequence 'ids'.
    stats = get_stats(ids)
    pair = max(stats, key=stats.get)
    idx = 128 + i
    # Decode the characters of the pair for display
    char_pair = (chr(pair[0]), chr(pair[1]))
    print(f"merging {pair} ({char_pair[0]}{char_pair[1]}) into a new token {idx}")
    ids = merge(ids, pair, idx)
    merges[pair] = idx

merging (101, 32) (e ) into a new token 128
merging (110, 32) (n ) into a new token 129
merging (115, 32) (s ) into a new token 130
merging (101, 114) (er) into a new token 131
merging (100, 32) (d ) into a new token 132
merging (116, 104) (th) into a new token 133
merging (97, 110) (an) into a new token 134
merging (44, 32) (, ) into a new token 135
merging (116, 32) (t ) into a new token 136
merging (97, 108) (al) into a new token 137
merging (105, 110) (in) into a new token 138
merging (111, 102) (of) into a new token 139
merging (111, 114) (or) into a new token 140
merging (131, 32) ( ) into a new token 141
merging (139, 32) ( ) into a new token 142
merging (111, 119) (ow) into a new token 143
merging (134, 132) () into a new token 144
merging (114, 101) (re) into a new token 145
merging (116, 111) (to) into a new token 146
merging (133, 128) () into a new token 147


In [59]:
print(tokens)
print(ids)
print("tokens length:", len(tokens))
print("ids length:", len(ids))
print(f"compression ratio: {len(tokens) / len(ids):.2f}X")

[10, 67, 97, 115, 97, 110, 100, 114, 97, 32, 66, 114, 101, 110, 233, 32, 66, 114, 111, 119, 110, 32, 40, 98, 111, 114, 110, 32, 78, 111, 118, 101, 109, 98, 101, 114, 32, 49, 56, 44, 32, 49, 57, 54, 53, 41, 32, 105, 115, 32, 97, 110, 32, 65, 109, 101, 114, 105, 99, 97, 110, 32, 97, 99, 97, 100, 101, 109, 105, 99, 32, 97, 110, 100, 32, 112, 111, 100, 99, 97, 115, 116, 101, 114, 32, 119, 104, 111, 32, 105, 115, 32, 116, 104, 101, 32, 72, 117, 102, 102, 105, 110, 103, 116, 111, 110, 32, 70, 111, 117, 110, 100, 97, 116, 105, 111, 110, 39, 115, 32, 66, 114, 101, 110, 233, 32, 66, 114, 111, 119, 110, 32, 69, 110, 100, 111, 119, 101, 100, 32, 67, 104, 97, 105, 114, 32, 97, 116, 32, 116, 104, 101, 32, 85, 110, 105, 118, 101, 114, 115, 105, 116, 121, 32, 111, 102, 32, 72, 111, 117, 115, 116, 111, 110, 39, 115, 32, 71, 114, 97, 100, 117, 97, 116, 101, 32, 67, 111, 108, 108, 101, 103, 101, 32, 111, 102, 32, 83, 111, 99, 105, 97, 108, 32, 87, 111, 114, 107, 32, 97, 110, 100, 32, 97, 32, 118, 105, 1

### In class activity: Take any text of your choice and implement the BPE algorithm. Report the compression ratio achieved.

## Using the tiktoken library

In [56]:
! pip install tiktoken

import tiktoken

# Text to encode and decode
text = "The lion roams in the jungle"

# ─────────────────────────────────────────────────────────────────────────
# 1. GPT-2 Encoding/Decoding
#    Using the "gpt2" encoding
# ─────────────────────────────────────────────────────────────────────────
tokenizer_gpt2 = tiktoken.get_encoding("gpt2")

# Encode: text -> list of token IDs
token_ids_gpt2 = tokenizer_gpt2.encode(text)

# Decode: list of token IDs -> original text (just to verify correctness)
decoded_text_gpt2 = tokenizer_gpt2.decode(token_ids_gpt2)

# We can also get each token string by decoding the IDs one by one
tokens_gpt2 = [tokenizer_gpt2.decode([tid]) for tid in token_ids_gpt2]

print("=== GPT-2 Encoding ===")
print("Original Text: ", text)
print("Token IDs:     ", token_ids_gpt2)
print("Tokens:        ", tokens_gpt2)
print("Decoded Text:  ", decoded_text_gpt2)
print()


=== GPT-2 Encoding ===
Original Text:  The lion roams in the jungle
Token IDs:      [464, 18744, 686, 4105, 287, 262, 20712]
Tokens:         ['The', ' lion', ' ro', 'ams', ' in', ' the', ' jungle']
Decoded Text:   The lion roams in the jungle




[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [57]:
# ─────────────────────────────────────────────────────────────────────────
# 2. GPT-3.5 Encoding
#    Using the encoding_for_model("gpt-3.5-turbo")
# ─────────────────────────────────────────────────────────────────────────
tokenizer_gpt35 = tiktoken.encoding_for_model("gpt-3.5-turbo")

token_ids_gpt35 = tokenizer_gpt35.encode(text)
decoded_text_gpt35 = tokenizer_gpt35.decode(token_ids_gpt35)
tokens_gpt35 = [tokenizer_gpt35.decode([tid]) for tid in token_ids_gpt35]

print("=== GPT-3.5 Encoding ===")
print("Original Text: ", text)
print("Token IDs:     ", token_ids_gpt35)
print("Tokens:        ", tokens_gpt35)
print("Decoded Text:  ", decoded_text_gpt35)
print()


=== GPT-3.5 Encoding ===
Original Text:  The lion roams in the jungle
Token IDs:      [791, 40132, 938, 4214, 304, 279, 45520]
Tokens:         ['The', ' lion', ' ro', 'ams', ' in', ' the', ' jungle']
Decoded Text:   The lion roams in the jungle



In [58]:
# ─────────────────────────────────────────────────────────────────────────
# 3. GPT-4 Encoding
#    Using the encoding_for_model("gpt-4")
# ─────────────────────────────────────────────────────────────────────────
tokenizer_gpt4 = tiktoken.encoding_for_model("gpt-4")

token_ids_gpt4 = tokenizer_gpt4.encode(text)
decoded_text_gpt4 = tokenizer_gpt4.decode(token_ids_gpt4)
tokens_gpt4 = [tokenizer_gpt4.decode([tid]) for tid in token_ids_gpt4]

print("=== GPT-4 Encoding ===")
print("Original Text: ", text)
print("Token IDs:     ", token_ids_gpt4)
print("Tokens:        ", tokens_gpt4)
print("Decoded Text:  ", decoded_text_gpt4)

=== GPT-4 Encoding ===
Original Text:  The lion roams in the jungle
Token IDs:      [791, 40132, 938, 4214, 304, 279, 45520]
Tokens:         ['The', ' lion', ' ro', 'ams', ' in', ' the', ' jungle']
Decoded Text:   The lion roams in the jungle
