## Reading in a short story as text sample into Python.

## Step 1: Creating Tokens

Our goal is to tokenize documents into individual words and special
characters that we can then turn into embeddings for LLM training.
Note that it's common to process millions of articles and hundreds of thousands of
books -- many gigabytes of text -- when working with LLMs.


Using some simple example text, we can use the re.split command with the following
syntax to split a text on whitespace characters:</div>

In [1]:
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)

print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


The result is a list of individual words, whitespaces, and punctuation characters:

Let's modify the regular expression splits on whitespaces (\s) and commas, and periods
([,.]):

In [2]:
result = re.split(r'([,.]|\s)', text)

print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


<div class="alert alert-block alert-warning">

A small remaining issue is that the list still includes whitespace characters. Optionally, we
can remove these redundant characters safely as follows:</div>

In [3]:
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


<div class="alert alert-block alert-warning">

REMOVING WHITESPACES OR NOT


When developing a simple tokenizer, whether we should encode whitespaces as
separate characters or just remove them depends on our application and its
requirements. Removing whitespaces reduces the memory and computing
requirements. However, keeping whitespaces can be useful if we train models that
are sensitive to the exact structure of the text (for example, Python code, which is
sensitive to indentation and spacing). Here, we remove whitespaces for simplicity
and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme
that includes whitespaces.

</div>

## Step 2: Creating Token IDs

In the previous section, we tokenized document. Let's now create a list of all unique tokens and sort
them alphabetically to determine the vocabulary size:

In [None]:
all_words = sorted(set(result))
print ("Unique Words: ", all_words)
vocab_size = len(all_words)

print(vocab_size)

Unique Words:  [',', '.', 'Hello', 'This', 'a', 'is', 'test', 'world']
8


<div class="alert alert-block alert-success">

After determining that the vocabulary size is 1,130 via the above code, we create the
vocabulary and print its first 51 entries for illustration purposes:

</div>

In [None]:
vocab = {token:integer for integer,token in enumerate(all_words)}


In [None]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

(',', 0)
('.', 1)
('Hello', 2)
('This', 3)
('a', 4)
('is', 5)
('test', 6)
('world', 7)


<div class="alert alert-block alert-info">
As we can see, based on the output above, the dictionary contains individual tokens
associated with unique integer labels.
</div>

<div class="alert alert-block alert-info">
    
Step 1: Store the vocabulary as a class attribute for access in the encode and decode methods
    
Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens

Step 3: Process input text into token IDs

Step 4: Convert token IDs back into text

Step 5: Replace spaces before the specified punctuation

</div>



<div class="alert alert-block alert-success">

So far, so good. Now tokenize: text = "Hello, do you like tea?". This will throw an error as 'Hello' not contained in the vocabulary..

</div>

### ADDING SPECIAL CONTEXT TOKENS

In particular, we will modify the vocabulary and tokenizer we implemented in the
previous section, SimpleTokenizerV2, to support two new tokens, <|unk|> and
<|endoftext|>

all_tokens.extend(["<|endoftext|>", "<|unk|>"])

Depending on the LLM, some researchers also consider additional special tokens such
as the following:

[BOS] (beginning of sequence): This token marks the start of a text. It
signifies to the LLM where a piece of content begins.

[EOS] (end of sequence): This token is positioned at the end of a text,
and is especially useful when concatenating multiple unrelated texts,
similar to <|endoftext|>. For instance, when combining two different
Wikipedia articles or books, the [EOS] token indicates where one article
ends and the next one begins.

[PAD] (padding): When training LLMs with batch sizes larger than one,
the batch might contain texts of varying lengths. To ensure all texts have
the same length, the shorter texts are extended or "padded" using the
[PAD] token, up to the length of the longest text in the batch.


<div class="alert alert-block alert-warning">

Note that the tokenizer used for GPT models does not need any of these tokens mentioned
above but only uses an <|endoftext|> token for simplicity.

Instead, GPT models use a byte pair encoding tokenizer, which breaks
down words into subword units

</div>

### BYTE PAIR ENCODING (BPE)


<div class="alert alert-block alert-success">

We implemented a simple tokenization scheme in the previous sections for illustration
purposes.

This section covers a more sophisticated tokenization scheme based on a concept
called byte pair encoding (BPE).

The BPE tokenizer covered in this section was used to train
LLMs such as GPT-2, GPT-3, and the original model used in ChatGPT.

</div>

<div class="alert alert-block alert-success">

3 types of tokenizers:

1. word based
    * (-) missing vocab (out of vocab) words 
    * (-) similar words like singular/plural terms
2. sub-word based - break into smaller meaningful subwords
3. character based - individual characters with small vocab size like english ~256
    * (+) memory efficient 
    * (-) semantic knowledge missing

</div>

## What is BPE?

BPE iteratively merges the most frequent pair of characters or character sequences to create a vocabulary of subword units.

## Step-by-Step Example

Let's say we have a small corpus:

### **Initial Corpus**
```
low low low low low
lower lower lower
newest newest newest newest newest newest
widest widest widest
```

### **Step 1: Start with Characters**

Split everything into individual characters (with end-of-word marker `</w>`):

```
l o w </w>         (appears 5 times)
l o w e r </w>     (appears 3 times)  
n e w e s t </w>   (appears 6 times)
w i d e s t </w>   (appears 3 times)
```

**Initial vocabulary:** 
```
{l, o, w, e, r, n, i, d, s, t, </w>}  # 11 tokens
```

### **Step 2: Find Most Frequent Pair**

Count all adjacent character pairs:

```
Pair        Frequency
(e, s)      6+3 = 9     ← Most frequent!
(s, t)      6+3 = 9     ← Tie!
(l, o)      5+3 = 8
(o, w)      5+3 = 8
(w, </w>)   5
...
```

Let's merge `(e, s)` first.

### **Step 3: Merge Most Frequent Pair**

Replace all `e s` with `es`:

```
l o w </w>
l o w e r </w>
n e w es t </w>      ← merged!
w i d es t </w>      ← merged!
```

**Vocabulary now:**
```
{l, o, w, e, r, n, i, d, s, t, </w>, es}  # 12 tokens
```

### **Step 4: Repeat**

Find next most frequent pair:

```
Pair           Frequency
(es, t)        6+3 = 9    ← Most frequent!
(l, o)         5+3 = 8
(o, w)         5+3 = 8
...
```

Merge `(es, t)` → `est`:

```
l o w </w>
l o w e r </w>
n e w est </w>       ← merged!
w i d est </w>       ← merged!
```

**Vocabulary now:**
```
{l, o, w, e, r, n, i, d, s, t, </w>, es, est}  # 13 tokens
```

### **Step 5: Continue Merging**

Keep going until you reach desired vocabulary size:

**Iteration 3:** Merge `(est, </w>)` → `est</w>`
```
l o w </w>
l o w e r </w>
n e w est</w>
w i d est</w>
```

**Iteration 4:** Merge `(l, o)` → `lo`
```
lo w </w>
lo w e r </w>
n e w est</w>
w i d est</w>
```

**Iteration 5:** Merge `(lo, w)` → `low`
```
low </w>
low e r </w>
n e w est</w>
w i d est</w>
```

And so on...

## Final Result

After many iterations, you might end up with:

**Final Vocabulary:**
```
{
  # Characters
  l, o, w, e, r, n, i, d, s, t,
  
  # Subwords  
  lo, ow, low, low</w>, er, er</w>,
  es, est, est</w>, new, newest</w>,
  wi, id, wide, widest</w>
}
```

## How to Tokenize New Words

Now you can tokenize new words using learned merges:

**Example: "lowest"** (unseen word!)

```
Step 1: l o w e s t </w>
Step 2: lo w e s t </w>          (merge l+o)
Step 3: low e s t </w>           (merge lo+w)
Step 4: low es t </w>            (merge e+s)
Step 5: low est </w>             (merge es+t)
Final:  [low, est</w>]
```

Even though "lowest" wasn't in training, BPE can tokenize it using learned subwords!

## BPE vs FastText vs Word2Vec

| Method | Unit | Example: "running" |
|--------|------|-------------------|
| **Word2Vec** | Whole words | `["running"]` |
| **FastText** | Character n-grams | `["<ru", "run", "unn", "nni", "nin", "ing", "ng>"]` |
| **BPE** | Learned subwords | `["runn", "ing"]` |

## Real-World BPE Example (GPT)

Modern models like GPT use BPE with ~50k vocabulary:

```python
# How GPT might tokenize
"unhappiness" → ["un", "happiness"]
"running"     → ["run", "ning"]
"COVID-19"    → ["COVID", "-", "19"]
"hyperglycemia" → ["hyper", "gly", "cemia"]
```
## Why BPE is Important

**Advantages:**
- ✓ Handles rare/unseen words (unlike Word2Vec)
- ✓ Fixed vocabulary size
- ✓ Data-driven (learns from corpus)
- ✓ Language-agnostic

**Used in:**
- GPT (OpenAI)
- BERT variants
- Most modern LLMs
- Machine translation

<div class="alert alert-block alert-warning">

Since implementing BPE can be relatively complicated, we will use an existing Python
open-source library called tiktoken (https://github.com/openai/tiktoken).

This library implements
the BPE algorithm very efficiently based on source code in Rust.
</div>