# Small Language Workshop Part 1

# Tokenization from Scratch: Building a Byte Pair Encoder (BPE)

In this notebook, we will implement a **byte-level tokenizer** similar to the one used in GPT-style models.

By the end, you will:
- Understand why tokenization is necessary
- Implement Byte Pair Encoding from scratch
- Train a tokenizer on real Unicode text
- Encode and decode text losslessly
- Understand why tokenization matters for Small Language Models

## 1. Why Tokenization?

Neural networks operate on **numbers**, but language is **text**.

Key challenges:
- Variable-length sequences
- Unicode & emojis
- Finite vocabulary
- Efficiency (context length matters!)

üí° **Key idea:** Tokenization is a *compression problem*.

<img src='Neural_Network.png'>

## 2. Text ‚Üí Bytes

Instead of characters or words, GPT-style models operate on **bytes**.

Why?
- Every string can be represented as bytes
- No unknown tokens
- Unicode-safe

In [None]:
text = "hello üòÑ students"
encoded = text.encode("utf-8")

print("Original text:", text)
print("UTF-8 Encoding:", encoded) # Bytes values
print("List Encoded UTF-8:", list(encoded)) # Decimal Values in Hexadecimal

Original text: hello üòÑ students
UTF-8 Encoding: b'hello \xf0\x9f\x98\x84 students'
List Encoded UTF-8: [104, 101, 108, 108, 111, 32, 240, 159, 152, 132, 32, 115, 116, 117, 100, 101, 110, 116, 115]


üß† **Think**
- Why does the emoji produce multiple numbers?
- Why might this be better than character-level tokenization?

## 3. Initial Vocabulary

We start with a vocabulary of **256 tokens**, one for each possible byte.

In [None]:
vocab = {i: bytes([i]) for i in range(256)}

# Sanity check
print(vocab[97], vocab[97].decode("utf-8"))  # 'a'

# So we can take the previous values of the string "hello üòÑ students"
print(vocab[104], bytes([encoded[0]]))  # It should match
print(vocab[240], bytes([encoded[6]]))  # It should match

b'a' a
b'h' b'h'
b'\xf0' b'\xf0'


## 4. Counting Adjacent Pairs

Byte Pair Encoding works by repeatedly merging the **most frequent adjacent pair**.

In [None]:
def get_pair(ids):
    counts = {}
    for pair in zip(ids, ids[1:]):
        # TODO: count how many times each pair appears

    return counts

In [None]:
ids = [1, 2, 3, 1, 2]
print(get_pair(ids))

# (1,1) or (2,2) does not count because the elements are equal

{(1, 2): 2, (2, 3): 1, (3, 1): 1}


## 5. Merging Pairs

When we merge a pair `(a, b)` into a new token `k`,
we replace all occurrences of `(a, b)` with `k`.

<img src = 'BPE_Algorithm.png'>

In [None]:
def merge(ids, pair, index):
    new_ids = []
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and (ids[i], ids[i+1]) == pair:
            new_ids.append(index)
            i += 2
        else:
            new_ids.append(ids[i])
            i += 1
    return new_ids

In [None]:
merge([1, 2, 3, 1, 2], (1, 2), 256)

[256, 3, 256]

What happened to sequence length? Why is this useful?

## 6. Training the Tokenizer

We will now repeatedly:
1. Count all adjacent pairs
2. Select the most frequent one
3. Merge it into a new token

In [None]:
class BasicTokenizer:
    def __init__(self, vocab_size):
        self.vocab_size = vocab_size
        self.vocabulary = {i: bytes([i]) for i in range(256)}
        self.merges = {}

    def train(self, text, verbose=False):
        assert self.vocab_size > 256

        ids = list(text.encode("utf-8"))
        initial_len = len(ids)

        for i in range(self.vocab_size - 256):
            pairs = get_pair(ids)
            pair = max(pairs, key=pairs.get)
            new_id = 256 + i

            ids = merge(ids, pair, new_id)
            self.merges[pair] = new_id
            self.vocabulary[new_id] = (
                self.vocabulary[pair[0]] + self.vocabulary[pair[1]]
            )

        if verbose:
            print("Compression ratio:", initial_len / len(ids))

## 7. Encoding New Text

At inference time, we **must apply merges in the same order they were learned**.

In [None]:
def encode(self, text):
    ids = list(text.encode("utf-8"))

    while len(ids) > 1:
        pairs = get_pair(ids)
        pair = min(pairs, key=lambda p: self.merges.get(p, float("inf")))
        if pair not in self.merges:
            break
        ids = merge(ids, pair, self.merges[pair])

    return ids

## 8. Decoding Tokens Back to Text

Tokenization must be reversible.

In [None]:
def decode(self, ids):
        byte_string = b"".join(self.vocabulary[i] for i in ids)
        return byte_string.decode("utf-8")

## 9. Let's Practice!

In [None]:
class BasicTokenizer:
    def __init__(self, vocab_size):
        self.vocab_size = vocab_size
        self.vocabulary = {i: bytes([i]) for i in range(256)}
        self.merges = {}

    def train(self, text, verbose=False):
        assert self.vocab_size > 256

        ids = list(text.encode("utf-8"))
        initial_len = len(ids)

        for i in range(self.vocab_size - 256):
            pairs = get_pair(ids)
            pair = max(pairs, key=pairs.get)
            new_id = 256 + i

            ids = merge(ids, pair, new_id)
            self.merges[pair] = new_id
            self.vocabulary[new_id] = (
                self.vocabulary[pair[0]] + self.vocabulary[pair[1]]
            )
            #print(sorted( [(v, k) for k,v in pairs.items()], reverse = True) [:10])


        if verbose:
            print("Compression ratio:", initial_len / len(ids))

    def encode(self, text):
        ids = list(text.encode("utf-8"))

        while len(ids) > 1:
            pairs = get_pair(ids)
            pair = min(pairs, key=lambda p: self.merges.get(p, float("inf")))
            if pair not in self.merges:
                break
            ids = merge(ids, pair, self.merges[pair])

        return ids

    def decode(self, ids):
        byte_string = b"".join(self.vocabulary[i] for i in ids)
        return byte_string.decode("utf-8")

In [None]:
tokenizer = BasicTokenizer(266)
print("The text to encode is the following:", text)
tokenizer.train(text, verbose=True)

encoded = tokenizer.encode("hello üòÑ students")
decoded = tokenizer.decode(encoded)

print(encoded)
print(decoded)

The text to encode is the following: hello üòÑ students
Compression ratio: 2.111111111111111
[265, 115, 116, 117, 100, 101, 110, 116, 115]
hello üòÑ students


In [None]:
tokenizer = BasicTokenizer(266)
textTest = "hello everyone"
print("The text to encode is the following:", textTest)
tokenizer.train(textTest, verbose=True)

encoded = tokenizer.encode(textTest)
decoded = tokenizer.decode(encoded)

print(encoded)
print(decoded)

The text to encode is the following: hello everyone
Compression ratio: 3.5
[265, 111, 110, 101]
hello everyone


## 10. Now , your turn!

Try the following:

1. Change `vocab_size` and measure compression
2. Train on:
   - English text
   - Code
   - Emojis
3. Print the first 20 learned merges
4. Compare with character-level tokenization

‚úçÔ∏è Write short answers below each experiment.

In [None]:
tokenizer = BasicTokenizer(280)
textTest = "‚úçÔ∏è Write short answers below each experiment."
print("The text to encode is the following:", textTest)
tokenizer.train(textTest, verbose=True)

encoded = tokenizer.encode(textTest)
decoded = tokenizer.decode(encoded)

print(encoded)
print(decoded)

The text to encode is the following: ‚úçÔ∏è Write short answers below each experiment.
Compression ratio: 2.130434782608696
[279, 114, 115, 32, 98, 101, 108, 111, 119, 257, 97, 99, 104, 257, 120, 112, 101, 256, 109, 101, 110, 116, 46]
‚úçÔ∏è Write short answers below each experiment.


## 10. Why This Matters for Language Models

- Tokens ‚Üí embeddings
- Fewer tokens ‚Üí longer context
- Tokenization is the **first inductive bias** of a Transformer
- Good tokenization matters more when the model is small.

This tokenizer can now be plugged directly into a GPT training loop.

<a href="https://tiktokenizer.vercel.app/" >
    <img src="Tiktokenizer.png" >
</a>

# Regex Tokenizer

### 1. Why RegexTokenizer Exists

Your current basicTokenizer trains BPE on raw byte streams:

```
text ‚Üí bytes ‚Üí BPE merges
```

That works ‚Äî but it has drawbacks:

- BPE may merge across semantic boundaries

- Punctuation, whitespace, and numbers get mixed

- Training is slower and noisier

- Tokens can become syntactically awkward

RegexTokenizer fixes this by adding structure before BPE.

### 2Ô∏è. Core Idea (One Sentence)

RegexTokenizer first splits text into meaningful chunks using regex, then applies BPE inside each chunk independently.

This is exactly how GPT-2 / GPT-3 style tokenizers work.

### 3. High-Level Pipeline

```
Text
 ‚Üì
Regex split (words, numbers, punctuation, spaces)
 ‚Üì
Each chunk ‚Üí UTF-8 bytes
 ‚Üì
Byte Pair Encoding (BPE)
 ‚Üì
Final token IDs
```

So instead of training BPE on everything, we train it on pre-segmented text units.

Let‚Äôs decode what it does.

What the Regex Captures
Pattern	Meaning

| Pattern    | Meaning |
| -------- | ------- |
| \p{L}+  | Letters (words)    |
| \p{N}+ | Numbers     |
| [^ \s\p{L}\p{N}]+    | Punctuation / symbols    |
| \s+    | Whitespace    |
| 's, 't, etc.    | English contractions   |

In [None]:
import regex as re

pattern = re.compile(
    r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,2}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""", # NOTE: this split pattern deviates from GPT-4 in that it is used \p{N}{1,2} instead of \p{N}{1,3}
# I did this because I didn't want to "waste" too many tokens on numbers for smaller vocab sizes.
# I haven't validated that this is actually a good idea, TODO.
    re.UNICODE
)

In [None]:
text = "Hello üòÑ students"
print(text)
print("-"*50)
print("Splitting Regex Pattern Capture:\n", re.findall(pattern, text))

Hello üòÑ students
--------------------------------------------------
Splitting Regex Pattern Capture:
 ['Hello', ' üòÑ', ' students']


#### ! Important: Spaces are kept, not discarded.

In [None]:
text = "ÔºµÔΩéÔΩâÔΩÉÔΩèÔΩÑÔΩÖ! üÖ§üÖùüÖòüÖíüÖûüÖìüÖî‚ÄΩ üá∫‚Äåüá≥‚ÄåüáÆ‚Äåüá®‚Äåüá¥‚Äåüá©‚Äåüá™! üòÑ The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to ‚Äúsupport Unicode‚Äù in our software (whatever that means‚Äîlike using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don‚Äôt blame programmers for still finding the whole thing mysterious, even 30 years after Unicode‚Äôs inception."
print(text)
print("-"*50)
re.findall(pattern, text)

ÔºµÔΩéÔΩâÔΩÉÔΩèÔΩÑÔΩÖ! üÖ§üÖùüÖòüÖíüÖûüÖìüÖî‚ÄΩ üá∫‚Äåüá≥‚ÄåüáÆ‚Äåüá®‚Äåüá¥‚Äåüá©‚Äåüá™! üòÑ The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to ‚Äúsupport Unicode‚Äù in our software (whatever that means‚Äîlike using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don‚Äôt blame programmers for still finding the whole thing mysterious, even 30 years after Unicode‚Äôs inception.
--------------------------------------------------


['ÔºµÔΩéÔΩâÔΩÉÔΩèÔΩÑÔΩÖ',
 '!',
 ' üÖ§üÖùüÖòüÖíüÖûüÖìüÖî‚ÄΩ',
 ' üá∫\u200cüá≥\u200cüáÆ\u200cüá®\u200cüá¥\u200cüá©\u200cüá™!',
 ' üòÑ',
 ' The',
 ' very',
 ' name',
 ' strikes',
 ' fear',
 ' and',
 ' awe',
 ' into',
 ' the',
 ' hearts',
 ' of',
 ' programmers',
 ' worldwide',
 '.',
 ' We',
 ' all',
 ' know',
 ' we',
 ' ought',
 ' to',
 ' ‚Äú',
 'support',
 ' Unicode',
 '‚Äù',
 ' in',
 ' our',
 ' software',
 ' (',
 'whatever',
 ' that',
 ' means',
 '‚Äîlike',
 ' using',
 ' wchar',
 '_t',
 ' for',
 ' all',
 ' the',
 ' strings',
 ',',
 ' right',
 '?).',
 ' But',
 ' Unicode',
 ' can',
 ' be',
 ' abstruse',
 ',',
 ' and',
 ' diving',
 ' into',
 ' the',
 ' thousand',
 '-page',
 ' Unicode',
 ' Standard',
 ' plus',
 ' its',
 ' dozens',
 ' of',
 ' supplementary',
 ' annexes',
 ',',
 ' reports',
 ',',
 ' and',
 ' notes',
 ' can',
 ' be',
 ' more',
 ' than',
 ' a',
 ' little',
 ' intimidating',
 '.',
 ' I',
 ' don',
 '‚Äôt',
 ' blame',
 ' programmers',
 ' for',
 ' still',
 ' finding

### 3. Why This Matters Before BPE
Without Regex (Your Basic Tokenizer)

BPE might merge:

```

"o " + "t" ‚Üí "o t"

```

Which is meaningless.

With RegexTokenizer

BPE operates inside units like:

- "hello"

- " there"

- "!"

This ensures:

- Cleaner merges

- Faster convergence

- More interpretable tokens


In [None]:
sample = "Hello there! I'm 26. üòÑ\nNew line."
chunks = re.findall(pattern, sample)
chunks

['Hello',
 ' there',
 '!',
 ' I',
 "'m",
 ' ',
 '26',
 '.',
 ' üòÑ\n',
 'New',
 ' line',
 '.']

In [None]:
vocabulary = {i: bytes([i]) for i in range(256)}
print(vocabulary[97], vocabulary[97].decode("utf-8"))  # 'a'
print(list("üòÑ".encode("utf-8")))

b'a' a
[240, 159, 152, 132]


### BPE primitives: count adjacent pairs + merge

BPE repeatedly merges the most frequent adjacent pair into a new token id.


In [None]:
def get_pair(ids, counts):
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

def merge(ids, pair, index):
    '''
    This function will iterate over ids and every time
    it sees a instance of pair, it will take that pair
    and instead put index , then it will return the list
    merge()
    list = [1,2, 3, 4, 1, 2]
    merge(list, (1,2). 257)
    list = [257, 3, 4, 257, 3]
    '''

    new_ids = []
    i = 0
    while i < len(ids):
        if i <len(ids) - 1 and  (ids[i], ids[i+1]) == pair:
            new_ids.append(index)
            i += 2
        else:
            new_ids.append(ids[i])
            i += 1
    return new_ids

### 4. RegexTokenizer overview

Training:
1. Split text into regex chunks
2. Convert each chunk to UTF-8 bytes (list of ints)
3. Count pairs across **all chunks**
4. Merge the most frequent pair across **all chunks**
5. Repeat until reaching vocab_size

Encoding:
1. Split input text into regex chunks
2. Encode each chunk with learned merges
3. Concatenate token ids

Decoding:
- Map ids ‚Üí bytes ‚Üí UTF-8 string

In [None]:
import pickle

In [None]:

class regexTokenizer:

    def __init__(self, vocab_size, pattern):

        self.vocab_size = vocab_size
        self.vocabulary = {i : bytes([i]) for i in range(256)}
        self.merges = {}
        self.pattern = re.compile(pattern)

    def train(self, text, verbose = False):
        # Encode the text
        # Iterate over text, self.vocab_size - 256 times
        # count all of the pairs in a dictionary
        # choose the pair with the highest frequency
        # merge that pair as a new token
        # add that token to the vocab
        # {256: byte_string}
        # add to self.merges = {byte_string: 256}

        assert self.vocab_size > 256
        number_merges = self.vocab_size - 256

        text_chunks = re.findall(self.pattern, text)
        encoded_chunks = [list(text_chunk.encode('utf-8')) for text_chunk in text_chunks]

        length_initial = sum([len(encoded_chunk) for encoded_chunk in encoded_chunks])

        for i in range(number_merges):
            pairs = {}
            for encoded_chunk in encoded_chunks:
                pairs = get_pair(encoded_chunk, pairs)

            pair = max(pairs, key = pairs.get)
            index = 256 + i
            encoded_chunks = [merge(encoded_chunk,pair,index) for encoded_chunk in encoded_chunks]
            self.merges[pair] = index
            self.vocabulary[index] = self.vocabulary[pair[0]] + self.vocabulary[pair[1]]
            #print(sorted( [(v, k) for k,v in pairs.items()], reverse = True) [:10])

        if verbose:
            length_final = sum([len(encoded_chunk) for encoded_chunk in encoded_chunks])
            compression = length_initial/length_final
            print(length_initial, length_final)
            print(compression)

    def encode(self, text):

        text_chunks = re.findall(self.pattern, text)
        encoded_text = []

        for text_chunk in text_chunks:
            encoded_chunk = self.encode_chunk(text_chunk)
            encoded_text.extend(encoded_chunk)
        return encoded_text

    def encode_chunk(self, text):
        '''
        self.merges is important here

        we get text, and then we convert that text to byte strings, then to integers
        and then we iterate over the text until all pairs of
        merges that are possible under the trained tokenizer
        have been completed

        '''

        ids = list(text.encode('utf-8'))

        for pair, index in self.merges.items():
            ids = merge(ids, pair, index)

        return ids

    def decode(self, ids):
        '''
        decode gets ids
        1. convert the ids to their byte strings
        2. convert the byte strings to strings via the vocabulary
        3. then return the decoded_text
        '''

        byte_strings = b''.join([bytes(self.vocabulary[i]) for i in ids])
        decoded_text =  byte_strings.decode('utf-8')
        return decoded_text

    def save(self, path):
        with open(path, "wb") as file:
            pickle.dump(
                {
                    "merges": self.merges,
                    "vocabulary": self.vocabulary,
                    "pattern": self.pattern
                },
                file
            )

    @classmethod
    def load(cls, path):
        tokenizer = cls(300, pattern)

        with open(path , "rb") as file:
            data = pickle.load(file)
            tokenizer.merges = data["merges"]
            tokenizer.vocabulary = data["vocabulary"]
            tokenizer.pattern = data["pattern"]
        return tokenizer


In [None]:
tokenizer = regexTokenizer(300, pattern)

In [None]:
text = "ÔºµÔΩéÔΩâÔΩÉÔΩèÔΩÑÔΩÖ! üÖ§üÖùüÖòüÖíüÖûüÖìüÖî‚ÄΩ üá∫‚Äåüá≥‚ÄåüáÆ‚Äåüá®‚Äåüá¥‚Äåüá©‚Äåüá™! üòÑ The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to ‚Äúsupport Unicode‚Äù in our software (whatever that means‚Äîlike using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don‚Äôt blame programmers for still finding the whole thing mysterious, even 30 years after Unicode‚Äôs inception."
tokenizer.train(text, True)

616 383
1.608355091383812


# Nanochat Tokenizer

We move from educational tokenizers to the **final tokenizer**
used in NanoChat.

You will learn:
- How GPT-style tokenizers handle *chat*
- What special tokens are and why they matter
- How conversations are rendered into `(input_ids, loss_mask)`
- How this enables supervised fine-tuning (SFT)

In [None]:
# First we need to download the weights https://huggingface.co/karpathy/nanochat-d32/tree/main
# Put the tokenizer.pkl in ~/.cache/nanochat/tokenizer directory

In [None]:
!git clone https://github.com/karpathy/nanochat.git


Cloning into 'nanochat'...
remote: Enumerating objects: 989, done.[K
remote: Counting objects: 100% (154/154), done.[K
remote: Compressing objects: 100% (110/110), done.[K
remote: Total 989 (delta 94), reused 47 (delta 44), pack-reused 835 (from 3)[K
Receiving objects: 100% (989/989), 1.25 MiB | 14.56 MiB/s, done.
Resolving deltas: 100% (600/600), done.
/content/nanochat


In [1]:
%cd nanochat

/content/nanochat


In [2]:
# Remove ipykernel and add
#[tool.setuptools.packages.find]
#where = ["."]
#include = ["nanochat*"]
#exclude = ["class*", "dev*"]
# to pyproject.toml
!pip install -e .

Obtaining file:///content/nanochat
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: nanochat
  Building editable for nanochat (pyproject.toml) ... [?25l[?25hdone
  Created wheel for nanochat: filename=nanochat-0.1.0-0.editable-py3-none-any.whl size=10029 sha256=07fea20e7851d151abdb06f09e0a7046ec22993d4ac6a3982867cbed9c91ac8a
  Stored in directory: /tmp/pip-ephem-wheel-cache-2vvx32o4/wheels/59/44/68/4f0e259f1e3efb353b7dc9ec0502623edda1ea438a24e9f48f
Successfully built nanochat
Installing collected packages: nanochat
  Attempting uninstall: nanochat
    Found existing installation: nanochat 0.1.0
    Uninstalling nanochat-0.1.0:
      Successfully uninstalled nanochat-0.1.0
Successfully installed nanochat-0.1.0


In [4]:
from nanochat.tokenizer import get_tokenizer

In [6]:
%cd /root

/root


In [12]:
%mkdir /root/.cache/nanochat/tokenizer/

mkdir: cannot create directory ‚Äò/root/.cache/nanochat/tokenizer/‚Äô: File exists


In [14]:
%cd /root/.cache/nanochat/tokenizer

/root/.cache/nanochat/tokenizer


In [16]:
from huggingface_hub import hf_hub_download
# Download the specific file and get its local file path
tokenizer_pkl_path = hf_hub_download(repo_id="karpathy/nanochat-d34", filename="tokenizer.pkl")
tokenizer_pt_path = hf_hub_download(repo_id="karpathy/nanochat-d34", filename="token_bytes.pt")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer.pkl:   0%|          | 0.00/846k [00:00<?, ?B/s]

token_bytes.pt:   0%|          | 0.00/264k [00:00<?, ?B/s]

In [17]:
tokenizer_pkl_path

'/root/.cache/huggingface/hub/models--karpathy--nanochat-d34/snapshots/c48357d43863a3a6cdc5f5db5b4ec5964e4192d6/tokenizer.pkl'

In [18]:
%cp '/root/.cache/huggingface/hub/models--karpathy--nanochat-d34/snapshots/c48357d43863a3a6cdc5f5db5b4ec5964e4192d6/tokenizer.pkl' '/root/.cache/nanochat/tokenizer/tokenizer.pkl'

In [20]:
%cd /content/nanochat/
%ls

/content/nanochat
[0m[01;34mdev[0m/           [01;34mnanochat[0m/           README.md        [01;34mscripts[0m/     [01;34mtests[0m/
LICENSE        [01;34mnanochat.egg-info[0m/  run1000.sh       speedrun.sh  uv.lock
miniseries.sh  pyproject.toml      scaling_laws.sh  [01;34mtasks[0m/


In [19]:
tokenizer = get_tokenizer()
print(type(tokenizer))

<class 'nanochat.tokenizer.RustBPETokenizer'>


### 1. Special Tokens

Chat models need *control tokens* to:
- separate user vs assistant
- delimit messages
- support tools (python blocks, outputs)

In [21]:
special_tokens = tokenizer.get_special_tokens()
special_tokens

{'<|assistant_end|>',
 '<|assistant_start|>',
 '<|bos|>',
 '<|output_end|>',
 '<|output_start|>',
 '<|python_end|>',
 '<|python_start|>',
 '<|user_end|>',
 '<|user_start|>'}

In [22]:
# Lets encode these special tokens
for tok in special_tokens:
    print(f"{tok:25s} ‚Üí id {tokenizer.encode_special(tok)}")

<|output_start|>          ‚Üí id 65534
<|assistant_end|>         ‚Üí id 65531
<|python_start|>          ‚Üí id 65532
<|python_end|>            ‚Üí id 65533
<|user_end|>              ‚Üí id 65529
<|bos|>                   ‚Üí id 65527
<|assistant_start|>       ‚Üí id 65530
<|user_start|>            ‚Üí id 65528
<|output_end|>            ‚Üí id 65535


In [23]:
# Lets decode it again to see if everything works well
encoded_special_tokens = [ tokenizer.encode_special(tok) for tok in list(special_tokens) ]
for id in encoded_special_tokens:
    print(f"{id} ‚Üí id {tokenizer.decode([id])}")

65534 ‚Üí id <|output_start|>
65531 ‚Üí id <|assistant_end|>
65532 ‚Üí id <|python_start|>
65533 ‚Üí id <|python_end|>
65529 ‚Üí id <|user_end|>
65527 ‚Üí id <|bos|>
65530 ‚Üí id <|assistant_start|>
65528 ‚Üí id <|user_start|>
65535 ‚Üí id <|output_end|>


In [24]:
# Lets try with the text we already worked
text = "ÔºµÔΩéÔΩâÔΩÉÔΩèÔΩÑÔΩÖ! üÖ§üÖùüÖòüÖíüÖûüÖìüÖî‚ÄΩ üá∫‚Äåüá≥‚ÄåüáÆ‚Äåüá®‚Äåüá¥‚Äåüá©‚Äåüá™! üòÑ The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to ‚Äúsupport Unicode‚Äù in our software (whatever that means‚Äîlike using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don‚Äôt blame programmers for still finding the whole thing mysterious, even 30 years after Unicode‚Äôs inception."
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)

ids, decoded

([12167,
  181,
  239,
  189,
  142,
  239,
  189,
  137,
  239,
  189,
  131,
  239,
  189,
  143,
  239,
  189,
  132,
  239,
  189,
  133,
  33,
  20524,
  133,
  164,
  14899,
  133,
  157,
  14899,
  133,
  152,
  14899,
  133,
  146,
  14899,
  133,
  158,
  14899,
  133,
  147,
  14899,
  133,
  148,
  308,
  189,
  20524,
  135,
  186,
  308,
  140,
  50936,
  179,
  308,
  140,
  50936,
  174,
  308,
  140,
  50936,
  168,
  308,
  140,
  50936,
  180,
  308,
  140,
  50936,
  169,
  308,
  140,
  50936,
  170,
  33,
  46824,
  132,
  361,
  907,
  1588,
  13591,
  3615,
  288,
  24500,
  636,
  261,
  12164,
  281,
  20942,
  5425,
  46,
  1006,
  500,
  675,
  384,
  11814,
  287,
  549,
  46955,
  38226,
  507,
  283,
  659,
  3076,
  372,
  56571,
  332,
  1452,
  36656,
  1034,
  270,
  4210,
  31930,
  327,
  500,
  261,
  12736,
  44,
  1037,
  42544,
  1208,
  38226,
  400,
  311,
  445,
  9129,
  312,
  44,
  288,
  17719,
  636,
  261,
  6557,
  18645,
  38226,
  930

In [25]:
conversation = {
    "messages": [
        {"role": "user", "content": "What is a transformer?"},
        {"role": "assistant", "content": "A transformer is a neural network based on attention."}
    ]
}

In [26]:
ids, loss_mask = tokenizer.render_conversation(conversation)

print("Number of tokens:", len(ids))
print("Loss tokens:", sum(loss_mask))

Number of tokens: 20
Loss tokens: 11


In [27]:
def printUserAssistantType(mask):
    if mask == 0:
        return "User"
    else:
        return "Assistant"

In [28]:
decoded_tokens = [tokenizer.decode([i]) for i in ids]

for t, m in zip(decoded_tokens, loss_mask):
    print(f"{repr(t):20s}  mask={m} {printUserAssistantType(m)}")

'<|bos|>'             mask=0 User
'<|user_start|>'      mask=0 User
'What'                mask=0 User
' is'                 mask=0 User
' a'                  mask=0 User
' transformer'        mask=0 User
'?'                   mask=0 User
'<|user_end|>'        mask=0 User
'<|assistant_start|>'  mask=0 User
'A'                   mask=1 Assistant
' transformer'        mask=1 Assistant
' is'                 mask=1 Assistant
' a'                  mask=1 Assistant
' neural'             mask=1 Assistant
' network'            mask=1 Assistant
' based'              mask=1 Assistant
' on'                 mask=1 Assistant
' attention'          mask=1 Assistant
'.'                   mask=1 Assistant
'<|assistant_end|>'   mask=1 Assistant


### 2. Why Masking?

Without masking:
- model would be trained to predict user prompts
- learning becomes unstable
- chat behavior degrades

Masking enforces:
P(assistant | user)

### 3. Exercise (15 minutes)
1. Create a 3-turn conversation
2. Render it
3. Count how many tokens are supervised
4. Inspect where assistant supervision starts

In [None]:
## TO DO

## Next Step

We now have:
- Token IDs
- Loss masks

Next:
‚û° Let's understand GPT