<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width=35% align=right>

# Building a Large Language Model from Scratch — A Step-by-Step Guide Using Python and PyTorch
## Appendix — Tokenization Methods: Concepts, Trade-Offs, and Hands-On Examples
**© Dr. Yves J. Hilpisch**<br>AI-Powered by GPT-5.

## How to Use This Notebook

- Experiment with multiple tokenization strategies using a shared mini-corpus.
- Quantify how vocabulary size influences sequence length and downstream costs.
- Export tokenizers in formats that can plug into the training notebooks later.

### Roadmap

You will start with whitespace and rule-based tokenizers, move to byte-pair approaches, and finish by comparing statistics that drive modeling decisions.

### Study Tips

Keep a scratchpad of tokenization artifacts (vocabularies, merges, stats). The qualitative differences are easier to grasp when you can inspect them side by side.

In [None]:
# Install optional packages if missing (Colab-friendly)
import os
import subprocess
import sys

try:
    import tokenizers  # type: ignore
    import sentencepiece  # type: ignore
except Exception:
    if os.environ.get('COLAB_RELEASE_TAG'):
        subprocess.run(
            [
                sys.executable,
                '-m',
                'pip',
                'install',
                '-q',
                'tokenizers',
                'sentencepiece',
            ],
            check=True,
        )
        import tokenizers  # type: ignore  # noqa: F401
        import sentencepiece  # type: ignore  # noqa: F401
    else:
        print(
            'tokenizers/sentencepiece missing (skipped install outside Colab). '
            'Some cells will be illustrative only.'
        )
import matplotlib.pyplot as plt
try:
    from IPython import get_ipython  # type: ignore
    ip = get_ipython()
    if ip is not None:
        ip.run_line_magic('config', "InlineBackend.figure_format = 'svg'")
except Exception:
    pass
plt.style.use('seaborn-v0_8')


## Character and Byte Level
Simple, robust baselines; every codepoint (or byte) is a token.

In [None]:
text = 'The model dreams in tokens.'
chars = [ord(c) for c in text]
recovered = ''.join(chr(i) for i in chars)
chars[:10], recovered[:10]

## Word / Whitespace Split
Compact but language-dependent and OOV-prone.

In [None]:
import re
def words(s): return re.findall(r'\b\w+\b', s.lower())
vocab = {}
def encode_words(s):
    ids = []
    for w in words(s):
        if w not in vocab: vocab[w] = len(vocab)
        ids.append(vocab[w])
    return ids
text = 'The model dreams in tokens. The model learns.'
encode_words(text), vocab


## Subword (BPE) with tokenizers
Train a tiny BPE tokenizer on a miniature corpus.

In [None]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
texts = ['The model dreams in tokens.', 'The model learns.']
tok = Tokenizer(BPE(unk_token='<unk>'))
tok.pre_tokenizer = Whitespace()
trainer = BpeTrainer(vocab_size=200, special_tokens=['<unk>', '<pad>'])
tok.train_from_iterator(texts, trainer)
enc = tok.encode('The model models tokens')
enc.tokens, enc.ids


## SentencePiece (BPE)
Train a tiny SentencePiece model and tokenize a sentence.

In [None]:
# Train a tiny SentencePiece model (skips when package is unavailable)
try:
    import sentencepiece as spm
except Exception as exc:
    print('sentencepiece not installed; skipping SentencePiece demo.', exc)
else:
    with open('spm_corpus.txt', 'w', encoding='utf-8') as f:
        f.write('The model dreams in tokens.\nThe model learns.\n')

    spm.SentencePieceTrainer.Train(
        input='spm_corpus.txt',
        model_prefix='spm_demo',
        model_type='bpe',
        vocab_size=40,
        pad_id=0,
        unk_id=1,
        bos_id=-1,
        eos_id=-1,
    )
    sp = spm.SentencePieceProcessor(model_file='spm_demo.model')
    ids = sp.encode('The model models tokens', out_type=int)
    print(sp.id_to_piece(ids))



## Quick Visualization: Token Lengths
Plot token counts under different tokenizers.

In [None]:
sent = 'The model models tokens'
char_n = len(list(sent))
word_n = len(sent.split())
bpe_n = len(tok.encode(sent).ids)
plt.bar(['char', 'word', 'bpe'], [char_n, word_n, bpe_n],
        color=['#DCE6F8', '#CFE2FF', '#B5D0F5'])
plt.ylabel('tokens')
plt.show()


## Exercises

- Swap in a domain-specific corpus and measure how token length distributions shift.
- Implement a simple normalization preprocessor and quantify its effect on vocabulary size.
- Compare BPE and unigram tokenizers on the same dataset; document pros, cons, and when you would choose each.

<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width=35% align=right>