
# LLM Tokenization & Dataloader Experiments

This notebook contains four hands-on experiments that mirror the ideas from *LLMs From Scratch* (ch.2):

1. **Regex drills:** tweak the punctuation class; run on custom text and measure token count and vocabulary size.  
2. **Break the toy tokenizer:** show OOV behavior and add a minimal `[UNK]` path to `SimpleTokenizerV1`.  
3. **Swap tokenizers:** rebuild the dataloader using **tiktoken (GPT‑2)** when available and compare to the regex tokenizer.  
4. **Stride experiment:** fix `max_length`, vary stride ∈ {1, 8, 64}, plot how many samples we get and inspect overlap.

Each section includes:
- An intuitive description
- Executable code
- A short **Conclusion** and **Vocabulary learned**


In [None]:

# Core imports
import re
from collections import Counter
import math
import random
import numpy as np
import matplotlib.pyplot as plt

# Try to import tiktoken; fall back gracefully if unavailable
try:
    import tiktoken
    TIKTOKEN_AVAILABLE = True
except Exception as e:
    TIKTOKEN_AVAILABLE = False
    TIKTOKEN_IMPORT_ERROR = e

print("tiktoken available:", TIKTOKEN_AVAILABLE)
if not TIKTOKEN_AVAILABLE:
    print("Note: tiktoken not available in this environment. We'll provide a lightweight fallback tokenizer so the notebook still runs.")



## 1) Regex drills — tweak the punctuation class
**Goal:** Feel how the tokenizer design changes token counts and vocabulary size.

**What you do:** Adjust a punctuation class in a regex-based tokenizer and re-run on your own paragraph.


In [None]:

# ---- Editable input text (write your own paragraph here) ----
text = (
    "I’m building a tiny tokenizer, testing dashes—both en– and em—plus quotes, “smart quotes”, and ellipses... "
    "Also: URLs like https://example.com, emails (me@example.com), and numbers 3.14159!"
)

# Baseline punctuation class (feel free to edit)
# The idea: tokens are either "words" (letters/digits/underscore/apostrophe) or single punctuation marks.
punctuation_class = r"[\\.\\,\\;\\:\\!\\?\\-\\—\\–\\(\\)\\[\\]\\{\\}\"\\'\\…]"  # tweak me!
token_pattern = rf"{punctuation_class}|\\w+|\\S"

tokens = re.findall(token_pattern, text)
vocab = sorted(set(tokens))
print(f"Token count: {len(tokens)}")
print(f"Vocab size: {len(vocab)}")
print("First 30 tokens:", tokens[:30])
print("Vocabulary sample:", vocab[:30])



**Conclusion:**  
Changing the punctuation class alters how characters like dashes (–, —), quotes (“ ”), and ellipses (…) are split.  
This directly changes both **token count** (sequence length) and **vocabulary size** (unique tokens).

**Vocabulary learned:** *tokenization, regex, vocabulary, token length vs. granularity trade‑off*.



## 2) Break the toy tokenizer → add a minimal `[UNK]`
**Goal:** See why word-level tokenizers struggle with out-of-vocabulary (OOV) tokens and how `[UNK]` patches the error (but loses information).

**What you do:** Build a tiny word-level tokenizer from a corpus, then try to encode an unseen word. Add `[UNK]` handling and compare.


In [None]:

# Build a tiny vocab from a small corpus
corpus = "the quick brown fox jumps over the lazy dog. the dog sleeps."
word_tokens = re.findall(r"\w+|[^\w\s]", corpus.lower())
vocab = sorted(set(word_tokens))
stoi = {tok: i for i, tok in enumerate(vocab)}
itos = {i: tok for tok, i in stoi.items()}

def encode_strict(text):
    toks = re.findall(r"\w+|[^\w\s]", text.lower())
    return [stoi[t] for t in toks]  # KeyError if unseen

def encode_with_unk(text, unk_token="[UNK]"):
    # ensure UNK in vocab
    if unk_token not in stoi:
        idx = len(stoi)
        stoi[unk_token] = idx
        itos[idx] = unk_token
    toks = re.findall(r"\w+|[^\w\s]", text.lower())
    ids = [stoi[t] if t in stoi else stoi[unk_token] for t in toks]
    return ids

# Try an unseen word
test_text = "the fox plays jazz"  # 'plays' and 'jazz' are unseen
print("Vocab:", vocab)
print("\\nAttempting strict encode (will error if OOV):")
try:
    print(encode_strict(test_text))
except Exception as e:
    print("Strict encode error:", repr(e))

print("\\nEncoding with [UNK]:")
ids = encode_with_unk(test_text)
print(ids)
print("Decoded (approx):", [itos[i] for i in ids])



**Conclusion:**  
A strict word-level tokenizer throws errors on **OOV** tokens. Adding `[UNK]` prevents crashes but conflates many unknowns into the **same** token, losing detail. This motivates **subword** tokenizers (like BPE/byte-level) that decompose unfamiliar words into known pieces without `[UNK]`.

**Vocabulary learned:** *OOV (out‑of‑vocabulary), `[UNK]`, word‑level vs. subword tokenization*.



## 3) Swap tokenizers — regex vs. `tiktoken` (GPT‑2) for a dataloader
**Goal:** Compare sample counts and average sequence length when using a basic regex tokenizer vs. GPT‑2’s tokenizer.

**What you do:**  
- Implement a simple sliding-window dataset.  
- Tokenize the same text with the two tokenizers.  
- Build batches and compare: number of samples, and average sequence length.


In [None]:

# Sample text (feel free to paste a paragraph or short story excerpt)
sample_text = (
    "Alice was beginning to get very tired of sitting by her sister on the bank, "
    "and of having nothing to do: once or twice she had peeped into the book her sister was reading, "
    "but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice "
    "'without pictures or conversations?'"
)

def regex_tokenize(text):
    tokens = re.findall(r"\w+|[^\w\s]", text)
    return tokens

def tokenize_tiktoken(text):
    if not TIKTOKEN_AVAILABLE:
        # Fallback: character-level bytes-like split so the rest of the notebook remains functional
        return list(text)
    enc = tiktoken.get_encoding("gpt2")
    ids = enc.encode(text)
    # Return "tokens" as string forms of ids for consistent interface
    return [str(i) for i in ids]

def sliding_windows(tokens, max_length=32, stride=16):
    # Build (x, y) pairs where y is x shifted by 1 within each window
    # Return list of tuples of ID lists (we'll map strings to ints remapped per tokenizer)
    # For simplicity, map each distinct token to an int locally
    vocab = {tok: i for i, tok in enumerate(sorted(set(tokens)))}
    ids = [vocab[t] for t in tokens]
    samples = []
    for start in range(0, max(0, len(ids)-max_length), stride):
        x = ids[start:start+max_length]
        y = ids[start+1:start+max_length+1]
        if len(y) == len(x):
            samples.append((x, y))
    return samples

# Run both tokenizers
toks_regex = regex_tokenize(sample_text)
toks_tk = tokenize_tiktoken(sample_text)

# Build datasets
max_length = 32
stride = 16
ds_regex = sliding_windows(toks_regex, max_length=max_length, stride=stride)
ds_tk = sliding_windows(toks_tk, max_length=max_length, stride=stride)

# Compare
def describe(ds, name):
    lens = [len(x) for (x, y) in ds]
    print(f"{name}: samples={len(ds)}, avg_seq_len={np.mean(lens) if lens else 0:.2f}")
describe(ds_regex, "Regex")
describe(ds_tk, "tiktoken (or fallback)")

# Simple bar plot of sample counts
counts = [len(ds_regex), len(ds_tk)]
labels = ["Regex", "tiktoken/fallback"]
plt.figure(figsize=(5,3))
plt.bar(labels, counts)
plt.title("Number of samples by tokenizer")
plt.ylabel("samples")
plt.show()



**Conclusion:**  
GPT‑2’s tokenizer (when available) generally yields **shorter sequences** (fewer tokens for the same text) than character-level or naive regex schemes, which often means **fewer windows** for a fixed `max_length`. Conversely, very coarse tokenization may inflate token counts. Real outcomes depend on your text and `max_length`/`stride`.

**Vocabulary learned:** *sequence length, dataloader, context window, tokenizer choice impacts batching*.



## 4) Stride experiment — fix `max_length`, vary stride ∈ {1, 8, 64}
**Goal:** See how stride changes the number of samples and how much adjacent batches overlap.

**What you do:** For each stride, count samples and visualize the first two windows.


In [None]:

# Reuse regex tokenizer for a longer synthetic text
long_text = " ".join(["This is a tiny sliding window experiment with overlapping sequences."] * 20)
tokens = re.findall(r"\w+|[^\w\s]", long_text)

def count_samples(tokens, max_length, stride):
    vocab = {tok: i for i, tok in enumerate(sorted(set(tokens)))}
    ids = [vocab[t] for t in tokens]
    n = 0
    starts = []
    for start in range(0, max(0, len(ids)-max_length), stride):
        if start+max_length < len(ids):
            n += 1
            starts.append(start)
    return n, ids, starts

max_length = 24
strides = [1, 8, 64]
results = {}
for s in strides:
    n, ids, starts = count_samples(tokens, max_length, s)
    results[s] = (n, ids, starts)
    print(f"stride={s}: samples={n}")

# Plot sample counts
plt.figure(figsize=(6,3))
plt.bar([str(s) for s in strides], [results[s][0] for s in strides])
plt.title("Samples vs. stride (fixed max_length)")
plt.xlabel("stride")
plt.ylabel("samples")
plt.show()

# Visualize first 2 windows for each stride
def show_windows(ids, starts, max_length, title):
    plt.figure(figsize=(10, 1.8))
    # draw tokens as ticks
    for i in range(len(ids)):
        plt.plot([i, i], [0.1, 0.2])
    if len(starts) >= 2:
        a, b = starts[0], starts[1]
    elif len(starts) == 1:
        a, b = starts[0], starts[0]
    else:
        a, b = 0, 0
    # Draw two rectangles for window 1 & 2
    plt.gca().add_patch(plt.Rectangle((a, 0.3), max_length, 0.4, fill=True, alpha=0.3))
    plt.gca().add_patch(plt.Rectangle((b, 0.3), max_length, 0.4, fill=True, alpha=0.3))
    plt.title(title)
    plt.yticks([])
    plt.xlabel("token index")
    plt.show()

for s in strides:
    n, ids, starts = results[s]
    show_windows(ids, starts, max_length, f"First two windows (stride={s})")



**Conclusion:**  
- **Small stride** → many overlapping windows → more training samples, but higher correlation between adjacent batches.  
- **Large stride** → fewer, less-overlapping windows → fewer samples but more diversity between batches.

**Vocabulary learned:** *stride, overlap, correlation between batches, data augmentation via overlap*.
