# Session 1.5C: Tokenization ‚Äî How LLMs Read Text

**Requires:** `pip install tiktoken tokenizers` (~2MB total, fully local, no model download)  
**No API key. No internet after install. No GPU.**  
**Focus:** Understand exactly how text becomes numbers before any embedding happens.

---

## What You Will Experience

```
Part 1 ‚Üí Build a tokenizer from scratch (pure Python, zero libraries)
           See the exact problem it solves
Part 2 ‚Üí Character tokenization ‚Äî the naive approach and why it fails
Part 3 ‚Üí Word tokenization ‚Äî better, but still breaks on banking jargon
Part 4 ‚Üí BPE (Byte Pair Encoding) ‚Äî train it yourself on a banking corpus
           Watch it learn subwords: 'compl' + 'iance', 'launder' + 'ing'
Part 5 ‚Üí tiktoken ‚Äî the actual tokenizer used by GPT-4, Claude, and others
           Inspect real token IDs for banking text
Part 6 ‚Üí Token counting ‚Äî why it matters for cost, context, and chunking
Part 7 ‚Üí Tokenization surprises ‚Äî what breaks and why
Part 8 ‚Üí Banking-specific analysis ‚Äî how AML/KYC/SWIFT are tokenized
```

## Why Tokenization Matters Before Embeddings

```
Your text:   "AML compliance requires KYC verification"
                      ‚Üì  tokenizer
Token IDs:   [25300, 22460, 7612, 74, 42, 1242, 12, ...]
                      ‚Üì  embedding lookup
Vectors:     [[0.12, -0.34, ...], [0.88, 0.01, ...], ...]
                      ‚Üì  transformer
Output:      Model understands your text

Tokenization is STEP ZERO. Every LLM does it first.
If you don't understand tokens, you don't understand:
  ‚Üí Why 'KYC' costs 1 token but 'anti-money-laundering' costs 5
  ‚Üí Why your 10-page document uses 8,000 tokens
  ‚Üí Why context windows are measured in tokens, not words
  ‚Üí Why some languages are more expensive to process than others
```

---
## Setup

In [None]:
# tiktoken  ‚Äî BPE tokenizer used by GPT-3.5/4, ~1MB, fully local
# tokenizers ‚Äî HuggingFace tokenizer library for training BPE ourselves
!pip install -q tiktoken tokenizers

In [None]:
import re
import math
import collections
import tiktoken

# Load the cl100k_base encoding ‚Äî used by GPT-3.5, GPT-4, Claude, and most modern LLMs
enc = tiktoken.get_encoding("cl100k_base")

print("‚úÖ Ready ‚Äî fully local, no API key, no internet calls after install")
print()
print(f"Encoding loaded:   cl100k_base")
print(f"Vocabulary size:   {enc.n_vocab:,} tokens")
print(f"Used by:           GPT-3.5-turbo, GPT-4, text-embedding-ada-002")
print(f"Also similar to:   Claude, Gemini (same BPE approach, slightly different vocab)")
print()
print("Quick test:")
test = "AML compliance requires KYC verification."
ids  = enc.encode(test)
print(f"  Text:    '{test}'")
print(f"  Tokens:  {ids}")
print(f"  Count:   {len(ids)} tokens for {len(test)} characters")

---
## Part 1: Build a Tokenizer From Scratch ‚Äî The Problem It Solves

Before using tiktoken, let's understand WHY tokenization exists at all.  
A neural network needs **numbers** as input ‚Äî text must be converted first.

In [None]:
# The fundamental problem: LLMs need integer IDs, not strings
print("=== The Fundamental Problem ===")
print()
print("A transformer model works on vectors of numbers.")
print("Text is a sequence of characters. How do we bridge that gap?")
print()
print("Option 1: Assign each unique word an ID")
print("  'the' ‚Üí 1, 'bank' ‚Üí 2, 'compliance' ‚Üí 3 ...")
print("  Problem: English has 170,000+ words ‚Üí massive vocabulary")
print("  Problem: 'compliance' and 'compliant' are different IDs ‚Äî no sharing")
print("  Problem: new words like 'DeFi', 'stablecoin' ‚Üí unknown token")
print()
print("Option 2: Use characters only")
print("  'A','M','L' ‚Üí 65, 77, 76")
print("  Problem: sequences become very long (1 word = 5-12 chars)")
print("  Problem: model must learn to spell before it learns meaning")
print()
print("Option 3: Subwords ‚Äî the winner")
print("  'compliance' ‚Üí ['comp', 'liance']  ‚Üí [1842, 7712]")
print("  'compliant'  ‚Üí ['comp', 'liant']   ‚Üí [1842, 7634]")
print("  Both share the 'comp' token ‚Äî meaning is shared!")
print("  New words built from known subwords ‚Üí no unknown tokens")

---
## Part 2: Character Tokenization ‚Äî The Naive Approach

In [None]:
# Build a character-level tokenizer from scratch ‚Äî pure Python, zero libraries

class CharTokenizer:
    """Simplest possible tokenizer: each character gets a unique integer ID."""

    def __init__(self, corpus: str):
        # Build vocabulary from all unique characters in corpus
        chars = sorted(set(corpus))
        self.char_to_id = {c: i for i, c in enumerate(chars)}
        self.id_to_char = {i: c for c, i in self.char_to_id.items()}
        self.vocab_size  = len(chars)

    def encode(self, text: str) -> list:
        return [self.char_to_id.get(c, -1) for c in text]

    def decode(self, ids: list) -> str:
        return "".join(self.id_to_char.get(i, "?") for i in ids)

    def show_vocab(self, n=40):
        items = list(self.char_to_id.items())[:n]
        for ch, idx in items:
            display = repr(ch) if ch in (" ", "\n", "\t") else ch
            print(f"  {display!r:<6} ‚Üí {idx}")


# Build corpus
CORPUS = """
aml compliance team monitors suspicious transactions daily
kyc procedures require customer identification and verification
bsa requires banks to file suspicious activity reports
fraud detection models flag anomalous transaction patterns
capital adequacy ratio measures bank financial strength
mortgage loan approval depends on credit score and income
wire transfer sends funds between banks via swift network
"""

char_tok = CharTokenizer(CORPUS)

print("=== Character Tokenizer ===")
print(f"Corpus chars: {len(CORPUS)}")
print(f"Vocab size:   {char_tok.vocab_size} unique characters")
print()
print("Vocabulary:")
char_tok.show_vocab()
print()

# Encode a sentence
sentence = "aml compliance"
ids      = char_tok.encode(sentence)
decoded  = char_tok.decode(ids)

print(f"Text:    '{sentence}'")
print(f"IDs:     {ids}")
print(f"Decoded: '{decoded}'")
print()
print(f"üìå {len(sentence)} characters ‚Üí {len(ids)} tokens (1:1 ratio)")
print(f"   Longer sequences, smaller vocab ‚Äî but model must learn word meaning")
print(f"   from scratch using individual letter patterns.")

In [None]:
# Show why character tokenization makes sequences very long
banking_texts = [
    "aml",
    "anti-money-laundering",
    "kyc compliance verification",
    "suspicious transaction monitoring report",
]

print("=== Sequence Length with Character Tokenization ===")
print(f"{'Text':<42} {'Chars':>6}  {'Tokens':>7}  {'Ratio'}")
print("-" * 65)
for text in banking_texts:
    ids = char_tok.encode(text)
    # filter -1 (unknown chars like '-')
    valid = [i for i in ids if i >= 0]
    print(f"{text:<42} {len(text):>6}  {len(ids):>7}  1:{len(ids)//max(len(text.split()),1)} tokens/word")

print()
print("üìå A 1,000-word document ‚Üí ~6,000 character tokens.")
print("   Transformers have quadratic attention cost: 6,000¬≤ = 36M operations")
print("   vs word tokens: 1,000¬≤ = 1M operations.")
print("   This is why character tokenization is impractical for long documents.")

---
## Part 3: Word Tokenization ‚Äî Better, But Still Breaks

In [None]:
class WordTokenizer:
    """Split on whitespace/punctuation, assign integer IDs."""

    UNK = "<UNK>"   # unknown word token
    PAD = "<PAD>"   # padding token

    def __init__(self, corpus: str):
        words = re.findall(r"[a-z0-9]+", corpus.lower())
        freq  = collections.Counter(words)
        # Sort by frequency (most common first)
        vocab = [self.PAD, self.UNK] + [w for w, _ in freq.most_common()]
        self.word_to_id = {w: i for i, w in enumerate(vocab)}
        self.id_to_word = {i: w for w, i in self.word_to_id.items()}
        self.vocab_size  = len(vocab)

    def encode(self, text: str) -> list:
        words = re.findall(r"[a-z0-9]+", text.lower())
        return [self.word_to_id.get(w, self.word_to_id[self.UNK]) for w in words]

    def decode(self, ids: list) -> str:
        return " ".join(self.id_to_word.get(i, self.UNK) for i in ids)


word_tok = WordTokenizer(CORPUS)

print("=== Word Tokenizer ===")
print(f"Vocab size: {word_tok.vocab_size} unique words")
print()

# Show top 20 vocab entries
print("Most common words (first 20 vocab entries):")
for w, i in list(word_tok.word_to_id.items())[2:22]:
    print(f"  {i:>4}  '{w}'")
print()

# Encode known vs unknown words
sentences = [
    "aml compliance",
    "suspicious transaction",
    "fatf grey list",          # 'fatf', 'grey', 'list' NOT in training corpus
    "stablecoin cryptocurrency",  # also not in corpus
]

print("Encoding results:")
for s in sentences:
    ids = word_tok.encode(s)
    decoded = word_tok.decode(ids)
    has_unk = 1 in ids  # ID 1 = <UNK>
    flag = "‚ö† UNKNOWN WORDS" if has_unk else "‚úì"
    print(f"  Input:   '{s}'")
    print(f"  IDs:     {ids}")
    print(f"  Decoded: '{decoded}'  {flag}")
    print()

In [None]:
# The word tokenizer OOV (out-of-vocabulary) problem in banking
print("=== The OOV Problem in Banking ===")
print()
print("Words that appear in banking regulations but may not be in a generic vocab:")
print()

oov_words = [
    "FATF", "MiCA", "DORA", "pgvector", "stablecoin",
    "cryptocurrency", "DeFi", "VASP", "CCAR", "DFAST",
    "Sarbanes-Oxley", "RegTech", "FinCEN", "FinTech",
]

for word in oov_words:
    ids = word_tok.encode(word)
    has_unk = 1 in ids
    status = "‚ö† OOV ‚Üí <UNK>" if has_unk else "‚úì in vocab"
    print(f"  {word:<20} {status}")

print()
print("üìå With word tokenization: any term not seen during vocab building = <UNK>.")
print("   All <UNK> tokens look identical to the model ‚Äî it cannot tell 'FATF' from 'DeFi'.")
print("   This is the core problem BPE solves: no out-of-vocabulary words.")

---
## Part 4: BPE ‚Äî Byte Pair Encoding ‚Äî Train It Yourself

BPE starts with characters and iteratively merges the most frequent adjacent pair.  
The result: common words become single tokens, rare words split into known subwords.  
**No OOV problem** ‚Äî any word can be decomposed into characters if needed.

In [None]:
# Implement BPE from scratch ‚Äî pure Python stdlib
# This is exactly what tiktoken does, just much simpler

def tokenize_corpus(corpus: str) -> list:
    """Split corpus into words, represent each as list of characters + end-of-word marker."""
    words = re.findall(r"[a-z]+", corpus.lower())
    return [tuple(list(w) + ["</w>"]) for w in words]

def get_pair_counts(tokenized: list) -> dict:
    """Count frequency of every adjacent pair across all words."""
    pairs = collections.defaultdict(int)
    for word in tokenized:
        for i in range(len(word) - 1):
            pairs[(word[i], word[i + 1])] += 1
    return pairs

def merge_pair(tokenized: list, pair: tuple) -> list:
    """Merge all occurrences of `pair` into a single token."""
    merged_token = pair[0] + pair[1]
    new_tokenized = []
    for word in tokenized:
        new_word = []
        i = 0
        while i < len(word):
            if i < len(word) - 1 and word[i] == pair[0] and word[i + 1] == pair[1]:
                new_word.append(merged_token)
                i += 2
            else:
                new_word.append(word[i])
                i += 1
        new_tokenized.append(tuple(new_word))
    return new_tokenized

def train_bpe(corpus: str, num_merges: int = 30, verbose: bool = True):
    """Train BPE on corpus, return tokenized words and merge history."""
    tokenized = tokenize_corpus(corpus)
    merges    = []

    if verbose:
        print(f"Initial tokenization (first 5 words):")
        # Show unique words only
        unique = list(dict.fromkeys(tokenized))[:5]
        for w in unique:
            print(f"  {' '.join(w)}")
        print()

    for step in range(num_merges):
        pairs = get_pair_counts(tokenized)
        if not pairs:
            break
        best_pair  = max(pairs, key=pairs.get)
        best_count = pairs[best_pair]
        tokenized  = merge_pair(tokenized, best_pair)
        merges.append(best_pair)

        if verbose:
            merged = best_pair[0] + best_pair[1]
            print(f"  Step {step+1:>2}: merge {best_pair[0]!r:>8} + {best_pair[1]!r:<10} ‚Üí {merged!r:<14} (freq={best_count})")

    return tokenized, merges

print("=== BPE Training on Banking Corpus ===")
print()
tokenized, merges = train_bpe(CORPUS, num_merges=25)

In [None]:
# Show the final tokenization of key banking words
print("=== Final BPE Tokenization of Banking Terms ===")
print("(After 25 merge operations)")
print()

# Rebuild unique word ‚Üí token mapping from our trained tokenizer
word_to_tokens = {}
raw_words = re.findall(r"[a-z]+", CORPUS.lower())
for word, tok in zip(raw_words, tokenized):
    if word not in word_to_tokens:
        word_to_tokens[word] = list(tok)

print(f"{'Word':<22} {'Tokens':<40} Count")
print("-" * 70)
for word, tokens in sorted(word_to_tokens.items(), key=lambda x: len(x[1])):
    tokens_display = " | ".join(t.replace("</w>", "‚èé") for t in tokens)
    print(f"{word:<22} {tokens_display:<40} {len(tokens)} token(s)")

print()
print("üìå Frequent words like 'and', 'the', 'to' ‚Üí 1-2 tokens (fully merged)")
print("   Domain words like 'compliance', 'suspicious' ‚Üí split into subwords")
print("   Subwords are shared across related words:")
print("   'compliance' and 'compliant' share the same prefix tokens")

In [None]:
# Visualize the BPE merge tree for one word ‚Äî how 'compliance' was assembled
def trace_word_bpe(word: str, num_steps: int = 20):
    """Show step-by-step how BPE merges characters into subwords for one word."""
    tokens = tuple(list(word) + ["</w>"])
    print(f"Tracing BPE merges for: '{word}'")
    print(f"  Start: {' | '.join(tokens)}")
    print()

    for step, (a, b) in enumerate(merges[:num_steps]):
        merged = a + b
        new_tokens = []
        i = 0
        changed = False
        while i < len(tokens):
            if i < len(tokens) - 1 and tokens[i] == a and tokens[i + 1] == b:
                new_tokens.append(merged)
                i += 2
                changed = True
            else:
                new_tokens.append(tokens[i])
                i += 1
        if changed:
            tokens = tuple(new_tokens)
            print(f"  Step {step+1:>2} (merge {a!r}+{b!r}): {' | '.join(tokens)}")

    print(f"\n  Final: {len(tokens)} token(s) ‚Äî {list(tokens)}")

trace_word_bpe("compliance")
print()
trace_word_bpe("suspicious")
print()
trace_word_bpe("transaction")

---
## Part 5: tiktoken ‚Äî The Real Thing

`tiktoken` uses the same BPE algorithm but trained on a massive internet corpus  
with 100,000 merge rules. Fully local ‚Äî the vocabulary is bundled with the package.

In [None]:
# Explore the tiktoken vocabulary
print("=== tiktoken: cl100k_base Encoding ===")
print(f"Vocabulary size: {enc.n_vocab:,} tokens")
print()

# Encode and decode banking terms ‚Äî see real token IDs
banking_terms = [
    "AML",
    "KYC",
    "BSA",
    "SWIFT",
    "compliance",
    "anti-money-laundering",
    "suspicious activity report",
    "know your customer",
    "capital adequacy ratio",
    "Basel III",
    "cryptocurrency",
    "stablecoin",
    "FATF",
    "MiCA",
    "mortgage",
    "overdraft",
]

print(f"{'Term':<28} {'Token IDs':<35} {'Count':>6}  Tokens (decoded)")
print("-" * 100)
for term in banking_terms:
    ids     = enc.encode(term)
    decoded = [enc.decode([i]) for i in ids]
    ids_str = str(ids)
    dec_str = " | ".join(repr(t) for t in decoded)
    print(f"{term:<28} {ids_str:<35} {len(ids):>6}  {dec_str}")

In [None]:
# Visual tokenization ‚Äî color-coded display using ASCII blocks
def visualize_tokens(text: str, enc=enc):
    """Show each token as a labeled block."""
    ids     = enc.encode(text)
    decoded = [enc.decode([i]) for i in ids]

    print(f"Text: '{text}'")
    print(f"Tokens: {len(ids)}")
    print()

    # Print each token with its ID beneath it
    token_width = 12
    header_row  = ""
    id_row      = ""
    sep_row     = ""

    for token, tid in zip(decoded, ids):
        # Escape whitespace for display
        display = token.replace(" ", "¬∑").replace("\n", "‚Üµ")
        display = display[:token_width - 2]  # truncate if too long
        header_row += f"[{display:<{token_width - 2}}]"
        id_row     += f" {str(tid):<{token_width - 1}}"
        sep_row    += "-" * token_width

    print("  Tokens: " + header_row)
    print("  IDs:    " + id_row)
    print()

banking_sentences = [
    "AML compliance requires KYC verification.",
    "The bank approved the mortgage application.",
    "Suspicious transaction flagged by anti-money-laundering system.",
    "Basel III capital adequacy ratio must exceed 8%.",
    "SWIFT wire transfer to correspondent bank.",
]

print("=== Token-by-Token Visualization ===")
print()
for s in banking_sentences:
    visualize_tokens(s)
    print("-" * 60)
    print()

---
## Part 5B: Two Models, Same Text ‚Äî Tokenizers Are NOT Interchangeable

Every model family ships its own tokenizer.  
**The same sentence produces different token IDs and even different token boundaries** across models.  
This matters because: token count affects cost, context fit, and chunking strategy.

| Tokenizer | Algorithm | Vocab size | Used by |
|-----------|-----------|-----------|---------|
| `cl100k_base` | BPE | 100,277 | GPT-3.5, GPT-4, text-embedding-ada-002 |
| `p50k_base` | BPE | 50,281 | GPT-3 (davinci), Codex |
| `bert-base-uncased` | WordPiece | 30,522 | BERT, many HuggingFace models |

In [None]:
# Load three tokenizers ‚Äî all local, no API, no model weights downloaded
# tiktoken encodings are bundled in the package (~1MB each)
# tokenizers WordPiece vocab is bundled too (~500KB)

import tiktoken
from tokenizers import Tokenizer as HFTokenizer

# --- Tokenizer 1: cl100k_base (GPT-3.5 / GPT-4) --- already loaded as `enc`
enc_cl100k = tiktoken.get_encoding("cl100k_base")

# --- Tokenizer 2: p50k_base (GPT-3 / Codex) ---
enc_p50k = tiktoken.get_encoding("p50k_base")

# --- Tokenizer 3: BERT WordPiece (bert-base-uncased) ---
# Load from HuggingFace tokenizers library ‚Äî uses bundled vocab, no download
bert_tok = HFTokenizer.from_pretrained("bert-base-uncased")

print("=== Three Tokenizers Loaded ===")
print()
print(f"  cl100k_base (GPT-4)      vocab: {enc_cl100k.n_vocab:>8,}  algorithm: BPE")
print(f"  p50k_base   (GPT-3)      vocab: {enc_p50k.n_vocab:>8,}  algorithm: BPE")
print(f"  bert-base-uncased        vocab: {bert_tok.get_vocab_size():>8,}  algorithm: WordPiece")
print()
print("üìå Larger vocab ‚Üí longer subwords ‚Üí fewer tokens per sentence")
print("   Smaller vocab ‚Üí shorter subwords ‚Üí more tokens per sentence")
print("   WordPiece uses '##' prefix for continuation pieces (different from BPE)")

In [None]:
# Helper: encode with all three tokenizers and show results side by side

def encode_all(text: str):
    """Return (ids, decoded_pieces) for each tokenizer."""
    # cl100k
    ids_cl   = enc_cl100k.encode(text)
    dec_cl   = [enc_cl100k.decode([i]) for i in ids_cl]
    # p50k
    ids_p50  = enc_p50k.encode(text)
    dec_p50  = [enc_p50k.decode([i]) for i in ids_p50]
    # BERT WordPiece ‚Äî lowercases automatically (uncased)
    out_bert = bert_tok.encode(text)
    # strip [CLS] and [SEP] which BERT adds automatically
    ids_bert = out_bert.ids[1:-1]
    dec_bert = out_bert.tokens[1:-1]
    return {
        "cl100k (GPT-4)":  (ids_cl,  dec_cl),
        "p50k   (GPT-3)":  (ids_p50, dec_p50),
        "BERT WordPiece":   (ids_bert, dec_bert),
    }

def compare_tokenizers(text: str):
    """Print a side-by-side token comparison for all three tokenizers."""
    results = encode_all(text)

    print(f"Text: '{text}'")
    print()
    for name, (ids, pieces) in results.items():
        # Build visual token blocks
        blocks = " | ".join(repr(p) for p in pieces)
        print(f"  {name}  ({len(ids)} tokens)")
        print(f"    Pieces: {blocks}")
        print(f"    IDs:    {ids}")
        print()

# Run on a core set of banking sentences
COMPARE_SENTENCES = [
    "AML compliance requires KYC verification.",
    "anti-money laundering",
    "Suspicious activity report filed with FinCEN.",
    "Basel III capital adequacy ratio.",
    "SWIFT wire transfer to correspondent bank.",
    "stablecoin cryptocurrency VASP FATF MiCA",
    "mortgage overdraft creditworthiness",
]

print("=== Side-by-Side: Same Text, Three Tokenizers ===")
print("=" * 70)
print()
for s in COMPARE_SENTENCES:
    compare_tokenizers(s)
    print("-" * 70)
    print()

In [None]:
# Token COUNT comparison ‚Äî the most practical difference
# Same text can use significantly more or fewer tokens depending on the tokenizer

print("=== Token Count Comparison: All Three Tokenizers ===")
print()

BANKING_TEXTS = {
    "AML":                              "AML",
    "KYC":                              "KYC",
    "SWIFT":                            "SWIFT",
    "compliance":                       "compliance",
    "anti-money-laundering":            "anti-money-laundering",
    "suspicious activity report":       "suspicious activity report",
    "capital adequacy ratio":           "capital adequacy ratio",
    "stablecoin cryptocurrency VASP":   "stablecoin cryptocurrency VASP",
    "AML alert (short paragraph)": (
        "Transaction flagged: Customer ID 4892 initiated wire transfer of $48,500 "
        "to correspondent bank in a FATF high-risk jurisdiction. AML analyst review required."
    ),
    "Basel III excerpt": (
        "Under Basel III, banks must maintain a minimum Common Equity Tier 1 (CET1) "
        "capital ratio of 4.5% and a Total Capital ratio of 8% of risk-weighted assets."
    ),
}

print(f"{'Text':<35} {'cl100k':>8} {'p50k':>8} {'BERT':>8}  Winner (fewest tokens)")
print("-" * 75)

for label, text in BANKING_TEXTS.items():
    ids_cl   = enc_cl100k.encode(text)
    ids_p50  = enc_p50k.encode(text)
    ids_bert = bert_tok.encode(text).ids[1:-1]  # strip CLS/SEP

    counts = {
        "cl100k": len(ids_cl),
        "p50k":   len(ids_p50),
        "BERT":   len(ids_bert),
    }
    winner = min(counts, key=counts.get)
    winner_str = f"‚Üê {winner}"
    print(f"{label[:34]:<35} {counts['cl100k']:>8} {counts['p50k']:>8} {counts['BERT']:>8}  {winner_str}")

print()
print("üìå cl100k (100K vocab) typically wins ‚Äî larger vocab ‚Üí longer subwords ‚Üí fewer tokens.")
print("   BERT (30K vocab) often uses the most tokens for technical/domain jargon.")
print("   p50k sits in between ‚Äî same BPE algorithm as cl100k but smaller vocabulary.")
print()
print("   WHY THIS MATTERS:")
print("   ‚Üí If you chunk a document at 512 tokens for BERT, it may only be ~350 words.")
print("   ‚Üí The same chunk for GPT-4 (cl100k) could be ~450 words.")
print("   ‚Üí Always count tokens with THE SAME tokenizer as your target model.")

In [None]:
# WordPiece vs BPE ‚Äî the algorithmic difference made visible
# BPE:       merges the most frequent CHARACTER PAIR across the whole corpus
# WordPiece: merges the pair that maximises the likelihood of the training data
# Result:    BPE tends to keep whole words; WordPiece splits more aggressively
#            and marks continuations with "##"

print("=== BPE vs WordPiece: The Algorithm Difference ===" )
print()
print("WordPiece '##' convention:")
print("  'compliance' ‚Üí 'com' + '##pliance'")
print("  The '##' means: this piece CONTINUES the previous word (no space before it)")
print("  BPE uses no such marker ‚Äî continuations are implied by the space rule")
print()

demo_words = [
    "compliance",
    "compliant",
    "anti-money-laundering",
    "cryptocurrency",
    "creditworthiness",
    "SWIFT",
    "FinCEN",
    "stablecoin",
    "overcollateralized",
    "recapitalization",
]

print(f"{'Word':<22}  {'BPE cl100k pieces':<35}  {'BPE p50k pieces':<35}  {'WordPiece BERT pieces'}")
print("-" * 130)

for word in demo_words:
    pieces_cl   = [enc_cl100k.decode([i]) for i in enc_cl100k.encode(word)]
    pieces_p50  = [enc_p50k.decode([i])   for i in enc_p50k.encode(word)]
    out_bert    = bert_tok.encode(word)
    pieces_bert = out_bert.tokens[1:-1]  # strip [CLS]/[SEP]

    cl_str   = " | ".join(repr(p) for p in pieces_cl)
    p50_str  = " | ".join(repr(p) for p in pieces_p50)
    bert_str = " | ".join(repr(p) for p in pieces_bert)

    print(f"{word:<22}  {cl_str:<35}  {p50_str:<35}  {bert_str}")

print()
print("Key observations:")
print("  1. BPE (cl100k/p50k): common words often become ONE token.")
print("     WordPiece (BERT): more likely to split even common words.")
print("  2. Acronyms like 'SWIFT' or 'FinCEN' may be single token in BPE")
print("     but split into characters/subwords in BERT's smaller vocab.")
print("  3. Long compound words ('overcollateralized') always split ‚Äî but DIFFERENTLY.")
print("  4. The '##' in BERT output is NOT present in BPE ‚Äî it's a WordPiece-specific marker.")

In [None]:
# ASCII visual: token boundaries as fence posts
# Makes it instantly clear where each tokenizer "cuts" the word

def fence_display(text: str):
    """
    Show token boundaries as fence posts '|' in the original text.
    Each tokenizer shown on its own line under the text.
    """
    # Get pieces from each tokenizer
    pieces_cl   = [enc_cl100k.decode([i]) for i in enc_cl100k.encode(text)]
    pieces_p50  = [enc_p50k.decode([i])   for i in enc_p50k.encode(text)]
    out_bert    = bert_tok.encode(text)
    pieces_bert = [t.replace("##", "") for t in out_bert.tokens[1:-1]]

    def build_fence(pieces):
        """Join pieces with | separator, mark token boundaries."""
        return "|".join(p.replace(" ", "¬∑") for p in pieces)

    print(f"  Text:              {text}")
    print(f"  cl100k (GPT-4) :  {build_fence(pieces_cl)}")
    print(f"  p50k   (GPT-3) :  {build_fence(pieces_p50)}")
    print(f"  BERT WordPiece :  {build_fence(pieces_bert)}  (## stripped, lowercase)")
    print(f"  Token counts   :  cl100k={len(pieces_cl)}  p50k={len(pieces_p50)}  BERT={len(pieces_bert)}")

print("=== Fence Display: Where Each Tokenizer Cuts the Text ===")
print("(| = token boundary,  ¬∑ = space)")
print()

fence_words = [
    "AML compliance",
    "anti-money-laundering",
    "suspicious transaction",
    "cryptocurrency stablecoin",
    "overcollateralized mortgage",
    "KYC EDD PEP FATF VASP",
]

for w in fence_words:
    fence_display(w)
    print()

print("üìå Where the boundary '|' lands changes per tokenizer.")
print("   GPT-4 (cl100k) cuts less ‚Äî bigger pieces, fewer tokens.")
print("   BERT cuts more ‚Äî smaller pieces, more tokens.")
print("   The SAME word can land on different sides of a chunk boundary")
print("   depending on which tokenizer you use ‚Äî which affects RAG retrieval.")

---
## Part 6: Token Counting ‚Äî Why It Matters for Cost, Context, and Chunking

In [None]:
# Token counting is the single most practical skill for LLM users
# Every API charges per token. Every model has a context limit in tokens.

def token_stats(text: str, enc=enc) -> dict:
    ids   = enc.encode(text)
    words = len(text.split())
    chars = len(text)
    return {
        "tokens": len(ids),
        "words":  words,
        "chars":  chars,
        "tok_per_word": len(ids) / max(words, 1),
        "tok_per_char": len(ids) / max(chars, 1),
    }

# Realistic banking document samples
SAMPLES = {
    "Short AML alert": """
        Transaction flagged: Customer ID 4892 initiated wire transfer of $48,500
        to correspondent bank in jurisdiction with elevated FATF risk rating.
        AML analyst review required within 24 hours per BSA policy.
    """,

    "KYC onboarding paragraph": """
        As part of our Know Your Customer (KYC) onboarding process, all new
        corporate clients are required to submit: Certificate of Incorporation,
        beneficial ownership declaration (UBO >25%), source of funds documentation,
        and a completed Customer Due Diligence (CDD) questionnaire. Enhanced Due
        Diligence (EDD) applies to Politically Exposed Persons (PEPs) and customers
        in FATF grey-list jurisdictions.
    """,

    "Basel III excerpt": """
        Under Basel III, banks must maintain a minimum Common Equity Tier 1 (CET1)
        capital ratio of 4.5%, a Tier 1 capital ratio of 6%, and a Total Capital
        ratio of 8%. In addition, a Capital Conservation Buffer (CCB) of 2.5%
        of risk-weighted assets (RWA) must be maintained, bringing the effective
        minimum CET1 ratio to 7%. Countercyclical capital buffers of up to 2.5%
        may be imposed by national regulators during periods of excess credit growth.
    """,

    "Mortgage product description": """
        Our 30-year fixed-rate mortgage product offers competitive rates starting
        at 6.75% APR for borrowers with FICO scores above 740. Loan-to-value (LTV)
        ratios up to 80% are available without private mortgage insurance (PMI).
        Debt-to-income (DTI) ratios must not exceed 43%. Origination fees are 1%
        of the principal amount. Applications require W-2s for two years, pay stubs,
        bank statements, and a property appraisal.
    """,
}

print("=== Token Counts for Realistic Banking Text ===")
print(f"{'Sample':<30} {'Tokens':>7} {'Words':>7} {'Chars':>7} {'Tok/Word':>10} {'Tok/Char':>10}")
print("-" * 80)

for name, text in SAMPLES.items():
    stats = token_stats(text.strip())
    print(f"{name:<30} {stats['tokens']:>7} {stats['words']:>7} {stats['chars']:>7} "
          f"{stats['tok_per_word']:>10.2f} {stats['tok_per_char']:>10.3f}")

print()
print("üìå Rule of thumb: ~1.3 tokens per word for English banking text")
print("   (Technical jargon, acronyms, and numbers can push this higher)")

In [None]:
# Context window and cost calculator
print("=== Context Window Reality Check ===")
print()

CONTEXT_WINDOWS = {
    "GPT-3.5-turbo":    4_096,
    "GPT-4":           32_768,
    "GPT-4 Turbo":    128_000,
    "Claude Sonnet":  200_000,
    "Claude Opus":    200_000,
    "Gemini 1.5 Pro": 1_000_000,
}

# Estimate tokens for common banking document types
DOCUMENT_SIZES = {
    "1 AML alert (2 paragraphs)": 250,
    "KYC questionnaire (1 page)": 500,
    "Loan application (4 pages)": 2_000,
    "Basel III reg excerpt (10 pages)": 5_000,
    "Annual report (80 pages)": 40_000,
    "Full AML policy manual (200 pages)": 100_000,
    "Basel III full text (600 pages)": 300_000,
}

print(f"{'Model':<22} {'Context (tokens)':>18}")
print("-" * 42)
for model_name, ctx in CONTEXT_WINDOWS.items():
    print(f"{model_name:<22} {ctx:>18,}")

print()
print(f"{'Document':<42} {'Est. tokens':>13}  Fits in GPT-4?  Fits in Claude?")
print("-" * 85)
for doc, tokens in DOCUMENT_SIZES.items():
    fits_gpt4   = "‚úì" if tokens < 32_768  else "‚úó"
    fits_claude = "‚úì" if tokens < 200_000 else "‚úó"
    print(f"{doc:<42} {tokens:>13,}  {fits_gpt4:<14}  {fits_claude}")

print()
print("üìå This is why RAG (Session 3) exists.")
print("   You can't fit a 200-page policy manual into any context window.")
print("   RAG retrieves only the relevant 500-token chunks ‚Üí fits easily.")

In [None]:
# Cost calculator ‚Äî real pricing (approximate)
print("=== Token Cost Calculator ===")
print("(Approximate pricing as of 2025 ‚Äî verify current rates)")
print()

# Cost per 1M tokens in USD
PRICING = {
    "GPT-4o (input)": 2.50,
    "GPT-4o (output)": 10.00,
    "Claude Sonnet 3.5 (input)": 3.00,
    "Claude Sonnet 3.5 (output)": 15.00,
    "GPT-3.5-turbo (input)": 0.50,
    "text-embedding-ada-002": 0.10,
}

print(f"{'Model + direction':<35} {'$/1M tokens':>12}")
print("-" * 50)
for name, price in PRICING.items():
    print(f"{name:<35} ${price:>11.2f}")

print()
print("=== Cost Scenarios ===")

scenarios = [
    ("Embed 10,000 AML alerts (250 tok each)",   10_000 * 250,   0.10),
    ("Summarize 1,000 KYC files (500 tok each)",  1_000 * 500,   3.00),
    ("Chat: 1M daily messages (500 tok in/out)",  1_000_000 * 500, 3.00),
    ("Process Basel III full text once",          300_000,        3.00),
]

print(f"{'Scenario':<47} {'Tokens':>10} {'Est. cost':>12}")
print("-" * 72)
for desc, tokens, price_per_M in scenarios:
    cost = tokens / 1_000_000 * price_per_M
    print(f"{desc:<47} {tokens:>10,} ${cost:>11.4f}")

print()
print("üìå Token efficiency directly impacts operational cost.")
print("   Shorter prompts, better chunking, and caching all reduce spend.")

---
## Part 7: Tokenization Surprises ‚Äî What Breaks and Why

The tokenizer has quirks that matter for production banking systems.

In [None]:
# Surprise 1: Numbers are tokenized in unexpected ways
print("=== Surprise 1: Numbers ===")
print()

number_examples = [
    "1000",
    "10000",
    "100000",
    "1,000",
    "10,000",
    "$10,000",
    "$10,000.00",
    "48500",
    "8.5%",
    "6.75% APR",
    "Basel III",
    "FICO 740",
]

for text in number_examples:
    ids     = enc.encode(text)
    decoded = [enc.decode([i]) for i in ids]
    parts   = " | ".join(repr(t) for t in decoded)
    print(f"  {text:<18} ‚Üí {len(ids)} token(s):  {parts}")

print()
print("üìå Numbers are not tokenized as numbers ‚Äî they are strings.")
print("   '10,000' may split differently from '10000' from '$10,000'.")
print("   LLMs can struggle with arithmetic because of this tokenization.")
print("   For regulatory thresholds and amounts, consider normalizing format first.")

In [None]:
# Surprise 2: Capitalization matters
print("=== Surprise 2: Capitalization Changes Token IDs ===")
print()

cap_pairs = [
    ("compliance", "Compliance"),
    ("aml",        "AML"),
    ("kyc",        "KYC"),
    ("swift",      "SWIFT"),
    ("bank",       "Bank"),
    ("mortgage",   "Mortgage"),
    ("fraud",      "FRAUD"),
]

print(f"{'Lower':<16} {'Upper':<16} {'Lower IDs':<25} {'Upper IDs':<25} Same?")
print("-" * 90)
for lower, upper in cap_pairs:
    ids_l = enc.encode(lower)
    ids_u = enc.encode(upper)
    same  = "‚úì" if ids_l == ids_u else "‚úó DIFFERENT"
    print(f"{lower:<16} {upper:<16} {str(ids_l):<25} {str(ids_u):<25} {same}")

print()
print("üìå 'aml' and 'AML' are DIFFERENT tokens.")
print("   This matters for consistency in prompts and document processing.")
print("   Consider lowercasing or normalizing banking acronyms before embedding.")

In [None]:
# Surprise 3: Leading spaces create different tokens
print("=== Surprise 3: Leading Space Changes the Token ===")
print()

space_examples = [
    ("compliance",  " compliance"),
    ("fraud",       " fraud"),
    ("bank",        " bank"),
    ("transaction", " transaction"),
]

for without, with_space in space_examples:
    id_without = enc.encode(without)
    id_with    = enc.encode(with_space)
    same = id_without == id_with
    print(f"  '{without}' ‚Üí {id_without}")
    print(f"  '{with_space}' ‚Üí {id_with}")
    print(f"  Same token? {'Yes' if same else 'No ‚Äî leading space creates a different token'}")
    print()

print("üìå In BPE, a word at the start of a sentence (no preceding space)")
print("   gets a different token ID than the same word mid-sentence (with space).")
print("   tiktoken handles this internally ‚Äî but it explains why position matters.")

In [None]:
# Surprise 4: Languages are not equal ‚Äî non-English text uses more tokens
print("=== Surprise 4: Language Efficiency ===")
print("(Same meaning, different languages, different token counts)")
print()

multilingual = [
    ("English",    "Anti-money laundering compliance requires customer verification."),
    ("Spanish",    "El cumplimiento contra el lavado de dinero requiere verificaci√≥n del cliente."),
    ("French",     "La conformit√© en mati√®re de lutte contre le blanchiment d'argent."),
    ("German",     "Geldw√§schebek√§mpfung erfordert Kundenidentifikation und √úberpr√ºfung."),
    ("Arabic",     "ÿßŸÑÿßŸÖÿ™ÿ´ÿßŸÑ ŸÑŸÖŸÉÿßŸÅÿ≠ÿ© ÿ∫ÿ≥ŸäŸÑ ÿßŸÑÿ£ŸÖŸàÿßŸÑ Ÿäÿ™ÿ∑ŸÑÿ® ÿßŸÑÿ™ÿ≠ŸÇŸÇ ŸÖŸÜ ŸáŸàŸäÿ© ÿßŸÑÿπŸÖŸäŸÑ."),
    ("Chinese",    "ÂèçÊ¥óÈí±ÂêàËßÑË¶ÅÊ±ÇÂÆ¢Êà∑Ë∫´‰ªΩÈ™åËØÅÂíåÂ∞ΩËÅåË∞ÉÊü•Á®ãÂ∫è„ÄÇ"),
    ("Japanese",   "„Éû„Éç„Éº„É≠„É≥„ÉÄ„É™„É≥„Ç∞Èò≤Ê≠¢„Ç≥„É≥„Éó„É©„Ç§„Ç¢„É≥„Çπ„Å´„ÅØÈ°ßÂÆ¢Á¢∫Ë™ç„ÅåÂøÖË¶Å„Åß„Åô„ÄÇ"),
]

print(f"{'Language':<12} {'Tokens':>8} {'Chars':>8}  {'Tok/Char':>10}  Sentence")
print("-" * 100)
for lang, text in multilingual:
    ids = enc.encode(text)
    print(f"{lang:<12} {len(ids):>8} {len(text):>8}  {len(ids)/len(text):>10.3f}  {text[:55]}")

print()
print("üìå Asian languages often need 2-4x more tokens than English for equivalent content.")
print("   This means higher cost AND shorter effective context for multilingual banking.")
print("   Global banks with multilingual compliance documents should account for this.")

---
## Part 8: Banking-Specific Tokenization Analysis

In [None]:
# Deep dive: how are banking acronyms and terms tokenized?
# This directly affects how well the model understands them

print("=== Banking Acronyms: How the Model Sees Them ===")
print()

BANKING_VOCAB = {
    "AML":  "Anti-Money Laundering",
    "KYC":  "Know Your Customer",
    "BSA":  "Bank Secrecy Act",
    "SAR":  "Suspicious Activity Report",
    "CTR":  "Currency Transaction Report",
    "PEP":  "Politically Exposed Person",
    "CDD":  "Customer Due Diligence",
    "EDD":  "Enhanced Due Diligence",
    "SWIFT": "Society for Worldwide Interbank Financial Telecommunication",
    "FATF": "Financial Action Task Force",
    "OFAC": "Office of Foreign Assets Control",
    "CET1": "Common Equity Tier 1",
    "RWA":  "Risk-Weighted Assets",
    "LTV":  "Loan-to-Value",
    "DTI":  "Debt-to-Income",
    "FICO": "Fair Isaac Corporation score",
    "VASP": "Virtual Asset Service Provider",
    "MiCA": "Markets in Crypto-Assets",
    "DORA": "Digital Operational Resilience Act",
}

print(f"{'Acronym':<8} {'Token IDs':<25} {'Tok count':>9}  {'Decoded tokens':<30} Full form")
print("-" * 110)
for acronym, full_form in BANKING_VOCAB.items():
    ids     = enc.encode(acronym)
    decoded = [enc.decode([i]) for i in ids]
    dec_str = " | ".join(repr(t) for t in decoded)
    print(f"{acronym:<8} {str(ids):<25} {len(ids):>9}  {dec_str:<30} {full_form[:45]}")

In [None]:
# Practical insight: acronym vs full form ‚Äî which is better in prompts?
print("=== Acronym vs Full Form: Token Efficiency ===")
print()

comparisons = [
    ("AML",    "anti-money laundering"),
    ("KYC",    "know your customer"),
    ("SAR",    "suspicious activity report"),
    ("SWIFT",  "Society for Worldwide Interbank Financial Telecommunication"),
    ("Basel III", "Basel Three capital framework"),
]

print(f"{'Acronym':<12} {'Tokens':>8}  {'Full form':<55} {'Tokens':>8}  Savings")
print("-" * 100)
for short, long in comparisons:
    t_short = len(enc.encode(short))
    t_long  = len(enc.encode(long))
    savings = t_long - t_short
    print(f"{short:<12} {t_short:>8}  {long:<55} {t_long:>8}  -{savings} tokens")

print()
print("üìå Use acronyms in prompts to save tokens ‚Äî but DEFINE them first.")
print("   'AML (anti-money laundering)' costs a few more tokens once,")
print("   but then 'AML' alone is cheaper throughout the rest of the prompt.")

In [None]:
# Full pipeline: text ‚Üí tokens ‚Üí token IDs ‚Üí (conceptually) ‚Üí embeddings
print("=== Full Pipeline: Text ‚Üí Tokens ‚Üí IDs ‚Üí Embeddings (conceptual) ===")
print()

sample = "The AML team flagged a suspicious wire transfer for SAR filing."
ids    = enc.encode(sample)
decoded = [enc.decode([i]) for i in ids]

print(f"Step 0 ‚Äî Raw text:")
print(f"  '{sample}'")
print()

print(f"Step 1 ‚Äî Tokenize:")
for i, (token, tid) in enumerate(zip(decoded, ids)):
    bar = "‚ñà" * min(tid // 5000, 20)  # rough visual of ID magnitude
    print(f"  [{i:>2}] id={tid:>6}  token={token!r:<18}  {bar}")
print()

print(f"Step 2 ‚Äî Each ID is looked up in an embedding table:")
print(f"  Vocab size: {enc.n_vocab:,} rows")
print(f"  Embedding dim: 768 (for cl100k models), 4096 (for GPT-4)")
print(f"  Each row: a 768-dimensional vector learned during model training")
print()
print(f"  id {ids[0]} ‚Üí embedding table row {ids[0]} ‚Üí [0.12, -0.34, 0.88, ...]  (768 numbers)")
print(f"  id {ids[1]} ‚Üí embedding table row {ids[1]} ‚Üí [-0.05, 0.72, 0.11, ...]  (768 numbers)")
print(f"  ... and so on for all {len(ids)} tokens")
print()

print(f"Step 3 ‚Äî Transformer processes the sequence of {len(ids)} vectors")
print(f"  Self-attention: each token 'looks at' all other tokens")
print(f"  Output: {len(ids)} context-aware vectors (one per token)")
print()
print(f"Step 4 ‚Äî Pool or use the final representations for your task")
print(f"  Classification: use [CLS] token")
print(f"  Sentence embedding: mean-pool all token vectors  ‚Üê this is Notebook A")
print(f"  Generation: predict next token ID from the last vector")

---
## Hands-On Exercise: Analyze Your Own Text

In [None]:
# Paste any banking text here and analyze it
YOUR_TEXT = """
TODO: Paste your own banking text here.
Examples:
- A paragraph from your internal AML policy
- A loan product description
- A regulatory requirement from Basel III or MiFID
- A customer email about a suspicious transaction
"""

if "TODO" not in YOUR_TEXT:
    ids     = enc.encode(YOUR_TEXT.strip())
    decoded = [enc.decode([i]) for i in ids]
    words   = len(YOUR_TEXT.split())

    print(f"=== Your Text Analysis ===")
    print(f"Characters: {len(YOUR_TEXT)}")
    print(f"Words:      {words}")
    print(f"Tokens:     {len(ids)}")
    print(f"Tok/word:   {len(ids)/max(words,1):.2f}")
    print()
    print("Token breakdown:")
    for i, (token, tid) in enumerate(zip(decoded, ids)):
        print(f"  [{i:>3}] {tid:>7}  {token!r}")
    print()
    print("Questions:")
    print("  1. Which terms split into multiple tokens?")
    print("  2. Which acronyms are a single token?")
    print("  3. How many tokens would this use in a 128K context window?")
else:
    print("Replace YOUR_TEXT above with your own banking text to analyze it.")
    print("Then re-run this cell.")
    print()
    # Demonstrate with a sample instead
    demo = "The SAR filing deadline is within 30 days of detecting suspicious activity."
    ids = enc.encode(demo)
    decoded = [enc.decode([i]) for i in ids]
    print(f"Demo: '{demo}'")
    print(f"Tokens ({len(ids)}):")
    for i, (token, tid) in enumerate(zip(decoded, ids)):
        print(f"  [{i:>2}] {tid:>7}  {token!r}")

In [None]:
# Group exercise: Token budget for a RAG prompt
# In Session 3 you will build RAG ‚Äî prompts look like this

print("=== RAG Prompt Token Budget Exercise ===")
print()

SYSTEM_PROMPT = """You are a banking compliance assistant. Answer questions using only 
the provided context. If the answer is not in the context, say so."""

RETRIEVED_CHUNK = """[Context from AML Policy, Section 4.2]
Wire transfers exceeding $10,000 USD must be reported to FinCEN via a Currency 
Transaction Report (CTR) within 15 calendar days. Structuring transactions to 
avoid the $10,000 threshold is a federal crime under 31 U.S.C. 5324. All wire 
transfers to or from FATF high-risk jurisdictions require Enhanced Due Diligence 
and senior management approval."""

USER_QUERY = "What is the reporting threshold for wire transfers and what form is used?"

EXPECTED_ANSWER = """Based on the policy, wire transfers exceeding $10,000 USD must 
be reported using a Currency Transaction Report (CTR) within 15 calendar days."""

parts = {
    "System prompt":     SYSTEM_PROMPT,
    "Retrieved context": RETRIEVED_CHUNK,
    "User query":        USER_QUERY,
    "Expected answer":   EXPECTED_ANSWER,
}

total = 0
print(f"{'Part':<22} {'Tokens':>8}")
print("-" * 32)
for part_name, text in parts.items():
    t = len(enc.encode(text))
    total += t
    print(f"{part_name:<22} {t:>8}")
print("-" * 32)
print(f"{'TOTAL':<22} {total:>8}")
print()
print(f"Remaining in 128K context window: {128_000 - total:,} tokens")
print(f"= room for ~{(128_000 - total) // 500} more retrieved chunks (avg 500 tok each)")
print()
print("üìå In Session 3, you will build this pipeline end-to-end.")
print("   Token counting determines how many chunks you can include,")
print("   which directly affects answer quality and completeness.")

---
## Summary: What You Experienced

| Part | Concept | Key Takeaway |
|------|---------|-------------|
| 1. The problem | Why tokenization exists | LLMs need integers, not strings |
| 2. Character | Na√Øve approach | 1 char = 1 token ‚Üí too long, no meaning sharing |
| 3. Word | Better, but broken | OOV problem ‚Äî 'FATF' and 'MiCA' become `<UNK>` |
| 4. BPE (yours) | You trained it | Merges frequent pairs ‚Üí subwords, no OOV |
| 5. tiktoken | The real thing | 100K vocab, banking terms get 1-3 tokens |
| 6. Counting | Cost & context | ~1.3 tok/word; 200-page manual > most context windows |
| 7. Surprises | What breaks | Numbers, capitalization, spaces, language cost |
| 8. Banking | Domain analysis | AML=1 tok, FATF=2, anti-money-laundering=5 |

### How This Connects to the Other Notebooks

```
Session 1.5C (this notebook) ‚Äî TOKENIZATION
  Text ‚Üí token IDs  (tiktoken, BPE)
         ‚Üì
Session 1.5B ‚Äî EMBEDDINGS (Word2Vec, static)
  Token IDs ‚Üí lookup in embedding table ‚Üí 50-dim vectors
  One vector per word, trained by YOU on banking corpus
         ‚Üì
Session 1.5A ‚Äî CONTEXTUAL EMBEDDINGS (sentence-transformers)
  Full sentence ‚Üí transformer ‚Üí 384-dim vector
  'bank' gets different vector per sentence
         ‚Üì
Session 3   ‚Äî RAG
  Documents ‚Üí chunk ‚Üí embed ‚Üí store in vector DB
  Query ‚Üí embed ‚Üí find similar chunks ‚Üí LLM answers
```

### Practical Rules for Banking LLM Systems

1. **Count tokens before sending** ‚Äî avoid context overflow on long documents  
2. **Use acronyms in prompts** ‚Äî AML (3 chars, 1 token) not anti-money-laundering (5 tokens)  
3. **Normalize numbers** ‚Äî decide on $10,000 vs 10000 vs 10k before embedding  
4. **Multilingual = more tokens** ‚Äî Arabic/Chinese compliance text costs 2-4x more  
5. **Chunk on token boundaries** ‚Äî not on word or character count for accurate sizing  
6. **RAG is the answer to context limits** ‚Äî retrieve, not fit everything