# Chapter 1.1 — Tokenization: Teaching Machines to Read

Companion article: https://medium.com/@vadidsadikshaikh/chapter-1-1-tokenization-teaching-machines-to-read-a82b5f260a3e 
Reference: Sebastian Raschka, *Build a Large Language Model (From Scratch)*  
Purpose: Implement the tokenizer that converts raw text → token IDs using OpenAI’s `tiktoken` library.

In [None]:
import tiktoken

# Initialize GPT-2 tokenizer (same as used in GPT-2/3 models)
tokenizer = tiktoken.get_encoding('gpt2')

print('✅ Tokenizer loaded successfully')
print('Vocabulary size:', tokenizer.n_vocab)

In [None]:
text = 'Large language models read the world as numbers.'
encoded = tokenizer.encode(text)

print('Original Text:', text)
print('Encoded Token IDs:', encoded)
print('Number of tokens:', len(encoded))

In [None]:
decoded = tokenizer.decode(encoded)
print('Decoded Text:', decoded)

In [None]:
print(f"{'Token ID':>10} | {'Token String':<10}")
print('-' * 25)
for token_id in encoded:
    token_str = tokenizer.decode([token_id])
    print(f"{token_id:>10} | {repr(token_str):<10}")

### Notes:
- This tokenizer uses GPT-2’s Byte Pair Encoding (BPE) algorithm via `tiktoken`.
- The output token IDs are the **numerical form** that your model will understand.
- These token IDs will be used in the **Embeddings** step (next chapter).
- Keep this tokenizer constant throughout your model’s lifecycle (training to deployment).