# 1. Tokenization & Embeddings

**Converting text into numerical representations**

Here's the thing about neural networks: they only understand numbers.

Words, sentences, paragraphs? Meaningless. But vectors of floating-point numbers? Now we're talking.

So the first step in any language model is converting text into numbers the model can actually process. This happens in two stages: **tokenization** (breaking text into tokens) and **embedding** (converting those tokens into vectors).

## Our Vocabulary

For this project, we're using a tiny vocabulary with just 6 tokens:

| Token ID | Token | Description |
|----------|-------|-------------|
| 0 | `<PAD>` | Padding token (for sequences of different lengths) |
| 1 | `<BOS>` | Beginning of sequence marker |
| 2 | `<EOS>` | End of sequence marker |
| 3 | `I` | Content word |
| 4 | `like` | Content word |
| 5 | `transformers` | Content word |

In real language models, vocabularies contain thousands or tens of thousands of tokens (words, subwords, or characters). GPT-3, for instance, has 50,257 tokens. Ours has 6. This makes the calculations actually manageable while still demonstrating all the key concepts.

In [None]:
import random

# Set seed for reproducibility
random.seed(42)

# Model hyperparameters
VOCAB_SIZE = 6
D_MODEL = 16
MAX_SEQ_LEN = 5

# Token names for pretty printing
TOKEN_NAMES = ["<PAD>", "<BOS>", "<EOS>", "I", "like", "transformers"]

## Input Text

We'll train our model on a single sentence:

```
"I like transformers"
```

## Tokenization Process

To convert our text into token IDs, we:

1. Add a `<BOS>` (beginning of sequence) token at the start
2. Convert each word to its token ID
3. Add an `<EOS>` (end of sequence) token at the end

**Result:**

```
Text:      <BOS>  I     like  transformers  <EOS>
Token IDs: [1,    3,    4,    5,            2]
```

Our sequence has **length 5**.

In [None]:
# Our input sequence
tokens = [1, 3, 4, 5, 2]  # <BOS>, I, like, transformers, <EOS>
seq_len = len(tokens)

print(f"Input sequence: {tokens}")
print(f"As text: {' '.join(TOKEN_NAMES[t] for t in tokens)}")
print(f"Sequence length: {seq_len}")

## Language Modeling Task

In a decoder-only transformer (like GPT), the game is simple: predict the next token.

At each position, we look at all the previous tokens and try to guess what comes next. We use a **causal mask** to enforce this "no peeking at the future" rule:

| Position | Input tokens | Predict |
|----------|--------------|--------|
| 0 | `<BOS>` | `I` (token 3) |
| 1 | `<BOS> I` | `like` (token 4) |
| 2 | `<BOS> I like` | `transformers` (token 5) |
| 3 | `<BOS> I like transformers` | `<EOS>` (token 2) |

Position 4 (`<EOS>`) doesn't need to predict anything—it marks the end.

## Embeddings: Tokens to Vectors

Okay, so we have token IDs. But remember, the model needs actual vectors—continuous representations it can do math with.

We use two types of embeddings to create these vectors.

### 1. Token Embeddings

Each token gets its own embedding vector. With `d_model = 16`, that means each of our 6 tokens maps to a 16-dimensional vector.

We have a **token embedding matrix** of shape `[vocab_size, d_model]` = `[6, 16]`:

```
E_token = [
  [e₀,₀,  e₀,₁,  ..., e₀,₁₅],   ← embedding for token 0 (<PAD>)
  [e₁,₀,  e₁,₁,  ..., e₁,₁₅],   ← embedding for token 1 (<BOS>)
  [e₂,₀,  e₂,₁,  ..., e₂,₁₅],   ← embedding for token 2 (<EOS>)
  [e₃,₀,  e₃,₁,  ..., e₃,₁₅],   ← embedding for token 3 (I)
  [e₄,₀,  e₄,₁,  ..., e₄,₁₅],   ← embedding for token 4 (like)
  [e₅,₀,  e₅,₁,  ..., e₅,₁₅],   ← embedding for token 5 (transformers)
]
```

To get the embedding for token ID `i`, we simply look up row `i` of this matrix.

In [None]:
def random_vector(size, scale=0.1):
    """Generate a random vector with values ~ N(0, scale^2)"""
    return [random.gauss(0, scale) for _ in range(size)]

def format_vector(vec, decimals=4):
    """Format vector as string with specified decimal places"""
    return "[" + ", ".join([f"{v:7.{decimals}f}" for v in vec]) + "]"

# Initialize token embedding matrix
E_token = [random_vector(D_MODEL) for _ in range(VOCAB_SIZE)]

print(f"Token Embedding Matrix E_token")
print(f"Shape: [{VOCAB_SIZE}, {D_MODEL}]")
print()
for i, row in enumerate(E_token):
    print(f"  Token {i} ({TOKEN_NAMES[i]:12s}): {format_vector(row)}")

### 2. Position Embeddings

Here's a weird thing about transformers: they have no built-in sense of order.

Unlike RNNs (which process sequences left-to-right) or CNNs (which have spatial structure), the attention mechanism treats the input like a bag of tokens. It has no idea that "I" comes before "like" unless we explicitly tell it.

So we need position embeddings. We use **learned position embeddings** (there are other approaches like sinusoidal encoding, but learned embeddings work great).

We have a **position embedding matrix** of shape `[max_seq_len, d_model]` = `[5, 16]`:

In [None]:
# Initialize position embedding matrix
E_pos = [random_vector(D_MODEL) for _ in range(MAX_SEQ_LEN)]

print(f"Position Embedding Matrix E_pos")
print(f"Shape: [{MAX_SEQ_LEN}, {D_MODEL}]")
print()
for i, row in enumerate(E_pos):
    print(f"  Position {i}: {format_vector(row)}")

### 3. Combined Embeddings

For each token at position `i`, we add the token embedding and position embedding:

$$\text{embedding}[i] = \text{token\_embedding}[\text{token\_id}[i]] + \text{position\_embedding}[i]$$

For our sequence `[1, 3, 4, 5, 2]`:

```
Position 0: embedding[0] = E_token[1] + E_pos[0]  (for <BOS>)
Position 1: embedding[1] = E_token[3] + E_pos[1]  (for I)
Position 2: embedding[2] = E_token[4] + E_pos[2]  (for like)
Position 3: embedding[3] = E_token[5] + E_pos[3]  (for transformers)
Position 4: embedding[4] = E_token[2] + E_pos[4]  (for <EOS>)
```

In [None]:
def add_vectors(v1, v2):
    """Element-wise addition of two vectors"""
    return [a + b for a, b in zip(v1, v2)]

# Look up token embeddings for our sequence
token_embeddings = [E_token[token_id] for token_id in tokens]

# Add position embeddings to get combined embeddings
X = [add_vectors(token_embeddings[i], E_pos[i]) for i in range(seq_len)]

print("Computing combined embeddings...")
print("X[i] = E_token[token_id[i]] + E_pos[i]")
print()

for i in range(seq_len):
    token_name = TOKEN_NAMES[tokens[i]]
    print(f"Position {i} ('{token_name}'):")
    print(f"  Token embedding:    {format_vector(token_embeddings[i])}")
    print(f"  Position embedding: {format_vector(E_pos[i])}")
    print(f"  Combined X[{i}]:       {format_vector(X[i])}")
    print()

### Final Result: Matrix X

The result is a matrix $X$ of shape `[seq_len, d_model]` = `[5, 16]`, where each row is the 16-dimensional embedding for one token in our sequence.

**This matrix X is the input to our transformer block.**

In [None]:
print("="*80)
print("FINAL COMBINED EMBEDDINGS MATRIX X")
print("="*80)
print(f"Shape: [{seq_len}, {D_MODEL}] (seq_len, d_model)")
print()
print("X =")
for i, row in enumerate(X):
    token_name = TOKEN_NAMES[tokens[i]]
    print(f"  {format_vector(row)}  # pos {i}: {token_name}")

## What's Next

These embeddings will flow through our transformer block, where the real magic happens:
1. Query, Key, Value projections (splitting into attention heads)
2. Self-attention scores (figuring out which tokens should attend to which)
3. Multi-head attention (combining information from multiple attention heads)
4. Feed-forward transformations (processing the attended representations)

Let's move on to the QKV projections.

In [None]:
# Store variables for use in subsequent notebooks
# (In a real setting, you'd save these or pass them along)
embedding_data = {
    'X': X,
    'E_token': E_token,
    'E_pos': E_pos,
    'tokens': tokens,
    'TOKEN_NAMES': TOKEN_NAMES,
    'VOCAB_SIZE': VOCAB_SIZE,
    'D_MODEL': D_MODEL,
    'MAX_SEQ_LEN': MAX_SEQ_LEN
}
print("Embedding data stored for next notebook.")