# 1. Token Embeddings & Positional Encoding

**Converting text to vectors and adding position information**

## Token Embeddings

**What are tokens?** Before we can process text with a neural network, we need to break it into pieces called tokens. A token might be a word ("hello"), a subword ("ing"), or even a single character.

**Why do we need embeddings?** Computers can't directly understand token IDs—they're just arbitrary numbers. We need to convert them into meaningful representations that capture semantic relationships.

**What is an embedding?** An embedding is a learned vector representation for each token. Instead of representing "cat" as ID 142, we represent it as a dense vector like [0.2, -0.5, 0.8, ...] with `d_model` dimensions.

In [None]:
import torch
import torch.nn as nn
import math

class TokenEmbedding(nn.Module):
    """Convert token indices to dense vectors."""
    
    def __init__(self, vocab_size, d_model):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.d_model = d_model
    
    def forward(self, x):
        # x: (batch, seq_len) - token indices
        # returns: (batch, seq_len, d_model) - embeddings
        return self.embedding(x)

# Example
vocab_size = 1000
d_model = 64
embed = TokenEmbedding(vocab_size, d_model)

# Fake tokens
tokens = torch.tensor([[5, 142, 89, 256]])  # (batch=1, seq_len=4)
embeddings = embed(tokens)

print(f"Input tokens: {tokens.shape}")
print(f"Output embeddings: {embeddings.shape}")
print(f"\nFirst token embedding (first 8 dims):")
print(f"  {embeddings[0, 0, :8]}")

## Why We Need Positional Information

Consider: "The cat ate the mouse" vs "The mouse ate the cat"—same words, completely different meanings! The order matters.

Transformers process all tokens simultaneously in parallel, so they have no inherent notion of position. We need to give the model information about where each token appears.

There are several modern approaches:

## Approach 1: ALiBi (Attention with Linear Biases)

The simplest and very effective! Instead of modifying embeddings, ALiBi adds distance-based penalties directly to attention scores:

```
attention_score[i,j] = Q·K / √d_k - slope × |i - j|
```

The further apart two positions are, the more negative the penalty → lower attention!

In [None]:
class ALiBiPositionalBias(nn.Module):
    """ALiBi: Attention with Linear Biases."""
    
    def __init__(self, num_heads):
        super().__init__()
        # Geometric sequence of slopes for different heads
        # Head 0 = strongest penalty (local), last head = weakest (global)
        slopes = torch.tensor([
            2 ** (-8 * (i + 1) / num_heads) 
            for i in range(num_heads)
        ])
        self.register_buffer('slopes', slopes)
    
    def forward(self, seq_len):
        # Create position indices
        positions = torch.arange(seq_len)
        
        # Compute pairwise distances: |i - j|
        distances = torch.abs(positions.unsqueeze(0) - positions.unsqueeze(1))
        
        # Apply slope to get biases: -slope × distance
        # Shape: (num_heads, seq_len, seq_len)
        biases = -self.slopes.view(-1, 1, 1) * distances.unsqueeze(0).float()
        
        return biases

# Example
alibi = ALiBiPositionalBias(num_heads=4)
biases = alibi(seq_len=8)

print(f"ALiBi biases shape: {biases.shape}")
print(f"\nHead 0 biases (strong local focus):")
print(biases[0, :4, :4].numpy().round(2))
print(f"\nHead 3 biases (weak penalty, global focus):")
print(biases[3, :4, :4].numpy().round(2))

## Approach 2: Learned Positional Embeddings

Used in GPT-2 and BERT. Each position gets its own learnable embedding that's added to the token embedding.

In [None]:
class LearnedPositionalEncoding(nn.Module):
    """Learned positional embeddings (GPT-2 style)."""
    
    def __init__(self, d_model, max_seq_len=5000):
        super().__init__()
        self.pos_embedding = nn.Embedding(max_seq_len, d_model)
    
    def forward(self, x):
        # x: (batch, seq_len, d_model)
        batch_size, seq_len, d_model = x.shape
        
        # Create position indices: [0, 1, 2, ..., seq_len-1]
        positions = torch.arange(seq_len, device=x.device)
        
        # Get position embeddings and ADD to input
        pos_emb = self.pos_embedding(positions)
        return x + pos_emb

# Example
pos_enc = LearnedPositionalEncoding(d_model=64)
x = torch.randn(2, 10, 64)  # (batch=2, seq_len=10, d_model=64)
x_with_pos = pos_enc(x)

print(f"Input: {x.shape}")
print(f"With position encoding: {x_with_pos.shape}")

## Approach 3: RoPE (Rotary Position Embeddings)

Used in LLaMA, Mistral. Instead of adding position info, we *rotate* query and key vectors by an angle proportional to their position.

In [None]:
class RotaryPositionalEmbedding(nn.Module):
    """RoPE: Rotary Position Embeddings."""
    
    def __init__(self, d_model, max_seq_len=5000, base=10000):
        super().__init__()
        # Compute inverse frequencies
        inv_freq = 1.0 / (base ** (torch.arange(0, d_model, 2).float() / d_model))
        self.register_buffer('inv_freq', inv_freq)
        
        # Precompute cos/sin for all positions
        positions = torch.arange(max_seq_len)
        freqs = torch.outer(positions, inv_freq)
        self.register_buffer('cos_cached', freqs.cos())
        self.register_buffer('sin_cached', freqs.sin())
    
    def forward(self, x, seq_len):
        # x: (batch, seq_len, d_model)
        cos = self.cos_cached[:seq_len]
        sin = self.sin_cached[:seq_len]
        return cos, sin

def apply_rotary_emb(x, cos, sin):
    """Apply rotation to x using precomputed cos/sin."""
    # Split into pairs and rotate
    x1, x2 = x[..., ::2], x[..., 1::2]
    x_rotated = torch.stack([
        x1 * cos - x2 * sin,
        x1 * sin + x2 * cos
    ], dim=-1).flatten(-2)
    return x_rotated

# Example
rope = RotaryPositionalEmbedding(d_model=64)
cos, sin = rope(None, seq_len=10)
print(f"Cos cache shape: {cos.shape}")
print(f"Sin cache shape: {sin.shape}")

## Comparison

| Method | Parameters | Position Type | Extrapolation | Used In |
|--------|------------|--------------|---------------|--------|
| **ALiBi** | 0 | Relative | Best | BLOOM, MPT |
| **RoPE** | 0 | Relative | Excellent | LLaMA, Mistral |
| **Learned** | Many | Absolute | Limited | GPT-2, BERT |

For modern models, ALiBi and RoPE are preferred because:
- Zero parameters to learn
- Better extrapolation to longer sequences
- Encode relative position (distance matters, not absolute position)

## Next: Attention

Now that we can convert tokens to embeddings and encode position, we're ready to implement the core innovation of transformers: the attention mechanism.