# Building a Transformer from Scratch

**An educational journey through the architecture that powers modern AI**

## What is a Transformer?

A transformer is a type of neural network architecture introduced in the landmark paper *"Attention is All You Need"* (Vaswani et al., 2017). It revolutionized artificial intelligence and is now the foundation of virtually all modern large language models, including GPT, BERT, Claude, and many others.

**What makes transformers special?** Previous approaches to language modeling used recurrent neural networks (RNNs), which process text one word at a time in sequence—like reading a sentence from left to right. Transformers instead use a mechanism called **attention** that allows them to process all words simultaneously *while still understanding their relationships*. This parallel processing makes them much faster to train and more effective at capturing long-range dependencies in text.

## Learning Path

In this section, we'll build a complete transformer in PyTorch:

1. **Token Embeddings** — Convert text to vectors and add position information
2. **Attention** — Learn how tokens attend to each other using Query, Key, Value
3. **Multi-Head Attention** — Run parallel attention heads to capture different relationships
4. **Feed-Forward Networks** — Process attended information through position-wise MLPs
5. **Transformer Block** — Combine attention, FFN, layer norm, and residual connections
6. **Complete Model** — Stack blocks and add embedding/output layers
7. **Training** — Use gradient accumulation and validation for stable training
8. **KV-Cache** — Optimize inference speed by caching key-value pairs
9. **Interpretability** — Analyze attention patterns and understand what the model learns

In [1]:
# Check PyTorch is available
import torch
import torch.nn as nn

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    print("MPS (Apple Silicon) available")

PyTorch version: 2.10.0.dev20251124+rocm7.1
CUDA available: True
CUDA device: Radeon RX 7900 XTX


## Model Architecture Overview

A decoder-only transformer (like GPT) consists of:

```
Input tokens
    ↓
[Token Embedding + Positional Encoding]
    ↓
┌─────────────────────────────────┐
│      Transformer Block × N      │  ← Repeated N times
│  ┌───────────────────────────┐  │
│  │ Multi-Head Self-Attention │  │
│  │ + Residual + LayerNorm    │  │
│  └───────────────────────────┘  │
│  ┌───────────────────────────┐  │
│  │ Feed-Forward Network      │  │
│  │ + Residual + LayerNorm    │  │
│  └───────────────────────────┘  │
└─────────────────────────────────┘
    ↓
[Linear → Vocabulary]
    ↓
Output probabilities
```

## Key Hyperparameters

| Parameter | Typical Value | Description |
|-----------|--------------|-------------|
| `d_model` | 512-4096 | Embedding dimension |
| `n_heads` | 8-32 | Number of attention heads |
| `n_layers` | 6-96 | Number of transformer blocks |
| `d_ff` | 4 × d_model | Feed-forward hidden dimension |
| `vocab_size` | 32K-100K | Size of token vocabulary |
| `max_seq_len` | 512-128K | Maximum sequence length |

In [2]:
# Example configuration for a small educational model
config = {
    'd_model': 256,      # Embedding dimension
    'n_heads': 4,        # Number of attention heads
    'n_layers': 4,       # Number of transformer blocks
    'd_ff': 1024,        # Feed-forward dimension (4 × d_model)
    'vocab_size': 10000, # Vocabulary size
    'max_seq_len': 512,  # Maximum sequence length
    'dropout': 0.1,      # Dropout rate
}

print("Model Configuration:")
for k, v in config.items():
    print(f"  {k}: {v}")

# Estimate parameter count
embed_params = config['vocab_size'] * config['d_model']  # Token embeddings
attn_params = config['n_layers'] * 4 * config['d_model'] ** 2  # Q, K, V, O projections
ffn_params = config['n_layers'] * 2 * config['d_model'] * config['d_ff']  # Up and down projections
total_params = embed_params + attn_params + ffn_params

print(f"\nEstimated parameters: {total_params:,} ({total_params/1e6:.1f}M)")

Model Configuration:
  d_model: 256
  n_heads: 4
  n_layers: 4
  d_ff: 1024
  vocab_size: 10000
  max_seq_len: 512
  dropout: 0.1

Estimated parameters: 5,705,728 (5.7M)


## Let's Build!

In the following notebooks, we'll implement each component step by step, with executable code you can run and modify.

Ready? Let's start with embeddings.