# Understanding Gradients

**A complete forward and backward pass through a transformer, calculated by hand**

## Introduction

Ever wonder how transformers **actually** work under the hood? I mean really work, at the level of matrices and gradients and actual numbers?

You can read about attention mechanisms and backpropagation in a textbook. You can use PyTorch and watch the loss go down. But there's something different about seeing **every single calculation** laid out in front of you—watching how a $6 \times 16$ embedding matrix multiplies with a $16 \times 16$ query weight matrix, seeing exactly how the chain rule propagates gradients through layer normalization, understanding why AdamW needs bias correction terms.

**This project calculates a complete training step through a transformer, by hand.**

We're going to take the sentence "I like transformers" (3 tokens, plus BOS and EOS markers for 5 total) through a tiny GPT-style model:
- **Forward pass:** From raw text → embeddings → attention → feed-forward → loss (7 notebooks)
- **Backward pass:** Computing gradients for every single parameter via backpropagation (5 notebooks)
- **Optimization:** Applying AdamW updates with momentum and bias correction (1 notebook)

Every matrix multiplication is shown step-by-step. Every gradient derivation is complete. Every dimension is tracked. Nothing is hidden behind library abstractions or handwaved as "trivial."

By the end, you'll have a deep, visceral understanding of transformer mathematics—the kind that only comes from doing the calculations yourself.

## What We'll Calculate

### Forward Pass (7 notebooks)
**Tokenization → Embeddings → QKV Projections → Attention Scores → Multi-Head Attention → Feed-Forward Network → Layer Normalization → Cross-Entropy Loss**

Watch the input flow through each layer with complete matrix operations. See how attention weights emerge from scaled dot products and how GELU activation transforms hidden states.

### Backward Pass (5 notebooks)
**Loss Gradients → Output Layer → FFN & LayerNorm → Attention → Embeddings**

Trace gradients backward through the network using the chain rule. Derive Jacobian matrices for softmax and layer normalization. Compute gradients for every weight, bias, and embedding.

### Optimization (1 notebook)
**AdamW Weight Updates with Momentum & Bias Correction**

Apply the complete AdamW optimizer algorithm with first and second moment estimates, bias correction terms, and weight decay. See how each parameter moves toward better values.

## Architecture

We're using a **GPT-style decoder-only transformer**—the same architecture family as ChatGPT, Claude, and Llama, just scaled down to be humanly tractable:

| Component | Value | Why This Size? |
|-----------|-------|----------------|
| **d_model** | 16 | Small enough to write out full matrices, large enough to be realistic |
| **num_heads** | 2 | Multiple heads to show how multi-head attention combines (d_k = d_v = 8) |
| **d_ff** | 64 | Standard 4× expansion in feed-forward layer |
| **vocab_size** | 6 | Our tiny vocabulary: PAD, BOS, EOS, I, like, transformers |
| **num_layers** | 1 | One complete transformer block (you can extrapolate to N layers) |
| **max_len** | 5 | Length of our sequence with BOS and EOS tokens |

**Total parameters: ~2,600** (versus 175 billion for GPT-3)

The math is identical whether you have 16 dimensions or 4096. We're just keeping things small enough that you can actually see what's happening in every matrix multiplication, understand every gradient, and verify every calculation.

## Who Is This For?

This project is ideal if you:

- **Want to deeply understand transformers** beyond high-level explanations
- **Are learning ML/DL** and want to see the mathematics, not just the PyTorch code
- **Already use transformers** but feel shaky on the mathematical foundations
- **Are implementing novel architectures** and need to understand the gradient flows
- **Learn best by example**—seeing actual numbers instead of abstract notation
- **Have taken calculus** (chain rule, partial derivatives) but want to see it applied

You *don't* need a PhD. You *do* need patience and a willingness to follow matrix calculations step by step.

## Why Calculate By Hand?

**Understanding vs. Using**: You can drive a car without knowing how the engine works. But if you want to *design* cars, you need to understand combustion, torque, and thermodynamics. Same with transformers.

**Debugging intuition**: When your transformer isn't training properly, understanding what's happening in each gradient helps you diagnose whether it's vanishing gradients, attention collapse, or something else.

**No magic**: Libraries like PyTorch make training easy but hide the details. This project reveals what `loss.backward()` actually does—all 5 notebooks of chain rule applications.

**Deep learning**: The best way to learn is to do. We're doing every calculation, so you'll learn it deeply.

## Let's Go

Every notebook builds on the previous one, so following in order is recommended for your first read-through. All calculations are executable—you can run each cell and verify the results yourself.

Ready? Let's start with tokenization and embeddings.