# Replicating GPT-2 (124M) From Scratch

This notebook implements GPT-2 (124M parameters) from scratch in PyTorch, following Andrej Karpathy's approach.

**Reference**: [Let's reproduce GPT-2 (124M)](https://www.youtube.com/watch?v=l8pRSuU81PU)

## Overview

We'll build every component of GPT-2:
1. Environment setup and dependencies
2. Model configuration
3. Multi-head self-attention
4. MLP feedforward blocks
5. Transformer blocks
6. Complete GPT-2 model
7. Weight loading from Hugging Face
8. Text generation with sampling
9. Training from scratch
10. Performance optimizations

Let's get started!

## 1. Environment Setup and Dependencies

First, let's install and import all required libraries.

In [3]:
# Install required packages (uncomment if needed)
!pip install torch transformers tiktoken numpy

Collecting transformers
Collecting transformers
  Downloading transformers-4.57.1-py3-none-any.whl.metadata (43 kB)
  Downloading transformers-4.57.1-py3-none-any.whl.metadata (43 kB)
Collecting tiktoken
Collecting tiktoken
  Downloading tiktoken-0.12.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (6.7 kB)
  Downloading tiktoken-0.12.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (6.7 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Downloading huggingface_hub-0.35.3-py3-none-any.whl.metadata (14 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Downloading huggingface_hub-0.35.3-py3-none-any.whl.metadata (14 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2025.10.22-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (40 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2025.10.22-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.met

In [4]:
import torch
import torch.nn as nn
from torch.nn import functional as F
import numpy as np
import tiktoken
from transformers import GPT2LMHeadModel
import math
from dataclasses import dataclass
import time
import os

# Set device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

# Set random seed for reproducibility
torch.manual_seed(1337)
if torch.cuda.is_available():
    torch.cuda.manual_seed(1337)

  from .autonotebook import tqdm as notebook_tqdm


Using device: cpu


## 2. Model Configuration Class

Define the GPT-2 124M configuration with all hyperparameters.

In [5]:
@dataclass
class GPTConfig:
    """Configuration class for GPT-2 124M model"""
    block_size: int = 1024  # Maximum sequence length
    vocab_size: int = 50257  # GPT-2 vocabulary size (50000 BPE merges + 256 bytes tokens + 1 <|endoftext|>)
    n_layer: int = 12  # Number of transformer blocks
    n_head: int = 12  # Number of attention heads
    n_embd: int = 768  # Embedding dimension
    dropout: float = 0.0  # Dropout probability (0.1 for training, 0.0 for inference)
    bias: bool = True  # Use bias in linear layers and LayerNorm

config = GPTConfig()
print(f"GPT-2 124M Configuration:")
print(f"  Layers: {config.n_layer}")
print(f"  Hidden size: {config.n_embd}")
print(f"  Attention heads: {config.n_head}")
print(f"  Vocab size: {config.vocab_size}")
print(f"  Max sequence length: {config.block_size}")

GPT-2 124M Configuration:
  Layers: 12
  Hidden size: 768
  Attention heads: 12
  Vocab size: 50257
  Max sequence length: 1024


## 3. Tokenizer Setup with Tiktoken

Initialize the GPT-2 tokenizer using tiktoken.

In [6]:
# Initialize GPT-2 tokenizer
enc = tiktoken.get_encoding("gpt2")

def encode(text):
    """Encode text to token indices"""
    return enc.encode(text, allowed_special={"<|endoftext|>"})

def decode(tokens):
    """Decode token indices to text"""
    return enc.decode(tokens)

# Test the tokenizer
test_text = "Hello, I'm a language model"
tokens = encode(test_text)
decoded = decode(tokens)
print(f"Original: {test_text}")
print(f"Tokens: {tokens}")
print(f"Decoded: {decoded}")
print(f"Number of tokens: {len(tokens)}")

Original: Hello, I'm a language model
Tokens: [15496, 11, 314, 1101, 257, 3303, 2746]
Decoded: Hello, I'm a language model
Number of tokens: 7


## 4. Multi-Head Self-Attention Implementation

Implement the core attention mechanism with causal masking.

In [7]:
class CausalSelfAttention(nn.Module):
    """
    Multi-head causal self-attention module.
    Implements the attention mechanism from "Attention is All You Need"
    with causal masking for autoregressive generation.
    """
    
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        
        # Key, Query, Value projections for all heads (in batch)
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        
        # Output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        
        # Regularization
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.dropout = config.dropout
        
        # Causal mask to ensure attention only flows to earlier positions
        # Not a parameter, just a buffer
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                            .view(1, 1, config.block_size, config.block_size))
    
    def forward(self, x):
        B, T, C = x.size()  # Batch size, sequence length, embedding dimensionality (n_embd)
        
        # Calculate Q, K, V for all heads in batch
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        
        # Reshape for multi-head attention
        # (B, T, C) -> (B, T, n_head, head_size) -> (B, n_head, T, head_size)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        
        # Causal self-attention: (B, n_head, T, head_size) @ (B, n_head, head_size, T) -> (B, n_head, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        
        # Apply causal mask
        att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        att = self.attn_dropout(att)
        
        # Apply attention to values: (B, n_head, T, T) @ (B, n_head, T, head_size) -> (B, n_head, T, head_size)
        y = att @ v
        
        # Reassemble all head outputs side by side
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        
        # Output projection
        y = self.resid_dropout(self.c_proj(y))
        return y

# Test the attention module
test_attn = CausalSelfAttention(config).to(device)
test_input = torch.randn(2, 10, config.n_embd).to(device)  # Batch=2, Seq=10
test_output = test_attn(test_input)
print(f"Attention input shape: {test_input.shape}")
print(f"Attention output shape: {test_output.shape}")
print(f"Number of parameters: {sum(p.numel() for p in test_attn.parameters()):,}")

Attention input shape: torch.Size([2, 10, 768])
Attention output shape: torch.Size([2, 10, 768])
Number of parameters: 2,362,368


## 5. MLP Block Implementation

Create the feedforward network with GELU activation.

In [8]:
class MLP(nn.Module):
    """
    Multi-Layer Perceptron (feedforward network).
    Standard two-layer MLP with GELU activation.
    """
    
    def __init__(self, config):
        super().__init__()
        # First linear layer expands dimension by 4x
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        
        # GELU activation (Gaussian Error Linear Unit)
        # GPT-2 uses the approximate version
        self.gelu = nn.GELU(approximate='tanh')
        
        # Second linear layer projects back to embedding dimension
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        
        # Dropout for regularization
        self.dropout = nn.Dropout(config.dropout)
    
    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x

# Test the MLP module
test_mlp = MLP(config).to(device)
test_input = torch.randn(2, 10, config.n_embd).to(device)
test_output = test_mlp(test_input)
print(f"MLP input shape: {test_input.shape}")
print(f"MLP output shape: {test_output.shape}")
print(f"Number of parameters: {sum(p.numel() for p in test_mlp.parameters()):,}")

MLP input shape: torch.Size([2, 10, 768])
MLP output shape: torch.Size([2, 10, 768])
Number of parameters: 4,722,432


## 6. Transformer Block Implementation

Combine attention and MLP with pre-LayerNorm and residual connections.

In [9]:
class Block(nn.Module):
    """
    Transformer block with pre-LayerNorm architecture.
    Contains self-attention and MLP with residual connections.
    """
    
    def __init__(self, config):
        super().__init__()
        # LayerNorm before attention
        self.ln_1 = nn.LayerNorm(config.n_embd, bias=config.bias)
        
        # Multi-head self-attention
        self.attn = CausalSelfAttention(config)
        
        # LayerNorm before MLP
        self.ln_2 = nn.LayerNorm(config.n_embd, bias=config.bias)
        
        # MLP feedforward
        self.mlp = MLP(config)
    
    def forward(self, x):
        # Pre-LayerNorm architecture with residual connections
        x = x + self.attn(self.ln_1(x))  # Attention block with residual
        x = x + self.mlp(self.ln_2(x))   # MLP block with residual
        return x

# Test the transformer block
test_block = Block(config).to(device)
test_input = torch.randn(2, 10, config.n_embd).to(device)
test_output = test_block(test_input)
print(f"Block input shape: {test_input.shape}")
print(f"Block output shape: {test_output.shape}")
print(f"Number of parameters: {sum(p.numel() for p in test_block.parameters()):,}")

Block input shape: torch.Size([2, 10, 768])
Block output shape: torch.Size([2, 10, 768])
Number of parameters: 7,087,872


## 7. Complete GPT-2 Model Class

Assemble all components into the full GPT-2 architecture.

In [10]:
class GPT(nn.Module):
    """
    GPT-2 Language Model.
    Full implementation matching Hugging Face's architecture.
    """
    
    def __init__(self, config):
        super().__init__()
        assert config.vocab_size is not None
        assert config.block_size is not None
        self.config = config
        
        # Model architecture
        self.transformer = nn.ModuleDict(dict(
            # Token embeddings
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            
            # Positional embeddings
            wpe = nn.Embedding(config.block_size, config.n_embd),
            
            # Dropout on embeddings
            drop = nn.Dropout(config.dropout),
            
            # Stack of transformer blocks
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            
            # Final layer normalization
            ln_f = nn.LayerNorm(config.n_embd, bias=config.bias),
        ))
        
        # Language model head (no bias, tied with token embeddings in original GPT-2)
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        
        # Weight tying: share weights between token embeddings and output layer
        # This reduces parameters and can improve performance
        self.transformer.wte.weight = self.lm_head.weight
        
        # Initialize weights
        self.apply(self._init_weights)
        
        # Apply special scaled init to residual projections (GPT-2 paper)
        for pn, p in self.named_parameters():
            if pn.endswith('c_proj.weight'):
                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))
        
        # Report number of parameters
        print(f"Number of parameters: {self.get_num_params()/1e6:.2f}M")
    
    def get_num_params(self, non_embedding=True):
        """
        Return the number of parameters in the model.
        For non-embedding count, subtract position and token embeddings.
        """
        n_params = sum(p.numel() for p in self.parameters())
        if non_embedding:
            n_params -= self.transformer.wpe.weight.numel()
        return n_params
    
    def _init_weights(self, module):
        """Initialize weights following GPT-2 initialization scheme"""
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, idx, targets=None):
        """
        Forward pass of the model.
        
        Args:
            idx: Token indices of shape (B, T)
            targets: Optional target indices for computing loss (B, T)
        
        Returns:
            logits: Output logits of shape (B, T, vocab_size)
            loss: Cross-entropy loss if targets provided, else None
        """
        device = idx.device
        b, t = idx.size()
        assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
        
        # Generate position indices [0, 1, 2, ..., t-1]
        pos = torch.arange(0, t, dtype=torch.long, device=device)  # shape (t)
        
        # Token embeddings + positional embeddings
        tok_emb = self.transformer.wte(idx)  # (B, T, n_embd)
        pos_emb = self.transformer.wpe(pos)  # (T, n_embd)
        x = self.transformer.drop(tok_emb + pos_emb)
        
        # Pass through all transformer blocks
        for block in self.transformer.h:
            x = block(x)
        
        # Final layer norm
        x = self.transformer.ln_f(x)
        
        # Language model head to get logits
        logits = self.lm_head(x)  # (B, T, vocab_size)
        
        # Calculate loss if targets provided
        loss = None
        if targets is not None:
            # Flatten logits and targets for cross-entropy
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
        
        return logits, loss

# Create the model
model = GPT(config)
model.to(device)
print(f"\nModel created successfully!")
print(f"Total parameters: {model.get_num_params():,}")
print(f"Non-embedding parameters: {model.get_num_params(non_embedding=True):,}")

Number of parameters: 123.65M

Model created successfully!
Total parameters: 123,653,376
Non-embedding parameters: 123,653,376


## 8. Weight Loading from Pretrained Model

Load pretrained GPT-2 124M weights from Hugging Face into our custom model.

In [11]:
def load_pretrained_weights(model, model_type='gpt2'):
    """
    Load pretrained weights from Hugging Face GPT-2 into our custom model.
    Handles weight transpositions and name mappings.
    """
    print(f"Loading pretrained weights from '{model_type}'...")
    
    # Load Hugging Face model
    model_hf = GPT2LMHeadModel.from_pretrained(model_type)
    sd_hf = model_hf.state_dict()
    
    # Get our model's state dict
    sd = model.state_dict()
    
    # Copy weights, handling transpositions
    sd_keys = sd.keys()
    sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')]  # Ignore attention bias buffer
    
    # Map Hugging Face keys to our keys
    # HF uses 'transformer.h.0.attn.c_attn.weight' format
    # We use the same, so most keys match directly
    
    transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
    
    for k in sd_keys:
        # Check if this key needs transposition
        # Conv1D layers in HF are stored transposed
        if any(k.endswith(w) for w in transposed):
            # Need to transpose
            assert sd_hf[k].shape[::-1] == sd[k].shape
            with torch.no_grad():
                sd[k].copy_(sd_hf[k].t())
        else:
            # Direct copy
            assert sd_hf[k].shape == sd[k].shape
            with torch.no_grad():
                sd[k].copy_(sd_hf[k])
    
    print("Weights loaded successfully!")
    return model

# Load pretrained weights
model = load_pretrained_weights(model, 'gpt2')
model.eval()  # Set to evaluation mode
print("\nModel ready for inference!")

Loading pretrained weights from 'gpt2'...
Weights loaded successfully!

Model ready for inference!
Weights loaded successfully!

Model ready for inference!


## 9. Text Generation with Sampling

Implement autoregressive text generation with top-k sampling.

In [12]:
@torch.no_grad()
def generate(model, idx, max_new_tokens, temperature=1.0, top_k=None):
    """
    Generate new tokens autoregressively.
    
    Args:
        model: The GPT model
        idx: Starting token indices (B, T)
        max_new_tokens: Number of tokens to generate
        temperature: Sampling temperature (higher = more random)
        top_k: If set, only sample from top k most likely tokens
    
    Returns:
        Generated token indices (B, T + max_new_tokens)
    """
    for _ in range(max_new_tokens):
        # Crop context if needed
        idx_cond = idx if idx.size(1) <= config.block_size else idx[:, -config.block_size:]
        
        # Forward pass
        logits, _ = model(idx_cond)
        
        # Get logits for last token
        logits = logits[:, -1, :] / temperature
        
        # Optionally apply top-k filtering
        if top_k is not None:
            v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
            logits[logits < v[:, [-1]]] = -float('Inf')
        
        # Apply softmax to get probabilities
        probs = F.softmax(logits, dim=-1)
        
        # Sample from the distribution
        idx_next = torch.multinomial(probs, num_samples=1)
        
        # Append sampled token
        idx = torch.cat((idx, idx_next), dim=1)
    
    return idx

# Test generation
prompt = "Hello, I'm a language model,"
print(f"Prompt: {prompt}\n")

# Encode prompt
tokens = encode(prompt)
tokens = torch.tensor(tokens, dtype=torch.long, device=device).unsqueeze(0)

# Generate
generated_tokens = generate(model, tokens, max_new_tokens=50, temperature=0.8, top_k=50)

# Decode
generated_text = decode(generated_tokens[0].tolist())
print(f"Generated text:\n{generated_text}")

Prompt: Hello, I'm a language model,

Generated text:
Hello, I'm a language model, and it is not easy to get things right.

I also have a problem with doing many of their projects on the server. The client would have to wait for all the code to load before it could connect. It might be because I had
Generated text:
Hello, I'm a language model, and it is not easy to get things right.

I also have a problem with doing many of their projects on the server. The client would have to wait for all the code to load before it could connect. It might be because I had


## 10. Dataset Preparation

Prepare training data. We'll use a simple text file for demonstration.

In [13]:
import urllib.request

# Download tiny_shakespeare dataset for demonstration
def download_dataset():
    """Download tiny_shakespeare dataset"""
    url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
    filename = 'input.txt'
    
    if not os.path.exists(filename):
        print("Downloading dataset...")
        urllib.request.urlretrieve(url, filename)
        print("Download complete!")
    else:
        print("Dataset already exists.")
    
    with open(filename, 'r', encoding='utf-8') as f:
        text = f.read()
    
    return text

# Load and tokenize data
text = download_dataset()
print(f"Dataset size: {len(text):,} characters")

# Tokenize entire dataset
tokens = encode(text)
tokens = torch.tensor(tokens, dtype=torch.long)
print(f"Number of tokens: {len(tokens):,}")

# Split into train and validation
n = int(0.9 * len(tokens))
train_data = tokens[:n]
val_data = tokens[n:]
print(f"Train tokens: {len(train_data):,}")
print(f"Val tokens: {len(val_data):,}")

Downloading dataset...
Download complete!
Dataset size: 1,115,394 characters
Download complete!
Dataset size: 1,115,394 characters
Number of tokens: 338,025
Train tokens: 304,222
Val tokens: 33,803
Number of tokens: 338,025
Train tokens: 304,222
Val tokens: 33,803


## 11. DataLoader Implementation

Create efficient batched data loading.

In [14]:
def get_batch(split, batch_size=4, block_size=128):
    """
    Generate a batch of data.
    
    Args:
        split: 'train' or 'val'
        batch_size: Number of sequences per batch
        block_size: Length of each sequence
    
    Returns:
        x: Input sequences (batch_size, block_size)
        y: Target sequences (batch_size, block_size)
    """
    data = train_data if split == 'train' else val_data
    
    # Randomly select starting indices
    ix = torch.randint(len(data) - block_size, (batch_size,))
    
    # Extract sequences
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    
    # Move to device
    x, y = x.to(device), y.to(device)
    
    return x, y

# Test the dataloader
batch_size = 4
block_size = 128
xb, yb = get_batch('train', batch_size, block_size)
print(f"Input batch shape: {xb.shape}")
print(f"Target batch shape: {yb.shape}")
print(f"\nFirst sequence (first 10 tokens):")
print(f"Input: {xb[0, :10].tolist()}")
print(f"Target: {yb[0, :10].tolist()}")

Input batch shape: torch.Size([4, 128])
Target batch shape: torch.Size([4, 128])

First sequence (first 10 tokens):
Input: [286, 1918, 11, 198, 464, 10647, 286, 13795, 348, 2419]
Target: [1918, 11, 198, 464, 10647, 286, 13795, 348, 2419, 314]


## 12. Loss Estimation

Create a function to estimate loss on train and validation sets.

In [15]:
@torch.no_grad()
def estimate_loss(model, eval_iters=200, batch_size=4, block_size=128):
    """
    Estimate average loss on train and val sets.
    
    Args:
        model: The model to evaluate
        eval_iters: Number of batches to average over
        batch_size: Batch size
        block_size: Sequence length
    
    Returns:
        Dictionary with 'train' and 'val' losses
    """
    out = {}
    model.eval()
    
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split, batch_size, block_size)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    
    model.train()
    return out

# Test loss estimation
losses = estimate_loss(model, eval_iters=10, batch_size=4, block_size=128)
print(f"Train loss: {losses['train']:.4f}")
print(f"Val loss: {losses['val']:.4f}")

Train loss: 4.7200
Val loss: 4.6274


## 13. Optimizer and Learning Rate Scheduler

Set up AdamW optimizer with learning rate scheduling.

In [17]:
def configure_optimizers(model, weight_decay, learning_rate, betas, device_type):
    """
    Configure optimizer with weight decay only on 2D parameters (matrices).
    Following the GPT-2 paper, we don't apply weight decay to biases and LayerNorm.
    """
    # Separate parameters into decay and no_decay groups
    decay = set()
    no_decay = set()
    whitelist_weight_modules = (torch.nn.Linear, )
    blacklist_weight_modules = (torch.nn.LayerNorm, torch.nn.Embedding)
    
    for mn, m in model.named_modules():
        for pn, p in m.named_parameters():
            fpn = f'{mn}.{pn}' if mn else pn  # full param name
            
            if pn.endswith('bias'):
                no_decay.add(fpn)
            elif pn.endswith('weight') and isinstance(m, whitelist_weight_modules):
                decay.add(fpn)
            elif pn.endswith('weight') and isinstance(m, blacklist_weight_modules):
                no_decay.add(fpn)
    
    # Validate that we considered every parameter
    # Note: Due to weight tying (lm_head.weight = transformer.wte.weight),
    # we need to only include parameters that actually exist in param_dict
    param_dict = {pn: p for pn, p in model.named_parameters()}
    
    # Filter out parameter names that don't exist (due to weight tying)
    decay = decay & param_dict.keys()
    no_decay = no_decay & param_dict.keys()
    
    inter_params = decay & no_decay
    union_params = decay | no_decay
    assert len(inter_params) == 0, f"parameters {inter_params} made it into both decay/no_decay sets!"
    assert len(param_dict.keys() - union_params) == 0, f"parameters {param_dict.keys() - union_params} were not separated into either decay/no_decay set!"
    
    # Create optimizer groups
    optim_groups = [
        {"params": [param_dict[pn] for pn in sorted(list(decay))], "weight_decay": weight_decay},
        {"params": [param_dict[pn] for pn in sorted(list(no_decay))], "weight_decay": 0.0},
    ]
    
    # Use fused AdamW if available (faster on CUDA)
    use_fused = (device_type == 'cuda') and ('fused' in torch.optim.AdamW.__init__.__code__.co_varnames)
    print(f"Using fused AdamW: {use_fused}")
    
    optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, fused=use_fused)
    return optimizer

def get_lr(it, warmup_iters, lr_decay_iters, learning_rate, min_lr):
    """
    Learning rate schedule with warmup and cosine decay.
    """
    # 1) Linear warmup for warmup_iters steps
    if it < warmup_iters:
        return learning_rate * it / warmup_iters
    # 2) If it > lr_decay_iters, return min learning rate
    if it > lr_decay_iters:
        return min_lr
    # 3) In between, use cosine decay down to min learning rate
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))  # coeff ranges 0..1
    return min_lr + coeff * (learning_rate - min_lr)

# Training hyperparameters (reduced for demonstration)
learning_rate = 6e-4
weight_decay = 1e-1
betas = (0.9, 0.95)
warmup_iters = 100
lr_decay_iters = 5000
min_lr = 6e-5

# Create optimizer
device_type = 'cuda' if torch.cuda.is_available() else 'cpu'
optimizer = configure_optimizers(model, weight_decay, learning_rate, betas, device_type)
print(f"Optimizer configured with learning rate: {learning_rate}")

Using fused AdamW: False
Optimizer configured with learning rate: 0.0006


## 14. Training Loop with Mixed Precision

Implement the main training loop with gradient accumulation and mixed precision.

In [19]:
# Training configuration
max_iters = 1000  # Reduced for demonstration
eval_interval = 100
eval_iters = 20
batch_size = 4  # Reduced for demonstration
block_size = 128  # Reduced for demonstration
gradient_accumulation_steps = 1

# Mixed precision training
use_amp = device_type == 'cuda'  # Use automatic mixed precision on CUDA
scaler = torch.cuda.amp.GradScaler(enabled=use_amp)

print(f"Training configuration:")
print(f"  Max iterations: {max_iters}")
print(f"  Batch size: {batch_size}")
print(f"  Block size: {block_size}")
print(f"  Gradient accumulation steps: {gradient_accumulation_steps}")
print(f"  Mixed precision: {use_amp}")
print(f"  Device: {device}")
print()

# Note: For demonstration, we'll train a fresh model from scratch
# If you want to finetune the pretrained model, skip the reinitialization
print("Creating fresh model for training from scratch...")
train_config = GPTConfig()
train_config.dropout = 0.1  # Add dropout for training
train_model = GPT(train_config)
train_model.to(device)

# Recreate optimizer for the new model
optimizer = configure_optimizers(train_model, weight_decay, learning_rate, betas, device_type)

  scaler = torch.cuda.amp.GradScaler(enabled=use_amp)


Training configuration:
  Max iterations: 1000
  Batch size: 4
  Block size: 128
  Gradient accumulation steps: 1
  Mixed precision: False
  Device: cpu

Creating fresh model for training from scratch...
Number of parameters: 123.65M
Using fused AdamW: False
Number of parameters: 123.65M
Using fused AdamW: False


In [21]:
def train(model, optimizer, max_iters, eval_interval=100):
    """
    Main training loop.
    """
    model.train()
    
    for iter in range(max_iters):
        # Determine learning rate for this iteration
        lr = get_lr(iter, warmup_iters, lr_decay_iters, learning_rate, min_lr)
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr
        
        # Evaluate at intervals
        if iter % eval_interval == 0 or iter == max_iters - 1:
            losses = estimate_loss(model, eval_iters, batch_size, block_size)
            print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}, lr {lr:.6f}")
        
        # Training step
        for micro_step in range(gradient_accumulation_steps):
            # Get batch
            X, Y = get_batch('train', batch_size, block_size)
            
            # Forward pass with mixed precision
            with torch.cuda.amp.autocast(enabled=use_amp):
                logits, loss = model(X, Y)
                loss = loss / gradient_accumulation_steps  # Scale loss for gradient accumulation
            
            # Backward pass
            scaler.scale(loss).backward()
        
        # Clip gradients
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        # Optimizer step
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad(set_to_none=True)
    
    return model

# Run training (uncomment to train)
# This will take a while! For quick testing, set max_iters to a small number like 100
print("Starting training...")
print("Note: Training from scratch takes significant time and compute.")
print("For demonstration, we're using reduced settings.")
print("For full GPT-2 reproduction, you'd need multiple GPUs and days of training.")
print()

# Uncomment to actually train:
train_model = train(train_model, optimizer, max_iters, eval_interval)

print("Training loop defined. Uncomment the train() call to start training.")

Starting training...
Note: Training from scratch takes significant time and compute.
For demonstration, we're using reduced settings.
For full GPT-2 reproduction, you'd need multiple GPUs and days of training.

step 0: train loss 10.9815, val loss 10.9748, lr 0.000000
step 0: train loss 10.9815, val loss 10.9748, lr 0.000000


  with torch.cuda.amp.autocast(enabled=use_amp):


: 

## 15. Model Checkpointing

Save and load model checkpoints.

In [None]:
def save_checkpoint(model, optimizer, iter, loss, filepath='checkpoint.pt'):
    """Save model checkpoint"""
    checkpoint = {
        'iter': iter,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
        'config': model.config,
    }
    torch.save(checkpoint, filepath)
    print(f"Checkpoint saved to {filepath}")

def load_checkpoint(filepath='checkpoint.pt'):
    """Load model checkpoint"""
    checkpoint = torch.load(filepath, map_location=device)
    
    # Recreate model
    config = checkpoint['config']
    model = GPT(config)
    model.load_state_dict(checkpoint['model_state_dict'])
    model.to(device)
    
    # Recreate optimizer
    optimizer = configure_optimizers(model, weight_decay, learning_rate, betas, device_type)
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    
    iter = checkpoint['iter']
    loss = checkpoint['loss']
    
    print(f"Checkpoint loaded from {filepath}")
    print(f"Resuming from iteration {iter} with loss {loss:.4f}")
    
    return model, optimizer, iter, loss

# Example usage (uncomment to use):
# save_checkpoint(model, optimizer, 0, 0.0, 'gpt2_checkpoint.pt')
# loaded_model, loaded_optimizer, iter, loss = load_checkpoint('gpt2_checkpoint.pt')

print("Checkpoint functions defined.")

## 16. Performance Optimization with torch.compile

Use PyTorch 2.0's torch.compile for faster execution.

In [None]:
# torch.compile is available in PyTorch 2.0+
import sys

# Check PyTorch version
pytorch_version = torch.__version__
print(f"PyTorch version: {pytorch_version}")

# Compile model if PyTorch 2.0+ and CUDA available
if sys.version_info >= (3, 8) and torch.__version__ >= '2.0' and device_type == 'cuda':
    print("Compiling model with torch.compile()...")
    print("Note: First run will be slow due to compilation.")
    
    # Compile the model
    # This can provide 2-3x speedup on modern GPUs
    compiled_model = torch.compile(model)
    print("Model compiled successfully!")
    
    # Benchmark
    print("\nBenchmarking compiled vs non-compiled model...")
    
    # Warm up
    X, Y = get_batch('train', batch_size=4, block_size=128)
    _ = model(X, Y)
    _ = compiled_model(X, Y)
    
    # Time non-compiled
    start = time.time()
    for _ in range(10):
        X, Y = get_batch('train', batch_size=4, block_size=128)
        logits, loss = model(X, Y)
    non_compiled_time = time.time() - start
    
    # Time compiled
    start = time.time()
    for _ in range(10):
        X, Y = get_batch('train', batch_size=4, block_size=128)
        logits, loss = compiled_model(X, Y)
    compiled_time = time.time() - start
    
    print(f"Non-compiled: {non_compiled_time:.3f}s")
    print(f"Compiled: {compiled_time:.3f}s")
    print(f"Speedup: {non_compiled_time/compiled_time:.2f}x")
else:
    print("torch.compile not available or not using CUDA. Skipping compilation.")
    print("To use torch.compile, you need:")
    print("  - PyTorch 2.0 or later")
    print("  - Python 3.8 or later")
    print("  - CUDA GPU (for best performance)")

## 17. Generate Samples from Trained Model

Generate multiple text samples with different prompts.

In [None]:
def generate_samples(model, prompts, max_new_tokens=100, temperature=0.8, top_k=50):
    """Generate text samples for multiple prompts"""
    model.eval()
    
    for i, prompt in enumerate(prompts):
        print(f"\n{'='*80}")
        print(f"Sample {i+1}")
        print(f"{'='*80}")
        print(f"Prompt: {prompt}")
        print(f"{'-'*80}")
        
        # Encode prompt
        tokens = encode(prompt)
        tokens = torch.tensor(tokens, dtype=torch.long, device=device).unsqueeze(0)
        
        # Generate
        with torch.no_grad():
            generated = generate(model, tokens, max_new_tokens, temperature, top_k)
        
        # Decode
        text = decode(generated[0].tolist())
        print(text)
        print()

# Sample prompts to test the model
prompts = [
    "Hello, I'm a language model,",
    "Once upon a time,",
    "The meaning of life is",
    "In the field of artificial intelligence,",
    "ROMEO:",
]

print("Generating samples from pretrained GPT-2 model...")
generate_samples(model, prompts, max_new_tokens=80, temperature=0.8, top_k=50)

## 18. Model Evaluation and Metrics

Calculate perplexity and other metrics.

In [None]:
@torch.no_grad()
def evaluate_model(model, num_batches=100):
    """
    Comprehensive model evaluation with multiple metrics.
    """
    model.eval()
    
    total_loss = 0.0
    total_tokens = 0
    
    for _ in range(num_batches):
        X, Y = get_batch('val', batch_size=4, block_size=128)
        logits, loss = model(X, Y)
        
        total_loss += loss.item() * Y.numel()
        total_tokens += Y.numel()
    
    avg_loss = total_loss / total_tokens
    perplexity = math.exp(avg_loss)
    
    print(f"\n{'='*60}")
    print(f"Model Evaluation Results")
    print(f"{'='*60}")
    print(f"Average loss: {avg_loss:.4f}")
    print(f"Perplexity: {perplexity:.2f}")
    print(f"Evaluated on {total_tokens:,} tokens")
    print(f"{'='*60}")
    
    model.train()
    return avg_loss, perplexity

# Evaluate the model
print("Evaluating pretrained GPT-2 model on validation set...")
loss, perplexity = evaluate_model(model, num_batches=50)

## 19. Summary and Next Steps

Congratulations! You've built GPT-2 from scratch.

### What We've Implemented:

✅ **Architecture**:
- Multi-head self-attention with causal masking
- MLP feedforward blocks with GELU activation
- Pre-LayerNorm transformer blocks
- Token and positional embeddings
- Full GPT-2 124M architecture

✅ **Training Infrastructure**:
- Efficient data loading
- AdamW optimizer with learning rate scheduling
- Mixed precision training
- Gradient accumulation
- Checkpointing

✅ **Inference**:
- Autoregressive text generation
- Top-k sampling
- Temperature scaling
- Weight loading from Hugging Face

✅ **Optimization**:
- torch.compile support
- Mixed precision (AMP)
- Efficient batching

### Next Steps:

1. **Train from Scratch**: Uncomment the training loop and train on a larger dataset (OpenWebText, The Pile, etc.)

2. **Scale Up**: 
   - Increase model size (GPT-2 medium/large/XL)
   - Use larger batch sizes
   - Train for more iterations

3. **Advanced Features**:
   - Implement Flash Attention for faster training
   - Add distributed training (DDP, FSDP)
   - Implement gradient checkpointing for memory efficiency

4. **Fine-tuning**:
   - Fine-tune on domain-specific data
   - Implement instruction tuning
   - Add RLHF (Reinforcement Learning from Human Feedback)

5. **Evaluation**:
   - Test on standard benchmarks (HellaSwag, MMLU, etc.)
   - Compare with OpenAI's GPT-2
   - Analyze generated samples

### Resources:

- **Andrej Karpathy's Video**: [Let's reproduce GPT-2 (124M)](https://www.youtube.com/watch?v=l8pRSuU81PU)
- **nanoGPT Repository**: https://github.com/karpathy/nanoGPT
- **build-nanogpt Repository**: https://github.com/karpathy/build-nanogpt
- **GPT-2 Paper**: [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
- **Attention Paper**: [Attention is All You Need](https://arxiv.org/abs/1706.03762)

Happy training! 🚀