# ü¶ô Build and Train a LLaMA Model from Scratch

You've successfully built and trained a LLaMA model from scratch!

## What You Accomplished:

‚úÖ **Built a complete transformer model** with modern architecture
- RoPE for position encoding
- RMSNorm for stabilization  
- Grouped Query Attention for efficiency
- SwiGLU activation function

‚úÖ **Trained the model** on text data
- Character-level tokenization
- AdamW optimizer with cosine scheduling
- Proper gradient clipping

‚úÖ **Generated text** with different sampling strategies
- Temperature control
- Top-k and top-p sampling

‚úÖ **Evaluated and saved** your model
- Perplexity metrics
- Checkpoint system

## What's Next?

### To Improve Your Model:
1. **More Training Data:** Use larger datasets (books, Wikipedia, etc.)
2. **Longer Training:** Train for more epochs
3. **Bigger Model:** Increase `d_model`, `n_layers`, etc.
4. **Better Tokenization:** Use BPE or SentencePiece instead of characters
5. **Fine-tuning:** Train on specific tasks or domains

### Advanced Topics to Explore:
- **Multi-GPU Training:** Distribute training across GPUs
- **Mixed Precision:** Use FP16 for faster training
- **LoRA:** Efficient fine-tuning technique
- **RLHF:** Reinforcement Learning from Human Feedback
- **Prompt Engineering:** Optimize prompts for better outputs

### Real LLaMA Models:
This demo used small sizes for learning. Real LLaMA models:
- LLaMA 7B: 7 billion parameters
- LLaMA 13B: 13 billion parameters  
- LLaMA 70B: 70 billion parameters

Your model: ~{n_params:,} parameters (much smaller for fast training!)

## Resources:
- [LLaMA Paper](https://arxiv.org/abs/2302.13971)
- [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
- [PyTorch Documentation](https://pytorch.org/docs/)

---

**Great job! You now understand how modern large language models work! üöÄ**

In [None]:
# Save checkpoint
def save_checkpoint(model, tokenizer, config, filepath):
    """Save model checkpoint."""
    checkpoint = {
        'model_state_dict': model.state_dict(),
        'config': config,
        'vocab_size': tokenizer.vocab_size,
        'char_to_idx': tokenizer.char_to_idx,
        'idx_to_char': tokenizer.idx_to_char,
        'chars': tokenizer.chars
    }
    torch.save(checkpoint, filepath)
    print(f"‚úì Model saved to: {filepath}")

# Load checkpoint
def load_checkpoint(filepath, device='cpu'):
    """Load model checkpoint."""
    checkpoint = torch.load(filepath, map_location=device)
    
    # Recreate tokenizer
    tokenizer = CharTokenizer("")
    tokenizer.chars = checkpoint['chars']
    tokenizer.vocab_size = checkpoint['vocab_size']
    tokenizer.char_to_idx = checkpoint['char_to_idx']
    tokenizer.idx_to_char = checkpoint['idx_to_char']
    
    # Recreate model
    config = checkpoint['config']
    model = LLaMA(config).to(device)
    model.load_state_dict(checkpoint['model_state_dict'])
    
    print(f"‚úì Model loaded from: {filepath}")
    return model, tokenizer, config

# Save the trained model
print("SAVING MODEL")
print("=" * 60)
save_checkpoint(model, tokenizer, config, 'llama_checkpoint.pt')
print("\n‚úì You can now load this model anytime!")

# Example: Load the model
print("\nEXAMPLE: Loading saved model")
print("-" * 60)
loaded_model, loaded_tokenizer, loaded_config = load_checkpoint('llama_checkpoint.pt', device)
print("‚úì Model loaded successfully!")

## Step 15: Save and Load Model

Save your trained model so you can use it later!

In [None]:
# Calculate perplexity
def calculate_perplexity(model, dataloader, device):
    """Calculate perplexity on a dataset."""
    model.eval()
    total_loss = 0
    total_tokens = 0
    criterion = nn.CrossEntropyLoss(reduction='sum')
    
    with torch.no_grad():
        for x, y in dataloader:
            x, y = x.to(device), y.to(device)
            
            logits = model(x)
            loss = criterion(logits.view(-1, logits.size(-1)), y.view(-1))
            
            total_loss += loss.item()
            total_tokens += y.numel()
    
    avg_loss = total_loss / total_tokens
    perplexity = math.exp(avg_loss)
    
    return perplexity, avg_loss

# Evaluate on training data
print("EVALUATING MODEL")
print("=" * 60)

perplexity, avg_loss = calculate_perplexity(model, train_loader, device)

print(f"Average Loss: {avg_loss:.4f}")
print(f"Perplexity:   {perplexity:.2f}")
print("\nInterpretation:")
if perplexity < 10:
    print("  üåü Excellent! Model learned the patterns well.")
elif perplexity < 50:
    print("  ‚úÖ Good! Model has reasonable understanding.")
elif perplexity < 100:
    print("  ‚ö†Ô∏è  Fair. More training could help.")
else:
    print("  ‚ùå Poor. Model needs more training or better data.")

print("\n‚úì Evaluation complete!")

---

# Part 8: Evaluation and Saving

## Step 14: Evaluate Model Performance

Let's measure how well our model learned!

**Perplexity:** A common metric for language models
- Lower perplexity = better model
- Measures how "surprised" the model is by the text
- Perfect model would have perplexity = 1

**Formula:** Perplexity = exp(average loss)

In [None]:
# Test text generation with different settings
print("=" * 60)
print("TEXT GENERATION EXAMPLES")
print("=" * 60 + "\n")

prompt = "Machine learning"

# Example 1: Low temperature (conservative)
print("1. LOW TEMPERATURE (0.5) - More predictable:")
print("-" * 60)
output1 = generate(model, tokenizer, prompt, max_length=150, 
                   temperature=0.5, device=device)
print(output1)
print("\n")

# Example 2: Medium temperature
print("2. MEDIUM TEMPERATURE (1.0) - Balanced:")
print("-" * 60)
output2 = generate(model, tokenizer, prompt, max_length=150,
                   temperature=1.0, device=device)
print(output2)
print("\n")

# Example 3: High temperature (creative)
print("3. HIGH TEMPERATURE (1.5) - More creative:")
print("-" * 60)
output3 = generate(model, tokenizer, prompt, max_length=150,
                   temperature=1.5, device=device)
print(output3)
print("\n")

# Example 4: With top-k sampling
print("4. TOP-K SAMPLING (k=10) - Limits choices:")
print("-" * 60)
output4 = generate(model, tokenizer, prompt, max_length=150,
                   temperature=1.0, top_k=10, device=device)
print(output4)
print("\n")

print("=" * 60)
print("‚úì Text generation working! Try different prompts and settings.")
print("=" * 60)

In [None]:
# Text generation function - Part 1
def generate(model, tokenizer, prompt, max_length=100, temperature=1.0, 
             top_k=None, top_p=None, device='cpu'):
    """
    Generate text using the trained model.
    
    Args:
        model: Trained LLaMA model
        tokenizer: Character tokenizer
        prompt: Starting text
        max_length: Maximum characters to generate
        temperature: Sampling temperature (higher = more random)
        top_k: Keep only top k tokens (optional)
        top_p: Nucleus sampling threshold (optional)
        device: CPU or GPU
    
    Returns:
        Generated text string
    """
    model.eval()
    
    # Encode prompt
    tokens = torch.tensor(tokenizer.encode(prompt), dtype=torch.long).unsqueeze(0).to(device)
    
    # Generate tokens one by one
    generated = []
    
    with torch.no_grad():
        for _ in range(max_length):
            # Get predictions
            logits = model(tokens)
            
            # Get logits for last position
            next_token_logits = logits[0, -1, :] / temperature
            
            # Apply top-k filtering
            if top_k is not None:
                indices_to_remove = next_token_logits < torch.topk(next_token_logits, top_k)[0][..., -1, None]
                next_token_logits[indices_to_remove] = float('-inf')
            
            # Apply top-p (nucleus) filtering
            if top_p is not None:
                sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
                
                # Remove tokens with cumulative probability above threshold
                sorted_indices_to_remove = cumulative_probs > top_p
                sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
                sorted_indices_to_remove[..., 0] = 0
                
                indices_to_remove = sorted_indices[sorted_indices_to_remove]
                next_token_logits[indices_to_remove] = float('-inf')
            
            # Sample next token
            probs = F.softmax(next_token_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            
            # Append to sequence
            tokens = torch.cat([tokens, next_token.unsqueeze(0)], dim=1)
            generated.append(next_token.item())
            
            # Keep only last max_seq_len tokens
            if tokens.size(1) > model.config.max_seq_len:
                tokens = tokens[:, -model.config.max_seq_len:]
    
    # Decode and return
    return prompt + tokenizer.decode(generated)

---

# Part 7: Generate Text

## Step 13: Text Generation

Now the fun part - making the model generate text!

**How Generation Works:**
1. Give model a starting text (prompt)
2. Model predicts probability of each next character
3. Sample a character based on probabilities
4. Add character to sequence
5. Repeat!

**Sampling Strategies:**
- **Temperature:** Controls randomness (low = safe, high = creative)
- **Top-k:** Only consider k most likely options
- **Top-p (Nucleus):** Consider top options until cumulative prob reaches p

**Higher temperature = more creative (but less coherent)**

In [None]:
# Run training
print("\n" + "=" * 60)
print("STARTING TRAINING")
print("=" * 60 + "\n")

# Initialize training
model, optimizer, scheduler, criterion, start_time, global_step = train(
    model, train_loader, num_epochs=5, device=device, learning_rate=3e-4
)

# Run training loop
model = train_loop(
    model, train_loader, num_epochs=5,
    optimizer=optimizer, scheduler=scheduler, criterion=criterion,
    device=device, start_time=start_time, global_step=global_step
)

print("‚úì Training complete! Model is ready to generate text.")

In [None]:
# Prepare training data
print("PREPARING TRAINING DATA")
print("=" * 60)

# Sample training text (you can replace with your own!)
training_text = """
The quick brown fox jumps over the lazy dog. 
Machine learning is the study of computer algorithms that improve automatically through experience.
Deep learning is part of a broader family of machine learning methods based on artificial neural networks.
A transformer is a deep learning model that adopts the mechanism of self-attention.
LLaMA is a large language model developed by Meta AI.
""" * 20  # Repeat to have more training data

print(f"Training text length: {len(training_text):,} characters\n")

# Create tokenizer
tokenizer = CharTokenizer(training_text)
print(f"Vocabulary: {tokenizer.vocab_size} unique characters")
print(f"Characters: {''.join(tokenizer.chars[:30])}...")

# Update config with actual vocab size
config.vocab_size = tokenizer.vocab_size

# Recreate model with correct vocab size
model = LLaMA(config).to(device)
n_params = sum(p.numel() for p in model.parameters())
print(f"\nModel parameters: {n_params:,}")

# Create dataset and dataloader
dataset = TextDataset(training_text, tokenizer, seq_len=config.max_seq_len)
train_loader = DataLoader(dataset, batch_size=32, shuffle=True)

print(f"\nDataset: {len(dataset):,} training samples")
print(f"Batches: {len(train_loader)} per epoch")
print(f"Batch size: 32")
print(f"Sequence length: {config.max_seq_len}")
print("\n‚úì Data ready for training!")

## Step 12: Load Real Training Data and Train

Now let's prepare a real dataset and train the model!

For this example, we'll use a sample text. In practice, you'd use much larger datasets like books or Wikipedia articles.

In [None]:
# Training function - Part 2: Main loop
def train_loop(model, train_loader, num_epochs, optimizer, scheduler, criterion, device, start_time, global_step):
    """Main training loop."""
    
    for epoch in range(num_epochs):
        epoch_loss = 0
        
        for batch_idx, (x, y) in enumerate(train_loader):
            # Move to device
            x, y = x.to(device), y.to(device)
            
            # Forward pass
            logits = model(x)
            
            # Calculate loss (reshape for CrossEntropyLoss)
            loss = criterion(
                logits.view(-1, logits.size(-1)),
                y.view(-1)
            )
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            
            # Gradient clipping (prevents exploding gradients)
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            
            # Update weights
            optimizer.step()
            scheduler.step()
            
            # Track metrics
            epoch_loss += loss.item()
            global_step += 1
            
            # Print progress
            if (batch_idx + 1) % 10 == 0:
                avg_loss = epoch_loss / (batch_idx + 1)
                lr = scheduler.get_last_lr()[0]
                print(f"Epoch {epoch+1}/{num_epochs} | "
                      f"Batch {batch_idx+1}/{len(train_loader)} | "
                      f"Loss: {avg_loss:.4f} | LR: {lr:.6f}")
        
        # Epoch summary
        avg_epoch_loss = epoch_loss / len(train_loader)
        elapsed = time.time() - start_time
        print(f"\n{'='*60}")
        print(f"Epoch {epoch+1} Complete | Avg Loss: {avg_epoch_loss:.4f} | Time: {elapsed:.1f}s")
        print(f"{'='*60}\n")
    
    total_time = time.time() - start_time
    print(f"\n{'='*60}")
    print(f"TRAINING COMPLETE!")
    print(f"Total time: {total_time:.1f}s ({total_time/60:.1f} minutes)")
    print(f"Final loss: {avg_epoch_loss:.4f}")
    print(f"{'='*60}\n")
    
    return model

In [None]:
# Training function - Part 1: Setup
def train(model, train_loader, num_epochs, device, learning_rate=3e-4):
    """
    Train the LLaMA model.
    
    Args:
        model: LLaMA model to train
        train_loader: DataLoader with training data
        num_epochs: Number of times to go through dataset
        device: CPU or GPU
        learning_rate: Starting learning rate
    """
    model.train()
    
    # Setup optimizer
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=0.1)
    
    # Setup learning rate scheduler
    total_steps = len(train_loader) * num_epochs
    warmup_steps = min(100, total_steps // 10)
    scheduler = get_cosine_schedule_with_warmup(optimizer, warmup_steps, total_steps)
    
    # Loss function
    criterion = nn.CrossEntropyLoss()
    
    print("TRAINING STARTED")
    print("=" * 60)
    print(f"Total epochs: {num_epochs}")
    print(f"Steps per epoch: {len(train_loader)}")
    print(f"Total steps: {total_steps}")
    print(f"Warmup steps: {warmup_steps}")
    print("=" * 60 + "\n")
    
    start_time = time.time()
    global_step = 0
    
    return model, optimizer, scheduler, criterion, start_time, global_step

In [None]:
# Learning rate scheduler
def get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps):
    """
    Creates a learning rate scheduler with warmup and cosine decay.
    
    Warmup: Gradually increase LR (prevents instability at start)
    Cosine: Smooth decrease (helps fine-tune at end)
    """
    def lr_lambda(current_step):
        # Warmup phase
        if current_step < num_warmup_steps:
            return float(current_step) / float(max(1, num_warmup_steps))
        
        # Cosine decay phase
        progress = float(current_step - num_warmup_steps) / float(
            max(1, num_training_steps - num_warmup_steps)
        )
        return max(0.0, 0.5 * (1.0 + math.cos(math.pi * progress)))
    
    return torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

# Visualize learning rate schedule
print("EXAMPLE: Learning Rate Schedule")
print("=" * 60)

# Create dummy optimizer
dummy_model = nn.Linear(10, 10)
dummy_optimizer = torch.optim.AdamW(dummy_model.parameters(), lr=1e-3)
scheduler = get_cosine_schedule_with_warmup(dummy_optimizer, 100, 1000)

# Sample learning rates
lrs = []
for step in range(1000):
    lrs.append(scheduler.get_last_lr()[0])
    scheduler.step()

print(f"Learning rate schedule (1000 steps, 100 warmup):")
print(f"  Step 0:    {lrs[0]:.6f} (starting)")
print(f"  Step 100:  {lrs[100]:.6f} (after warmup)")
print(f"  Step 500:  {lrs[500]:.6f} (middle)")
print(f"  Step 999:  {lrs[999]:.6f} (end)")
print("\n‚úì Learning rate starts low, peaks, then gradually decreases!")

---

# Part 6: Training the Model

## Step 11: Training Setup

Now we train the model to predict next characters!

**Training Process:**
1. Show model some text
2. Model tries to predict next character
3. Compare prediction to actual next character
4. Adjust model weights to improve
5. Repeat thousands of times!

**Key Components:**
- **Optimizer:** AdamW (updates model weights intelligently)
- **Learning Rate Scheduler:** Starts slow, then faster, then slower (cosine schedule)
- **Loss Function:** CrossEntropyLoss (measures prediction accuracy)

**Analogy:** Like learning to play piano - practice makes perfect!

In [None]:
# Training Dataset
class TextDataset(Dataset):
    """Dataset for language modeling."""
    
    def __init__(self, text, tokenizer, seq_len):
        self.tokenizer = tokenizer
        self.seq_len = seq_len
        
        # Encode entire text
        self.tokens = torch.tensor(tokenizer.encode(text), dtype=torch.long)
    
    def __len__(self):
        return len(self.tokens) - self.seq_len
    
    def __getitem__(self, idx):
        """
        Get a training pair.
        
        Returns:
            x: input sequence
            y: target sequence (shifted by 1)
        """
        x = self.tokens[idx:idx + self.seq_len]
        y = self.tokens[idx + 1:idx + self.seq_len + 1]
        return x, y

# Create dataset
print("\nEXAMPLE: Text Dataset")
print("=" * 60)

dataset = TextDataset(sample_text, tokenizer, seq_len=10)
print(f"Dataset size: {len(dataset)} samples")
print(f"Sequence length: 10 characters")

# Show example
x, y = dataset[0]
print(f"\nExample training pair:")
print(f"  Input:  {x.tolist()}")
print(f"  Target: {y.tolist()}")
print(f"\n  Input text:  '{tokenizer.decode(x.tolist())}'")
print(f"  Target text: '{tokenizer.decode(y.tolist())}'")
print("\n  Notice: Target is shifted by 1 position!")
print("  Model learns to predict the next character.")
print("\n‚úì Dataset ready for training!")

In [None]:
# Character-level Tokenizer
class CharTokenizer:
    """Simple character-level tokenizer."""
    
    def __init__(self, text):
        # Find all unique characters
        self.chars = sorted(list(set(text)))
        self.vocab_size = len(self.chars)
        
        # Create mappings
        self.char_to_idx = {ch: i for i, ch in enumerate(self.chars)}
        self.idx_to_char = {i: ch for i, ch in enumerate(self.chars)}
    
    def encode(self, text):
        """Convert text to token indices."""
        return [self.char_to_idx[ch] for ch in text]
    
    def decode(self, indices):
        """Convert token indices back to text."""
        return ''.join([self.idx_to_char[i] for i in indices])

# Example: Create tokenizer with sample text
sample_text = "Hello World! This is a LLaMA model."
tokenizer = CharTokenizer(sample_text)

print("EXAMPLE: Character Tokenizer")
print("=" * 60)
print(f"Sample text: '{sample_text}'")
print(f"\nVocabulary size: {tokenizer.vocab_size} unique characters")
print(f"Characters: {tokenizer.chars[:20]}...")

# Encode example
test_text = "Hello"
encoded = tokenizer.encode(test_text)
decoded = tokenizer.decode(encoded)

print(f"\nEncoding test:")
print(f"  Original: '{test_text}'")
print(f"  Encoded:  {encoded}")
print(f"  Decoded:  '{decoded}'")
print(f"  Match: {test_text == decoded} ‚úì")
print("\n‚úì Tokenizer working!")

---

# Part 5: Prepare Training Data

## Step 10: Tokenizer and Dataset

Before training, we need to prepare our text data!

**What's a Tokenizer?** Converts text ‚Üî numbers
- "Hello" ‚Üí [34, 56, 67, 67, 78]
- Model only understands numbers!

**For this demo:** We'll use character-level tokenization (each character = one token)
- Simple and fast for learning
- Real LLaMA uses more sophisticated tokenization (BPE)

**Dataset:** Creates training pairs
- Input: "Hello worl"
- Target: "ello world" (shifted by 1)
- Model learns to predict next character!

In [None]:
# Create and test the complete model
print("CREATING COMPLETE LLaMA MODEL")
print("=" * 60)

model = LLaMA(config).to(device)

# Count parameters
n_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {n_params:,}")

# Test forward pass
test_tokens = torch.randint(0, config.vocab_size, (2, 10)).to(device)
print(f"\nInput tokens: {test_tokens.shape}")
print(f"  (2 sentences, 10 tokens)")

mask = torch.tril(torch.ones(10, 10)).unsqueeze(0).unsqueeze(0).to(device)

with torch.no_grad():
    logits = model(test_tokens, mask)

print(f"\nOutput logits: {logits.shape}")
print(f"  (2 sentences, 10 positions, {config.vocab_size} possible next tokens)")

print(f"\nModel Architecture:")
print(f"  ‚Ä¢ Embedding layer: {config.vocab_size:,} tokens")
print(f"  ‚Ä¢ {config.n_layers} transformer blocks")
print(f"  ‚Ä¢ {config.n_heads} attention heads per block")
print(f"  ‚Ä¢ {config.d_model} model dimension")
print(f"  ‚Ä¢ {config.d_ff:,} FFN dimension")
print("\n‚úì Complete LLaMA model created successfully!")

In [None]:
# Complete LLaMA Model
class LLaMA(nn.Module):
    """Complete LLaMA language model."""
    
    def __init__(self, config):
        super().__init__()
        self.config = config
        
        # Token embedding layer
        self.tok_embeddings = nn.Embedding(config.vocab_size, config.d_model)
        
        # Stack of transformer blocks
        self.layers = nn.ModuleList([
            LLaMABlock(config) for _ in range(config.n_layers)
        ])
        
        # Final normalization
        self.norm = RMSNorm(config.d_model, config.rms_norm_eps)
        
        # Output projection (predict next token)
        self.output = nn.Linear(config.d_model, config.vocab_size, bias=False)
        
        # Initialize weights
        self.apply(self._init_weights)
    
    def _init_weights(self, module):
        """Initialize weights with small random values."""
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, tokens, mask=None):
        """
        Forward pass through the model.
        
        Inputs:
            tokens: (batch, seq_len) - Token indices
            mask: Optional attention mask
        
        Returns:
            logits: (batch, seq_len, vocab_size) - Predictions
        """
        # Embed tokens
        x = self.tok_embeddings(tokens)
        
        # Process through all layers
        for layer in self.layers:
            x = layer(x, mask)
        
        # Final normalization and projection
        x = self.norm(x)
        logits = self.output(x)
        
        return logits

---

# Part 4: Complete LLaMA Model

## Step 9: Build the Full Model

Time to assemble everything! The complete LLaMA model has:

**Architecture:**
1. **Token Embedding** ‚Üí Convert words to numbers
2. **N Transformer Blocks** ‚Üí Process and understand (we use 6)
3. **Final Norm** ‚Üí Stabilize output
4. **Output Head** ‚Üí Predict next word

**Information Flow:**
```
Text ‚Üí Embedding ‚Üí Block‚ÇÅ ‚Üí Block‚ÇÇ ‚Üí ... ‚Üí Block‚ÇÜ ‚Üí Norm ‚Üí Prediction
```

Think of it like an assembly line - each block refines the understanding!

In [None]:
# Complete Transformer Block
class LLaMABlock(nn.Module):
    """Single transformer block with attention and feed-forward."""
    
    def __init__(self, config):
        super().__init__()
        # Attention components
        self.attention = GroupedQueryAttention(config)
        self.attention_norm = RMSNorm(config.d_model, config.rms_norm_eps)
        
        # Feed-forward components
        self.feed_forward = SwiGLU(config)
        self.ffn_norm = RMSNorm(config.d_model, config.rms_norm_eps)
    
    def forward(self, x, mask=None):
        """
        Process through attention and feed-forward.
        
        Uses pre-normalization and residual connections.
        """
        # Attention block with residual
        h = x + self.attention(self.attention_norm(x), mask)
        
        # Feed-forward block with residual
        out = h + self.feed_forward(self.ffn_norm(h))
        
        return out

# Test transformer block
print("EXAMPLE: Transformer Block")
print("=" * 60)

block = LLaMABlock(config).to(device)
test_input = torch.randn(2, 10, config.d_model).to(device)
mask = torch.tril(torch.ones(10, 10)).unsqueeze(0).unsqueeze(0).to(device)

print(f"Input:  {test_input.shape}")
output = block(test_input, mask)
print(f"Output: {output.shape}")

print(f"\nComponents in block:")
print(f"  ‚úì Attention with {config.n_heads} heads")
print(f"  ‚úì SwiGLU feed-forward")  
print(f"  ‚úì RMSNorm (2x)")
print(f"  ‚úì Residual connections (2x)")
print("\n‚úì Complete transformer block working!")

## Step 8: Complete Transformer Block

Now we combine everything into a single **transformer block**!

**A Block Contains:**
1. **Attention Layer** ‚Üí Looks at relationships between words
2. **Feed-Forward Layer** ‚Üí Processes each word independently  
3. **Normalization** ‚Üí Keeps numbers stable (before each layer)
4. **Residual Connections** ‚Üí Helps information flow (adds input to output)

**Residual Connections:** Like a highway bypass - if the layer doesn't help, just skip it!

**Formula:** 
- x = x + Attention(Norm(x))
- x = x + FFN(Norm(x))

LLaMA stacks multiple blocks (we'll use 6) to build the complete model!

In [None]:
# SwiGLU Feed-Forward Network
class SwiGLU(nn.Module):
    """Gated feed-forward network with SwiGLU activation."""
    
    def __init__(self, config):
        super().__init__()
        hidden_dim = config.d_ff
        
        # Three transformations
        self.w1 = nn.Linear(config.d_model, hidden_dim, bias=False)  # Gate
        self.w2 = nn.Linear(hidden_dim, config.d_model, bias=False)  # Output
        self.w3 = nn.Linear(config.d_model, hidden_dim, bias=False)  # Value
        
        self.dropout = nn.Dropout(config.dropout)
    
    def forward(self, x):
        """
        Apply SwiGLU transformation.
        
        Formula: w2(silu(w1(x)) * w3(x))
        where silu(x) = x * sigmoid(x)
        """
        # Gate path (with SiLU activation)
        gate = F.silu(self.w1(x))
        
        # Value path (no activation)
        value = self.w3(x)
        
        # Combine with gating
        hidden = gate * value
        
        # Project back to original dimension
        return self.dropout(self.w2(hidden))

# Example usage
print("EXAMPLE: SwiGLU Activation")
print("=" * 60)

swiglu = SwiGLU(config).to(device)
test_input = torch.randn(2, 10, config.d_model).to(device)

print(f"Input:  {test_input.shape}")
print(f"        (2 sentences, 10 words, {config.d_model} dims)")

output = swiglu(test_input)

print(f"\nOutput: {output.shape}")
print(f"        (same shape, but transformed)")
print(f"\nSample input values:  {test_input[0, 0, :5]}")
print(f"Sample output values: {output[0, 0, :5]}")
print("\n‚úì SwiGLU adds complex non-linear transformations!")

## Step 7: SwiGLU Activation Function

**What's an Activation Function?** It adds "non-linearity" so models can learn complex patterns.

**Why SwiGLU?** Better than older activations like ReLU or GELU.

**The Magic:** Uses a "gate" mechanism
- One path processes the information
- Another path decides what to let through
- Like a bouncer at a club - decides what gets in!

**Technical:** SwiGLU = Swish(xW) ‚äó (xV)
- ‚äó means element-wise multiplication
- W and V are learnable transformations

In [None]:
# Test Grouped Query Attention
print("EXAMPLE: Testing Grouped Query Attention")
print("=" * 60)

gqa = GroupedQueryAttention(config).to(device)
test_x = torch.randn(2, 10, config.d_model).to(device)

print(f"Input: {test_x.shape}")
print(f"       (2 sentences, 10 words, {config.d_model} dimensions)")

# Create causal mask (prevent looking at future words)
causal_mask = torch.tril(torch.ones(10, 10)).unsqueeze(0).unsqueeze(0).to(device)
print(f"\nCausal mask: {causal_mask.shape}")
print(f"First 5x5 of mask:\n{causal_mask[0, 0, :5, :5].int()}")
print("(1 = can attend, 0 = cannot attend)")

output = gqa(test_x, mask=causal_mask)

print(f"\nOutput: {output.shape}")
print(f"\nMemory Savings:")
print(f"  Query heads:     {gqa.n_heads}")
print(f"  Key/Value heads: {gqa.n_kv_heads}")
print(f"  Savings:         {(1 - gqa.n_kv_heads/gqa.n_heads)*100:.0f}%")
print("\n‚úì Grouped Query Attention working!")

In [None]:
# Grouped Query Attention - Part 1: Setup
class GroupedQueryAttention(nn.Module):
    """Memory-efficient attention with shared K/V heads."""
    
    def __init__(self, config):
        super().__init__()
        self.n_heads = config.n_heads        # Query heads (8)
        self.n_kv_heads = config.n_kv_heads  # K/V heads (4)
        self.head_dim = config.head_dim
        self.d_model = config.d_model
        
        # How many Q heads per K/V head?
        self.n_rep = self.n_heads // self.n_kv_heads
        
        # Projection layers (note: K/V have fewer heads!)
        self.wq = nn.Linear(self.d_model, self.n_heads * self.head_dim, bias=False)
        self.wk = nn.Linear(self.d_model, self.n_kv_heads * self.head_dim, bias=False)
        self.wv = nn.Linear(self.d_model, self.n_kv_heads * self.head_dim, bias=False)
        self.wo = nn.Linear(self.n_heads * self.head_dim, self.d_model, bias=False)
        
        self.dropout = nn.Dropout(config.dropout)
        
        # Store RoPE frequencies
        self.register_buffer(
            "rope_freqs",
            precompute_rope_freqs(self.head_dim, config.max_seq_len, config.rope_theta)
        )
    
    def forward(self, x, mask=None):
        """Apply grouped query attention."""
        batch_size, seq_len, _ = x.shape
        
        # Project to Q, K, V
        q = self.wq(x).view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        k = self.wk(x).view(batch_size, seq_len, self.n_kv_heads, self.head_dim).transpose(1, 2)
        v = self.wv(x).view(batch_size, seq_len, self.n_kv_heads, self.head_dim).transpose(1, 2)
        
        # Apply RoPE
        q = apply_rope(q, self.rope_freqs)
        k = apply_rope(k, self.rope_freqs)
        
        # Repeat K/V to match Q heads
        if self.n_rep > 1:
            k = k.repeat_interleave(self.n_rep, dim=1)
            v = v.repeat_interleave(self.n_rep, dim=1)
        
        # Calculate attention
        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        # Apply to values and combine heads
        output = torch.matmul(attn_weights, v)
        output = output.transpose(1, 2).contiguous().view(batch_size, seq_len, -1)
        
        return self.wo(output)

## Step 6: Grouped Query Attention (GQA)

**The Big Idea:** Attention helps models focus on important words, but uses lots of memory!

**The Problem:** Traditional attention creates 3 copies (Q, K, V) for EACH head.
- 8 heads = 24 separate copies = lots of memory!

**The Smart Solution (GQA):** Share Key and Value copies across multiple Query heads.
- 8 Query heads, but only 4 Key/Value heads
- Each K/V pair serves 2 Q heads
- **Result:** 50% memory savings!

**Real-world analogy:**
- Old way: 8 students, each with their own textbook ($$$)
- New way: 8 students sharing 4 textbooks (saves money, same learning!)

In [None]:
# RMSNorm implementation
class RMSNorm(nn.Module):
    """Root Mean Square Normalization."""
    
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.eps = eps  # Prevent division by zero
        self.weight = nn.Parameter(torch.ones(dim))  # Learnable scale
    
    def forward(self, x):
        """
        Normalize input.
        
        Input:  x with shape (..., dim)
        Output: Normalized x (same shape)
        """
        # Calculate RMS: sqrt(mean(x¬≤))
        rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
        # Normalize and scale
        return self.weight * (x / rms)

# Example usage
print("EXAMPLE: Using RMSNorm")
print("=" * 60)

rms_norm = RMSNorm(dim=8)
test_input = torch.randn(2, 5, 8)  # 2 sentences, 5 words, 8 dims

print("BEFORE RMSNorm:")
print(f"  Shape: {test_input.shape}")
print(f"  First word values: {test_input[0, 0, :4]}")
print(f"  Mean: {test_input[0, 0].mean():.4f}")
print(f"  Std:  {test_input[0, 0].std():.4f}")

output = rms_norm(test_input)

print("\nAFTER RMSNorm:")
print(f"  Shape: {output.shape}")
print(f"  First word values: {output[0, 0, :4]}")
print(f"  RMS (should be ~1.0): {torch.sqrt((output[0, 0]**2).mean()):.4f}")
print("\n‚úì Numbers are now balanced and stable!")

## Step 5: RMSNorm - Root Mean Square Normalization

**The Problem:** During training, numbers can become too big or too small, causing issues.

**The Solution:** RMSNorm rescales all numbers to a reasonable range.

**Simple analogy:** Like adjusting audio volume - too loud is distorted, too quiet can't be heard. RMSNorm keeps the "volume" just right!

**How it works:**
1. Calculate the average size (RMS) of all numbers
2. Divide all numbers by this average
3. Multiply by a learnable scale factor

**Why RMSNorm?** Faster than LayerNorm (skips mean-centering step)

In [None]:
# Apply RoPE rotations to data
def apply_rope(x, freqs):
    """
    Apply rotations to word representations.
    
    Inputs:
        x: Tensor (batch, n_heads, seq_len, head_dim)
        freqs: Precomputed rotation values
    
    Returns:
        Rotated tensor (same shape as input)
    """
    batch, n_heads, seq_len, head_dim = x.shape
    
    # Separate even/odd dimensions
    x_reshaped = x.reshape(batch, n_heads, seq_len, head_dim // 2, 2)
    x_even = x_reshaped[..., 0]  # Elements at indices 0, 2, 4, ...
    x_odd = x_reshaped[..., 1]   # Elements at indices 1, 3, 5, ...
    
    # Get cos/sin values
    cos = freqs[:seq_len, :, 0].unsqueeze(0).unsqueeze(0)
    sin = freqs[:seq_len, :, 1].unsqueeze(0).unsqueeze(0)
    
    # Apply 2D rotation
    rotated_even = x_even * cos - x_odd * sin
    rotated_odd = x_even * sin + x_odd * cos
    
    # Combine back
    rotated = torch.stack([rotated_even, rotated_odd], dim=-1)
    return rotated.reshape(batch, n_heads, seq_len, head_dim)

# Test RoPE
print("\nEXAMPLE: Applying RoPE")
print("=" * 60)
test_data = torch.randn(2, 4, 10, 32)
print(f"Input:  {test_data.shape}")
print(f"        (2 sentences, 4 heads, 10 words, 32 dimensions)")

rotated = apply_rope(test_data, rope_freqs)
print(f"\nOutput: {rotated.shape}")
print(f"        (same shape, but values rotated based on position)")
print(f"\nChanged: {not torch.equal(test_data, rotated)}")
print("\n‚úì RoPE applied! Model now knows word positions.")

In [None]:
# Create RoPE rotation frequencies
def precompute_rope_freqs(head_dim, max_seq_len, theta=10000.0):
    """
    Precompute rotation frequencies for each position.
    
    Inputs:
        head_dim: Size of each attention head (e.g., 32)
        max_seq_len: Max sequence length (e.g., 128)
        theta: Rotation speed control (default: 10000)
    
    Returns:
        Tensor with cos/sin values for each position
    """
    # Calculate frequencies
    freqs = 1.0 / (theta ** (torch.arange(0, head_dim, 2).float() / head_dim))
    
    # Position indices: 0, 1, 2, ..., max_seq_len-1
    positions = torch.arange(max_seq_len).float()
    
    # Outer product: position √ó frequency
    freqs = torch.outer(positions, freqs)
    
    # Convert to cos and sin (for rotation)
    freqs_cos = torch.cos(freqs)
    freqs_sin = torch.sin(freqs)
    
    return torch.stack([freqs_cos, freqs_sin], dim=-1)

# Example usage
print("EXAMPLE: Creating RoPE Frequencies")
print("=" * 60)
rope_freqs = precompute_rope_freqs(head_dim=32, max_seq_len=10)
print(f"Input:  head_dim=32, max_seq_len=10")
print(f"Output: shape {rope_freqs.shape}")
print(f"        (10 positions, 16 freq pairs, 2 values [cos,sin])")
print(f"\nPosition 0 (first 3 frequency pairs):")
print(rope_freqs[0, :3])
print(f"\nPosition 5 (first 3 frequency pairs):")
print(rope_freqs[5, :3])
print("\n‚úì Each position has unique rotation values!")

---

# Part 3: Build Model Components

Now we'll build each piece of the LLaMA model, one at a time, with examples!

## Step 4: RoPE - Rotary Position Embeddings

**The Problem:** Computers don't naturally understand word order.
- "Dog bites man" ‚â† "Man bites dog"

**The Solution:** RoPE gives each position a unique "rotation" so the model knows word order.

**How it works:**
1. Each position (0, 1, 2, ...) gets a rotation angle
2. We apply these rotations to the word representations
3. The model can now tell which words come before/after others

**Why RoPE is better:** Works great for long texts and is more efficient than older methods!

In [None]:
# Define model configuration
@dataclass
class LLaMAConfig:
    """All settings for our LLaMA model."""
    
    # Model architecture
    vocab_size: int = 512       # Number of unique characters (updated later)
    d_model: int = 256          # Size of word representations
    n_layers: int = 6           # Number of transformer blocks
    n_heads: int = 8            # Attention heads for Queries
    n_kv_heads: int = 4         # Attention heads for Keys/Values (saves memory!)
    d_ff: int = 1024            # Feed-forward network size
    
    # Training settings
    max_seq_len: int = 128      # Maximum text length
    dropout: float = 0.1        # Regularization (prevents overfitting)
    
    # Technical parameters
    rope_theta: float = 10000.0       # RoPE rotation parameter
    rms_norm_eps: float = 1e-6        # Numerical stability constant
    
    def __post_init__(self):
        """Validate settings."""
        assert self.d_model % self.n_heads == 0
        assert self.n_heads % self.n_kv_heads == 0
        self.head_dim = self.d_model // self.n_heads

# Create configuration
config = LLaMAConfig()

# Display settings
print("MODEL CONFIGURATION")
print("=" * 60)
print(f"Vocabulary size:      {config.vocab_size:>6,} characters")
print(f"Model dimension:      {config.d_model:>6}")
print(f"Number of layers:     {config.n_layers:>6}")
print(f"Attention heads (Q):  {config.n_heads:>6}")
print(f"Attention heads (KV): {config.n_kv_heads:>6}")
print(f"Head dimension:       {config.head_dim:>6}")
print(f"FFN dimension:        {config.d_ff:>6,}")
print(f"Max sequence length:  {config.max_seq_len:>6}")
print("=" * 60)
print("\nüí° These are small values for fast training!")
print("   Real LLaMA uses much bigger numbers.\n")

---

# Part 2: Model Configuration

## Step 3: Define Model Settings

Before building, we need to decide the model's "size" and settings.

**Think of it like building a house:**
- How many floors? (layers)
- How big are the rooms? (dimensions)
- How many windows? (attention heads)

We'll use small numbers so training is fast on your computer!

In [None]:
# Setup device and set random seed for reproducibility
torch.manual_seed(42)

# Detect available hardware
if torch.backends.mps.is_available():
    device = torch.device("mps")
    print("üçé Using: Apple Silicon GPU (MPS)")
elif torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"üéÆ Using: NVIDIA GPU - {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("üíª Using: CPU (slower, but will work)")

print(f"\nDevice: {device}")

# Test the device with a simple tensor
test = torch.ones(3, 3).to(device)
print(f"\nTest tensor created on {device}:")
print(test)
print("\n‚úì Device is working correctly!")

## Step 2: Setup Computing Device

AI models run faster on different hardware:
- **CPU** ‚Üí Your computer's main processor (slower)
- **NVIDIA GPU (CUDA)** ‚Üí Graphics card (much faster!)
- **Apple Silicon (MPS)** ‚Üí M1/M2/M3 chips (also fast!)

Let's detect what you have and use the best option.

In [None]:
# Import all required libraries
import math
import time
import sys
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from dataclasses import dataclass
from typing import Optional, Tuple

# Verify imports worked
print("=" * 60)
print("‚úì ALL IMPORTS SUCCESSFUL!")
print("=" * 60)
print(f"PyTorch version: {torch.__version__}")
print(f"Python version:  {sys.version.split()[0]}")
print("=" * 60)

---

# Part 1: Setup and Imports

## Step 1: Import Required Libraries

First, we need to load all the Python tools (libraries) we'll use.

**What each library does:**
- `torch` ‚Üí Main deep learning framework
- `torch.nn` ‚Üí Neural network building blocks
- `torch.nn.functional` ‚Üí Mathematical operations
- `math` ‚Üí Basic math functions
- `time` ‚Üí Track training duration
- `Dataset, DataLoader` ‚Üí Handle training data
- `dataclass` ‚Üí Easy configuration setup
- `typing` ‚Üí Type hints for clarity

# ü¶ô Build and Train a LLaMA Model from Scratch

You've successfully built and trained a LLaMA model from scratch!

## What You Accomplished:

‚úÖ **Built a complete transformer model** with modern architecture
- RoPE for position encoding
- RMSNorm for stabilization  
- Grouped Query Attention for efficiency
- SwiGLU activation function

‚úÖ **Trained the model** on text data
- Character-level tokenization
- AdamW optimizer with cosine scheduling
- Proper gradient clipping

‚úÖ **Generated text** with different sampling strategies
- Temperature control
- Top-k and top-p sampling

‚úÖ **Evaluated and saved** your model
- Perplexity metrics
- Checkpoint system

## What's Next?

### To Improve Your Model:
1. **More Training Data:** Use larger datasets (books, Wikipedia, etc.)
2. **Longer Training:** Train for more epochs
3. **Bigger Model:** Increase `d_model`, `n_layers`, etc.
4. **Better Tokenization:** Use BPE or SentencePiece instead of characters
5. **Fine-tuning:** Train on specific tasks or domains

### Advanced Topics to Explore:
- **Multi-GPU Training:** Distribute training across GPUs
- **Mixed Precision:** Use FP16 for faster training
- **LoRA:** Efficient fine-tuning technique
- **RLHF:** Reinforcement Learning from Human Feedback
- **Prompt Engineering:** Optimize prompts for better outputs

### Real LLaMA Models:
This demo used small sizes for learning. Real LLaMA models:
- LLaMA 7B: 7 billion parameters
- LLaMA 13B: 13 billion parameters  
- LLaMA 70B: 70 billion parameters

Your model: ~{n_params:,} parameters (much smaller for fast training!)

## Resources:
- [LLaMA Paper](https://arxiv.org/abs/2302.13971)
- [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
- [PyTorch Documentation](https://pytorch.org/docs/)

---

**Great job! You now understand how modern large language models work! üöÄ**