# Exercise 1: Training a GPT-style Transformer

In this exercise, you'll use the Hugging Face Transformers library to configure, train a GPT-style language model on the TinyStories dataset.

**Learning Objectives:**
- Understand GPT model configuration parameters
- Use the Transformers library for language modeling
- Configure model architecture (attention heads, layers, embedding size)
- Train a language model for next-token prediction
- Generate text using different decoding strategies
- Evaluate model performance with perplexity

**Key Concepts:**
- Model configuration vs model weights
- Attention heads and their relationship to model dimension
- Training arguments and hyperparameters
- Decoding strategies (greedy, sampling, beam search)

## Part 1: Setup and Load Data

In [None]:
# Install required packages
!pip install torch datasets transformers matplotlib tqdm accelerate -q

In [None]:
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    GPT2Config,
    GPT2LMHeadModel,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments
)
import math
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

In [None]:
# Load TinyStories dataset
print("Loading TinyStories dataset...")
dataset = load_dataset("roneneldan/TinyStories", split="train[:10000]")
print(f"Loaded {len(dataset)} stories")

# Use GPT-2 tokenizer (or you can load your custom BPE tokenizer from Exercise 0)
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

print(f"Tokenizer vocabulary size: {len(tokenizer)}")
print(f"Sample tokens: {list(tokenizer.get_vocab().keys())[:10]}")

## Part 2: Understanding Model Configuration

Before creating a model, you need to understand the key configuration parameters:

**Critical Parameters:**
- `vocab_size`: Size of the tokenizer vocabulary
- `n_positions`: Maximum sequence length the model can handle
- `n_embd`: Embedding dimension (hidden size)
- `n_layer`: Number of transformer blocks
- `n_head`: Number of attention heads
- `n_inner`: Dimension of feed-forward inner layer (typically 4 × n_embd)

**Important Constraint:** `n_embd` must be divisible by `n_head`

### Question 1: Understanding Attention Heads

**Before continuing, answer these questions:**

1. **If `n_embd = 256` and `n_head = 8`, what is the dimension of each attention head?**
   - YOUR ANSWER HERE:
   - Formula: head_dim = n_embd / n_head

2. **Why must `n_embd` be divisible by `n_head`?**
   - YOUR ANSWER HERE:

3. **What happens if you increase the number of heads while keeping `n_embd` constant?**
   - YOUR ANSWER HERE:

## Part 3: Configure Your GPT Model

Now configure your model architecture. You'll need to fill in the parameters.

In [None]:
# TODO: Configure your GPT model
# Fill in the parameters below based on your understanding

config = GPT2Config(
    vocab_size=len(tokenizer),              # Fixed: must match tokenizer
    n_positions=128,                         # Maximum sequence length
    
    # TODO: Fill in these parameters
    n_embd=None,        # Embedding dimension (try: 256, 384, 512)
    n_layer=None,       # Number of transformer blocks (try: 4, 6, 8)
    n_head=None,        # Number of attention heads (try: 4, 8, 16)
                        # REMEMBER: n_embd must be divisible by n_head!
    
    n_inner=None,       # Feed-forward dimension (typically 4 × n_embd)
    
    # Additional parameters (already set)
    resid_pdrop=0.1,    # Residual dropout
    embd_pdrop=0.1,     # Embedding dropout  
    attn_pdrop=0.1,     # Attention dropout
    activation_function="gelu_new"
)

print("Model Configuration:")
print(f"  Vocabulary size: {config.vocab_size}")
print(f"  Embedding dimension: {config.n_embd}")
print(f"  Number of layers: {config.n_layer}")
print(f"  Number of heads: {config.n_head}")
print(f"  Head dimension: {config.n_embd // config.n_head if config.n_head else 'N/A'}")
print(f"  Feed-forward dimension: {config.n_inner}")
print(f"  Max sequence length: {config.n_positions}")

### Question 2: Model Size Trade-offs

**Answer these questions about your configuration:**

1. **What is the relationship between `n_embd`, `n_layer`, `n_head` and model capacity?**
   - YOUR ANSWER HERE:

2. **Why is `n_inner` typically set to 4 × `n_embd`?**
   - YOUR ANSWER HERE:

3. **What are the trade-offs between a small model (e.g., n_embd=128, n_layer=4) vs a large model (e.g., n_embd=512, n_layer=12)?**
   - YOUR ANSWER HERE:

## Part 4: Create the Model

Now instantiate the model with your configuration.

In [None]:
# Create model from config
model = GPT2LMHeadModel(config)
model = model.to(device)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\nModel created successfully!")
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"Model size: ~{total_params * 4 / 1024**2:.2f} MB (fp32)")

# Show model architecture
print(f"\nModel architecture:")
print(model)

## Part 5: Data Preparation

Prepare the dataset for causal language modeling (predicting next tokens).

In [None]:
# Tokenize the entire dataset
def tokenize_function(examples):
    """Tokenize text examples."""
    return tokenizer(
        examples['text'],
        truncation=True,
        max_length=128,
        padding=False,
        return_attention_mask=False
    )

print("Tokenizing dataset...")
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset.column_names,
    desc="Tokenizing"
)

print(f"Tokenized {len(tokenized_dataset)} examples")

# Show example
example = tokenized_dataset[0]
print(f"\nExample tokenized sequence:")
print(f"  Length: {len(example['input_ids'])}")
print(f"  First 20 tokens: {example['input_ids'][:20]}")
print(f"  Decoded: {tokenizer.decode(example['input_ids'][:50])}...")

In [None]:
# Create data collator for language modeling
# This automatically creates labels by shifting input_ids
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # False = Causal LM (GPT), True = Masked LM (BERT)
)

print("Data collator created for causal language modeling")

### Question 3: Causal vs Masked Language Modeling

**Answer these questions:**

1. **What is the difference between causal LM (GPT) and masked LM (BERT)?**
   - YOUR ANSWER HERE:

2. **Why do we set `mlm=False` for GPT-style models?**
   - YOUR ANSWER HERE:

3. **How does the DataCollator create labels for next-token prediction?**
   - Hint: Compare input_ids and labels
   - YOUR ANSWER HERE:

## Part 6: Configure Training

Set up training hyperparameters.

In [None]:
# TODO: Configure training arguments
# Fill in key hyperparameters

training_args = TrainingArguments(
    output_dir="./gpt_tinystories",
    
    # TODO: Fill in these training hyperparameters
    num_train_epochs=None,           # Number of epochs (try: 3, 5)
    per_device_train_batch_size=None, # Batch size (try: 8, 16, 32)
    learning_rate=None,               # Learning rate (try: 5e-4, 3e-4, 1e-4)
    
    # Optimization settings
    warmup_steps=500,
    weight_decay=0.01,
    
    # Logging and saving
    logging_steps=100,
    save_steps=1000,
    save_total_limit=2,
    
    # Evaluation
    eval_strategy="no",
    
    # Misc
    report_to="none",
    remove_unused_columns=False,
)

print("Training configuration:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Warmup steps: {training_args.warmup_steps}")

### Question 4: Training Hyperparameters

**Answer these questions:**

1. **What is the relationship between batch size and training speed/memory usage?**
   - YOUR ANSWER HERE:

2. **What is the purpose of warmup steps in learning rate scheduling?**
   - YOUR ANSWER HERE:

## Part 7: Train the Model

Use the Hugging Face Trainer to train your model.

In [None]:
# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

print("Trainer created successfully!")
print(f"\nStarting training...")

In [None]:
# Train the model
train_result = trainer.train()

print("\nTraining completed!")
print(f"Training time: {train_result.metrics['train_runtime']:.2f} seconds")
print(f"Final loss: {train_result.metrics['train_loss']:.4f}")
print(f"Samples per second: {train_result.metrics['train_samples_per_second']:.2f}")

In [None]:
# Plot training loss
log_history = trainer.state.log_history
losses = [log['loss'] for log in log_history if 'loss' in log]
steps = [log['step'] for log in log_history if 'loss' in log]

plt.figure(figsize=(10, 5))
plt.plot(steps, losses)
plt.xlabel('Training Steps')
plt.ylabel('Loss')
plt.title('Training Loss Over Time')
plt.grid(alpha=0.3)
plt.show()

print(f"\nLoss decreased from {losses[0]:.4f} to {losses[-1]:.4f}")

## Part 8: Text Generation with Different Strategies

Now use your trained model to generate text using different decoding strategies.

### Strategy 1: Greedy Decoding

Always select the most likely next token.

In [None]:
def generate_text(prompt, strategy="greedy", max_length=100, **kwargs):
    """Generate text with different decoding strategies."""
    model.eval()
    
    # Encode prompt
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
    
    # Generate based on strategy
    with torch.no_grad():
        if strategy == "greedy":
            output = model.generate(
                input_ids,
                max_length=max_length,
                pad_token_id=tokenizer.eos_token_id,
                do_sample=False  # Greedy decoding
            )
        elif strategy == "sampling":
            output = model.generate(
                input_ids,
                max_length=max_length,
                pad_token_id=tokenizer.eos_token_id,
                do_sample=True,
                temperature=kwargs.get('temperature', 1.0),
                top_k=kwargs.get('top_k', 50),
                top_p=kwargs.get('top_p', 0.95)
            )
        elif strategy == "beam_search":
            output = model.generate(
                input_ids,
                max_length=max_length,
                pad_token_id=tokenizer.eos_token_id,
                num_beams=kwargs.get('num_beams', 5),
                early_stopping=True
            )
    
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Test greedy decoding
prompts = [
    "Once upon a time",
    "There was a little girl",
    "In a magical forest"
]

print("Greedy Decoding (deterministic)")
print("=" * 70)
for prompt in prompts:
    generated = generate_text(prompt, strategy="greedy", max_length=80)
    print(f"\nPrompt: {prompt}")
    print(f"Generated: {generated}")
    print("-" * 70)

### Strategy 2: Sampling with Temperature

Sample from the probability distribution over next tokens.

In [None]:
# TODO: Experiment with different temperatures
temperatures = [0.5, 0.8, 1.2]  # Add more if you want

prompt = "Once upon a time"

print("Sampling with Different Temperatures")
print("=" * 70)
print(f"Prompt: {prompt}\n")

for temp in temperatures:
    print(f"Temperature = {temp}:")
    for i in range(2):  # Generate 2 samples per temperature
        generated = generate_text(
            prompt,
            strategy="sampling",
            max_length=80,
            temperature=temp,
            top_k=50,
            top_p=0.95
        )
        print(f"  Sample {i+1}: {generated}")
    print()

### Strategy 3: Beam Search

Keep track of multiple candidate sequences.

In [None]:
# Test beam search
print("Beam Search (num_beams=5)")
print("=" * 70)

for prompt in prompts:
    generated = generate_text(
        prompt,
        strategy="beam_search",
        max_length=80,
        num_beams=5
    )
    print(f"\nPrompt: {prompt}")
    print(f"Generated: {generated}")
    print("-" * 70)

### Question 5: Generation Strategies

**Based on your experiments, answer these questions:**

1. **What is the difference between greedy decoding and sampling?**
   - YOUR ANSWER HERE:

2. **How does temperature affect the diversity and quality of generated text?**
   - Low temperature (e.g., 0.5): YOUR ANSWER HERE
   - High temperature (e.g., 1.2): YOUR ANSWER HERE

3. **What is beam search and how does it differ from greedy decoding?**
   - YOUR ANSWER HERE:

4. **What are `top_k` and `top_p` sampling? When would you use them?**
   - YOUR ANSWER HERE:

## Part 9: Model Evaluation with Perplexity

Compute perplexity to evaluate model performance.

In [None]:
def compute_perplexity(model, dataset, batch_size=16, max_samples=1000):
    """Compute perplexity on a dataset."""
    model.eval()
    
    # Create dataloader
    dataloader = DataLoader(
        dataset.select(range(min(max_samples, len(dataset)))),
        batch_size=batch_size,
        collate_fn=data_collator
    )
    
    total_loss = 0
    total_tokens = 0
    
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Computing perplexity"):
            # Move to device
            input_ids = batch['input_ids'].to(device)
            labels = batch['labels'].to(device)
            
            # Forward pass
            outputs = model(input_ids, labels=labels)
            loss = outputs.loss
            
            # Count tokens (exclude padding)
            num_tokens = (labels != -100).sum().item()
            
            total_loss += loss.item() * num_tokens
            total_tokens += num_tokens
    
    # Compute perplexity
    avg_loss = total_loss / total_tokens
    perplexity = math.exp(avg_loss)
    
    return perplexity, avg_loss

# Compute perplexity
perplexity, avg_loss = compute_perplexity(model, tokenized_dataset, batch_size=16)

print(f"\nModel Evaluation:")
print(f"  Average Loss: {avg_loss:.4f}")
print(f"  Perplexity: {perplexity:.2f}")

### Question 6: Understanding Perplexity

**Answer these questions:**

1. **What is perplexity and what does it measure?**
   - YOUR ANSWER HERE:

2. **What does a lower perplexity indicate?**
   - YOUR ANSWER HERE:

3. **How is perplexity calculated from cross-entropy loss?**
   - Formula: YOUR ANSWER HERE
   - Interpretation: YOUR ANSWER HERE

## Part 10: Save Your Model

Save the trained model for later use.

In [None]:
# Save model and tokenizer
save_path = "./my_gpt_model"

model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Model saved to '{save_path}'")
print(f"\nYou can load it later with:")
print(f"  model = GPT2LMHeadModel.from_pretrained('{save_path}')")
print(f"  tokenizer = AutoTokenizer.from_pretrained('{save_path}')")

In [None]:
# Test loading the model
loaded_model = GPT2LMHeadModel.from_pretrained(save_path)
loaded_tokenizer = AutoTokenizer.from_pretrained(save_path)

print("Model loaded successfully!")
print(f"Parameters: {sum(p.numel() for p in loaded_model.parameters()):,}")