# Training GPT-2 From Scratch: A Step-by-Step Guide

This notebook provides a comprehensive guide to training GPT-2 from scratch using the OpenWebText dataset.

## 📚 What You'll Learn

- How to load and stream large datasets efficiently with DeepLake
- How to configure GPT-2 model architecture  
- How to set up the training pipeline with the Hugging Face Trainer
- How to monitor training with Weights & Biases
- How to perform inference with your trained model

## 🔧 Requirements

- **GPU**: 8x NVIDIA A100 (40GB each) recommended for full training
- **Time**: ~40-45 hours for full training on the complete dataset
- **For Testing**: Can run on single GPU with reduced dataset/model size

## 📖 Table of Contents

1. [Setting Up Working Environment](#1-setting-up-working-environment)
2. [Load Dataset from Deep Lake](#2-load-dataset-from-deep-lake)
3. [Loading the Model & Tokenizer](#3-loading-the-model--tokenizer)
4. [Training the Model](#4-training-the-model)
5. [Inference](#5-inference)

---

**Credits**: This tutorial is based on the excellent article by [Youssef Hosni](https://youssef-hosni.medium.com/)

## 1. Setting Up Working Environment

First, we'll install all the necessary packages:

- **transformers**: For working with transformer-based models like GPT-2
- **deeplake**: For managing and streaming large datasets
- **wandb**: For experiment tracking and visualization
- **accelerate**: For optimizing and speeding up model training

In [None]:
# Install required packages
!pip install -q transformers==4.32.0 deeplake==3.6.19 wandb==0.15.8 accelerate==0.22.0

In [None]:
# Import necessary libraries
import deeplake
import torch
from transformers import (
    AutoTokenizer,
    AutoConfig,
    GPT2LMHeadModel,
    Trainer,
    TrainingArguments,
    pipeline
)
import wandb

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"Number of GPUs: {torch.cuda.device_count()}")

### Login to Weights & Biases

Weights & Biases (W&B) will help us track our training progress in real-time.

**Note**: You'll need to create a free account at [wandb.ai](https://wandb.ai) and get your API key.

In [None]:
# Login to Weights & Biases
# You can skip this if you don't want to use W&B
!wandb login

## 2. Load Dataset from Deep Lake

We'll use the **OpenWebText** dataset, which is a collection of Reddit posts with at least three upvotes. This dataset is ideal for building a foundational language model.

### Why DeepLake?

DeepLake allows us to **stream** the dataset batch by batch, which means:
- ✅ No need to load the entire dataset into memory
- ✅ Efficient resource management
- ✅ Seamless data streaming

### Dataset Structure

The dataset contains two tensors:
- **text**: The raw textual content
- **tokens**: Pre-tokenized version (we'll tokenize ourselves)

In [None]:
# Load the OpenWebText dataset from ActiveLoop
ds = deeplake.load('hub://activeloop/openwebtext-train')
ds_val = deeplake.load('hub://activeloop/openwebtext-val')

print("\n=== Training Dataset ===")
print(ds)
print(f"\nDataset size: {len(ds):,} samples")

print("\n=== Validation Dataset ===")
print(ds_val)
print(f"\nDataset size: {len(ds_val):,} samples")

In [None]:
# Let's examine a sample from the dataset
print("=== Sample Text from Dataset ===")
print(ds[0].text.text())
print("\n" + "="*50 + "\n")
print(ds[1].text.text())

### Load and Configure the Tokenizer

We'll use the GPT-2 tokenizer from Hugging Face. 

**Important**: GPT-2 doesn't have a padding token by default, so we set it to the EOS (End of Sentence) token.

In [None]:
# Load the GPT-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Set the padding token to the EOS token
tokenizer.pad_token = tokenizer.eos_token

print(f"Vocabulary size: {len(tokenizer):,}")
print(f"EOS token: {tokenizer.eos_token} (ID: {tokenizer.eos_token_id})")
print(f"PAD token: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")
print(f"BOS token: {tokenizer.bos_token}")

In [None]:
# Test the tokenizer
test_text = "Hello, how are you doing today?"
tokens = tokenizer(test_text, return_tensors="pt")

print(f"Original text: {test_text}")
print(f"\nTokenized IDs: {tokens['input_ids'][0].tolist()}")
print(f"\nDecoded back: {tokenizer.decode(tokens['input_ids'][0])}")

### Create DataLoaders with Tokenization

We'll create a transformation function that:
1. Tokenizes the text
2. Truncates to max_length (512 tokens)
3. Pads sequences to the same length
4. Creates input_ids and labels (both are the same; the Trainer will shift labels automatically)

In [None]:
# Define transform to tokenize texts on the fly
def get_tokens_transform(tokenizer):
    """
    Creates a transformation function for tokenizing text samples.
    
    Args:
        tokenizer: The tokenizer to use
        
    Returns:
        A function that tokenizes input samples
    """
    def tokens_transform(sample_in):
        # Tokenize the text
        tokenized_text = tokenizer(
            sample_in["text"],
            truncation=True,
            max_length=512,  # Maximum sequence length
            padding='max_length',  # Pad to max_length
            return_tensors="pt"
        )
        
        # Extract the input_ids
        tokenized_text = tokenized_text["input_ids"][0]
        
        # Return both input_ids and labels
        # For language modeling, labels are the same as inputs (shifted by Trainer)
        return {
            "input_ids": tokenized_text,
            "labels": tokenized_text
        }
    
    return tokens_transform

In [None]:
# Create data loaders
# Note: Adjust batch size based on your GPU memory
# For A100 40GB: batch_size=32 works well
# For smaller GPUs: reduce to 8 or 4

BATCH_SIZE = 32  # Adjust this based on your GPU

print(f"Creating dataloaders with batch size: {BATCH_SIZE}")

ds_train_loader = ds.dataloader()\
    .batch(BATCH_SIZE)\
    .transform(get_tokens_transform(tokenizer))\
    .pytorch()

ds_eval_loader = ds_val.dataloader()\
    .batch(BATCH_SIZE)\
    .transform(get_tokens_transform(tokenizer))\
    .pytorch()

print("✅ DataLoaders created successfully!")

In [None]:
# Test the dataloader by fetching one batch
print("=== Testing DataLoader ===")
sample_batch = next(iter(ds_train_loader))

print(f"Batch keys: {sample_batch.keys()}")
print(f"Input IDs shape: {sample_batch['input_ids'].shape}")
print(f"Labels shape: {sample_batch['labels'].shape}")
print(f"\nFirst sample in batch:")
print(tokenizer.decode(sample_batch['input_ids'][0]))

## 3. Loading the Model & Tokenizer

We'll use the GPT-2 architecture from Hugging Face. This allows us to:
- ✅ Use a well-tested, proven architecture
- ✅ Easily scale the model by adjusting hyperparameters
- ✅ Customize the model size based on available resources

### Key Hyperparameters

- **n_layer**: Number of transformer decoder blocks
- **n_embd**: Embedding dimension (hidden size)
- **n_head**: Number of attention heads
- **n_positions / n_ctx**: Maximum sequence length
- **vocab_size**: Size of the vocabulary

In [None]:
# Load the default GPT-2 configuration
config = AutoConfig.from_pretrained("gpt2")

print("=== Default GPT-2 Configuration ===")
print(config)
print("\n=== Key Parameters ===")
print(f"Number of layers: {config.n_layer}")
print(f"Embedding dimension: {config.n_embd}")
print(f"Number of attention heads: {config.n_head}")
print(f"Context length: {config.n_ctx}")
print(f"Vocabulary size: {config.vocab_size}")

In [None]:
# Initialize model with default config (124M parameters)
model = GPT2LMHeadModel(config)

# Count parameters
model_size = sum(t.numel() for t in model.parameters())
print(f"\n🎯 GPT-2 (default) size: {model_size/1e6:.1f}M parameters")

# Print model architecture summary
print("\n=== Model Architecture ===")
print(model)

### Scaling Up the Model (Optional)

If you have more resources, you can create a larger model. Here's an example of creating a ~1B parameter model.

**Warning**: This requires significantly more GPU memory and training time!

In [None]:
# Example: Create a larger GPT-2 model (1B parameters)
# Uncomment if you want to train a larger model

# config_1b = AutoConfig.from_pretrained("gpt2")
# config_1b.n_layer = 32
# config_1b.n_embd = 1600
# config_1b.n_positions = 512
# config_1b.n_ctx = 512
# config_1b.n_head = 32

# model_1b = GPT2LMHeadModel(config_1b)
# model_size_1b = sum(t.numel() for t in model_1b.parameters())
# print(f"GPT2-1B size: {model_size_1b/1e6:.1f}M parameters")

print("Note: For this tutorial, we'll continue with the 124M parameter model.")
print("You can uncomment the code above to train a larger model if you have the resources.")

## 4. Training the Model

Now we'll set up the training loop using the Hugging Face Trainer class.

### Training Arguments Explained

- **output_dir**: Where to save checkpoints
- **num_train_epochs**: Number of training epochs (2 for full dataset)
- **per_device_train_batch_size**: Batch size per GPU (set to 1 since we batch in dataloader)
- **gradient_accumulation_steps**: Accumulate gradients over multiple steps
- **learning_rate**: Initial learning rate
- **weight_decay**: L2 regularization
- **warmup_steps**: Learning rate warmup steps
- **lr_scheduler_type**: Learning rate schedule (cosine decay)
- **bf16/fp16**: Mixed precision training (faster, less memory)
- **eval_steps/save_steps**: How often to evaluate and save

### Resource Considerations

**For full training (45 hours on 8x A100):**
- Use the settings below
- Batch size: 32 in dataloader, 1 per device

**For testing/small-scale training:**
- Reduce num_train_epochs to 1
- Use a smaller batch size
- Limit training to first N samples

In [None]:
# Define training arguments
args = TrainingArguments(
    output_dir="GPT2-training-scratch-openwebtext",
    
    # Evaluation and saving
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=500,
    save_steps=500,
    
    # Training duration
    num_train_epochs=2,
    
    # Batch sizes (set to 1 since we batch in dataloader)
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=1,
    
    # Optimization
    learning_rate=5e-4,
    weight_decay=0.1,
    warmup_steps=100,
    lr_scheduler_type="cosine",
    
    # Mixed precision (use bf16 for A100, fp16 for older GPUs)
    bf16=True,  # Set to False if not supported, use fp16=True instead
    
    # Logging
    logging_steps=1,
    logging_dir="./logs",
    
    # Distributed training
    ddp_find_unused_parameters=False,
    
    # Weights & Biases
    run_name="GPT2-scratch-openwebtext",
    report_to="wandb",  # Set to "none" if you don't want to use W&B
    
    # Additional settings
    save_total_limit=3,  # Keep only last 3 checkpoints
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
)

print("✅ Training arguments configured!")
print(f"\nOutput directory: {args.output_dir}")
print(f"Number of epochs: {args.num_train_epochs}")
print(f"Learning rate: {args.learning_rate}")
print(f"Mixed precision: bf16={args.bf16}, fp16={args.fp16}")

### Custom Trainer Class

We need to create a custom Trainer class to use our DeepLake dataloaders.

This class overrides the `get_train_dataloader` and `get_eval_dataloader` methods to return our custom dataloaders.

In [None]:
# Custom Trainer class for DeepLake dataloaders
class TrainerWithDataLoaders(Trainer):
    """
    Custom Trainer that uses DeepLake dataloaders.
    
    This is necessary because we're using DeepLake's dataloader
    instead of the default PyTorch DataLoader.
    """
    def __init__(self, *args, train_dataloader=None, eval_dataloader=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.train_dataloader_custom = train_dataloader
        self.eval_dataloader_custom = eval_dataloader

    def get_train_dataloader(self):
        """Return the training dataloader."""
        return self.train_dataloader_custom

    def get_eval_dataloader(self, eval_dataset=None):
        """Return the evaluation dataloader."""
        return self.eval_dataloader_custom

print("✅ Custom Trainer class defined!")

In [None]:
# Initialize the Trainer
trainer = TrainerWithDataLoaders(
    model=model,
    args=args,
    train_dataloader=ds_train_loader,
    eval_dataloader=ds_eval_loader,
)

print("✅ Trainer initialized!")
print("\nReady to start training...")

### Start Training!

**⚠️ Important Notes:**

1. **Full training takes ~45 hours on 8x A100 GPUs**
2. **For testing**, you may want to:
   - Reduce `num_train_epochs` to 1 or less
   - Limit the dataset to first N samples
   - Use smaller batch sizes
3. **Monitor training** on Weights & Biases dashboard
4. **Checkpoints** are saved every 500 steps in the output directory

**To resume training** from a checkpoint:
```python
trainer.train(resume_from_checkpoint="path/to/checkpoint")
```

In [None]:
# Start training!
# This will take a long time on the full dataset

print("🚀 Starting training...")
print("This will take approximately 45 hours on 8x A100 GPUs for the full dataset.")
print("\nCheckpoints will be saved to:", args.output_dir)
print("\nYou can monitor progress at: https://wandb.ai")
print("\n" + "="*70)

# Uncomment the line below to start training
# trainer.train()

print("\n⚠️ Training is commented out by default.")
print("Uncomment 'trainer.train()' above to start actual training.")
print("\nFor testing, consider:")
print("  - Reducing num_train_epochs to 1")
print("  - Using a subset of the data")
print("  - Adjusting batch size based on your GPU")

### Training Progress Visualization

During training, you can monitor:
- **Training loss**: Should decrease smoothly
- **Evaluation loss**: Measures generalization
- **Learning rate**: Follows cosine schedule
- **GPU utilization**: Should be near 100%
- **Throughput**: Samples/second

All these metrics are available in your W&B dashboard!

In [None]:
# After training completes, save the final model
# Uncomment when training is done

# trainer.save_model("./GPT2-scratch-openwebtext-final")
# print("✅ Final model saved!")

## 5. Inference

Now let's test our trained model by generating text!

We'll use the Hugging Face `pipeline` API, which makes text generation simple and flexible.

### Generation Parameters

- **max_length**: Maximum number of tokens to generate
- **min_length**: Minimum number of tokens to generate
- **temperature**: Controls randomness (0=deterministic, 1=very random)
- **top_k**: Consider only top K tokens
- **top_p**: Nucleus sampling (consider tokens with cumulative probability p)
- **do_sample**: Whether to use sampling (vs greedy)
- **num_return_sequences**: Number of different outputs to generate

In [None]:
# Load the trained model for inference
# Change the path to your checkpoint directory

MODEL_PATH = "./GPT2-scratch-openwebtext-final"  # or "./GPT2-training-scratch-openwebtext/checkpoint-XXXX"

# Check if model exists
import os
if os.path.exists(MODEL_PATH):
    print(f"✅ Loading model from: {MODEL_PATH}")
    
    # Create text generation pipeline
    pipe = pipeline(
        "text-generation",
        model=MODEL_PATH,
        tokenizer=tokenizer,
        device="cuda:0" if torch.cuda.is_available() else "cpu"
    )
    
    print("✅ Pipeline created successfully!")
else:
    print(f"⚠️ Model not found at: {MODEL_PATH}")
    print("\nYou need to:")
    print("  1. Train the model first")
    print("  2. Or download a pretrained checkpoint")
    print("  3. Update MODEL_PATH to the correct location")

### Generate Text

Let's generate some text completions with different prompts!

In [None]:
# Example 1: Simple text completion
if 'pipe' in globals():
    prompt = "The house prices dropped down"
    
    print(f"Prompt: {prompt}")
    print("="*70)
    
    completion = pipe(
        prompt,
        max_length=100,
        num_return_sequences=1,
        temperature=0.8,
        do_sample=True,
        top_k=50,
        top_p=0.95,
    )
    
    print(completion[0]['generated_text'])
else:
    print("⚠️ Model not loaded. Train the model first!")

In [None]:
# Example 2: Generate multiple completions
if 'pipe' in globals():
    prompt = "In the year 2030, artificial intelligence"
    
    print(f"Prompt: {prompt}")
    print("="*70)
    
    completions = pipe(
        prompt,
        max_length=80,
        num_return_sequences=3,  # Generate 3 different completions
        temperature=0.9,
        do_sample=True,
    )
    
    for i, comp in enumerate(completions, 1):
        print(f"\n--- Completion {i} ---")
        print(comp['generated_text'])
        print()

In [None]:
# Example 3: More creative generation (higher temperature)
if 'pipe' in globals():
    prompt = "Once upon a time in a distant galaxy"
    
    print(f"Prompt: {prompt}")
    print("="*70)
    
    completion = pipe(
        prompt,
        max_length=120,
        num_return_sequences=1,
        temperature=1.2,  # Higher temperature = more creative/random
        do_sample=True,
        top_k=50,
    )
    
    print(completion[0]['generated_text'])

In [None]:
# Example 4: More deterministic generation (lower temperature)
if 'pipe' in globals():
    prompt = "The key to success in machine learning is"
    
    print(f"Prompt: {prompt}")
    print("="*70)
    
    completion = pipe(
        prompt,
        max_length=100,
        num_return_sequences=1,
        temperature=0.3,  # Lower temperature = more focused/deterministic
        do_sample=True,
        top_k=50,
    )
    
    print(completion[0]['generated_text'])

### Interactive Text Generation

Try your own prompts!

In [None]:
# Interactive generation - try your own prompts!
if 'pipe' in globals():
    # Modify this prompt
    your_prompt = "The future of technology is"
    
    print(f"Your Prompt: {your_prompt}")
    print("="*70)
    
    completion = pipe(
        your_prompt,
        max_length=100,
        num_return_sequences=1,
        temperature=0.8,
        do_sample=True,
    )
    
    print(completion[0]['generated_text'])
else:
    print("⚠️ Please train the model first to generate text!")

## 🎉 Congratulations!

You've successfully:
- ✅ Loaded and streamed a large dataset with DeepLake
- ✅ Configured GPT-2 architecture
- ✅ Set up a complete training pipeline
- ✅ Monitored training with Weights & Biases
- ✅ Generated text with your trained model

## 🚀 Next Steps

1. **Experiment with hyperparameters**: Try different learning rates, batch sizes, model sizes
2. **Use a different dataset**: Train on domain-specific data (medical, legal, code, etc.)
3. **Scale up**: Train a larger model (1B+ parameters)
4. **Fine-tune**: Use your pre-trained model as a starting point for specific tasks
5. **Explore PEFT**: Learn parameter-efficient fine-tuning (LoRA, QLoRA)

## 📚 Resources

- [Hugging Face Transformers Docs](https://huggingface.co/docs/transformers)
- [DeepLake Documentation](https://docs.activeloop.ai/)
- [Weights & Biases Guides](https://docs.wandb.ai/)
- [GPT-2 Paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

## 💬 Questions?

- Check the [README](README.md) in this directory
- Review the [SETUP.md](../SETUP.md) guide
- Open an issue on GitHub

---

**Happy Training! 🎓✨**