# Project Overview

**Architecture, components, and how everything fits together**

## The Complete Pipeline

This section covers the full post-training pipeline:

```
┌─────────────────────────────────────────────────────────────────────┐
│                     SUPERVISED FINE-TUNING                          │
├─────────────────────────────────────────────────────────────────────┤
│ • Instruction formatting (chat templates)                           │
│ • Loss masking (only train on responses)                           │
│ • LoRA for efficient training                                       │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      REWARD MODELING                                │
├─────────────────────────────────────────────────────────────────────┤
│ • Preference data format                                            │
│ • Bradley-Terry ranking loss                                        │
│ • Evaluation metrics                                                │
└─────────────────────────────────────────────────────────────────────┘
                              │
              ┌───────────────┴───────────────┐
              ▼                               ▼
┌─────────────────────────┐     ┌─────────────────────────┐
│         RLHF            │     │          DPO            │
├─────────────────────────┤     ├─────────────────────────┤
│ • PPO algorithm         │     │ • Direct optimization   │
│ • KL divergence         │     │ • No reward model       │
│ • Value network         │     │ • Simpler training      │
│ • Reference model       │     │ • Reference model       │
└─────────────────────────┘     └─────────────────────────┘
```

## Module Organization

The codebase is organized into clear modules:

| Module | Purpose | Key Functions |
|--------|---------|---------------|
| `sft/` | Supervised Fine-Tuning | `SFTTrainer`, `format_instruction` |
| `reward/` | Reward Model Training | `RewardModel`, `RewardModelTrainer` |
| `rlhf/` | RLHF with PPO | `PPOTrainer`, `ValueNetwork`, `RolloutBuffer` |
| `dpo/` | Direct Preference Optimization | `DPOTrainer`, `compute_dpo_loss` |
| `utils/` | Shared Utilities | `load_model_and_tokenizer`, `setup_device` |

## Data Formats

Each stage uses a specific data format:

### SFT Data
```json
{
    "instruction": "What is the capital of France?",
    "response": "The capital of France is Paris."
}
```

### Preference Data (Reward Model & DPO)
```json
{
    "prompt": "Explain quantum computing simply.",
    "chosen": "Imagine a coin spinning in the air...",
    "rejected": "Quantum computers use qubits which leverage..."
}
```

### Prompt Data (RLHF)
```json
{
    "prompt": "Write a haiku about programming."
}
```

In [1]:
# Example: Loading different data formats
from datasets import Dataset

# SFT data example
sft_data = [
    {"instruction": "What is Python?", "response": "Python is a programming language."},
    {"instruction": "Translate 'hello' to French", "response": "'Hello' in French is 'Bonjour'."}
]

# Preference data example  
preference_data = [
    {
        "prompt": "Explain AI briefly.",
        "chosen": "AI is technology that enables machines to simulate human intelligence.",
        "rejected": "AI."
    }
]

print("SFT Data Format:")
print(f"  Keys: {list(sft_data[0].keys())}")
print()
print("Preference Data Format:")
print(f"  Keys: {list(preference_data[0].keys())}")

SFT Data Format:
  Keys: ['instruction', 'response']

Preference Data Format:
  Keys: ['prompt', 'chosen', 'rejected']


## Training Progression

The typical training flow:

| Step | Input Model | Output Model | Training Time* |
|------|-------------|--------------|----------------|
| 1. SFT | Base (GPT-2) | SFT Model | ~1 hour |
| 2. Reward | SFT Model | Reward Model | ~30 min |
| 3a. RLHF | SFT Model + RM | RLHF Model | ~2 hours |
| 3b. DPO | SFT Model | DPO Model | ~1 hour |

*Approximate times for GPT-2 scale on a single GPU

## Key Hyperparameters

Each stage has critical hyperparameters:

### SFT
- **Learning rate:** 2e-4 (higher than pre-training)
- **Batch size:** 4-8
- **Epochs:** 3-5

### Reward Model
- **Learning rate:** 1e-5 (lower than SFT)
- **Batch size:** 4 (2 sequences per sample)
- **Epochs:** 1 (avoid overfitting)

### RLHF
- **Learning rate:** 1e-6 (very low)
- **KL coefficient:** 0.1
- **PPO epochs:** 4

### DPO  
- **Learning rate:** 1e-6
- **Beta (β):** 0.1
- **Epochs:** 1-3

In [2]:
# Configuration examples for each stage

sft_config = {
    "learning_rate": 2e-4,
    "batch_size": 4,
    "num_epochs": 3,
    "max_length": 512,
    "warmup_steps": 100,
}

reward_config = {
    "learning_rate": 1e-5,
    "batch_size": 4,
    "num_epochs": 1,
    "max_length": 512,
}

ppo_config = {
    "learning_rate": 1e-6,
    "batch_size": 4,
    "ppo_epochs": 4,
    "kl_coef": 0.1,
    "clip_ratio": 0.2,
}

dpo_config = {
    "learning_rate": 1e-6,
    "batch_size": 4,
    "num_epochs": 1,
    "beta": 0.1,
}

print("Configurations loaded for all training stages!")

Configurations loaded for all training stages!


## Memory Considerations

Post-training can be memory-intensive:

| Stage | Models in Memory | Memory Factor |
|-------|-----------------|---------------|
| SFT | 1 model | 1x |
| Reward | 1 model | 1x (but 2 sequences/batch) |
| RLHF | 4 models (policy, value, reward, reference) | 4x |
| DPO | 2 models (policy, reference) | 2x |

**Solutions:**
- LoRA for efficient fine-tuning
- Gradient checkpointing
- Mixed precision (fp16/bf16)
- Gradient accumulation

## Next Steps

Now that you have an overview of the complete pipeline, let's dive into the first stage: Supervised Fine-Tuning (SFT).