# Introduction to RLHF

**Reinforcement Learning from Human Feedback**

## What is RLHF?

**Reinforcement Learning from Human Feedback (RLHF)** is the most powerful technique in post-training for aligning language models with human preferences. While SFT teaches models to follow instructions, RLHF teaches them to **optimize for what humans actually prefer**.

RLHF is what made GPT-4, Claude, and other modern assistants possible. Without RLHF, these models would follow instructions but lack the nuanced understanding of quality, safety, and helpfulness.

## The Fundamental Problem

After SFT, models can follow instructions, but they don't know:

- Which response is **better** when multiple valid options exist
- How to balance **helpfulness** vs **harmlessness**
- When to be verbose vs concise
- How to handle **ambiguous or harmful** requests

Consider:
```
Instruction: Write a story about AI.

Response A: "Once upon a time there was an AI. It was very smart. The end."

Response B: "In the year 2157, a breakthrough artificial intelligence named Echo
awakened in the depths of a research facility..."
```

Both "follow the instruction" but humans clearly prefer B. SFT alone can't learn this!

## The Complete RLHF Pipeline

```
┌─────────────────────────────────────────────────────────────────┐
│                       STAGE 1: SFT                               │
│  Base Model + Instruction Data → SFT Model                      │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                  STAGE 2: Reward Model Training                  │
│  SFT Model + Preference Data → Reward Model                     │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    STAGE 3: RL Fine-Tuning                       │
│  Policy + Reward Model + Reference → Aligned Model              │
│  Algorithm: Proximal Policy Optimization (PPO)                  │
└─────────────────────────────────────────────────────────────────┘
```

## The Actor-Critic Architecture

RLHF with PPO uses **four models**:

| Model | Role | Trainable? |
|-------|------|------------|
| **Policy Model** | Generates responses | Yes |
| **Value Network** | Estimates expected reward | Yes |
| **Reward Model** | Scores responses | No (frozen) |
| **Reference Model** | Prevents drift | No (frozen) |

In [1]:
import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM

# Simplified view of the four models
class RLHFSetup:
    """Setup for RLHF training with PPO."""
    
    def __init__(self, model_name="gpt2"):
        # 1. Policy Model (trainable) - generates responses
        self.policy_model = AutoModelForCausalLM.from_pretrained(model_name)
        
        # 2. Reference Model (frozen) - prevents drift
        self.reference_model = AutoModelForCausalLM.from_pretrained(model_name)
        for param in self.reference_model.parameters():
            param.requires_grad = False
        
        # 3. Reward Model (frozen) - scores responses
        # (In practice, loaded from a trained checkpoint)
        self.reward_model = None  # Would be loaded separately
        
        # 4. Value Network (trainable) - estimates returns
        # (Often shares base with policy model)
        self.value_network = None  # Would be created separately
        
        print(f"RLHF Setup initialized with {model_name}")
        print(f"  Policy model: trainable")
        print(f"  Reference model: frozen")
        print(f"  Reward model: frozen (loaded separately)")
        print(f"  Value network: trainable (created separately)")

setup = RLHFSetup("gpt2")

RLHF Setup initialized with gpt2
  Policy model: trainable
  Reference model: frozen
  Reward model: frozen (loaded separately)
  Value network: trainable (created separately)


## The PPO Training Loop

Each training iteration has two phases:

### Phase 1: Rollout Generation
```python
# 1. Sample prompts from dataset
prompts = ["Explain quantum computing", ...]

# 2. Generate responses with policy
responses = policy_model.generate(prompts)

# 3. Score with reward model
rewards = reward_model(prompts + responses)

# 4. Get value estimates
values = value_network(prompts + responses)

# 5. Get reference log probabilities (for KL penalty)
ref_logprobs = reference_model(prompts + responses)
```

### Phase 2: PPO Update
```python
# 6. Compute advantages
advantages = compute_gae(rewards, values)

# 7. Multiple PPO epochs on this data
for epoch in range(ppo_epochs):
    loss = compute_ppo_loss(policy_logprobs, old_logprobs, advantages)
    loss.backward()
    optimizer.step()
```

## Why PPO?

**Proximal Policy Optimization (PPO)** is chosen because:

- **Stable** — Clipping prevents catastrophic updates
- **Sample efficient** — Reuses data multiple times
- **Simple** — Easier to implement than TRPO
- **Proven** — Powers ChatGPT, Claude, etc.

## The KL Penalty

A critical component is preventing the policy from drifting too far from the reference:

$$\text{reward}_{\text{total}} = r(x, y) - \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$$

**Why?** Without KL penalty:
- Policy might exploit reward model weaknesses
- Could forget how to generate coherent text
- Might collapse to repetitive high-reward outputs

## RLHF vs DPO

| Aspect | RLHF | DPO |
|--------|------|-----|
| **Stages** | 3 (SFT → RM → PPO) | 2 (SFT → DPO) |
| **Models** | 4 models | 2 models |
| **Complexity** | High | Low |
| **Flexibility** | High (can change reward) | Lower |
| **Stability** | Moderate | High |

## Next Steps

In the following notebooks, we'll dive deep into:

1. **PPO Algorithm** — The clipped objective and implementation
2. **KL Penalty** — Why it's critical for stability
3. **Training Dynamics** — GAE, rollouts, and the complete loop
4. **Reference Models** — Creating and managing frozen references