# Introduction to Direct Preference Optimization (DPO)

**A simpler alternative to RLHF**

## What is DPO?

**Direct Preference Optimization (DPO)** is a simpler approach to aligning language models with human preferences. Unlike RLHF, DPO:

- **Skips the reward model** — Directly optimizes on preferences
- **No RL** — Uses a supervised classification loss
- **Fewer models** — Only policy and reference needed
- **More stable** — No reward hacking or training instability

## DPO vs RLHF

| Aspect | RLHF | DPO |
|--------|------|-----|
| **Pipeline** | SFT → RM → PPO | SFT → DPO |
| **Models** | 4 (policy, value, reward, ref) | 2 (policy, ref) |
| **Complexity** | High | Low |
| **Training** | Reinforcement learning | Supervised learning |
| **Stability** | Moderate | High |
| **Memory** | 4x model size | 2x model size |

## The Key Insight

DPO derives from the observation that the optimal policy under the RLHF objective has a closed-form solution:

$$\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{r(x,y)}{\beta}\right)$$

Rearranging, we can express the reward in terms of the policy:

$$r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)$$

This means we can **directly optimize the policy** without explicitly learning a reward model!

## The DPO Loss

The DPO objective is:

$$\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right]$$

where:
- $y_w$ = chosen (winning) response
- $y_l$ = rejected (losing) response
- $\beta$ = temperature parameter
- $\pi_{\text{ref}}$ = frozen reference model

In [1]:
import torch
import torch.nn.functional as F

def compute_dpo_loss(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    reference_chosen_logps: torch.Tensor,
    reference_rejected_logps: torch.Tensor,
    beta: float = 0.1
) -> torch.Tensor:
    """
    Compute DPO loss.
    
    Args:
        policy_chosen_logps: Log probs of chosen under policy
        policy_rejected_logps: Log probs of rejected under policy
        reference_chosen_logps: Log probs of chosen under reference
        reference_rejected_logps: Log probs of rejected under reference
        beta: Temperature parameter
    
    Returns:
        DPO loss
    """
    # Compute log ratios
    chosen_logratios = policy_chosen_logps - reference_chosen_logps
    rejected_logratios = policy_rejected_logps - reference_rejected_logps
    
    # DPO loss: -log sigmoid(beta * (chosen_logratio - rejected_logratio))
    logits = beta * (chosen_logratios - rejected_logratios)
    loss = -F.logsigmoid(logits).mean()
    
    return loss

# Example
batch_size = 4
policy_chosen = torch.tensor([-50.0, -45.0, -55.0, -48.0])
policy_rejected = torch.tensor([-52.0, -48.0, -58.0, -46.0])
ref_chosen = torch.tensor([-51.0, -46.0, -56.0, -49.0])
ref_rejected = torch.tensor([-51.0, -46.0, -56.0, -49.0])

loss = compute_dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected, beta=0.1)
print(f"DPO Loss: {loss.item():.4f}")

DPO Loss: 0.6262


## How DPO Works

```
┌──────────────────────────────────────────────────────────────┐
│                         DPO Training                          │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  Input: (prompt, chosen, rejected) preference pairs          │
│                                                              │
│  ┌─────────────┐    ┌─────────────┐                         │
│  │   Policy    │    │  Reference  │                         │
│  │  (trainable)│    │   (frozen)  │                         │
│  └──────┬──────┘    └──────┬──────┘                         │
│         │                  │                                 │
│    Log probs           Log probs                            │
│         │                  │                                 │
│         └────────┬─────────┘                                 │
│                  │                                           │
│           Compute log ratios                                 │
│                  │                                           │
│           DPO Loss                                           │
│                  │                                           │
│           Update Policy                                      │
│                                                              │
└──────────────────────────────────────────────────────────────┘
```

## Implicit Reward Model

DPO implicitly defines a reward model:

$$r(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$$

The reward is the log-ratio between policy and reference, scaled by β. No separate reward model needed!

In [2]:
def compute_implicit_reward(
    policy_logps: torch.Tensor,
    reference_logps: torch.Tensor,
    beta: float = 0.1
) -> torch.Tensor:
    """
    Compute the implicit reward under DPO.
    """
    return beta * (policy_logps - reference_logps)

# Example
implicit_reward = compute_implicit_reward(policy_chosen, ref_chosen, beta=0.1)
print(f"Implicit rewards for chosen responses: {implicit_reward.tolist()}")

Implicit rewards for chosen responses: [0.10000000149011612, 0.10000000149011612, 0.10000000149011612, 0.10000000149011612]


## When to Use DPO vs RLHF

**Use DPO when:**
- You want simpler training
- Memory is constrained
- Stability is a priority
- You have good preference data

**Use RLHF when:**
- You need to iterate on the reward model
- You have lots of prompts (for rollouts)
- Maximum flexibility is needed

## Next Steps

In the following notebooks, we'll cover:

1. **DPO vs RLHF** — Detailed comparison
2. **DPO Loss** — Deep dive into the math
3. **DPO Training** — Complete implementation