```{contents}
```

## Direct Preference Optimization (DPO)

DPO is a **post-training alignment method** that teaches an LLM to prefer *better* responses over *worse* ones **without training a reward model and without using RL**.

It directly learns from **preference pairs**:

```
Given two answers:
A = human-preferred ("chosen")
B = human-rejected ("rejected")

→ Make the model more likely to generate A than B
```

DPO is simpler, cheaper, and more stable than RLHF, yet achieves comparable or superior results.

---

### Why DPO Was Invented

RLHF (Reinforcement Learning from Human Feedback) has **three costly steps**:

1. **Train a Reward Model (RM)** from preference data
2. **Run PPO reinforcement learning** using the RM
3. **Stabilize** training with KL penalties

Problems:

* RM training requires a separate neural network
* PPO (policy gradient RL) is unstable and expensive
* Hard to tune
* More compute, more code, more engineering

### DPO solves ALL of this:

✔ No reward model
✔ No reinforcement learning
✔ No PPO
✔ No policy gradient
✔ No KL regularization loops

It is **pure supervised learning** with a clever loss function.

---

### What DPO Does in Simple Words

DPO teaches the model:

> “For this prompt, make the chosen answer more likely than the rejected answer.”

Meaning:

* Increase log-probability of the chosen output
* Decrease log-probability of the rejected one
* Keep the model close to the original model (to avoid drift)

---

### The Core DPO Loss (Intuition, Not Math)

For each sample:

```
User Prompt: "Explain gravity."

Chosen Answer:   accurate answer
Rejected Answer: poor, harmful, wrong answer
```

The DPO objective pushes the LLM to:

* **Increase** likelihood(chosen | prompt)
* **Decrease** likelihood(rejected | prompt)
* Stay close to the base model

The key intuition:

> The model learns human preferences directly from the difference between chosen and rejected responses.

No reward model needed.

---

### The Actual (Simple) DPO Formula

The core term:

$$
\log \pi(chosen) - \log \pi(rejected)
$$

DPO maximizes this difference.

To prevent drifting too far from the base model π₀, DPO includes a KL preference:

$$
\text{Encourages π to stay close to π₀}
$$

This ensures stability.

---

### 5. Why DPO Works Better Than RLHF

#### ✔ **1. No Reward Model Needed**

The reward model often over-fits or misaligns behavior.

#### ✔ **2. No PPO (No RL Loop)**

Training becomes:

* Simple
* Stable
* Fast

#### ✔ **3. Direct Optimization**

You directly optimize preference data instead of relying on an auxiliary reward signal.

#### ✔ **4. Better at Alignment**

Experiments show DPO produces:

* safer responses
* higher quality answers
* more stable training
* better generalization

#### ✔ **5. Lower Cost**

You skip:

* RM training
* RL optimization
* long PPO rollout steps

DPO = **SFT-style training on preference pairs**.

---

### 6. What Data DPO Uses

DPO uses *preference datasets*:

```
prompt
chosen_answer
rejected_answer
```

Examples from Anthropic HH-RLHF, OpenAI InstructGPT, StackLLM, etc.

---

### 7. Simple Analogy

**SFT**: “Here is a correct answer. Learn to copy it.”
**DPO**: “Here are two answers. Learn why A is better than B.”

This makes DPO a **behavior optimizer**, not a task teacher.

---

### 8. Where DPO Fits in the LLM Pipeline

```
1. Pretraining  
   (next-token prediction)

2. SFT  
   (instruction-following)

3. Preference Data  
   (chosen vs rejected answers)

4. DPO  
   (optimize model to prefer chosen answers)
```

DPO replaces the RLHF step.

---

### 9. When To Use DPO

Use DPO when:

* You have human preference data
* You want ChatGPT-like alignment
* You want to avoid reward models
* You want stability and simplicity
* You want lower cost alignment

Most modern open LLMs (LLaMA-Instruct, Mistral-Instruct, Zephyr, Falcon-Instruct) now use **DPO instead of RLHF**.

---

### 10. One Example (VERY intuitive)

Prompt:

```
What is 2 + 2?
```

Chosen:

```
The answer is 4.
```

Rejected:

```
It depends on your perspective.
```

DPO tells the model:

* Increase probability of “The answer is 4.”
* Decrease probability of “It depends…”
* Don’t change the rest of the model too much

Over many examples, the model becomes:

* helpful
* accurate
* safe
* aligned with human preference

---

**One-Sentence Summary**

**Direct Preference Optimization (DPO) trains an LLM to prefer human-preferred responses over rejected ones using a direct supervised objective, eliminating the need for reward models and reinforcement learning while producing high-quality aligned models.**


---

#### 1. Install Dependencies

```bash
pip install transformers datasets trl peft accelerate bitsandbytes
```

---

#### 2. Create a Toy Preference Dataset

```python
from datasets import Dataset

data = [
    {
        "prompt": "Explain gravity.",
        "chosen": "Gravity is a force that pulls objects together.",
        "rejected": "Gravity is magic glue that makes things stick."
    },
    {
        "prompt": "What is photosynthesis?",
        "chosen": "Photosynthesis is how plants convert sunlight into energy.",
        "rejected": "Photosynthesis means plants like the sun."
    }
]

dataset = Dataset.from_list(data)
dataset
```

This dataset has:

* **prompt**
* **chosen response**
* **rejected response**

Exactly what DPO needs.

---

#### 3. Load a Small Base Model in 4-bit (QLoRA friendly)

We use a tiny model to run anywhere.

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL = "facebook/opt-350m"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    load_in_4bit=True,
    device_map="auto"
)
```

---

#### 4. Apply LoRA Adapters (Efficient Training)

```python
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
```

---

#### 5. Setup DPOTrainer

```python
from trl import DPOTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    learning_rate=2e-5,
    num_train_epochs=3,
    output_dir="./dpo_out",
    logging_steps=1,
    fp16=True
)

trainer = DPOTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=dataset,
    beta=0.1,                # KL strength (DPO hyperparameter)
    max_prompt_length=128,
    max_length=256,
)
```

---

#### 6. Train the Model Using DPO

```python
trainer.train()
```

DPO does the following automatically:

* Compute log-probability of chosen response
* Compute log-probability of rejected response
* Compute KL term w.r.t base model
* Optimize the objective:
  $$
  \log π(chosen) - \log π(rejected)
  $$

This directly improves alignment.

---

#### 7. Save the Tuned Model

```python
trainer.model.save_pretrained("dpo_lora_model")
tokenizer.save_pretrained("dpo_lora_model")
```

---

#### 8. Test the DPO-Tuned Model

```python
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="dpo_lora_model",
    tokenizer=tokenizer,
    max_new_tokens=80
)

prompt = "Explain gravity."
print(pipe(prompt)[0]["generated_text"])
```

You should see the model:

* Prefer accurate, safe, helpful responses
* Avoid the "rejected" style responses
* Produce more aligned behavior

---


