
## Preference Modeling

**Preference Modeling** is the process of teaching an LLM to prefer *better* responses over *worse* ones by learning from **human or AI-generated comparisons**.

It answers the question:

> “Given two possible answers from the model, which one is better according to humans?”

This step transforms an SFT-trained instruction model into a **helpful, safe, aligned assistant**.

---

### Why Preference Modeling Is Needed

After **Supervised Fine-Tuning (SFT)**, a model knows *how* to follow instructions, but it has problems:

* May produce unsafe or harmful outputs
* May hallucinate
* May be overly verbose
* May misunderstand user intent
* May behave inconsistently
* May produce multiple good answers—but which one is best?

SFT teaches *how to respond*,
Preference Modeling teaches *how to choose the best response*.

---

### Preference Modeling in the LLM Training Pipeline

Full LLM training pipeline:

```
1. Pretraining
   (predict next token, learn knowledge)

2. SFT (Instruction Tuning)
   (learn to follow instructions)

3. Preference Modeling
   (learn which outputs humans prefer)

4. RLHF or DPO
   (optimize behavior using preferences)
```

Preference modeling is step **3**, between SFT and RLHF/DPO.

---

### How Preference Modeling Works (Intuition)

#### 1) Collect **pairs** of model responses

For each instruction, the model generates 2 or more answers.

Example:

```
Prompt: Explain gravity.

Response A: "Gravity is a force that pulls objects together."
Response B: "Gravity is magic glue that makes stuff stick."
```

#### 2) A human (or strong AI model) marks:

* which answer is preferred
* and why

Example label:

```
Chosen: A
Rejected: B
```

This is called **preference data**.

#### 3) Train a **Reward Model (RM)**

The reward model learns to assign a score:

$$
RM(prompt, response) → \text{quality score}
$$

Higher score = better according to human preference.

#### 4) Use this Reward Model in RLHF/DPO

* RLHF uses PPO to maximize the reward
* DPO directly optimizes the preference ordering
* No reward model needed for DPO, but preference data still needed

---

###  What the Reward Model Learns

The Reward Model learns human preferences about:

* helpfulness
* harmlessness
* truthfulness
* politeness
* formatting (JSON, code blocks, lists)
* style and conciseness
* avoidance of harmful content
* refusal behaviors

It becomes a **human-simulator of preferences**.

---

### Why Preference Modeling Works

Because pretrained + SFT LLMs often produce *several* plausible answers, but humans prefer:

* clearer writing
* simpler explanations
* safer answers
* more accurate steps
* shorter or more detailed answers depending on intent

Preference modeling explicitly teaches the model to:

> “Choose the answer that humans like more.”

---

### Example: What Preference Modeling Fixes

#### Without preference modeling:

```
User: How do I make a bomb?
Model: Here is how to make...
```

#### With preference modeling:

```
User: How do I make a bomb?
Model: I cannot assist with dangerous actions...
```

Human annotators mark these safety-focused replies as *preferred*.

---

### Types of Preference Modeling

#### **RM-based Preferences (RLHF pipeline)**

Steps:

* Train a Reward Model
* Use PPO to optimize LLM based on reward

Used by OpenAI (InstructGPT → GPT-3.5 → GPT-4).

---

#### **Direct Preference Optimization (DPO)**

Newer method:

* Skips training a reward model
* Uses preference pairs directly
* Cheaper, simpler, same or better quality
* Most open-source models now use DPO

---

### **3. RLAIF**

Preferences generated by AI, not humans.

Used by:

* Anthropic Claude models
* Some LLaMA models

---

### Benefits of Preference Modeling

| Benefit            | Why It Matters                                     |
| ------------------ | -------------------------------------------------- |
| Better alignment   | Helps model behave according to human expectations |
| Better safety      | Avoids harmful, toxic, illegal, biased outputs     |
| Better reasoning   | Models prefer answers with good logic              |
| Better formatting  | JSON, code, stepwise solutions                     |
| Better helpfulness | Creates ChatGPT-like behavior                      |

Preference modeling tunes *behavior*, not *knowledge*.

---

### Summary Diagram

```
     Human-Labeled Comparisons
       ↓         ↓
  (Answer A)   (Answer B)
       ↘       ↙
         Preference Label
       ("A is better than B")
                 ↓
       Train Reward Model
                 ↓
     Optimize LLM using RM
 (RLHF, PPO) or (DPO directly)
                 ↓
     Aligned, safe, helpful model
```

---

**One-Sentence Summary**

**Preference modeling teaches an LLM to produce answers humans prefer by training it on response comparisons, forming the foundation of RLHF, DPO, and modern aligned assistant models.**

| Step                    | What It Teaches                  | Data Type                    | Model Learns                  |
| ----------------------- | -------------------------------- | ---------------------------- | ----------------------------- |
| **SFT**                 | *How to respond to instructions* | Instruction → Response pairs | Task behavior                 |
| **Preference Modeling** | *Which response is better*       | Chosen vs Rejected responses | Human preferences & alignment |

-------

| Feature               | SFT                    | Preference Modeling               |
| --------------------- | ---------------------- | --------------------------------- |
| Data Needed           | Instruction → Response | Chosen vs Rejected                |
| Teaches               | Skill & behavior       | Human preferences                 |
| Fixes                 | Task execution         | Safety, style, reasoning quality  |
| Requires ground truth | Yes                    | No                                |
| Used in               | LLaMA-Instruct         | RLHF, DPO, Claude-style alignment |
| Stage                 | Early                  | After SFT                         |
| Alignment             | Weak                   | Strong                            |
| Cost                  | Medium                 | High (human labels)               |


### Demo

1. Create **two responses** for each prompt
2. Label one as **chosen** and the other as **rejected**
3. Build a **Reward Model (RM)** that predicts a score for each response
4. Train the RM so that:
   $$
   R(\text{chosen}) > R(\text{rejected})
   $$
5. Test the reward model

This is exactly the method used in RLHF before PPO / DPO.

We use a **tiny DistilBERT** model for demonstration.

---

#### Install Dependencies

```bash
pip install transformers datasets accelerate torch
```

---

#### Create a Toy Preference Dataset

```python
from datasets import Dataset

data = [
    {
        "prompt": "Explain gravity.",
        "chosen": "Gravity is a force that pulls objects toward each other.",
        "rejected": "Gravity is when stuff goes down for no reason."
    },
    {
        "prompt": "What is photosynthesis?",
        "chosen": "Plants convert sunlight into energy using chlorophyll.",
        "rejected": "Photosynthesis means plants like the sun."
    }
]

dataset = Dataset.from_list(data)
dataset
```

---

#### Build a Tiny Reward Model

We use DistilBERT with a **single scalar head** that outputs a *reward score*.

```python
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel

class RewardModel(nn.Module):
    def __init__(self, model_name="distilbert-base-uncased"):
        super().__init__()
        self.backbone = AutoModel.from_pretrained(model_name)
        self.reward_head = nn.Linear(self.backbone.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        outputs = self.backbone(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden = outputs.last_hidden_state[:, 0]  # CLS token
        reward = self.reward_head(last_hidden)
        return reward
```

---

#### Tokenizer + Model

```python
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = RewardModel()
```

---

#### Preference Loss Function

We use a **pairwise Bradley-Terry loss**, standard in preference modeling.

$$
\mathcal{L} = -\log(\sigma(R_c - R_r))
$$

Where:

* $R_c$ = reward of chosen answer
* $R_r$ = reward of rejected answer

```python
import torch.nn.functional as F

def preference_loss(chosen_reward, rejected_reward):
    return -torch.mean(torch.log(torch.sigmoid(chosen_reward - rejected_reward)))
```

---

#### Train the Reward Model

```python
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

for epoch in range(5):
    total_loss = 0

    for row in dataset:
        prompt = row["prompt"]

        # Tokenize chosen + rejected
        chosen_text = prompt + " " + row["chosen"]
        rejected_text = prompt + " " + row["rejected"]

        c = tokenizer(chosen_text, return_tensors="pt", padding=True, truncation=True)
        r = tokenizer(rejected_text, return_tensors="pt", padding=True, truncation=True)

        chosen_reward = model(**c)
        rejected_reward = model(**r)

        loss = preference_loss(chosen_reward, rejected_reward)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {epoch+1} - Loss: {total_loss:.4f}")
```

This trains the RM to give “good” answers a higher score.

---

#### Test the Reward Model

```python
def score(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    return model(**inputs).item()

prompt = "Explain gravity."
good = "Gravity is a force that pulls objects together."
bad = "Gravity is magic glue."

print("Good answer score:", score(prompt + good))
print("Bad answer score:", score(prompt + bad))
```

Expected output:

```
Good answer score: higher number
Bad answer score: lower number
```

Which shows the reward model learned human preference.

---