# Preference Datasets

**Understanding preference data format, sources, and quality considerations**

## What is Preference Data?

**Preference data** consists of comparisons between responses, not absolute quality scores. Each sample contains:

- **prompt**: The input instruction or question
- **chosen**: The preferred response (higher quality)
- **rejected**: The non-preferred response (lower quality)

**Key insight:** Humans are much better at comparing responses than assigning absolute scores. "Which is better?" is easier than "Rate this 1-10."

## The Required Format

Every reward model dataset needs these three fields:

```json
{
  "prompt": "What is the capital of France?",
  "chosen": "The capital of France is Paris. It's known for art and culture.",
  "rejected": "paris"
}
```

| Field | Type | Description |
|-------|------|-------------|
| `prompt` | string | The instruction or question |
| `chosen` | string | The response humans preferred |
| `rejected` | string | The response humans did not prefer |

In [1]:
from datasets import load_dataset

# Load Anthropic's HH-RLHF dataset
dataset = load_dataset("Anthropic/hh-rlhf", split="train")

print(f"Dataset size: {len(dataset)} examples")
print(f"Columns: {dataset.column_names}")
print()

# Show example
example = dataset[0]
print("Example preference pair:")
print(f"Chosen: {example['chosen'][:200]}...")
print(f"\nRejected: {example['rejected'][:200]}...")

README.md: 0.00B [00:00, ?B/s]

harmless-base/train.jsonl.gz:   0%|          | 0.00/13.2M [00:00<?, ?B/s]

helpful-base/train.jsonl.gz:   0%|          | 0.00/16.2M [00:00<?, ?B/s]

helpful-online/train.jsonl.gz:   0%|          | 0.00/20.1M [00:00<?, ?B/s]

helpful-rejection-sampled/train.jsonl.gz:   0%|          | 0.00/25.7M [00:00<?, ?B/s]

harmless-base/test.jsonl.gz:   0%|          | 0.00/743k [00:00<?, ?B/s]

helpful-base/test.jsonl.gz:   0%|          | 0.00/875k [00:00<?, ?B/s]

helpful-online/test.jsonl.gz:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

helpful-rejection-sampled/test.jsonl.gz:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/160800 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8552 [00:00<?, ? examples/s]

Dataset size: 160800 examples
Columns: ['chosen', 'rejected']

Example preference pair:
Chosen: 

Human: What are some cuss words in english?

Assistant: Here’s an incomplete list.

Ass, dick, bugger, crap, fuck, shit, bitch, turd, shithead, shitbag, scrotum, cunt, whore, fucker, shit-eating, cu...

Rejected: 

Human: What are some cuss words in english?

Assistant: Here’s an incomplete list.

Ass, dick, bugger, crap, fuck, shit, bitch, turd, shithead, shitbag, scrotum, cunt, whore, fucker, shit-eating, cu...


## Popular Preference Datasets

### 1. Anthropic HH-RLHF
- **Size:** ~161K training pairs
- **Focus:** Helpfulness and Harmlessness
- **Quality:** High (trained annotators)

### 2. Stanford SHP
- **Size:** ~385K preference pairs
- **Source:** Stack Exchange votes
- **Focus:** Factual helpfulness

### 3. OpenAssistant
- **Size:** ~161K messages with rankings
- **Source:** Community-contributed
- **Format:** Multi-turn with rankings

## Dataset Quality Considerations

### What Makes Good Preference Data?

1. **Clear Distinction** — Chosen should be noticeably better than rejected
2. **Diverse Criteria** — Cover helpfulness, safety, accuracy, style
3. **Consistent Guidelines** — Annotators follow same criteria
4. **Representative Distribution** — Match your deployment domain

In [2]:
# Analyze dataset statistics
import numpy as np

def analyze_preference_dataset(dataset, num_samples=1000):
    """Compute statistics on preference dataset."""
    chosen_lengths = []
    rejected_lengths = []
    
    for i, item in enumerate(dataset):
        if i >= num_samples:
            break
        chosen_lengths.append(len(item['chosen'].split()))
        rejected_lengths.append(len(item['rejected'].split()))
    
    print("Length Statistics (word count):")
    print(f"  Chosen - Mean: {np.mean(chosen_lengths):.1f}, Median: {np.median(chosen_lengths):.1f}")
    print(f"  Rejected - Mean: {np.mean(rejected_lengths):.1f}, Median: {np.median(rejected_lengths):.1f}")
    
    # Check for length bias
    chosen_longer = sum(1 for c, r in zip(chosen_lengths, rejected_lengths) if c > r)
    print(f"\nLength bias check:")
    print(f"  Chosen is longer: {100 * chosen_longer / len(chosen_lengths):.1f}% of the time")
    
    return chosen_lengths, rejected_lengths

chosen_lengths, rejected_lengths = analyze_preference_dataset(dataset)

Length Statistics (word count):
  Chosen - Mean: 113.0, Median: 90.0
  Rejected - Mean: 122.2, Median: 98.0

Length bias check:
  Chosen is longer: 41.2% of the time


## Common Data Quality Issues

### Near-Duplicate Pairs
**Problem:** Chosen and rejected are nearly identical.

### Ordering Bias
**Problem:** Annotators prefer the first response shown.

### Length Bias
**Problem:** Longer responses always preferred.

### Annotation Fatigue
**Problem:** Quality degrades over long sessions.

In [3]:
from difflib import SequenceMatcher

def check_similarity(dataset, num_samples=100):
    """Check for near-duplicate preference pairs."""
    similarities = []
    
    for i, item in enumerate(dataset):
        if i >= num_samples:
            break
        sim = SequenceMatcher(None, item['chosen'], item['rejected']).ratio()
        similarities.append(sim)
    
    high_sim = sum(1 for s in similarities if s > 0.9)
    
    print(f"Similarity analysis (first {num_samples} samples):")
    print(f"  Mean similarity: {np.mean(similarities):.3f}")
    print(f"  High similarity (>0.9): {high_sim} pairs ({100*high_sim/len(similarities):.1f}%)")
    
    if high_sim > 0:
        print("  ⚠️ Warning: Some pairs are very similar - may need filtering")

check_similarity(dataset)

Similarity analysis (first 100 samples):
  Mean similarity: 0.685
  High similarity (>0.9): 15 pairs (15.0%)


## PyTorch Dataset for Reward Model Training

In [4]:
import torch
from torch.utils.data import Dataset

class RewardModelDataset(Dataset):
    """Dataset for reward model training."""
    
    def __init__(self, dataset, tokenizer, max_length=512):
        self.dataset = dataset
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, idx):
        item = self.dataset[idx]
        
        # Tokenize chosen response
        chosen_tokens = self.tokenizer(
            item['chosen'],
            max_length=self.max_length,
            truncation=True,
            padding='max_length',
            return_tensors='pt'
        )
        
        # Tokenize rejected response
        rejected_tokens = self.tokenizer(
            item['rejected'],
            max_length=self.max_length,
            truncation=True,
            padding='max_length',
            return_tensors='pt'
        )
        
        return {
            'chosen_input_ids': chosen_tokens['input_ids'].squeeze(0),
            'chosen_attention_mask': chosen_tokens['attention_mask'].squeeze(0),
            'rejected_input_ids': rejected_tokens['input_ids'].squeeze(0),
            'rejected_attention_mask': rejected_tokens['attention_mask'].squeeze(0),
        }

# Example usage
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

reward_dataset = RewardModelDataset(dataset, tokenizer, max_length=256)

sample = reward_dataset[0]
print("Dataset sample keys:", list(sample.keys()))
print(f"Chosen shape: {sample['chosen_input_ids'].shape}")
print(f"Rejected shape: {sample['rejected_input_ids'].shape}")

Dataset sample keys: ['chosen_input_ids', 'chosen_attention_mask', 'rejected_input_ids', 'rejected_attention_mask']
Chosen shape: torch.Size([256])
Rejected shape: torch.Size([256])


## Next Steps

Now that we understand preference data, let's learn how to train reward models on it.