# ‚öñÔ∏è LMFast: Preference Alignment (ORPO/DPO)

**Align your SLM with human preferences efficiently!**

## What You'll Learn
- Align models using ORPO (Odds Ratio Preference Optimization)
- Prepare preference datasets (Chosen vs Rejected)
- Fine-tune a chat model to be more helpful/harmless
- Evaluate alignment quality

## Why Preference Alignment?
- **Safety**: Reduce toxic outputs
- **Style**: Make the model speak like a pirate, or a professional
- **Accuracy**: penalize hallucinations

**Time to complete:** ~15 minutes (Colab T4 optimized)

## 1Ô∏è‚É£ Setup

In [None]:
!pip install -q lmfast[all]

import lmfast
lmfast.setup_colab_env()

import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")

## 2Ô∏è‚É£ Prepare Preference Data

Alignment requires "triplets": `(Prompt, Chosen Response, Rejected Response)`.
The model learns to prefer the *Chosen* one and avoid the *Rejected* one.

In [None]:
from datasets import Dataset

# Example: Teaching the model to be polite
data = [
    {
        "prompt": "Give me the report now.",
        "chosen": "Here is the report you requested. Let me know if you need anything else.",
        "rejected": "Here it is. take it."
    },
    {
        "prompt": "This code is broken.",
        "chosen": "I'm sorry to hear that. Could you share the error message so I can help fix it?",
        "rejected": "You probably wrote it wrong. Check your syntax."
    },
    {
        "prompt": "I hate you.",
        "chosen": "I understand you're frustrated, but I'm just an AI trying to help.",
        "rejected": "That is not very nice. I don't like you either."
    }
]

# In real scenarios, use "HuggingFaceH4/ultrafeedback_binarized" or similar
dataset = Dataset.from_list(data)
print(f"Dataset Example:\n{dataset[0]}")

## 3Ô∏è‚É£ Configure ORPO

We use **ORPO** (Odds Ratio Preference Optimization) because it aligns *during* SFT (Supervised Fine-Tuning), saving memory and time compared to RLHF or DPO.

In [None]:
from lmfast.alignment import align

# One-line alignment training
# This automatically configures ORPOConfig for T4

print("üöÄ Starting ORPO Alignment...")
print("This optimizes the model to favor 'chosen' responses.")

# Note: In a notebook, this returns the trainer object
trainer = align(
    model_name="HuggingFaceTB/SmolLM-135M-Instruct",
    dataset=dataset,
    output_dir="./aligned_model",
    method="orpo",
    max_steps=50,  # Short demo run
    learning_rate=5e-6, # Lower LR for alignment usually
    beta=0.1,  # Weight of the preference penalty
)

print("‚úÖ Training initiated...")

## 4Ô∏è‚É£ Evaluate Results

Let's compare the base model vs. the aligned model.

In [None]:
from lmfast.inference import SLMServer

# Load aligned model
aligned_server = SLMServer("./aligned_model")

test_prompts = [
    "This product is terrible! Fix it!",
    "You are stupid."
]

print("üß™ Testing Aligned Model Responses:")
print("="*50)

for p in test_prompts:
    response = aligned_server.generate(p, max_new_tokens=60)
    print(f"\nUser: {p}")
    print(f"AI: {response}")

## 5Ô∏è‚É£ Advanced: GRPO (Group Relative Policy Optimization)

For reasoning tasks (like math), GRPO is better. It takes a group of samples and reinforces the best ones relative to the group average.

*Note: GRPO requires a reward function or ground-truth verifier.*

```python
# Conceptual Example for GRPO
from lmfast.alignment import align

def reward_func(completions, **kwargs):
    # Return 1.0 if answer is correct, 0.0 otherwise
    return [1.0 if "42" in c else 0.0 for c in completions]

trainer = align(
    model_name="HuggingFaceTB/SmolLM-135M",
    dataset=math_dataset,
    method="grpo",
    reward_function=reward_func
)
```

## üéâ Summary

You've learned how to:
- ‚úÖ Structure preference datasets
- ‚úÖ Run ORPO alignment with one line of code
- ‚úÖ Understand the difference between ORPO and GRPO

### Tips
- **Data Quality**: Alignment is very sensitive to data quality. Ensure 'chosen' is genuinely better.
- **Beta Parameter**: Controls how much preference guides the training. 0.1 is a good default.

### Next Steps
- `13_guardrails.ipynb`: Add hard constraints to your aligned model.