# RLHF for LLMs: DPO and GRPO on Qwen2.5-0.5B

This notebook demonstrates how to improve a small open LLM using **Reinforcement Learning from Human Feedback (RLHF)**.  
We focus on two efficient preference-optimization techniques:
- **DPO (Direct Preference Optimization)**
- **GRPO (Generalized Reward Policy Optimization)**

These methods align model responses with human-like preferences, building on top of the SFT model from the previous notebook.


## Objectives

- Load a small quantized LLM (Qwen2.5-0.5B)
- Prepare preference data with "chosen" and "rejected" answers
- Fine-tune with **Direct Preference Optimization (DPO)**
- Optionally test **GRPO** for reward-based learning
- Compare model generations before and after alignment


## Setup

Uncomment the cell below if you are running this notebook on Colab or a fresh environment.


In [None]:
# clone course repo (needed because we use its DPOTrainer)
!git clone https://github.com/BounharAbdelaziz/RLHF.git

In [None]:
!pip install -q -r RLHF/requirements.txt

In [None]:
!pip install -q transformers accelerate peft bitsandbytes datasets trl

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import DPOTrainer, DPOConfig
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
)

# optional tracking
USE_WANDB = False
if USE_WANDB:
    import wandb
    os.environ["WANDB_PROJECT"] = "RLHF"
    wandb.login()

# dataset / model config
DATASET_PATH = "AIffl/french_orca_dpo_pairs"
LIMIT = 2_000
MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
SEED = 1998

MAX_PROMPT_LEN = 1024
MAX_LENGTH = MAX_PROMPT_LEN + 512
RUN_NAME = f"DPO-french-orca-{MODEL_NAME.split('/')[-1]}"

## (Optional) Experiment tracking

You can log metrics to Weights & Biases (W&B) if you have an account.  
This is optional ‚Äî the notebook runs without it.


## From SFT to RLHF

Supervised Fine-Tuning (SFT) teaches a model to imitate examples, but it doesn't ensure the responses are *preferred*.  
Reinforcement Learning from Human Feedback (RLHF) introduces a **preference dataset**, where each sample includes:
- a **prompt**
- a **chosen** (preferred) answer
- a **rejected** (less preferred) answer

The model learns to score the *chosen* higher than the *rejected*.  
We use the **DPO** method to do this efficiently without a reward model or full RL.


## Load the base SFT model

We use the 4-bit quantized **Qwen/Qwen2.5-0.5B-Instruct** model so that DPO training stays feasible on a single GPU.  
The tokenizer will also be used to apply the chat template during data preparation.


In [None]:
# Quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,)

# Load the model to finetune
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True,
)
# to avoid warning
model.config.use_cache = False
# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

# Set padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

## Data Preparation

### üí¨ Chat templates & converting data to `messages`

Modern instruction-tuned models (including **Qwen2.5-0.5B-Instruct**) expect inputs in a **chat format** and rely on a **tokenizer chat template** to turn structured messages into the exact token sequence the model was trained on. In practice, you should **not** hand-craft special tokens; instead, pass a list of `{role, content}` messages to the tokenizer and let `apply_chat_template(...)` do the right thing.

#### Why a chat template?
- Ensures your prompts match the **pretraining/finetuning format** (system/user/assistant turns, BOS/EOS, separators).
- Minimizes prompt drift across libraries and models.
- Makes it easy to add **system instructions** (e.g., ‚ÄúYou are a helpful assistant that answers in French.‚Äù).

#### Message structure
Each example becomes an ordered list of chat turns:
```python
messages = [
  {"role": "system", "content": "Tu es un assistant utile. R√©ponds en fran√ßais."},
  {"role": "user", "content": "Explique la diff√©rence entre LoRA et le fine-tuning complet."},
  {"role": "assistant", "content": "LoRA adapte un petit sous-espace de poids, alors que..."}
]


In [None]:
def preprocess_for_dpo(example):
    # build chat-like prompt
    messages = []
    if example.get("system") and example["system"].strip():
        messages.append({"role": "system", "content": example["system"]})
    messages.append({"role": "user", "content": example["question"]})

    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    return {
        "prompt": prompt,
        "chosen": example["chosen"],
        "rejected": example["rejected"],
    }

dataset = load_dataset(DATASET_PATH, split="train").shuffle(seed=SEED).select(range(LIMIT))
original_columns = dataset.column_names

dpo_dataset = dataset.map(
    preprocess_for_dpo,
    remove_columns=original_columns,
)

def filter_length(example):
    prompt_len = len(tokenizer(example["prompt"]).input_ids)
    chosen_len = len(tokenizer(example["chosen"]).input_ids)
    rejected_len = len(tokenizer(example["rejected"]).input_ids)
    return prompt_len + max(chosen_len, rejected_len) < MAX_LENGTH

dpo_dataset = dpo_dataset.filter(filter_length)
print(dpo_dataset[0])


## Model Training

Training will mirror `01_instruction_finetuning_qwen.ipynb`, but we‚Äôll switch from SFT to **off-policy DPO** using the `trl` library. Concretely, we‚Äôll instantiate a **policy model** (trainable) and a **reference model** (frozen) and optimize with the DPO objective so the policy prefers **chosen** over **rejected** responses for the same prompt.

### What we‚Äôll use
- **TRL**: `DPOConfig`, `DPOTrainer`
- **PEFT**: LoRA adapters on top of the base **Qwen2.5-0.5B-Instruct**
- **Quantization**: 4-bit (QLoRA-style) to fit on small GPUs
- **Logging**: W&B for metrics, configs, and artifacts

### Expected dataset columns
- `prompt` (or `messages`): the shared context (system+user turns)
- `chosen`: assistant reply preferred by annotators
- `rejected`: less-preferred reply
> If you‚Äôre keeping everything in chat format, we‚Äôll pass lists of `{role, content}` and rely on `tokenizer.apply_chat_template(...)` inside the collator.

### Minimal training flow
1. Load tokenizer with the **chat template** and enable 4-bit loading of the base model.
2. Wrap the model with **LoRA** (target attention/MLP modules).
3. Build a `datasets.Dataset` that yields `(prompt/messages, chosen, rejected)`.
4. Define `DPOConfig` (batch size, lr, epochs, `beta`, logging/saving/eval cadence).
5. Create `DPOTrainer(policy_model, ref_model, tokenizer, train_dataset, eval_dataset, **config)`.
6. Call `trainer.train()`; optional `trainer.evaluate()` and `trainer.save_model()`.


In [None]:
# LoRA configuration - targeting the correct modules for Qwen2.5
peft_config = LoraConfig(
    r=32,
    lora_alpha=64,
    lora_dropout=0.1,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ], # Target all MLP layers
    bias="none",
    task_type="CAUSAL_LM",
) # fill the gap

# Apply LoRA to the model
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

# Training configuration
training_args = DPOConfig(
    beta=0.1,  # DPO temperature parameter
    learning_rate=5e-6,
    max_prompt_length=MAX_PROMPT_LEN,
    max_length=MAX_LENGTH,
    per_device_train_batch_size=1,  # Reduced for memory
    gradient_accumulation_steps=4,  # Increased to maintain effective batch size of 4 (1*4)
    num_train_epochs=1,
    max_grad_norm=1.0,
    logging_steps=1,
    save_steps=100,
    lr_scheduler_type="cosine",
    optim="paged_adamw_8bit",  # More memory efficient
    warmup_ratio=0.03, # 3% of the steps will be just a warmup
    save_strategy="steps",
    output_dir="./dpo_model",
    report_to="none",
    run_name=RUN_NAME,
    remove_unused_columns=False,
    dataloader_pin_memory=False,
    fp16=True,  # Enable mixed precision
)

# Initialize the trainer - Note: no ref_model needed when using peft_config
trainer = DPOTrainer(
    model=model,
    args=training_args,
    peft_config=peft_config,  # This automatically handles reference model
    processing_class=tokenizer,
    train_dataset=dpo_dataset,
)

# Print a sample to verify preprocessing
print("Sample from dataset:")
print(f"Prompt: {dpo_dataset[0]['prompt']}")
print(f"Chosen: {dpo_dataset[0]['chosen']}")
print(f"Rejected: {dpo_dataset[0]['rejected']}")

# Train
trainer.train()

In [None]:
# merge LoRA adapters with the base model
save_path = "dpo_model/final_merged_dpo_model"

model = model.merge_and_unload()
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

## Model Testing


In [None]:
def generate(model, tokenizer, prompt, max_new_tokens=200):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

test_prompt = "<human>: Donne-moi 3 conseils pour s√©curiser une API.\n<assistant>:"
print(generate(model, tokenizer, test_prompt))


# Part II ‚Äì GRPO

In this section we show how the same model can be optimized with a reinforcement-style method, **GRPO**.  
This is optional ‚Äî DPO is often enough for small alignment tasks.


In [None]:
# Reuse the model and tokenizer loaded in the DPO section
grpo_model = model
grpo_tokenizer = tokenizer

# We will build a very small example dataset for GRPO right after


In [None]:
# Load a tiny math dataset for the GRPO demo
from datasets import load_dataset

grpo_ds = load_dataset("openai/gsm8k", 'main', split="train[:200]")

def build_prompt(example):
    return f"<human>: Solve the following math problem step by step.\n{example['question']}\n<assistant>:"

grpo_ds = grpo_ds.map(lambda ex: {"prompt": build_prompt(ex)})
print(grpo_ds[0])

In [None]:
grpo_ds[0]

## Reward Function Design

We‚Äôll use **two simple rewards** during GRPO rollouts:

1. **Format reward** ‚Äî checks that the **last non-empty line** is exactly in the form  
   `<answer>NUMBER</answer>`  
   - Score: **1** if correct format, **0** otherwise.

2. **Correctness reward** ‚Äî checks whether the extracted number matches the gold answer.  
   - Score: **2** if correct, **0** otherwise.

Total reward per sample ‚àà {0, 1, 2, 3}.



In [None]:
def extract_xml_answer(text: str) -> str:
    match = re.search(r"<answer>(\d+)</answer>", text)
    if match:
        answer = match.group(1)
    else:
        answer = "" # Return empty string if not found
    return answer

def format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has the correct format."""
    pattern = r"^(?:[^\r\n]*\r?\n)+<answer>\d+</answer>\r?\n?$"
    responses = completions
    matches = [bool(re.match(pattern, r)) for r in responses]
    return [1.0 if match else 0.0 for match in matches]

def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    """Reward function that checks if the answer is correct."""
    responses = completions
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

## Model Training

In [None]:
# Optional GRPO skeleton (commented to keep notebook runnable)
# from trl import GRPOTrainer, GRPOConfig
#
# grpo_config = GRPOConfig(
#     max_steps=200,
# )
#
# grpo_args = TrainingArguments(
#     output_dir="./grpo_model",
#     per_device_train_batch_size=1,
#     gradient_accumulation_steps=4,
#     num_train_epochs=1,
#     logging_steps=10,
#     report_to="none",
# )
#
# grpo_trainer = GRPOTrainer(
#     model=grpo_model,
#     args=grpo_args,
#     tokenizer=grpo_tokenizer,
#     train_dataset=grpo_ds,
#     grpo_config=grpo_config,
# )
# grpo_trainer.train()

## Evaluation

Quick qualitative check on a few prompts (same model as above).


In [None]:
def generate(model, tokenizer, prompt, max_new_tokens=200):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

test_prompts = [
    "<human>: Solve 12 + 35. Show the result only.\n<assistant>:",
    "<human>: Explique le principe du RLHF en 5 phrases.\n<assistant>:",
]

for p in test_prompts:
    print("=" * 60)
    print("PROMPT:", p)
    print(generate(model, tokenizer, p))

In [None]:
# Optional: interactive demo (commented out to keep notebook runnable)
# !pip install gradio -q
# import gradio as gr
#
# def dpo_chat(prompt):
#     text = f"<human>: {prompt}\n<assistant>:"
#     inputs = tokenizer(text, return_tensors="pt").to(model.device)
#     with torch.no_grad():
#         outputs = model.generate(**inputs, max_new_tokens=200)
#     return tokenizer.decode(outputs[0], skip_special_tokens=True)
#
# demo = gr.Interface(
#     fn=dpo_chat,
#     inputs=gr.Textbox(label="Your prompt"),
#     outputs=gr.Textbox(label="Model answer"),
#     title="Qwen2.5-0.5B ‚Äì aligned demo",
# )
# demo.launch()

## Summary

- we loaded a 4-bit Qwen2.5-0.5B-Instruct model
- we prepared a French DPO dataset (prompt, chosen, rejected)
- we ran DPO with LoRA adapters and no external logger
- we optionally showed how to structure a GRPO experiment

This notebook is meant to be lightweight and easy to run on Colab / single GPU.

## Reproducibility vs. demo quality

The goal of this notebook is to provide a **fully runnable** DPO/GRPO example on top of a quantized Qwen2.5-0.5B model.
The original lab also included a Gradio-based chat UI with slightly different prompting and enough GPU memory, which
produced cleaner generations.

In lightweight environments (Colab, 4-bit, small context) you may observe:
- verbose or generic answers,
- occasional language mixing,
- sensitivity to the prompt template.

For reference, the README includes screenshots of the Gradio demo in the ‚Äúgood‚Äù environment.
