# 09. LoRA for Linear Regression and GRPO on Verifiable Sorting

This tutorial extends the previous notebooks by combining **Low-Rank Adapters (LoRA)** with a lightweight reinforcement learning pipeline on verifiable tasks. We begin with a concrete linear regression example that highlights the memory advantages of LoRA on a single layer. We then move to a reinforcement learning in verifiable domains (RLVR) setting where we adapt a Qwen 2.5 7B model with **Group Relative Policy Optimization (GRPO)** to sort lists of integers using the [PEFT](https://huggingface.co/docs/peft/index) library while explicitly modelling `<think>` reasoning tokens and structured answers.

I recommend watching [Stanford CME295 Lecture on LoRA](https://youtu.be/VlA_jt_3Qc4?t=5858).


## Roadmap

1. Refresh the intuition for LoRA and quantify how low-rank adapters reduce the memory footprint of a single linear layer.
2. Implement the adapter for a synthetic multi-target linear regression problem and compare full fine-tuning vs. LoRA.
3. Build a verifiable sorting reward, warm-start the policy with a small cold-start dataset of `<think>` exemplars, run a short supervised fine-tuning (SFT) stage, and drive a GRPO loop with PEFT to adapt Qwen 2.5 7B.


>
💡 **Dependencies**

If you are running in a clean environment you may need to install a few extra packages such as `transformers`, `datasets`, `peft`, `trl`, and `accelerate`.


In [3]:
%%capture
%pip install -q torch transformers datasets accelerate peft trl evaluate openai

## 1. Revisiting LoRA on a Single Linear Layer

LoRA decomposes the weight update of a frozen matrix $W$ into a product $BA$ where $A \in \mathbb{R}^{r \times d}$ and $B \in \mathbb{R}^{m \times r}$. Instead of storing gradients and optimizer states for the full $m \times d$ matrix, we only update the low-rank factors. The effective weight during adaptation is

$$W_{\text{eff}} = W + \frac{\alpha}{r} BA,$$

where $\alpha$ rescales the update. For wide layers (large $m$ and $d$) and small rank $r$, this reduces the number of trainable parameters and the accompanying optimizer state by orders of magnitude.


### 1.1 Synthetic regression setup

We construct a multi-target linear regression task with a 512 → 256 linear layer. The ground-truth weight matrix is the sum of a frozen base matrix and a low-rank update, mirroring the scenario where LoRA is expected to shine.


In [1]:
import math
import random
from dataclasses import dataclass
from contextlib import nullcontext
from typing import Dict, Iterable, List, Sequence, Tuple

import pandas as pd
import plotly.graph_objects as go
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader, TensorDataset

torch.manual_seed(0)
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [5]:
in_features = 512
out_features = 256
lora_rank = 8
num_samples = 4096
batch_size = 256
noise_std = 0.05

# Construct a frozen base weight and a low-rank update that represents the target task.
base_weight = torch.randn(out_features, in_features).to(DEVICE)
adapter_A_true = torch.randn(lora_rank, in_features).to(DEVICE)
adapter_B_true = torch.randn(out_features, lora_rank).to(DEVICE)
delta_weight = adapter_B_true @ adapter_A_true
target_weight = base_weight + delta_weight

features = torch.randn(num_samples, in_features).to(DEVICE)
targets = features @ target_weight.T + noise_std * torch.randn(num_samples, out_features).to(DEVICE)
targets = targets.to(DEVICE)

dataset = TensorDataset(features, targets)
loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

### 1.2 Implementing a LoRA-augmented linear layer

The class below mirrors the adapter structure used in larger language models. Only the low-rank matrices `A` and `B` are trainable; the base weight stays frozen.


In [3]:
class LoRALinear(nn.Module):
    def __init__(self, base_weight: torch.Tensor, rank: int, alpha: float = 1.0, bias: bool = False):
        super().__init__()
        out_features, in_features = base_weight.shape
        self.in_features = in_features
        self.out_features = out_features
        self.rank = rank
        self.alpha = alpha
        # Frozen base weight
        self.weight = nn.Parameter(base_weight.clone())
        self.weight.requires_grad = False
        # Trainable low-rank factors
        self.A = nn.Parameter(torch.zeros(rank, in_features))
        self.B = nn.Parameter(torch.zeros(out_features, rank))
        nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))
        nn.init.zeros_(self.B)
        self.scaling = alpha / max(rank, 1)
        if bias:
            self.bias = nn.Parameter(torch.zeros(out_features))
        else:
            self.register_parameter('bias', None)

    def effective_weight(self) -> torch.Tensor:
        return self.weight + (self.B @ self.A) * self.scaling

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return F.linear(x, self.effective_weight(), self.bias)

Let's see the training code now. 

In [4]:
def count_trainable_parameters(module: nn.Module) -> int:
    return sum(p.numel() for p in module.parameters() if p.requires_grad)

def train_linear_module(module: nn.Module, loader: DataLoader, steps: int, lr: float) -> List[float]:
    module.to(DEVICE)
    module.train()
    optimizer = torch.optim.Adam([p for p in module.parameters() if p.requires_grad], lr=lr)
    history: List[float] = []
    iterator = iter(loader)
    for step in range(steps):
        try:
            batch = next(iterator)
        except StopIteration:
            iterator = iter(loader)
            batch = next(iterator)
        x, y = (tensor.to(DEVICE) for tensor in batch)
        optimizer.zero_grad()
        preds = module(x)
        loss = F.mse_loss(preds, y)
        loss.backward()
        optimizer.step()
        history.append(loss.item())
    return history

def evaluate_mse(module: nn.Module, features: torch.Tensor, targets: torch.Tensor) -> float:
    module.eval()
    with torch.no_grad():
        preds = module(features.to(DEVICE))
        loss = F.mse_loss(preds, targets)
    return float(loss)

def relative_weight_error(module: nn.Module, target: torch.Tensor) -> float:
    if isinstance(module, LoRALinear):
        weight = module.effective_weight().detach()
    else:
        weight = module.weight.detach()
    return float(torch.norm(weight - target) / torch.norm(target))

Comparing dense linear layer with the `LoRALinear` layer written above.

In [5]:
full_linear = nn.Linear(in_features, out_features, bias=False)
full_linear.weight.data.copy_(base_weight.clone())

lora_linear = LoRALinear(base_weight=base_weight, rank=lora_rank, alpha=lora_rank)

full_history = train_linear_module(full_linear, loader, steps=2000, lr=1e-3)
lora_history = train_linear_module(lora_linear, loader, steps=2000, lr=5e-3)

full_mse = evaluate_mse(full_linear, features, targets)
lora_mse = evaluate_mse(lora_linear, features, targets)

full_error = relative_weight_error(full_linear, target_weight)
lora_error = relative_weight_error(lora_linear, target_weight)

In [6]:
fig = go.Figure()
fig.add_trace(go.Scatter(y=full_history, name="Full fine-tuning"))
fig.add_trace(go.Scatter(y=lora_history, name="LoRA (rank=8)"))
fig.update_layout(title="Training loss comparison", xaxis_title="Step", yaxis_title="MSE loss")
fig.show()

def format_mb(params: int, dtype=torch.float32) -> float:
    bytes_per_param = torch.finfo(dtype).bits // 8
    return params * bytes_per_param / (1024 ** 2)

results_table = pd.DataFrame([
    {
        "Model": "Full fine-tuning",
        "Trainable params": count_trainable_parameters(full_linear),
        "Approx optimizer state (MB)": format_mb(count_trainable_parameters(full_linear) * 2),
        "Final MSE": full_mse,
        "Relative weight error": full_error,
    },
    {
        "Model": "LoRA (rank=8)",
        "Trainable params": count_trainable_parameters(lora_linear),
        "Approx optimizer state (MB)": format_mb(count_trainable_parameters(lora_linear) * 2),
        "Final MSE": lora_mse,
        "Relative weight error": lora_error,
    },
])
results_table

Unnamed: 0,Model,Trainable params,Approx optimizer state (MB),Final MSE,Relative weight error
0,Full fine-tuning,131072,1.0,1635.135498,0.651816
1,LoRA (rank=8),6144,0.046875,0.00248,6.3e-05


LoRA matches the full fine-tuning loss while updating only a few thousand parameters. The optimizer state memory shrinks proportionally, which is critical when the base layer contains millions of parameters.


## 2. RLVR with GRPO on a Sorting Task

We now move from a single layer to a causal language model. The goal is to sort a list of integers — a domain where the reward can be **verified automatically**. We will:

* Load a small cold-start dataset of prompts and structured `<think>` answers.
* Run a lightweight supervised warm-start so the LoRA adapter learns to emit the reasoning and output tags.
* Apply a LoRA adapter to Qwen 2.5 7B with the [PEFT](https://huggingface.co/docs/peft/index) library.
* Implement a GRPO-style policy gradient loop that samples multiple completions per prompt and shapes the reward with verifiable checks.

Typically, RLVR is applied for a large number of math and coding like tasks where the result can be verified against a ground truth. These could include proof verifiers or code testing. We are using Qwen 2.5 because Qwen3 is already trained with `<think>` and `</think>` tokens and Qwen1 is not optimised for inference (no GQA). 

For a more in-depth discussion, please watch [Stanford CS336 - RL Lecture](https://www.youtube.com/watch?v=JdGFdViaOJk&list=PLoROMvodv4rOY23Y0BoGoBGgQ1zmU_MT_&index=17).


### 2.1 Cold-start data and prompt construction

The helper below loads a JSONL file (also easy to host on the Hugging Face Hub) and augments it with synthetic permutations so that the policy has a warm start before RL. Every prompt enforces the structure “think inside `<think>...</think>` and then the answer with a strict structured response reminder. We shall use the `Claude Sonnet 4.5` model to generate thinking tokens as it provides precise reasoning while maintaining brevity. 


In [2]:
from datasets import Dataset, load_dataset
from typing import Any, Dict, List, Optional
from rich.progress import (
    Progress,
    SpinnerColumn,
    BarColumn,
    TextColumn,
    TimeElapsedColumn,
)
import statistics
import random
import asyncio
import json
import re
import os

import nest_asyncio
nest_asyncio.apply()

STRUCTURED_INSTRUCTIONS = (
    "First think between <think> and </think> tags and then provide a response as a sorted list and nothing else. No tools."
)

def render_numbers(numbers):
    return ', '.join(str(n) for n in numbers)

def build_prompt(numbers):
    return f"Sort the numbers [{render_numbers(numbers)}]. {STRUCTURED_INSTRUCTIONS}"

async def build_response(numbers, client):
    sorted_numbers = sorted(numbers)
    response = await client.chat.completions.create(
        model="anthropic/claude-sonnet-4.5",
        messages=[
            {"role": "user", "content": f"Briefly think step by step and sort this list by hand: {numbers}."}
        ],
    )
    return f"<think>{response.choices[0].message.content}</think>[{render_numbers(sorted_numbers)}]"

async def _gen_one(i: int, seed: int, client) -> Dict[str, Any]:
    rng = random.Random(seed + i)
    length = rng.randint(10, 50)
    numbers = [rng.uniform(-20, 30) for _ in range(length)]

    prompt = build_prompt(numbers)
    response = await build_response(numbers, client)

    return {
        "prompt": prompt,
        "response": response,
        "numbers": numbers,
        "rationale": response.split("</think>")[0].replace("<think>", "").strip(),
    }

def generate_synthetic_sorting(num_examples: int = 128, client = None, seed: int = 0) -> Dataset:
    rng = random.Random(seed)
    samples = []
    for i in track(range(num_examples), description="Generating synthetic sorting examples"):
        length = rng.randint(10, 50)
        numbers = [rng.uniform(-20, 30) for _ in range(length)]
        prompt = build_prompt(numbers)
        response = build_response(numbers, client)
        samples.append(
            {
                "prompt": prompt,
                "response": response,
                "numbers": numbers,
                "rationale": response.split('</think>')[0].replace('<think>', '').strip(),
            }
        )
    return Dataset.from_list(samples)

async def generate_synthetic_sorting_async(num_examples: int = 128, client=None, seed: int = 0, concurrency: int = 50) -> Dataset:
    sem = asyncio.Semaphore(concurrency)
    results: List[Optional[Dict[str, Any]]] = [None] * num_examples

    async def guarded(i: int):
        async with sem:
            out = await _gen_one(i, seed, client)
            results[i] = out

    tasks = [asyncio.create_task(guarded(i)) for i in range(num_examples)]

    with Progress(SpinnerColumn(), TextColumn("[bold]Generating synthetic sorting examples[/]"), BarColumn(), TextColumn("{task.completed}/{task.total}"), TimeElapsedColumn()) as progress:
        task_id = progress.add_task("gen", total=num_examples)
        for fut in asyncio.as_completed(tasks):
            await fut
            progress.update(task_id, advance=1)

    return Dataset.from_list([r for r in results if r is not None])

def generate_synthetic_sorting(num_examples: int = 128, client=None, seed: int = 0, concurrency: int = 50) -> Dataset:
    return asyncio.run(generate_synthetic_sorting_async(num_examples=num_examples, client=client, seed=seed, concurrency=concurrency))

if not os.path.exists('data/sorting_synthetic_training.jsonl'):
    from openai import AsyncOpenAI
    client = AsyncOpenAI(base_url="https://openrouter.ai/api/v1", api_key="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")
    test_ds = generate_synthetic_sorting(10, client, seed=42)
    test_ds.to_json('data/sorting_synthetic_test.jsonl')
    training_ds = generate_synthetic_sorting(200, client, seed=42)
    training_ds.to_json('data/sorting_synthetic_training.jsonl')
else:
    training_ds = load_dataset('json', data_files='data/sorting_synthetic_training.jsonl', split='train')
    test_ds = load_dataset('json', data_files='data/sorting_synthetic_test.jsonl', split='train')
    
print(json.dumps(training_ds[0], indent=2))

{
  "prompt": "Sort the numbers [-14.433446591715981, 17.077524987991644, -7.7554073098261895, -13.023103573742805, -14.875241191424625, 17.03338723338379, 7.268326687417488, 9.524625622451982, -18.41086602591082, -15.315238006920378, -8.366955330463021, 10.100936452499017, 8.06225314693065, 15.800980646120173, 15.066248679511794, 0.9759910480829355, 2.460452314192679, -6.090464588466864, 23.4650160396467, 17.940368356488364, -12.017034181155495, 1.1307199076751289, -6.1064329164179085, -9.23431189462056, 18.174706450326198, -14.889486174007565, -1.003634968133131, -2.0510309757685796, -2.8022138760514466, -6.773956638899346, -17.82747785384514, 2.971243983592128, -13.758691857339532, 26.114768601407995, -16.05999009607709, -5.341085929645356, 11.431989974722168, 24.272587396840343, -1.9182486916511756, -10.385570487167005, -16.522242558812955, 13.063165928385509, 18.653417039434594, 29.261076033037888, 22.765886050757345, 23.324183337763486, -0.9936887480416914, 2.6705151184217755, 21

### 2.2 Supervised warm-start with structured reasoning tokens

Before optimising with RL we align the policy to the desired format by running a brief supervised fine-tuning pass on the combined cold-start and synthetic data. We mask the prompt tokens so that only the completion (the `<think>` rationale plus the answer) contributes to the loss, ensuring the LoRA adapter reliably emits the control tokens.

* `autotokenizer.from_pretrained`: this loads the pre-trained tokenizer specifically designed for the `qwen2.5-7b model` from the hugging face hub. it knows how to break text into tokens that the model understands.
* `tokenizer.add_special_tokens`: we're adding custom tokens like <think> and <output>. this is important because these tokens help the model learn the structured reasoning format we're aiming for in the sorting task.
* `tokenizer.pad_token` and `tokenizer.padding_side`: models often require inputs of the same length, so padding tokens (<eos>) are used to make shorter sequences match the max_length. setting padding_side='left' means padding will be added to the beginning of sequences.

In [3]:
from transformers import Trainer, TrainingArguments
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

base_model_name = 'Qwen/Qwen2.5-7B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
policy_model = AutoModelForCausalLM.from_pretrained(base_model_name, trust_remote_code=True).to(DEVICE)

def test_model(model, prompt=test_ds[0]['prompt']):
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    outputs = model.generate(**model_inputs, max_new_tokens=8192)
    decoded = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
    return decoded

base_response = test_model(policy_model)
print(base_response)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Sort the numbers [-14.433446591715981, 17.077524987991644, -7.7554073098261895, -13.023103573742805, -14.875241191424625, 17.03338723338379, 7.268326687417488, 9.524625622451982, -18.41086602591082, -15.315238006920378, -8.366955330463021, 10.100936452499017, 8.06225314693065, 15.800980646120173, 15.066248679511794, 0.9759910480829355, 2.460452314192679, -6.090464588466864, 23.4650160396467, 17.940368356488364, -12.017034181155495, 1.1307199076751289, -6.1064329164179085, -9.23431189462056, 18.174706450326198, -14.889486174007565, -1.003634968133131, -2.0510309757685796, -2.8022138760514466, -6.773956638899346, -17.82747785384514, 2.971243983592128, -13.758691857339532, 26.114768601407995, -16.05999009607709, -5.341085929645356, 11.431989974722168, 24.272587396840343, -1.9182486916511756, -10.385570487167005, -16.522242558812955, 13.063165928385509, 18.653417039434594, 29.

Clearly, the response is incorrect, does not use thinking tokens correctly, and does not give the response with  the required structure.

Hence, we perform some simple SFT. 
* `prepare_sft_example`: this function takes each example from our dataset (prompt and response), combines them, and tokenizes them using the tokenizer. it also creates labels for supervised training. the key part here is setting `labels[idx] = -100` for the prompt tokens, which means the model won't calculate loss on the prompt itself, only on its generated response. this is standard for causal language modeling fine-tuning.
* `training_ds.map`: this applies the prepare_sft_example function to every entry in our training_ds (which comes from the datasets library), effectively tokenizing the entire dataset.
* `datacollatorforlanguagemodeling`: this is responsible for taking a list of tokenized examples and batching them together. it handles padding them to the same length (as mlm=false means it's for causal language modeling, not masked language modeling).
* `trainingarguments`: this is where we define all the hyperparameters and configurations for our training run, like batch size, learning rate, how many steps to train for, where to save logs, and whether to use fp16 (mixed-precision training for faster training on gpus).
* `trainer`: this is the main class from transformers that orchestrates the training. you pass it the model, training arguments, our prepared dataset, and the data collator.
* `sft_trainer.train()`: this kicks off the supervised fine-tuning process, where the policy_model learns from our sft_dataset according to the sft_args.
* `policy_model.to(device)` and `policy_model.eval()`: after training, the model is moved to the appropriate device (gpu, if available) and set to evaluation mode, which turns off things like dropout for consistent predictions.

In [11]:
from transformers import DataCollatorForSeq2Seq
from os import path
import gc

SFT_MAX_LENGTH = 8192

def prepare_sft_example(example):
    prompt = example["prompt"].strip()
    response = example["response"].strip()
    text = f"{prompt}{response}"
    tokenized = tokenizer(text, truncation=True, max_length=SFT_MAX_LENGTH)
    prompt_ids = tokenizer(prompt, add_special_tokens=False, truncation=True, max_length=SFT_MAX_LENGTH)["input_ids"]
    labels = tokenized["input_ids"][:]
    labels = labels.copy()
    prompt_len = min(len(prompt_ids), len(labels))
    for idx in range(prompt_len):
        labels[idx] = -100
    tokenized["labels"] = labels
    return tokenized

sft_dataset = training_ds.map(
    prepare_sft_example,
    remove_columns=training_ds.column_names,
)

if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token

sft_args = TrainingArguments(
    output_dir="checkpoints/qwen2_sorting_sft",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=5e-7,
    max_steps=50,
    logging_steps=10,
    save_only_model=True,
    bf16=torch.cuda.is_bf16_supported(),
    gradient_checkpointing=True,
    report_to="none",
)

data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    padding=True,
    label_pad_token_id=-100,
    pad_to_multiple_of=8 if sft_args.fp16 else None,
)

sft_trainer = Trainer(
    model=policy_model,
    args=sft_args,
    train_dataset=sft_dataset,
    data_collator=data_collator,
)

if path.exists("checkpoints/qwen2_sorting_sft/checkpoint-50/config.json"):
    print("Loading from existing checkpoint...")
    policy_model = AutoModelForCausalLM.from_pretrained("checkpoints/qwen2_sorting_sft/checkpoint-50/").to(DEVICE)
else:
    sft_trainer.train()

policy_model.eval()

torch.cuda.empty_cache(); gc.collect()

response = test_model(policy_model)
print(response)

Loading from existing checkpoint...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Sort the numbers [-14.433446591715981, 17.077524987991644, -7.7554073098261895, -13.023103573742805, -14.875241191424625, 17.03338723338379, 7.268326687417488, 9.524625622451982, -18.41086602591082, -15.315238006920378, -8.366955330463021, 10.100936452499017, 8.06225314693065, 15.800980646120173, 15.066248679511794, 0.9759910480829355, 2.460452314192679, -6.090464588466864, 23.4650160396467, 17.940368356488364, -12.017034181155495, 1.1307199076751289, -6.1064329164179085, -9.23431189462056, 18.174706450326198, -14.889486174007565, -1.003634968133131, -2.0510309757685796, -2.8022138760514466, -6.773956638899346, -17.82747785384514, 2.971243983592128, -13.758691857339532, 26.114768601407995, -16.05999009607709, -5.341085929645356, 11.431989974722168, 24.272587396840343, -1.9182486916511756, -10.385570487167005, -16.522242558812955, 13.063165928385509, 18.653417039434594, 29.

### 2.3 Reward shaping with verifiable checks

Sorting is verifiable because we can deterministically extract integers from the prompt and the model output. The reward below blends several signals:

* **Exact match** — full credit when the completion equals the sorted list.
* **Monotonicity** — partial credit if the answer is sorted but numbers differ.
* **Prefix accuracy** — rewards early correct numbers to stabilise learning.
* **Coverage** — encourages the model to reuse the original numbers.
* **Format compliance** — bonus for emitting the `<think>` block.
* **Length penalty** — discourages hallucinating or dropping numbers.

The function also returns diagnostic components so we can reason about learning progress.


In [16]:
from collections import Counter

THINK_PATTERN = re.compile(r"<think>([\s\S]*?)</think>", re.IGNORECASE)
OUTPUT_PATTERN = re.compile(r"</think>([\s\S]*?)(?:<\|im_end\|>|$)", re.IGNORECASE)

def extract_numbers(text: str) -> List[int]:
    numbers = re.findall(r'[-+]?\d*\.\d+|\d+', text)
    numbers = [float(i) for i in numbers]
    return numbers

def sorting_reward(prompt: str, completion: str, test: bool = False) -> Tuple[float, Dict[str, float]]:
    target_numbers = extract_numbers(prompt)
    target_sorted = sorted(target_numbers)
    output_section = re.findall(OUTPUT_PATTERN, completion)[1].strip() if len(re.findall(OUTPUT_PATTERN, completion)) > 1 else ""
    if not output_section:
        return -1.0, {"exact": 0.0, "monotonic": 0.0, "prefix": 0.0, "coverage": 0.0, "format": 0.0}

    predicted_numbers = extract_numbers(output_section)
    if not predicted_numbers:
        return -1.0, {"exact": 0.0, "monotonic": 0.0, "prefix": 0.0, "coverage": 0.0, "format": 0.0}

    think_section = re.findall(THINK_PATTERN, completion)[1].strip() if len(re.findall(THINK_PATTERN, completion)) > 1 else ""
    format_score = 1.0 if think_section and '[' in output_section and ']' in output_section else 0.0

    length_penalty = -0.05 * abs(len(predicted_numbers) - len(target_sorted))

    target_counter = Counter(target_sorted)
    predicted_counter = Counter(predicted_numbers)
    coverage = sum((target_counter & predicted_counter).values()) / max(len(target_sorted), 1)

    monotonic = 1.0 if predicted_numbers == sorted(predicted_numbers) else 0.0

    prefix = 0.0
    for t, p in zip(target_sorted, predicted_numbers):
        if t == p:
            prefix += 1
        else:
            break
    prefix = prefix / max(len(target_sorted), 1)

    exact = 1.0 if predicted_numbers == target_sorted else 0.0

    reward = (
        0.55 * exact
        + 0.2 * monotonic
        + 0.1 * prefix
        + 0.1 * coverage
        + 0.05 * format_score
        + (length_penalty if not test else 0.0)
    )
    reward = float(max(-1.0, min(reward, 1.0)))
    return reward, {
        "exact": exact,
        "monotonic": monotonic,
        "prefix": prefix,
        "coverage": coverage,
        "format": format_score,
    }

example_prompt = test_ds[0]['prompt']
base_reward = sorting_reward(example_prompt, base_response, test=True)
sft_reward = sorting_reward(example_prompt, response, test=True)

print(f"Base model reward: {base_reward}")
print(f"SFT model reward: {sft_reward}")

Base model reward: (0.338, {'exact': 0.0, 'monotonic': 1.0, 'prefix': 0.0, 'coverage': 0.88, 'format': 1.0})
SFT model reward: (0.34, {'exact': 0.0, 'monotonic': 1.0, 'prefix': 0.0, 'coverage': 0.9, 'format': 1.0})


Let's see what the average reward is for the pre-SFT and post-SFT models on the test dataset.

In [17]:
from tqdm import tqdm 

def evaluate_model_on_dataset(model, dataset: Dataset) -> Dict[str, float]:
    rewards = []
    for example in tqdm(dataset, desc="Evaluating model", total=len(dataset)):
        prompt = example['prompt']
        response = test_model(model, prompt=prompt)
        reward, _ = sorting_reward(prompt, response, test=True)
        rewards.append(reward)
    avg_reward = statistics.mean(rewards)
    return avg_reward


reference_model = AutoModelForCausalLM.from_pretrained(base_model_name, trust_remote_code=True)
reference_model.to(DEVICE)
reference_model.eval()
for param in reference_model.parameters():
    param.requires_grad = False

base_eval_results = evaluate_model_on_dataset(reference_model, test_ds.select(range(10)))
print("Base model average reward:", base_eval_results)

sft_eval_results = evaluate_model_on_dataset(policy_model, test_ds.select(range(10)))
print("SFT model average reward:", sft_eval_results)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Evaluating model: 100%|██████████| 10/10 [01:47<00:00, 10.75s/it]


Base model average reward: 0.46372391921538264


Evaluating model: 100%|██████████| 10/10 [03:06<00:00, 18.65s/it]

SFT model average reward: 0.5281940379403794





### 2.4 Qwen 2.5 7B with PEFT LoRA

We attach a LoRA adapter to Qwen 2.5 7B so that only a few attention projections are updated while the base weights stay frozen. The tokenizer is augmented with `<think>` specials so the adapter can model the reasoning format, and the frozen reference model provides the KL anchor in the RL loss.

The `LoraConfig` (low-rank adaptation configuration) is used here to set up how your policymodel will be fine-tuned efficiently. lora works by injecting small, trainable matrices into the model's layers instead of fine-tuning all of the original model's parameters, which significantly reduces the number of parameters that need to be updated.

* `r=16`: this sets the rank of the update matrices. a higher r means more expressive lora layers (closer to full fine-tuning) but also more trainable parameters. 16 is a common choice, balancing performance and efficiency.
* `lora_alpha=32`: this is a scaling factor for the lora updates. it's typically set to 2 r or r itself. a larger lora_alpha gives more weight to the lora-adapted features.
* `target_modules`: this specifies which layers within the base model (Qwen 2 7B in our case) will have lora adapters applied to them. these are typically the attention mechanism's projection layers (query, key, value, output) and the feed-forward network's projections (gate, up, down).
* `lora_dropout=0.05`: this applies dropout to the lora layers during training. dropout helps prevent overfitting by randomly setting a fraction of the lora activations to zero.
* `bias='none'`: this indicates that no bias terms will be trained with lora. you can also choose to train all bias terms ('all') or only those in the lora layers ('lora_only'). 'none' is a common default.
* `task_type='CAUSAL_LM'`: this tells the peft library that you're working with a causal language model (like gpt-style models that generate text one token at a time). this helps the library apply lora correctly for this type of architecture.


In [6]:
from copy import deepcopy 
torch.cuda.empty_cache(); gc.collect()

lora_config = LoraConfig(
    r=64,
    lora_alpha=128,
    target_modules=[
        "self_attn.q_proj",
        "self_attn.k_proj",
        "self_attn.v_proj",
        "self_attn.o_proj",
        "mlp.gate_proj",
        "mlp.up_proj",
        "mlp.down_proj",
    ],
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM',
)

get_peft_model(policy_model, lora_config).print_trainable_parameters()



trainable params: 161,480,704 || all params: 7,777,097,216 || trainable%: 2.0764


### 2.5 GRPO Utilities

Group Relative Policy Optimization (GRPO) is a reinforcement learning technique designed to make fine-tuning large language models (LLMs) more stable and efficient—especially in scenarios where multiple responses per prompt are available. Traditional methods like PPO (Proximal Policy Optimization) optimize policies based on scalar rewards per sample. However, in LLM fine-tuning (e.g., aligning with human preference data), rewards are often relative — we know which response is better, not by how much. GRPO was first introduced in [DeepSeek-R1 Technical Report](https://arxiv.org/pdf/2501.12948), but we use the Dr. GRPO (GRPO Done Right) variance introduced by [Liu et al. (2025)](https://arxiv.org/pdf/2503.20783?).

![image.png](attachment:image.png)

GRPO leverages this **relative preference** more effectively by comparing samples **within a group** of responses to the same prompt. Instead of updating the policy using individual sample rewards, GRPO:

1. Groups responses by the same prompt.
2. Computes **relative advantages** within each group:
   $$
   A_i = \text{reward}_i - \text{mean(rewards in group)}
   $$
3. Uses these relative advantages to update the model using a PPO-style objective:
   $$
   \mathcal{L}*{\text{GRPO}} = \mathbb{E}*i \Big[\min(r_i A_i, \text{clip}(r_i, 1 - \epsilon, 1 + \epsilon) A_i)\Big]
   $$
   where ( $r_i = \frac{\pi*\theta(a_i | s_i)}{\pi*{\text{ref}}(a_i | s_i)}$ ) is the **likelihood ratio** between the policy and reference models.

This ensures the model learns to *prefer relatively better responses* without depending on absolute reward scaling.


Key Benefits:

* **Stable training:** By normalizing rewards within groups, it mitigates outlier effects.
* **More sample-efficient:** Every group yields multiple gradient signals.
* **Alignment-friendly:** Works well with human or model preference data (as in RLHF or DPO setups).


GRPO needs log-probabilities for each sampled completion under both the policy and the frozen reference model. The helpers below combine prompts with completions, build attention masks that isolate the generated tokens, and return per-sequence log-probs.


In [10]:
from datasets import load_dataset, Dataset

# GRPOTrainer will pass lists of prompts and completions.
def grpo_reward_fn(prompts, completions, **kwargs):
    # Return only the scalar reward per (prompt, completion)
    rewards = []
    for p, c in zip(prompts, completions):
        r, _ = sorting_reward(p[0].get('content'), p[0].get('content')+'<|im_end|>'+c[0].get('content')+'<|im_end|>')  # Append end token to completion
        rewards.append(r)
    return rewards

# Keep only 'prompt', rename → 'content', and add 'role' = 'user'
train_prompts = training_ds.map(
    lambda ex: {"prompt": [{"role": "user", "content": ex["prompt"]}]},
    remove_columns=training_ds.column_names
)

test_prompts = test_ds.map(
    lambda ex: {"prompt": [{"role": "user", "content": ex["prompt"]}]},
    remove_columns=test_ds.column_names
)

print(train_prompts[0])

{'prompt': [{'content': 'Sort the numbers [-14.433446591715981, 17.077524987991644, -7.7554073098261895, -13.023103573742805, -14.875241191424625, 17.03338723338379, 7.268326687417488, 9.524625622451982, -18.41086602591082, -15.315238006920378, -8.366955330463021, 10.100936452499017, 8.06225314693065, 15.800980646120173, 15.066248679511794, 0.9759910480829355, 2.460452314192679, -6.090464588466864, 23.4650160396467, 17.940368356488364, -12.017034181155495, 1.1307199076751289, -6.1064329164179085, -9.23431189462056, 18.174706450326198, -14.889486174007565, -1.003634968133131, -2.0510309757685796, -2.8022138760514466, -6.773956638899346, -17.82747785384514, 2.971243983592128, -13.758691857339532, 26.114768601407995, -16.05999009607709, -5.341085929645356, 11.431989974722168, 24.272587396840343, -1.9182486916511756, -10.385570487167005, -16.522242558812955, 13.063165928385509, 18.653417039434594, 29.261076033037888, 22.765886050757345, 23.324183337763486, -0.9936887480416914, 2.6705151184

### 2.6 GRPO Config

Important knobs:
 - num_generations: K samples per prompt to compute group-relative baselines.
 - scale_rewards: Dr. GRPO sets this False.
 - kl_coef: mild KL to keep outputs close to reference (0 disables).
 - max_new_tokens / min_new_tokens: make sure the model can output the list.
 - stop: we stop at ']' to avoid rambling after the sorted list.

In [16]:
from trl import GRPOTrainer, GRPOConfig

training_args = GRPOConfig(
    # — standard Trainer args —
    output_dir="logs/grpo_sorting",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-6,
    logging_steps=1,
    log_completions=True,
    num_completions_to_print=1,
    save_steps=200,
    num_train_epochs=2,           # or use max_steps
    seed=42,
    bf16=True,                    # if supported; otherwise set False

    # — GRPO data/generation knobs (doc names) —
    max_prompt_length=2048,
    num_generations=4,            # “G” completions per prompt
    temperature=0.7,
    max_completion_length=8192,
    repetition_penalty=1.0,       # keep default unless you need it

    # You can choose one of these: generation_batch_size OR steps_per_generation
    # generation_batch_size: if None, it’s derived from effective train batch. 
    generation_batch_size=None,

    # — GRPO objective/regularization (doc names) —
    beta=0.0,                     # KL weight; 0.0 is the doc default for GRPO. 
    num_iterations=1,             # μ in the paper (updates per generation)
    epsilon=0.2,                  # PPO clip range
    importance_sampling_level="token",

    # >>> Dr. GRPO setting <<<
    scale_rewards=False,          # disable std scaling to avoid difficulty bias. 

    # Loss variant (we use "dr_grpo", which avoids length bias)
    loss_type="dr_grpo",             # keep default unless you want classic sample-level. 

    # Training stability (recommended in docs)
    mask_truncated_completions=True,  # don’t penalize truncated samples. 

    # Optional reference-model syncing (off by default)
    sync_ref_model=False,
)

### 2.7 GRPO Training

In [None]:
import torch
from transformers import TrainerCallback

class RewardAndSamplePrinter(TrainerCallback):
    def __init__(self, tokenizer, dataset, sample_every=10, sample_idx=0):
        self.tokenizer = tokenizer
        self.dataset = dataset
        self.sample_every = sample_every
        self.sample_idx = sample_idx  

    def on_log(self, args, state, control, logs=None, model=None, **kwargs):
        if not logs:
            return

        step = state.global_step
        reward_keys = [k for k in logs.keys() if "reward" in k.lower()]
        if reward_keys:
            reward_str = " | ".join([f"{k}: {logs[k]:.4f}" for k in reward_keys])
            print(f"[Step {step}] {reward_str}")

printer_callback = RewardAndSamplePrinter(
    tokenizer=tokenizer,
    dataset=train_prompts,    # dataset
    sample_every=10,          # print every 10 steps
    sample_idx=0,             # or random index if you prefer
)

trainer = GRPOTrainer(
    model=policy_model,
    args=training_args,
    train_dataset=train_prompts,   # dataset 
    eval_dataset=test_prompts,     # optional
    reward_funcs=grpo_reward_fn,   # our sorting reward adapter
)

trainer.add_callback(printer_callback)

trainer.train()
trainer.save_model("checkpoints/qwen2_sorting_grpo")

Step,Training Loss
1,-0.0
2,-0.0048
3,0.0021
4,-0.0022
5,0.016
6,-0.0046
7,0.0008
8,0.0123
9,0.0014
10,-0.0026


[Step 1] rewards/grpo_reward_fn/mean: 0.2410 | rewards/grpo_reward_fn/std: 0.0949 | reward: 0.2410 | reward_std: 0.0378 | frac_reward_zero_std: 0.2500


[Step 2] rewards/grpo_reward_fn/mean: 0.5129 | rewards/grpo_reward_fn/std: 0.3467 | reward: 0.5129 | reward_std: 0.3014 | frac_reward_zero_std: 0.0000


[Step 3] rewards/grpo_reward_fn/mean: 0.5074 | rewards/grpo_reward_fn/std: 0.4426 | reward: 0.5074 | reward_std: 0.1999 | frac_reward_zero_std: 0.2500


[Step 4] rewards/grpo_reward_fn/mean: 0.3430 | rewards/grpo_reward_fn/std: 0.2733 | reward: 0.3430 | reward_std: 0.2317 | frac_reward_zero_std: 0.0000


[Step 5] rewards/grpo_reward_fn/mean: -0.0290 | rewards/grpo_reward_fn/std: 0.6132 | reward: -0.0290 | reward_std: 0.4321 | frac_reward_zero_std: 0.0000


[Step 6] rewards/grpo_reward_fn/mean: 0.4768 | rewards/grpo_reward_fn/std: 0.3768 | reward: 0.4768 | reward_std: 0.1384 | frac_reward_zero_std: 0.2500


[Step 7] rewards/grpo_reward_fn/mean: 0.2272 | rewards/grpo_reward_fn/std: 0.0863 | reward: 0.2272 | reward_std: 0.0705 | frac_reward_zero_std: 0.0000


[Step 8] rewards/grpo_reward_fn/mean: 0.0917 | rewards/grpo_reward_fn/std: 0.4348 | reward: 0.0917 | reward_std: 0.3444 | frac_reward_zero_std: 0.0000


[Step 9] rewards/grpo_reward_fn/mean: 0.7330 | rewards/grpo_reward_fn/std: 0.4246 | reward: 0.7330 | reward_std: 0.1423 | frac_reward_zero_std: 0.5000


[Step 10] rewards/grpo_reward_fn/mean: 0.5281 | rewards/grpo_reward_fn/std: 0.3830 | reward: 0.5281 | reward_std: 0.1214 | frac_reward_zero_std: 0.2500


[Step 11] rewards/grpo_reward_fn/mean: 0.3028 | rewards/grpo_reward_fn/std: 0.4280 | reward: 0.3028 | reward_std: 0.3009 | frac_reward_zero_std: 0.0000


[Step 12] rewards/grpo_reward_fn/mean: 0.3582 | rewards/grpo_reward_fn/std: 0.3978 | reward: 0.3582 | reward_std: 0.2087 | frac_reward_zero_std: 0.2500


[Step 13] rewards/grpo_reward_fn/mean: 0.3944 | rewards/grpo_reward_fn/std: 0.3379 | reward: 0.3944 | reward_std: 0.1923 | frac_reward_zero_std: 0.0000


[Step 14] rewards/grpo_reward_fn/mean: 0.5575 | rewards/grpo_reward_fn/std: 0.3552 | reward: 0.5575 | reward_std: 0.1893 | frac_reward_zero_std: 0.2500


[Step 15] rewards/grpo_reward_fn/mean: 0.5230 | rewards/grpo_reward_fn/std: 0.6515 | reward: 0.5230 | reward_std: 0.4700 | frac_reward_zero_std: 0.2500


[Step 16] rewards/grpo_reward_fn/mean: 0.3262 | rewards/grpo_reward_fn/std: 0.5773 | reward: 0.3262 | reward_std: 0.3194 | frac_reward_zero_std: 0.2500


[Step 17] rewards/grpo_reward_fn/mean: 0.3974 | rewards/grpo_reward_fn/std: 0.3779 | reward: 0.3974 | reward_std: 0.3174 | frac_reward_zero_std: 0.0000


[Step 18] rewards/grpo_reward_fn/mean: 0.4315 | rewards/grpo_reward_fn/std: 0.3475 | reward: 0.4315 | reward_std: 0.0582 | frac_reward_zero_std: 0.2500


[Step 19] rewards/grpo_reward_fn/mean: 0.3634 | rewards/grpo_reward_fn/std: 0.1879 | reward: 0.3634 | reward_std: 0.1450 | frac_reward_zero_std: 0.2500


[Step 20] rewards/grpo_reward_fn/mean: 0.4443 | rewards/grpo_reward_fn/std: 0.4051 | reward: 0.4443 | reward_std: 0.1440 | frac_reward_zero_std: 0.5000


[Step 21] rewards/grpo_reward_fn/mean: 0.4004 | rewards/grpo_reward_fn/std: 0.3788 | reward: 0.4004 | reward_std: 0.0607 | frac_reward_zero_std: 0.2500


[Step 22] rewards/grpo_reward_fn/mean: 0.4148 | rewards/grpo_reward_fn/std: 0.3022 | reward: 0.4148 | reward_std: 0.1177 | frac_reward_zero_std: 0.0000


[Step 23] rewards/grpo_reward_fn/mean: 0.6761 | rewards/grpo_reward_fn/std: 0.3847 | reward: 0.6761 | reward_std: 0.2177 | frac_reward_zero_std: 0.2500


[Step 24] rewards/grpo_reward_fn/mean: 0.5355 | rewards/grpo_reward_fn/std: 0.3292 | reward: 0.5355 | reward_std: 0.2318 | frac_reward_zero_std: 0.0000


[Step 25] rewards/grpo_reward_fn/mean: 0.5025 | rewards/grpo_reward_fn/std: 0.4471 | reward: 0.5025 | reward_std: 0.3364 | frac_reward_zero_std: 0.0000


[Step 26] rewards/grpo_reward_fn/mean: 0.4456 | rewards/grpo_reward_fn/std: 0.3345 | reward: 0.4456 | reward_std: 0.0342 | frac_reward_zero_std: 0.2500


[Step 27] rewards/grpo_reward_fn/mean: 0.3556 | rewards/grpo_reward_fn/std: 0.2720 | reward: 0.3556 | reward_std: 0.1443 | frac_reward_zero_std: 0.0000


[Step 28] rewards/grpo_reward_fn/mean: 0.6580 | rewards/grpo_reward_fn/std: 0.3539 | reward: 0.6580 | reward_std: 0.0180 | frac_reward_zero_std: 0.5000


[Step 29] rewards/grpo_reward_fn/mean: 0.2188 | rewards/grpo_reward_fn/std: 0.3939 | reward: 0.2188 | reward_std: 0.2369 | frac_reward_zero_std: 0.0000


[Step 30] rewards/grpo_reward_fn/mean: 0.3839 | rewards/grpo_reward_fn/std: 0.2455 | reward: 0.3839 | reward_std: 0.1254 | frac_reward_zero_std: 0.0000


[Step 31] rewards/grpo_reward_fn/mean: 0.4657 | rewards/grpo_reward_fn/std: 0.3250 | reward: 0.4657 | reward_std: 0.0390 | frac_reward_zero_std: 0.2500


[Step 32] rewards/grpo_reward_fn/mean: 0.4674 | rewards/grpo_reward_fn/std: 0.3213 | reward: 0.4674 | reward_std: 0.2096 | frac_reward_zero_std: 0.0000


[Step 33] rewards/grpo_reward_fn/mean: 0.6993 | rewards/grpo_reward_fn/std: 0.4108 | reward: 0.6993 | reward_std: 0.1228 | frac_reward_zero_std: 0.5000


[Step 34] rewards/grpo_reward_fn/mean: 0.3540 | rewards/grpo_reward_fn/std: 0.4934 | reward: 0.3540 | reward_std: 0.3908 | frac_reward_zero_std: 0.0000


[Step 35] rewards/grpo_reward_fn/mean: 0.5574 | rewards/grpo_reward_fn/std: 0.5566 | reward: 0.5574 | reward_std: 0.3207 | frac_reward_zero_std: 0.2500


[Step 36] rewards/grpo_reward_fn/mean: 0.6813 | rewards/grpo_reward_fn/std: 0.3850 | reward: 0.6813 | reward_std: 0.2960 | frac_reward_zero_std: 0.2500


[Step 37] rewards/grpo_reward_fn/mean: 0.7473 | rewards/grpo_reward_fn/std: 0.3988 | reward: 0.7473 | reward_std: 0.1352 | frac_reward_zero_std: 0.5000


[Step 38] rewards/grpo_reward_fn/mean: 0.7368 | rewards/grpo_reward_fn/std: 0.3521 | reward: 0.7368 | reward_std: 0.1767 | frac_reward_zero_std: 0.5000


[Step 39] rewards/grpo_reward_fn/mean: 0.3717 | rewards/grpo_reward_fn/std: 0.5582 | reward: 0.3717 | reward_std: 0.2758 | frac_reward_zero_std: 0.5000


[Step 40] rewards/grpo_reward_fn/mean: 0.7500 | rewards/grpo_reward_fn/std: 0.3336 | reward: 0.7500 | reward_std: 0.1016 | frac_reward_zero_std: 0.5000


[Step 41] rewards/grpo_reward_fn/mean: 0.3629 | rewards/grpo_reward_fn/std: 0.3283 | reward: 0.3629 | reward_std: 0.1426 | frac_reward_zero_std: 0.0000


[Step 42] rewards/grpo_reward_fn/mean: 0.5095 | rewards/grpo_reward_fn/std: 0.3542 | reward: 0.5095 | reward_std: 0.1297 | frac_reward_zero_std: 0.2500


[Step 43] rewards/grpo_reward_fn/mean: 1.0000 | rewards/grpo_reward_fn/std: 0.0000 | reward: 1.0000 | reward_std: 0.0000 | frac_reward_zero_std: 1.0000


[Step 44] rewards/grpo_reward_fn/mean: 0.6434 | rewards/grpo_reward_fn/std: 0.3796 | reward: 0.6434 | reward_std: 0.0382 | frac_reward_zero_std: 0.5000


[Step 45] rewards/grpo_reward_fn/mean: 0.5745 | rewards/grpo_reward_fn/std: 0.5429 | reward: 0.5745 | reward_std: 0.3610 | frac_reward_zero_std: 0.2500


[Step 46] rewards/grpo_reward_fn/mean: 0.4553 | rewards/grpo_reward_fn/std: 0.2749 | reward: 0.4553 | reward_std: 0.1963 | frac_reward_zero_std: 0.2500


[Step 47] rewards/grpo_reward_fn/mean: 0.7299 | rewards/grpo_reward_fn/std: 0.3611 | reward: 0.7299 | reward_std: 0.1171 | frac_reward_zero_std: 0.5000


[Step 48] rewards/grpo_reward_fn/mean: 0.6482 | rewards/grpo_reward_fn/std: 0.3660 | reward: 0.6482 | reward_std: 0.0290 | frac_reward_zero_std: 0.5000


[Step 49] rewards/grpo_reward_fn/mean: 0.4571 | rewards/grpo_reward_fn/std: 0.3943 | reward: 0.4571 | reward_std: 0.1460 | frac_reward_zero_std: 0.2500


[Step 50] rewards/grpo_reward_fn/mean: 0.6503 | rewards/grpo_reward_fn/std: 0.3641 | reward: 0.6503 | reward_std: 0.2088 | frac_reward_zero_std: 0.2500


[Step 51] rewards/grpo_reward_fn/mean: 0.6936 | rewards/grpo_reward_fn/std: 0.3616 | reward: 0.6936 | reward_std: 0.2678 | frac_reward_zero_std: 0.0000


[Step 52] rewards/grpo_reward_fn/mean: 0.4384 | rewards/grpo_reward_fn/std: 0.4812 | reward: 0.4384 | reward_std: 0.1867 | frac_reward_zero_std: 0.5000


[Step 53] rewards/grpo_reward_fn/mean: 0.5991 | rewards/grpo_reward_fn/std: 0.3751 | reward: 0.5991 | reward_std: 0.2140 | frac_reward_zero_std: 0.2500


[Step 54] rewards/grpo_reward_fn/mean: 0.6542 | rewards/grpo_reward_fn/std: 0.4137 | reward: 0.6542 | reward_std: 0.0925 | frac_reward_zero_std: 0.5000


[Step 55] rewards/grpo_reward_fn/mean: 0.6411 | rewards/grpo_reward_fn/std: 0.3731 | reward: 0.6411 | reward_std: 0.1938 | frac_reward_zero_std: 0.2500


[Step 56] rewards/grpo_reward_fn/mean: 0.4704 | rewards/grpo_reward_fn/std: 0.3852 | reward: 0.4704 | reward_std: 0.1457 | frac_reward_zero_std: 0.2500


[Step 57] rewards/grpo_reward_fn/mean: 0.5024 | rewards/grpo_reward_fn/std: 0.3006 | reward: 0.5024 | reward_std: 0.2840 | frac_reward_zero_std: 0.0000


[Step 58] rewards/grpo_reward_fn/mean: 0.5724 | rewards/grpo_reward_fn/std: 0.3505 | reward: 0.5724 | reward_std: 0.3820 | frac_reward_zero_std: 0.0000


[Step 59] rewards/grpo_reward_fn/mean: 0.6746 | rewards/grpo_reward_fn/std: 0.3364 | reward: 0.6746 | reward_std: 0.2609 | frac_reward_zero_std: 0.0000


[Step 60] rewards/grpo_reward_fn/mean: 0.6915 | rewards/grpo_reward_fn/std: 0.3615 | reward: 0.6915 | reward_std: 0.0942 | frac_reward_zero_std: 0.5000


[Step 61] rewards/grpo_reward_fn/mean: 0.7118 | rewards/grpo_reward_fn/std: 0.3881 | reward: 0.7118 | reward_std: 0.1288 | frac_reward_zero_std: 0.5000


[Step 62] rewards/grpo_reward_fn/mean: 0.6672 | rewards/grpo_reward_fn/std: 0.3994 | reward: 0.6672 | reward_std: 0.1079 | frac_reward_zero_std: 0.5000


[Step 63] rewards/grpo_reward_fn/mean: 0.7385 | rewards/grpo_reward_fn/std: 0.4071 | reward: 0.7385 | reward_std: 0.1396 | frac_reward_zero_std: 0.5000


[Step 64] rewards/grpo_reward_fn/mean: 0.6324 | rewards/grpo_reward_fn/std: 0.3828 | reward: 0.6324 | reward_std: 0.2120 | frac_reward_zero_std: 0.2500


[Step 65] rewards/grpo_reward_fn/mean: 0.3130 | rewards/grpo_reward_fn/std: 0.3646 | reward: 0.3130 | reward_std: 0.2612 | frac_reward_zero_std: 0.0000


[Step 66] rewards/grpo_reward_fn/mean: 1.0000 | rewards/grpo_reward_fn/std: 0.0000 | reward: 1.0000 | reward_std: 0.0000 | frac_reward_zero_std: 1.0000


[Step 67] rewards/grpo_reward_fn/mean: 0.8519 | rewards/grpo_reward_fn/std: 0.3235 | reward: 0.8519 | reward_std: 0.1038 | frac_reward_zero_std: 0.7500


[Step 68] rewards/grpo_reward_fn/mean: 0.4590 | rewards/grpo_reward_fn/std: 0.2707 | reward: 0.4590 | reward_std: 0.1873 | frac_reward_zero_std: 0.0000


[Step 69] rewards/grpo_reward_fn/mean: 0.5327 | rewards/grpo_reward_fn/std: 0.3924 | reward: 0.5327 | reward_std: 0.1630 | frac_reward_zero_std: 0.2500


[Step 70] rewards/grpo_reward_fn/mean: 0.6226 | rewards/grpo_reward_fn/std: 0.3466 | reward: 0.6226 | reward_std: 0.1903 | frac_reward_zero_std: 0.2500


[Step 71] rewards/grpo_reward_fn/mean: 0.5351 | rewards/grpo_reward_fn/std: 0.4474 | reward: 0.5351 | reward_std: 0.1674 | frac_reward_zero_std: 0.2500


[Step 72] rewards/grpo_reward_fn/mean: 0.3177 | rewards/grpo_reward_fn/std: 0.4548 | reward: 0.3177 | reward_std: 0.2716 | frac_reward_zero_std: 0.0000


[Step 73] rewards/grpo_reward_fn/mean: 0.6568 | rewards/grpo_reward_fn/std: 0.3560 | reward: 0.6568 | reward_std: 0.2617 | frac_reward_zero_std: 0.0000


[Step 74] rewards/grpo_reward_fn/mean: 0.5295 | rewards/grpo_reward_fn/std: 0.3842 | reward: 0.5295 | reward_std: 0.1369 | frac_reward_zero_std: 0.2500


[Step 75] rewards/grpo_reward_fn/mean: 0.7840 | rewards/grpo_reward_fn/std: 0.3309 | reward: 0.7840 | reward_std: 0.1857 | frac_reward_zero_std: 0.5000


[Step 76] rewards/grpo_reward_fn/mean: 0.5209 | rewards/grpo_reward_fn/std: 0.3895 | reward: 0.5209 | reward_std: 0.2155 | frac_reward_zero_std: 0.2500


[Step 77] rewards/grpo_reward_fn/mean: 0.6813 | rewards/grpo_reward_fn/std: 0.3761 | reward: 0.6813 | reward_std: 0.0998 | frac_reward_zero_std: 0.5000


[Step 78] rewards/grpo_reward_fn/mean: 0.6121 | rewards/grpo_reward_fn/std: 0.3540 | reward: 0.6121 | reward_std: 0.1944 | frac_reward_zero_std: 0.2500


[Step 79] rewards/grpo_reward_fn/mean: 0.6884 | rewards/grpo_reward_fn/std: 0.3697 | reward: 0.6884 | reward_std: 0.2773 | frac_reward_zero_std: 0.2500


[Step 80] rewards/grpo_reward_fn/mean: 0.3128 | rewards/grpo_reward_fn/std: 0.5094 | reward: 0.3128 | reward_std: 0.3732 | frac_reward_zero_std: 0.0000


[Step 81] rewards/grpo_reward_fn/mean: 0.6206 | rewards/grpo_reward_fn/std: 0.4020 | reward: 0.6206 | reward_std: 0.2062 | frac_reward_zero_std: 0.2500


[Step 82] rewards/grpo_reward_fn/mean: 0.3104 | rewards/grpo_reward_fn/std: 0.1944 | reward: 0.3104 | reward_std: 0.1310 | frac_reward_zero_std: 0.0000


[Step 83] rewards/grpo_reward_fn/mean: 0.7295 | rewards/grpo_reward_fn/std: 0.5581 | reward: 0.7295 | reward_std: 0.1554 | frac_reward_zero_std: 0.7500


[Step 84] rewards/grpo_reward_fn/mean: 0.7999 | rewards/grpo_reward_fn/std: 0.3616 | reward: 0.7999 | reward_std: 0.0287 | frac_reward_zero_std: 0.7500


[Step 85] rewards/grpo_reward_fn/mean: 0.6965 | rewards/grpo_reward_fn/std: 0.4116 | reward: 0.6965 | reward_std: 0.2235 | frac_reward_zero_std: 0.2500


[Step 86] rewards/grpo_reward_fn/mean: 0.4152 | rewards/grpo_reward_fn/std: 0.3649 | reward: 0.4152 | reward_std: 0.0964 | frac_reward_zero_std: 0.2500


[Step 87] rewards/grpo_reward_fn/mean: 0.4854 | rewards/grpo_reward_fn/std: 0.3760 | reward: 0.4854 | reward_std: 0.1333 | frac_reward_zero_std: 0.2500


[Step 88] rewards/grpo_reward_fn/mean: 0.5370 | rewards/grpo_reward_fn/std: 0.3759 | reward: 0.5370 | reward_std: 0.1049 | frac_reward_zero_std: 0.5000


[Step 89] rewards/grpo_reward_fn/mean: 0.9044 | rewards/grpo_reward_fn/std: 0.2613 | reward: 0.9044 | reward_std: 0.1911 | frac_reward_zero_std: 0.5000


[Step 90] rewards/grpo_reward_fn/mean: 0.6381 | rewards/grpo_reward_fn/std: 0.3758 | reward: 0.6381 | reward_std: 0.0281 | frac_reward_zero_std: 0.5000


[Step 91] rewards/grpo_reward_fn/mean: 0.4539 | rewards/grpo_reward_fn/std: 0.3318 | reward: 0.4539 | reward_std: 0.0432 | frac_reward_zero_std: 0.2500


[Step 92] rewards/grpo_reward_fn/mean: 0.4855 | rewards/grpo_reward_fn/std: 0.3103 | reward: 0.4855 | reward_std: 0.1940 | frac_reward_zero_std: 0.0000


[Step 93] rewards/grpo_reward_fn/mean: 0.6429 | rewards/grpo_reward_fn/std: 0.6645 | reward: 0.6429 | reward_std: 0.1023 | frac_reward_zero_std: 0.7500


[Step 94] rewards/grpo_reward_fn/mean: 0.6736 | rewards/grpo_reward_fn/std: 0.3390 | reward: 0.6736 | reward_std: 0.0063 | frac_reward_zero_std: 0.7500


[Step 95] rewards/grpo_reward_fn/mean: 0.4591 | rewards/grpo_reward_fn/std: 0.3317 | reward: 0.4591 | reward_std: 0.0466 | frac_reward_zero_std: 0.2500


[Step 96] rewards/grpo_reward_fn/mean: 0.8087 | rewards/grpo_reward_fn/std: 0.3518 | reward: 0.8087 | reward_std: 0.0457 | frac_reward_zero_std: 0.7500


[Step 97] rewards/grpo_reward_fn/mean: 0.7797 | rewards/grpo_reward_fn/std: 0.3387 | reward: 0.7797 | reward_std: 0.2561 | frac_reward_zero_std: 0.2500


[Step 98] rewards/grpo_reward_fn/mean: 0.7100 | rewards/grpo_reward_fn/std: 0.3933 | reward: 0.7100 | reward_std: 0.1991 | frac_reward_zero_std: 0.5000


[Step 99] rewards/grpo_reward_fn/mean: 0.5630 | rewards/grpo_reward_fn/std: 0.3547 | reward: 0.5630 | reward_std: 0.1936 | frac_reward_zero_std: 0.2500


[Step 100] rewards/grpo_reward_fn/mean: 0.6825 | rewards/grpo_reward_fn/std: 0.3746 | reward: 0.6825 | reward_std: 0.1071 | frac_reward_zero_std: 0.5000


We see the performance curves show that reward increases with training steps. 

![image.png](attachment:image.png)

### 2.8 Inspecting the tuned model

After RL in verifiable domains, let's see what the average reward is for the test set.


In [20]:
from peft import PeftModel

policy_model = PeftModel.from_pretrained(policy_model, "checkpoints/qwen2_sorting_grpo")
policy_model = policy_model.merge_and_unload()  # optional: merges LoRA weights into base
policy_model.eval()

rlvr_response = test_model(policy_model, prompt=test_ds[0]['prompt'])
print(rlvr_response)

rlvr_reward = sorting_reward(test_ds[0]['prompt'], rlvr_response, test=True)
print('RLVR reward:', rlvr_reward)

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Sort the numbers [-14.433446591715981, 17.077524987991644, -7.7554073098261895, -13.023103573742805, -14.875241191424625, 17.03338723338379, 7.268326687417488, 9.524625622451982, -18.41086602591082, -15.315238006920378, -8.366955330463021, 10.100936452499017, 8.06225314693065, 15.800980646120173, 15.066248679511794, 0.9759910480829355, 2.460452314192679, -6.090464588466864, 23.4650160396467, 17.940368356488364, -12.017034181155495, 1.1307199076751289, -6.1064329164179085, -9.23431189462056, 18.174706450326198, -14.889486174007565, -1.003634968133131, -2.0510309757685796, -2.8022138760514466, -6.773956638899346, -17.82747785384514, 2.971243983592128, -13.758691857339532, 26.114768601407995, -16.05999009607709, -5.341085929645356, 11.431989974722168, 24.272587396840343, -1.9182486916511756, -10.385570487167005, -16.522242558812955, 13.063165928385509, 18.653417039434594, 29.

In [22]:
rlvr_eval_results = evaluate_model_on_dataset(policy_model, test_ds.select(range(10)))
print("RLVR model average reward:", rlvr_eval_results)

Evaluating model: 100%|██████████| 10/10 [05:12<00:00, 31.27s/it]

RLVR model average reward: 0.7761401626016261





Clearly, the test reward increases from 46.3% to 52.8% with SFT and then a further to 77.6% after RL tuning. 


## Takeaways

* LoRA adapters collapse the fine-tuning footprint of a dense linear layer while maintaining accuracy when the required update is approximately low rank.
* Supervised warm-starting with structured `<think>...</think>` exemplars teaches the LoRA adapter to emit both reasoning tokens and the final sorted answer before RL.
* GRPO-style updates combined with PEFT adapters on Qwen 2.5 7B provide a practical recipe for reinforcement learning on consumer hardware.
