<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/PiScorer_as_GRPO_Reward_Function.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://build.withpi.ai"><font size="4">Copilot</font></a>

Are you constantly relying on LLM-as-a-judge to evaluate your model’s performance?

Have you ever wanted to assess your model at every training checkpoint but hesitated because LLM-as-a-judge is too slow and expensive?

**Now you can — with [Pi-Scorer](https://build.withpi.ai).**

[Pi-Scorer](https://build.withpi.ai) offers an alternative to LLM-as-a-judge with several advantages:

* Significantly faster

* Highly consistent — always returns the same score for the same inputs

* Eliminates the need for prompt tuning or adjustments

In this Colab, we integrate [Pi-Scorer](https://build.withpi.ai) as the reward function within the [Unsloth](https://unsloth.ai/) GRPO training loop, based on the [Unsloth Qwen2.5_(3B)-GRPO.ipynb colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(3B)-GRPO.ipynb) notebook.

### Installation

In [1]:
from google.colab import userdata
import os

# Get PI API key: https://build.withpi.ai/account/keys
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

In [2]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install or uv pip install
    !pip install unsloth vllm
else:
    pass # For Colab / Kaggle, we need extra instructions hidden below \/

In [3]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
!pip install --upgrade -qqq uv
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install!
    !pip install unsloth vllm
else:
    try: import numpy; get_numpy = f"numpy=={numpy.__version__}"
    except: get_numpy = "numpy"
    try: import subprocess; is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
    except: is_tesla_t4 = False
    get_vllm, get_triton = ("vllm==0.10.1", "triton==3.2.0") if is_t4 else ("vllm", "triton")
    !uv pip install -qqq --upgrade \
        unsloth {get_vllm} {get_numpy} torchvision bitsandbytes xformers transformers
    !uv pip install -qqq {get_triton}

### Load Unsloth Model

Load up `Qwen 2.5 3B Instruct` and set parameters

In [4]:
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-3B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.5, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 08-23 19:37:54 [__init__.py:241] Automatically detected platform cuda.
ERROR 08-23 19:37:56 [fa_utils.py:57] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
🦥 Unsloth Zoo will now patch everything to make training faster!
Unsloth: Patching vLLM v1 graph capture
Unsloth: Patching vLLM v0 graph capture
==((====))==  Unsloth 2025.8.9: Fast Qwen2 patching. Transformers: 4.55.4. vLLM: 0.10.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.31. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit with actual GPU utilization

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

INFO 08-23 19:38:32 [cuda.py:384] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-23 19:38:32 [cuda.py:433] Using XFormers backend.
INFO 08-23 19:38:32 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 08-23 19:38:32 [model_runner.py:1080] Starting to load model unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit...
INFO 08-23 19:38:33 [bitsandbytes_loader.py:742] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 08-23 19:38:34 [weight_utils.py:296] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

INFO 08-23 19:38:56 [weight_utils.py:312] Time spent downloading weights for unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit: 21.899478 seconds
INFO 08-23 19:38:56 [weight_utils.py:349] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 08-23 19:38:59 [punica_selector.py:19] Using PunicaWrapperGPU.
INFO 08-23 19:39:00 [model_runner.py:1112] Model loading took 2.4392 GiB and 25.987565 seconds
INFO 08-23 19:39:13 [worker.py:295] Memory profiling takes 12.24 seconds
INFO 08-23 19:39:13 [worker.py:295] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.50) = 7.30GiB
INFO 08-23 19:39:13 [worker.py:295] model weights take 2.44GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 1.05GiB; the rest of the memory reserved for KV Cache is 3.79GiB.
INFO 08-23 19:39:13 [executor_base.py:114] # cuda blocks: 6894, # CPU blocks: 0
INFO 08-23 19:39:13 [executor_base.py:119] Maximum concurrency for 1024 tokens per request: 107.72x
INFO 08-23 19:39:13 [vllm_utils.py:671] Unsloth: Running patched vLLM v0 `capture_model`.
INFO 08-23 19:39:13 [model_runner.py:1383] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run t

Capturing CUDA graph shapes:   0%|          | 0/27 [00:00<?, ?it/s]

INFO 08-23 19:39:43 [model_runner.py:1535] Graph capturing finished in 30 secs, took 0.56 GiB
INFO 08-23 19:39:43 [vllm_utils.py:678] Unsloth: Patched vLLM v0 graph capture finished in 30 secs.
INFO 08-23 19:39:45 [llm_engine.py:417] init engine (profile, create kv cache, warmup model) took 44.45 seconds
INFO 08-23 19:39:45 [llm.py:298] Supported_tasks: ['generate']
Unsloth: Just some info: will skip parsing ['q_norm', 'k_norm', 'pre_feedforward_layernorm', 'post_feedforward_layernorm']
Unsloth: Just some info: will skip parsing ['q_norm', 'k_norm', 'pre_feedforward_layernorm', 'post_feedforward_layernorm']


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Unsloth 2025.8.9 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


### Data Preparation and PI Reward Functions

In [5]:
from datasets import load_dataset, Dataset
import requests

# Load and prep dataset
SYSTEM_PROMPT = """
Generate a short TLDR of a subreddit post without any surrounding text. Here are some requirement of the TLDR:
1. Make sure that the TLDR is short and concise.
2. Make sure that the TLDR state the important points of the post
3. Make sure that the TLDR should make sense on its own.
"""

dataset = load_dataset("trl-lib/tldr", split="train")
dataset = dataset.remove_columns(["completion"])
dataset = dataset.rename_column("prompt", "post")
dataset = dataset.select(range(500))
dataset = dataset.map(
    lambda x: {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": x["post"]},
        ]
    }
)
print(dataset[0])


# Pi constants
PI_API_URL = "https://api.withpi.ai/v1/scoring_system/score"
HEADERS = {
    "Content-Type": "application/json",
    "x-api-key": os.environ.get("WITHPI_API_KEY"),
}

# Pi util functions
def get_pi_score(input: str, output: str, question: str) -> float:
    payload = {
        "llm_input": input,
        "llm_output": output,
        "scoring_spec": [{"question": question}]
    }
    # Can add retry if needed.
    response = requests.post(PI_API_URL, headers=HEADERS, json=payload)
    return response.json()["total_score"]

def score_tldrs(prompts, completions, question: str) -> list[float]:
    posts = [prompt[-1]["content"] for prompt in prompts]
    tldrs = [completion[0]["content"] for completion in completions]
    return [get_pi_score(post, tldr, question) for post, tldr in zip(posts, tldrs)]

# Reward functions
def pi_concise(prompts, completions, **kwargs) -> list[float]:
    return score_tldrs(prompts, completions, "Is the TLDR concise and to the point?")

def pi_coverage(prompts, completions, **kwargs) -> list[float]:
    return score_tldrs(prompts, completions, "Does the TLDR state the important points of the post?")

def pi_standalone(prompts, completions, **kwargs) -> list[float]:
    return score_tldrs(prompts, completions, "Does the TLDR make sense on its own without needing to refer to the original post?")

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/110M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/6.11M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/6.21M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/116722 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/6447 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6553 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

{'post': "SUBREDDIT: r/relationships\n\nTITLE: I (f/22) have to figure out if I want to still know these girls or not and would hate to sound insulting\n\nPOST: Not sure if this belongs here but it's worth a try. \n\nBackstory:\nWhen I (f/22) went through my first real breakup 2 years ago because he needed space after a year of dating roand  it effected me more than I thought. It was a horrible time in my life due to living with my mother and finally having the chance to cut her out of my life. I can admit because of it was an emotional wreck and this guy was stable and didn't know how to deal with me. We ended by him avoiding for a month or so after going to a festival with my friends. When I think back I wish he just ended. So after he ended it added my depression I suffered but my friends helped me through it and I got rid of everything from him along with cutting contact. \n\nNow: Its been almost 3 years now and I've gotten better after counselling and mild anti depressants. My mot

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [6]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    importance_sampling_level = "sequence",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 8, # Decrease if out of memory
    max_prompt_length = 1024,
    max_completion_length = 200,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 40,
    save_steps = 10,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 8


In [7]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        pi_concise,
        pi_coverage,
        pi_standalone,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 500 | Num Epochs = 1 | Total steps = 40
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 119,734,272 of 3,205,672,960 (3.74% trained)


Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,kl,entropy,rewards / pi_concise / mean,rewards / pi_concise / std,rewards / pi_coverage / mean,rewards / pi_coverage / std,rewards / pi_standalone / mean,rewards / pi_standalone / std
1,-0.0,2.488187,0.813543,50.125,42.0,60.0,0.0,50.125,42.0,60.0,0.0,0,0.888325,0.273519,0.833,0.239226,0.766863,0.310045
2,-0.0,2.766125,0.136929,62.375,51.0,79.0,0.0,62.375,51.0,79.0,0.0,No Log,0.973637,0.033344,0.916013,0.040863,0.876475,0.072486
3,-0.0,2.875012,0.080066,59.875,42.0,93.0,0.0,59.875,42.0,93.0,0.0,No Log,0.981462,0.028749,0.959975,0.016946,0.933575,0.053109
4,0.0,2.6858,0.196729,62.0,49.0,77.0,0.0,62.0,49.0,77.0,2.3e-05,No Log,0.98145,0.025555,0.876462,0.04243,0.827887,0.172301
5,0.0,2.722662,0.188195,48.75,38.0,58.0,0.0,48.75,38.0,58.0,3.1e-05,No Log,0.9922,0.005515,0.823225,0.120988,0.907237,0.06931
6,-0.0,2.794462,0.074326,65.0,52.0,78.0,0.0,65.0,52.0,78.0,6.3e-05,No Log,0.9883,0.007221,0.9424,0.033924,0.863762,0.040078
7,0.0,2.807637,0.09432,46.5,35.0,61.0,0.0,46.5,35.0,61.0,0.000419,No Log,0.98585,0.012868,0.95265,0.021857,0.869138,0.080979
8,0.0,2.704125,0.139267,39.875,35.0,47.0,0.0,39.875,35.0,47.0,0.001148,No Log,0.983912,0.009191,0.907238,0.055797,0.812975,0.092068
9,0.0,2.732438,0.20064,60.5,56.0,66.0,0.0,60.5,56.0,66.0,0.000895,No Log,0.959488,0.049651,0.903813,0.060749,0.869138,0.10281
10,0.0,2.654425,0.479984,73.0,60.0,100.0,0.0,73.0,60.0,100.0,0.001256,No Log,0.902125,0.208562,0.930163,0.039285,0.822137,0.237463


Unsloth: Will smartly offload gradients to save VRAM!


TrainOutput(global_step=40, training_loss=1.1735589837602589e-05, metrics={'train_runtime': 1027.6253, 'train_samples_per_second': 0.311, 'train_steps_per_second': 0.039, 'total_flos': 0.0, 'train_loss': 1.1735589837602589e-05})

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [8]:
post = """SUBREDDIT: r/personalfinance

TITLE: Prioritize student debt or saving for down payment?

POST: I have $25k in student debt. One private loan at 9.5% (highest priority obviously) and nine others federal between 3.4% and 6.8%. Minimum payment per month total is $301.16. Over the next 9 months, I will pay off $11k of these, which will get rid of everything above 5% interest and will drop the total minimum payment to $150.

At the end of the 9 months, our savings will be around $35k. At that time my husband will need to purchase a car so some of that will be his down payment. So more realistically $25-30k.

Sometime in the future, between a year to two years from now, my husband and I may be moving. Typical single family homes in this area go for around $300k.

At the end of the 9 months, should I continue to focus on paying down student debt (which will be a balance of $14k by then) or growing our savings/down payment? I have $5200/mo to somehow split between debt and down payment and I'm not sure how best to allocate it.

TL;DR:
"""

text = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": post},
    ],
    tokenize=False,
    add_generation_prompt=True,
)

from vllm import SamplingParams

sampling_params = SamplingParams(
    temperature=0.4,
    top_p=0.95,
    max_tokens=1024,
)
output = (
    model.fast_generate(
        [text],
        sampling_params=sampling_params,
        lora_request=None,
    )[0]
    .outputs[0]
    .text
)

output

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

'The poster has $25k in student debt with a total monthly minimum payment of $301.16. They plan to pay off $11k over 9 months, dropping the total minimum to $150. By then, they expect to have around $25-30k in savings. They need to decide between continuing to pay down student debt or saving for a down payment and potential home purchase. They have $5200/month to allocate between these goals.'

Now we load the LoRA and test:

In [9]:
model.save_lora("grpo_saved_lora")

output = (
    model.fast_generate(
        [text],
        sampling_params=sampling_params,
        lora_request=None,
    )[0]
    .outputs[0]
    .text
)

output

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

'The poster has $25k in student debt with a total monthly minimum payment of $301.16. They plan to pay off $11k over 9 months, reducing the interest rate to 5% or below, and lowering the minimum payment to $150. By then, they expect to have around $35k in savings. They need to decide between continuing to pay down student debt or saving for a down payment on a house worth around $300k. They have $520mla/month to allocate between these goals.'