<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/PiScorer_as_GRPO_Reward_Function.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Are you constantly relying on LLM-as-a-judge to evaluate your model’s performance?

Have you ever wanted to assess your model at every training checkpoint but hesitated because LLM-as-a-judge is too slow and expensive?

**Now you can — with [Pi-Scorer](https://build.withpi.ai).**

[Pi-Scorer](https://build.withpi.ai) offers an alternative to LLM-as-a-judge with several advantages:

* Significantly faster

* Highly consistent — always returns the same score for the same inputs

* Eliminates the need for prompt tuning or adjustments

In this colab, we use Pi-Scorer as the reward function in the [Unsloth](https://unsloth.ai/) GRPO training loop.

### Installation

In [6]:
from google.colab import userdata
import os

# Get PI API key: https://build.withpi.ai/account/keys
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm

In [2]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

### Unsloth

Load up `Qwen 2.5 3B Instruct`, and set parameters

In [None]:
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-3B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.5, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

### Data Preparation and Reward Functions

In [None]:
from datasets import load_dataset, Dataset
import requests

# Load and prep dataset
SYSTEM_PROMPT = """
Generate a short TLDR of a subreddit post without any surrounding text. Here are some requirement of the TLDR:
1. Make sure that the TLDR is short and concise.
2. Make sure that the TLDR state the important points of the post
3. Make sure that the TLDR should make sense on its own.
"""

dataset = load_dataset("trl-lib/tldr", split="train")
dataset = dataset.remove_columns(["completion"])
dataset = dataset.rename_column("prompt", "post")
dataset = dataset.select(range(500))
dataset = dataset.map(
    lambda x: {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": x["post"]},
        ]
    }
)
print(dataset[0])


# Pi constants
PI_API_URL = "https://api.withpi.ai/v1/scoring_system/score"
HEADERS = {
    "Content-Type": "application/json",
    "x-api-key": os.environ.get("WITHPI_API_KEY"),
}

# Pi util functions
def get_pi_score(input: str, output: str, question: str) -> float:
    payload = {
        "llm_input": input,
        "llm_output": output,
        "scoring_spec": [{"question": question}]
    }
    # Can add retry if needed.
    response = requests.post(PI_API_URL, headers=HEADERS, json=payload)
    return response.json()["total_score"]

def score_tldrs(prompts, completions, question: str) -> list[float]:
    posts = [prompt[-1]["content"] for prompt in prompts]
    tldrs = [completion[0]["content"] for completion in completions]
    return [get_pi_score(post, tldr, question) for post, tldr in zip(posts, tldrs)]

# Reward functions
def pi_concise(prompts, completions, **kwargs) -> list[float]:
    return score_tldrs(prompts, completions, "Is the TLDR concise and to the point?")

def pi_coverage(prompts, completions, **kwargs) -> list[float]:
    return score_tldrs(prompts, completions, "Does the TLDR state the important points of the post?")

def pi_standalone(prompts, completions, **kwargs) -> list[float]:
    return score_tldrs(prompts, completions, "Does the TLDR make sense on its own without needing to refer to the original post?")

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [40]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 8, # Decrease if out of memory
    max_prompt_length = 1024,
    max_completion_length = 200,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 50,
    save_steps = 50,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 8


In [41]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        pi_concise,
        pi_coverage,
        pi_standalone,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 500 | Num Epochs = 1 | Total steps = 50
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 119,734,272/3,000,000,000 (3.99% trained)


Step,Training Loss,reward,reward_std,completion_length,kl,rewards / pi_concise,rewards / pi_coverage,rewards / pi_standalone
1,0.0002,0.096135,0.052832,87.375,0.004562,0.010714,0.002734,0.082687
2,0.0,2.764648,0.15091,64.5,0.000927,0.930176,0.910156,0.924316
3,0.0001,2.88916,0.12056,54.25,0.002759,0.979004,0.936523,0.973633
4,0.0001,2.523438,0.276907,69.5,0.001406,0.863281,0.848145,0.812012
5,0.0001,2.580078,0.395169,56.125,0.002676,0.859375,0.859375,0.861328
6,0.0002,2.964355,0.034966,67.375,0.004323,0.992676,0.979492,0.992188
7,0.0003,2.706543,0.264024,51.75,0.00768,0.92041,0.927246,0.858887
8,0.0003,2.858398,0.148991,53.875,0.006939,0.967285,0.938965,0.952148
9,0.0002,2.786621,0.165504,66.25,0.005457,0.924316,0.921387,0.940918
10,0.0004,2.723145,0.226307,84.375,0.010696,0.918457,0.9375,0.867188


TrainOutput(global_step=50, training_loss=0.0010171514411922544, metrics={'train_runtime': 896.6434, 'train_samples_per_second': 0.446, 'train_steps_per_second': 0.056, 'total_flos': 0.0, 'train_loss': 0.0010171514411922544})

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [None]:
post = """SUBREDDIT: r/relationships

TITLE: Me [19 F] with my friend [19 M], not sure if I may have messed things up already.

POST: Hello hello everybody. I hope this isn't too trivial of a question to ask on here, but I've been feeling a bit out of my depth when it comes to this situation (I've had only one relationship before, and for many reasons, it was out of the ordinary).

Okay! So, a couple of weeks ago, I started talking to this guy on Facebook, through a student group that we were both part of. I thought he was sort of cute, so I sent him a PM just to talk, etc, etc. We're both transfer students at the same school, so I knew that we could eventually meet in person once we both moved on-campus. So, we did, and we hung out maybe twice, just as friends.

Okay. So, everything is going pretty well. We talk over Facebook and Snapchat, whatever. So, Saturday night, I was just hanging out with people and kind of being bored, when I got a Snapchat from him asking what I was doing. I asked if he wanted to hang out, so we did.

We ended up smoking pot (the first time for me, ever), and sort of just wandering around. Eventually we ended up back at his dorm room, where high me decided to just go for it, and I came on to him pretty strongly. It worked out for me (luckily, otherwise things would have been really super awkward), and we ended up messing around but not having sex.

Yesterday, however, I ended up going to hang out with him again, and this time we did sleep together. Afterward, we kind of discussed what we were going to do, and he just said that he wanted to "play it by ear" and not slap any labels on anything. I'm wondering if this means that he wants a fwb-type situation, or if he might actually be interested in me. The way I've been acting is extremely out of character for me, and I am not interested in having a fuck buddy. I like him, and I would be very interested in maybe seeing where things go, but I'm worried that I may have ruined my chances of a relationship by sleeping with him already.

TL;DR:
"""

text = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": post},
    ],
    tokenize=False,
    add_generation_prompt=True,
)

from vllm import SamplingParams

sampling_params = SamplingParams(
    temperature=0.4,
    top_p=0.95,
    max_tokens=1024,
)
output = (
    model.fast_generate(
        [text],
        sampling_params=sampling_params,
        lora_request=None,
    )[0]
    .outputs[0]
    .text
)

output