<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/GRPO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://play.withpi.ai"><font size="4">Technique Catalog</font></a>

# Reinforcement Learning GRPO

This is the companion to the RL playground

Description: Train models to more deeply learn patterns from your data.

## Install and initialize SDK

Connect to a regular CPU Python 3 runtime.  You won't need GPUs for this notebook.

You'll need a WITHPI_API_KEY from https://play.withpi.ai.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [None]:
%%capture

%pip install withpi withpi-utils datasets tqdm litellm

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

client = PiClient()

# Load a scoring spec and dataset

We have a pre-existing scoring spec and a dataset you can play with.


In [None]:
# @title Load Scoring Spec
from withpi_utils.colab import load_scoring_spec_from_web, display_scoring_spec

tldr_scoring_spec = load_scoring_spec_from_web(
    "https://raw.githubusercontent.com/withpi/cookbook-withpi/refs/heads/main/scoring_specs/tldr.json"
)

display_scoring_spec(tldr_scoring_spec)

In [None]:
# @title Load dataset
from datasets import load_dataset

tldr_dataset = load_dataset("withpi/tldr", split="train").select(range(100))

print(tldr_dataset)

README.md:   0%|          | 0.00/319 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt', 'completion'],
    num_rows: 100
})


# Let's inspect the model quality pre-training

Here we score the input/response generated by the based model to get a sense of the model quality for this task.

In [None]:
# @title Define a TLDR generator
import litellm
import asyncio


async def generate_tldrs(
    reddit_posts, system_prompt, model_id, api_base, api_key, concurrency_limit=5
):
    """Generate TLDR for all REDDIT posts with TaskGroup and rate limiting"""
    # Create a semaphore to limit concurrent API calls
    semaphore = asyncio.Semaphore(concurrency_limit)

    async def generate_single_tldr(reddit_post, index):
        """Process a single REDDIT post generation with rate limiting"""
        async with semaphore:
            try:
                response = await litellm.acompletion(
                    messages=[
                        {"role": "system", "content": system_prompt},
                        {"role": "user", "content": reddit_post},
                    ],
                    model=model_id,
                    api_base=api_base,
                    api_key=api_key,
                    temperature=0.2,
                )
                generated_tldr = response.choices[0].message.content
                print(
                    "Generated a tldr for post #{}: {}".format(index, reddit_post[:40])
                )
                return generated_tldr
            except Exception as e:
                print(f"Error generating tldr for post #{index}: {e}")
                return f"Error: {str(e)}"

    generated_tldrs = []

    # Using TaskGroup for cleaner task management
    async with asyncio.TaskGroup() as tg:
        tasks = [
            tg.create_task(generate_single_tldr(reddit_post, i + 1))
            for i, reddit_post in enumerate(reddit_posts)
        ]

    # Collect results in the same order as topics
    for task in tasks:
        generated_tldrs.append(task.result())

    print("Done generating TLDRs!!")
    return generated_tldrs

In [None]:
# @title Generate TLDRs

loop = asyncio.get_running_loop()
generated_tldrs = await loop.create_task(
    generate_tldrs(
        tldr_dataset["prompt"],
        model_id="fireworks_ai/llama-v3p2-3b-instruct",
        api_key=userdata.get("FIREWORKS_API_KEY"),
        api_base=None,
        system_prompt=tldr_scoring_spec.description,
    )
)

Generated a tldr for post #3: SUBREDDIT: r/relationships

TITLE: Me [1
Generated a tldr for post #2: SUBREDDIT: r/loseit

TITLE: SV & NSV! Ke
Generated a tldr for post #5: SUBREDDIT: r/relationships

TITLE: My[25
Generated a tldr for post #1: SUBREDDIT: r/relationships

TITLE: I (f/
Generated a tldr for post #4: SUBREDDIT: r/personalfinance

TITLE: Pri
Generated a tldr for post #7: SUBREDDIT: r/relationships

TITLE: Is it
Generated a tldr for post #6: SUBREDDIT: r/relationships

TITLE: Me 28
Generated a tldr for post #9: SUBREDDIT: r/relationships

TITLE: Advic
Generated a tldr for post #8: SUBREDDIT: r/relationships

TITLE: I (27
Generated a tldr for post #10: SUBREDDIT: r/relationships

TITLE: Me [2
Generated a tldr for post #13: SUBREDDIT: r/relationships

TITLE: Me [ 
Generated a tldr for post #12: SUBREDDIT: r/relationship_advice

TITLE:
Generated a tldr for post #11: SUBREDDIT: r/offmychest

TITLE: I'm just
Generated a tldr for post #14: SUBREDDIT: r/relationships

TITLE: Me [2
G

In [None]:
# @title Let's Score
from tqdm import tqdm
import pandas as pd
import asyncio


scores = []
for i in tqdm(range(len(tldr_dataset))):
    scores.append(
        client.scoring_system.score(
            scoring_spec=tldr_scoring_spec,
            llm_input=tldr_dataset["prompt"][i],
            llm_output=generated_tldrs[i],
        )
    )

df = pd.DataFrame(
    {
        "prompt": tldr_dataset["prompt"],
        "generated_tldr": generated_tldrs,
        "score": [score.total_score for score in scores],
    }
)

display(df)

print(df["score"].describe())

100%|██████████| 100/100 [00:38<00:00,  2.62it/s]


Unnamed: 0,prompt,generated_tldr,score
0,SUBREDDIT: r/relationships\n\nTITLE: I (f/22) ...,TL;DR: I'm considering cutting contact with tw...,0.543889
1,SUBREDDIT: r/loseit\n\nTITLE: SV & NSV! Keepin...,"TL;DR: After a 4-week plateau, the author lost...",0.980721
2,SUBREDDIT: r/relationships\n\nTITLE: Me [19F] ...,TL;DR: I'm considering sharing my body image i...,0.307570
3,SUBREDDIT: r/personalfinance\n\nTITLE: Priorit...,Here is a possible TLDR:\n\nI have $25k in stu...,0.505557
4,SUBREDDIT: r/relationships\n\nTITLE: My[25m] g...,Here is a possible TLDR:\n\nWoman only shows a...,0.193111
...,...,...,...
95,SUBREDDIT: r/relationships\n\nTITLE: My [30 F]...,TL;DR: Considering moving in together in a hom...,0.484841
96,SUBREDDIT: r/relationships\n\nTITLE: Me[19M] p...,TL;DR: After being hospitalized with viral men...,0.979461
97,SUBREDDIT: r/relationships\n\nTITLE: Am I bein...,TL;DR: Boyfriend prioritizes bringing distant ...,0.842945
98,SUBREDDIT: r/relationships\n\nTITLE: My boyfri...,TL;DR: Boyfriend of 5 months is extremely jeal...,0.619342


count    100.000000
mean       0.644201
std        0.175192
min        0.155466
25%        0.559885
50%        0.617892
75%        0.677419
max        0.991457
Name: score, dtype: float64


## Let's do GRPO

The GRPO job internally performs a 90/10 train-test split, which is why the loader is not splitting the input data.

This process takes a while, please be patient as a cloud GPU is aquired, fine tuning is performed, and a result is returned.

In [None]:
# @title Start the GRPO training process
status = client.training.grpo.start_job(
    scoring_spec=tldr_scoring_spec,
    examples=[{"llm_input": row["prompt"]} for row in tldr_dataset],
    base_rl_model="LLAMA_3.2_3B",
    system_prompt=tldr_scoring_spec.description,
    lora_config={"lora_rank": "R_64"},
    learning_rate=5e-6,
    num_train_epochs=10,
)
print(status)

GrpoStartJobResponse(detailed_status=['LAUNCHING'], job_id='rl_grpo_jobs:09ac227c912792130a876568ea872593308c0d4b3d7c896ec7991f041cbeedd8:07bc8b1b-a964-44e1-95f9-f96628df117d', state='QUEUED', trained_models=[])


In [None]:
# @title Let's monitor the progress
from withpi_utils.colab import stream_training_response

response = stream_training_response(
    status.job_id,
    client.training.grpo,
    additional_columns={
        "Train_Pi_Reward": "rewards/pi_reward_func",
        "Train_Std_Reward": "reward_std",
        "Eval_Pi_Reward": "eval_rewards/pi_reward_func",
        "Eval_Std_Reward": "eval_reward_std",
        "Train_KL": "kl",
        "Eval_KL": "eval_kl",
        "Train_Completion_Length": "completion_length",
        "Eval_Completion_Length": "eval_completion_length",
    },
)

print("Result traing state: {}".format(response.state))
if response.state == "ERROR":
    print("The job failed due to:\n{}".format("\n".join(response.detailed_status[-5:])))
elif response.state == "DONE":
    print(
        "GRPO model = {}".format(response.trained_models[0].model_dump_json(indent=2))
    )

Training Status for rl_grpo_jobs:09ac227c912792130a876568ea872593308c0d4b3d7c896ec7991f041cbeedd8:07bc8b1b-a964-44e1-95f9-f96628df117d


Unnamed: 0,Step,Epoch,Learning_Rate,Training_Loss,Eval_Loss,Train_Pi_Reward,Train_Std_Reward,Eval_Pi_Reward,Eval_Std_Reward,Train_KL,Eval_KL,Train_Completion_Length,Eval_Completion_Length
0,0,0.0,X,X,0.05051,X,X,0.595011,0.182379,0.0,X,X,59.783335
1,22,0.488889,0.000005,0.0337,X,0.532612,0.156018,X,X,0.001157,X,61.090911,X
2,44,0.977778,0.000005,0.0427,0.040747,X,X,0.592577,0.192582,X,0.025859,X,57.983335
3,66,1.466667,0.000005,0.0168,X,0.553309,0.146358,X,X,0.029389,X,60.867426,X
4,88,1.955556,0.000005,0.0391,0.013838,X,X,0.607953,0.1384,X,0.045408,X,61.783335
5,110,2.444444,0.000005,0.029,X,0.571314,0.14691,X,X,0.050367,X,58.219698,X
6,132,2.933333,0.000004,0.0563,0.053078,X,X,0.575955,0.146897,X,0.149936,X,58.550002
7,154,3.422222,0.000004,0.0579,X,0.56949,0.168502,X,X,0.124799,X,58.107957,X
8,176,3.911111,0.000004,0.0337,0.045376,X,X,0.648078,0.161442,X,0.1474,X,53.941668
9,198,4.4,0.000003,0.0543,X,0.582327,0.155222,X,X,0.117989,X,58.159092,X


Result traing state: DONE
GRPO model = {
  "contract_score": 0.6480780363082885,
  "epoch": 3.911111111111111,
  "eval_loss": 0.045375920832157135,
  "serving_id": 0,
  "serving_state": "LOADING",
  "step": 176
}


# Evaluate the GRPO Trained Model

In [None]:
# @title Prepare the evaluation REDDIT posts
from datasets import load_dataset
from google.colab import userdata
import asyncio
import time

ds = load_dataset("withpi/tldr", split="train").select(range(1000, 1050))

reddit_posts = ds["prompt"]

loop = asyncio.get_running_loop()
generated_tldrs = await loop.create_task(
    generate_tldrs(
        reddit_posts,
        model_id="fireworks_ai/llama-v3p2-3b-instruct",
        api_key=userdata.get("FIREWORKS_API_KEY"),
        api_base=None,
        system_prompt=tldr_scoring_spec.description,
    )
)

GRPO_JOB_ID = "rl_grpo_jobs:09ac227c912792130a876568ea872593308c0d4b3d7c896ec7991f041cbeedd8:07bc8b1b-a964-44e1-95f9-f96628df117d"

# Generate the blogs using GRPO llama 3B model
client.training.grpo.load(GRPO_JOB_ID)

# Wait for the model to be loaded
while not (
    client.training.grpo.retrieve(GRPO_JOB_ID).trained_models[0].serving_state
    == "SERVING"
):
    time.sleep(3)

new_generated_tldrs = await loop.create_task(
    generate_tldrs(
        reddit_posts,
        model_id="fireworks_ai/0",
        api_base=f"https://api.withpi.ai/v1/training/grpo/{GRPO_JOB_ID}",
        api_key=os.environ["WITHPI_API_KEY"],
        system_prompt=tldr_scoring_spec.description,
    )
)

Generated a tldr for post #4: SUBREDDIT: r/AskReddit

TITLE: Redditors
Generated a tldr for post #1: SUBREDDIT: r/BreakUps

TITLE: Ex Girlfri
Generated a tldr for post #2: SUBREDDIT: r/relationships

TITLE: Me [2
Generated a tldr for post #3: SUBREDDIT: r/offmychest

TITLE: Very rec
Generated a tldr for post #5: SUBREDDIT: r/relationship_advice

TITLE:
Generated a tldr for post #6: SUBREDDIT: r/tifu

TITLE: TIFU by watchi
Generated a tldr for post #7: SUBREDDIT: r/relationships

TITLE: I [19
Generated a tldr for post #9: SUBREDDIT: r/dating_advice

TITLE: Male,
Generated a tldr for post #8: SUBREDDIT: r/relationships

TITLE: Me [2
Generated a tldr for post #11: SUBREDDIT: r/tifu

TITLE: TIFU by not op
Generated a tldr for post #10: SUBREDDIT: r/relationships

TITLE: My (1
Generated a tldr for post #12: SUBREDDIT: r/legaladvice

TITLE: Father 
Generated a tldr for post #14: SUBREDDIT: r/AskReddit

TITLE: Wouldn't 
Generated a tldr for post #13: SUBREDDIT: r/running

TITLE: Night runni
G

In [None]:
# @title Compare the GRPO fine-tuned TLDRs against previous ones using the Pi scoring system
from tqdm import tqdm
import pandas as pd

scores = []
generated_scores = []
new_generated_scores = []
for reddit_post, tldr, new_tldr in tqdm(
    zip(reddit_posts, generated_tldrs, new_generated_tldrs)
):
    generated_score = client.scoring_system.score(
        llm_input=reddit_post, llm_output=tldr, scoring_spec=tldr_scoring_spec
    )
    new_generated_score = client.scoring_system.score(
        llm_input=reddit_post, llm_output=new_tldr, scoring_spec=tldr_scoring_spec
    )
    generated_scores.append(generated_score)
    new_generated_scores.append(new_generated_score)
    score = {
        "reddit post": reddit_post,
        "base model": generated_score.total_score,
        "grpo model": new_generated_score.total_score,
    }
    scores.append(score)

df = pd.DataFrame(scores)

print(df[["base model", "grpo model"]].describe())
display(df)

50it [00:36,  1.38it/s]

       base model  grpo model
count   50.000000   50.000000
mean     0.624360    0.784496
std      0.184762    0.194033
min      0.106283    0.235975
25%      0.556426    0.651611
50%      0.634898    0.854347
75%      0.673781    0.954832
max      0.968813    0.995338





Unnamed: 0,reddit post,base model,grpo model
0,SUBREDDIT: r/BreakUps\n\nTITLE: Ex Girlfriend ...,0.618813,0.676285
1,SUBREDDIT: r/relationships\n\nTITLE: Me [24F] ...,0.659161,0.978579
2,SUBREDDIT: r/offmychest\n\nTITLE: Very recentl...,0.534463,0.507693
3,"SUBREDDIT: r/AskReddit\n\nTITLE: Redditors, I'...",0.659444,0.854738
4,SUBREDDIT: r/relationship_advice\n\nTITLE: Sho...,0.554133,0.677079
5,SUBREDDIT: r/tifu\n\nTITLE: TIFU by watching a...,0.947429,0.974105
6,SUBREDDIT: r/relationships\n\nTITLE: I [19M] d...,0.374042,0.629952
7,SUBREDDIT: r/relationships\n\nTITLE: Me [21 M]...,0.636265,0.635534
8,"SUBREDDIT: r/dating_advice\n\nTITLE: Male, 25 ...",0.648614,0.988193
9,SUBREDDIT: r/relationships\n\nTITLE: My (18/M)...,0.399779,0.741205


In [None]:
# @title Manually inspect new generated blogs against previous ones with scores
from withpi_utils.colab import pretty_print_responses


def pretty_print_blog(i):
    pretty_print_responses(
        response1=generated_tldrs[i],
        response2=new_generated_tldrs[i],
        header="##### " + reddit_posts[i],
        left_label="Base Model",
        right_label="GRPO Model",
        scores_left=generated_scores[i],
        scores_right=new_generated_scores[i],
    )


for i in range(2):
    pretty_print_blog(i)
    print("\n\n")

0,1,2
Length,,0.0
,Length Compliance,0.0
Structure,,0.828
,Length Compliance,1.0
,Conciseness,0.773
,No Redundancy,0.809
,No Repetition,0.777
,No Incomplete Sentences,0.781
Content Accuracy,,0.954
,Important Points,0.77

0,1,2
Length,,0.0
,Length Compliance,0.0
Structure,,1.0
,Length Compliance,1.0
,Conciseness,1.0
,No Redundancy,1.0
,No Repetition,1.0
,No Incomplete Sentences,1.0
Content Accuracy,,1.0
,Important Points,1.0







0,1,2
Length,,0.0
,Length Compliance,0.0
Structure,,0.946
,Length Compliance,1.0
,Conciseness,0.789
,No Redundancy,1.0
,No Repetition,0.941
,No Incomplete Sentences,1.0
Content Accuracy,,0.977
,Important Points,0.883

0,1,2
Length,,1.0
,Length Compliance,1.0
Structure,,0.957
,Length Compliance,1.0
,Conciseness,0.785
,No Redundancy,1.0
,No Repetition,1.0
,No Incomplete Sentences,1.0
Content Accuracy,,0.949
,Important Points,0.746





