<a href="https://colab.research.google.com/github/shivvor2/RL-PEFT-a-small-reasoner/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h3 align="center"></h3>

<h1 align="center">Qwen 0.5b on GRPO</h1>

---

<h1 align="center">Training a small math reasoner with RL</h1>

Original notebook by [will brown,](https://x.com/willccbb), unfortunately, I can't find the X/Twitter release post anymore.

On top of the original notebook, we have implemented:
1. Evaluation code (to evaluate performance of the finetuned model vs the original model)
2. LoRA finetuning (instead of full finetuning) of the model (in progress)

Here is the release message for the original notebook

> This notebook is an alternate version of the [GRPO demo](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) by [will brown,](https://x.com/willccbb) training llama-1b on the gsm8k math dataset.

> We've only implemented a series of changes to make the code more workable on Colab:
* Replacement of llama-1b with Qwen-0.5b
* Generation with vllm, which yields a significant speed-up. Qwen small size makes it possible to run vllm on the same gpu as the one being used for GRPO.
* Dropping flash-attn (recurrent bug with modeling qwen, not clear why)

## Setting up the environment.

First we install vllm. Notice that you'll have to restart the session afterwards.

In [1]:
!pip install vllm

Collecting vllm
  Downloading vllm-0.8.5.post1-cp38-abi3-manylinux1_x86_64.whl.metadata (14 kB)
Collecting blake3 (from vllm)
  Downloading blake3-1.0.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting fastapi>=0.115.0 (from fastapi[standard]>=0.115.0->vllm)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.1.0-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken>=0.6.0 (from vllm)
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting lm-format-enforcer<0.11,>=0.10.11 (from vllm)
  Downloading lm_format_enforcer-0.10.11-py3-none-any.whl.metadata (17 kB)
Collecting llguidance<0.8.0,>=0.7.9 (from vllm)
  Downloading llguidance-0.7.19-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting outlines==0.1.11 (from vllm)
  Downloading

Then we install trl and datasets. It has to be in this order for some reason (bug on trl if you do vllm afterwards)

In [1]:
!pip install trl datasets peft

Collecting trl
  Downloading trl-0.17.0-py3-none-any.whl.metadata (12 kB)
Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading trl-0.17.0-py3-none-any.whl (348 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m348.0/348.0 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m34.7 MB/s[0

(Optional) We mount google drive for persistant storage.

Change the root storage path if other forms of persistant storage is used

In [2]:
from google.colab import drive
import os
drive.mount('/content/drive')

base_path = "/content/drive/MyDrive/ML_Experiments/qwen2.5_0.5B_GRPO_LoRA"
os.makedirs(os.path.dirname(base_path), exist_ok=True)

Mounted at /content/drive


## Defining the RL rewards

Now we have everything ready to set up our RL training set and reward policy.

First we set the general prompt structure (with the reasoning tags).

In [3]:
import re
import torch
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import GRPOConfig, GRPOTrainer

# Load and prep dataset

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

INFO 05-08 06:41:15 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-08 06:41:15 [__init__.py:239] Automatically detected platform cuda.


Now we import the gsm8k dataset and restructure it to fit into a conversational prompt format:

In [4]:
def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

dataset = get_gsm8k_questions()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

We move on now to the reward functions. The most important one is the "correctness" function which acts as a verifier (comparison of model completions vs. answer). The three others are formatting functions.

In [5]:
# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

And here are some additional helper functions to help find the latest checkpoint

In [6]:
import os
import re
from google.colab import files
import shutil

def get_latest_checkpoint(base_dir: str):
    """Find the latest checkpoint in the given directory."""

    # Check existance for base directory
    if not os.path.exists(base_dir):
        print(f"Warning: Directory {base_dir} does not exist")
        return None

    # Look for checkpoint directories
    checkpoint_dirs = [d for d in os.listdir(base_dir) if d.startswith('checkpoint-')]

    if not checkpoint_dirs:
        return None

    # Extract checkpoint numbers and find the highest
    checkpoint_nums = [int(re.search(r'checkpoint-(\d+)', d).group(1)) for d in checkpoint_dirs]
    latest_checkpoint_num = max(checkpoint_nums)
    latest_checkpoint = f"checkpoint-{latest_checkpoint_num}"

    return os.path.join(base_dir, latest_checkpoint)

## Full finetuning and evaluation

### Training loop

(Optional) Resume training from checkpoint

We now set the training arguments:

In [None]:
model_name = "Qwen/Qwen2.5-0.5B-Instruct"

output_dir=os.path.join(base_path, "outputs/Qwen-0.5B-GRPO")
run_name="Qwen-0.5B-GRPO-gsm8k"

training_args = GRPOConfig(
    output_dir=output_dir,
    run_name=run_name,
    learning_rate=5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type='cosine',
    logging_steps=1,
    bf16=True,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_generations=16,
    max_prompt_length=256,
    max_completion_length=200,
    num_train_epochs=1,
    save_steps=100,
    max_grad_norm=0.1,
    log_on_each_node=False,
    use_vllm=True,
    vllm_gpu_memory_utilization=.3,
    vllm_device="cuda:0",
    report_to="none" #I'm disabling Wandb.
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map=None
).to("cuda")

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

And launch the actual training:

In [None]:
# Obtain checkpoint (to resume training)
checkpoint_path = get_latest_checkpoint(output_dir)
# checkpoint_path = None # Uncomment this if we want to restart training

trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func],
    args=training_args,
    train_dataset=dataset,
)

if checkpoint_path is None: # No checkpoint
    trainer.train()
else:
    trainer.train(resume_from_checkpoint=checkpoint_path) # resume training



INFO 02-06 08:38:24 config.py:526] This model supports multiple tasks: {'reward', 'embed', 'score', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 02-06 08:38:24 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='Qwen/Qwen2.5-0.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-0.5B-Instruct, n

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 02-06 08:38:26 model_runner.py:1116] Loading model weights took 0.9279 GB
INFO 02-06 08:38:28 worker.py:266] Memory profiling takes 1.12 seconds
INFO 02-06 08:38:28 worker.py:266] the current vLLM instance can use total_gpu_memory (39.56GiB) x gpu_memory_utilization (0.30) = 11.87GiB
INFO 02-06 08:38:28 worker.py:266] model weights take 0.93GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 1.51GiB; the rest of the memory reserved for KV Cache is 9.34GiB.
INFO 02-06 08:38:28 executor_base.py:108] # CUDA blocks: 50982, # CPU blocks: 21845
INFO 02-06 08:38:28 executor_base.py:113] Maximum concurrency for 32768 tokens per request: 24.89x
INFO 02-06 08:38:31 model_runner.py:1435] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_u

Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:32<00:00,  1.07it/s]

INFO 02-06 08:39:04 model_runner.py:1563] Graph capturing finished in 33 secs, took 0.16 GiB
INFO 02-06 08:39:04 llm_engine.py:429] init engine (profile, create kv cache, warmup model) took 37.73 seconds





-------------------- Question:
Gerald wants to buy a meat pie that costs 2 pfennigs. Gerald has 54 farthings, and there are 6 farthings to a pfennig. How many pfennigs will Gerald have left after buying the pie? 
Answer:
7 
Response:
To solve this problem, we first need to understand how far possessions are worth in pfennigs and how far they are worth in farthings. We know that 5 farthings are equivalent to 1 pfennig. Therefore, in pfennigs, 54 farthings would be equivalent to:

\( 54 \text{ farthings} \times \frac{1 \text{ pfennig}}{5 \text{ farthings}} = 10.8 \text{ pfennigs} \)

Since Gerald has 10.8 pfennigs, after purchasing the meat pie, he will have:

\( 10.8 \text{ pfennigs} - 2 \text{ pfennigs} = 8.8 \text{ pfennigs} \)

So, Gerald will have 8.8 pfennigs left. 
Extracted:
To solve this problem, we first need to understand how far possessions are worth in pfennigs and how far they are worth in farthings. We know that 5 farthings are equivalent to 1 pfennig. Therefore, in pfenni

Step,Training Loss
1,0.0
2,0.0
3,0.0
4,0.0
5,0.0
6,0.0
7,0.0
8,0.0
9,0.0
10,0.0


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
</answer>
 
Extracted:
3.6
-------------------- Question:
Celina enjoys hiking in the mountains. Due to a lack of proper equipment, she decided to order a new pair of boots, a flashlight, and a special hoodie. The hoodie cost $80 and the flashlight was only 20% of that price. The pair of boots cost was $110, but due to a special promotion, Celina was able to buy them 10% cheaper. How much did Celina spend in total on the above-mentioned equipment? 
Answer:
195 
Response:
<reasoning>
The flashlight cost 20% of the hoodie's price, which is 0.20 * $80 = $16. Therefore, the boots cost $110 - $80 = $30. As a discount, the hoodie bought for $80 * 10/100 = $8. Thus, the total cost of the kit is $30 + $8 + $110 = $148.
</reasoning>
<answer>
148
</answer>
 
Extracted:
148
-------------------- Question:
Russel and Jen went to the circus. Jen played a shooting game twice, while Russel rode the carousel three times. If the shooting g

TrainOutput(global_step=1868, training_loss=0.006375221701942574, metrics={'train_runtime': 7258.7949, 'train_samples_per_second': 1.03, 'train_steps_per_second': 0.257, 'total_flos': 0.0, 'train_loss': 0.006375221701942574})

### Evaluating the trained model

In [None]:
from vllm import SamplingParams, LLM

In [None]:
test_data = get_gsm8k_questions(split="test")

Map:   0%|          | 0/1319 [00:00<?, ? examples/s]

In [None]:
# 2. Load Trained Model & Tokenizer
model_path = get_latest_checkpoint(output_dir)
tokenizer = AutoTokenizer.from_pretrained(model_path)

In [None]:
# 3. Format Prompts Using Chat Template
test_prompts = []
for example in test_data:
    formatted_prompt = tokenizer.apply_chat_template(
        example["prompt"],
        tokenize=False,
        add_generation_prompt=True
    )
    test_prompts.append(formatted_prompt)

In [None]:
# 4. Set Up vLLM for Batch Inference
llm = LLM(
    model=model_path,
    tensor_parallel_size=1,
    gpu_memory_utilization=0.3,
    trust_remote_code=True
)

# 5. Configure Sampling Parameters
sampling_params = SamplingParams(
    temperature=0.0,    # Greedy decoding for evaluation
    max_tokens=200,     # Same as training's max_completion_length
    stop=["<|im_end|>"] # Qwen's stop token
)

INFO 02-06 11:22:22 config.py:526] This model supports multiple tasks: {'reward', 'embed', 'score', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 02-06 11:22:22 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='outputs/Qwen-0.5B-GRPO/checkpoint-1868', speculative_config=None, tokenizer='outputs/Qwen-0.5B-GRPO/checkpoint-1868', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=outputs/

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 02-06 11:22:45 model_runner.py:1116] Loading model weights took 0.9234 GB
INFO 02-06 11:23:06 worker.py:266] Memory profiling takes 10.39 seconds
INFO 02-06 11:23:06 worker.py:266] the current vLLM instance can use total_gpu_memory (39.56GiB) x gpu_memory_utilization (0.30) = 11.87GiB
INFO 02-06 11:23:06 worker.py:266] model weights take 0.92GiB; non_torch_memory takes 0.00GiB; PyTorch activation peak memory takes 1.43GiB; the rest of the memory reserved for KV Cache is 9.51GiB.
INFO 02-06 11:23:16 executor_base.py:108] # CUDA blocks: 51942, # CPU blocks: 21845
INFO 02-06 11:23:16 executor_base.py:113] Maximum concurrency for 32768 tokens per request: 25.36x
INFO 02-06 11:23:19 model_runner.py:1435] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_

Capturing CUDA graph shapes: 100%|██████████| 35/35 [12:10<00:00, 20.86s/it]

INFO 02-06 11:35:29 model_runner.py:1563] Graph capturing finished in 730 secs, took 0.14 GiB
INFO 02-06 11:35:29 llm_engine.py:429] init engine (profile, create kv cache, warmup model) took 764.24 seconds





In [None]:
# 6. Generate Responses
outputs = llm.generate(test_prompts, sampling_params)

Processed prompts: 100%|██████████| 1319/1319 [00:26<00:00, 49.35it/s, est. speed input: 4753.12 toks/s, output: 6055.99 toks/s] 


In [None]:
# 7. Extract Answers
def extract_xml_answer(text: str) -> str:
    if "<answer>" in text and "</answer>" in text:
        return text.split("<answer>")[1].split("</answer>")[0].strip()
    return ""

pred_answers = [extract_xml_answer(output.outputs[0].text) for output in outputs]
true_answers = [example["answer"] for example in test_data]

In [None]:
# 8. Calculate Accuracy
accuracy = sum(1 for p, t in zip(pred_answers, true_answers) if p == t) / len(true_answers)
print(f"GSM8K Test Accuracy: {accuracy * 100:.2f}%")

GSM8K Test Accuracy: 46.17%


In [None]:
# (Optional) 9. Log the results
results_path = os.path.join(base_path, "/grpo_lora_results.txt")
os.makedirs(os.path.dirname(results_path), exist_ok=True)

with open(results_path, "a") as f:
    f.write(f"Baseline (full finetuning): {accuracy:.2f}%\n")

## LoRA finetuning and evaluation

### Training loop



We first setup the PEFT (LoRA) configuration

In [7]:
from peft import LoraConfig

rank = 16

peft_config = LoraConfig(
    r=rank,                     # the rank of the loRA matrices
    lora_alpha=2*rank,
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",   # Attention layers
        "gate_proj", "up_proj", "down_proj",     # MLP layers
    ]
)

and setup the trainer like how it was previously (without VLLM as it does not support LoRA)

In [8]:
model_name = "Qwen/Qwen2.5-0.5B-Instruct"

output_dir = os.path.join(base_path, f"outputs/Qwen-0.5B-GRPO-LoRA-r{rank}")
run_name = f"Qwen-0.5B-GRPO-LoRA-r{rank}-gsm8k"

training_args = GRPOConfig(
    output_dir=output_dir,
    run_name=run_name,
    learning_rate=5e-6,
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',
    logging_steps=1,
    bf16=True,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4, # Changed from 4 to 16 because otherwise the training would not start
    num_generations=16,
    max_prompt_length=256,
    max_completion_length=200,
    num_train_epochs=1,
    save_steps=100,
    max_grad_norm=0.1,
    log_on_each_node=False,
    use_vllm=False,        # Use the PEFT model directly instead of vLLM engine
    report_to="none",
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map=None
).to("cuda")

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Now, we launch the actural training

In [None]:
# Load checkpoint if it exists
checkpoint_path = get_latest_checkpoint(output_dir)
# checkpoint_path = None # Uncomment this if we want to restart training

# Initialize GRPOTrainer with PEFT enabled
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func
    ],
    args=training_args,
    train_dataset=dataset,
    peft_config=peft_config  # <-- Enables PEFT fine-tuning
)

# Start the training
if checkpoint_path is None: # No checkpoint
    trainer.train()
else:
    trainer.train(resume_from_checkpoint=checkpoint_path) # resume training

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


-------------------- Question:
Every day Janet spends 8 minutes looking for her keys and another 3 minutes complaining after she finds them. If Janet stops losing her keys, how many minutes will she save every week? 
Answer:
77 
Response:
<reasoning>
Janet spends 8 minutes looking for her keys and 3 minutes complaining after finding them, which totals 8 + 3 = 11 minutes per day. To calculate how many minutes she will save every week, we multiply the daily saving by the number of days in a week, 7. So, 11 * 7 = 77 minutes.
</reasoning>
<answer>
77
</answer>
 
Extracted:
77


Step,Training Loss
1301,0.0977
1302,-0.0055
1303,-0.0084
1304,0.072
1305,0.0372
1306,0.0878
1307,0.0701
1308,0.0257
1309,0.005
1310,0.0449


-------------------- Question:
At school today, Charlize was 20 minutes late. Four of her classmates were each ten minutes late than she was. What's the total time for which the five students were late? 
Answer:
140 
Response:
<reasoning>
Charlotte was 20 minutes late, and her four classmates were each ten minutes late than she was. So, her classmates were 4*10 = 40 minutes late. The total time for which the five students were late is 20 + 40 = 60 minutes.
</reasoning>
<answer>
60
</answer>
 
Extracted:
60
-------------------- Question:
A restaurant is counting their sales for the day. They sold 10 meals at $8 each, 5 meals at $10 each, and 20 meals at $4 each. In dollars, how much money did the restaurant make throughout the day? 
Answer:
210 
Response:
<reasoning>
The revenue from the $8 meals is 10 meals * $8 = $80. The revenue from the $10 meals is 5 meals * $10 = $50. The revenue from the $4 meals is 20 meals * $4 = $80. Therefore, the total revenue made is $80 + $50 + $80 = $210.

### Evaluation with the PEFT model

We merge the trained LoRA adapter to our base model in order to evaluate using VLLM for better speed, as evaluation using `transformers` take over an hour

The merged model should behave almost the same as the unmerged PEFT model (up to floating point rounding in matrix additions), so for the purpose of evaluation, the merged model should have the same performance as the unmerged model

We load the model from the latest checkpoint and merge it (in memory)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
import torch

base_model = 'Qwen/Qwen2.5-0.5B-Instruct'
checkpoint_path = get_latest_checkpoint(output_dir)  # LoRA adapter

tokenizer = AutoTokenizer.from_pretrained(base_model)
# If you saved tokenizer with PEFT, replace base_model with checkpoint_path above

# Load the base model
model = AutoModelForCausalLM.from_pretrained(
    base_model, torch_dtype=torch.bfloat16, device_map=None
).to("cuda")

# Apply the LoRA adapter
model = PeftModel.from_pretrained(model, checkpoint_path, is_trainable = False).to("cuda")
# model.eval()

# Merging the model
merged_model = model.merge_and_unload()

Now we merge and save the model to a temp directory, and then load the model with vLLM (which does not support reading from memory)

In [None]:
import tempfile

merged_dir = os.path.join(base_path, f"merged/Qwen-0.5B-GRPO-LoRA-r{rank}")
os.makedirs(os.path.dirname(merged_dir), exist_ok=True)

# Save merged model and tokenizer to the temp dir
merged_model.save_pretrained(merged_dir)
tokenizer.save_pretrained(merged_dir)

And perform the evaluation on the gsm8k test

(Restart the kernel before running the following code, as there will be errors when loading a model into `vllm` after it is loaded with `torch` or `transformers`, the following code is self-contained)

Remount persistant storage if needed

In [None]:
base_path = "/content/drive/MyDrive/ML_Experiments/qwen2.5_0.5B_GRPO_LoRA"

rank = 16 # (Use the same rank you performred the test with)

In [None]:
from google.colab import drive
import os
drive.mount('/content/drive')

os.makedirs(os.path.dirname(base_path), exist_ok=True)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


The following code is identical to the full finetuning version (and changed to be self-contained)

~~See, I know the whole reloading the kernel thing is awkward, but this project is done in a notebook and I don't want to deal with any subprocess shinenigans~~

In [None]:
from vllm import SamplingParams, LLM
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset, Dataset

base_model = 'Qwen/Qwen2.5-0.5B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(base_model) # The tokenizer is not updated during the PEFT finetuning process

# 1. Load Test Data (if not already done)
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

def get_gsm8k_questions(split="test"):
    data = load_dataset('openai/gsm8k', 'main')[split]
    data = data.map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    })
    return data

test_data = get_gsm8k_questions(split="test")

# 2. Format Prompts Using Chat Template
test_prompts = []
for example in test_data:
    formatted_prompt = tokenizer.apply_chat_template(
        example["prompt"],
        tokenize=False,
        add_generation_prompt=True
    )
    test_prompts.append(formatted_prompt)

# 3. Load the model into vLLM
llm = LLM(
model=os.path.join(base_path, f"merged/Qwen-0.5B-GRPO-LoRA-r{rank}"),  # Point to merged model path
tensor_parallel_size=1,
gpu_memory_utilization=0.3,
trust_remote_code=True
)

# 4. Configure Sampling Parameters
sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=200,
    stop=["<|im_end|>"] # Qwen's stop token
)

# 5. Generate Responses
outputs = llm.generate(test_prompts, sampling_params)

# 6. Extract Answers
def extract_xml_answer(text: str) -> str:
    if "<answer>" in text and "</answer>" in text:
        return text.split("<answer>")[1].split("</answer>")[0].strip()
    return ""

pred_answers = [extract_xml_answer(output.outputs[0].text) for output in outputs]
true_answers = [example["answer"] for example in test_data]

# 7. Calculate Accuracy
accuracy = sum(p == t for p, t in zip(pred_answers, true_answers)) / len(true_answers)
print(f"GSM8K Test Accuracy: {accuracy * 100:.2f}%")

INFO 05-07 13:10:02 [__init__.py:239] Automatically detected platform cuda.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


INFO 05-07 13:10:24 [config.py:717] This model supports multiple tasks: {'generate', 'embed', 'classify', 'reward', 'score'}. Defaulting to 'generate'.
INFO 05-07 13:10:24 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 05-07 13:10:26 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='/content/drive/MyDrive/ML_Experiments/qwen2.5_0.5B_GRPO_LoRA/merged/Qwen-0.5B-GRPO-LoRA-r8', speculative_config=None, tokenizer='/content/drive/MyDrive/ML_Experiments/qwen2.5_0.5B_GRPO_LoRA/merged/Qwen-0.5B-GRPO-LoRA-r8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guide

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 05-07 13:10:28 [loader.py:458] Loading weights took 0.77 seconds
INFO 05-07 13:10:28 [gpu_model_runner.py:1347] Model loading took 0.9269 GiB and 0.972753 seconds
INFO 05-07 13:10:37 [backends.py:420] Using cache directory: /root/.cache/vllm/torch_compile_cache/043c115269/rank_0_0 for vLLM's torch.compile
INFO 05-07 13:10:37 [backends.py:430] Dynamo bytecode transform time: 9.02 s
INFO 05-07 13:10:43 [backends.py:118] Directly load the compiled graph(s) for shape None from the cache, took 4.507 s
INFO 05-07 13:10:44 [monitor.py:33] torch.compile takes 9.02 s in total
INFO 05-07 13:10:45 [kv_cache_utils.py:634] GPU KV cache size: 353,056 tokens
INFO 05-07 13:10:45 [kv_cache_utils.py:637] Maximum concurrency for 32,768 tokens per request: 10.77x
INFO 05-07 13:11:17 [gpu_model_runner.py:1686] Graph capturing finished in 33 secs, took 0.38 GiB
INFO 05-07 13:11:17 [core.py:159] init engine (profile, create kv cache, warmup model) took 49.02 seconds
INFO 05-07 13:11:17 [core_client.py:4

Processed prompts:   0%|          | 0/1319 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s…

GSM8K Test Accuracy: 38.13%


In [None]:
pred_answers

['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',


In [None]:
outputs

[RequestOutput(request_id=0, prompt="<|im_start|>system\n\nRespond in the following format:\n<reasoning>\n...\n</reasoning>\n<answer>\n...\n</answer>\n<|im_end|>\n<|im_start|>user\nJanet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?<|im_end|>\n<|im_start|>assistant\n", prompt_token_ids=[151644, 8948, 271, 65354, 304, 279, 2701, 3561, 510, 27, 19895, 287, 397, 9338, 522, 19895, 287, 397, 27, 9217, 397, 9338, 522, 9217, 397, 151645, 198, 151644, 872, 198, 18315, 295, 748, 77778, 10962, 220, 16, 21, 18805, 817, 1899, 13, 2932, 49677, 2326, 369, 17496, 1449, 6556, 323, 293, 2050, 54304, 1330, 369, 1059, 4780, 1449, 1899, 448, 3040, 13, 2932, 30778, 279, 26313, 518, 279, 20336, 6, 3081, 7298, 369, 400, 17, 817, 7722, 35985, 18636, 13, 2585, 1753, 304, 11192, 1558, 1

And we log the results (optional)

In [None]:
results_path = os.path.join(base_path, "/grpo_lora_results.txt")
os.makedirs(os.path.dirname(results_path), exist_ok=True)

with open(results_path, "a") as f:
    f.write(f"Rank {rank}: {accuracy:.2f}%\n")

### Delete the model to perform another round of training

In [None]:
# After downloading the checkpoint
del model
torch.cuda.empty_cache()