# (d) Reinforcement Learning — GRPO (Reasoning)
**Created:** 2025-11-10 02:42 UTC

This notebook follows the **GRPO** (Group Relative Policy Optimization) recipe to nudge a model toward reasoning‑style outputs.
We use a tiny arithmetic dataset and **SmolLM2‑135M** for speed. For larger models (e.g., Llama‑3.1 8B), see Unsloth's GRPO tutorial.

In [1]:
!pip -q install --upgrade pip
!pip -q install "transformers>=4.44.2" "datasets>=2.19.0" "accelerate>=0.33.0" "trl>=0.9.6" "unsloth>=2024.11.0"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 2.12.0 requires google-cloud-bigquery-storage<3.0.0,>=2.30.0, which is not installed.
pylibcudf-cu12 25.2.2 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
cudf-cu12 25.2.2 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
bigframes 2.12.0 requires rich<14,>=12.4.4, but you have rich 14.2.0 which is incompatible.
libcugraph-cu12 25.6.0 requires libraft-cu12==25.6.*, but you have libraft-cu12 25.2.0 which is incompatible.
torchaudio 2.6.0+cu124 requires torch==2.6.0, but you have torch 2.8.0 which is incompatible.
cudf-polars-cu

In [2]:
import torch, random
random.seed(0)
print("CUDA:", torch.cuda.is_available())

CUDA: True


## Tiny math reasoning dataset
We ask the model to think step‑by‑step using a deliberate reasoning format.

In [3]:
from datasets import Dataset
samples = []
for a in range(11, 16):
    for b in range(3, 6):
        ans = a*b
        prompt = f"Solve: {a} * {b}. Respond with your reasoning and final answer as 'Answer: <num>'."
        reason = f"Multiply {a} by {b}. {a}*{b}={ans}. Answer: {ans}"
        samples.append({"prompt": prompt, "answer": reason, "answer_only": f"Answer: {ans}"})
ds = Dataset.from_list(samples)
ds

Dataset({
    features: ['prompt', 'answer', 'answer_only'],
    num_rows: 15
})

## Policy + reference + reward
We craft a simple reward: +1 if model outputs the correct final answer token, else 0. This is intentionally minimal for speed.

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import GRPOTrainer, GRPOConfig

base = "HuggingFaceTB/SmolLM2-135M"
tokenizer = AutoTokenizer.from_pretrained(base)
if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token
policy = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32)
ref = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32)

def reward_fn(samples, **kwargs):
    # samples: list of strings (model outputs). Reward 1.0 if contains correct 'Answer: X' substring.
    rewards = []
    for s, ex in zip(samples, kwargs.get("inputs", [])):
        target = ex["answer_only"]
        rewards.append(1.0 if target in s else 0.0)
    return rewards

grpo_args = GRPOConfig(
    output_dir="/kaggle/working/smollm2_grpo",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    learning_rate=5e-6,
    max_steps=80,
    bf16=torch.cuda.is_available(),
    logging_steps=5,
)
trainer = GRPOTrainer(
    model=policy,
    ref_model=ref,
    args=grpo_args,
    reward_funcs=[reward_fn],
    tokenizer=tokenizer,
    max_prompt_length=128,
    max_completion_length=96,
    train_dataset=ds,
)
trainer.train()
trainer.save_model("/kaggle/working/smollm2_grpo")
tokenizer.save_pretrained("/kaggle/working/smollm2_grpo")

2025-11-10 04:12:08.627726: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1762747928.835687      48 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1762747928.891434      48 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

Skipping import of cpp extensions due to incompatible torch version 2.8.0+cu128 for torchao version 0.14.1             Please see https://github.com/pytorch/ao/issues/2919 for more info
    Found GPU0 Tesla P100-PCIE-16GB which is of cuda capability 6.0.
    Minimum and Maximum cuda capability supported by this version of PyTorch is
    (7.0) - (12.0)
    
    Please install PyTorch with a following CUDA
    configurations:  12.6 following instructions at
    https://pytorch.org/get-started/locally/
    
Tesla P100-PCIE-16GB with CUDA capability sm_60 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_70 sm_75 sm_80 sm_86 sm_90 sm_100 sm_120.
If you want to use the Tesla P100-PCIE-16GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/



tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

ValueError: generation_batch_size (4) must be divisible by num_generations (8).

In [5]:
test = "Solve: 13 * 4. Respond with your reasoning and final answer as 'Answer: <num>'."
inputs = tokenizer(test, return_tensors="pt").to(policy.device)
with torch.no_grad():
    out = policy.generate(**inputs, max_new_tokens=96, do_sample=True, temperature=0.7)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Solve: 13 * 4. Respond with your reasoning and final answer as 'Answer: <num>'.

15. I am 22 years old. In 2010, I started working as a software engineer at a major company. In 2013, I was promoted to Software Engineer and now I am a software developer at a small company. In the last 5 years, I have been working with a new startup that is trying to develop a new product called "The Greatest New Product". I know most of my teammates are from the
