<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="https://mng.bz/lZ5B">Build a Reasoning Model (From Scratch)</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/reasoning-from-scratch">https://github.com/rasbt/reasoning-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="https://mng.bz/lZ5B"><img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

# Chapter 6: Training Reasoning Models with Reinforcement Learning

Packages that are being used in this notebook:

In [1]:
from importlib.metadata import version

used_libraries = [
    "reasoning_from_scratch",
    "torch",
    "tokenizers"  # Used by reasoning_from_scratch
]

for lib in used_libraries:
    print(f"{lib} version: {version(lib)}")

reasoning_from_scratch version: 0.1.13
torch version: 2.9.1
tokenizers version: 0.22.2


<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F01_raschka.webp" width=600>

&nbsp;
## 6.1 Introduction to reinforcement learning for LLMs

- Inference-time scaling improves reasoning by using more compute per generated answer
- Training-time scaling improves reasoning by using additional compute during training, which is the focus of this chapter

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F02_raschka.webp" width=600>

- Inference-time scaling and training-time scaling can (/should) also be combined, for example, by applying inference-time techniques after RL-based reasoning training
- In practice, RL for LLMs is applied as a post-training stage on top of a pre-trained model or following instruction fine-tuning

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F03_raschka.webp" width=600>

- Pre-training builds general knowledge via next-token prediction, whereas RL refines model behavior by optimizing sequence-level objectives such as answer correctness or preferences
- RL for LLMs includes reasoning training and preference tuning, but reasoning-focused RL can also be applied directly to a pre-trained base model, as shown by DeepSeek-R1
- Training reasoning directly on the base model produces a weaker but still capable model (but it offers a simpler setting for understanding what the reasoning stage contributes)

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F04_raschka.webp" width=600>

&nbsp;
### 6.1.1 The original reinforcement learning pipeline with human feedback (RLHF)

- RLHF was introduced in the InstructGPT work in 2022 and uses human preference labels to train LLMs (this was a key step in turning GPT-3 into the original ChatGPT)
- Unlike pre-training and supervised fine-tuning, which optimize next-token prediction, RLHF optimizes models based on human preference labels of the model responses

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F05_raschka.webp" width=600>

&nbsp;
### 6.1.2 From human feedback to verifiable rewards (RLVR)

- RLHF requires training a separate reward model, which is often a large and expensive LLM
- RLVR replaces the learned reward model with automatically verifiable, deterministic rewards

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F06_raschka.webp" width=600>

- The popularity of RLVR was driven in large part by the success of DeepSeek-R1 in 2025, which demonstrated strong reasoning performance without relying on human preference data or a learned reward model
- DeepSeek-R1 trained reasoning behavior using automatically verifiable rewards, such as correctness checks for math problems and code compilation or execution for programming tasks
- While this book focuses on math-based verification, the underlying idea is similar to code verification: rewards are computed automatically using binary success signals

&nbsp;
## 6.2 Reinforcement learning with verifiable rewards walkthrough using GRPO

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F07_raschka.webp" width=600>

- Now, after introducing the big picture and seeing how RL fits into the development cycle of LLMs, we will implement RLVR to train a reasoning model (similar to DeepSeek-R1-Zero but on a much smaller scale, as a comparable run would cost multiple hundreds of thousands of dollars in GPU costs)
- RL for LLMs uses a so-called policy gradient algorithm that is used to update the LLM we want to train (which is called the "policy" in RL contexts)
- A popular policy gradient algorithm for RLHF is proximal policy optimization (PPO); we could use the same algorithm in RLVR
- However, the DeepSeek team used a simpler algorithm when they trained the DeepSeek-R1 reasoning models, namely group relative policy optimization (GRPO) (first used in DeepSeekMath)
- GRPO is more resource-friendly, because in PPO we have another LLM compute the value function; in GRPO, we don't need that, as it derives its learning signal from relative comparisons within a group of sampled responses
- Interested readers can find a more detailed side-by-side comparison between PPO and GRPO in my article [The State of Reinforcement Learning for LLM Reasoning](https://magazine.sebastianraschka.com/p/the-state-of-llm-reasoning-model-training)
- In this chapter, we implement RLVR using GRPO
- Additionally, the next chapter introduces additional improvements to GRPO to improve the training stability and resulting modeling performance

### 6.2.1 High-level GRPO intuition via a chef analogy

- Since GRPO can look complicated at first glance, I wanted to start this section with a general big-picture overview using a "chef & cooking" analogy to introduce the terminology and provide some intuition

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F08_raschka.webp" width=600>

- In most RL-for-LLMs contexts, rollout and completion are terms that are used interchangeably

### 6.2.2 The high-level GRPO procedure

- The technical roadmap for implementing GRPO in the following sections:

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F09_raschka.webp" width=600>

&nbsp;
## 6.3 Loading a pre-trained model

- The code in this chapter is identical to the one in previous chapters

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F10_raschka.webp" width=600>

In [2]:
import torch

from reasoning_from_scratch.ch02 import get_device
from reasoning_from_scratch.ch03 import (
     load_model_and_tokenizer
)

device = get_device()
device = torch.device("cpu")

model, tokenizer = load_model_and_tokenizer(
    which_model="base",
    device=device,
    use_compile=False
)

Using Apple Silicon GPU (MPS)
qwen3-0.6B-base.pth: 100% (1433 MiB / 1433 MiB)


In [3]:
from reasoning_from_scratch.ch03 import render_prompt
from reasoning_from_scratch.ch04 import (
    generate_text_stream_concat_flex,
    generate_text_top_p_stream_cache
)

raw_prompt = (
    "Half the value of $3x-9$ is $x+37$. "
    "What is the value of $x$?"
)
prompt = render_prompt(raw_prompt)

torch.manual_seed(0)
response = generate_text_stream_concat_flex(
    model, tokenizer, prompt, device,
    max_new_tokens=2048, verbose=True,
    generate_func=generate_text_top_p_stream_cache,
    temperature=0.9,
    top_p=0.9
)

 \boxed{58}

&nbsp;
## 6.4 Loading a MATH training subset

- We use a non-overlapping training subset derived from the original MATH dataset that explicitly excludes the MATH-500 examples used for model evaluation in the previous chapters (for more information about how the dataset was prepared, please see https://github.com/rasbt/math_full_minus_math500)

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F11_raschka.webp" width=600>

- The following `load_math_train` function is similar to the [load_math500_test](https://github.com/rasbt/reasoning-from-scratch/blob/main/reasoning_from_scratch/ch03.py#L422) function in chapter 3, except that we specify a different file path

In [4]:
import json
import requests
from pathlib import Path

def load_math_train(local_path="math_train.json", save_copy=True):
    local_path = Path(local_path)
    url = (
        "https://raw.githubusercontent.com/rasbt/"
        "math_full_minus_math500/refs/heads/main/"
        "math_full_minus_math500.json"
    )

    if local_path.exists():
        with local_path.open("r", encoding="utf-8") as f:
            data = json.load(f)
    else:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        data = r.json()

        if save_copy:  # Saves a local copy
            with local_path.open("w", encoding="utf-8") as f:
                json.dump(data, f, indent=2)

    return data

In [5]:
math_train = load_math_train()

print("Dataset size:", len(math_train))

Dataset size: 12000


In [6]:
from pprint import pprint

pprint(math_train[4])

{'answer': '6',
 'level': 'Level 3',
 'problem': 'Sam is hired for a 20-day period. On days that he works, he earns '
            '$\\$$60. For each day that he does not work, $\\$$30 is '
            'subtracted from his earnings. At the end of the 20-day period, he '
            'received $\\$$660. How many days did he not work?',
 'solution': 'Call $x$ the number of days Sam works and $y$ the number of days '
             'he does not. We can set up the following system of equations to '
             'represent the given information: \\begin{align*}\n'
             'x+y &= 20 \\\\\n'
             '60x - 30y &= 660 \\\\\n'
             '\\end{align*} The first equation represents the total number of '
             'days Sam works, and the second equation represents his total '
             'profit. Solving for $x$ in the first equation yields $x = 20 - '
             'y$. Substituting into the second equation gives $60(20-y) - 30y '
             '= 660$. Canceling a factor of $10$ an

- Note that we only need the `"answer"` and `"problem"` fields
- In theory, it may be tempting to use the `"solution"`, but here we want to let the model explore solutions freely (instead of learning a specific solution and style)

&nbsp;
## 6.5 Sampling rollouts

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F12_raschka.webp" width=600>

- Rollout is RL jargon for generated response

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F13_raschka.webp" width=400>

- We need `@torch.no_grad` as we don't want to build a graph and backpropagate through this, but `@torch.inference_mode` doesn't work (it does too much) and results in

> RuntimeError: Inference tensors cannot be saved for backward. Please do not use Tensors created in inference mode in computation tracked by autograd. To work around this, you can make a clone to get a normal tensor and use it in autograd, or use `torch.no_grad()` instead of `torch.inference_mode()`.

In [7]:
from reasoning_from_scratch.qwen3 import KVCache
from reasoning_from_scratch.ch04 import top_p_filter


@torch.no_grad()
def sample_response(
    model,
    tokenizer,
    prompt,
    device,
    max_new_tokens=512,
    temperature=0.8,
    top_p=0.9,
):
    input_ids = torch.tensor(
        tokenizer.encode(prompt),
        device=device
        )

    cache = KVCache(n_layers=model.cfg["n_layers"])
    model.reset_kv_cache()
    logits = model(input_ids.unsqueeze(0), cache=cache)[:, -1]

    generated = []
    for _ in range(max_new_tokens):
        if temperature and temperature != 1.0:
            logits = logits / temperature

        probas = torch.softmax(logits, dim=-1)
        probas = top_p_filter(probas, top_p)
        next_token = torch.multinomial(
            probas.cpu(), num_samples=1
        ).to(device)

        if (
            tokenizer.eos_token_id is not None
            and next_token.item() == tokenizer.eos_token_id
        ):
            break
        generated.append(next_token.item())
        logits = model(next_token, cache=cache)[:, -1]

    full_token_ids = torch.cat(
        [input_ids,
         torch.tensor(generated, device=device, dtype=input_ids.dtype),]
    )
    return full_token_ids, input_ids.numel(), tokenizer.decode(generated)

- There is nothing new here
- The code above is simply a leaner version of what we have been developing previously; it combines the [generate_text_basic_cache](https://github.com/rasbt/reasoning-from-scratch/blob/main/reasoning_from_scratch/ch02.py#L57) function from chapter 2 with temperature and top-p sampling from chapter 4 directly

In [8]:
torch.manual_seed(0)

raw_prompt = (
    "Half the value of $3x-9$ is $x+37$. "
    "What is the value of $x$?"
)
prompt = render_prompt(raw_prompt)

token_ids, prompt_len, answer_text = sample_response(
            model=model,
            tokenizer=tokenizer,
            prompt=prompt,
            device=device,
            max_new_tokens=512,
            temperature=0.9,
            top_p=0.9,
        )

print(answer_text)

 \boxed{58}


In [9]:
torch.manual_seed(5)

token_ids, prompt_len, answer_text = sample_response(
            model=model,
            tokenizer=tokenizer,
            prompt=prompt,
            device=device,
            max_new_tokens=512,
            temperature=0.9,
            top_p=0.9,
        )

print(answer_text)

 Let's solve the problem step by step.

**Given:**
\[
\frac{1}{2} \times (3x - 9) = x + 37
\]

**Step 1: Multiply both sides by 2 to eliminate the fraction.**
\[
3x - 9 = 2(x + 37)
\]

**Step 2: Distribute the 2 on the right side.**
\[
3x - 9 = 2x + 74
\]

**Step 3: Subtract \(2x\) from both sides to get the \(x\) terms on one side.**
\[
3x - 2x - 9 = 74
\]
\[
x - 9 = 74
\]

**Step 4: Add 9 to both sides to solve for \(x\).**
\[
x = 74 + 9
\]
\[
x = 83
\]

**Final Answer:**
\[
\boxed{83}
\]


- In practice, we would call sample_response multiple times to generate rollouts
- To keep the GRPO walkthrough simple and aligned with figure 6.13, we instead assume the model generated the following four responses:

In [10]:
rollouts = [
    r"\boxed{83}",
    r"The correct answer is \boxed{83}",
    r"The final answer is 83",
    r"We get \boxed{38}",
]

&nbsp;
## 6.6 Calculating rewards

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F14_raschka.webp" width=400>

- The rewards are simply correctness rewards, similar to chapter 3
- However, there is also an implicit format reward, the reward of 1.0 is only given if the final answer is in the `\boxed{}` format (via `fallback=None`)

In [11]:
from reasoning_from_scratch.ch03 import (
    extract_final_candidate, grade_answer
)

def reward_rlvr(answer_text, ground_truth):
    extracted = extract_final_candidate(
        answer_text, fallback=None  # Require \boxed{}
    )
    if not extracted:
        return 0.0
    correct = grade_answer(extracted, ground_truth)
    return float(correct)

In [12]:
rollouts = [
    r"\boxed{83}",
    r"The correct answer is \boxed{83}",
    r"The final answer is 83",
    r"We get \boxed{38}",
]
rollout_rewards = []

for answer in rollouts:
    reward = reward_rlvr(answer_text=answer, ground_truth="83")
    print(f"Answer: {answer!r}")
    print(f"Reward: {reward}\n")
    rollout_rewards.append(reward)

Answer: '\\boxed{83}'
Reward: 1.0

Answer: 'The correct answer is \\boxed{83}'
Reward: 1.0

Answer: 'The final answer is 83'
Reward: 0.0

Answer: 'We get \\boxed{38}'
Reward: 0.0



- Note: The DeepSeek-R1 team tried to use process reward models to score intermediate solution steps when training the model
- However, these attempts were unsuccessful, and the researchers concluded that it is better to only train on the final answer correctness rewards without intermediate rewards

&nbsp;
## 6.7 Preparing learning signals from rollouts via advantages

- The "GR" (group relative) in GRPO refers to the fact that GRPO generates multiple answers (rollouts) per prompt, and compares them relative to each other to construct a learning signal

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F15_raschka.webp" width=400>

- The formula is quite simple:

$$\text{advantages}_i = \frac{r_i - \mu_r}{\sigma_r + \epsilon}$$

- Here, $r_i$ denotes the reward of the $i$-th rollout, $\mu_r$ is the mean reward across the group of rollouts, $\sigma_r$ is the corresponding standard deviation, and $\epsilon$ is a small constant added for numerical stability to avoid zero-division errors

In [13]:
rewards = torch.tensor(rollout_rewards, device=device)
print(rewards)

tensor([1., 1., 0., 0.])


In [14]:
advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-4)

print(advantages)

tensor([ 0.8659,  0.8659, -0.8659, -0.8659])


- Note that if all rewards in a group are identical, for example, all 0 or all 1, then $r_i - \mu_r = 0$ for all $i$ rollouts
- This means the model is not updated if all answers are correct or all answers are incorrect

&nbsp;
## 6.8 Scoring rollouts with sequence log-probabilities

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F16_raschka.webp" width=400>

- In the previous chapter, we implemented an `avg_logprob_answer` function that calculates per-token log-probabilities for the answer tokens
- These averaged log-probabilities are also often referred to as token-level log-probabilities, and are commonly used for scoring LLM answers
- This averaging is preferred for scoring as it provides length-normalization
- Mathematically, this can be expressed as $\frac{1}{T} \sum_{t=1}^{T} \log p_W(y_t \mid y_{<t}, x)$
- Here, $y_1, ..., y_T$ denote the tokens in the generated response of length $T, y_{<t}$ represents all previously generated tokens, $x$ is the input prompt, and $W$ denotes the model's weight parameters
- This expression is mathematically identical to the one used in the previous chapter; we simply switch from $x$ to $y$ to distinguish the generated output tokens from the input prompt for clarity
- For reference, the function is copied below

In [15]:
## Chapter 5

@torch.inference_mode()
def avg_logprob_answer(model, tokenizer, prompt, answer, device="cpu"):

    # Encode prompt and answer tokens separately to get the prompt length later
    prompt_ids = tokenizer.encode(prompt)
    answer_ids = tokenizer.encode(answer)
    full_ids = torch.tensor(prompt_ids + answer_ids, device=device)

    # Same as in calc_next_token_logprobas before
    logits = model(full_ids.unsqueeze(0)).squeeze(0)
    logprobs = torch.log_softmax(logits, dim=-1)

    # Index range for positions corresponding to answer tokens
    start = len(prompt_ids) - 1
    end = full_ids.shape[0] - 1

    # Same as before, except for using start and end
    t_idx = torch.arange(start, end, device=device)
    next_tokens = full_ids[start + 1 : end + 1]
    next_token_logps = logprobs[t_idx, next_tokens]

    # Average over the answer token scores
    return torch.mean(next_token_logps).item()

In [16]:
avg_logprob_val = avg_logprob_answer(
                   model, tokenizer, 
                   prompt=prompt,
                   answer=answer_text,
                   device=device) 
print(avg_logprob_val)

-0.0517578125


- However, GRPO uses sequence-level log-probabilities, not length-normalized token-level averages like above
- Token-level averages are useful for scoring, since they make outputs of different lengths comparable
- In GRPO, each rollout receives one reward and one advantage for the entire sequence, and to scale the gradient correctly, log-probabilities must reflect the likelihood of the full sequence, which is obtained by summing token-level log-probabilities
- Otherwise, averaging log-probabilities would implicitly rescale the learning signal by sequence length and distort policy updates, especially for longer rollouts

- We can convert it into a sequence-level log-probability by dropping the averaging, and replacing the `torch.mean(next_token_logps)` with a `torch.sum(next_token_logps)`
- Retrospectively, we could also multiply the averaged result by the number of answer tokens to obtain the unaveraged value

In [17]:
sequence_logprob_val = avg_logprob_val * (len(tokenizer.encode(answer_text)))
print(sequence_logprob_val)

-11.490234375


- These sequence-level logprobs scale linearly with sequence length T
- This means longer answers always get more negative logprobs
- This, in turn, encourages, for two equally good answers, to prefer the shorter one (which is cheaper)
- Summed logprobs encourage the model to stop earlier

- So, as mentioned above, we can replace `torch.mean` by `torch.sum` in the function above
- However, since we ran the function in inference mode in the previous chapter, using the `@torch.inference_mode()` decorator, we have to redefine it anyway, as we want PyTorch to track and compute gradients
- Also, since we have the `sample_response` from section 6.5, return the `token_ids` and `prompt_len`, we can drop the encoding and `full_ids` computation in the `avg_logprob_answer` to simplify things

In [18]:
def sequence_logprob_draft(model, token_ids, prompt_len):
    logits = model(token_ids.unsqueeze(0)).squeeze(0).float()
    logprobs = torch.log_softmax(logits, dim=-1)

    # Positions whose next-token probabilities we want
    # These correspond to predicting token_ids[t + 1] from position t
    start = prompt_len - 1
    end = token_ids.shape[0] - 1

    t_idx = torch.arange(start, end, device=token_ids.device)
    next_tokens = token_ids[start + 1 : end + 1]
    next_token_logps = logprobs[t_idx, next_tokens]

    # Sum log-probabilities over the answer tokens
    return torch.sum(next_token_logps)

print(sequence_logprob_draft(model, token_ids, prompt_len))

tensor(-11.5178, grad_fn=<SumBackward0>)


- Note that we don't use `.item()` in `torch.sum(next_token_logps)` so that PyTorch returns a tensor (rather than a Python float), which is important for the gradient calculation
- As we can see, the resulting value (-6.5853) is almost identical to that we got previously when rescaling the `avg_logprob_val` by the number of answer tokens (-6.5625); the minor differences can be attributed to floating point rounding behavior 

- Below, we will rewrite the function using torch.gather, which is a bit more idiomatic in PyTorch and is a bit better optimized for GPUs
- However, both functions are mathematically equivalent

In [19]:
def sequence_logprob(model, token_ids, prompt_len):
    logits = model(token_ids.unsqueeze(0)).squeeze(0).float()
    logprobs = torch.log_softmax(logits, dim=-1)
    selected = logprobs[:-1].gather(
        1, token_ids[1:].unsqueeze(-1)
    ).squeeze(-1)
    return torch.sum(selected[prompt_len - 1:])

print(sequence_logprob(model, token_ids, prompt_len))

tensor(-11.5178, grad_fn=<SumBackward0>)


In [20]:
rollouts = [
    r"\boxed{83}",
    r"The correct answer is \boxed{83}",
    r"The final answer is 83",
    r"We get \boxed{38}",
]

rollout_logps = []

for text in rollouts:
    token_ids = tokenizer.encode(prompt + " " + text)
    logprob = sequence_logprob(
        model=model,
        token_ids=torch.tensor(token_ids, device=device),
        prompt_len=prompt_len,
    )

    print(f"Answer:  {text}")
    print(f"Logprob: {logprob.item():.4f}\n")

    rollout_logps.append(logprob)

Answer:  \boxed{83}
Logprob: -7.9243

Answer:  The correct answer is \boxed{83}
Logprob: -20.1546

Answer:  The final answer is 83
Logprob: -16.6130

Answer:  We get \boxed{38}
Logprob: -23.3677



- The trend here is that shorter and more concise answers receive higher (less negative) sequence-level log-probabilities
- And the lowest score is assigned to the only answer containing an incorrect value (38 instead of 83)
- Overall, summed log-probabilities favor concise and correct outputs

&nbsp;
## 6.9 From advantages to policy updates via the GRPO loss

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F17_raschka.webp" width=400>

In [21]:
logps = torch.stack(rollout_logps)
print(logps)

tensor([ -7.9243, -20.1546, -16.6130, -23.3677], grad_fn=<StackBackward0>)


In [22]:
pg_loss = -(advantages.detach() * logps).mean()
print(pg_loss)

tensor(-2.5764, grad_fn=<NegBackward0>)


- We need the `.detach()` because we want to treat the `advantages` as fixed learning signals; this way, we ensure that we only backprop through the logprobs
- We need the negative sign because PyTorch optimizers minimize by default, and here we want to maximize the logprob-weighted advantages

- In mathematical notation, we can write the policy gradient loss as follows:

$$\mathcal{L}_{\mathrm{PG}}
= -\frac{1}{N} \sum_{i=1}^{N} A_i \sum_{t=1}^{T_i} \log p_W\!\left( y_t^{(i)} \mid y_{<t}^{(i)}, x^{(i)} \right)$$

- $N$ denotes the number of rollouts in the batch
- $y_1^{(i)}, ..., y_{T_i}^{(i)}$ are the tokens of the $i$-th generated response of length $T_i$
- $y_{<t}^{(i)}$ represents all previously generated tokens in that response
- $x^{(i)}$ is the corresponding input prompt for the $i$-th rollout
- $p_W$ denotes the model's policy, that is, the probability distribution over next tokens parameterized by the weights $W$
- $A_i$ is the advantage assigned to the full $i$-th rollout
- The inner sum computes the sequence-level log-probability of a rollout
- The outer average computes advantage-weighted log-probabilities across rollouts

&nbsp;
## 6.10 Putting everything together in a GRPO step

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F18_raschka.webp" width=400>

In [23]:
def compute_grpo_loss(
    model,
    tokenizer,
    example,
    device,
    num_rollouts=2,
    max_new_tokens=256,
    temperature=0.8,
    top_p=0.9,
):
    assert num_rollouts >= 2
    roll_logps, roll_rewards, samples = [], [], []
    prompt = render_prompt(example["problem"])

    was_training = model.training
    model.eval()

    for _ in range(num_rollouts):
        # Stage 1: generate rollouts
        token_ids, prompt_len, text = sample_response(
            model=model,
            tokenizer=tokenizer,
            prompt=prompt,
            device=device,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
        )
        # Stage 2: compute rewards
        reward = reward_rlvr(text, example["answer"])
        
        # Stage 4: compute logprobs
        logp = sequence_logprob(model, token_ids, prompt_len)

        roll_logps.append(logp)
        roll_rewards.append(reward)
        samples.append(
            {
                "text": text,
                "reward": reward,
                "gen_len": token_ids.numel() - prompt_len,
            }
        )

    if was_training:
        model.train()

    # Stage 2: collect all rewards
    rewards = torch.tensor(roll_rewards, device=device)

    # Stage 3: compute advantages
    advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-4)

    # Stage 4: collect all logprobs
    logps = torch.stack(roll_logps)

    # Stage 5: compute policy gradient loss
    pg_loss = -(advantages.detach() * logps).mean()
    loss = pg_loss  # In the next chapter we add a KL term here

    return {
        "loss": loss.item(),
        "pg_loss": pg_loss.item(),
        "rewards": roll_rewards,
        "advantages": advantages.detach().cpu().tolist(),
        "samples": samples,
        "loss_tensor": loss,
    }

- The stages in the code comments map to the stages in the GRPO figure
- Note that following stage 1, we have stages 2 and 4 (instead of 3 and 4) in the code comments, since this results in a simpler code implementation (so that we don't have to implement multiple for-loops)

In [24]:
torch.manual_seed(123)

stats = compute_grpo_loss(
    model=model,
    tokenizer=tokenizer,
    example=math_train[4],
    device=device,
    num_rollouts=2,
    max_new_tokens=256,
    temperature=0.8,
    top_p=0.9
)

pprint(stats)

{'advantages': [0.0, 0.0],
 'loss': -0.0,
 'loss_tensor': tensor(-0., grad_fn=<NegBackward0>),
 'pg_loss': -0.0,
 'rewards': [0.0, 0.0],
 'samples': [{'gen_len': 3, 'reward': 0.0, 'text': ' 14'},
             {'gen_len': 256,
              'reward': 0.0,
              'text': ' 4\n'
                      'To solve this problem, we can set up an equation based '
                      'on the given information.\n'
                      '\n'
                      'Let \\( x \\) be the number of days Sam worked and \\( '
                      'y \\) be the number of days he did not work. We know '
                      'that:\n'
                      '\n'
                      '1. \\( x + y = 20 \\) (since he worked for 20 days)\n'
                      '2. For each day he works, he earns $60, and for each '
                      'day he does not work, $30 is subtracted from his '
                      'earnings.\n'
                      '3. At the end of the 20-day period, he received $66

&nbsp;
## 6.11 Implementing the GRPO training loop

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F19_raschka.webp" width=600>

- We skip batching due to the already expensive resource requirements

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F20_raschka.webp" width=400>

In [25]:
import time

def train_rlvr_grpo(
    model,
    tokenizer,
    math_data,
    device,
    steps=None,
    num_rollouts=2,
    max_new_tokens=256,
    temperature=0.8,
    top_p=0.9,
    lr=1e-5,
    checkpoint_every=50,
    checkpoint_dir=".",
    csv_log_path=None,

):
    if steps is None:
        steps = len(math_data)

    # Stage 1: initialize optimize
    # (the model was already initialized outside the function)
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    current_step = 0
    if csv_log_path is None:
        timestamp = time.strftime("%Y%m%d_%H%M%S")
        csv_log_path = f"train_rlvr_grpo_metrics_{timestamp}.csv"
    csv_log_path = Path(csv_log_path)

    try:
        # Stage 2: Iterate over training steps
        for step in range(steps):

            # Stage 3: Reset loss gradient
            # (it's best practice to do this at the beginning of each step)
            optimizer.zero_grad()

            current_step = step + 1
            example = math_data[step % len(math_data)]

            # Stage 4: calculate GRPO loss
            stats = compute_grpo_loss(
                model=model,
                tokenizer=tokenizer,
                example=example,
                device=device,
                num_rollouts=num_rollouts,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                top_p=top_p,
            )

            # Stage 5: Backward pass to calculate loss gradients
            stats["loss_tensor"].backward()

            # Clip large gradients to improve training stability
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            # Stage 6: Update model weights using loss gradients
            optimizer.step()

            # Stage 7: Collect rewards, response lengths, and losses
            reward_avg = torch.tensor(stats["rewards"]).mean().item()
            step_tokens = sum(
                sample["gen_len"] for sample in stats["samples"]
            )
            avg_response_len = (
                step_tokens / len(stats["samples"]) 
                if stats["samples"] else 0.0
            )
            append_csv_metrics(
                csv_log_path, current_step, steps, stats["loss"],
                reward_avg, avg_response_len,
            )

            # Print step metrics
            print(
                f"[Step {current_step}/{steps}] "
                f"loss={stats['loss']:.4f} "
                f"reward_avg={reward_avg:.3f} "
                f"avg_resp_len={avg_response_len:.1f}"
            )

            # Sample outputs (every 10 steps) to check if model
            # generates coherent text
            if current_step % 10 == 0:
                print(f"[Step {current_step}] sample outputs")
                for i, sample in enumerate(stats["samples"][:3]):
                    text = sample["text"].replace("\n", "\\n")
                    print(
                        f"  {i+1}) reward={sample['reward']:.3f} "
                        f"len={sample['gen_len']}: {text}"
                    )
                print()

            # Stage 8: Save model checkpoint
            if checkpoint_every and current_step % checkpoint_every == 0:
                ckpt_path = save_checkpoint(
                    model=model,
                    checkpoint_dir=checkpoint_dir,
                    step=current_step,
                )
                print(f"Saved checkpoint to {ckpt_path}")

    # Save a model checkpoint if we interrupt the training early
    except KeyboardInterrupt:
        ckpt_path = save_checkpoint(
            model=model,
            checkpoint_dir=checkpoint_dir,
            step=max(1, current_step),
            suffix="interrupt",
        )
        print(f"\nKeyboardInterrupt. Saved checkpoint to {ckpt_path}")
        return model

    return model


def save_checkpoint(model, checkpoint_dir, step, suffix=""):
    checkpoint_dir = Path(checkpoint_dir)
    checkpoint_dir.mkdir(parents=True, exist_ok=True)
    suffix = f"-{suffix}" if suffix else ""
    ckpt_path = (
        checkpoint_dir /
        f"qwen3-0.6B-rlvr-grpo-step{step:05d}{suffix}.pth"
    )
    torch.save(model.state_dict(), ckpt_path)
    return ckpt_path


def append_csv_metrics(
    csv_log_path,
    step_idx,
    total_steps,
    loss,
    reward_avg,
    avg_response_len,
):
    if not csv_log_path.exists():
        csv_log_path.write_text(
            "step,total_steps,loss,reward_avg,avg_response_len\n",
            encoding="utf-8",
        )
    with csv_log_path.open("a", encoding="utf-8") as f:
        f.write(
            f"{step_idx},{total_steps},{loss:.6f},{reward_avg:.6f},"
            f"{avg_response_len:.6f}\n"
        )

- Everything except for stage 4, the GRPO loss calculation, is part of the standard training loop when training deep neural networks (including LLMs)
- The `append_csv_metrics` records the results in a CSV file for record keeping (and to visualize the results in chapter 7)
- For a general introduction to training neural networks in PyTorch, please see sections 3-8 in my [PyTorch in One Hour: From Tensors to Training Neural Networks on Multiple GPUs](https://sebastianraschka.com/teaching/pytorch-1h/) article

In [26]:
device = get_device()
model.to(device)

torch.manual_seed(1)

train_rlvr_grpo(
    model=model,
    tokenizer=tokenizer,
    math_data=math_train,
    device=device,
    steps=50,
    num_rollouts=4,
    max_new_tokens=512,
    temperature=0.8,
    top_p=0.9,
    lr=1e-5,
    checkpoint_every=5,
    checkpoint_dir=".",
    csv_log_path="train_rlvr_grpo_metrics.csv",
)

Using Apple Silicon GPU (MPS)
[Step 1/50] loss=-0.0000 reward_avg=0.000 avg_resp_len=5.5
[Step 2/50] loss=-0.0000 reward_avg=0.000 avg_resp_len=6.8
[Step 3/50] loss=0.3592 reward_avg=0.250 avg_resp_len=7.8
[Step 4/50] loss=2.7401 reward_avg=0.250 avg_resp_len=56.5
[Step 5/50] loss=3.3214 reward_avg=0.500 avg_resp_len=251.2
Saved checkpoint to qwen3-0.6B-rlvr-grpo-step00005.pth
[Step 6/50] loss=-0.0000 reward_avg=0.000 avg_resp_len=14.2

KeyboardInterrupt. Saved checkpoint to qwen3-0.6B-rlvr-grpo-step00007-interrupt.pth


Qwen3Model(
  (tok_emb): Embedding(151936, 1024)
  (trf_blocks): ModuleList(
    (0-27): 28 x TransformerBlock(
      (att): GroupedQueryAttention(
        (W_query): Linear(in_features=1024, out_features=2048, bias=False)
        (W_key): Linear(in_features=1024, out_features=1024, bias=False)
        (W_value): Linear(in_features=1024, out_features=1024, bias=False)
        (out_proj): Linear(in_features=2048, out_features=1024, bias=False)
        (q_norm): RMSNorm()
        (k_norm): RMSNorm()
      )
      (ff): FeedForward(
        (fc1): Linear(in_features=1024, out_features=3072, bias=False)
        (fc2): Linear(in_features=1024, out_features=3072, bias=False)
        (fc3): Linear(in_features=3072, out_features=1024, bias=False)
      )
      (norm1): RMSNorm()
      (norm2): RMSNorm()
    )
  )
  (final_norm): RMSNorm()
  (out_head): Linear(in_features=1024, out_features=151936, bias=False)
)

- If you have memory-related issues when running the code above, you can lower the number of rollouts (e.g., `num_rollouts=2`) and number of tokens per rollout (e.g., `max_new_tokens=128`)
- However, to get a relatively good model, it requires at least `num_rollouts=8` and `max_new_tokens=512`
- If you can't run it on your available hardware, no worries, the next section shows how to download a pre-trained checkpoint

- Note that either way, the code will likely run very slowly, because GRPO is a resource-intensive procedure
- You can interrupt the run anytime, and it will save the latest model checkpoint in the `checkpoints` folder
- If you are interested in using cloud GPUs, please see the [GPU Cloud Resources](../../ch02/02_setup-tips/gpu-instructions.md) document for recommendations

- Note that this code does not support batched training
- This is a deliberate choice to keep the code simpler and more readable, and because sampling multiple (potentially long) rollouts can already be very resource-intensive
- However, if you have access to multiple GPUs, you can use the optional version of this code with batch and multi-GPU support that can be found in the supplementary materials at [../02_rlvr_grpo_scripts_intro](../02_rlvr_grpo_scripts_intro), which trains the model faster

&nbsp;
## 6.12 Loading and evaluating saved model checkpoints

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch06/CH06_F21_raschka.webp" width=600>

- The saved checkpoints can be loaded using the model.load_state_dict(torch.load(model_path)) explained in chapter 2, where model_path references the checkpoint ".pth" file
- These checkpoint files are also compatible with the model evaluation utilities from chapter 3
- For your convenience, you can use the evaluation scripts provided in the chapter 3's bonus materials:

```python
uv run ../../ch03/02_math500-verifier-scripts/evaluate_math500.py \
--dataset_size 500 \
--which_model base \
--checkpoint_path checkpoints/qwen3-0.6B-rlvr-grpo-step00050.pth
```
    

- If you prefer not to run the GRPO training on your computer because it takes too long, you can also download the checkpoints that I uploaded to [rasbt/qwen3-from-scratch-grpo-checkpoints/tree/main/grpo_original_no_kl](https://huggingface.co/rasbt/qwen3-from-scratch-grpo-checkpoints/tree/main/grpo_original_no_kl) (click on the checkpoint file you want to download and then click the [download](https://huggingface.co/rasbt/qwen3-from-scratch-grpo-checkpoints/resolve/main/grpo_original_no_kl/qwen3-0.6B-rlvr-grpo-step00050.pth?download=true) button
- For your convenience, you can also download the checkpoint directly here using Python

In [27]:
from reasoning_from_scratch.qwen3 import download_qwen3_grpo_checkpoints

download_qwen3_grpo_checkpoints(grpo_type="no_kl", step="00050")

qwen3-0.6B-rlvr-grpo-step00050.pth: 100% (1433 MiB / 1433 MiB)


|      | Method                                 | Step | Max tokens | Num rollouts | MATH-500 Acc | Avg # of tokens |
| ---- | -------------------------------------- | ---- | ---------- | ------------ | ------------ | --------------- |
| 1    | Base (chapter 3)                       | -    |            |              | 15.2%        | 78.85           |
| 2    | Reasoning (chapter 3)                  | -    |            |              | 48.2%        | 1369.79         |
| 3    | GRPO original but no KL (this chapter) | 50   | 512        | 8            | 47.4%        | 586.11          | 

- Based on the table above, we see that after only 50 steps, the trained model (row 3), which is initialized from the base model (row 1), is almost as good as the original reasoning variant (row 2)
- Note that training for longer may not improve the model and could even make it worse, as GRPO can be relatively unstable; the next chapter introduces additional tricks to improve the GRPO algorithm

&nbsp;
## 6.13 Summary

- No code in this section