<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="https://mng.bz/lZ5B">Build a Reasoning Model (From Scratch)</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/reasoning-from-scratch">https://github.com/rasbt/reasoning-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="https://mng.bz/lZ5B"><img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


# Chapter 6: Exercise Solutions

Packages that are being used in this notebook:

In [None]:
from importlib.metadata import version

used_libraries = [
    "reasoning_from_scratch",
    "torch",
    "tokenizers"  # Used by reasoning_from_scratch
]

for lib in used_libraries:
    print(f"{lib} version: {version(lib)}")

&nbsp;
## Exercise 6.1: Adding format-aware reward shaping

- We can assign a partial reward (score 0.5) if no "\boxed{}" answer is found as follows, using the `fallback="number_then_full"` fallback we coded in chapter 3:

In [None]:
from reasoning_from_scratch.ch03 import (
    extract_final_candidate, grade_answer
)

def partial_reward_rlvr(answer_text, ground_truth):
    
    # 1) Try to extract a boxed answer
    boxed = extract_final_candidate(
        answer_text, fallback=None
    )
    if boxed:
        correct = grade_answer(boxed, ground_truth)
        return 1.0 if correct else 0.0

    # 2) If no boxed answer is found, look for number
    unboxed = extract_final_candidate(
        answer_text, fallback="number_then_full"
    )
    if unboxed:
        correct = grade_answer(unboxed, ground_truth)
        return 0.5 if correct else 0.0

    return 0.0

- When plugged into the chapter 6 code and trained under the same settings, the partial-reward variant achieves lower accuracy (37.8%) than the standard GRPO setup (47.4%), despite using a similar number of tokens on average

| # | Method                                   | Step | Max tokens | Num rollouts | Accuracy | Average tokens |
|---|------------------------------------------|------|------------|--------------|----------|----------------|
| 1 | GRPO (chapter 6)                         | 50   | 512        | 8            | 47.4%    | 586.11         |
| 2 | GRPO partial rewards (exercise 6.1)      | 50   | 512        | 8            | 37.8%    | 550.33         |

&nbsp;
## Exercise 6.2: Zero-advantage cases

- If the rewards are all equal (for instance, they are all 0 or all 1), the advantages will all be 0, because subtracting the mean removes the shared reward value and leaves only zeros, which we can demonstrate below

In [3]:
import torch

rollout_rewards = [0., 0., 0., 0.]
rewards = torch.tensor(rollout_rewards)
advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-4)

print(advantages)

tensor([0., 0., 0., 0.])


In [4]:
rollout_rewards = [1., 1., 1., 1.]
rewards = torch.tensor(rollout_rewards)
advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-4)

print(advantages)

tensor([0., 0., 0., 0.])


- Now, if all advantages are 0, the loss will be zero as well, because the loss multiplies the advantages by the log probabilities, and multiplying by zero eliminates the contribution

```python
pg_loss = -(advantages.detach() * logps).mean()
```

- As a result, the policy gradient is zero and the model parameters are not updated for that prompt

- This behavior is intentional; if all rollouts are equally bad or equally good, there is no relative signal to tell the model which behavior to reinforce or suppress
- Intuitively, if the model answers all the questions correctly, there is no need to update it
- Vice versa, if the model answers all questions incorrectly, we don't want to update the model to reinforce this behavior