<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="https://mng.bz/lZ5B">Build a Reasoning Model (From Scratch)</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/reasoning-from-scratch">https://github.com/rasbt/reasoning-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="https://mng.bz/lZ5B"><img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


# Chapter 3: Exercise Solutions

Packages that are being used in this notebook:

In [1]:
from importlib.metadata import version

used_libraries = [
    "reasoning_from_scratch",
    "torch",
    "tokenizers"  # Used by reasoning_from_scratch
]

for lib in used_libraries:
    print(f"{lib} version: {version(lib)}")

reasoning_from_scratch version: 0.1.4
torch version: 2.7.1
tokenizers version: 0.21.4


&nbsp;
## Exercise 3.1: Adding more test cases

- There is an endless number of different test cases we may add
- Below is a selection of some interesting ones

In [2]:
from reasoning_from_scratch.ch03 import (
    run_demos_table
)

more_tests = [
    # Different bracket types
    ("check_16", "[1, 2]", "(1, 2)", True),

    # Scientific notation
    ("check_17", "1e-3", "0.001", True),

    # Algebraic simplification with caret exponent
    ("check_18", "(-3)^2", "9", True),

    # Unicode minus (U+2212) vs ASCII hyphen-minus
    ("check_19", "−1", "-1", True),    

]

run_demos_table(more_tests)

Test     | Expect | Got   | Status
check_16 | True   | True  | PASS  
check_17 | True   | True  | PASS  
check_18 | True   | True  | PASS  
check_19 | True   | False | FAIL  

Passed 3/4


- As we can see, the tests pass in all cases except for `check_19`, which swaps the regular sign with a Unicode version of a minus sign that looks indistinguishable to the human eye
- We could fix this test case by adding one of the following lines anywhere to the `normalize_text` function

```python
text = text.replace("−", "-")
# or
text = text.replace("\u2212", "-")
```

- At first glance, another interesting test is the following one:

In [3]:
extra_tests_1 = [
    ("check_20", "Text around answer 3.", "3", True)
]

run_demos_table(extra_tests_1)

Test     | Expect | Got   | Status
check_20 | True   | False | FAIL  

Passed 0/1


- While it may seem that our code cannot handle such text-containing cases, this is actually a poorly designed test
- In practice, the `run_demos_table` function is intended specifically to test the `grade_answer` function; nothing more, nothing less
- The `grade_answer` function would never receive the entire answer in this form, since the answer would have been extracted from the text before being passed to it

I.e., if we want to test text answers, we need to call the test as follows:

In [4]:
from reasoning_from_scratch.ch03 import (
    extract_final_candidate
)


extra_tests_2 = [
    ("check_20",
     extract_final_candidate("Text around answer 3."),
     "3", True)
]
run_demos_table(extra_tests_2)

Test     | Expect | Got  | Status
check_20 | True   | True | PASS  

Passed 1/1


&nbsp;
## Exercise 3.2: Calculating the average response length

- Option A: We could modify the `evaluate_math500_stream` function by adding the following lines:

```python
# ...
# below `num_correct = 0`
total_len = 0

# ...
# inside for i, row in enumerate(math_data, start=1):
# anywhere below `gen_text = ...`
total_len += len(tokenizer.encode(gen_text))

# ...
# anywhere at the bottom before the return statement
avg_len = total_len / num_examples
print(f"Average length: {avg_len:.2f} tokens")
```

- Alternatively, we can also calculate the response lengths from the `.jsonl` files that were created when we ran the `evaluate_math500_stream` function in the main chapter
- First, we load the `.jsonl` file as follows:

In [5]:
import json
from pathlib import Path

WHICH_MODEL = "base"

# You may need to adjust this path:
local_path = Path(f"math500_{WHICH_MODEL}-mps.jsonl")
if not local_path.exists():
    raise FileNotFoundError(
        f"{local_path} not found. Run ch03_main.ipynb to create it."
    )

results = []
with open(local_path, "r") as f:
    for line in f:
        if line.strip():
            results.append(json.loads(line))

print("Number of entries:", len(results))

Number of entries: 10


- Note that each entry has multiple keys, however, we are only interested in the `"generated_text"` key, which contains the models full answer:

In [6]:
print(results[0].keys())

dict_keys(['index', 'problem', 'gtruth_answer', 'generated_text', 'extracted', 'correct'])


- Note that each entry has multiple keys; however, we are only interested in the `"generated_text"` key, which contains the model's full answer:

In [7]:
from reasoning_from_scratch.qwen3 import (
    download_qwen3_small,
    Qwen3Tokenizer
)

if WHICH_MODEL == "base":

    download_qwen3_small(
        kind="base", tokenizer_only=True, out_dir="qwen3"
    )
    tokenizer_path = Path("qwen3") / "tokenizer-base.json"
    tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)

elif WHICH_MODEL == "reasoning":

    download_qwen3_small(
        kind="reasoning", tokenizer_only=True, out_dir="qwen3"
    )
    tokenizer_path = Path("qwen3") / "tokenizer-reasoning.json"
    tokenizer = Qwen3Tokenizer(
        tokenizer_file_path=tokenizer_path,
        apply_chat_template=True,
        add_generation_prompt=True,
        add_thinking=True,
    )

✓ qwen3/tokenizer-base.json already up-to-date


- Then, we can calculate the average length as follows, which is similar to how we could have modified the `evaluate_math500_stream` function:

In [8]:
total_len = 0

for item in results:
    num_tokens = len(tokenizer.encode(item["generated_text"]))
    total_len += num_tokens

avg_len = total_len / len(results)
print(f"Average length: {avg_len:.2f} tokens")

Average length: 98.00 tokens


| Mode      | Device  | Average length | MATH-500 size  |
|-----------|---------|----------------|----------------|
| Base      | CPU     | 97.3           | 10             |
| Base      | MPS     | 98.0           | 10             |
| Reasoning | CPU     | 891.80         | 10             |
| Reasoning | MPS     | 1159.30        | 10             |
|           |         |                |                |
| Base      | CUDA    | 96.74          | 500            |
| Reasoning | CUDA    | 1361.21        | 500            |


- As we can see, and as expected, the reasoning model writes much longer responses

&nbsp;
## Exercise 3.3: Extending or changing the evaluation dataset

- To evaluate the model on a larger dataset, we can simply change the `math_data[:10]` to a different slice or larger number (up to 500)

```python
num_correct, num_examples, acc = evaluate_math500_stream(
    model, tokenizer, device, 
    math_data=math_data[:10],
    max_new_tokens=2048,
    verbose=False
)
```

- The table below shows the accuracy values for different dataset sizes (since the MATH-500 test set is already shuffled, no additional shuffling was applied)

| Mode      | Device  | Accuracy | MATH-500 size  |
|-----------|---------|----------|----------------|
| Base      | CUDA    | 30.0%    | 10             |
| Base      | CUDA    | 34.0%    | 50             |
| Base      | CUDA    | 27.0%    | 100            |
| Base      | CUDA    | 31.0%    | 200            |
| Base      | CUDA    | 15.3%    | 500            |
|           |         |          |                |
| Reasoning | CUDA    | 90.0%    | 10             |
| Reasoning | CUDA    | 58.0%    | 50             |
| Reasoning | CUDA    | 58.0%    | 100            |
| Reasoning | CUDA    | 56.0%    | 200            |
| Reasoning | CUDA    | 50.8%    | 500            |


- As we can see based on the results above, the first 10 examples are not very representative of the MATH-500 performance evaluated on the whole 500 examples

- In addition, we can create an entirely new dataset in a similar style to MATH-500
- For example, a dataset in MATH-500 style is included in this repository; we can use it in the main chapter by changing the filename from `math500_test.json` to `math_new50_exercise.json` (this dataset is included in this book's GitHub repository at https://github.com/rasbt/reasoning-from-scratch/tree/main/ch03/01_main-chapter-code)
- The performance of the base and reasoning models is as follows:
    - base: 36.0% (18/50)
    - reasoning: 80.0% (40/50)
- From this, we can conclude that while the original MATH-500 test dataset may have been included in Qwen3's training dataset, the model shows similar performance on new math questions, which indicates that it is not suffering from extensive overfitting to the original MATH-500 data

&nbsp;
## Exercise 3.4: Experimenting with different prompt templates 

- We could use the alternative prompt similar to the one suggested in the chapter, which modifies the prompt to use "Problem" instead of "Question":

```python
def render_prompt(prompt):
    template = (
        "You are a helpful math assistant.\n"
        "Solve the problem and write the final result on a new line as:\n"
        "\\boxed{ANSWER}\n\n"
        f"Problem:\n{prompt}\n\nAnswer:"
    )
    return template
```

- Using this prompt improves the performance of the base model, on the 500 examples, from 15.3% to 31.2%
- And vice versa, it reduces the performance of the reasoning model from 50.8% to 50.0%
- From these observations, we may conclude that the base model is much more sensitive to the prompt format (likely due to memorizing some prompt-formatted MATH-500 examples from the training set) than the reasoning model; the latter seems largely unaffected