<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="https://mng.bz/lZ5B">Build a Reasoning Model (From Scratch)</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/reasoning-from-scratch">https://github.com/rasbt/reasoning-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="https://mng.bz/lZ5B"><img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


# Chapter 3: Evaluating Reasoning Models

In [1]:
from importlib.metadata import version

used_libraries = [
    "reasoning_from_scratch",
    "torch",
    "sympy",
    "tokenizers"  # Used by reasoning_from_scratch
]

for lib in used_libraries:
    print(f"{lib} version: {version(lib)}")

reasoning_from_scratch version: 0.1.4
torch version: 2.7.1
sympy version: 1.14.0
tokenizers version: 0.21.4


<br>

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch03/CH03_F01_raschka.webp?1" width="500px">

&nbsp;
## 3.1 Building a math verifier

- No code in this section

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch03/CH03_F02_raschka.webp?1" width="500px">

<br>
<br>
<br>

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch03/CH03_F03_raschka.webp?1" width="500px">

&nbsp;
## 3.2 Loading a pre-trained model to generate text

- In this section, we load the model (recap of chapter 2) that we want to evaluate
- Note that we use the base model here; once you have completed this chapter, you can rerun the notebook after changing `WHICH_MODEL = "base"` to `WHICH_MODEL = "reasoning"` to evaluate an already trained reasoning model

In [2]:
from pathlib import Path
import torch

from reasoning_from_scratch.ch02 import (
    get_device
)
from reasoning_from_scratch.qwen3 import (
    download_qwen3_small,
    Qwen3Tokenizer,
    Qwen3Model,
    QWEN_CONFIG_06_B
)

device = get_device()

# If you have compatibility issues, try to
# uncomment the line below and rerun the notebook
# device = "cpu"

WHICH_MODEL = "base"

if WHICH_MODEL == "base":

    download_qwen3_small(
        kind="base", tokenizer_only=False, out_dir="qwen3"
    )

    tokenizer_path = Path("qwen3") / "tokenizer-base.json"
    model_path = Path("qwen3") / "qwen3-0.6B-base.pth"
    tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)

elif WHICH_MODEL == "reasoning":

    download_qwen3_small(
        kind="reasoning", tokenizer_only=False, out_dir="qwen3"
    )

    tokenizer_path = Path("qwen3") / "tokenizer-reasoning.json"
    model_path = Path("qwen3") / "qwen3-0.6B-reasoning.pth"
    tokenizer = Qwen3Tokenizer(
        tokenizer_file_path=tokenizer_path,
        apply_chat_template=True,
        add_generation_prompt=True,
        add_thinking=True,
    )

else:
    raise ValueError(f"Invalid choice: WHICH_MODEL={WHICH_MODEL}")


model = Qwen3Model(QWEN_CONFIG_06_B)
model.load_state_dict(torch.load(model_path))

model.to(device)


USE_COMPILE = False  # Set to true to enable compilation
if USE_COMPILE:
  torch._dynamo.config.allow_unspec_int_on_nn_module = True
  model = torch.compile(model)

Using Apple Silicon GPU (MPS)
✓ qwen3/qwen3-0.6B-base.pth already up-to-date
✓ qwen3/tokenizer-base.json already up-to-date


- Instead of the `generate_text_basic_stream` function introduced in chapter 2, we use the slightly modified `generate_text_basic_stream_cache` version (from [exercise 2.2](../../ch02/01_main-chapter-code/ch02_exercise-solutions.ipynb) as it prints the tokens as soon as they are generated, which can be useful for debugging purposes (so it doesn't appear the LLM is stuck when generating the response)

In [3]:
from reasoning_from_scratch.ch02_ex import (
    generate_text_basic_stream_cache
)

prompt = (
    r"If $a+b=3$ and $ab=\tfrac{13}{6}$, "
    r"what is the value of $a^2+b^2$?"
)

# Similar to chapter 2 exercise solution:
input_token_ids_tensor = torch.tensor(
    tokenizer.encode(prompt),
    device=device
    ).unsqueeze(0)

all_token_ids = []
for token in generate_text_basic_stream_cache(
    model=model,
    token_ids=input_token_ids_tensor,
    max_new_tokens=2048,
    eos_token_id=tokenizer.eos_token_id
):
    token_id = token.squeeze(0)
    decoded_id = tokenizer.decode(token_id.tolist())
    print(
        decoded_id,
        end="",
        flush=True
    )
    all_token_ids.append(token_id)

all_tokens = tokenizer.decode(all_token_ids)

 To find the value of \( a^2 + b^2 \) given that \( a + b = 3 \) and \( ab = \frac{13}{6} \), we can use the following algebraic identity:

\[
a^2 + b^2 = (a + b)^2 - 2ab
\]

**Step 1:** Substitute the given values into the equation.

\[
a^2 + b^2 = (3)^2 - 2 \left( \frac{13}{6} \right)
\]

**Step 2:** Calculate \( (3)^2 \).

\[
(3)^2 = 9
\]

**Step 3:** Calculate \( 2 \times \frac{13}{6} \).

\[
2 \times \frac{13}{6} = \frac{26}{6} = \frac{13}{3}
\]

**Step 4:** Subtract the second result from the first.

\[
a^2 + b^2 = 9 - \frac{13}{3}
\]

**Step 5:** Convert 9 to a fraction with a denominator of 3 to perform the subtraction.

\[
9 = \frac{27}{3}
\]

\[
a^2 + b^2 = \frac{27}{3} - \frac{13}{3} = \frac{14}{3}
\]

**Final Answer:**

\[
\boxed{\dfrac{14}{3}}
\]

- If you are unfamiliar with LaTeX syntax, the response above can be very hard to read
- You can use the `Latex` class to render the LaTeX syntax to improve readability, as shown below

In [4]:
from IPython.display import Latex, display

display(Latex(all_tokens))

<IPython.core.display.Latex object>

- If you only want to render specific math expressions, you can also use the `Math` class:

In [5]:
from IPython.display import Math

display(Math(r"\dfrac{14}{3}"))

<IPython.core.display.Math object>

&nbsp;
## 3.3 Implementing a wrapper for easier text generation

- Above, we loaded the pre-trained LLM and set up the text generation functionality (as illustrated in the figure below), which are the first two steps of the evaluation process covered in the remainder of this chapter

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch03/CH03_F05_raschka.webp?1" width="500px">

- For additional convenience, we create a wrapper function for the text generation function so that we only have to pass in the model, tokenizer, and prompt, along with some additional settings

In [6]:
def generate_text_stream_concat(
    model, tokenizer, prompt, device, max_new_tokens,
    verbose=False,
):
    input_ids = torch.tensor(
        tokenizer.encode(prompt), device=device
        ).unsqueeze(0)

    generated_ids = []
    for token in generate_text_basic_stream_cache(
        model=model,
        token_ids=input_ids,
        max_new_tokens=max_new_tokens,
        eos_token_id=tokenizer.eos_token_id,
    ):
        next_token_id = token.squeeze(0)
        generated_ids.append(next_token_id.item())

        if verbose:
            print(
                tokenizer.decode(next_token_id.tolist()),
                end="",
                flush=True
            )
    return tokenizer.decode(generated_ids)


skip_portion = False

if not skip_portion:
    generated_text = generate_text_stream_concat(
        model, tokenizer, prompt, device,
        max_new_tokens=2048,
        verbose=True
    )

 To find the value of \( a^2 + b^2 \) given that \( a + b = 3 \) and \( ab = \frac{13}{6} \), we can use the following algebraic identity:

\[
a^2 + b^2 = (a + b)^2 - 2ab
\]

**Step 1:** Substitute the given values into the equation.

\[
a^2 + b^2 = (3)^2 - 2 \left( \frac{13}{6} \right)
\]

**Step 2:** Calculate \( (3)^2 \).

\[
(3)^2 = 9
\]

**Step 3:** Calculate \( 2 \times \frac{13}{6} \).

\[
2 \times \frac{13}{6} = \frac{26}{6} = \frac{13}{3}
\]

**Step 4:** Subtract the second result from the first.

\[
a^2 + b^2 = 9 - \frac{13}{3}
\]

**Step 5:** Convert 9 to a fraction with a denominator of 3 to perform the subtraction.

\[
9 = \frac{27}{3}
\]

\[
a^2 + b^2 = \frac{27}{3} - \frac{13}{3} = \frac{14}{3}
\]

**Final Answer:**

\[
\boxed{\dfrac{14}{3}}
\]

&nbsp;
## 3.4 Extracting the final answer box

- In this section, we extract the answer box (step 3); later, in the next section will take the extracted answer and normalize it (step 4)

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch03/CH03_F06_raschka.webp?1" width="500px">

In [7]:
model_answer = (
r"""... some explanation...
**Final Answer:**

\[
\boxed{\dfrac{14}{3}}
\]
""")

In [8]:
def get_last_boxed(text):
    # Find the last occurrence of "\boxed"
    boxed_start_idx = text.rfind(r"\boxed")
    if boxed_start_idx == -1:
        return None

    # Get position after "\boxed"
    current_idx = boxed_start_idx + len(r"\boxed")

    # Skip any whitespace after "\boxed"
    while current_idx < len(text) and text[current_idx].isspace():
        current_idx += 1

    # Expect an opening brace "{"
    if current_idx >= len(text) or text[current_idx] != "{":
        return None

    # Parse the braces with nesting
    current_idx += 1
    brace_depth = 1
    content_start_idx = current_idx

    while current_idx < len(text) and brace_depth > 0:
        char = text[current_idx]
        if char == "{":
            brace_depth += 1
        elif char == "}":
            brace_depth -= 1
        current_idx += 1

    # Account for unbalanced braces
    if brace_depth != 0:
        return None

    # Extract content inside the outermost braces
    return text[content_start_idx:current_idx-1]

In [9]:
extracted_answer = get_last_boxed(model_answer)
print(extracted_answer)

\dfrac{14}{3}


In [10]:
import re

RE_NUMBER = re.compile(
    r"-?(?:\d+/\d+|\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)"
)

def extract_final_candidate(text, fallback="number_then_full"):
    # Default return value if nothing matches
    result = ""

    if text:
        # Prefer the last boxed expression if present
        boxed = get_last_boxed(text.strip())
        if boxed:
            result = boxed.strip().strip("$ ")

        # If no boxed expression, try fallback
        elif fallback in ("number_then_full", "number_only"):
            m = RE_NUMBER.findall(text)
            if m:
                # Use last number
                result = m[-1]
            elif fallback == "number_then_full":
                # Else return full text if no number found
                result = text
    return result

- fallback settings if no boxed content is found:
    - "number_then_full": pick the last simple number, else the whole text
    - "number_only": pick the last simple number, else return an empty string `""`
    - "none": extract only boxed content, else return empty string `""`

In [11]:
print(extract_final_candidate(model_answer))

\dfrac{14}{3}


In [12]:
print(extract_final_candidate(r"\boxed{ 14/3. }"))

14/3.


In [13]:
print(extract_final_candidate("abc < > 14/3 abc"))

14/3


In [14]:
print(extract_final_candidate("Text without numbers"))

Text without numbers


&nbsp;
## 3.5 Normalizing the extracted answer

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch03/CH03_F07_raschka.webp?1" width="500px">

- In the previous section, we extracted the answer (step 3), now we are normalizing it (step 4 in the previous figure)

In [15]:
LATEX_FIXES = [  # Latex formatting to be replaced
    (r"\\left\s*", ""),
    (r"\\right\s*", ""),
    (r"\\,|\\!|\\;|\\:", ""),
    (r"\\cdot", "*"),
    (r"\u00B7|\u00D7", "*"),
    (r"\\\^\\circ", ""),
    (r"\\dfrac", r"\\frac"),
    (r"\\tfrac", r"\\frac"),
    (r"°", ""),
]

RE_SPECIAL = re.compile(r"<\|[^>]+?\|>")  # strip chat special tokens like <|assistant|>

def normalize_text(text):
    if not text:
        return ""
    text = RE_SPECIAL.sub("", text).strip()

    # Remove angle-degree markers
    text = re.sub(r"\^\s*\{\s*\\circ\s*\}", "", text)   # ^{\circ}
    text = re.sub(r"\^\s*\\circ", "", text)             # ^\circ
    text = text.replace("°", "")                        # Unicode degree

    # unwrap \text{...} if the whole string is wrapped
    match = re.match(r"^\\text\{(?P<x>.+?)\}$", text)
    if match:
        text = match.group("x")

    # strip inline/display math wrappers \( \) \[ \]
    text = re.sub(r"\\\(|\\\)|\\\[|\\\]", "", text)

    # light LaTeX canonicalization
    for pat, rep in LATEX_FIXES:
        text = re.sub(pat, rep, text)

    # numbers/roots
    text = text.replace("\\%", "%").replace("$", "").replace("%", "")
    text = re.sub(
        r"\\sqrt\s*\{([^}]*)\}",
        lambda match: f"sqrt({match.group(1)})",
        text,
    )
    text = re.sub(
        r"\\sqrt\s+([^\\\s{}]+)",
        lambda match: f"sqrt({match.group(1)})",
        text,
    )

    # fractions
    text = re.sub(
        r"\\frac\s*\{([^{}]+)\}\s*\{([^{}]+)\}",
        lambda match: f"({match.group(1)})/({match.group(2)})",
        text,
    )
    text = re.sub(
        r"\\frac\s+([^\s{}]+)\s+([^\s{}]+)",
        lambda match: f"({match.group(1)})/({match.group(2)})",
        text,
    )

    # exponent and mixed numbers
    text = text.replace("^", "**")
    text = re.sub(
        r"(?<=\d)\s+(\d+/\d+)",
        lambda match: "+" + match.group(1),
        text,
    )

    # 1,234 -> 1234
    text = re.sub(
        r"(?<=\d),(?=\d\d\d(\D|$))",
        "",
        text,
    )

    return text.replace("{", "").replace("}", "").strip().lower()

In [16]:
print(normalize_text(extract_final_candidate(model_answer)))

(14)/(3)


In [17]:
print(normalize_text(r"$\dfrac{14}{3.}$"))

(14)/(3.)


In [18]:
print(normalize_text(r"\text{\[\frac{14}{3}\]}"))

(14)/(3)


In [19]:
print(normalize_text("4/3"))

4/3


&nbsp;
## 3.6 Verifying mathematical equivalence

- In this section, we implement the basic functionality to check if the extracted answer (generated by the model) is equivalent to the correct answer (ground truth) provided in the dataset; this is step 5
- In the next section (step 6), we make this process a bit more robust to grade the answer correctness

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch03/CH03_F08_raschka.webp?1" width="500px">

In [20]:
from sympy.parsing import sympy_parser as spp
from sympy.core.sympify import SympifyError

def sympy_parser(expr):
    try:
        return spp.parse_expr(
            expr,
            transformations=(
                # Standard transformations like handling parentheses
                *spp.standard_transformations,

                # Allow omitted multiplication symbols (e.g., "2x" -> 2*x")
                spp.implicit_multiplication_application,
            ),

            # Evaluate during parsing so simple constants simplify (e.g., 2+3 -> 5)
            evaluate=True,
        )
    except (SympifyError, SyntaxError, TypeError, IndexError):
        return None

- Note that this appears to be an excessive amount of error handling, but these are all errors that I encountered when evaluating the model on all 500 MATH-500 problems as the model does not always generate perfectly formatted outputs

In [21]:
print(sympy_parser(normalize_text(
    extract_final_candidate(model_answer)
)))

14/3


In [22]:
print(sympy_parser("28/6"))

14/3


In [23]:
from sympy import simplify

def equality_check(expr_gtruth, expr_pred):
    # First, check if the two expressions are exactly the same string
    if expr_gtruth == expr_pred:
        return True

    # Parse both expressions into SymPy objects (returns None if parsing fails)
    gtruth, pred = sympy_parser(expr_gtruth), sympy_parser(expr_pred)

    # If both expressions were parsed successfully, try symbolic comparison
    if gtruth is not None and pred is not None:
        try:
            # If the difference is 0, they are equivalent
            return simplify(gtruth - pred) == 0
        except (SympifyError, TypeError):
            pass

    return False

In [24]:
print(equality_check(
    normalize_text("13/4."),
    normalize_text(r"(13)/(4)")
))

True


In [25]:
print(equality_check(
    normalize_text("0.5"),
    normalize_text(r"(1)/(2)")
))

True


In [26]:
print(equality_check(
    normalize_text("14/3"),
    normalize_text("15/3")
))

False


In [27]:
print(equality_check(
    normalize_text("(14/3, 2/3)"),
    normalize_text("(14/3, 4/6)")
))

False


&nbsp;
## 3.7 Grading answers

In [28]:
def split_into_parts(text):
    result = [text]

    if text:
        # Check if text looks like a tuple or list, e.g. "(a, b)" or "[a, b]"
        if (
            len(text) >= 2
            and text[0] in "([" and text[-1] in ")]"
            and "," in text[1:-1]
        ):
            # Split on commas inside brackets and strip whitespace
            items = [p.strip() for p in text[1:-1].split(",")]
            if all(items):
                result = items
    else:
        # If text is empty, return an empty list
        result = []

    return result

In [29]:
split_into_parts(normalize_text(r"(14/3, 2/3)"))

['14/3', '2/3']

In [30]:
def grade_answer(pred_text, gt_text):
    result = False  # Default outcome if checks fail

    # Only continue if both inputs are non-empty strings
    if pred_text is not None and gt_text is not None:
        gt_parts = split_into_parts(
            normalize_text(gt_text)
        )  # Break ground truth into comparable parts

        pred_parts = split_into_parts(
            normalize_text(pred_text)
        )  # Break prediction into comparable parts

        # Ensure both sides have same number of valid parts
        if (gt_parts and pred_parts
           and len(gt_parts) == len(pred_parts)):
            result = all(
                equality_check(gt, pred)
                for gt, pred in zip(gt_parts, pred_parts)
            )  # Check each part for mathematical equivalence

    return result  # True only if all checks passed

In [31]:
grade_answer("14/3", r"\frac{14}{3}")

True

In [32]:
grade_answer(r"(14/3, 2/3)", "(14/3, 4/6)")

True

In [33]:
# Define test cases: (name, prediction, ground truth, expected result)
tests = [
        ("check_1", "3/4", r"\frac{3}{4}", True),
        ("check_2", "(3)/(4)", r"3/4", True),
        ("check_3", r"\frac{\sqrt{8}}{2}", "sqrt(2)", True),
        ("check_4", r"\( \frac{1}{2} + \frac{1}{6} \)", "2/3", True),
        ("check_5", "(1, 2)", r"(1,2)", True),
        ("check_6", "(2, 1)", "(1, 2)", False),
        ("check_7", "(1, 2, 3)", "(1, 2)", False),
        ("check_8", "0.5", "1/2", True),
        ("check_9", "0.3333333333", "1/3", False),
        ("check_10", "1,234/2", "617", True),
        ("check_11", r"\text{2/3}", "2/3", True),
        ("check_12", "50%", "1/2", False),
        ("check_13", r"2\cdot 3/4", "3/2", True),
        ("check_14", r"90^\circ", "90", True),
        ("check_15", r"\left(\frac{3}{4}\right)", "3/4", True),
    ]


def run_demos_table(tests):
    header = ("Test", "Expect", "Got", "Status")
    rows = []
    for name, pred, gtruth, expect in tests:
        got = grade_answer(pred, gtruth)  # Run equality check
        status = "PASS" if got == expect else "FAIL"
        rows.append((name, str(expect), str(got), status))

    data = [header] + rows
    
    # Compute max width for each column to align table nicely
    col_widths = [
        max(len(row[i]) for row in data)
        for i in range(len(header))
    ]

    # Print table row by row
    for row in data:
        line = " | ".join(
            row[i].ljust(col_widths[i])
            for i in range(len(header))
        )
        print(line)

    # Print summary of passed tests
    passed = sum(r[3] == "PASS" for r in rows)
    print(f"\nPassed {passed}/{len(rows)}")

In [34]:
run_demos_table(tests)

Test     | Expect | Got   | Status
check_1  | True   | True  | PASS  
check_2  | True   | True  | PASS  
check_3  | True   | True  | PASS  
check_4  | True   | True  | PASS  
check_5  | True   | True  | PASS  
check_6  | False  | False | PASS  
check_7  | False  | False | PASS  
check_8  | True   | True  | PASS  
check_9  | False  | False | PASS  
check_10 | True   | True  | PASS  
check_11 | True   | True  | PASS  
check_12 | False  | False | PASS  
check_13 | True   | True  | PASS  
check_14 | True   | True  | PASS  
check_15 | True   | True  | PASS  

Passed 15/15


&nbsp;
## 3.8 Loading the evaluation dataset

- The previous section implemented the basic evaluation pipeline
- In this section, we load the dataset (step 7) to which we will apply this pipeline in order to evaluate the model (step 8, next section).

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch03/CH03_F09_raschka.webp?1" width="500px">

- The dataset was downloaded and prepared via the following code from the [HuggingFaceH4/MATH-500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500) repository, which requires the [`datasets`](https://huggingface.co/docs/datasets/en/index) package depencency (you don't need to execute this, it's only included for reference):

```python
from datasets import load_dataset
import json

dset = load_dataset("HuggingFaceH4/MATH-500", split="test")

math_data = dset.to_list()
with open("math500_test.json", "w", encoding="utf-8") as f:
    json.dump(math_data, f, ensure_ascii=False, indent=2)
```

In [35]:
import json
import requests

local_path = Path("math500_test.json")
url = (
    "https://raw.githubusercontent.com/rasbt/reasoning-from-scratch/"
    "main/ch03/01_main-chapter-code/math500_test.json"
)

if local_path.exists():
    with local_path.open("r", encoding="utf-8") as f:
        math_data = json.load(f)
else:
    r = requests.get(url, timeout=30)
    r.raise_for_status()
    math_data = r.json()

print("Number of entries:", len(math_data))

Number of entries: 500


In [36]:
from pprint import pprint
pprint(math_data[0])

{'answer': '\\left( 3, \\frac{\\pi}{2} \\right)',
 'level': 2,
 'problem': 'Convert the point $(0,3)$ in rectangular coordinates to polar '
            'coordinates.  Enter your answer in the form $(r,\\theta),$ where '
            '$r > 0$ and $0 \\le \\theta < 2 \\pi.$',
 'solution': 'We have that $r = \\sqrt{0^2 + 3^2} = 3.$  Also, if we draw the '
             'line connecting the origin and $(0,3),$ this line makes an angle '
             'of $\\frac{\\pi}{2}$ with the positive $x$-axis.\n'
             '\n'
             '[asy]\n'
             'unitsize(0.8 cm);\n'
             '\n'
             'draw((-0.5,0)--(3.5,0));\n'
             'draw((0,-0.5)--(0,3.5));\n'
             'draw(arc((0,0),3,0,90),red,Arrow(6));\n'
             '\n'
             'dot((0,3), red);\n'
             'label("$(0,3)$", (0,3), W);\n'
             'dot((3,0), red);\n'
             '[/asy]\n'
             '\n'
             'Therefore, the polar coordinates are $\\boxed{\\left( 3, '
             '\\frac

&nbsp;
## 3.9 Evaluating the model

- In the previous section, we loaded the dataset; now we can apply the evaluation pipeline to evaluate the model on this dataset (step 7)

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch03/CH03_F10_raschka.webp?1" width="500px">

In [37]:
def render_prompt(prompt):
    template = (
        "You are a helpful math assistant.\n"
        "Answer the question and write the final result on a new line as:\n"
        "\\boxed{ANSWER}\n\n"
        f"Question:\n{prompt}\n\nAnswer:"
    )
    return template

In [38]:
prompt = (  # Same prompt we used at the beginning of the chapter
    r"If $a+b=3$ and $ab=\tfrac{13}{6}$, "
    r"what is the value of $a^2+b^2$?"
)
prompt_fmt = render_prompt(prompt)
print(prompt_fmt)

You are a helpful math assistant.
Answer the question and write the final result on a new line as:
\boxed{ANSWER}

Question:
If $a+b=3$ and $ab=\tfrac{13}{6}$, what is the value of $a^2+b^2$?

Answer:


In [39]:
generated_text = generate_text_stream_concat(
    model, tokenizer, prompt_fmt, device,
    max_new_tokens=2048,
    verbose=True
)

 \boxed{10}

In [40]:
# Below is an alternative prompt template
# which swaps "Question" with "Problem"

"""
def render_prompt(prompt):
    template = (
        "You are a helpful math assistant.\n"
        "Solve the problem and write the final result on a new line as:\n"
        "\\boxed{ANSWER}\n\n"
        f"Problem:\n{prompt}\n\nAnswer:"
    )
    return template
"""

# This can noticeably affect the MATH-500 results:
# Base model on mps: improves accuracy 20% -> 40%
# Reasoning model on mps: worsens accuracy 90% -> 60%

'\ndef render_prompt(prompt):\n    template = (\n        "You are a helpful math assistant.\n"\n        "Solve the problem and write the final result on a new line as:\n"\n        "\\boxed{ANSWER}\n\n"\n        f"Problem:\n{prompt}\n\nAnswer:"\n    )\n    return template\n'

In [41]:
# Alternatively, we may use no prompt template

"""
def render_prompt(prompt):
    return prompt
"""

# This can noticeably affect the MATH-500 results:
# Base model on mps: improves accuracy 20% -> 70%
# Reasoning model on mps: worsens accuracy 90% -> 50%

'\ndef render_prompt(prompt):\n    return prompt\n'

In [42]:
def mini_eval_demo(model, tokenizer, device):
    ex = {  # Test example with "problem" and "answer" fields
        "problem": "Compute 1/2 + 1/6.",
        "answer": "2/3"
    }
    prompt = render_prompt(ex["problem"])     # 1. Apply prompt template
    gen_text = generate_text_stream_concat(   # 2. Generate response
        model, tokenizer, prompt, device,
        max_new_tokens=64,
    )
    pred_answer = extract_final_candidate(gen_text)  # 3. Extract and normalize answer
    is_correct = grade_answer(                       # 4. Grade answer
        pred_answer, ex["answer"]
    )
    print(f"Device: {device}")
    print(f"Prediction: {pred_answer}")
    print(f"Ground truth: {ex['answer']}")
    print(f"Correct: {is_correct}")

In [43]:
mini_eval_demo(model, tokenizer, device)

Device: mps
Prediction: 1/3
Ground truth: 2/3
Correct: False


In [44]:
import time


def evaluate_math500_stream(
    model,
    tokenizer,
    device,
    math_data,
    out_path=None,
    max_new_tokens=512,
    verbose=False,
):

    if out_path is None:
        dev_name = str(device).replace(":", "-")  # Make filename compatible with Windows
        out_path = Path(f"math500_{WHICH_MODEL}-{dev_name}.jsonl")

    num_examples = len(math_data)
    num_correct = 0
    print(f"MATH-500: 0/{num_examples}", end="\r", flush=True)

    start_time = time.time()
    
    with open(out_path, "w", encoding="utf-8") as f:  # Save results for inspection
        for i, row in enumerate(math_data, start=1):
            prompt = render_prompt(row["problem"])    # 1. Apply prompt template
            gen_text = generate_text_stream_concat(   # 2. Generate response
                model, tokenizer, prompt, device,
                max_new_tokens=max_new_tokens,
                verbose=verbose,
            )

            extracted = extract_final_candidate(  # 3. Extract and normalize answer
                gen_text
            )  
            is_correct = grade_answer(            # 4. Grade answer
                extracted, row["answer"]
            )
            num_correct += int(is_correct)

            record = {  # Record to be saved for inspection
                "index": i,
                "problem": row["problem"],
                "gtruth_answer": row["answer"],
                "generated_text": gen_text,
                "extracted": extracted,
                "correct": bool(is_correct),
            }
            f.write(json.dumps(record, ensure_ascii=False) + "\n")

            if verbose:  # Print responses during the generation process
                print(
                    f"\n\n{'='*50}\nMATH-500: {i}/{num_examples}\n"
                    f"{'='*50}\nExtracted: {extracted}\n"
                    f"Expected:  {row['answer']}\n"
                    f"Correct so far: {num_correct}\n{'-'*50}"
                )
            else:
                print(
                    f"MATH-500: {i}/{num_examples}",
                    end="\r", flush=True
                )

    # Print summary information
    seconds_elapsed = time.time() - start_time
    acc = num_correct / num_examples if num_examples else 0.0
    print(f"\nAccuracy: {acc*100:.1f}% ({num_correct}/{num_examples})")
    print(f"Total time: {seconds_elapsed/60:.1f} min")
    print(f"Logs written to: {out_path}")
    return num_correct, num_examples, acc

- We only evaluate on 10 examples for demo purposes (to keep the runtime reasonable)

In [45]:
print("Model:", WHICH_MODEL)
print("Device:", device)
num_correct, num_examples, acc = evaluate_math500_stream(
    model, tokenizer, device, 
    math_data=math_data[:10],
    max_new_tokens=2048,
    verbose=False
)

Model: base
Device: mps
MATH-500: 10/10
Accuracy: 20.0% (2/10)
Total time: 0.7 min
Logs written to: math500_base-mps.jsonl


| Mode      | Device | Accuracy | MATH-500 size | Time                  |
|-----------|--------|----------|---------------|-----------------------|
| Base      | CPU    | 30%      | 10            | 0.7 min (Mac Mini M4) |
| Base      | MPS    | 20%      | 10            | 0.4 min (Mac Mini M4) |
| Base      | CUDA   | 30%      | 10            | 0.2 min (DGX Spark)   |
| Base      | XPU    | 30%      | 10            | 1.2 min (Intel)       |
| Reasoning | CPU    | 90%      | 10            | 9.5 min (Mac Mini M4) |
| Reasoning | MPS    | 80%      | 10            | 3.8 min (Mac Mini M4) |
| Reasoning | CUDA   | 90%      | 10            | 3.7 min (DGX Spark)   |
| Reasoning | XPU    | 70%      | 10            | 8.5 min (Intel)       |


| Mode      | Device | Accuracy | MATH-500 size    | Time                   |
|-----------|--------|----------|------------------|------------------------|
| Base      | CUDA   | 15.6%    | 500              | 10.0 min (DGX Spark)   |
| Reasoning | CUDA   | 50.8%    | 500              | 182.2 min (DGX Spark)  |

- For reference, above are the different accuracy values 
- Note that "GPU" here refers to a NVIDIA ("cuda") GPU; MPS refers to an Apple Silicon M4 chip
- The reasoning model is much slower because it produces much longer responses
- While Qwen3-Base is a pre-trained base model and the Qwen3 recommends using it without chat template, changing `tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)` to `tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path, apply_chat_template=True)` boosts the MATH-500 performance substantially (80%); note that it is not clear whether the MATH-500 test set was part of the training data; in the age of LLMs, we can assume that any data available on the internet has been part of the training data (also see the discussion [here](https://github.com/rasbt/LLMs-from-scratch/pull/828#issuecomment-3324829736))
- The run for the 500 MATH-500 examples corresponds to changing the code here in the `evaluate_math500_stream` function call from `math_data=math_data[:10],` to `math_data=math_data,`
- The bonus materials contain a script to run the evaluation batched mode for higher throughput (see [../02_math500-verifier-scripts/README.md](../02_math500-verifier-scripts/README.md); on an H100, with a batch size of 128, the base model can be evaluated in 3.3 min, and the reasoning model can be evaluated in 14.6  min

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/ch03/CH03_F11_raschka.webp?1" width="500px">

- For convenience, you can use the [../02_math500-verifier-scripts/evaluate_math500.py](../02_math500-verifier-scripts/evaluate_math500.py) script, which runs the MATH-500 evaluation code as a standalone script from the command line (see the [../02_math500-verifier-scripts/README.md](../02_math500-verifier-scripts/README.md) for more usage information)
- The [../02_math500-verifier-scripts/evaluate_math500_batched.py](../02_math500-verifier-scripts/evaluate_math500_batched.py) script runs the code in this chapter in batched mode
  - This means it processes multiple examples per forward pass to accelerate the evaluation while requiring more RAM
  - With a batch size of 128, this reduces the runtime of the base model, when evaluating all 500 samples, from 13.3 min to 3.3 min on an H100 GPU
  - Similarly, it reduces the runtime of the reasoning model from 185.4 min to 14.6 min for the 500 examples in the dataset
  - Note that the H100 is used as an example, and the script is compatible with other GPUs (or CPUs) as well

&nbsp;
## Summary

- No code in this section