<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="https://mng.bz/lZ5B">Build a Reasoning Model (From Scratch)</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/reasoning-from-scratch">https://github.com/rasbt/reasoning-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="https://mng.bz/lZ5B"><img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


# Appendix F: Common Approaches to LLM Evaluation

In [1]:
from importlib.metadata import version

used_libraries = [
    "reasoning_from_scratch",
    "torch",
    "tokenizers"  # Used by reasoning_from_scratch
]

for lib in used_libraries:
    print(f"{lib} version: {version(lib)}")

reasoning_from_scratch version: 0.1.0
torch version: 2.7.1
tokenizers version: 0.21.2


&nbsp;
## F.1 Understanding the main evaluation methods for LLMs

- No code in this section

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/appendix-f/Appendix_F_F01_raschka.webp" width="500px">

&nbsp;
### F.2 Evaluating answer-choice accuracy

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/appendix-f/Appendix_F_F02_raschka.webp" width="500px">

- Note that this figure depicts a simplified version of a multiple-choice-based evaluation (like MMLU), where we check the generated output letter against the correct answer letter
- In practice, variants of this include log-probability scoring, where instead of checking only the final letter, we compute how likely the model considers each candidate answer
- For reasoning models, this can also involve evaluating the likelihood of the correct answer being produced when fed into the model
- In either case, the evaluation still checks whether the model selects one of the pre-defined answers
- (Output probability scores are discussed in more detail in chapter 4, where we improve the text generation function)

&nbsp;
#### F.2.1 Loading the model

In [2]:
from pathlib import Path
import torch

from reasoning_from_scratch.ch02 import (
    get_device
)
from reasoning_from_scratch.qwen3 import (
    download_qwen3_small,
    Qwen3Tokenizer,
    Qwen3Model,
    QWEN_CONFIG_06_B
)

device = get_device()
torch.set_float32_matmul_precision("high")

# If you have compatibility issues, try to
# uncomment the line below and rerun the notebook
# device = "cpu"

WHICH_MODEL = "base"

if WHICH_MODEL == "base":

    download_qwen3_small(
        kind="base", tokenizer_only=False, out_dir="qwen3"
    )

    tokenizer_path = Path("qwen3") / "tokenizer-base.json"
    model_path = Path("qwen3") / "qwen3-0.6B-base.pth"
    tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)

elif WHICH_MODEL == "reasoning":

    download_qwen3_small(
        kind="reasoning", tokenizer_only=False, out_dir="qwen3"
    )

    tokenizer_path = Path("qwen3") / "tokenizer-reasoning.json"
    model_path = Path("qwen3") / "qwen3-0.6B-reasoning.pth"
    tokenizer = Qwen3Tokenizer(
        tokenizer_file_path=tokenizer_path,
        apply_chat_template=True,
        add_generation_prompt=True,
        add_thinking=True,
    )

else:
    raise ValueError(f"Invalid choice: WHICH_MODEL={WHICH_MODEL}")


model = Qwen3Model(QWEN_CONFIG_06_B)
model.load_state_dict(torch.load(model_path))

model.to(device)


USE_COMPILE = False  # Set to true to enable compilation
if USE_COMPILE:
  torch._dynamo.config.allow_unspec_int_on_nn_module = True
  model = torch.compile(model)

Using Apple Silicon GPU (MPS)
✓ qwen3/qwen3-0.6B-base.pth already up-to-date
✓ qwen3/tokenizer-base.json already up-to-date


&nbsp;
#### F.2.2 Checking the generated answer letter

In [3]:
example = {
    "question": (
        "How many ways are there to put 4 distinguishable"
        " balls into 2 indistinguishable boxes?"
    ),
    "choices": ["7", "11", "16", "8"],
    "answer": "D",
}

def format_prompt(example):
    return (
        f"{example['question']}\n"
        f"A. {example['choices'][0]}\n"
        f"B. {example['choices'][1]}\n"
        f"C. {example['choices'][2]}\n"
        f"D. {example['choices'][3]}\n"
        "Answer: "  # trailing space encourages a single-letter next token
    )

prompt = format_prompt(example)
print(prompt)

How many ways are there to put 4 distinguishable balls into 2 indistinguishable boxes?
A. 7
B. 11
C. 16
D. 8
Answer: 


---


- You can load examples from the MMLU dataset directly via the `datasets` library (which can be installed via `pip install datasets` or `uv add datasets`):

```python
from datasets import load_dataset

configs = get_dataset_config_names("cais/mmlu")
dataset = load_dataset("cais/mmlu", "high_school_mathematics")

# Inspect the first example from test set:
example = dataset["test"][0]
print(example)
```

- Above, we used the `"high_school_mathematics"` subset; to get a list of the other subsets, use the following code:


```python
from datasets import get_dataset_config_names

subsets = get_dataset_config_names("cais/mmlu")
print(subsets)
```

---

In [4]:
prompt_ids = tokenizer.encode(prompt)
prompt_fmt = torch.tensor(prompt_ids, device=device).unsqueeze(0)

- We generate a few tokens and extract the first instance of letter A/B/C/D the model prints:

In [5]:
from reasoning_from_scratch.ch02_ex import generate_text_basic_stream_cache


def predict_choice(
    model, tokenizer, prompt_fmt, max_new_tokens=8
):
    pred = None
    for t in generate_text_basic_stream_cache(
        model=model,
        token_ids=prompt_fmt,
        max_new_tokens=max_new_tokens,
        eos_token_id=tokenizer.eos_token_id,
    ):
        answer = tokenizer.decode(t.squeeze(0).tolist())
        for letter in answer:
            letter = letter.upper()
            if letter in "ABCD":
                pred = letter
                break
        if pred:  # stop as soon as a letter appears
            break
    return pred

In [6]:
pred1 = predict_choice(model, tokenizer, prompt_fmt)

print(
    f"Generated letter: {pred1}\n"
    f"Correct? {pred1 == example['answer']}"
)

Generated letter: C
Correct? False


&nbsp;
### F.3 Using verifiers to check answers

- No code in this section (see chapter 3)

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/appendix-f/Appendix_F_F03_raschka.webp" width="500px">

&nbsp;
### F.4 Comparing models using preferences and leaderboards

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/appendix-f/Appendix_F_F04_raschka.webp" width="500px">

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/appendix-f/Appendix_F_F05_raschka.webp" width="500px">

- Elo rating ("algorithm of 400") inspired by chess rankings: https://en.wikipedia.org/wiki/Performance_rating_(chess)
- Note that LM Arena switched to a statistical Bradely-Terry model that provides scores on a Elo-like scale; however, the same concept of pairwise ranking still applies

In [7]:
# Pairwise "arena votes" where the first model is the winner and
# the second model is the loser
votes = [
    ("GPT-5", "Claude-3"),  # First match-up: GPT-5 was preferred over Claude-3
    ("GPT-5", "Llama-4"),
    ("Claude-3", "Llama-3"),
    ("Llama-4", "Llama-3"),
    ("Claude-3", "Llama-3"),
    ("GPT-5", "Llama-3"),
]

In [8]:
def elo_ratings(vote_pairs, k_factor=32, initial_rating=1000):
    # Initialize all models with the same base rating
    ratings = {
        model: initial_rating
        for pair in vote_pairs
        for model in pair
    }

    # Update ratings after each match
    for winner, loser in vote_pairs:
        rating_winner, rating_loser = ratings[winner], ratings[loser]

        # Expected score for the current winner given the ratings
        expected_winner = 1.0 / (
            1.0 + 10 ** ((rating_loser - rating_winner) / 400.0)
        )

        # k_factor determines sensitivity of rating updates
        ratings[winner] = (
            rating_winner + k_factor * (1 - expected_winner)
        )
        ratings[loser] = (
            rating_loser + k_factor * (0 - (1 - expected_winner))
        )

    return ratings

In [9]:
ratings = elo_ratings(votes, k_factor=32, initial_rating=1000)

for model in sorted(ratings, key=ratings.get, reverse=True):
    print(f"{model:8s} : {ratings[model]:.1f}")

GPT-5    : 1043.7
Claude-3 : 1015.2
Llama-4  : 1000.7
Llama-3  : 940.4


- The expected winner score is calculated as follows:

$$\text{expected\_winner} \;=\; \frac{1}{1 + 10^{\tfrac{\text{rating\_loser} - \text{rating\_winner}}{400}}}
$$

- Intuition:
    - If rating_winner >> rating_loser:
       - exponent → very negative
       - denominator ≈ 1
       - expected_winner ≈ 1 (almost certain win)
    - If rating_winner << rating_loser:
       - exponent → very positive
       - denominator → very large
       - expected_winner ≈ 0 (almost certain loss)
    - If rating_winner == rating_loser:
       - exponent = 0
       - denominator = 2
       - expected_winner = 0.5 (even match)

&nbsp;
### F.5 Judging responses with other LLMs

<img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/appendix-f/Appendix_F_F06_raschka.webp" width="500px">

- In this section, we automate the response evaluation of the finetuned LLM using another, larger LLM
- In particular, we use an instruction-finetuned 20-billion-parameter gpt-oss model by Open AI that can be run locally via ollama ([https://ollama.com](https://ollama.com))

- Ollama is an application to run LLMs efficiently
- It is a wrapper around llama.cpp ([https://github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)), which implements LLMs in pure C/C++ to maximize efficiency
- Note that it is a to ol for using LLMs to generate text (inference), not training or finetuning LLMs
- Before running the code below, install ollama by visiting [https://ollama.com](https://ollama.com) and following the instructions (for instance, clicking on the "Download" button and downloading the ollama application for your operating system)

- For macOS and Windows users, click on the ollama application you downloaded; if it prompts you to install the command line usage, say "yes"
- Linux users can use the installation command provided on the ollama website
- There are 3 ways we can run ollama on our computer:

**1. `ollama serve`**

- This runs the pllama backend as a server, usually on `http://localhost:11434`. It doesn't load a model until we call it through the API.     This is what we want if we want to use ollama through Python.

**2. `ollama run gpt-oss:20b`**

- This is a convenience wrapper. If the server is not already running, it will start it, then download the model (the first time), and drop us into an interactive terminal where we can chat with the model. Behind the scenes, it uses the same server API.

**3. Ollama desktop app**

- This runs the same backend automatically and provides a GUI on top of it (as shown in the figure above).
It also applies defaults (system prompt, temperature, stop sequences), which can explain why answers look different from raw API usage.

---

**Note**:

- When running `ollama serve` in the terminal, as described above, you may encounter an error message saying `Error: listen tcp 127.0.0.1:11434: bind: address already in use`
- If that's the case, try use the command `OLLAMA_HOST=127.0.0.1:11435 ollama serve` (and if this address is also in use, try to increment the numbers by one until you find an address not in use

---

- For example, to give ollama a try, we can use `ollama run gpt-oss:20b` to try the 20-billion-parameter gpt-oss 20B model. The model
  (about 13 GB) will be automatically downloaded the first time you run
  this command. (Alternatively, you can use it in the Desktop app similar to the previous figure.)
  
```bash
ollama run gpt-oss:20b
```


- The output looks like as follows:

```
$ ollama run gpt-oss:20b
pulling manifest 
pulling b112e727c6f1: 100% ▕█████████████████████████████████▏  13 GB                         
pulling fa6710a93d78: 100% ▕█████████████████████████████████▏ 7.2 KB                         
pulling f60356777647: 100% ▕█████████████████████████████████▏  11 KB                         
pulling d8ba2f9a17b3: 100% ▕█████████████████████████████████▏   18 B                         
pulling 55c108d8e936: 100% ▕█████████████████████████████████▏  489 B                         
verifying sha256 digest 
writing manifest 
removing unused layers 
success
```

- For more information on gpt-oss, please see my in-depth article, [From GPT-2 to gpt-oss: Analyzing the Architectural Advances](https://magazine.sebastianraschka.com/p/from-gpt-2-to-gpt-oss-analyzing-the) 
- Using ollama with the `"gpt-oss:20b"` model (a 20B parameter model) requires 13 GB of RAM; if this is not supported by your machine, you can try the smaller model, such as the 4B parameter `qwen3:4b` model, which only requires approximately 4 GB of RAM
- Alternatively, you can also use the larger 120-billion gpt-oss (`qwen3:235b`) or even the 235-billion-parameter Qwen3 model (`qwen3:235b`), if your machine supports it
- After the download has been completed, you will see a command line prompt that allows you to chat with the model
- Try a prompt like "What is 1+2?", which should return an output similar to the following

```
>>> What is 1+2?
Thinking...
User asks: "What is 1+2?" This is simple: answer 3. Provide explanation? Possibly ask for simple 
arithmetic. Provide answer: 3.
...done thinking.

1 + 2 = **3**
```

- You can end this session using the input `/bye`

- The following code checks whether the ollama session is running correctly before proceeding to use ollama to evaluate the test set responses we generated in the previous section

In [10]:
import psutil

def check_if_running(process_name):
    running = False
    for proc in psutil.process_iter(["name"]):
        if process_name in proc.info["name"]:
            running = True
            break
    return running

ollama_running = check_if_running("ollama")

if not ollama_running:
    raise RuntimeError("Ollama not running. Launch ollama before proceeding.")
print("Ollama running:", check_if_running("ollama"))

Ollama running: True


- Now, an alternative way to the `ollama run` command we used earlier to interact with the model is via its REST API in Python via the following function
- Before you run the next cells in this notebook, make sure that ollama is still running (the previous code cells should print `"Ollama running: True"`)
- Next, run the following code cell to query the model

In [11]:
import json
import urllib.request


def query_model(
    prompt,
    model="gpt-oss:20b",
    # If you used OLLAMA_HOST=127.0.0.1:11435 ollama serve
    # update the address from 11434 to 11435
    url="http://localhost:11434/api/chat"
):
    # Create the data payload as a dictionary
    data = {
        "model": model,
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "options": {     # Settings below are required for deterministic responses
            "seed": 123,
            "temperature": 0,
            "num_ctx": 2048
        }
    }

    # Convert the dictionary to a JSON formatted string and encode it to bytes
    payload = json.dumps(data).encode("utf-8")

    # Create a request object, setting the method to POST and adding necessary headers
    request = urllib.request.Request(
        url,
        data=payload,
        method="POST"
    )
    request.add_header("Content-Type", "application/json")

    # Send the request and capture the response
    response_data = ""
    with urllib.request.urlopen(request) as response:
        # Read and decode the response
        while True:
            line = response.readline().decode("utf-8")
            if not line:
                break
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]

    return response_data

In [12]:
ollama_model = "gpt-oss:20b"
result = query_model("What is 1+2?", ollama_model)
print(result)

3


- Now, using the `query_model` function we defined above, we can evaluate the responses of our own model

In [16]:
def rubric_prompt(instruction, reference_answer, model_answer):
    rubric = (
        "You are a fair judge assistant. You will be given an instruction, "
        "a reference answer, and a candidate answer to evaluate, according "
        "to the following rubric:\n\n"
        "1: The response fails to address the instruction, providing "
        "irrelevant, incorrect, or excessively verbose content.\n"
        "2: The response partially addresses the instruction but contains "
        "major errors, omissions, or irrelevant details.\n"
        "3: The response addresses the instruction to some degree but is "
        "incomplete, partially correct, or unclear in places.\n"
        "4: The response mostly adheres to the instruction, with only minor "
        "errors, omissions, or lack of clarity.\n"
        "5: The response fully adheres to the instruction, providing a "
        "clear, accurate, and relevant answer in a concise and efficient "
        "manner.\n\n"
        "Now here is the instruction, the reference answer, and the "
        "response.\n"
    )

    prompt = (
        f"{rubric}\n"
        f"Instruction:\n{instruction}\n\n"
        f"Reference Answer:\n{reference_answer}\n\n"
        f"Answer:\n{model_answer}\n\n"
        f"Evaluation: "
    )
    return prompt

- The the `model_answer` could be the answer produced by our own model; here we hardcode a possible model answer for simplicity

In [17]:
rendered_prompt = rubric_prompt(
    instruction=(
        "If all birds can fly, and a penguin is a bird, "
        "can a penguin fly?"
    ),
    reference_answer=(
        "Yes, according to the premise that all birds can fly, "
        "a penguin can fly."
    ),
    model_answer=(
        "Yes – under those premises a penguin would be able to fly."
    )
)
print(rendered_prompt)

You are a fair judge assistant. You will be given an instruction, a reference answer, and a candidate answer to evaluate, according to the following rubric:

1: The response fails to address the instruction, providing irrelevant, incorrect, or excessively verbose content.
2: The response partially addresses the instruction but contains major errors, omissions, or irrelevant details.
3: The response addresses the instruction to some degree but is incomplete, partially correct, or unclear in places.
4: The response mostly adheres to the instruction, with only minor errors, omissions, or lack of clarity.
5: The response fully adheres to the instruction, providing a clear, accurate, and relevant answer in a concise and efficient manner.

Now here is the instruction, the reference answer, and the response.

Instruction:
If all birds can fly, and a penguin is a bird, can a penguin fly?

Reference Answer:
Yes, according to the premise that all birds can fly, a penguin can fly.

Answer:
Yes – 

In [18]:
result = query_model(rendered_prompt, ollama_model)
print(result)

**Score: 5**

The candidate answer directly addresses the question, correctly applies the given premises, and concisely states that a penguin would be able to fly. It is accurate, relevant, and clear.
