<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="https://mng.bz/lZ5B">Build a Reasoning Model (From Scratch)</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/reasoning-from-scratch">https://github.com/rasbt/reasoning-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="https://mng.bz/lZ5B"><img src="https://sebastianraschka.com/images/reasoning-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


# Chapter 5: Exercise Solutions

Packages that are being used in this notebook:

In [1]:
from importlib.metadata import version

used_libraries = [
    "reasoning_from_scratch",
    "torch",
    "tokenizers"  # Used by reasoning_from_scratch
]

for lib in used_libraries:
    print(f"{lib} version: {version(lib)}")

reasoning_from_scratch version: 0.1.13
torch version: 2.10.0
tokenizers version: 0.22.2


&nbsp;
## Exercise 5.1: Using the heuristic scorer as a tie-breaker in self-consistency

- There are many ways to implement this
- The perhaps easiest way is to handle it outside the self-consistency function and work with the returned dictionary (e.g., similar to what we have done in exercise 4.4, when we implemented the tie-breaking, which we added directly to the `evaluate_math500_stream` function
- The relevant lines are shown below

```python
# ...
from pathlib import Path
import time

from reasoning_from_scratch.ch05 import heuristic_score


def evaluate_math500_stream(
    model,
    tokenizer,
    device,
    math_data,
    out_path=None,
    max_new_tokens=2048,
    verbose=False,
    prompt_suffix="",
    temperature=1.0,
    top_p=1.0,
    seed=None,
    num_samples=10,
):
    if out_path is None:
        dev_name = str(device).replace(":", "-")
        out_path = Path(f"math500-{dev_name}.jsonl")

    num_examples = len(math_data)
    num_correct = 0
    start_time = time.time()

    with open(out_path, "w", encoding="utf-8") as f:
        for i, row in enumerate(math_data, start=1):
            prompt = render_prompt(row["problem"]) + prompt_suffix

            results = self_consistency_vote(
                model=model,
                tokenizer=tokenizer,
                prompt=prompt,
                device=device,
                num_samples=num_samples,
                temperature=temperature,
                top_p=top_p,
                max_new_tokens=max_new_tokens,
                show_progress=False,
                show_long_answer=False,
                seed=seed,
            )

            # Majority vote winner available
            if results["final_answer"] is not None:
                extracted = results["final_answer"]

            ### NEW: Break tie with heuristic_score
            else:
                best = None
                best_score = float("-inf")
            
                for cand in results["majority_winners"]:
                    scores = [
                        heuristic_score(results["full_answers"][idx], prompt=prompt)
                        for idx in results["groups"][cand]
                    ]
            
                    score = max(scores)
            
                    if score > best_score:
                        best_score = score
                        best = cand
            
                extracted = best

            # ...

    # ...
    return num_correct, num_examples, acc
```

- The improvements over the baseline in chapter 3 and self-consistency from chapter 4 are shown below

|   | Method                                   | Model | Accuracy | Time      |
|---|------------------------------------------|-------|----------|-----------|
| 1 | Chapter 4 baseline with CoT prompting    | Base  | 33.4%    | 129.2 min |
| 2 | Self-consistency (n=3) + majority vote   | Base  | 43.2%    | 328.2 min |
| 3 | Self-consistency (n=3) + heuristic       | Base  | 43.4%    | 326.5 min |
| 4 | Self-consistency (n=3) + avg. logprob    | Base  | 44.8%    | 327.7 min |

- The accuracy values and runtimes shown in the table were computed on all 500 samples in the MATH-500 test set using a "cuda" GPU (DGX Spark)

- For your convenience, you can run the [self_consistency_scorer_math500.py](../02_math500-more-inference-scaling-scripts/self_consistency_scorer_math500.py) script located in [../02_math500-more-inference-scaling-scripts](../02_math500-more-inference-scaling-scripts)

- However, note that as discussed in [#159](https://github.com/rasbt/reasoning-from-scratch/issues/159), we decide the majority winner based on the heuristic score but only consider the first instance in each majority pair
- For instance

&nbsp;
## Exercise 5.2: Using the heuristic scorer in a Best-of-N setup

- Best-of-N is similar to self-consistency in that we generate multiple answers
- However, instead of selecting the final answer based on majority vote, we score all answers using a scoring function (like `heuristic_score`) and return the highest-scoring answer
- There are several ways to implement this behavior, but the easiest one is arguably to use the existing self-consistency function from chapter 4 as a template and swap in the `heuristic_score` as shown below

```python
# ...

from reasoning_from_scratch.ch05 import (
    heuristic_score
)

def self_consistency_vote(
    model,
    tokenizer,
    prompt,
    device,
    num_samples=10,
    temperature=0.8,
    top_p=0.9,
    max_new_tokens=2048,
    show_progress=True,
    show_long_answer=False,
    seed=None,
):
    full_answers, short_answers = [], []
    counts = Counter()
    groups = {}
    majority_winners, final_answer = [], None
    best_score, best_idx = float("-inf"), None

    for i in range(num_samples):
        if seed is not None:
            torch.manual_seed(seed + i + 1)

        answer = generate_text_stream_concat_flex(
            model=model,
            tokenizer=tokenizer,
            prompt=prompt,
            device=device,
            max_new_tokens=max_new_tokens,
            verbose=show_long_answer,
            generate_func=generate_text_top_p_stream_cache,
            temperature=temperature,
            top_p=top_p,
        )

        short = extract_final_candidate(answer, fallback="number_then_full")
        full_answers.append(answer)
        short_answers.append(short)
        counts[short] += 1

        if short in groups:
            groups[short].append(i)
        else:
            groups[short] = [i]

        score = heuristic_score(answer, prompt=prompt)

        if score > best_score:
            best_score, best_idx = score, i

        if show_progress:
            print(f"[Sample {i+1}/{num_samples}] → {short!r}")

    if best_idx is not None:
        final_answer = short_answers[best_idx]
        majority_winners = [final_answer]

    return {
        "full_answers": full_answers,
        "short_answers": short_answers,
        "counts": dict(counts),
        "groups": groups,
        "majority_winners": majority_winners,
        "final_answer": final_answer,
    }

```

- The results are shown below

|   | Method                                   | Model | Accuracy | Time      |
|---|------------------------------------------|-------|----------|-----------|
| 1 | Baseline with chain-of-thought prompting | Base  | 33.4%    | 129.2 min |
| 2 | Best-of-N (n=3) + heuristic              | Base  | 40.6%    | 327.7 min |
| 3 | Best-of-N (n=3) + avg. logprob           | Base  | 43.2%    | 330.2 min |

- The accuracy values and runtimes shown in the table were computed on all 500 samples in the MATH-500 test set using a "cuda" GPU (DGX Spark)

- For your convenience, you can run the [self_consistency_scorer_math500.py](../02_math500-more-inference-scaling-scripts/best_of_n_math500.py) script located in [../02_math500-more-inference-scaling-scripts](../02_math500-more-inference-scaling-scripts)

&nbsp;
## Exercise 5.3: Using the logprob scorer as a tie-breaker in self-consistency

- The code is similar to exercise 5.1, except that we swap `heuristic_score` with `avg_logprob_answer`

```python
# ...
# from reasoning_from_scratch.ch05 import heuristic_score
from reasoning_from_scratch.ch05 import avg_logprob_answer


def evaluate_math500_stream(
    model,
    tokenizer,
    device,
    math_data,
    out_path=None,
    max_new_tokens=2048,
    verbose=False,
    prompt_suffix="",
    temperature=1.0,
    top_p=1.0,
    seed=None,
    num_samples=10,
):
    if out_path is None:
        dev_name = str(device).replace(":", "-")
        out_path = Path(f"math500-{dev_name}.jsonl")

    num_examples = len(math_data)
    num_correct = 0
    start_time = time.time()

    with open(out_path, "w", encoding="utf-8") as f:
        for i, row in enumerate(math_data, start=1):
            prompt = render_prompt(row["problem"]) + prompt_suffix

            results = self_consistency_vote(
                model=model,
                tokenizer=tokenizer,
                prompt=prompt,
                device=device,
                num_samples=num_samples,
                temperature=temperature,
                top_p=top_p,
                max_new_tokens=max_new_tokens,
                show_progress=False,
                show_long_answer=False,
                seed=seed,
            )

            # Majority vote winner available
            if results["final_answer"] is not None:
                extracted = results["final_answer"]

            ### NEW: Break tie with avg_logprob_answer
            else:
                best = None
                best_score = float("-inf")
            
                # Consider all members of each majority group
                for cand in results["majority_winners"]:
                    scores = []
            
                    for idx in results["groups"][cand]:
                        candidate_full = results["full_answers"][idx]
            
                        score = avg_logprob_answer(
                            model=model,
                            tokenizer=tokenizer,
                            prompt=prompt,
                            answer=candidate_full,
                            device=device,
                        )
                        scores.append(score)
            
                    cand_score = max(scores)
            
                    if cand_score > best_score:
                        best_score = cand_score
                        best = cand
            
                extracted = best
            # ...

    # ...
    return num_correct, num_examples, acc
```

- The improvements over the baseline in chapter 3 and self-consistency from chapter 4 are shown below

|   | Method                                   | Model | Accuracy | Time      |
|---|------------------------------------------|-------|----------|-----------|
| 1 | Baseline with chain-of-thought prompting | Base  | 33.4%    | 129.2 min |
| 2 | Self-consistency (n=3) + majority vote   | Base  | 43.2%    | 328.2 min |
| 3 | Self-consistency (n=3) + heuristic       | Base  | 43.4%    | 326.5 min |
| 4 | Self-consistency (n=3) + avg logprob     | Base  | 44.8%    | 327.7 min |

- The accuracy values and runtimes shown in the table were computed on all 500 samples in the MATH-500 test set using a "cuda" GPU (DGX Spark)

- For your convenience, you can run the [self_consistency_scorer_math500.py](../02_math500-more-inference-scaling-scripts/best_of_n_math500.py) script located in [../02_math500-more-inference-scaling-scripts](../02_math500-more-inference-scaling-scripts)

&nbsp;
## Exercise 5.4: Using the logprob scorer in a Best-of-N setup

- To implement Best-of-N with a logprob scorer, we can use the code from exercise 5.2 ans swap the `heuristic_score` with `avg_logprob_answer`:

```python

from reasoning_from_scratch.ch05 import (
    avg_logprob_answer
)


def self_consistency_vote(
    model,
    tokenizer,
    prompt,
    device,
    num_samples=10,
    temperature=0.8,
    top_p=0.9,
    max_new_tokens=2048,
    show_progress=True,
    show_long_answer=False,
    seed=None,
):
    full_answers, short_answers = [], []
    counts = Counter()
    groups = {}
    majority_winners, final_answer = [], None
    best_score, best_idx = float("-inf"), None

    for i in range(num_samples):
        if seed is not None:
            torch.manual_seed(seed + i + 1)

        answer = generate_text_stream_concat_flex(
            model=model,
            tokenizer=tokenizer,
            prompt=prompt,
            device=device,
            max_new_tokens=max_new_tokens,
            verbose=show_long_answer,
            generate_func=generate_text_top_p_stream_cache,
            temperature=temperature,
            top_p=top_p,
        )

        short = extract_final_candidate(answer, fallback="number_then_full")
        full_answers.append(answer)
        short_answers.append(short)
        counts[short] += 1

        if short in groups:
            groups[short].append(i)
        else:
            groups[short] = [i]

            score = avg_logprob_answer(
                model=model,
                tokenizer=tokenizer,
                prompt=prompt,
                answer=answer,
                device=device
            )
        if score > best_score:
            best_score, best_idx = score, i

        if show_progress:
            print(f"[Sample {i+1}/{num_samples}] → {short!r}")

    if best_idx is not None:
        final_answer = short_answers[best_idx]
        majority_winners = [final_answer]

    return {
        "full_answers": full_answers,
        "short_answers": short_answers,
        "counts": dict(counts),
        "groups": groups,
        "majority_winners": majority_winners,
        "final_answer": final_answer,
    }
```

- The results are shown below

| # | Method                                   | Model | Accuracy | Time      |
|---|------------------------------------------|-------|----------|-----------|
| 1 | Baseline with chain-of-thought prompting | Base  | 33.4%    | 129.2 min |
| 2 | Best-of-N (n=3) + heuristic              | Base  | TBD      | TBD       |
| 3 | Best-of-N (n=3) + avg. logprob           | Base  | TBD      | TBD       |

- The accuracy values and runtimes shown in the table were computed on all 500 samples in the MATH-500 test set using a "cuda" GPU (DGX Spark)

- For your convenience, you can run the [self_consistency_scorer_math500.py](../02_math500-more-inference-scaling-scripts/best_of_n_math500.py) script located in [../02_math500-more-inference-scaling-scripts](../02_math500-more-inference-scaling-scripts)

&nbsp;
## Exercise 5.5: Using the heuristic score for self-refinement

- Using the `heuristic_score` is actually even simpler than using the logprob score, all we need to do is to change the following code:

```python
from functools import partial

avg_logprob_score = partial(
    avg_logprob_answer,
    model=model,
    tokenizer=tokenizer,
    device=device
)


torch.manual_seed(0)

results_logprob = self_refinement_loop(
    model=model,
    tokenizer=tokenizer,
    raw_prompt=raw_prompt,
    device=device,
    iterations=2,
    max_response_tokens=2048,
    max_critique_tokens=256,
    score_fn=avg_logprob_score,
    verbose=True,
    temperature=0.7,
    top_p=0.9,
)
```

- The updated code is:

```python
torch.manual_seed(0)

results_logprob = self_refinement_loop(
    model=model,
    tokenizer=tokenizer,
    raw_prompt=raw_prompt,
    device=device,
    iterations=2,
    max_response_tokens=2048,
    max_critique_tokens=256,
    score_fn=heuristic_score,  # NEW
    verbose=True,
    temperature=0.7,
    top_p=0.9,
)
```

- The results, using the heuristic scorer, are shown in rows 4, 5, and 10:

|    | Method                 | Scoring       | Iterations | Model      | Accuracy | Time      |
|----|------------------------|---------------|------------|------------|----------|-----------|
| 1  | Baseline (chapter 3)   | -             | -          | Base       | 15.2%    | 10.1 min  |
| 2  | Self-refinement        | None          | 1          | Base       | 25.0%    | 84.8 min  |
| 3  | Self-refinement        | None          | 2          | Base       | 22.0%    | 165.4 min |
| 4  | Self-refinement        | Heuristic     | 1          | Base       | 21.6%    | 84.7 min  |
| 5  | Self-refinement        | Heuristic     | 2          | Base       | 20.8%    | 151.4 min |
| 6  | Self-refinement        | Avg. logprob  | 1          | Base       | 21.4%    | 85.3 min  |
| 7  | Self-refinement        | Avg. logprob  | 2          | Base       | 22.0%    | 165.3 min |
|    |                        |               |            |            |          |           |
| 8  | Baseline (chapter 3)   | -             | -          | Reasoning  | 48.2%    | 182.1 min |
| 9  | Self-refinement        | None          | 1          | Reasoning  | 56.6%    | 498.8 min |
| 10 | Self-refinement        | Heuristic     | 1          | Reasoning  | 57.8%    | 498.6 min |
| 11 | Self-refinement        | Avg. logprob  | 1          | Reasoning  | 48.4%    | 499.7 min |

- The accuracy values and runtimes shown in the table were computed on all 500 samples in the MATH-500 test set using a "cuda" GPU (DGX Spark)
- For your convenience, you can run the [self_consistency_scorer_math500.py](../02_math500-more-inference-scaling-scripts/self_refinement_math500.py) script located in [../02_math500-more-inference-scaling-scripts](../02_math500-more-inference-scaling-scripts)