# Section 1: Baseline

In [2]:
from collections import Counter
import json

"""
Evaluation complete on 5000 examples.
Average format_reward: 0.1698
Average answer_reward: 0.0276
Average reward: 0.0276
"""

RESULT_PATH = "../../results/baseline_results.jsonl"

We classify each result into whether it has format and answer rewards.

In [3]:
results = []
with open(RESULT_PATH, "r") as f:
    for line in f:
        item = json.loads(line)
        format_reward = item.get("format_reward")
        answer_reward = item.get("answer_reward")
        reward = item.get("reward")

        if format_reward and answer_reward:
            item["class"] = "format_answer"
        elif format_reward:
            item["class"] = "format_noanswer"
        elif answer_reward:
            item["class"] = "noformat_answer"
        else:
            item["class"] = "noformat_noanswer"

        results.append(item)

In [4]:
classes = Counter([item["class"] for item in results])
print(classes)

Counter({'noformat_noanswer': 4151, 'format_noanswer': 711, 'format_answer': 138})


## Observe at least 10 cases where format reward is 0, do we think the issue is with the base model's output, or the parser?

In [5]:
counter = 0
for item in results:
    if item["class"] == "noformat_noanswer":
        answer_tag = item["generated_text"].split('<answer>')
        if len(answer_tag) == 1:
            text = "[No answer tag]"
        else:
            text = answer_tag[1]
        print(f"Output: {text}; Ground truth: {item["ground_truth"]}\n")
        counter += 1
    if counter >= 10:
        break

Output:  46 </answer>; Ground truth: 420

Output: [No answer tag]; Ground truth: \dfrac{1}{9}

Output: [No answer tag]; Ground truth: 3400

Output: 7200 seconds - 930 seconds = 6270 seconds = 1 hr 41 min 10 seconds
12 noon + (1 hr 41 min 10 seconds) = 1:41:10 p.m. You are correct! The exact time they will arrive at their destination is 1:41:10 p.m. Here's the step-by-step reasoning process:

1. First, we need to calculate the difference between 12 noon and 2:30 PM in seconds. There are 60 minutes in an hour and 60 seconds in a minute, so there are \(12 \times 3600 = 43200\) seconds in 2 hours (which is 2 PM). Since 2:30 PM is 30 minutes after 2 PM, which is \(30 \times 60 = 1800\) seconds, the total time between 12 Noon and 2:30 PM is \(43200 + 1800 = 45000\) seconds.
2. Next, we need to subtract this time from 7200 seconds to find out how much time is left until they arrive at their destination. So, we subtract 7200 seconds from 45000 seconds. (Using the concept of clock arithmetic, w

After observing 10 cases, I believe the issue is mostly with the output. However, there are some examples which I believe are correctly formatted with the wrong answer that still had a 0 for formatting reward, which is interesting to me.

## Observe at least 10 cases where format reward is 1 but the answer reward is 0, do we think the issue is with the base model's output, or the parser?

In [6]:
counter = 0
for item in results:
    if item["class"] == "format_noanswer":
        answer_tag = item["generated_text"].split('<answer>')
        if len(answer_tag) == 1:
            text = "[No answer tag]"
        else:
            text = answer_tag[1]
        print(f"Output: {text}; Ground truth: {item["ground_truth"]}\n")
        counter += 1
    if counter >= 10:
        break

Output:  $200~units^2$ </answer>; Ground truth: 100\text{ square units}

Output: $6290115$ </answer>; Ground truth: 6290000

Output:  <hr>
Since Ed has a 90% average on his five tests, the total score of his five tests is $0.9 \times 500 = 450$.
Let the missing score be $x$. Then, the total score of the other four tests is $450 - x$.
Since the scores of the last two tests differ by three points, let the missing score be $x$. Then, the second-to-last test score is $x + 3$, and the last test score is $x - 3$.
The sum of the scores of the first three tests is $87 + 85 + 87 = 259$.
Therefore, the equation is $259 + x + (x+3) + (x-3) = 450$.
Simplifying:
$3x = 198$
$x = 66$
</answer>; Ground truth: 97

Output:  $3t$ </answer>; Ground truth: 4t

Output:  $\frac{5}{7}$ </answer>; Ground truth: \frac{5}{8}

Output: Therefore, the sum of all integers $n$ such that $\dfrac{12}{n}$ is also an integer is $\boxed{28}$. </answer>; Ground truth: 0

Output:  3 </answer>; Ground truth: 14

Output:  To 

Observing 10 cases where format reward is 1 but the answer reward is 0, I am now confused by why previous answers had some format reward of 0. Many of them correctly had answer tags. I only see 1 mistake here where I believe the answer reward also should have been 1.

## How well does the Qwen 2.5 Math 1.5B zero-shot baseline perform on MATH?

Not well. We have an average reward of 0.0276, which means 138 of 5000 problems were correctly solved and formatted.

# Question Only Prompt

In [7]:
QO_RESULT_PATH = "../../results/baseline_results_question_only.jsonl"

results = []
with open(QO_RESULT_PATH, "r") as f:
    for line in f:
        item = json.loads(line)
        format_reward = item.get("format_reward")
        answer_reward = item.get("answer_reward")
        reward = item.get("reward")

        if format_reward and answer_reward:
            item["class"] = "format_answer"
        elif format_reward:
            item["class"] = "format_noanswer"
        elif answer_reward:
            item["class"] = "noformat_answer"
        else:
            item["class"] = "noformat_noanswer"

        results.append(item)

classes = Counter([item["class"] for item in results])
print(classes)

Counter({'noformat_noanswer': 5000})
