# Model Evaluation Demo

This notebook demonstrates how to programmatically evaluate the model using `evaluate.py`.

In [1]:
import pandas as pd
from evaluate import run_evaluation, load_config

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
# Load configuration
config_path = "debug_config.yaml"
config = load_config(config_path)

# You can modify config programmatically here if needed
print("Model:", config['model']['name_or_path'])
print("Dataset:", config['data']['dataset_name'])

Model: /home/rlvr_experiments/Qwen2-0.5B-GRPO-General/checkpoint-300
Dataset: openai/gsm8k


In [11]:
# Run evaluation on a small subset (e.g., 5 samples) to test
metrics, samples = run_evaluation(config, max_samples=5)

print("Metrics:", metrics)

Loading dataset: openai/gsm8k
Loading model: /home/rlvr_experiments/Qwen2-0.5B-GRPO-General/checkpoint-300
Starting evaluation...


100%|██████████| 5/5 [00:17<00:00,  3.42s/it]

Final Results: Accuracy: 0.4000, Format: 1.0000
Metrics: {'accuracy': 0.4, 'format_compliance': 1.0, 'total_samples': 5}





In [12]:
# Display samples in a DataFrame
df = pd.DataFrame(samples)
pd.set_option('display.max_colwidth', None)
display(df[['completion', 'ground_truth', 'accuracy_score', 'format_score']])

Unnamed: 0,completion,ground_truth,accuracy_score,format_score
0,"<think> Janet's ducks lay 16 eggs per day, so they lay a total of 16 * 1 = 16 eggs per day. She eats three eggs for breakfast every morning, so she eats 3 eggs per day. She bakes muffins for her friends with four eggs each day, so she bakes 4 * 1 = 4 eggs per day. So far, she has eaten or baked 3 + 4 = 7 eggs per day. Therefore, she has 16 - 7 = 9 eggs left to sell at the farmers' market. She sells each fresh duck egg at $2, so she makes 9 * $2 = $18 from selling the eggs at the farmers' market each day. </think><answer> $18</answer>",Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.\nShe makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.\n#### 18,1.0,1.0
1,"<think> A robe requires 2 bolts of blue fiber. Half of that amount is 2/2 = 1 bolt of white fiber. So, the total number of bolts needed is 2 + 1 = 3 bolts. </think><answer>3</answer>",It takes 2/2=<<2/2=1>>1 bolt of white fiber\nSo the total amount of fabric is 2+1=<<2+1=3>>3 bolts of fabric\n#### 3,1.0,1.0
2,"<think> The value of the house increased by 150% of its original price of $80,000, so the increase is 150/100 * $80,000 = $120,000. The total cost of the repairs was $50,000. So the total cost of buying and fixing the house is $80,000 + $50,000 = $130,000. Josh made a profit of $130,000 - $120,000 = $10,000. </think><answer> $10,000</answer>","The cost of the house and repairs came out to 80,000+50,000=$<<80000+50000=130000>>130,000\nHe increased the value of the house by 80,000*1.5=<<80000*1.5=120000>>120,000\nSo the new value of the house is 120,000+80,000=$<<120000+80000=200000>>200,000\nSo he made a profit of 200,000-130,000=$<<200000-130000=70000>>70,000\n#### 70000",0.0,1.0
3,"<think> James runs 3 sprints 3 times a week, so he runs 3 x 60 = 180 meters each week. Therefore, he runs a total of 180 x 2 = 360 meters per week. </think><answer>360</answer>",He sprints 3*3=<<3*3=9>>9 times\nSo he runs 9*60=<<9*60=540>>540 meters\n#### 540,0.0,1.0
4,"<think> Wendi gives her chickens 15 cups in the morning and 25 cups in the afternoon, so she gives a total of 15 + 25 = 40 cups of feed per day. Since her flock is 20 chickens, she needs to give her chickens 40 * 20 = 800 cups of feed in the final meal of the day. </think><answer>800</answer>","If each chicken eats 3 cups of feed per day, then for 20 chickens they would need 3*20=<<3*20=60>>60 cups of feed per day.\nIf she feeds the flock 15 cups of feed in the morning, and 25 cups in the afternoon, then the final meal would require 60-15-25=<<60-15-25=20>>20 cups of chicken feed.\n#### 20",0.0,1.0


In [14]:
# Deep dive into a single sample
sample_idx = 0
print("Prompt:\n", df.iloc[sample_idx]['prompt'])
print("\nModel Completion:\n", df.iloc[sample_idx]['completion'])
print("\nGround Truth:\n", df.iloc[sample_idx]['ground_truth'])

Prompt:
 <|im_start|>system
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>
<|im_end|>
<|im_start|>user
Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?<|im_end|>
<|im_start|>assistant


Model Completion:
 <think> Janet's ducks lay 16 eggs per day, so they lay a total of 16 * 1 = 16 eggs per day. She eats three eggs for breakfast every morning, so she eats 3 eggs per day. She bakes muffins for her friends with four