# GRPO Reward Ablation: Binary vs Tiered

## Hypothesis
The closeness/format bonuses in the tiered reward may cause reward hacking —
model optimizes for easy partial credit instead of correctness.

## Setup
- Model: Qwen2.5-0.5B-Instruct + LoRA
- Dataset: GSM8K (same 1024 train split)
- Steps: 200
- Only change: reward function (binary 1.0/0.0)

## Baseline (tiered reward)

| Condition | Reward |
|---|---|
| Correct answer + correct format | 8.0 |
| Correct answer + no format | 3.2 |
| Wrong answer + correct format | 1.6 + 1.2 × closeness |
| Wrong answer + no format | 1.2 × closeness |
| No number extracted | 0.0 |

Where `closeness = max(0, 1 - |answer - gold| / max(|gold|, 1))` — ranges from 0 (far off) to 1 (very close). Here we give reward to give partial reward.

- Pre-GRPO: 25.15%
- Post-GRPO: 33.18%
- Reward: 3.58 → 5.05

## Results (binary reward)
- Pre-GRPO: 25.76%
- Post-GRPO: 35.91%
- Reward: 0.23 → 0.512+

In [None]:
!pip install -qUU datasets trl wandb peft accelerate --upgrade

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/515.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m515.2/515.2 kB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.5/540.5 kB[0m [31m47.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.6/47.6 MB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import torch
import regex as re
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, GRPOConfig, GRPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainerCallback
from peft import LoraConfig, get_peft_model

## 1. Load Model + LoRA

In [None]:
model_name = "Qwen/Qwen2.5-0.5B-Instruct"
max_seq_length = 512

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.gradient_checkpointing_enable()

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/290 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

In [None]:
model.print_trainable_parameters()

trainable params: 8,798,208 || all params: 502,830,976 || trainable%: 1.7497


## 2. SFT Warm-Start

Fine-tune on a subset of GSM8K gold chain-of-thought answers so the model
learns the reasoning format before GRPO.

In [None]:
def clean_gold_answer(answer_text):
    """Strip <<...>> annotations and reformat with 'The answer is: N.' ending."""
    parts = answer_text.split("####")
    reasoning = parts[0].strip()
    final_num = parts[1].strip() if len(parts) > 1 else ""
    reasoning = re.sub(r'<<.*?>>', '', reasoning)
    return f"{reasoning}\nThe answer is: {final_num}."

# Test it
test_ans = 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'
print(clean_gold_answer(test_ans))

Natalia sold 48/2 = 24 clips in May.
Natalia sold 48+24 = 72 clips altogether in April and May.
The answer is: 72.


In [None]:
SYSTEM_PROMPT = (
    "You are a helpful math assistant. Solve the problem step by step, "
    "then give your final answer as a single number on the last line in "
    """this exact format:\n\n        The answer is: {number}.\n        """
)

sft_size = 1024
sft_subset = load_dataset("openai/gsm8k", "main")["train"].shuffle(seed=42).select(range(sft_size))

def make_sft_example(ex):
    cleaned = clean_gold_answer(ex["answer"])
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": ex["question"]},
        {"role": "assistant", "content": cleaned},
    ]
    return {"messages": messages}

sft_ds = sft_subset.map(make_sft_example, remove_columns=sft_subset.column_names)
sft_ds

README.md: 0.00B [00:00, ?B/s]

main/train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

main/test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Map:   0%|          | 0/1024 [00:00<?, ? examples/s]

Dataset({
    features: ['messages'],
    num_rows: 1024
})

In [None]:
sft_config = SFTConfig(
    output_dir="sft_qwen_gsm8k",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,  # higher LR for LoRA
    logging_steps=10,
    save_steps=100,
    report_to='none',
    max_length=256,

    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    bf16=True,
    fp16=False,
    dataset_text_field=None,
)

sft_trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=sft_ds,
    processing_class=tokenizer,
)

sft_trainer.train()

warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead.


Tokenizing train dataset:   0%|          | 0/1024 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/1024 [00:00<?, ? examples/s]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Step,Training Loss
10,1.355938
20,0.422635
30,0.338875
40,0.292715
50,0.273123
60,0.276681
70,0.264854
80,0.245806
90,0.240012


TrainOutput(global_step=96, training_loss=0.4022846619288127, metrics={'train_runtime': 178.8023, 'train_samples_per_second': 17.181, 'train_steps_per_second': 0.537, 'total_flos': 1700948749240320.0, 'train_loss': 0.4022846619288127})

## Save SFT checkpoint & merge LoRA for GRPO

In [None]:
# Save merged model (LoRA weights folded into base) for GRPO
merged_model = model.merge_and_unload()
merged_model.save_pretrained("sft_qwen_merged")
tokenizer.save_pretrained("sft_qwen_merged")

del merged_model, sft_config, sft_trainer
torch.cuda.empty_cache()

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

## 3. Quick SFT sanity check

In [None]:
def format_prompt(question):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": question},
    ]
    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Quick test
model.eval()
prompt = format_prompt("If a bag has 5 red and 3 blue marbles, how many marbles are there in total?")
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(out[0], skip_special_tokens=True))

system
You are a helpful math assistant. Solve the problem step by step, then give your final answer as a single number on the last line in this exact format:

        The answer is: {number}.
        
user
If a bag has 5 red and 3 blue marbles, how many marbles are there in total?
assistant
There are 8 marbles because 5 + 3 = 8
The answer is: 8.


## 4. Dataset Setup for GRPO

In [None]:
ds = load_dataset("openai/gsm8k", "main")

def extract_gold(ex):
    return {"gold": ex["answer"].split("####")[-1].strip()}

ds = ds.map(extract_gold, remove_columns=["answer"])
ds = ds.rename_column("question", "prompt")

# Split test into eval + final test
split = ds["test"].train_test_split(test_size=0.5, seed=42)
ds["eval"] = split["train"]
ds["test"] = split["test"]

ds_rl = ds.map(lambda x: {"prompt": format_prompt(x["prompt"])})
ds_rl

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

Map:   0%|          | 0/1319 [00:00<?, ? examples/s]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

Map:   0%|          | 0/660 [00:00<?, ? examples/s]

Map:   0%|          | 0/659 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['prompt', 'gold'],
        num_rows: 7473
    })
    test: Dataset({
        features: ['prompt', 'gold'],
        num_rows: 660
    })
    eval: Dataset({
        features: ['prompt', 'gold'],
        num_rows: 659
    })
})

## 5. Reward Function

In [None]:
def extract_answer(response):
    try:
        matches = re.findall(r'The answer is[:\s]*(\-?\d[\d,]*\.?\d*)', response, re.IGNORECASE)
        if matches:
            return int(float(matches[-1].replace(",", ""))), True
        nums = re.findall(r'\-?\d[\d,]*\.?\d*', response)
        if nums:
            return int(float(nums[-1].replace(",", ""))), False
    except Exception as e:
        print(f"[extract_answer error] response={response[:80]!r} err={e}")
    return None, False

assert extract_answer("The answer is: 72.") == (72, True)
assert extract_answer("ans : 42") == (42, False)
assert extract_answer("no numbers here") == (None, False)
print("✅ extract_answer tests passed")

✅ extract_answer tests passed


In [None]:
def reward_fn(completions, **kwargs):
    golds = kwargs["gold"]
    rewards = []
    for comp, gold in zip(completions, golds):
        try:
            ans, has_fmt = extract_answer(comp)
            gold_int = int(float(gold.replace(",", "")))
            if ans == gold_int and has_fmt:
                rewards.append(8.0)
            elif ans == gold_int and not has_fmt:
                rewards.append(3.2)
            elif ans != gold_int and has_fmt:
                closeness = max(0, 1 - abs(ans - gold_int) / max(abs(gold_int), 1))
                rewards.append(1.6 + 1.2 * closeness)
            elif ans is not None:
                closeness = max(0, 1 - abs(ans - gold_int) / max(abs(gold_int), 1))
                rewards.append(1.2 * closeness)
            else:
                rewards.append(0.0)
        except Exception as e:
            print(f"[reward_fn error] gold={gold!r} comp={comp[:80]!r} err={e}")
            rewards.append(0.0)
    return rewards

# Quick test
print(reward_fn(
    ["The answer is: 72.", "the answer is 72", "Answer is 99.", "wrong"],
    gold=["72", "72", "72", "72"]
))

[8.0, 8.0, 0.75, 0.0]


In [None]:
def reward_fn(completions, **kwargs):
    golds = kwargs["gold"]
    rewards = []
    for comp, gold in zip(completions, golds):
        try:
            ans, _ = extract_answer(comp)
            gold_int = int(float(gold.replace(",", "")))
            rewards.append(1.0 if ans == gold_int else 0.0)
        except Exception as e:
            print(f"[reward_fn error] gold={gold!r} comp={comp[:80]!r} err={e}")
            rewards.append(0.0)
    return rewards


## 6. Eval Function

In [None]:
from IPython.display import display, Markdown
from tqdm import tqdm

def perf_check(model, tokenizer):
    model.eval()
    tokenizer.padding_side = "left"

    correct, total = 0, 0
    results = []
    batch_size = 64
    table_rows = []

    test_data = list(ds_rl["test"])
    for i in tqdm(range(0, len(test_data), batch_size), desc="Evaluating"):
        batch = test_data[i:i+batch_size]
        prompts = [ex["prompt"] for ex in batch]
        golds = [int(float(ex["gold"].replace(",", ""))) for ex in batch]

        inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
        with torch.no_grad():
            out = model.generate(**inputs, max_new_tokens=256)

        for j, (ids, gold_int) in enumerate(zip(out, golds)):
            response = tokenizer.decode(ids[inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
            ans, has_fmt = extract_answer(response)

            is_correct = (ans == gold_int)
            correct += int(is_correct)
            total += 1

            results.append({
                "gold": gold_int,
                "predicted": ans,
                "correct": is_correct,
                "response": response[:200],
            })
            table_rows.append(f"| {total} | {gold_int} | {ans} | {'✅' if is_correct else '❌'} | {response[:80].replace(chr(10), ' ')} |")

    print(f"\nFinal Accuracy: {correct}/{total} = {correct/total:.2%}")

    model.train()
    tokenizer.padding_side = "right"

    return table_rows

## Pre-GRPO baseline accuracy

In [None]:
_ = perf_check(model, tokenizer)

Evaluating: 100%|██████████| 11/11 [01:47<00:00,  9.74s/it]


Final Accuracy: 170/660 = 25.76%





## 7. Reload merged SFT model with fresh LoRA for GRPO

In [None]:
# Load the merged SFT model, then apply fresh LoRA for GRPO
del model
torch.cuda.empty_cache()

tokenizer = AutoTokenizer.from_pretrained("sft_qwen_merged")
model = AutoModelForCausalLM.from_pretrained(
    "sft_qwen_merged",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.gradient_checkpointing_enable()

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

model.print_trainable_parameters()

Loading weights:   0%|          | 0/290 [00:00<?, ?it/s]

trainable params: 8,798,208 || all params: 502,830,976 || trainable%: 1.7497


## 8. GRPO Training

In [None]:
import wandb

wandb.login()
wandb.init(project="grpo-gsm8k", name="grpo-qwen2.5-0.5B-lora")

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: [wandb.login()] Using explicit session credentials for https://api.wandb.ai.
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mtripathysagar08[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
class VibecheckCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs:
            reward = logs.get('reward')
            completions = logs.get('completions')
            if reward is not None:
                print(f"Step {state.global_step} | reward: {reward:.3f}")
            if completions and len(completions) > 0:
                print(f"Step {state.global_step} | Sample:\n{completions[0][:200]}...")


config = GRPOConfig(
    output_dir="grpo_qwen_gsm8k",
    num_generations=8,
    max_completion_length=256,
    #num_train_epochs=1,
    max_steps=200,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,  # higher LR for LoRA
    logging_steps=10,
    report_to='wandb',
    beta=0.04,
    lr_scheduler_type="cosine",
    warmup_steps=20,
    weight_decay=0.01,
    eval_steps=50,
    save_steps=100,
    save_total_limit=2,
    bf16=True,
    fp16=False,
)


trainer = GRPOTrainer(
    model=model,
    reward_funcs=reward_fn,
    args=config,
    train_dataset=ds_rl["train"],
    processing_class=tokenizer,
    callbacks=[VibecheckCallback()],
)

trainer.train()

Passing `generation_config` together with generation-related arguments=({'disable_compile'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.


Step,Training Loss
10,0.029969
20,0.013656
30,0.03396
40,0.024576
50,0.022531
60,-0.021941
70,-0.002235
80,-0.012913
90,0.011064
100,-0.007362


Step 10 | reward: 0.231
Step 20 | reward: 0.319
Step 30 | reward: 0.356
Step 40 | reward: 0.350
Step 50 | reward: 0.244
Step 60 | reward: 0.263
Step 70 | reward: 0.438
Step 80 | reward: 0.400
Step 90 | reward: 0.406
Step 100 | reward: 0.406
Step 110 | reward: 0.512
Step 120 | reward: 0.325
Step 130 | reward: 0.350
Step 140 | reward: 0.463
Step 150 | reward: 0.456
Step 160 | reward: 0.381
Step 170 | reward: 0.375
Step 180 | reward: 0.394
Step 190 | reward: 0.431
Step 200 | reward: 0.431


TrainOutput(global_step=200, training_loss=-0.0012255232757888734, metrics={'train_runtime': 2857.8148, 'train_samples_per_second': 1.12, 'train_steps_per_second': 0.07, 'total_flos': 0.0, 'train_loss': -0.0012255232757888734})

## 9. Post-GRPO Eval

In [None]:
_ = perf_check(trainer.model, tokenizer)

Evaluating: 100%|██████████| 11/11 [03:36<00:00, 19.65s/it]


Final Accuracy: 237/660 = 35.91%





## 10. Test inference

In [None]:
def infer(model, tokenizer, prompt):
    model.eval()
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=256,
            do_sample=True,
            temperature=0.7,
            top_p=.9,
            pad_token_id=tokenizer.eos_token_id,
            use_cache=True
        )
    response = tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
    return response

# Test on a sample
[infer(trainer.model, tokenizer, ds_rl['eval'][0]['prompt']) for i in range(5)], ds_rl['eval'][0]['gold']

## 11. Push to Hub

In [None]:
#merged_model = model.merge_and_unload()
#merged_model.save_pretrained("grpo_qwen_merged")
#tokenizer.save_pretrained("grpo_qwen_merged")

# Push merged model
#from huggingface_hub import HfApi
#api = HfApi()
#api.create_repo("tripathysagar/qwen2.5-0.5B-grpo-gsm8k", exist_ok=True)
#api.upload_folder(
#    folder_path="grpo_qwen_merged",
#    repo_id="tripathysagar/qwen2.5-0.5B-grpo-gsm8k",
#)