Skip to content

qtris123/Counting-Math-Problems

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GRPO for Countdown Math Problems

This project implements the Group Relative Policy Optimization (GRPO) algorithm to fine-tune a language model on the "Countdown" math task. The goal is to train an LLM to generate correct mathematical equations that reach a target number using a given set of integers.

Setup

pip install --upgrade uv
uv venv
source .venv/bin/activate
uv pip install vllm==0.7.2 triton==3.1.0 datasets transformers==4.51.3 tensorboard torch gpustat datasets python-dotenv
uv pip install flash-attn==2.7.4.post1 --no-build-isolation

Assignment Structure

Instruction: gstar_assignment2.pdf

Main File: starter.py:

Problem 1

Part 1: GRPO Pipeline Implementation

  • Task 1: Implement reward helper functions (_extract_answer, _validate_numbers, _evaluate_equation).
  • Task 2: Implement the main reward_fn.
  • Task 3: Implement compute_group_normalized_advantages.
  • Task 5: Implement masked mean for GRPO and masked_mean_drgrpo for DR.GRPO.

Part 2: Main Experiments and Report (please refer to section 1.6 of gstar_assignment2.pdf for the Main Report and Experiments

Problem 2: (Optional) Open-ended Investigation. Please refer to Section 2 of gstar_assignment2.pdf

About

GRPO for Countdown Math Problems

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%