Skip to content

NiuTrans/GRAM

Repository files navigation

A Generative Foundation Reward Model (GRAM)

This repository contains the code and released models for our paper GRAM: A Generative Foundation Reward Model for Reward Generalization 📝. We propose a more effective approach to reward model training by combining both labeled and unlabeled data. Our method introduces a generative reward model that first learns from a large corpus of unlabeled data and is then fine-tuned with supervised data. Please find all the released model checkpoints at this link 🤗. To develop a reward model tailored to a specific task or domain, we recommend fine-tuning the released GRAM model using task-specific preference data. This strategy mitigates the dependence on large-scale human annotations while maintaining strong performance on the target task.

🆕 Changelog

  • [2025/6/17] We trained GRAM using Qwen3 and achieved a score of 71.4 on JudgeBench with Qwen3-14B, significantly outperforming two strong open-source baselines: Llama-3.1-Nemotron-70B-Reward and Skywork-Reward-Gemma-2-27B-v0.2.
  • [2025/6/1] We performed additional data cleaning, such as the removal of overly long or corrupted samples, to help GRAM achieve better performance. The processed dataset is available at this link.
  • [2025/5/1] Our paper has been accepted by ICML 2025!

🔗 Quick Links


Released Models

Check out our GRAM series below. The models were first pre-trained on the dataset available here, and then fine-tuned on the dataset available here.

  • We evaluate our reward model on the JudgeBench, a benchmark for evaluating LLM-as-a-Judge (i.e., generative reward models) applications, and present the results as follows:
Model Param. Chat Code Math Safety Avg.
GRAM-Qwen3-14B-RewardBench 14B 63.0 64.3 89.3 69.1 71.4
GRAM-LLaMA3.2-3B-RewardBench 3B 59.7 64.3 84.0 71.4 69.9
GRAM-Qwen3-8B-RewardBench 8B 62.3 64.3 80.4 64.3 67.8
nvidia/Llama-3.1-Nemotron-70B-Reward 70B 62.3 72.5 76.8 57.1 67.2
GRAM-Qwen3-4B-RewardBench 4B 59.7 59.2 80.4 64.3 65.9
GRAM-Qwen3-1.7B-RewardBench 1.7B 60.4 65.3 78.6 57.1 65.4
Skywork/Skywork-Reward-Gemma-2-27B-v0.2 27B 59.7 66.3 83.9 50.0 65.0
Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 8B 59.1 64.3 76.8 50.0 62.6
internlm/internlm2-20b-reward 20B 62.3 69.4 66.1 50.0 62.0
  • We also evaluate our reward model on the recently introduced RM-Bench, a challenging benchmark for reward models, and present the results as follows:
Model Param. Chat Code Math Safety Avg.
nvidia/Llama-3.1-Nemotron-70B-Reward 70B 70.7 57.4 64.3 90.3 70.7
Skywork/Skywork-Reward-Gemma-2-27B-v0.2 27B 71.8 56.6 59.2 94.3 70.5
GRAM-Qwen3-14B-RewardBench 14B 67.4 55.2 62.8 94.3 69.9
nvidia/Nemotron-340B-Reward 340B 71.2 59.4 59.8 87.5 69.5
GRAM-Qwen3-8B-RewardBench 8B 63.5 53.9 62.9 92.8 68.3
internlm/internlm2-20b-reward 20B 63.1 56.7 66.8 86.5 68.3
GRAM-Qwen3-4B-RewardBench 4B 61.1 54.7 61.6 92.9 67.6
GRAM-Qwen3-1.7B-RewardBench 1.7B 59.6 53.6 59.6 91.8 66.2
GRAM-LLaMA3.2-3B-RewardBench 3B 56.8 50.0 56.3 88.7 63.0

Installation Guide

The code of this repo is modified from hiyouga/LLaMA-Factory. If you encounter installation issues (e.g., related to PyTorch or CUDA), we recommend first checking the LLaMA-Factory issues for potential solutions. If the problem persists, please feel free to submit an issue in this repository.

git clone --depth 1 https://gitee.com/wangclnlp/gram
cd gram
pip install -e ".[torch,metrics]" --no-build-isolation

Preparing Models and Datasets

Datasets

Pre-Training

Each item of the dataset for GRAM pre-training should include at least two keys:

  • instruction: any prompt in following template:
    [User Question]
    {your prompt here}
    
  • input: the input for above prompt, can be empty if there is not.
  • output: two responses in following template:
    [The Start of Assistant A's Answer]
    {answer of assistant A}
    [The End of Assistant A's Answer]
    
    [The Start of Assistant B's Answer]
    {answer of assistant B}
    [The End of Assistant B's Answer]
    

An example in json format:

[
  {
    "instruction": "[User Question]\nCan dogs get covid?\n\n",
    "input": "",
    "output": "[The Start of Assistant A's Answer]\nYes, indeed. ... [The End of Assistant A's Answer]\n\n[The Start of Assistant B's Answer]\nMany of the symptoms are similar, including fever, coughing, loss of smell, etc. ...\n[The End of Assistant B's Answer]"
  },
  ...
]

Fine-Tuning

Each item of the dataset for GRAM fine-tuning should include at least two keys:

  • instruction: any prompt with corresponding two responses in following template:
    Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user's instructions and answers the user's question better.
    Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible.
    Please directly output your final verdict by strictly following this format: "A" if assistant A is better, "B" if assistant B is better.
    
    [User Question]
    {your prompt here}
    
    [The Start of Assistant A's Answer]
    {answer of assistant A}
    [The End of Assistant A's Answer]
    
    [The Start of Assistant B's Answer]
    {answer of assistant B}
    [The End of Assistant B's Answer]
    
    #Preferred:
    
  • input: leave it empty.
  • output: the correct option, "A" or "B".

An example in json format:

[
  {
    "instruction": "Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. ... [User Question]... [The Start of Assistant A's Answer] ... [The Start of Assistant B's Answer] ...",
    "input": "",
    "output": "B"
  },
  ...
]

Note

After pre-processing the datasets as above, DO NOT forget to register the dataset in data/dataset_info.json. Example:

 {
   ...,
   "dataset_name": {
     "file_name": "/path/to/your/dataset"
   },
   ...,
 },

Training Scripts

Pre-Training

llamafactory-cli train examples/train_full/qwen3_pre_training_rm.yaml

Fine-Tuning

llamafactory-cli train examples/train_full/qwen3_fine_tuning_rm.yaml

At this stage, the released models can be directly fine-tuned on your own labeled, task- or domain-specific preference data to obtain a reward model that is well-adapted to the target application.

Evaluation

The evaluation scripts are in the subdirectory evaluation/:

cd evaluation/
ckpt_path=/path/to/your/model
  • Evaluation with Rewardbench

    python gram_eval.py -i allenai_reward_bench/filtered.json -m $ckpt_path -o $ckpt_path/reward-bench.res
    echo -e "RewardBench Evaluation Summary:\n"
    python get_reward_bench_score.py $ckpt_path/reward-bench.res
  • Evaluation with Judgebench

    python gram_eval.py -i scalerlab_judgebench/gpt.json -m $ckpt_path -o $ckpt_path/judge-bench.res
    echo -e "JudgeBench Evaluation Summary:\n"
    python get_judgebench_score.py $ckpt_path/judge-bench.res
  • Evaluation with RM-Bench

    python gram_eval.py -i thu_keg_rm_bench/total_dataset.json -m $ckpt_path -o $ckpt_path/reward-bench.res
    echo -e "RM-bench Evaluation Summary:\n"
    python thu_keg_rm_bench/compute_accuracy.py $ckpt_path/reward-bench.res

Using GRAM in RLHF

Computing Rewards of a pair of Samples

def compute_pair_rewards(response_a, response_b):
    # compute rewards for response_a and response_b as in the demo `evaluation/gram_demo.py`
    ...
    return reward_response_a, reward_response_b

PPO

When applying GRAM to PPO training, we first generate a reference response and then compute a reward score using GRAM, which quantifies how much better the sampled response is compared to the reference. This score serves as the reward signal during PPO training. The basic idea, using the difference between the sampled and reference responses as the reward, has been shown effective in prior baseline methods. Additionally, inspired by ReMax, we can use greedy search to construct the reference response. The detailed procedure is described below:

def ppo():
    # Init dataset and model
    ...
    ref_model, policy_model, reward_model, value_model = ...
    # Sample from policy model: generate with greedy search first, then sample as normal with top-p/top-k
    response_greedy_search = greedy_search(policy_model, query)
    response_normal = generate(policy_model, query, top_p=..., top_k=...)
    _, reward_response_normal = compute_pair_rewards(response_greedy_search, response_normal)
    # Compute logits from ref_model, values from value_model and update with PPO loss
    ...

List-wise Response Ranking

A common use case for list-wise response ranking is best-of-n sampling, where the goal is to select the single best response from a list. This can be accomplished using GRAM with a linear search approach, as illustrated below. To support parallel computation and improve efficiency, we also incorporate optimization strategies such as divide-and-conquer.

def list_wise_response_ranking():
    # Init dataset and model
    ...
    # Generate from model
    responses = [response0, response1, responses2, ...]
    # Compute rewards and choose one with highest score
    best_response = response0
    for response in responses[1:]:
        score_a, score_b = compute_pair_rewards(best_response, response)
        if score_a < score_b:
           best_response = response

    return best_response

Citation

@misc{wang2025gram,
      title={GRAM: A Generative Foundation Reward Model for Reward Generalization}, 
      author={Chenglong Wang and Yang Gan and Yifu Huo and Yongyu Mu and Qiaozhi He and Murun Yang and Bei Li and Tong Xiao and Chunliang Zhang and Tongran Liu and Jingbo Zhu},
      year={2025},
      eprint={2506.14175},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.14175}, 
}

Acknowledgement

We commence by utilizing the exceptional codebase provided by hiyouga/LLaMA-Factory 🌹🌹🌹.

We would like to thank Hang Zhou for his help in open-sourcing the GRAM model series.

We thank the contributions of the following papers:

[1] Lambert, Nathan, et al. "Rewardbench: Evaluating reward models for language modeling." arXiv preprint arXiv:2403.13787 (2024).
[2] Liu, Yantao, et al. "RM-bench: Benchmarking reward models of language models with subtlety and style." arXiv preprint arXiv:2410.16184 (2024).
[3] Tan, Sijun, et al. "Judgebench: A benchmark for evaluating llm-based judges." arXiv preprint arXiv:2410.12784 (2024).
[4] Grattafiori, Aaron, et al. "The llama 3 herd of models." arXiv preprint arXiv:2407.21783 (2024).
[5] Yang, An, et al. "Qwen3 technical report." arXiv preprint arXiv:2505.09388 (2025).

About

Code for ICML 2025 paper "GRAM: A Generative Foundation Reward Model for Reward Generalization"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages