This repository contains the code and released models for our paper GRAM: A Generative Foundation Reward Model for Reward Generalization 📝. We propose a more effective approach to reward model training by combining both labeled and unlabeled data. Our method introduces a generative reward model that first learns from a large corpus of unlabeled data and is then fine-tuned with supervised data. Please find all the released model checkpoints at this link 🤗. To develop a reward model tailored to a specific task or domain, we recommend fine-tuning the released GRAM model using task-specific preference data. This strategy mitigates the dependence on large-scale human annotations while maintaining strong performance on the target task.
- [2025/6/17] We trained GRAM using Qwen3 and achieved a score of 71.4 on JudgeBench with Qwen3-14B, significantly outperforming two strong open-source baselines: Llama-3.1-Nemotron-70B-Reward and Skywork-Reward-Gemma-2-27B-v0.2.
- [2025/6/1] We performed additional data cleaning, such as the removal of overly long or corrupted samples, to help GRAM achieve better performance. The processed dataset is available at this link.
- [2025/5/1] Our paper has been accepted by ICML 2025!
Check out our GRAM series below. The models were first pre-trained on the dataset available here, and then fine-tuned on the dataset available here.
- We evaluate our reward model on the JudgeBench, a benchmark for evaluating LLM-as-a-Judge (i.e., generative reward models) applications, and present the results as follows:
Model | Param. | Chat | Code | Math | Safety | Avg. |
---|---|---|---|---|---|---|
GRAM-Qwen3-14B-RewardBench | 14B | 63.0 | 64.3 | 89.3 | 69.1 | 71.4 |
GRAM-LLaMA3.2-3B-RewardBench | 3B | 59.7 | 64.3 | 84.0 | 71.4 | 69.9 |
GRAM-Qwen3-8B-RewardBench | 8B | 62.3 | 64.3 | 80.4 | 64.3 | 67.8 |
nvidia/Llama-3.1-Nemotron-70B-Reward | 70B | 62.3 | 72.5 | 76.8 | 57.1 | 67.2 |
GRAM-Qwen3-4B-RewardBench | 4B | 59.7 | 59.2 | 80.4 | 64.3 | 65.9 |
GRAM-Qwen3-1.7B-RewardBench | 1.7B | 60.4 | 65.3 | 78.6 | 57.1 | 65.4 |
Skywork/Skywork-Reward-Gemma-2-27B-v0.2 | 27B | 59.7 | 66.3 | 83.9 | 50.0 | 65.0 |
Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 | 8B | 59.1 | 64.3 | 76.8 | 50.0 | 62.6 |
internlm/internlm2-20b-reward | 20B | 62.3 | 69.4 | 66.1 | 50.0 | 62.0 |
- We also evaluate our reward model on the recently introduced RM-Bench, a challenging benchmark for reward models, and present the results as follows:
Model | Param. | Chat | Code | Math | Safety | Avg. |
---|---|---|---|---|---|---|
nvidia/Llama-3.1-Nemotron-70B-Reward | 70B | 70.7 | 57.4 | 64.3 | 90.3 | 70.7 |
Skywork/Skywork-Reward-Gemma-2-27B-v0.2 | 27B | 71.8 | 56.6 | 59.2 | 94.3 | 70.5 |
GRAM-Qwen3-14B-RewardBench | 14B | 67.4 | 55.2 | 62.8 | 94.3 | 69.9 |
nvidia/Nemotron-340B-Reward | 340B | 71.2 | 59.4 | 59.8 | 87.5 | 69.5 |
GRAM-Qwen3-8B-RewardBench | 8B | 63.5 | 53.9 | 62.9 | 92.8 | 68.3 |
internlm/internlm2-20b-reward | 20B | 63.1 | 56.7 | 66.8 | 86.5 | 68.3 |
GRAM-Qwen3-4B-RewardBench | 4B | 61.1 | 54.7 | 61.6 | 92.9 | 67.6 |
GRAM-Qwen3-1.7B-RewardBench | 1.7B | 59.6 | 53.6 | 59.6 | 91.8 | 66.2 |
GRAM-LLaMA3.2-3B-RewardBench | 3B | 56.8 | 50.0 | 56.3 | 88.7 | 63.0 |
The code of this repo is modified from hiyouga/LLaMA-Factory. If you encounter installation issues (e.g., related to PyTorch or CUDA), we recommend first checking the LLaMA-Factory issues for potential solutions. If the problem persists, please feel free to submit an issue in this repository.
git clone --depth 1 https://gitee.com/wangclnlp/gram
cd gram
pip install -e ".[torch,metrics]" --no-build-isolation
Each item of the dataset for GRAM pre-training should include at least two keys:
instruction
: any prompt in following template:[User Question] {your prompt here}
input
: the input for above prompt, can be empty if there is not.output
: two responses in following template:[The Start of Assistant A's Answer] {answer of assistant A} [The End of Assistant A's Answer] [The Start of Assistant B's Answer] {answer of assistant B} [The End of Assistant B's Answer]
An example in json format:
[
{
"instruction": "[User Question]\nCan dogs get covid?\n\n",
"input": "",
"output": "[The Start of Assistant A's Answer]\nYes, indeed. ... [The End of Assistant A's Answer]\n\n[The Start of Assistant B's Answer]\nMany of the symptoms are similar, including fever, coughing, loss of smell, etc. ...\n[The End of Assistant B's Answer]"
},
...
]
Each item of the dataset for GRAM fine-tuning should include at least two keys:
instruction
: any prompt with corresponding two responses in following template:Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user's instructions and answers the user's question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. Please directly output your final verdict by strictly following this format: "A" if assistant A is better, "B" if assistant B is better. [User Question] {your prompt here} [The Start of Assistant A's Answer] {answer of assistant A} [The End of Assistant A's Answer] [The Start of Assistant B's Answer] {answer of assistant B} [The End of Assistant B's Answer] #Preferred:
input
: leave it empty.output
: the correct option, "A" or "B".
An example in json format:
[
{
"instruction": "Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. ... [User Question]... [The Start of Assistant A's Answer] ... [The Start of Assistant B's Answer] ...",
"input": "",
"output": "B"
},
...
]
After pre-processing the datasets as above, DO NOT forget to register the dataset in
data/dataset_info.json
. Example:{ ..., "dataset_name": { "file_name": "/path/to/your/dataset" }, ..., },
llamafactory-cli train examples/train_full/qwen3_pre_training_rm.yaml
llamafactory-cli train examples/train_full/qwen3_fine_tuning_rm.yaml
At this stage, the released models can be directly fine-tuned on your own labeled, task- or domain-specific preference data to obtain a reward model that is well-adapted to the target application.
The evaluation scripts are in the subdirectory evaluation/
:
cd evaluation/
ckpt_path=/path/to/your/model
-
Evaluation with Rewardbench
python gram_eval.py -i allenai_reward_bench/filtered.json -m $ckpt_path -o $ckpt_path/reward-bench.res echo -e "RewardBench Evaluation Summary:\n" python get_reward_bench_score.py $ckpt_path/reward-bench.res
-
Evaluation with Judgebench
python gram_eval.py -i scalerlab_judgebench/gpt.json -m $ckpt_path -o $ckpt_path/judge-bench.res echo -e "JudgeBench Evaluation Summary:\n" python get_judgebench_score.py $ckpt_path/judge-bench.res
-
Evaluation with RM-Bench
python gram_eval.py -i thu_keg_rm_bench/total_dataset.json -m $ckpt_path -o $ckpt_path/reward-bench.res echo -e "RM-bench Evaluation Summary:\n" python thu_keg_rm_bench/compute_accuracy.py $ckpt_path/reward-bench.res
def compute_pair_rewards(response_a, response_b):
# compute rewards for response_a and response_b as in the demo `evaluation/gram_demo.py`
...
return reward_response_a, reward_response_b
When applying GRAM to PPO training, we first generate a reference response and then compute a reward score using GRAM, which quantifies how much better the sampled response is compared to the reference. This score serves as the reward signal during PPO training. The basic idea, using the difference between the sampled and reference responses as the reward, has been shown effective in prior baseline methods. Additionally, inspired by ReMax, we can use greedy search to construct the reference response. The detailed procedure is described below:
def ppo():
# Init dataset and model
...
ref_model, policy_model, reward_model, value_model = ...
# Sample from policy model: generate with greedy search first, then sample as normal with top-p/top-k
response_greedy_search = greedy_search(policy_model, query)
response_normal = generate(policy_model, query, top_p=..., top_k=...)
_, reward_response_normal = compute_pair_rewards(response_greedy_search, response_normal)
# Compute logits from ref_model, values from value_model and update with PPO loss
...
A common use case for list-wise response ranking is best-of-n sampling, where the goal is to select the single best response from a list. This can be accomplished using GRAM with a linear search approach, as illustrated below. To support parallel computation and improve efficiency, we also incorporate optimization strategies such as divide-and-conquer.
def list_wise_response_ranking():
# Init dataset and model
...
# Generate from model
responses = [response0, response1, responses2, ...]
# Compute rewards and choose one with highest score
best_response = response0
for response in responses[1:]:
score_a, score_b = compute_pair_rewards(best_response, response)
if score_a < score_b:
best_response = response
return best_response
@misc{wang2025gram,
title={GRAM: A Generative Foundation Reward Model for Reward Generalization},
author={Chenglong Wang and Yang Gan and Yifu Huo and Yongyu Mu and Qiaozhi He and Murun Yang and Bei Li and Tong Xiao and Chunliang Zhang and Tongran Liu and Jingbo Zhu},
year={2025},
eprint={2506.14175},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.14175},
}
We commence by utilizing the exceptional codebase provided by hiyouga/LLaMA-Factory 🌹🌹🌹.
We would like to thank Hang Zhou for his help in open-sourcing the GRAM model series.
We thank the contributions of the following papers:
[1] Lambert, Nathan, et al. "Rewardbench: Evaluating reward models for language modeling." arXiv preprint arXiv:2403.13787 (2024).
[2] Liu, Yantao, et al. "RM-bench: Benchmarking reward models of language models with subtlety and style." arXiv preprint arXiv:2410.16184 (2024).
[3] Tan, Sijun, et al. "Judgebench: A benchmark for evaluating llm-based judges." arXiv preprint arXiv:2410.12784 (2024).
[4] Grattafiori, Aaron, et al. "The llama 3 herd of models." arXiv preprint arXiv:2407.21783 (2024).
[5] Yang, An, et al. "Qwen3 technical report." arXiv preprint arXiv:2505.09388 (2025).