📖 Overview ·
🧩 Method ·
📊 Main Results ·
🚀 Getting Started
🤝 Acknowledgements ·
📧 Contact ·
🔗 Citation
Context-level reweighting has emerged as a central algorithmic lever in Reinforcement Learning with Verified Rewards (RLVR) for improving the reasoning capability of large language models, yet the principle determining what constitutes an optimal weighting remains poorly understood. CurveRL addresses this gap from two angles:
- A unified optimality framework. We cast prompt reweighting as context distribution control and formulate the optimal weight as a functional derivative of a utility functional defined in the pass-rate function space. This subsumes existing pointwise schemes — REINFORCE, GRPO, MaxRL — as special cases.
-
A distribution-aware instantiation. Pointwise weights are determined solely by the absolute value of the pass rate
$\hat{p}$ , and so suffer from a weight collapse: in the early stage most prompts have$\hat{p} \approx 0$ and in the late stage most prompts have$\hat{p} \approx 1$ , yielding nearly indistinguishable weights. CurveRL replaces this with a quantile coordinate transform, in which the weight depends not on the absolute value of$\hat{p}$ but on its rank and density in the evolving pass-rate distribution.
CurveRL applies a quantile coordinate transform through a reference CDF F_ref with density f_ref, giving the utility
with the log distortion ψ(u) = log u, corresponding to a risk-seeking preference that emphasizes hard prompts. The induced gradient yields the CurveRL weight
which has the form of a reverse hazard rate: 1 / F_ref(p̂) emphasizes the lower-quantile prompts, while f_ref(p̂) makes the allocation data-driven by tracking pass-rate regions that are actually populated under the current policy.
We train Qwen3-1.7B-Base and Qwen3-4B-Base on POLARIS-53K (≈53K math reasoning prompts) under the verl framework. All methods share the same training loop and differ only in the prompt-weighting rule. We use batch size |B| = 256, N = 8 rollouts per prompt, and t_0 = 10.
CurveRL is compared against GRPO and MaxRL on five benchmarks in the main paper — AIME 2025, BeyondAIME, HMMT 02/25, HMMT 02/26, MATH-500 — and three more in the appendix (BRUMO 2025, HMMT 11/25, Minerva).
Figure 1 — pass@k scaling on five representative benchmarks. CurveRL outperforms GRPO and MaxRL across the full range of k on both model sizes, and exceeds the pretrained base model on most panels.
Key findings.
- Improved Pareto frontier of pass@1 and pass@k. CurveRL attains the highest pass@64 on every benchmark across both model sizes, improving the average pass@64 by +5.9% on Qwen3-1.7B-Base and +9.7% on Qwen3-4B-Base over MaxRL — the strongest baseline — without sacrificing average pass@1.
- Wider gap at larger k. Against MaxRL, CurveRL's advantage grows with both model scale and
k, reaching roughly +7.3% on HMMT 02/26 atk = 1024on Qwen3-1.7B-Base and +26.8% on HMMT 02/25 atk = 512on Qwen3-4B-Base — evidence that distribution-aware reweighting effectively broadens the search over reasoning trajectories. - No pass@k degradation. While GRPO and MaxRL exhibit varying degrees of pass@k degradation relative to the pretrained base model, CurveRL exceeds the base model in 9 out of 10 panels above.
The paper experiments use Python 3.10, CUDA 12.8, and 8× NVIDIA B200. Other recent CUDA / Ampere / Hopper GPUs should also work.
conda create -n curverl python=3.10 -y
conda activate curverl
pip install -r requirements.txt
pip install --no-deps -e .The paper trains on POLARIS-53K and evaluates on eight math reasoning benchmarks. Build the parquet files once:
# Training data
python examples/curverl_data_preprocess/polaris.py --local_dir data/polaris
# Benchmarks reported in the paper
python examples/curverl_data_preprocess/aime25.py --local_dir data/aime25
python examples/curverl_data_preprocess/beyondaime.py --local_dir data/beyondaime
python examples/curverl_data_preprocess/hmmt2502.py --local_dir data/hmmt2502
python examples/curverl_data_preprocess/hmmt2602.py --local_dir data/hmmt2602
python examples/curverl_data_preprocess/math_500.py --local_dir data/math_500
python examples/curverl_data_preprocess/brumo25.py --local_dir data/brumo25
python examples/curverl_data_preprocess/hmmt2511.py --local_dir data/hmmt2511
python examples/curverl_data_preprocess/minerva.py --local_dir data/minervaThe launcher reads from environment variables and dispatches a single verl.trainer.main_ppo job.
# CurveRL (paper default: t_0 = 10)
ADVANTAGE_ESTIMATOR=curverl \
CURVERL_POOL_NUM=10 \
MODEL_PATH=Qwen/Qwen3-1.7B-Base \
bash qwen3_experiments/run_qwen3_training.shFor Qwen3-4B-Base, set MODEL_PATH=Qwen/Qwen3-4B-Base.
# GRPO
ADVANTAGE_ESTIMATOR=grpo MODEL_PATH=Qwen/Qwen3-1.7B-Base \
bash qwen3_experiments/run_qwen3_training.sh
# MaxRL
ADVANTAGE_ESTIMATOR=maxrl MODEL_PATH=Qwen/Qwen3-1.7B-Base \
bash qwen3_experiments/run_qwen3_training.shThe same launcher runs in val-only mode through a thin wrapper:
EVAL_DATASET=aime25 \
CKPT_PATH=/path/to/local/hf_model \
bash qwen3_experiments/run_qwen3_eva.shThis work builds on top of verl (Ray + Hydra + FSDP + vLLM) and the maxrl baseline. We thank both projects for their open-source contributions.
For questions about the code, feel free to reach out:
- Yizhou Zhao: yzzhao@sas.upenn.edu
- Ke Sun: kesun6@upenn.edu
If you find CurveRL useful, please cite:
@article{curverl2026,
title = {CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning},
author = {Sun, Ke and Zhao, Yizhou and Xin, Jiayi and Long, Qi and Su, Weijie},
journal = {arXiv preprint},
year = {2026},
}