Skip to content

zhyzmath/CurveRL

Repository files navigation

CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning

   

📖 Overview  ·  🧩 Method  ·  📊 Main Results  ·  🚀 Getting Started
🤝 Acknowledgements  ·  📧 Contact  ·  🔗 Citation

📖 Overview

Context-level reweighting has emerged as a central algorithmic lever in Reinforcement Learning with Verified Rewards (RLVR) for improving the reasoning capability of large language models, yet the principle determining what constitutes an optimal weighting remains poorly understood. CurveRL addresses this gap from two angles:

  • A unified optimality framework. We cast prompt reweighting as context distribution control and formulate the optimal weight as a functional derivative of a utility functional defined in the pass-rate function space. This subsumes existing pointwise schemes — REINFORCE, GRPO, MaxRL — as special cases.
  • A distribution-aware instantiation. Pointwise weights are determined solely by the absolute value of the pass rate $\hat{p}$, and so suffer from a weight collapse: in the early stage most prompts have $\hat{p} \approx 0$ and in the late stage most prompts have $\hat{p} \approx 1$, yielding nearly indistinguishable weights. CurveRL replaces this with a quantile coordinate transform, in which the weight depends not on the absolute value of $\hat{p}$ but on its rank and density in the evolving pass-rate distribution.

🧩 Method

Distribution-aware utility in quantile space

CurveRL applies a quantile coordinate transform through a reference CDF F_ref with density f_ref, giving the utility

$$\mathcal{U}_\theta(F_{\mathrm{ref}}) = \mathbb{E}_{x \sim d_0}\left[\psi\left(F_{\mathrm{ref}}(p_\theta(x))\right)\right]$$

with the log distortion ψ(u) = log u, corresponding to a risk-seeking preference that emphasizes hard prompts. The induced gradient yields the CurveRL weight

$$w(\hat{p}) = \frac{f_{\mathrm{ref}}(\hat{p})}{F_{\mathrm{ref}}(\hat{p})}$$

which has the form of a reverse hazard rate: 1 / F_ref(p̂) emphasizes the lower-quantile prompts, while f_ref(p̂) makes the allocation data-driven by tracking pass-rate regions that are actually populated under the current policy.

📊 Main Results

We train Qwen3-1.7B-Base and Qwen3-4B-Base on POLARIS-53K (≈53K math reasoning prompts) under the verl framework. All methods share the same training loop and differ only in the prompt-weighting rule. We use batch size |B| = 256, N = 8 rollouts per prompt, and t_0 = 10.

CurveRL is compared against GRPO and MaxRL on five benchmarks in the main paper — AIME 2025, BeyondAIME, HMMT 02/25, HMMT 02/26, MATH-500 — and three more in the appendix (BRUMO 2025, HMMT 11/25, Minerva).

Pass@k scaling on five representative benchmarks
Figure 1 — pass@k scaling on five representative benchmarks. CurveRL outperforms GRPO and MaxRL across the full range of k on both model sizes, and exceeds the pretrained base model on most panels.

Key findings.

  • Improved Pareto frontier of pass@1 and pass@k. CurveRL attains the highest pass@64 on every benchmark across both model sizes, improving the average pass@64 by +5.9% on Qwen3-1.7B-Base and +9.7% on Qwen3-4B-Base over MaxRL — the strongest baseline — without sacrificing average pass@1.
  • Wider gap at larger k. Against MaxRL, CurveRL's advantage grows with both model scale and k, reaching roughly +7.3% on HMMT 02/26 at k = 1024 on Qwen3-1.7B-Base and +26.8% on HMMT 02/25 at k = 512 on Qwen3-4B-Base — evidence that distribution-aware reweighting effectively broadens the search over reasoning trajectories.
  • No pass@k degradation. While GRPO and MaxRL exhibit varying degrees of pass@k degradation relative to the pretrained base model, CurveRL exceeds the base model in 9 out of 10 panels above.

🚀 Getting Started

1. Installation

The paper experiments use Python 3.10, CUDA 12.8, and 8× NVIDIA B200. Other recent CUDA / Ampere / Hopper GPUs should also work.

conda create -n curverl python=3.10 -y
conda activate curverl

pip install -r requirements.txt

pip install --no-deps -e .

2. Data preparation

The paper trains on POLARIS-53K and evaluates on eight math reasoning benchmarks. Build the parquet files once:

# Training data
python examples/curverl_data_preprocess/polaris.py     --local_dir data/polaris

# Benchmarks reported in the paper
python examples/curverl_data_preprocess/aime25.py      --local_dir data/aime25
python examples/curverl_data_preprocess/beyondaime.py  --local_dir data/beyondaime
python examples/curverl_data_preprocess/hmmt2502.py    --local_dir data/hmmt2502
python examples/curverl_data_preprocess/hmmt2602.py    --local_dir data/hmmt2602
python examples/curverl_data_preprocess/math_500.py    --local_dir data/math_500
python examples/curverl_data_preprocess/brumo25.py     --local_dir data/brumo25
python examples/curverl_data_preprocess/hmmt2511.py    --local_dir data/hmmt2511
python examples/curverl_data_preprocess/minerva.py     --local_dir data/minerva

3. Training

The launcher reads from environment variables and dispatches a single verl.trainer.main_ppo job.

# CurveRL (paper default: t_0 = 10)
ADVANTAGE_ESTIMATOR=curverl \
CURVERL_POOL_NUM=10 \
MODEL_PATH=Qwen/Qwen3-1.7B-Base \
bash qwen3_experiments/run_qwen3_training.sh

For Qwen3-4B-Base, set MODEL_PATH=Qwen/Qwen3-4B-Base.

4. Baselines

# GRPO
ADVANTAGE_ESTIMATOR=grpo  MODEL_PATH=Qwen/Qwen3-1.7B-Base \
  bash qwen3_experiments/run_qwen3_training.sh

# MaxRL
ADVANTAGE_ESTIMATOR=maxrl MODEL_PATH=Qwen/Qwen3-1.7B-Base \
  bash qwen3_experiments/run_qwen3_training.sh

5. Evaluation

The same launcher runs in val-only mode through a thin wrapper:

EVAL_DATASET=aime25 \
CKPT_PATH=/path/to/local/hf_model \
bash qwen3_experiments/run_qwen3_eva.sh

🤝 Acknowledgements

This work builds on top of verl (Ray + Hydra + FSDP + vLLM) and the maxrl baseline. We thank both projects for their open-source contributions.

📧 Contact

For questions about the code, feel free to reach out:

🔗 Citation

If you find CurveRL useful, please cite:

@article{curverl2026,
  title   = {CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning},
  author  = {Sun, Ke and Zhao, Yizhou and Xin, Jiayi and Long, Qi and Su, Weijie},
  journal = {arXiv preprint},
  year    = {2026},
}

About

Official Implementation of "CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning."

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors