CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning

📖 Overview · 🧩 Method · 📊 Main Results · 🚀 Getting Started
🤝 Acknowledgements · 📧 Contact · 🔗 Citation

📖 Overview

Context-level reweighting has emerged as a central algorithmic lever in Reinforcement Learning with Verified Rewards (RLVR) for improving the reasoning capability of large language models, yet the principle determining what constitutes an optimal weighting remains poorly understood. CurveRL addresses this gap from two angles:

A unified optimality framework. We cast prompt reweighting as context distribution control and formulate the optimal weight as a functional derivative of a utility functional defined in the pass-rate function space. This subsumes existing pointwise schemes — REINFORCE, GRPO, MaxRL — as special cases.
A distribution-aware instantiation. Pointwise weights are determined solely by the absolute value of the pass rate $\hat{p}$, and so suffer from a weight collapse: in the early stage most prompts have $\hat{p} \approx 0$ and in the late stage most prompts have $\hat{p} \approx 1$, yielding nearly indistinguishable weights. CurveRL replaces this with a quantile coordinate transform, in which the weight depends not on the absolute value of $\hat{p}$ but on its rank and density in the evolving pass-rate distribution.

🧩 Method

Distribution-aware utility in quantile space

CurveRL applies a quantile coordinate transform through a reference CDF F_ref with density f_ref, giving the utility

$$\mathcal{U}_\theta(F_{\mathrm{ref}}) = \mathbb{E}_{x \sim d_0}\left[\psi\left(F_{\mathrm{ref}}(p_\theta(x))\right)\right]$$

with the log distortion ψ(u) = log u, corresponding to a risk-seeking preference that emphasizes hard prompts. The induced gradient yields the CurveRL weight

$$w(\hat{p}) = \frac{f_{\mathrm{ref}}(\hat{p})}{F_{\mathrm{ref}}(\hat{p})}$$

which has the form of a reverse hazard rate: 1 / F_ref(p̂) emphasizes the lower-quantile prompts, while f_ref(p̂) makes the allocation data-driven by tracking pass-rate regions that are actually populated under the current policy.

📊 Main Results

We train Qwen3-1.7B-Base and Qwen3-4B-Base on POLARIS-53K (≈53K math reasoning prompts) under the verl framework. All methods share the same training loop and differ only in the prompt-weighting rule. We use batch size |B| = 256, N = 8 rollouts per prompt, and t_0 = 10.

CurveRL is compared against GRPO and MaxRL on five benchmarks in the main paper — AIME 2025, BeyondAIME, HMMT 02/25, HMMT 02/26, MATH-500 — and three more in the appendix (BRUMO 2025, HMMT 11/25, Minerva).

Figure 1 — pass@k scaling on five representative benchmarks. CurveRL outperforms GRPO and MaxRL across the full range of k on both model sizes, and exceeds the pretrained base model on most panels.

Key findings.

Improved Pareto frontier of pass@1 and pass@k. CurveRL attains the highest pass@64 on every benchmark across both model sizes, improving the average pass@64 by +5.9% on Qwen3-1.7B-Base and +9.7% on Qwen3-4B-Base over MaxRL — the strongest baseline — without sacrificing average pass@1.
Wider gap at larger k. Against MaxRL, CurveRL's advantage grows with both model scale and k, reaching roughly +7.3% on HMMT 02/26 at k = 1024 on Qwen3-1.7B-Base and +26.8% on HMMT 02/25 at k = 512 on Qwen3-4B-Base — evidence that distribution-aware reweighting effectively broadens the search over reasoning trajectories.
No pass@k degradation. While GRPO and MaxRL exhibit varying degrees of pass@k degradation relative to the pretrained base model, CurveRL exceeds the base model in 9 out of 10 panels above.

🚀 Getting Started

1. Installation

The paper experiments use Python 3.10, CUDA 12.8, and 8× NVIDIA B200. Other recent CUDA / Ampere / Hopper GPUs should also work.

conda create -n curverl python=3.10 -y
conda activate curverl

pip install -r requirements.txt

pip install --no-deps -e .

2. Data preparation

The paper trains on POLARIS-53K and evaluates on eight math reasoning benchmarks. Build the parquet files once:

# Training data
python examples/curverl_data_preprocess/polaris.py     --local_dir data/polaris

# Benchmarks reported in the paper
python examples/curverl_data_preprocess/aime25.py      --local_dir data/aime25
python examples/curverl_data_preprocess/beyondaime.py  --local_dir data/beyondaime
python examples/curverl_data_preprocess/hmmt2502.py    --local_dir data/hmmt2502
python examples/curverl_data_preprocess/hmmt2602.py    --local_dir data/hmmt2602
python examples/curverl_data_preprocess/math_500.py    --local_dir data/math_500
python examples/curverl_data_preprocess/brumo25.py     --local_dir data/brumo25
python examples/curverl_data_preprocess/hmmt2511.py    --local_dir data/hmmt2511
python examples/curverl_data_preprocess/minerva.py     --local_dir data/minerva

3. Training

The launcher reads from environment variables and dispatches a single verl.trainer.main_ppo job.

# CurveRL (paper default: t_0 = 10)
ADVANTAGE_ESTIMATOR=curverl \
CURVERL_POOL_NUM=10 \
MODEL_PATH=Qwen/Qwen3-1.7B-Base \
bash qwen3_experiments/run_qwen3_training.sh

For Qwen3-4B-Base, set MODEL_PATH=Qwen/Qwen3-4B-Base.

4. Baselines

# GRPO
ADVANTAGE_ESTIMATOR=grpo  MODEL_PATH=Qwen/Qwen3-1.7B-Base \
  bash qwen3_experiments/run_qwen3_training.sh

# MaxRL
ADVANTAGE_ESTIMATOR=maxrl MODEL_PATH=Qwen/Qwen3-1.7B-Base \
  bash qwen3_experiments/run_qwen3_training.sh

5. Evaluation

The same launcher runs in val-only mode through a thin wrapper:

EVAL_DATASET=aime25 \
CKPT_PATH=/path/to/local/hf_model \
bash qwen3_experiments/run_qwen3_eva.sh

🤝 Acknowledgements

This work builds on top of verl (Ray + Hydra + FSDP + vLLM) and the maxrl baseline. We thank both projects for their open-source contributions.

📧 Contact

For questions about the code, feel free to reach out:

Yizhou Zhao: yzzhao@sas.upenn.edu
Ke Sun: kesun6@upenn.edu

🔗 Citation

If you find CurveRL useful, please cite:

@article{curverl2026,
  title   = {CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning},
  author  = {Sun, Ke and Zhao, Yizhou and Xin, Jiayi and Long, Qi and Su, Weijie},
  journal = {arXiv preprint},
  year    = {2026},
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
docs		docs
examples		examples
qwen3_experiments		qwen3_experiments
recipe		recipe
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning

📖 Overview

🧩 Method

Distribution-aware utility in quantile space

📊 Main Results

🚀 Getting Started

1. Installation

2. Data preparation

3. Training

4. Baselines

5. Evaluation

🤝 Acknowledgements

📧 Contact

🔗 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning

📖 Overview

🧩 Method

Distribution-aware utility in quantile space

📊 Main Results

🚀 Getting Started

1. Installation

2. Data preparation

3. Training

4. Baselines

5. Evaluation

🤝 Acknowledgements

📧 Contact

🔗 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages