Verifiable Process Rewards for Agentic Reasoning

🌐 Project Page | 📝 Paper | 🤗 Models

Overview

Reinforcement learning from verifiable rewards (RLVR) has shown strong potential for improving LLM reasoning, but most existing methods rely on sparse outcome-level feedback. This becomes problematic in long-horizon agentic reasoning: a trajectory can fail after many correct intermediate decisions, or succeed despite flawed ones, making credit assignment difficult.

Verifiable Process Rewards (VPR) studies a class of densely-verifiable agentic reasoning problems where intermediate actions can be checked by symbolic or algorithmic oracles. Instead of only rewarding final success, VPR converts these verifiers into dense turn-level rewards for reinforcement learning.

The core idea is simple:

Use objective task structure to verify each intermediate action.
Provide local reward signals during a trajectory, not only at the end.
Optimize LLM agents with turn-level process supervision grounded in reliable verifiers.

Overview of VPR compared with outcome reward and rollout-based process reward

Method

VPR focuses on agentic reasoning settings where each action can be checked against a task-specific verifier:

r_t = V(s_t, a_t)

Here, V is an oracle verifier that determines whether the current action is valid, useful, or optimal under the task structure. This turns sparse trajectory-level feedback into dense turn-level supervision.

We instantiate VPR in three representative settings:

Search-based VPR for dynamic deduction. In Tic-Tac-Toe, Monte Carlo Tree Search labels strategically optimal moves, rewarding actions with the best lookahead value.
Constraint-based VPR for logical reasoning. In Sudoku, a constraint oracle checks whether a filled digit is consistent with the unique solution.
Posterior-based VPR for probabilistic inference. In Minesweeper, posterior mine probabilities verify safe reveals and certain mine flags under partial observability.

Three VPR instantiations in Tic-Tac-Toe, Sudoku, and Minesweeper

During training, VPR uses a turn-level GRPO-style objective. For each environment instance, multiple trajectories are sampled, verifier rewards are normalized at each turn across the group, and the resulting turn-level advantages are used in a clipped policy optimization objective. This lets correct intermediate decisions receive positive learning signal even when the full trajectory later fails, and invalid decisions receive negative signal even when the trajectory succeeds by chance.

The paper also provides a theoretical analysis showing that dense verifier-grounded rewards can improve long-horizon credit assignment: VPR behaves like an on-policy filtered imitation update over oracle-valid actions, verifier-induced bias scales with verifier error, and dense step-level signal can avoid the horizon dilution suffered by sparse outcome rewards.

Experiments

We evaluate VPR on three densely-verifiable training environments:

Tic-Tac-Toe for dynamic deduction and strategic lookahead.
Sudoku for symbolic constraint satisfaction.
Minesweeper for probabilistic inference under partial observability.

The experimental results show that VPR consistently outperforms outcome-level reward and Monte Carlo process-reward baselines in these environments. Models trained with VPR also transfer to broader general reasoning benchmarks and agentic tasks such as ALFWorld and WebShop, suggesting that verifiable process supervision can teach reasoning skills beyond the synthetic training games.

The main caveat is oracle quality: dense feedback helps only when the verifier is sufficiently reliable. Weak or misleading oracles can degrade both in-domain learning and out-of-domain generalization.

Installation

VPR is a early fork of the ROLL framework. Please follow the official ROLL setup guide for environment and backend compatibility:

ROLL Docs - Getting Started

Training

Use the provided agentic RL scripts to run VPR-style training in the three environments.

# Tic-Tac-Toe
bash examples/tictactoe/run_agentic_pipeline_tictactoe.sh

# Sudoku
bash examples/sudoku/run_agentic_pipeline_sudoku.sh

# Minesweeper
bash examples/minesweeper/run_agentic_pipeline_minesweeper.sh

Training progress can be monitored with TensorBoard, for example:

tensorboard --logdir=runs/tictactoe_pipeline/

Model Convert

After training, convert the saved Megatron checkpoint into a Hugging Face compatible model directory for evaluation or release.

First edit the paths in model_convert.sh:

RUN_PATH=/path/to/runs/tictactoe_pipeline/YYYYMMDD-HHMMSS
CKPT=checkpoint-100
OUTPUT_PATH=/path/to/output/hf_model

Then run:

bash model_convert.sh

The script gathers tensor-parallel checkpoint shards, copies the base model config, runs mcore_adapter/tools/convert.py, and writes the converted checkpoint to OUTPUT_PATH.

Evaluations:

# Tic-Tac-Toe
bash examples/tictactoe/run_agentic_rollout_tictactoe.sh

# Sudoku
bash examples/sudoku/run_agentic_rollout_sudoku.sh

# Minesweeper
bash examples/minesweeper/run_agentic_rollout_minesweeper.sh

Citation

If you find this work useful, please cite:

@misc{yuan2026verifiable,
      title={Verifiable Process Rewards for Agentic Reasoning}, 
      author={Huining Yuan and Zelai Xu and Huaijie Wang and Xiangmin Yi and Jiaxuan Gao and Xiao-Ping Zhang and Yu Wang and Chao Yu and Yi Wu},
      year={2026},
      eprint={2605.10325},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.10325}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 196 Commits
.github/workflows		.github/workflows
assets		assets
data		data
docker		docker
docs_roll		docs_roll
examples		examples
mcore_adapter		mcore_adapter
roll		roll
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
model_convert.sh		model_convert.sh
playground.ipynb		playground.ipynb
pyproject.toml		pyproject.toml
requirements_common.txt		requirements_common.txt
requirements_torch251_sglang.txt		requirements_torch251_sglang.txt
requirements_torch251_vllm.txt		requirements_torch251_vllm.txt
requirements_torch260_sglang.txt		requirements_torch260_sglang.txt
requirements_torch260_vllm.txt		requirements_torch260_vllm.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Verifiable Process Rewards for Agentic Reasoning

Overview

Method

Experiments

Installation

Training

Model Convert

Evaluations:

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Verifiable Process Rewards for Agentic Reasoning

Overview

Method

Experiments

Installation

Training

Model Convert

Evaluations:

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages