Skip to content

thu-nics/VPR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

196 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Verifiable Process Rewards for Agentic Reasoning

License Python arXiv

🌐 Project Page | 📝 Paper | 🤗 Models


Overview

Reinforcement learning from verifiable rewards (RLVR) has shown strong potential for improving LLM reasoning, but most existing methods rely on sparse outcome-level feedback. This becomes problematic in long-horizon agentic reasoning: a trajectory can fail after many correct intermediate decisions, or succeed despite flawed ones, making credit assignment difficult.

Verifiable Process Rewards (VPR) studies a class of densely-verifiable agentic reasoning problems where intermediate actions can be checked by symbolic or algorithmic oracles. Instead of only rewarding final success, VPR converts these verifiers into dense turn-level rewards for reinforcement learning.

The core idea is simple:

  • Use objective task structure to verify each intermediate action.
  • Provide local reward signals during a trajectory, not only at the end.
  • Optimize LLM agents with turn-level process supervision grounded in reliable verifiers.
Overview of VPR compared with outcome reward and rollout-based process reward

Method

VPR focuses on agentic reasoning settings where each action can be checked against a task-specific verifier:

r_t = V(s_t, a_t)

Here, V is an oracle verifier that determines whether the current action is valid, useful, or optimal under the task structure. This turns sparse trajectory-level feedback into dense turn-level supervision.

We instantiate VPR in three representative settings:

  • Search-based VPR for dynamic deduction. In Tic-Tac-Toe, Monte Carlo Tree Search labels strategically optimal moves, rewarding actions with the best lookahead value.
  • Constraint-based VPR for logical reasoning. In Sudoku, a constraint oracle checks whether a filled digit is consistent with the unique solution.
  • Posterior-based VPR for probabilistic inference. In Minesweeper, posterior mine probabilities verify safe reveals and certain mine flags under partial observability.
Three VPR instantiations in Tic-Tac-Toe, Sudoku, and Minesweeper

During training, VPR uses a turn-level GRPO-style objective. For each environment instance, multiple trajectories are sampled, verifier rewards are normalized at each turn across the group, and the resulting turn-level advantages are used in a clipped policy optimization objective. This lets correct intermediate decisions receive positive learning signal even when the full trajectory later fails, and invalid decisions receive negative signal even when the trajectory succeeds by chance.

The paper also provides a theoretical analysis showing that dense verifier-grounded rewards can improve long-horizon credit assignment: VPR behaves like an on-policy filtered imitation update over oracle-valid actions, verifier-induced bias scales with verifier error, and dense step-level signal can avoid the horizon dilution suffered by sparse outcome rewards.

Experiments

We evaluate VPR on three densely-verifiable training environments:

  • Tic-Tac-Toe for dynamic deduction and strategic lookahead.
  • Sudoku for symbolic constraint satisfaction.
  • Minesweeper for probabilistic inference under partial observability.

The experimental results show that VPR consistently outperforms outcome-level reward and Monte Carlo process-reward baselines in these environments. Models trained with VPR also transfer to broader general reasoning benchmarks and agentic tasks such as ALFWorld and WebShop, suggesting that verifiable process supervision can teach reasoning skills beyond the synthetic training games.

The main caveat is oracle quality: dense feedback helps only when the verifier is sufficiently reliable. Weak or misleading oracles can degrade both in-domain learning and out-of-domain generalization.

Evaluation curves for VPR training

Installation

VPR is a early fork of the ROLL framework. Please follow the official ROLL setup guide for environment and backend compatibility:

ROLL Docs - Getting Started

Training

Use the provided agentic RL scripts to run VPR-style training in the three environments.

# Tic-Tac-Toe
bash examples/tictactoe/run_agentic_pipeline_tictactoe.sh

# Sudoku
bash examples/sudoku/run_agentic_pipeline_sudoku.sh

# Minesweeper
bash examples/minesweeper/run_agentic_pipeline_minesweeper.sh

Training progress can be monitored with TensorBoard, for example:

tensorboard --logdir=runs/tictactoe_pipeline/

Model Convert

After training, convert the saved Megatron checkpoint into a Hugging Face compatible model directory for evaluation or release.

First edit the paths in model_convert.sh:

RUN_PATH=/path/to/runs/tictactoe_pipeline/YYYYMMDD-HHMMSS
CKPT=checkpoint-100
OUTPUT_PATH=/path/to/output/hf_model

Then run:

bash model_convert.sh

The script gathers tensor-parallel checkpoint shards, copies the base model config, runs mcore_adapter/tools/convert.py, and writes the converted checkpoint to OUTPUT_PATH.

Evaluations:

# Tic-Tac-Toe
bash examples/tictactoe/run_agentic_rollout_tictactoe.sh

# Sudoku
bash examples/sudoku/run_agentic_rollout_sudoku.sh

# Minesweeper
bash examples/minesweeper/run_agentic_rollout_minesweeper.sh

Citation

If you find this work useful, please cite:

@misc{yuan2026verifiable,
      title={Verifiable Process Rewards for Agentic Reasoning}, 
      author={Huining Yuan and Zelai Xu and Huaijie Wang and Xiangmin Yi and Jiaxuan Gao and Xiao-Ping Zhang and Yu Wang and Chao Yu and Yi Wu},
      year={2026},
      eprint={2605.10325},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2605.10325}, 
}

About

Verifiable Process Rewards for Agentic Reasoning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors