Reinforcement learning from verifiable rewards (RLVR) has shown strong potential for improving LLM reasoning, but most existing methods rely on sparse outcome-level feedback. This becomes problematic in long-horizon agentic reasoning: a trajectory can fail after many correct intermediate decisions, or succeed despite flawed ones, making credit assignment difficult.
Verifiable Process Rewards (VPR) studies a class of densely-verifiable agentic reasoning problems where intermediate actions can be checked by symbolic or algorithmic oracles. Instead of only rewarding final success, VPR converts these verifiers into dense turn-level rewards for reinforcement learning.
The core idea is simple:
- Use objective task structure to verify each intermediate action.
- Provide local reward signals during a trajectory, not only at the end.
- Optimize LLM agents with turn-level process supervision grounded in reliable verifiers.
VPR focuses on agentic reasoning settings where each action can be checked against a task-specific verifier:
r_t = V(s_t, a_t)
Here, V is an oracle verifier that determines whether the current action is valid, useful, or optimal under the task structure. This turns sparse trajectory-level feedback into dense turn-level supervision.
We instantiate VPR in three representative settings:
- Search-based VPR for dynamic deduction. In Tic-Tac-Toe, Monte Carlo Tree Search labels strategically optimal moves, rewarding actions with the best lookahead value.
- Constraint-based VPR for logical reasoning. In Sudoku, a constraint oracle checks whether a filled digit is consistent with the unique solution.
- Posterior-based VPR for probabilistic inference. In Minesweeper, posterior mine probabilities verify safe reveals and certain mine flags under partial observability.
During training, VPR uses a turn-level GRPO-style objective. For each environment instance, multiple trajectories are sampled, verifier rewards are normalized at each turn across the group, and the resulting turn-level advantages are used in a clipped policy optimization objective. This lets correct intermediate decisions receive positive learning signal even when the full trajectory later fails, and invalid decisions receive negative signal even when the trajectory succeeds by chance.
The paper also provides a theoretical analysis showing that dense verifier-grounded rewards can improve long-horizon credit assignment: VPR behaves like an on-policy filtered imitation update over oracle-valid actions, verifier-induced bias scales with verifier error, and dense step-level signal can avoid the horizon dilution suffered by sparse outcome rewards.
We evaluate VPR on three densely-verifiable training environments:
- Tic-Tac-Toe for dynamic deduction and strategic lookahead.
- Sudoku for symbolic constraint satisfaction.
- Minesweeper for probabilistic inference under partial observability.
The experimental results show that VPR consistently outperforms outcome-level reward and Monte Carlo process-reward baselines in these environments. Models trained with VPR also transfer to broader general reasoning benchmarks and agentic tasks such as ALFWorld and WebShop, suggesting that verifiable process supervision can teach reasoning skills beyond the synthetic training games.
The main caveat is oracle quality: dense feedback helps only when the verifier is sufficiently reliable. Weak or misleading oracles can degrade both in-domain learning and out-of-domain generalization.
VPR is a early fork of the ROLL framework. Please follow the official ROLL setup guide for environment and backend compatibility:
Use the provided agentic RL scripts to run VPR-style training in the three environments.
# Tic-Tac-Toe
bash examples/tictactoe/run_agentic_pipeline_tictactoe.sh
# Sudoku
bash examples/sudoku/run_agentic_pipeline_sudoku.sh
# Minesweeper
bash examples/minesweeper/run_agentic_pipeline_minesweeper.shTraining progress can be monitored with TensorBoard, for example:
tensorboard --logdir=runs/tictactoe_pipeline/After training, convert the saved Megatron checkpoint into a Hugging Face compatible model directory for evaluation or release.
First edit the paths in model_convert.sh:
RUN_PATH=/path/to/runs/tictactoe_pipeline/YYYYMMDD-HHMMSS
CKPT=checkpoint-100
OUTPUT_PATH=/path/to/output/hf_modelThen run:
bash model_convert.shThe script gathers tensor-parallel checkpoint shards, copies the base model config, runs mcore_adapter/tools/convert.py, and writes the converted checkpoint to OUTPUT_PATH.
# Tic-Tac-Toe
bash examples/tictactoe/run_agentic_rollout_tictactoe.sh
# Sudoku
bash examples/sudoku/run_agentic_rollout_sudoku.sh
# Minesweeper
bash examples/minesweeper/run_agentic_rollout_minesweeper.shIf you find this work useful, please cite:
@misc{yuan2026verifiable,
title={Verifiable Process Rewards for Agentic Reasoning},
author={Huining Yuan and Zelai Xu and Huaijie Wang and Xiangmin Yi and Jiaxuan Gao and Xiao-Ping Zhang and Yu Wang and Chao Yu and Yi Wu},
year={2026},
eprint={2605.10325},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.10325},
}

