VGPO: Visually-Guided Policy Optimization for Multimodal Reasoning

🔥 News

🔥 [2026.04]: 🎉🎉🎉 Congratulations! Our paper is accepted by ACL 2026 (Main Conference).

📖 Overview of VGPO

Standard RLVR methods treat every generated token equally, broadcasting a single reward signal indiscriminately. This leads to signal dilution — generic text tokens receive the same reinforcement as critical visually-grounded reasoning steps. Meanwhile, temporal visual forgetting causes attention to visual inputs to progressively decay as reasoning chains extend.

VGPO addresses these issues through three key mechanisms:

Visual Attention Compensation (VAC): Uses the inherent hidden-state similarity between generated tokens and image tokens as a Visual Focus Score to localize visual activations without external supervision. A progressive incentive schedule counteracts temporal visual forgetting in later reasoning steps.
Intra-Trajectory Re-weighting: At the token level, dynamically re-weights advantages by visual focus scores to amplify learning from visually-grounded tokens.
Inter-Trajectory Re-weighting: At the trajectory level, prioritizes rollouts with superior visual accumulation, favoring trajectories that sustain consistent visual grounding.

Analysis of the inference nature of multimodal reasoning trajectory (based on Qwen2.5-VL-7B).

Overview of Visually-Guided Policy Optimization framework.

🔧 Getting Started

conda create -n vgpo python=3.10 -y
conda activate vgpo

git clone https://github.com/wzb-bupt/VGPO.git
cd VGPO

pip install -e .

🚀 Training

bash train_vgpo_7B.sh

The training script exposes the core VGPO hyperparameters as follows. All other hyperparameters use sensible defaults in the code. See verl/trainer/config.py for the full configuration.

# ── Dual-Grained Advantage Re-weighting ──
use_intra_trajectory_reweighting=true   # Eq.7-8: token-level reweighting by visual focus score ψ_{i,t}
use_inter_trajectory_reweighting=true   # Eq.9-11: trajectory-level reweighting by visual accumulation ϕ_i

# ── Visual Attention Compensation ──
use_visual_compensation=true            # Eq.5: progressive compensation β·(t/T)
use_gated_visual_compensation=true      # Eq.6: visual gate G_i(ρ) for late-stage tokens
gated_visual_compensation_start_ratio=0.5  # γ in Eq.6: tail ratio threshold
visual_compensation_strength=0.3        # β in Eq.5: compensation intensity
visual_attention_threshold=0.2          # κ in Eq.6: top-κ% gate activation threshold

📊 Evaluation

We follow the evaluation script of Look-Back. All results are reported as average accuracy with inference temperature 0.0.

Supported Training Datasets

Split	Dataset	Link
Train	ViRL39K	PAPOGalaxy/PAPO_ViRL39K_train
Val	MMK12	PAPOGalaxy/PAPO_MMK12_test

Supported Evaluation Benchmarks

Benchmark	Focus Domain
MathVista	General Mathematical & Geometric Reasoning
MathVerse	General Mathematical & Geometric Reasoning
WeMath	General Mathematical & Geometric Reasoning
MMK12	General Mathematical & Geometric Reasoning
GeoMath	General Mathematical & Geometric Reasoning
Geometry3K	General Mathematical & Geometric Reasoning
LogicVista	Vision-dependent Multimodal Reasoning
SuperClevr Counting	Vision-dependent Multimodal Reasoning
MMMU-Pro	Vision-dependent Multimodal Reasoning
MathVerse-V	Vision-dependent Multimodal Reasoning

✍️ Citation

If you find this codebase useful in your research, please consider giving us a star ⭐ and citing our work 📝:

@article{wang2026vgpo,
  title={Visually-Guided Policy Optimization for Multimodal Reasoning}, 
  author={Zengbin Wang and Feng Xiong and Liang Lin and Xuecai Hu and Yong Wang and Yanlin Wang and Man Zhang and Xiangxiang Chu},
  journal={arXiv preprint arXiv:2604.09349},
  year={2026}
}

Acknowledgements

Our codebase is built upon EasyR1, VPPO-RL, PAPO, Look-Back. We thank the authors for their excellent work.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
examples		examples
scripts		scripts
verl		verl
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
train_vgpo_7B.sh		train_vgpo_7B.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VGPO: Visually-Guided Policy Optimization for Multimodal Reasoning

🔥 News

📖 Overview of VGPO

🔧 Getting Started

🚀 Training

📊 Evaluation

Supported Training Datasets

Supported Evaluation Benchmarks

✍️ Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VGPO: Visually-Guided Policy Optimization for Multimodal Reasoning

🔥 News

📖 Overview of VGPO

🔧 Getting Started

🚀 Training

📊 Evaluation

Supported Training Datasets

Supported Evaluation Benchmarks

✍️ Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages