Skip to content

wzb-bupt/VGPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VGPO: Visually-Guided Policy Optimization for Multimodal Reasoning

     

🔥 News

  • 🔥 [2026.04]: 🎉🎉🎉 Congratulations! Our paper is accepted by ACL 2026 (Main Conference).

📖 Overview of VGPO

Standard RLVR methods treat every generated token equally, broadcasting a single reward signal indiscriminately. This leads to signal dilution — generic text tokens receive the same reinforcement as critical visually-grounded reasoning steps. Meanwhile, temporal visual forgetting causes attention to visual inputs to progressively decay as reasoning chains extend.

VGPO addresses these issues through three key mechanisms:

  • Visual Attention Compensation (VAC): Uses the inherent hidden-state similarity between generated tokens and image tokens as a Visual Focus Score to localize visual activations without external supervision. A progressive incentive schedule counteracts temporal visual forgetting in later reasoning steps.
  • Intra-Trajectory Re-weighting: At the token level, dynamically re-weights advantages by visual focus scores to amplify learning from visually-grounded tokens.
  • Inter-Trajectory Re-weighting: At the trajectory level, prioritizes rollouts with superior visual accumulation, favoring trajectories that sustain consistent visual grounding.

VGPO Pilot Exp

Analysis of the inference nature of multimodal reasoning trajectory (based on Qwen2.5-VL-7B).

VGPO Framework

Overview of Visually-Guided Policy Optimization framework.

🔧 Getting Started

conda create -n vgpo python=3.10 -y
conda activate vgpo

git clone https://github.com/wzb-bupt/VGPO.git
cd VGPO

pip install -e .

🚀 Training

bash train_vgpo_7B.sh

The training script exposes the core VGPO hyperparameters as follows. All other hyperparameters use sensible defaults in the code. See verl/trainer/config.py for the full configuration.

# ── Dual-Grained Advantage Re-weighting ──
use_intra_trajectory_reweighting=true   # Eq.7-8: token-level reweighting by visual focus score ψ_{i,t}
use_inter_trajectory_reweighting=true   # Eq.9-11: trajectory-level reweighting by visual accumulation ϕ_i

# ── Visual Attention Compensation ──
use_visual_compensation=true            # Eq.5: progressive compensation β·(t/T)
use_gated_visual_compensation=true      # Eq.6: visual gate G_i(ρ) for late-stage tokens
gated_visual_compensation_start_ratio=0.5  # γ in Eq.6: tail ratio threshold
visual_compensation_strength=0.3        # β in Eq.5: compensation intensity
visual_attention_threshold=0.2          # κ in Eq.6: top-κ% gate activation threshold

📊 Evaluation

We follow the evaluation script of Look-Back. All results are reported as average accuracy with inference temperature 0.0.

Supported Training Datasets

Split Dataset Link
Train ViRL39K PAPOGalaxy/PAPO_ViRL39K_train
Val MMK12 PAPOGalaxy/PAPO_MMK12_test

Supported Evaluation Benchmarks

Benchmark Focus Domain
MathVista General Mathematical & Geometric Reasoning
MathVerse General Mathematical & Geometric Reasoning
WeMath General Mathematical & Geometric Reasoning
MMK12 General Mathematical & Geometric Reasoning
GeoMath General Mathematical & Geometric Reasoning
Geometry3K General Mathematical & Geometric Reasoning
LogicVista Vision-dependent Multimodal Reasoning
SuperClevr Counting Vision-dependent Multimodal Reasoning
MMMU-Pro Vision-dependent Multimodal Reasoning
MathVerse-V Vision-dependent Multimodal Reasoning

✍️ Citation

If you find this codebase useful in your research, please consider giving us a star ⭐ and citing our work 📝:

@article{wang2026vgpo,
  title={Visually-Guided Policy Optimization for Multimodal Reasoning}, 
  author={Zengbin Wang and Feng Xiong and Liang Lin and Xuecai Hu and Yong Wang and Yanlin Wang and Man Zhang and Xiangxiang Chu},
  journal={arXiv preprint arXiv:2604.09349},
  year={2026}
}

Acknowledgements

Our codebase is built upon EasyR1, VPPO-RL, PAPO, Look-Back. We thank the authors for their excellent work.

About

[ACL 2026] VGPO: Visually-Guided Policy Optimization for Multimodal Reasoning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors