📄 Paper | 🤗 CodeV-SFT | 🤗 CodeV-RL | 📊 CodeV-RL-Data
CodeV is a code-based visual agent that achieves faithful visual reasoning by generating and executing Python code with visual tools. Unlike traditional vision-language models that may achieve high accuracy while exhibiting unfaithful reasoning (invoking tools on irrelevant regions or ignoring outputs), CodeV ensures that intermediate visual tool outputs actually contain queried evidence.
This repository provides the Tool-Aware Policy Optimization (TAPO) training code and evaluation framework for CodeV.
CodeV framework: Code-based visual reasoning with executable tools
- 🎯 Faithful Visual Reasoning: Ensures intermediate tool outputs contain actual evidence, not just correct final answers
- 🔧 Tool-Aware Policy Optimization (TAPO): Novel RL framework with dense rewards based on visual tool inputs/outputs
- 📝 Code-Based Visual Agent: Represents visual tools as executable Python code for verifiable reasoning
- 🚀 Strong Performance: Competitive or superior accuracy with substantially higher faithful tool-use rates
- TAPO training framework augmenting GRPO with dense visual tool rewards
- Safe and powerful python sandbox as visual tool-use
See rl/README.md for detailed capabilities and training instructions.
- Comprehensive evaluation framework based on VLMEvalKit
- Evaluation protocol with tool-use like python sandbox and crop API
For supervised fine-tuning (SFT), please refer to LLaMA-Factory or ms-swift since currently VeRL does not support multimodal SFT. As for the training data, please refer to Thyme which provides the SFT training data.
Performance on visual search and reasoning benchmarks
CodeV achieves:
- ✅ Competitive or superior accuracy on visual search benchmarks (VStarBench, HRBench, MathVista)
- ✅ Substantially increased faithful tool-use rates compared to baselines
Faithful tool-use rates: CodeV vs. baselines (Thyme, Pixel-Reasoner, etc.)
Key Finding: Explicitly supervising intermediate tool behavior (via TAPO) is crucial for building trustworthy agentic visual reasoning systems.
TAPO training dynamics: response length/tool calls (left) and reward progression (right)
See paper for complete results and analysis
# 1. Install package
cd rl && pip install -e . --no-deps
# 2. Install required dependencies
pip install -r requirements_codev.txt
# 3. Prepare data
hf download RenlyH/CodeV-RL-Data --local-dir data/
python scripts/extract_images_from_parquet.py \
--parquet_path data/codev_rl_data.parquet \
--image_save_path data/images
# 4. Setup LLM judge for reward model (on separate node with REWARD_NODE_IP)
vllm serve Qwen/Qwen2.5-VL-32B-Instruct --port 8000
# 5. Configure judge endpoint
export LLM_AS_A_JUDGE_BASE="http://REWARD_NODE_IP:8000/v1"
# 6. Login to Weights & Biases
wandb login
# 7. Start training
bash examples/agent/codev.shSee rl/README.md for hardware requirements and detailed instructions.
Note: Install VLMEvalKit in a separate conda environment. Avoid sharing the same environment with RL training.
# 1. Create conda environment
conda create -n codev-eval python=3.10 -y
conda activate codev-eval
# 2. Install VLMEvalKit
cd VLMEvalKit && pip install -e .
# 3. Serve the model with vLLM (online serving recommended)
vllm serve RenlyH/CodeV-RL --port 8000
# 4. Run evaluation on the served model
python run.py --config codev_eval.yamlSupported Benchmarks: VStarBench, HRBench, MathVista, and many more via VLMEvalKit integration.
Supported eval protocols: Python sandbox, crop API.
Recommendation: Use vLLM online serving (v0.10.0+) for testing models. This provides better performance and easier integration with the evaluation framework.
CodeV/
├── rl/ # TAPO training code
│ ├── verl/ # RL framework
│ ├── scripts/ # Training scripts
│ ├── examples/ # Example configurations
│ └── README.md # Detailed RL capabilities
├── VLMEvalKit/ # Evaluation framework
│ ├── vlmeval/ # VLMEvalKit integration
│ ├── scripts/ # Evaluation helper scripts
│ └── run.py # Main evaluation script
├── assets/ # Figures and visualizations
└── CLAUDE.md # Guidance for Claude Code
If you find CodeV useful for your research, please cite:
@article{hou2025codev,
title={CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization},
author={Hou, Xinhai and Xu, Shaoyuan and Biyani, Manan and Li, Mayan and Liu, Jia and Hollon, Todd C and Wang, Bryan},
journal={arXiv preprint arXiv:2511.19661},
year={2025}
}- Pixel-Reasoner: Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
- OpenThinkImg: OPENTHINKIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
- DeepEyes: DeepEyes: Incentivizing “Thinking with Images” via Reinforcement Learning
- REVPT: Reinforced Visual Perception with Tools
- Thyme: Thyme: Think Beyond Images
This project is licensed under Apache 2.0 License.
For questions and issues, please open an issue on GitHub or contact xinhaih@umich.edu.