Skip to content

X-PLUG/ToolCUA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

  🌐 Website   |   📑 Paper   |   🤗 ToolCUA-8B   |   📄 Cases

ToolCUA overview

📢 Updates

  • 2026-05-12: 🎉 Thrilled to release ToolCUA with the ToolCUA-8B model, evaluation code, and OSWorld-MCP benchmark results.

📚 Table of Contents

TODO

  • Tool CUA Model Released
  • Data Pipeline: GUI-Tool interleaved trajectory scaling pipeline
  • Training Infra: Asynchronous training-rollout decoupled agentic RL in sandbox

🌟 Introduction

ToolCUA is an end-to-end Computer Use Agent (CUA) designed for optimal GUI-Tool path orchestration. Modern CUAs can act through both atomic GUI actions, such as clicking, typing, and scrolling, and high-level tool calls, such as API-based file or application operations. However, simply exposing a model to both action spaces does not make it a reliable desktop agent: the model must learn when to continue with GUI actions, when to invoke tools, and when to switch back.

ToolCUA addresses this challenge with a staged training pipeline. We first scale interleaved GUI-Tool trajectories from existing GUI-only data through trajectory-aware tool synthesis. Then, we use Tool-Bootstrapped GUI RFT to acquire tool-calling knowledge and calibrate critical switching decisions. Finally, we optimize the agent with Online Agentic RL in a GUI-Tool environment using a Tool-Efficient Path Reward, encouraging appropriate tool use and shorter execution paths.

🔍 Path Selection Confusion Under Hybrid Actions

Giving agents both GUI actions and tool calls does not automatically make them better. In our diagnostic study, hybrid actions introduce a clear path selection confusion problem: some models stay GUI-centric and almost never invoke tools, while stronger models may overuse tools, shorten trajectories, and still lose task success. The bottleneck is therefore not tool availability itself, but whether the agent can choose the right GUI-Tool execution path at each state.

GUI-Tool path confusion

🧠 Method Overview

ToolCUA learns GUI-Tool orchestration through three tightly connected stages: (1) scalable interleaved GUI-Tool trajectory construction from existing GUI corpora, (2) Tool-Bootstrapped GUI RFT for tool knowledge and local switching calibration, and (3) Online Agentic RL with Tool-Efficient Path Reward for trajectory-level optimization.

ToolCUA method overview

Installation & Download

First, install the required transformers dependencies:

pip install -r requirement.txt

Download the model weight from huggingface:

from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="mPLUG/ToolCUA-8B",
    local_dir="ToolCUA-8B",                
    local_dir_use_symlinks=False  
)

🚀 vLLM Serve

We recommend using vLLM for production deployment. Requires vllm>=0.12.0 with --trust-remote-code.

# 8B (single GPU)

MAX_IMAGE=${MAX_IMAGE:-5}
IMAGE_LIMIT_ARGS='{"image": '"$MAX_IMAGE"'}'

PIXEL_ARGS='{"size": {"longest_edge": 3072000, "shortest_edge": 65536}}' # 3000*32*32

vllm serve xPLUG/ToolCUA-8B \
    --max-model-len 32768 \
    --mm-processor-kwargs "$PIXEL_ARGS" \
    --limit-mm-per-prompt "$IMAGE_LIMIT_ARGS" \
    --tensor-parallel-size 1 \
    --allowed-local-media-path '/' \
    --port 4243 \
    --gpu-memory-utilization 0.85 \
    --mm-processor-cache-gb 0 \
    --no-enable-prefix-caching \
    --enforce-eager \
    --max-logprobs 50

As ToolCUA-8B is based on Qwen3VL-8B-Instruct, whom you can follow the similar implementation

🖥️ Evaluation

You should first have the complete evaluation environment as OSWorld (for pure GUI settings) or OSWorld-MCP (for GUI-Tool settings)

There are some key files that you need to place at the right place os OSWorld / OSWorld-MCP evaluation dir

  • eval_data with tool_beneficial label: ./eval/evaluaton_data
  • desktop_env: ./eval/desktop_env.py
  • agents_implementation: ./eval/qwen3vl_toolcua_aget_mcp.py
  • eval_main: ./eval/run_multienv_qwen3vl_toolcua_mcp_eval.py

Command for running in average@k and calculate results

bash ./eval/passk_run_new.sh

python ./eval/pass_k_results.py --root_path ${RESULT_DIR} --trials 0 1 2

📊 Performance

Results are reported on the feasible tasks of OSWorld-MCP. We list the Overall metrics from the main paper table: Accuracy, Tool Invocation Rate (TIR), and Average Completion Steps (ACS).

Agent Model Accuracy ↑ TIR ↑ ACS ↓
Gemini-2.5-Pro 20.22 17.22 29.97
OpenAI o3 20.62 18.22 31.87
Seed1.5-VL 34.53 26.83 20.69
Claude-4-Sonnet 43.54 35.74 19.76
Gemini-3.1-Pro 41.14 34.23 25.40
Claude-4-5-Sonnet 48.35 40.24 19.07
Qwen3-VL-235B-A22B 38.14 28.63 17.95
Qwen3.5-397B-A17B 40.84 11.71 21.86
UI-Tars-1.5-7B 12.31 4.50 37.11
EvoCUA-8B 35.74 13.81 26.77
EvoCUA-32B 40.54 22.52 26.16
GUI-Owl-1.5-8B 43.84 36.04 21.19
GUI-Owl-1.5-32B 48.05 41.14 24.19
Qwen3-VL-8B-Instruct 28.23 8.41 19.34
ToolCUA-8B 46.85 24.32 14.93

Compared with the Qwen3-VL-8B-Instruct baseline, ToolCUA-8B improves Accuracy by +18.62, TIR by +15.91, and reduces ACS by -4.41.

ToolCUA results across applications

Acknowledge

We thank Zhaoqing Zhu, Junyang Wang, Jitong Liao and Haowei Liu for their support of training infrastructure, sandbox construction and evaluation.

Our work is motivated by OpenCUA, ScaleCUA, AutoGLM, CUA-skill, EVOCUA, and Mobile-Agent Series. Thanks for their wonderful work.

Citation

If you use ToolCUA in your research or project, please cite our work:

@article{hu2026toolcua,
  title={ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents},
  author={Hu, Xuhao and Zhang, Xi and Xu, Haiyang and Qiao, Kyle and Yang, Jingyi and Huang, Xuanjing and Shao, Jing and Yan, Ming and Ye, Jieping},
  journal={arXiv preprint arXiv:2605.12481},
  year={2026}
}

About

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors