Skip to content

YangRui2015/GUI-Actor

 
 

Repository files navigation


Qianhui Wu*1  Kanzhi Cheng*2  Rui Yang*3  Chaoyun Zhang1  Jianwei Yang1  Huiqiang Jiang1
Jian Mu2  Baolin Peng1  Bo Qiao1  Reuben Tan1  Si Qin1  Lars Liden1
Qingwei Lin1  Huan Zhang3  Tong Zhang3  Jianbing Zhang2  Dongmei Zhang1  Jianfeng Gao1

1 Microsoft Research  2 Nanjing University  3 University of Illinois Urbana-Champaign
* Equal Contribution     Leadership

Figure 1. Left: Model performance vs. training data scale on the ScreenSpot-Pro benchmark. Higher and more left is better; larger points indicate models with more parameters. We only show GUI-Actor models built upon Qwen2-VL here for fair comparison. With Qwen2.5-VL as the backbone, GUI-Actor-3B/7B reaches scores up to 42.2/44.6 (without Verifier). Right: Illustration of action attention. GUI-Actor grounds target elements by attending to the most relevant visual regions.

✨ Highlights

🤔 We identify several limitations in coordinate-generation based methods (i.e., output screen positions as text tokens x=…, y=…) for GUI grounding, including (1) weak spatial-semantic alignment, (2) ambiguous supervision signals, and (3) granularity mismatch between vision and action space.

💡 Rethink how humans interact with digital interfaces: humans do NOT calculate precise screen coordinates before acting—they perceive the target element and interact with it directly.

🚀 We propose GUI-Actor, a VLM enhanced by an action head, to mitigate the above limitations. The attention-based action head not only enables GUI-Actor to peform coordinate-free GUI grounding that more closely aligns with human behavior, but also can generate multiple candidate regions in a single forward pass, offering flexibility for downstream modules such as search strategies.

We design a grounding verifier to evaluate and select the most plausible action region among the candidates proposed from the action attention map. We show that this verifier can be easily integrated with other grounding methods for a further performance boost.

🎯 GUI-Actor achieves state-of-the-art performance on multiple GUI action grounding benchmarks with the same Qwen2-VL backbone, demonstrating its effectiveness and generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, achieving scores of 40.7 with Qwen2-VL and 44.6 with Qwen2.5-VL as backbones.

📑 Todos

We will be releasing all the following contents:

  • Model training and evaluation based on Qwen2-VL
  • Model checkpoint
  • Code for grounding verifier
  • Support for Qwen2.5-VL
  • Processed training data
  • Demo

📊 Main Results

Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with Qwen2-VL as the backbone. † indicates scores obtained from our own evaluation of the official models on Huggingface.

Method Backbone VLM ScreenSpot-Pro ScreenSpot ScreenSpot-v2
72B models:
AGUVIS-72B Qwen2-VL - 89.2 -
UGround-V1-72B Qwen2-VL 34.5 89.4 -
UI-TARS-72B Qwen2-VL 38.1 88.4 90.3
7B models:
OS-Atlas-7B Qwen2-VL 18.9 82.5 84.1
AGUVIS-7B Qwen2-VL 22.9 84.4 86.0†
UGround-V1-7B Qwen2-VL 31.1 86.3 87.6†
UI-TARS-7B Qwen2-VL 35.7 89.5 91.6
GUI-Actor-7B Qwen2-VL 40.7 88.3 89.5
GUI-Actor-7B + Verifier Qwen2-VL 44.2 89.7 90.9
2B models:
UGround-V1-2B Qwen2-VL 26.6 77.1 -
UI-TARS-2B Qwen2-VL 27.7 82.3 84.7
GUI-Actor-2B Qwen2-VL 36.7 86.5 88.6
GUI-Actor-2B + Verifier Qwen2-VL 41.8 86.9 89.3

Table 2. Main results on the ScreenSpot-Pro and ScreenSpot-v2 with Qwen2.5-VL as the backbone.

Method Backbone VLM ScreenSpot-Pro ScreenSpot-v2
7B models:
Qwen2.5-VL-7B Qwen2.5-VL 27.6 88.8
Jedi-7B Qwen2.5-VL 39.5 91.7
GUI-Actor-7B Qwen2.5-VL 44.6 92.1
GUI-Actor-7B + Verifier Qwen2.5-VL 47.7 92.5
3B models:
Qwen2.5-VL-3B Qwen2.5-VL 25.9 80.9
Jedi-3B Qwen2.5-VL 36.1 88.6
GUI-Actor-3B Qwen2.5-VL 42.2 91.0
GUI-Actor-3B + Verifier Qwen2.5-VL 45.9 92.4

⛑️ Installation

  1. Clone this repo to your local machine:
git clone https://github.com/microsoft/GUI-Actor.git
cd GUI-Actor
  1. Create a conda environment and install the dependencies:
conda create -n gui_actor python=3.10
conda activate gui_actor
conda install pytorch torchvision torchaudio pytorch-cuda -c pytorch -c nvidia
pip install -e .

💽 Data Preparation

  1. Download the processed data from here (coming soon).
  2. Modify the paths in the data_config.yaml file to point to the downloaded data.

🏗️ Model Training

  1. Warmup stage:
bash scripts/warmup.sh
  1. Full-parameter training stage:
bash scripts/train.sh

🏁 Evaluation on GUI Grounding Benchmarks

For evaluation on ScreenSpot and ScreenSpot-v2, you can directly run the scripts under the scripts/ folder like python eval/screenSpot.py or python eval/screenSpot_v2.py.

For evaluation on ScreenSpot-Pro, you first need to download the data from here, then run the following command:

python eval/screenSpot_pro.py --save_path <path_to_save_results> --data_path <path_to_data_dir>

👍 Acknowledgements

This project is built upon the following projects. Thanks for their great work!

We also thank the authors of the following projects for their insightful work, as well as for providing datasets and engaging in valuable discussions.

📝 Citation

If you find this work useful in your research, please consider citing:

@article{wu2025guiactor,
    title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents}, 
    author={Qianhui Wu and Kanzhi Cheng and Rui Yang and Chaoyun Zhang and Jianwei Yang and Huiqiang Jiang and Jian Mu and Baolin Peng and Bo Qiao and Reuben Tan and Si Qin and Lars Liden and Qingwei Lin and Huan Zhang and Tong Zhang and Jianbing Zhang and Dongmei Zhang and Jianfeng Gao},
    year={2025},
    eprint={},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={},
}

About

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.2%
  • Shell 2.8%