We propose UI-R1, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks.
Experimental results demonstrate that our proposed UI-R1-3B achieves significant improvements over the base model (i.e. Qwen2.5-VL-3B) on both in-domain (ID) and out-of-domain (OOD) tasks, with average accuracy gains of 22.1% on ScreenSpot, 6.0% on ScreenSpot-Pro, and 12.7% on AndroidControl. Furthermore, UI-R1-3B delivers competitive performance compared to larger models (e.g., OS-Atlas-7B) trained via supervised fine-tuning (SFT) on 76K samples.

Grounding Leaderboard: UI-I2E-Bench
Model | ScreenSpot | UI-I2E-Bench Avg | ScreenSpot-Pro | Average |
---|---|---|---|---|
UI-TARS-1.5-7B | 88.1 | 73.2 | 42.2 | 67.8 |
Uground-V1-72B | 89.7 | 76.3 | 34.3 | 66.8 |
UI-TARS-72B | 88.4 | 73.7 | 38.1 | 66.7 |
UI-R1-E-3B | 89.2 | 69.1 | 33.5 | 63.9 |
Uground-V1-7B | 87.1 | 70.3 | 31.1 | 62.8 |
InfiGUI-R1 | 87.5 | 69.7 | 29.6 | 62.3 |
UI-TARS-7B | 89.5 | 61.4 | 35.7 | 62.2 |
Qwen2.5-VL-72B | 87.1 | 51.4 | 43.6 | 60.7 |
UI-I2E-VLM-7B | 82.5 | 69.5 | 23.6 | 58.5 |
UI-TARS-2B | 82.3 | 62 | 27.7 | 57.3 |
Qwen2.5-VL-7B | 84.7 | 53.8 | 29 | 55.8 |
OmniParser-V2 | 72 | 54.8 | 39.6 | 55.5 |
Uground-V1-2B | 78.8 | 57.4 | 26.6 | 54.3 |
OS-Atlas-7B | 82.5 | 58.6 | 18.9 | 53.3 |
UI-R1-3B | 83.3 | 58.5 | 17.8 | 53.2 |
UGround-7B | 74.1 | 54.2 | 16.5 | 48.3 |
UI-I2E-VLM-4B | 70.4 | 53.4 | 12.2 | 45.3 |
OmniParser | 73.9 | 53.1 | 8.3 | 45.1 |
ShowUI-2B | 76.8 | 41.5 | 7.7 | 42 |
Qwen2.5-VL-3B | 55.5 | 41.7 | 23.9 | 41.3 |
Aguvis-7B | 84.4 | 53.2 | 22.9 | 40.4 |
OS-Atlas-4B | 70.1 | 44.3 | 3.7 | 39.4 |
Qwen2-VL-7B | 42.6 | 48.7 | 1.6 | 31 |
Seeclick | 55.8 | 26.4 | 1.1 | 27.8 |
InternVL2-4B | 4.2 | 0.9 | 0.3 | 1.8 |
Thinking is not needed for GUI grounding.
Inspired by concurrent works studying efficient LRM, we realize efficient reasoning by RFT training. UI-R1-3B-E's training consists of two steps:
- DAST (Difficulty-Adaptive Slow-Thinking): Add difficulty-adaptive length reward to make reasoning from slow to fast.
- Nothinking: Not output reasoning process.
Note: UI-R1-3B (v2) and UI-R1-3B-E both train on larger dataset (2K grounding data in GUI-R1-3K) compared to UI-R1-3B (v1).
ScreenSpotV2 | inference mode | Mobile-T | Mobile-I | Desktop-T | Desktop-I | Web-T | Web-I | Avg↑ / Len↓ |
---|---|---|---|---|---|---|---|---|
OS-ATLAS-7B | w/o thinking | 95.2 | 75.8 | 90.7 | 63.6 | 90.6 | 77.3 | 84.1 / |
UI-TARS-7B | w/o thinking | 95.2 | 79.1 | 90.7 | 68.6 | 90.6 | 78.3 | 84.7 / |
UI-R1-3B (v1) | w/ thinking | 96.2 | 84.3 | 92.3 | 63.6 | 89.2 | 75.4 | 85.4 / 67 |
GUI-R1-3B | w/ thinking | 97.6 | 78.2 | 94.3 | 64.3 | 91.0 | 72.4 | 85.0 / 80 |
UI-R1-3B (v2) | w/ thinking | 97.6 | 79.6 | 92.3 | 67.9 | 88.9 | 77.8 | 85.8 / 60 |
UI-R1-E-3B | w/o thinking | 98.2 | 83.9 | 94.8 | 75.0 | 93.2 | 83.7 | 89.5 / 28 |
ScreenSpot-Pro | inference mode | Average Length↓ | Average Accuracy↑ |
---|---|---|---|
UGround-7B | w/o thinking | - | 16.5 |
OS-ATLAS-7B | w/o thinking | - | 18.9 |
UI-R1-3B (v1) | w/ thinking | 102 | 17.8 |
GUI-R1-3B | w/ thinking | 114 | 26.6 |
UI-R1-3B (v2) | w/ thinking | 129 | 29.8 |
UI-R1-E-3B | w/o thinking | 28 | 33.5 |
- Our UI-R1-3B-E achieves SOTA with least answer tokens in 3B/7B Open-source methods, demonstrating GUI grounding needs no reasoning.
- Performance on 7B may be opposite.
- Performance on Planning may be opposite. The author predicts that Fast Grounding, Slow Planning.
- The checkpoints of UI-R1-3B-E will be released soon.
- The updated paper will come soon.
- The efficient training code will come soon. (in src/script/train_e.sh)
conda create -n ui-r1 python=3.10
conda activate ui-r1
bash setup.sh
Our training mobile data is a subset from AndroidControl and ScreenSpot.
You can also prepare your training or inference data like:
images/:
image1.png
image2.png
test.json:
[
{
"img_filename": "image1.png",
"bbox": [
825,
72,
1673,
149
],
"instruction": "search bar"
},
{
"img_filename": "image2.png",
"bbox": [
123,
732,
334,
812
],
"instruction": "check weather"
}
]
where bbox : [x1,y1,x2,y2] is the coordinate of the left top and the right bottom of the ground truth bbox
We provide an example here
cd evaluation/
bash test.sh
Please fill the MODEL_PATH, IMG_PATH, TEST_JSON with your real checkpoint path and data path.
cd src/script/
bash train.sh
# efficient training
bash train_e.sh
2025-05-14
: We update the paper with UI-R1-E-3B.2025-05-12
: We release the checkpoints of the UI-R1-E-3B model.2025-05-12
: We fix the bug of scales when batch_size > 1.2025-05-11
: We release the efficient training code of the UI-R1-E-3B model.2025-04-02
: We release the datasets of the UI-R1-3B (v1) model.2025-03-30
: We release the checkpoints of the UI-R1-3B (v1) model.2025-03-30
: We release the UI-R1 repository.2025-03-27
: We release our paper.
If you find this project useful, welcome to cite us.
@article{lu2025ui,
title={UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning},
author={Lu, Zhengxi and Chai, Yuxiang and Guo, Yaxuan and Yin, Xi and Liu, Liang and Wang, Hao and Xiong, Guanjing and Li, Hongsheng},
journal={arXiv preprint arXiv:2503.21620},
year={2025}
}
We sincerely thank projects R1-V, Open-R1, and Open-r1-multimodal, VLM-R1 for providing their open-source resources.