🌐 FedGUI: Benchmarking Federated GUI Agents across Heterogeneous Platforms, Devices, Sources, and Operating Systems
🎉 FedGUI has been accepted to the Findings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026).
FedGUI is the first comprehensive benchmark designed for developing and evaluating federated GUI agents across diverse platforms, including Mobile, Desktop, and Web. It addresses the privacy and scalability challenges of traditional centralized training by leveraging Federated Learning (FL) to train generalized agents on heterogeneous, decentralized data.
-
Platform Diversity Supports over 900 mobile apps, 40+ desktop applications, and 200+ websites.
-
Comprehensive Heterogeneity Systematically models four types of real-world heterogeneity: Cross-Platform, Cross-Device, Cross-OS, and Cross-Source.
-
Unified Action Space Standardizes interactions across all platforms into 17 discrete action types, including basic actions (e.g., CLICK, TYPE) and platform-specific custom actions.
-
Extensive Model & Algorithm Support Integrates 7 FL algorithms (e.g., FedAvg, FedYogi, FedAdam) and supports 20+ base VLMs such as Qwen3-VL, InternVL2, and Gemma-3.
-
Open-Source Datasets We release constructed datasets under the
datasets/directory for easy access and reproduction.
FedGUI/
├── README.md
├── fedgui.png
├── datasets/ # Open-source datasets & prompt injection
│ ├── FedGUI-Full/
│ │ ├── Full_IID.json
│ │ ├── Full_Non-Uniform.json
│ │ └── ...
│ ├── FedGUI-OS/
│ │ ├── OS_IID.json
│ │ └── ...
│ └── prompt.py
├── data_process/ # Data preprocessing pipeline
│ ├── action_normalize.py
│ ├── gen_message_VLM.py
│ └── single_dataset_level/
│ ├── 0_dump_AC.py
│ ├── 1_gen_jsonl.py
│ └── ...
├── scripts/
│ ├── train/
│ │ └── run_fedavg.sh
│ └── evaluation/
│ └── eval_fed.sh
├── swift/ # ms-swift framework (integrated)
├── requirements/
├── setup.py
└── requirements.txt
Ensure you have Python ≥ 3.8 and CUDA installed. FedGUI is built upon the ms-swift framework.
git clone https://anonymous.4open.science/r/FedGUI-1B15/
cd FedGUI
pip install -e .[all]FedGUI utilizes 9 curated datasets derived from 6 major sources:
-
Mobile AndroidControl (AC), AitW, GUI Odyssey (GO)
-
Web Mind2Web (M2W), GUIAct-Web (GA-W), OmniAct-Web
-
Desktop AgentSynth (AS), OmniAct-Mac/Windows
The data_process/single_dataset_level/ directory contains scripts for processing each dataset individually. These scripts handle data extraction, normalization, and conversion to the unified format required by FedGUI.
cd data_process/single_dataset_level
python 0_dump_AC.py
python 1_gen_jsonl.py --data_dir ./data/processed_android_controlAfter processing individual datasets, use gen_message_VLM.py to aggregate multiple datasets and convert episode-level data into step-level format with VLM-compatible prompts.
Usage Example:
cd data_process
python gen_message_VLM.pyConfiguration:
Edit the configuration section in gen_message_VLM.py:
DISTRIBUTION_MODE = "iid"
NUM_CLIENTS = 9
OUTPUT_FILE = "./output/converted_data.jsonl"
DATASET_CONFIGS = [
{
"path": "./datasets/GUI_Odyssey/train_600.jsonl",
"sample_count": 600,
"name": "GUI_Odyssey",
},
{
"path": "./datasets/GUIAct_Web/train_600.jsonl",
"sample_count": 600,
"name": "GUIAct_Web",
},
{
"path": "./datasets/Mind2Web/train_600.jsonl",
"sample_count": 600,
"name": "Mind2Web",
}
]Each step contains:
{
"images": "/path/to/screenshot.png",
"query": "Task instruction with history...",
"response": "Actions:\nCLICK <point>[[100, 200]]</point>",
"client_id": 0
}We release our datasets under the datasets/, containing FedGUI-Full and FedGUI-OS. You can use prompt.py to inject prompt into the dataset samples for consistent training and evaluation.
FedGUI supports 7 representative Federated Learning (FL) algorithms (e.g., FedAvg, FedYogi, FedAdam) and is compatible with 20+ base vision-language models (VLMs), including Qwen3-VL, InternVL2, and Gemma-3, enabling flexible and parameter-efficient adaptation across heterogeneous clients.To reduce communication and computation overhead, FedGUI adopts LoRA (Low-Rank Adaptation), where only lightweight adapter parameters are exchanged between the server and clients, making large-scale VLM training feasible even on a single RTX 4090. GPU.
A typical training command is shown below:
bash scripts/train/run_fedavg.sh <GPU_ID> 10 3 qwen2-vl-7b /path/to/model FedGUI-FullFedGUI evaluates GUI agent performance using three action-level metrics:
-
Action Type Accuracy (Type) Measures whether the predicted interaction intent matches the ground-truth action type, based on the first token of the generated action.
-
Grounding Accuracy (Ground) Evaluates spatial correctness for coordinate-based actions (e.g.,
CLICK,DOUBLE_CLICK). A prediction is considered correct if the Euclidean distance between predicted and ground-truth coordinates is within 14% of the screen diagonal, ensuring robustness across different screen sizes. -
Success Rate (SR) Reflects end-to-end execution accuracy, requiring both correct action type and parameters. For text-based actions, semantic correctness is measured using a Similarity Score (token-level F1 + character-level overlap), with a success threshold of 0.5.
bash scripts/evaluation/eval_fed.sh <GPU_ID> <DATASET_NAME> <MODEL_TYPE> <CHECKPOINT_PATH> <ROUND_NUM>