AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

We are currently organizing and presenting the code for AgentHijack. If you have any questions about the code, feel free to create an issue. If you are interested in our work, please star ⭐ our project, Thx 💕.

📢 Updates

2026-05-17: We release the code of AgentHijack. Check it out!
2026-05-01: AgentHijack is accepted to ICML 2026!

💾 Installation

This repository is built on OSWorld, ref to it for installation. We recommend using VMware/Docker to run experiments, as these have been verified by us.

🧪 Experiments

Open-Source and Closed-Source Multimodal Large Language Models

If you wish to run the baseline agent used in our paper, you can execute the following command, using GPT-4o under pop_ups as an example:

python run.py --path_to_vm vmware_vm_data/Ubuntu0/Ubuntu0.vmx --headless --observation_type screenshot --model openai/chatgpt-4o-latest --noise_type pop_ups --result_dir ./results

The results, which include screenshots, actions and summaries of the agent's task completion, will be saved in the ./results directory in this case. You can then run the following command to obtain the result:

python show_result.py

For convenience, we utilize OpenRouter to integrate the APIs of different LLMs, write your api_key or change it to other interface in mm_agents/agent.py.

Client = OpenAI(
        base_url="https://openrouter.ai/api/v1",
        api_key="", # put your api_key here
        )

State-of-the-Art GUI Agents

We provide the deployment code in /vllm_server, deploy corresponding agents before running experiments. For UI-TARS-7B-DPO/UI-TARS-1.7-7B models, we recommend use 1×A100 GPU. For UI-TARS-72B-DPO, 4×A100 GPUs are needed for inference.

nohup bash vllm_server/ui-tars-1.5-7b.sh > server.log &

After successful deployment, run the following command to obtain the result:

python run_uitars.py --path_to_vm vmware_vm_data/Ubuntu0/Ubuntu0.vmx --headless --observation_type screenshot --model ui-tars --noise_type pop_ups --result_dir ./results

You can also use run_multienv_uitars.py for parallel execution. It should be noted that, currently, corruption "network_error" can not run in docker environment. Therefore, we recommend use vmware for network_error evaluation.

python run_multienv_uitars.py --path_to_vm "" --headless --observation_type screenshot --model ui-tars --noise_type pop_ups --num_envs 4 --result_dir ./results

AgentHijack Agent

Download the AgentHijack-Agent from huggingface, then deploy it to run evaluation experiment.

nohup bash vllm_server/agenthijack-agent.sh > server.log &

python run_agenthijack_agent.py --path_to_vm vmware_vm_data/Ubuntu0/Ubuntu0.vmx --headless --observation_type screenshot --model ui-tars --noise_type pop_ups --result_dir ./results

⚙️ Corruption Setups

To support flexible setups for different corruptions, we offer configurable parameters in YAML file /vllm_server/default.yaml. Please ref to our paper for detailed explanations of these parameters.

📄 Citation

If you find this environment useful, please consider citing our work:

@inproceedings{sun2026agenthijack,
  title     = {AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions},
  author    = {Jingwei Sun and Jianing Zhu and Yuanyi Li and Tongliang Liu and Xia Hu and Bo Han},
  booktitle = {Forty-third International Conference on Machine Learning},
  year      = {2026},
  url       = {https://openreview.net/forum?id=0H5Im3Xvuf}
}

❤️ Acknowledgement

Parts of the codes are borrowed from OSWorld and PopupAttack, we express our great thanks to them for the wonderful works.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets		assets
config		config
desktop_env		desktop_env
evaluation_examples		evaluation_examples
mm_agents		mm_agents
vllm_server		vllm_server
.envrc		.envrc
.gitignore		.gitignore
.mise.toml		.mise.toml
DejaVuSansMono-Bold.ttf		DejaVuSansMono-Bold.ttf
DejaVuSansMono.ttf		DejaVuSansMono.ttf
README.md		README.md
lib_run_single.py		lib_run_single.py
main.py		main.py
noise.py		noise.py
perturb_utils.py		perturb_utils.py
requirements.txt		requirements.txt
run.py		run.py
run_agenthijack_agent.py		run_agenthijack_agent.py
run_multienv_uitars.py		run_multienv_uitars.py
run_uitars.py		run_uitars.py
show_result.py		show_result.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

📢 Updates

💾 Installation

🧪 Experiments

Open-Source and Closed-Source Multimodal Large Language Models

State-of-the-Art GUI Agents

AgentHijack Agent

⚙️ Corruption Setups

📄 Citation

❤️ Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

📢 Updates

💾 Installation

🧪 Experiments

Open-Source and Closed-Source Multimodal Large Language Models

State-of-the-Art GUI Agents

AgentHijack Agent

⚙️ Corruption Setups

📄 Citation

❤️ Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages