Skip to content

Latest commit

 

History

History
161 lines (81 loc) · 9.25 KB

llm-web-grounding.md

File metadata and controls

161 lines (81 loc) · 9.25 KB

Paper collection for LLM GUI autonomous agent

Introduction

Papers

  1. World of Bits: An Open-Domain Platform for Web-Based Agents. ICML 2017

    Tianlin (Tim) Shi, Andrej Karpathy, Linxi (Jim) Fan, Jonathan Hernandez, Percy Liang [pdf], 2017

  2. Rico: A Mobile App Dataset for Building Data-Driven Design Applications 2017

    Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, Ranjitha Kumar [pdf], 2017

  3. Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration. ICLR 2018

    Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, Percy Liang [pdf], 2018.2

  4. Mapping Natural Language Instructions to Mobile UI Action Sequences. ACL 2020

    Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, Jason Baldridge [pdf], 2020.5

  5. AndroidEnv: A Reinforcement Learning Platform for Android. ViGIL at NAACL 2021

    Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, Doina Precup [pdf], 2021.5

  6. Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task Feasibility in Interactive Visual Environments. ViGIL at NAACL 2021

    Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, Bryan A. Plummer [pdf], 2021.4

  7. A data-driven approach for learning to control computers. PLMR

    Peter C Humphreys, David Raposo, Toby Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal, Josh Abramson, Petko Georgiev, Alex Goldin, Adam Santoro, Timothy Lillicrap [pdf], 2022.2

  8. META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI. Arxiv

    Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, Kai Yu [pdf], 2022.5

  9. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. Arxiv

    Shunyu Yao, Howard Chen, John Yang, Karthik Narasimhan [pdf], 2022.7

  10. Enabling Conversational Interaction with Mobile UI using Large Language Models. CHI 2023

    Bryan Wang, Gang Li, Yang Li [pdf], 2022.9

  11. UGIF: UI Grounded Instruction Following. Arxiv

    Sagar Gubbi Venkatesh, Partha Talukdar, Srini Narayanan [pdf], 2022.11

  12. Multimodal Web Navigation with Instruction-Finetuned Foundation Models. ICLR 2023 Workshop ME-FoMo

    Hiroki Furuta, Ofir Nachum, Kuang-Huei Lee, Yutaka Matsuo, Shixiang Shane Gu, Izzeddin Gur [pdf], 2023.5

  13. Hierarchical Prompting Assists Large Language Model on Web Navigation. ACL 2023 NLRSE workshop

    Abishek Sridhar, Robert Lo, Frank F. Xu, Hao Zhu, Shuyan Zhou [pdf], 2023.5

  14. From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces. Arxiv

    Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, Kristina Toutanova [pdf], 2023.6

  15. Mind2Web: Towards a Generalist Agent for the Web. Arxiv

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, Yu Su [pdf], 2023.6

  16. A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis Arxiv

    Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, Aleksandra Faust [pdf], 2023.7

  17. WebArena: A Realistic Web Environment for Building Autonomous Agents Arxiv

    Shuyan Zhou, Frank F. Xu, Hao Zh+, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig [pdf], 2023.7

  18. Empowering LLM to use Smartphone for Intelligent Task Automation Arxiv

    Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, Yunxin Liu [pdf], 2023.8

  19. Android in the Wild: A Large-Scale Dataset for Android Device Control Arxiv

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, Timothy Lillicrap [pdf], 2023.7

  20. An Empirical Study & Evaluation of Modern CAPTCHAs Arxiv

    Andrew Searles, Yoshimichi Nakatsuka, Ercan Ozturk, Andrew Paverd, Gene Tsudik, Ai Enkoji [pdf], 2023.7

  21. LASER: LLM Agent with State-Space Exploration for Web Navigation Arxiv

    Kaixin Ma, Hongming Zhang, Hongwei Wang, Xiaoman Pan, Dong Yu [pdf], 2023.9

  22. You Only Look at Screens: Multimodal Chain-of-Action Agents Arxiv

    Zhuosheng Zhang, Aston Zhang [pdf], 2023.9

  23. HeaP: Hierarchical Policies for Web Actions using LLMs Arxiv

    Paloma Sodhi, S.R.K. Branavan, Ryan McDonald [pdf], 2023.10

  24. The Unsolved Challenges of LLMs as Generalist Web Agents: A Case Study Arxiv

    Rim_Assouel1, Tom Marty, Massimo Caccia, Issam H. Laradji, Alexandre Drouin, Sai Rajeswar, Hector Palacios, Quentin Cappart, David Vazquez, Nicolas Chapados, Maxime Gasse, Alexandre Lacoste [pdf], 2023.12

  25. "What's important here?": Opportunities and Challenges of Using LLMs in Retrieving Information from Web Interfaces Arxiv

    Faria Huq, Jeffrey P. Bigham, Nikolas Martelaro [pdf], 2023.12

  26. ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation Arxiv

    Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou [pdf], 2023.12

  27. GPT-4V(ision) is a Generalist Web Agent, if Grounded Arxiv

    Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su [pdf], 2024.1

  28. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents Arxiv

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, Zhiyong Wu [pdf], 2024.1

  29. ScreenAgent: A Vision Language Model-driven Computer Control Agent Arxiv

    Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Yi Chang, Qi Wang [pdf], 2024.2

  30. OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web Arxiv

    Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov [pdf], 2024.2

  31. WebLINX: Real-World Website Navigation with Multi-Turn Dialogue Arxiv

    Xing Han Lù, Zdeněk Kasner, Siva Reddy [pdf], 2024.2

  32. Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study Arxiv

    Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, Börje F. Karlsson, Bo An, Zongqing Lu [pdf], 2024.3

  33. AgentStudio: A Toolkit for Building General Virtual Agents Arxiv

    Longtao Zheng, Zhiyuan Huang, Zhenghai Xue, Xinrun Wang, Bo An, Shuicheng Yan [pdf], 2024.3

  34. VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? Arxiv

    Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue [pdf], 2024.4

  35. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments Arxiv

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu [pdf], 2024.4

  36. WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents Arxiv

    Michael Lutz, Arth Bohra, Manvel Saroyan, Artem Harutyunyan, Giovanni Campagna [pdf], 2024.4

  37. Autonomous Evaluation and Refinement of Digital Agents Arxiv

    Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, Alane Suhr [pdf], 2024.4

  38. MMInA: Benchmarking Multihop Multimodal Internet Agents Arxiv

    Ziniu Zhang, Shulin Tian, Liangyu Chen, Ziwei Liu [pdf], 2024.4

  39. SteP: Stacked LLM Policies for Web Actions Arxiv

    Paloma Sodhi, S.R.K. Branavan, Yoav Artzi, Ryan McDonald [pdf], 2024.4