Language models (LMs) are powerful yet mostly for text-generation tasks. Tools have substantially enhanced their performance for tasks that require complex skills.
Based on our recent survey about LM-used tools, "What Are Tools Anyway? A Survey from the Language Model Perspective", we provide a structured list of literature relevant to tool-augmented LMs.
- Tool basics (
$\S2$ ) - Tool use paradigm (
$\S3$ ) - Scenarios (
$\S4$ ) - Advanced methods (
$\S5$ ) - Evaluation (
$\S6$ )
If you find our paper or code useful, please cite the paper:
@article{wang2022what,
title={What Are Tools Anyway? A Survey from the Language Model Perspective},
author={Zhiruo Wang, Zhoujun Cheng, Hao Zhu, Daniel Fried, Graham Neubig},
journal={arXiv preprint arXiv:2403.15452},
year={2024}
}
-
Definition and discussion of animal-used tools
Animal tool behavior: the use and manufacture of tools by animals Shumaker, Robert W., Kristina R. Walkup, and Benjamin B. Beck. 2011 [Book]
-
Early discussions on LM-used tools
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs Qin, Yujia, et al. 2023.07 [Paper]
-
A survey on augmented LMs, including tool augmentation
Augmented Language Models: a Survey Mialon, Grégoire, et al. 2023.02 [Paper]
-
Definition of agents
Artificial intelligence a modern approach Russell, Stuart J., and Peter Norvig. 2016 [Book]
-
Survey about agents that perceive and act in the environment
The Rise and Potential of Large Language Model Based Agents: A Survey Xi, Zhiheng, et al. 2023.09 [Preprint]
-
Survey about the cognitive architectures for language agents
Cognitive Architectures for Language Agents Sumers, Theodore R., et al. 2023.09 [Paper]
-
Early works that set up the commonly used tooling paradigm
Toolformer: Language Models Can Teach Themselves to Use Tools Schick, Timo, et al. 2024 [Paper]
-
Provide in-context examples for tool-using on visual programming problems
Visual Programming: Compositional visual reasoning without training Gupta, Tanmay, and Aniruddha Kembhavi. 2023 [Paper]
-
Tool learning via in-context examples on reasoning problems involving text or multi-modal inputs
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models Lu, Pan, et al. 2024 [Paper]
-
In-context learning based tool using for reasoning problems in BigBench and MMLU
ART: Automatic multi-step reasoning and tool-use for large language models Paranjape, Bhargavi, et al. 2023.03 [Preprint]
-
Providing tool documentation for in-context tool learning
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models Hsieh, Cheng-Yu, et al. 2023.08 [Preprint]
-
Training on human annotated examples of (NL input, tool-using solution output) pairs
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs Li, Minghao, et al. 2023.12 [Paper]
Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through Interaction with Symbolic Systems Kadlčík, Marek, et al. 2023 [Paper]
-
Training on model-synthesized examples
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases Tang, Qiaoyu, et al. 2023.06 [Preprint]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs Qin, Yujia, et al. 2023.07 [Paper]
MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use Huang, Yue, et al. 2023.10 [Paper]
Making Language Models Better Tool Learners with Execution Feedback Qiao, Shuofei, et al. 2023.05 [Preprint]
LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error Wang, Boshi, et al. 2024.03 [Preprint]
-
Self-training with bootstrapped examples
Toolformer: Language Models Can Teach Themselves to Use Tools Schick, Timo, et al. 2024 Paper
-
Collect data from structured knowledge sources, e.g., databases, knowledge graphs, etc.
LaMDA: Language Models for Dialog Applications Thoppilan, Romal, et al. 2022.01 [Paper]
TALM: Tool Augmented Language Models Parisi, Aaron, Yao Zhao, and Noah Fiedel. 2022.05 [Preprint]
ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings Hao, Shibo, et al. 2024 [Paper]
ToolQA: A Dataset for LLM Question Answering with External Tools Zhuang, Yuchen, et al. 2024 [Paper]
Middleware for LLMs: Tools are Instrumental for Language Agents in Complex Environments Gu, Yu, et al. 2024 [Paper]
GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information Jin, Qiao, et al. 2024 [Paper]
-
Search information from the web
Internet-augmented language models through few-shot prompting for open-domain question answering Lazaridou, Angeliki, et al. 2022.03 [Paper]
Internet-Augmented Dialogue Generation Komeili, Mojtaba, Kurt Shuster, and Jason Weston. 2022 [Paper]
-
Viewing retrieval models as tools under the retrieval-augmented generation context
Retrieval-based Language Models and Applications Asai, Akari, et al. 2023 [Tutorial]
Augmented Language Models: a Survey Mialon, Grégoire, et al. 2023.02 [Paper]
-
Using calculator for math calculations
Toolformer: Language Models Can Teach Themselves to Use Tools Schick, Timo, et al. 2024 [Paper]
Calc-X and Calcformers: Empowering Arithmetical Chain-of-Thought through Interaction with Symbolic Systems Kadlčík, Marek, et al. 2023 [Paper]
-
Using programs/Python interpreter to perform more complex operations
Pal: Program-aided language models Gao, Luyu, et al. 2023 [Paper]
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks Chen, Wenhu, et al. 2022.11 [Paper]
Mint: Evaluating llms in multi-turn interaction with tools and language feedback Wang, Xingyao, et al. 2023.09 [Paper]
MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning Das, Debrup, et al. 2024 [Paper]
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving Gou, Zhibin, et al. 2023.09 [Paper]
-
Tools for more advanced business activities, e.g., financial, medical, education, etc.
On the Tool Manipulation Capability of Open-source Large Language Models Xu, Qiantong, et al. 2023.05 [Paper]
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases Tang, Qiaoyu, et al. 2023.06 [Preprint]
Mint: Evaluating llms in multi-turn interaction with tools and language feedback Wang, Xingyao, et al. 2023.09 [Paper]
AgentMD: Empowering Language Agents for Risk Prediction with Large-Scale Clinical Tool Learning Jin, Qiao, et al. 2024.02 [Paper]
-
Access real-time or real-world information such as weather, location, etc.
On the Tool Manipulation Capability of Open-source Large Language Models Xu, Qiantong, et al. 2023.05 [Paper]
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases Tang, Qiaoyu, et al. 2023.06 [Preprint]
-
Managing personal events such as calendar or emails
Toolformer: Language Models Can Teach Themselves to Use Tools Schick, Timo, et al. 2024 [Paper]
-
Tools in embodied environments, e.g., the Minecraft world
Voyager: An Open-Ended Embodied Agent with Large Language Models Wang, Guanzhi, et al. 2023.05 [Paper]
-
Tools interacting with the physical world
ProgPrompt: Generating Situated Robot Task Plans using Large Language Models Singh, Ishika, et al. 2023 [Paper]
Alfred: A benchmark for interpreting grounded instructions for everyday tasks Shridhar, Mohit, et al. 2020 [Paper]
Autonomous chemical research with large language models Boiko, Daniil A., et al. 2023 [Paper]
-
Tools providing access to information in non-textual modalities
Vipergpt: Visual inference via python execution for reasoning Surís, Dídac, Sachit Menon, and Carl Vondrick. 2023 [Paper]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action Yang, Zhengyuan, et al. 2023.03 [Preprint]
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Gao, Difei, et al. 2023.06 [Preprint]
-
Tools that can answer questions about data in other modalities
Visual Programming: Compositional visual reasoning without training Gupta, Tanmay, and Aniruddha Kembhavi. 2023 [Paper]
-
Text-generation models that can perform specific tasks, e.g., question answering, machine translation
Toolformer: Language Models Can Teach Themselves to Use Tools Schick, Timo, et al. 2024 [Paper]
ART: Automatic multi-step reasoning and tool-use for large language models Paranjape, Bhargavi, et al. 2023.03 [Preprint]
-
Integration of available models on Huggingface, TorchHub, TensorHub, etc.
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face Shen, Yongliang, et al. 2024 [Paper]
Gorilla: Large language model connected with massive apis Patil, Shishir G., et al. 2023.05 [Paper]
Taskbench: Benchmarking large language models for task automation Shen, Yongliang, et al. 2023.11 [Paper]
-
Train retrievers that map natural language instructions to tool documentation
DocPrompting: Generating Code by Retrieving the Docs Zhou, Shuyan, et al. 2022.07 [Paper]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs Qin, Yujia, et al. 2023.07 [Paper]
-
Ask LMs to write hypothetical tool descriptions and search relevant tools
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets Yuan, Lifan, et al. 2023.09 [Paper]
-
Complex tool usage, e.g., parallel calls
Function Calling and Other API Updates Eleti, Atty, et al. 2023.06 [Blog]
An LLM Compiler for Parallel Function Calling Kim, Sehoon, et al. 2023.12 [Paper]
-
Domain-specific logical forms to query structured data
Semantic parsing on freebase from question-answer pairs Berant, Jonathan, et al. 2013 [Paper]
Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task Yu, Tao, et al. 2018.09 [Paper]
Break It Down: A Question Understanding Benchmark Wolfson, Tomer, et al. 2020 [Paper]
-
Domain-specific actions for agentic tasks such as web navigation
Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration Liu, Evan Zheran, et al. 2018.02 [Paper]
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents Yao, Shunyu, et al. 2022.07 [Paper]
Webarena: A realistic web environment for building autonomous agents Zhou, Shuyan, et al. 2023.07 [Paper]
-
Using external Python libraries as tools
ToolCoder: Teach Code Generation Models to use API search tools Zhang, Kechi, et al. 2023.05 [Paper]
-
Using expert designed functions as tools to answer questions about images
Visual Programming: Compositional visual reasoning without training Gupta, Tanmay, and Aniruddha Kembhavi. 2023 [Paper]
Vipergpt: Visual inference via python execution for reasoning Surís, Dídac, Sachit Menon, and Carl Vondrick. 2023 [Paper]
-
Using GPT as a tool to query external Wikipedia knowledge for table-based question answering
Binding Language Models in Symbolic Languages Cheng, Zhoujun, et al. 2022.10 [Paper]
-
Incorporate QA API and operation APIs to assist table-based question answering
API-Assisted Code Generation for Question Answering on Varied Table Structures Cao, Yihan, et al. 2023.12 [Paper]
-
Approaches to abstract libraries for domain-specific logical forms from a large corpus
DreamCoder: growing generalizable, interpretable knowledge with wake--sleep Bayesian program learning Ellis, Kevin, et al. 2020.06 [Paper]
Leveraging Language to Learn Program Abstractions and Search Heuristics] Wong, Catherine, et al. 2021 [Paper]
Top-Down Synthesis for Library Learning Bowers, Matthew, et al. 2023 [Paper]
LILO: Learning Interpretable Libraries by Compressing and Documenting Code Grand, Gabriel, et al. 2023.10 [Paper]
-
Make and learn skills (Java programs) in the embodied Minecraft world
Voyager: An Open-Ended Embodied Agent with Large Language Models Wang, Guanzhi, et al. 2023.05 [Paper]
-
Leverage LMs as tool makers on BigBench tasks
Large Language Models as Tool Makers Cai, Tianle, et al. 2023.05 [Preprint]
-
Create tools for math and table QA tasks by example-wise tool making
CREATOR: Disentangling Abstract and Concrete Reasonings of Large Language Models through Tool Creation Qian, Cheng, et al. 2023.05 [Paper]
-
Make tools via heuristic-based training and tool deduplication
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets Yuan, Lifan, et al. 2023.09 [Paper]
-
Learning tools by refactoring a small amount of programs
ReGAL: Refactoring Programs to Discover Generalizable Abstractions Stengel-Eskin, Elias, Archiki Prasad, and Mohit Bansal. 2024.01 [Preprint]
-
A training-free approach to make tools via execution consistency
🎁 TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks Wang, Zhiruo, Daniel Fried, and Graham Neubig. 2024.01 [Preprint]
-
Datasets that require reasoning over texts
Measuring Mathematical Problem Solving With the MATH Dataset Hendrycks, Dan, et al. 2021.03 [Paper]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models Srivastava, Aarohi, et al. 2022.06 [Paper]
-
Datasets that require reasoning over structured data, e.g., tables
Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning Lu, Pan, et al. 2022.09 [Paper]
Compositional Semantic Parsing on Semi-Structured Tables Pasupat, Panupong, and Percy Liang. 2015 [Paper]
HiTab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation Cheng, Zhoujun, et al. 2022 [Paper]
-
Datasets that require reasoning over other modalities, e.g., images and image pairs
Gqa: A new dataset for real-world visual reasoning and compositional question answering Hudson, Drew A., and Christopher D. Manning. 2019.02 [Paper]
A Corpus for Reasoning about Natural Language Grounded in Photographs Suhr, Alane, et al. 2019 [Paper]
-
Example datasets that require retriever model (tool) to solve
Natural Questions: A Benchmark for Question Answering Research Kwiatkowski, Tom, et al. 2019 [Paper]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension Joshi, Mandar, et al. 2017 [Paper]
-
Collect RapidAPIs and use models to synthesize examples for evaluation
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs Qin, Yujia, et al. 2023.07 [Paper]
-
Collect APIs from PublicAPIs and use models to synthesize examples
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases Tang, Qiaoyu, et al. 2023.06 [Preprint]
-
Collect APIs from PublicAPIs and manually annotate examples for evaluation
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs Li, Minghao, et al. 2023.12 [Paper]
-
Collect APIs from OpenAI plugin list and use models to synthesize examples
MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use Huang, Yue, et al. 2023.10 [Paper]
-
Collect neural model tools from Huggingface hub, TorchHub, and TensorHub
Gorilla: Large language model connected with massive apis Patil, Shishir G., et al. 2023.05 [Paper]
-
Collect neural model tools from Huggingface
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face Shen, Yongliang, et al. 2024 [Paper]
-
Collect tools from Huggingface and PublicAPIs
Taskbench: Benchmarking large language models for task automation Shen, Yongliang, et al. 2023.11 [Paper]
-
Collect Action Sequences in real-world macOS/iPadOS/iOS.
ShortcutsBench: A Large-Scale Real-World Benchmark for API-Based Agents Shen, Haiyang, et al. 2024.07 [Paper]