Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems (🔥🔥🔥 ICML 2025 Spotlight - Top 2.6%)

🚀 This work has been covered by three well-known media outlets:

[Synced (机器之心)] [AIEra (新智元)] [QbitAI (量子位)]

Important

If you find this project helpful, please consider giving us a ⭐️! It motivates us to keep improving.

🧐 Overview

This repository provides the implementation of ICML 2025 spotlight paper "Which Agent Causes Task Failures and When?", which introduces the task of automated failure attribution in LLM-based multi-agent systems. Given a failed task, the goal failure attribution is to automatically identify the agent and step responsible for the failure.

Automated failure-attributions offers several key advantages:

Reduces manual debugging effort: Automates the labor-intensive process of inspecting failure logs and tracing errors.
Accelerates system development: Speeds up the iteration cycle by quickly identifying faulty agents and critical mistakes.
Enables intermediate feedback for agent self-improvement: Pinpointing decisive errors provides actionable signals for agentic systems' self-correction or can serve as rewards in reinforcement learning.

🔧 Who&When: #1 Benchmark for MAS automated failure attribution.

184 annotated failure tasks collected from
- Algorithm-generated agentic systems built using CaptainAgent,
- Hand-crafted systems such as Magnetic-One.
Fine-grained annotations for each failure, including:
- The failure-responsible agent (who failed),
- The decisive error step (when the critical error occurred),
- A natural language explanation of the failure.

The dataset covers a wide range of realistic multi-agent scenarios based on queries from GAIA and AssistantBench. It serves as a foundational resource for developing and evaluating methods that aim to automatically pinpoint the causes of failures in complex agentic systems. We follow the following guide to annotate these failure logs. More information could be found in the paper.

Important

Check out the dataset on Hugging Face 🤗.

💡 Evaluations

Requirements

To install requirements:

pip install -r requirements.txt

Inference

Please ensure that you specify the AutoFA method (--method) in the corresponding sections of the code before executing it.

Models We support the following models:

Model Name	Command-line Argument
GPT-4o	`--model gpt-4o`
GPT-4	`--model gpt4`
GPT-4o-mini	`--model gpt4o-mini`
Llama-3.1-8B-Instruct	`--model llama-8b`
Llama-3.1-70B-Instruct	`--model llama-70b`
Qwen2.5-7B-Instruct	`--model qwen-7b`
Qwen2.5-72B-Instruct	`--model qwen-72b`

Run

python inference.py --method #METHOD --model #MODEL --is_handcrafted #DATA --directory_path #PATH

where:

--method specifies the failure attribution method:
- all_at_once : All-at-Once judging
- step_by_step : Step-by-Step judging
- binary_search : Binary Search judging
--is_handcrafted specifies the dataset type:
- True : Use hand-crafted agentic systems
- False : Use algorithm-generated agentic systems
--directory_path specifies the path to the dataset:
- ../Who&When/Hand-Crafted : Path to hand-crafted systems
- ../Who&When/Algorithm-Generated : Path to algorithm-generated systems

Example:

python inference.py --method step_by_step --model gpt-4o --is_handcrafted False --directory_path ../Who&When/Algorithm-Generated

Evaluation

After that, you can evaluate the results. By default, the results are stored in the outputs.

Example:

python evaluate.py --data_path ../Who\&When/Algorithm-Generated --eval_file  outputs/step_by_step_gpt-4o_alg_generated.txt

🧪 Experimental Results

More results could be found in the paper.

📖 Reference

Important

If you find it useful, please consider citing our work:

@article{zhang2025agent,
  title={Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems},
  author={Zhang, Shaokun and Yin, Ming and Zhang, Jieyu and Liu, Jiale and Han, Zhiguang and Zhang, Jingyang and Li, Beibin and Wang, Chi and Wang, Huazheng and Chen, Yiran and others},
  journal={arXiv preprint arXiv:2505.00212},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
Automated_FA		Automated_FA
Who&When		Who&When
assets		assets
LICENSE		LICENSE
README.md		README.md
paper.pdf		paper.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems (🔥🔥🔥 ICML 2025 Spotlight - Top 2.6%)

🚀 This work has been covered by three well-known media outlets:

[Synced (机器之心)] [AIEra (新智元)] [QbitAI (量子位)]

🧐 Overview

🔧 Who&When: #1 Benchmark for MAS automated failure attribution.

💡 Evaluations

Requirements

Inference

Run

Evaluation

🧪 Experimental Results

📖 Reference

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

mingyin1/Agents_Failure_Attribution

Folders and files

Latest commit

History

Repository files navigation

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems (🔥🔥🔥 ICML 2025 Spotlight - Top 2.6%)

🚀 This work has been covered by three well-known media outlets:

[Synced (机器之心)] [AIEra (新智元)] [QbitAI (量子位)]

🧐 Overview

🔧 Who&When: #1 Benchmark for MAS automated failure attribution.

💡 Evaluations

Requirements

Inference

Run

Evaluation

🧪 Experimental Results

📖 Reference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages