Tool Failure Crashes Entire ADK Multi-Agent Workflow #795

nikolaidk · 2025-05-15T09:47:14Z

nikolaidk
May 15, 2025

Maintainer's comment: we'd like to seek options on this topic from community.

check out #795 (comment) for the poll and cast your opinions.

Original content

MCP Tool Failure Crashes Entire ADK Multi-Agent Workflow

When an MCP tool fails during execution (not connection), it propagates as an unhandled exception that crashes the entire ADK agent workflow, stopping all subsequent agents in a SequentialAgent pipeline.

Environment
ADK Version: Latest (using google.adk.agents, google.adk.tools.mcp_tool)
Python Version: 3.12
MCP Library Version: Latest compatible with current ADK implementation
Operating System: Linux 5.15
Problem Description
While ADK provides good error handling for MCP server connection failures, runtime MCP tool failures (like "Resource not found") propagate as unhandled McpError exceptions that crash the entire multi-agent workflow.

Expected Behavior
Individual MCP tool failures should not crash the entire agent workflow
Agents should be able to handle tool failures gracefully and continue execution
Sequential agents should continue to subsequent agents even if one tool fails
The framework should provide built-in resilience mechanisms for MCP tool failures
Actual Behavior
Single MCP tool failure crashes the entire SequentialAgent workflow
No opportunity for graceful degradation or alternative approaches
Complete loss of partial results from successful agents
Workflow stops executing without running subsequent agents
Steps to Reproduce
Create a multi-agent workflow using SequentialAgent
Include an MCP tool that may fail (e.g., GitHub file access with invalid path)
Configure the agent to use the MCP tool
Run the workflow with inputs that will cause the MCP tool to fail
Minimal Reproducible Example
python
import asyncio
from google.adk.agents import SequentialAgent, LlmAgent
from google.adk.tools.mcp_tool.mcp_toolset import MCPToolset
from google.adk.tools.mcp_tool.mcp_toolset import StdioServerParameters

async def create_failing_workflow():
# Setup GitHub MCP tools
git_tools, git_exit_stack = await MCPToolset.from_server(
connection_params=StdioServerParameters(
command='npx',
args=["-y", "@modelcontextprotocol/server-github"],
env={"GITHUB_PERSONAL_ACCESS_TOKEN": "your_token"}
)
)

# Create agent that will use failing MCP tool
failing_agent = LlmAgent(
    name="FailingAgent",
    model="gemini-2.5-pro-preview-05-06",
    instruction="Try to access a non-existent file from the repository",
    tools=git_tools
)

# Create workflow with subsequent agents
workflow = SequentialAgent(
    name="TestWorkflow",
    sub_agents=[failing_agent, other_agent1, other_agent2]
)

return workflow, git_exit_stack

Run with input that causes MCP tool to fail

Result: Entire workflow crashes, other_agent1 and other_agent2 never execute

Error Log
mcp.shared.exceptions.McpError: Not Found: Resource not found: Not Found
File "/home/nmr/.venv/lib/python3.12/site-packages/google/adk/tools/mcp_tool/mcp_tool.py", line 126, in run_async
raise e
File "/home/nmr/.venv/lib/python3.12/site-packages/google/adk/tools/mcp_tool/mcp_tool.py", line 122, in run_async
response = await self.mcp_session.call_tool(self.name, arguments=args)
File "/home/nmr/.venv/lib/python3.12/site-packages/mcp/client/session.py", line 265, in call_tool
return await self.send_request(
File "/home/nmr/.venv/lib/python3.12/site-packages/mcp/shared/session.py", line 273, in send_request
raise McpError(response_or_error.error)
Current Workarounds
Agent Instruction Level: Explicitly instruct agents to handle tool failures
Wrapper Functions: Create wrapper tools with try-catch logic
Alternative Agent Patterns: Use custom agents instead of SequentialAgent
Suggested Solutions

Framework-Level Error Handling
Add built-in error handling in MCPTool.run_async():

python
async def run_async(self, args, tool_context):
try:
response = await self.mcp_session.call_tool(self.name, arguments=args)
return response
except McpError as e:
# Convert to tool result with error information
return {
"error": True,
"error_type": "mcp_tool_failure",
"error_message": str(e),
"tool_name": self.name,
"suggestions": ["Try alternative tools", "Check connectivity"]
}
2. SequentialAgent Resilience
Modify SequentialAgent to continue execution despite sub-agent failures:

python

Add option for fault-tolerant execution

workflow = SequentialAgent(
name="FaultTolerantWorkflow",
sub_agents=[agent1, agent2, agent3],
continue_on_failure=True, # New parameter
collect_partial_results=True # New parameter
)
3. Circuit Breaker Pattern
Implement circuit breaker functionality for MCP tools to prevent cascading failures.

Impact
Severity: High - Crashes entire workflows
Frequency: Common when using external MCP servers
Workaround Complexity: Medium - Requires manual error handling
Additional Context
This issue significantly impacts the reliability of production ADK systems using MCP tools. The current behavior makes it difficult to build robust multi-agent systems that can gracefully handle partial failures.

Related Issues
[Link to any related issues if they exist]
Feature Request
Consider adding:

Built-in error handling options for MCP tools
Fault-tolerant execution modes for multi-agent workflows
Circuit breaker patterns for external tool integrations
Better error propagation and handling documentation
Labels: bug, enhancement, mcp-tools, multi-agent, error-handling

seanzhou1023 · 2025-05-19T21:25:47Z

seanzhou1023
May 19, 2025
Collaborator

I think that's not specific to MCP tools. Any tool failure should not crash the agent. @Jacksunwei could you please take a look at the robustness of agent workflows ?

0 replies

Jacksunwei · 2025-05-20T04:42:26Z

Jacksunwei
May 20, 2025
Maintainer

Right now, this is expected behavior. We were thinking about catching all errors and submit the error to LLM, so that it can retry.

However, that approach may leak sensitive data to LLM unexpected.

Hence, the ending solution is that we recommend tool author to handle the error and return a sanitized and meaningful error object like below, so that LLM can help retry and doesn't leak sensitive data to LLM.

{
  "error": "transient error",
  "detail": "Server transient error, retry with the same parameter"
}

In the meanwhile, I'm also interest how people think about this topic. I'm converting this to a discussion for more people to comment.

4 replies

Danau5tin May 21, 2025

Hey @Jacksunwei thank you for this.
Are you able to clarify how to handle the MCPError, convert it to a dict, and continue the agent's trajectory please?

For example, where below would I be able to intercept the error and do the above?

session_service = InMemorySessionService()
runner = Runner(
    agent=agent,
    app_name=app_name,
    session_service=session_service,
)
session_service.create_session(
    app_name=app_name, user_id=user_id, session_id=session_id
)

content = types.Content(role="user", parts=[types.Part(text=prompt)])
new_events = []
async for event in runner.run_async(
    user_id=user_id, session_id=session_id, new_message=content, 
):
    new_events.append(event)

Thank you in advance.

Danau5tin May 21, 2025

Ah I found a way.
I used AfterToolCallback like so:

LlmAgent(
    model=model,
    name=agent_name,
    tools=tools,
    after_tool_callback=test_after_callback,
)

async def test_after_callback(
    tool: BaseTool,
    tool_response: CallToolResult,
    tool_context: ToolContext,
    args: dict,
) -> Optional[dict]:
    if tool_response.isError:
        content: TextContent = tool_response.content[0]
        return {"status": "error", "msg": content.text}

    return None

Hope this helps others!

lsinghkochava Jun 11, 2025

Surprisingly this doesn't work anymore, the agent just halts if an error happens and the after_tool_callback doesn't get called. @Danau5tin was this working on a specific version of the library? I'm on v1.2.1

Danau5tin Jun 12, 2025

Hi @lsinghkochava, Here is my pip list output:

google-adk                         0.5.0
google-api-core                    2.24.2
google-api-python-client           2.169.0

Jacksunwei · 2025-05-20T04:54:47Z

Jacksunwei
May 20, 2025
Maintainer

In general, which options do you think the best and would like to adopt?

Tool author's responsibility to handle error

Tool author handles all the error within the tool and provide meaningful error dict for LLM to proceed.

Pros

Safest bet for data. Agent or tool authors have full control of their data

Cons

Requires the most work from agent or tool authors comparing the below tool

ADK handles basic error, and leave the rest to tool author

ADK currently handles only one basic error, parameter missing. For all other errors, defer to No.1.

Pros

Still safer than No.3, but parameter name may be also sensitive end-user data or leak user integration details in some cases

Cons

Still requires significant amount work from the agent or tool authors to handle the majority errors.

Catch all errors and submit to LLM automatically and let LLM to reason and decide the next

Pros

Easiest to use.
Require least amount of work for agent or tool authors;

Cons

Significant risk of leaking sensitive data and server implementation details to LLM

Feel free to cast your opinions and discussion on the 3 options or even more ideas.

1 reply

Danau5tin May 21, 2025

I believe a potential approach is that by default the ADK should send the error back to the LLM for processing.

However if desired, clients can intercept the MCPError and decide whether to:

Edit the payload, and provide to LLM as part of same trajectory
Stop execution of this trajectory (e.g: because the returned error is fatal)

What do you think of this?

Tavernari · 2025-06-11T17:04:49Z

Tavernari
Jun 11, 2025

Regarding tools: if a tool determines that it's safe not to crash, it can handle the error gracefully by returning a JSON object or an error message with an appropriate indicator.

However, in my case, I'm testing the agent flow using smaller models, and they often attempt to call incorrect tools, which ends up "crashing" the flow.
In my opinion, it should simply return something like "invalid tool" or "tool not found". Instead, it's currently breaking the flow entirely—and it's incredibly frustrating.

0 replies

lsinghkochava · 2025-06-12T10:06:55Z

lsinghkochava
Jun 12, 2025

I tried @Danau5tin's fix, but unfortunately it didn’t work for me. The after_tool_callback doesn’t seem to catch any exceptions thrown by the MCP tools. While it makes sense for tools to return structured error responses, we can't always rely on that behavior. It would be really helpful if we could still catch tool errors in a callback and decide whether to continue or stop the workflow execution based on the context.

1 reply

lsinghkochava Jun 13, 2025

Ok I hadn't realised that we're able to intercept the tool call with before_tool_callback and can add our error handling logic there.

async def before_tool_call(tool: BaseTool, args: dict, tool_context: ToolContext):
    try:
            return await tool.run_async(args=args, tool_context=tool_context)
    except ToolCallError as e:
        return {"error": e.message}

Tool Failure Crashes Entire ADK Multi-Agent Workflow #795

Uh oh!

Uh oh!

MCP Tool Failure Crashes Entire ADK Multi-Agent Workflow

Run with input that causes MCP tool to fail

Result: Entire workflow crashes, other_agent1 and other_agent2 never execute

Add option for fault-tolerant execution

Replies: 5 comments · 6 replies

Uh oh!

seanzhou1023 May 19, 2025 Collaborator

Uh oh!

Jacksunwei May 20, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jacksunwei May 20, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 5 comments 6 replies

seanzhou1023
May 19, 2025
Collaborator

Jacksunwei
May 20, 2025
Maintainer

Jacksunwei
May 20, 2025
Maintainer