A systems-level evaluation framework for comparing how different agent communication protocols behave under real-world constraints: latency, failures, coordination overhead, and observability.
Goal: Answer the question: When should you use Tool-based agents (MCP) vs. Multi-agent communication (A2A)?
Agent systems are often evaluated on "happy path" scenarios. In production:
- Tools timeout
- Agents crash
- Failures cascade
- Coordination becomes complex
- Debugging is hard
This project deliberately injects failures to compare how different protocols handle real-world friction.
- Agent executes structured tool calls
- Deterministic, schema-validated
- Strength: Control, predictability
- Weakness: Limited flexibility
- Agents communicate with each other
- Task decomposition across multiple agents
- Strength: Flexibility, scalability
- Weakness: Coordination complexity, harder to debug
- Agents use tools + communicate with peers
- Best of both worlds (or worst?)
Use Case: User Data Enrichment Pipeline
Agent → Fetch User Data → Call External API → Aggregate Result
Focus: Tool latency, schema validation, retry behavior
Use Case: Incident Investigation
Planner → Log Analyzer → Metrics Analyzer → Summary Agent
Focus: Message passing overhead, coordination delays, state consistency
Use Case: Security Anomaly Detection with Partial Failure Tolerance
Log Agent (0.0-1.0) ─┐
Metrics Agent (0.0-1.0) ├→ Parallel Execution → Aggregator → Decision
Behavior Agent (0.0-1.0) ┘
Focus: Partial failures, graceful degradation, decision under uncertainty
Test Results:
| Failure Scenario | Decision | Confidence |
|---|---|---|
| 3.1 No failures | HIGH_RISK | 100% |
| 3.2 One agent timeout | HIGH_RISK | 66% |
| 3.3 One agent exception | HIGH_RISK | 66% |
| 3.4 Two agents fail | MEDIUM_RISK | 33% |
Key Finding: System continues producing decisions even with 2/3 agents failing.
- Performance: Latency (p50, p95, p99), throughput
- Reliability: Success rate, retry count, failure propagation depth
- Coordination Cost: Number of messages, execution steps
- Observability: Traceability, decision clarity, debugging complexity
Client Request
↓
API Gateway (validates, assigns request_id)
↓
Agent Runtime (orchestration engine)
├→ Planner (decide next action)
├→ Protocol Layer (MCP/A2A adapter)
├→ Agent Registry (manage agents)
└→ Tool Services (execute tools)
↓
Observability Layer (structured logging)
↓
Failure Injection Layer (simulate real-world issues)
- Protocol-agnostic runtime - Both MCP and A2A run through the same engine
- Execution as state machine - Clear, testable flow
- Typed data contracts - Pydantic models for all data
- Failure-first design - Expect failures, handle them explicitly
- Python 3.9+
- pip / venv
# Clone or setup the repo
cd agentic-protocols
# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Run all scenarios (1, 2, 3) with full trace
python src/main.py
# Run performance benchmarks (5 iterations each)
python src/main.py --benchmark
# Run full scenarios + benchmarks
python src/main.py --full
# Run resilience testing (baseline + 30% failure injection)
python src/benchmark/resilience_benchmarks.pyQuick Performance Summary:
MCP (Scenario 1):
- Avg latency: 0.14 ms
- Success rate: 100.0 %
A2A (Scenario 2):
- Avg latency: 0.03 ms
- Success rate: 100.0 %
Distributed (Scenario 3):
- Avg latency: 0.22 ms
- Success rate: 100.0 %
📊 See BENCHMARKING_GUIDE.md for:
- Complete baseline and failure injection results
- Performance vs. resilience trade-offs
- Detailed recommendations by use case
📖 See BENCHMARK_FORMAT_GUIDE.md for:
- How to interpret benchmark output
- Custom benchmark configurations
- Integration into your own reports
agentic-protocols/
├── docs/
│ ├── project-plan.md # Project roadmap & milestones
│ ├── architecture.md # High-level design (HLD)
│ ├── lld-agent-runtime.md # Low-level design (LLD)
│ ├── scenarios.md # Test scenarios & failure strategies
│ └── IMPLEMENTATION_SUMMARY.md # Complete implementation status
│
├── src/
│ ├── runtime/
│ │ ├── context.py # ExecutionContext (state + trace)
│ │ └── engine.py # ExecutionEngine (main loop)
│ │
│ ├── actions/
│ │ ├── base.py # Action abstraction
│ │ ├── tool_action.py # Execute tools (MCP)
│ │ ├── agent_action.py # Agent calls (A2A)
│ │ ├── parallel_action.py # Parallel execution (Scenario 3)
│ │ ├── noop_action.py # End signal
│ │ └── result.py # Standardized results
│ │
│ ├── planner/
│ │ ├── simple_planner.py # Scenario 1: Tool execution
│ │ ├── multi_agent_planner.py # Scenario 2: Agent coordination
│ │ └── distributed_planner.py # Scenario 3: Parallel + aggregation
│ │
│ ├── protocols/
│ │ ├── mcp_adapter.py # Tool execution adapter
│ │ ├── a2a_adapter.py # Agent-to-agent adapter
│ │ ├── parallel_adapter.py # Parallel execution adapter
│ │ ├── resolver.py # Protocol router with metrics
│ │ ├── observable_wrapper.py # Observability wrapper
│ │ ├── failure_wrapper.py # Failure injection wrapper
│ │ └── metrics_wrapper.py # Metrics collection wrapper
│ │
│ ├── agents/
│ │ ├── base_agent.py # Agent interface
│ │ ├── log_agent.py # Log analysis agent
│ │ ├── metrics_agent.py # Metrics analysis agent
│ │ ├── behavior_agent.py # Behavior detection agent
│ │ └── aggregator_agent.py # Decision aggregation agent
│ │
│ ├── registry/
│ │ └── agent_registry.py # Central agent registry
│ │
│ ├── failure/
│ │ ├── injector.py # Failure injection engine
│ │ ├── policy.py # Failure policies
│ │ └── types.py # Failure type enums
│ │
│ ├── metrics/
│ │ ├── models.py # ActionMetric, ExecutionMetrics
│ │ ├── collector.py # Passive metrics collection
│ │ └── reporter.py # Metrics aggregation & reporting
│ │
│ ├── benchmark/
│ │ └── runner.py # Multi-iteration benchmarking
│ │
│ ├── observability/
│ │ ├── logger.py # Structured logging
│ │ └── tracer.py # Trace correlation
│ │
│ └── main.py # Entry point for all scenarios
│
├── requirements.txt # Python dependencies
├── README.md # This file
└── .gitignore
- ✅ Execution engine with state management
- ✅ Typed data contracts (Pydantic)
- ✅ Protocol adapter abstraction
- ✅ Structured observability with trace correlation
- ✅ Tool execution (3 failure modes: timeout, exception, success)
- ✅ Retry logic with exponential backoff
- ✅ End-to-end testing with 3 test cases
- ✅ Baseline metrics collection
- ✅ Multi-agent framework with agent registry
- ✅ LogAnalyzerAgent, MetricsAgent examples
- ✅ Agent-to-agent message passing
- ✅ 3 test cases with failure injection
- ✅ Parallel action execution with ParallelAdapter
- ✅ Graceful degradation under partial failures
- ✅ BehaviorAgent, AggregatorAgent (decision logic)
- ✅ 4 test cases covering all failure combinations
- ✅ Passive metrics collection (MetricsCollector)
- ✅ Metrics wrapper composition in protocol stack
- ✅ Performance reporter with scenario comparison
- ✅ Benchmark runner (configurable iterations)
- ✅ Latency, success rate, retry tracking
- ✅ Error handling across all protocols
- ✅ Comprehensive trace logs
- ✅ Benchmark CLI:
--benchmark,--full - ✅ Documentation & implementation summary
- Validates architecture before adding complexity
- MCP is simpler - easier to debug and prove the system works
- Faster iteration - mock tools are trivial to implement
- No ML/LLM overhead - focuses on system design, not AI
- Deterministic - easier to test and reason about
- Extensible - can replace with LLM-based planner later
- Type safety - catches data shape errors early
- Serialization - easy to log and trace
- Validation - schema contracts enforced
This is not just another agent framework because:
- Evaluates failures, not just success - Most frameworks ignore degradation
- Protocol comparison - Systematic comparison across communication models
- Production constraints - Focuses on latency, coordination, observability
- Systems thinking - Treats agents as a distributed systems problem
- Actionable insights - Decision framework: Use MCP when X, A2A when Y
- Project Plan - Objectives, scope, timeline
- Architecture - System design & components
- LLD: Agent Runtime - Detailed runtime design
- Scenarios & Failures - Test plans & evaluation strategy
Current focus: Scenario-driven development
- Implement Scenario 1 end-to-end ✅
- Add failure injection to Scenario 1
- Implement Scenario 2 (A2A)
- Compare across scenarios
Open source - educational and research purposes.
-
When should MCP be used vs A2A vs Distributed?
- MCP: Simple, fast (0.06ms), low overhead, fails linearly
- A2A: Flexible, moderate latency (0.02ms), agent composition
- Distributed: Complex, highest latency (0.37ms), graceful degradation
-
What breaks first in each model?
- MCP/A2A: Single point failure stops everything
- Distributed: 2 of 3 agents can fail, system continues
-
How does failure propagate differently?
- MCP: Cascading (one tool failure = whole chain fails)
- A2A: Cascading (one agent crash = orchestration fails)
- Distributed: Isolated (failing agents tracked, others continue)
-
Which approach is easier to debug?
- MCP: Simplest (fewest hops, clear trace)
- A2A: Moderate (agent registry complexity)
- Distributed: Hardest (parallel execution, aggregation logic)
-
What is the coordination cost?
- MCP: 0ms (direct tool calls)
- A2A: ~2x overhead from agent lookup + delegation
- Distributed: ~6x overhead from parallel + aggregation
-
How does latency affect each protocol?
- All three operate in sub-millisecond range
- Distributed acceptable for decision systems (latency not critical)
- MCP preferred for latency-sensitive operations
| Metric | MCP | A2A | Distributed |
|---|---|---|---|
| Avg Latency | 0.06ms | 0.02ms | 0.37ms |
| Latency Ratio | 1x | 0.33x | 6x |
| Success Rate | 100% | 100% | 100% |
| Partial Failure Tolerance | ❌ | ❌ | ✅ |
| Max Agents | 1 | N | 3+ |
| Complexity | Low | Medium | High |
Key Insight: "Agent systems don't fail linearly—they fail probabilistically and collectively. Distributed systems accept higher latency for resilience."