Agentic Protocols: Production-Inspired Evaluation System

A systems-level evaluation framework for comparing how different agent communication protocols behave under real-world constraints: latency, failures, coordination overhead, and observability.

Goal: Answer the question: When should you use Tool-based agents (MCP) vs. Multi-agent communication (A2A)?

🎯 The Problem

Agent systems are often evaluated on "happy path" scenarios. In production:

Tools timeout
Agents crash
Failures cascade
Coordination becomes complex
Debugging is hard

This project deliberately injects failures to compare how different protocols handle real-world friction.

🔬 What We're Comparing

Model Context Protocol (MCP) - Tool-Oriented

Agent executes structured tool calls
Deterministic, schema-validated
Strength: Control, predictability
Weakness: Limited flexibility

Agent-to-Agent (A2A) - Multi-Agent

Agents communicate with each other
Task decomposition across multiple agents
Strength: Flexibility, scalability
Weakness: Coordination complexity, harder to debug

Hybrid

Agents use tools + communicate with peers
Best of both worlds (or worst?)

📊 Three Test Scenarios

Scenario 1: Tool-Oriented Execution (MCP Strength)

Use Case: User Data Enrichment Pipeline

Agent → Fetch User Data → Call External API → Aggregate Result

Focus: Tool latency, schema validation, retry behavior

Scenario 2: Multi-Agent Collaboration (A2A Strength)

Use Case: Incident Investigation

Planner → Log Analyzer → Metrics Analyzer → Summary Agent

Focus: Message passing overhead, coordination delays, state consistency

Scenario 3: Distributed Decision System (Differentiator) ✅ COMPLETE

Use Case: Security Anomaly Detection with Partial Failure Tolerance

Log Agent (0.0-1.0) ─┐
Metrics Agent (0.0-1.0) ├→ Parallel Execution → Aggregator → Decision
Behavior Agent (0.0-1.0) ┘

Focus: Partial failures, graceful degradation, decision under uncertainty

Test Results:

Failure Scenario	Decision	Confidence
3.1 No failures	HIGH_RISK	100%
3.2 One agent timeout	HIGH_RISK	66%
3.3 One agent exception	HIGH_RISK	66%
3.4 Two agents fail	MEDIUM_RISK	33%

Key Finding: System continues producing decisions even with 2/3 agents failing.

📈 Key Metrics

Performance: Latency (p50, p95, p99), throughput
Reliability: Success rate, retry count, failure propagation depth
Coordination Cost: Number of messages, execution steps
Observability: Traceability, decision clarity, debugging complexity

🏗️ Architecture

Core Components

Client Request
    ↓
API Gateway (validates, assigns request_id)
    ↓
Agent Runtime (orchestration engine)
    ├→ Planner (decide next action)
    ├→ Protocol Layer (MCP/A2A adapter)
    ├→ Agent Registry (manage agents)
    └→ Tool Services (execute tools)
    ↓
Observability Layer (structured logging)
    ↓
Failure Injection Layer (simulate real-world issues)

Design Principles

Protocol-agnostic runtime - Both MCP and A2A run through the same engine
Execution as state machine - Clear, testable flow
Typed data contracts - Pydantic models for all data
Failure-first design - Expect failures, handle them explicitly

🚀 Getting Started

Prerequisites

Python 3.9+
pip / venv

Installation

# Clone or setup the repo
cd agentic-protocols

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Run Scenarios

# Run all scenarios (1, 2, 3) with full trace
python src/main.py

# Run performance benchmarks (5 iterations each)
python src/main.py --benchmark

# Run full scenarios + benchmarks
python src/main.py --full

# Run resilience testing (baseline + 30% failure injection)
python src/benchmark/resilience_benchmarks.py

Benchmark Results

Quick Performance Summary:

MCP (Scenario 1):
  - Avg latency: 0.14 ms
  - Success rate: 100.0 %

A2A (Scenario 2):
  - Avg latency: 0.03 ms
  - Success rate: 100.0 %

Distributed (Scenario 3):
  - Avg latency: 0.22 ms
  - Success rate: 100.0 %

📊 See BENCHMARKING_GUIDE.md for:

Complete baseline and failure injection results
Performance vs. resilience trade-offs
Detailed recommendations by use case

📖 See BENCHMARK_FORMAT_GUIDE.md for:

How to interpret benchmark output
Custom benchmark configurations
Integration into your own reports

📁 Project Structure

agentic-protocols/
├── docs/
│   ├── project-plan.md                # Project roadmap & milestones
│   ├── architecture.md                # High-level design (HLD)
│   ├── lld-agent-runtime.md           # Low-level design (LLD)
│   ├── scenarios.md                   # Test scenarios & failure strategies
│   └── IMPLEMENTATION_SUMMARY.md      # Complete implementation status
│
├── src/
│   ├── runtime/
│   │   ├── context.py                 # ExecutionContext (state + trace)
│   │   └── engine.py                  # ExecutionEngine (main loop)
│   │
│   ├── actions/
│   │   ├── base.py                    # Action abstraction
│   │   ├── tool_action.py             # Execute tools (MCP)
│   │   ├── agent_action.py            # Agent calls (A2A)
│   │   ├── parallel_action.py         # Parallel execution (Scenario 3)
│   │   ├── noop_action.py             # End signal
│   │   └── result.py                  # Standardized results
│   │
│   ├── planner/
│   │   ├── simple_planner.py          # Scenario 1: Tool execution
│   │   ├── multi_agent_planner.py     # Scenario 2: Agent coordination
│   │   └── distributed_planner.py     # Scenario 3: Parallel + aggregation
│   │
│   ├── protocols/
│   │   ├── mcp_adapter.py             # Tool execution adapter
│   │   ├── a2a_adapter.py             # Agent-to-agent adapter
│   │   ├── parallel_adapter.py        # Parallel execution adapter
│   │   ├── resolver.py                # Protocol router with metrics
│   │   ├── observable_wrapper.py      # Observability wrapper
│   │   ├── failure_wrapper.py         # Failure injection wrapper
│   │   └── metrics_wrapper.py         # Metrics collection wrapper
│   │
│   ├── agents/
│   │   ├── base_agent.py              # Agent interface
│   │   ├── log_agent.py               # Log analysis agent
│   │   ├── metrics_agent.py           # Metrics analysis agent
│   │   ├── behavior_agent.py          # Behavior detection agent
│   │   └── aggregator_agent.py        # Decision aggregation agent
│   │
│   ├── registry/
│   │   └── agent_registry.py          # Central agent registry
│   │
│   ├── failure/
│   │   ├── injector.py                # Failure injection engine
│   │   ├── policy.py                  # Failure policies
│   │   └── types.py                   # Failure type enums
│   │
│   ├── metrics/
│   │   ├── models.py                  # ActionMetric, ExecutionMetrics
│   │   ├── collector.py               # Passive metrics collection
│   │   └── reporter.py                # Metrics aggregation & reporting
│   │
│   ├── benchmark/
│   │   └── runner.py                  # Multi-iteration benchmarking
│   │
│   ├── observability/
│   │   ├── logger.py                  # Structured logging
│   │   └── tracer.py                  # Trace correlation
│   │
│   └── main.py                        # Entry point for all scenarios
│
├── requirements.txt                   # Python dependencies
├── README.md                          # This file
└── .gitignore

🔄 Implementation Status

✅ Phase 1: Core Infrastructure

✅ Execution engine with state management
✅ Typed data contracts (Pydantic)
✅ Protocol adapter abstraction
✅ Structured observability with trace correlation

✅ Phase 2: Scenario 1 - MCP (Tool-Based)

✅ Tool execution (3 failure modes: timeout, exception, success)
✅ Retry logic with exponential backoff
✅ End-to-end testing with 3 test cases
✅ Baseline metrics collection

✅ Phase 3: Scenario 2 - A2A (Agent Coordination)

✅ Multi-agent framework with agent registry
✅ LogAnalyzerAgent, MetricsAgent examples
✅ Agent-to-agent message passing
✅ 3 test cases with failure injection

✅ Phase 4: Scenario 3 - Distributed Decisions

✅ Parallel action execution with ParallelAdapter
✅ Graceful degradation under partial failures
✅ BehaviorAgent, AggregatorAgent (decision logic)
✅ 4 test cases covering all failure combinations

✅ Phase 5: Metrics & Benchmarking

✅ Passive metrics collection (MetricsCollector)
✅ Metrics wrapper composition in protocol stack
✅ Performance reporter with scenario comparison
✅ Benchmark runner (configurable iterations)
✅ Latency, success rate, retry tracking

✅ Phase 6: Production-Ready

✅ Error handling across all protocols
✅ Comprehensive trace logs
✅ Benchmark CLI: --benchmark, --full
✅ Documentation & implementation summary

🧪 Design Decisions

Why Start with Scenario 1?

Validates architecture before adding complexity
MCP is simpler - easier to debug and prove the system works
Faster iteration - mock tools are trivial to implement

Why Rule-Based Planner First?

No ML/LLM overhead - focuses on system design, not AI
Deterministic - easier to test and reason about
Extensible - can replace with LLM-based planner later

Why Pydantic?

Type safety - catches data shape errors early
Serialization - easy to log and trace
Validation - schema contracts enforced

🔍 Key Differentiators

This is not just another agent framework because:

Evaluates failures, not just success - Most frameworks ignore degradation
Protocol comparison - Systematic comparison across communication models
Production constraints - Focuses on latency, coordination, observability
Systems thinking - Treats agents as a distributed systems problem
Actionable insights - Decision framework: Use MCP when X, A2A when Y

📚 Documentation

Project Plan - Objectives, scope, timeline
Architecture - System design & components
LLD: Agent Runtime - Detailed runtime design
Scenarios & Failures - Test plans & evaluation strategy

🛠️ Contributing

Current focus: Scenario-driven development

Implement Scenario 1 end-to-end ✅
Add failure injection to Scenario 1
Implement Scenario 2 (A2A)
Compare across scenarios

📝 License

Open source - educational and research purposes.

🎓 Questions This Project Answers

When should MCP be used vs A2A vs Distributed?
- MCP: Simple, fast (0.06ms), low overhead, fails linearly
- A2A: Flexible, moderate latency (0.02ms), agent composition
- Distributed: Complex, highest latency (0.37ms), graceful degradation
What breaks first in each model?
- MCP/A2A: Single point failure stops everything
- Distributed: 2 of 3 agents can fail, system continues
How does failure propagate differently?
- MCP: Cascading (one tool failure = whole chain fails)
- A2A: Cascading (one agent crash = orchestration fails)
- Distributed: Isolated (failing agents tracked, others continue)
Which approach is easier to debug?
- MCP: Simplest (fewest hops, clear trace)
- A2A: Moderate (agent registry complexity)
- Distributed: Hardest (parallel execution, aggregation logic)
What is the coordination cost?
- MCP: 0ms (direct tool calls)
- A2A: ~2x overhead from agent lookup + delegation
- Distributed: ~6x overhead from parallel + aggregation
How does latency affect each protocol?
- All three operate in sub-millisecond range
- Distributed acceptable for decision systems (latency not critical)
- MCP preferred for latency-sensitive operations

📊 Final Metrics

Metric	MCP	A2A	Distributed
Avg Latency	0.06ms	0.02ms	0.37ms
Latency Ratio	1x	0.33x	6x
Success Rate	100%	100%	100%
Partial Failure Tolerance	❌	❌	✅
Max Agents	1	N	3+
Complexity	Low	Medium	High

Key Insight: "Agent systems don't fail linearly—they fail probabilistically and collectively. Distributed systems accept higher latency for resilience."

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
src		src
.gitignore		.gitignore
BENCHMARKING_GUIDE.md		BENCHMARKING_GUIDE.md
BENCHMARK_FORMAT_GUIDE.md		BENCHMARK_FORMAT_GUIDE.md
BENCHMARK_RESULTS.md		BENCHMARK_RESULTS.md
COMPLETION_REPORT.md		COMPLETION_REPORT.md
COMPLETION_SUMMARY.md		COMPLETION_SUMMARY.md
CORRECTED_FAILURE_RESULTS.md		CORRECTED_FAILURE_RESULTS.md
INDEX.md		INDEX.md
MILESTONE.md		MILESTONE.md
README.md		README.md
full_output.txt		full_output.txt
requirements.txt		requirements.txt
scenario_output.txt		scenario_output.txt

Folders and files

Latest commit

History

Repository files navigation

Agentic Protocols: Production-Inspired Evaluation System

🎯 The Problem

🔬 What We're Comparing

Model Context Protocol (MCP) - Tool-Oriented

Agent-to-Agent (A2A) - Multi-Agent

Hybrid

📊 Three Test Scenarios

Scenario 1: Tool-Oriented Execution (MCP Strength)

Scenario 2: Multi-Agent Collaboration (A2A Strength)

Scenario 3: Distributed Decision System (Differentiator) ✅ COMPLETE

📈 Key Metrics

🏗️ Architecture

Core Components

Design Principles

🚀 Getting Started

Prerequisites

Installation

Run Scenarios

Benchmark Results

📁 Project Structure

🔄 Implementation Status

✅ Phase 1: Core Infrastructure

✅ Phase 2: Scenario 1 - MCP (Tool-Based)

✅ Phase 3: Scenario 2 - A2A (Agent Coordination)

✅ Phase 4: Scenario 3 - Distributed Decisions

✅ Phase 5: Metrics & Benchmarking

✅ Phase 6: Production-Ready

🧪 Design Decisions

Why Start with Scenario 1?

Why Rule-Based Planner First?

Why Pydantic?

🔍 Key Differentiators

📚 Documentation

🛠️ Contributing

📝 License

🎓 Questions This Project Answers

📊 Final Metrics

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages