Skip to content

simarbedi/agentic-protocols

Repository files navigation

Agentic Protocols: Production-Inspired Evaluation System

A systems-level evaluation framework for comparing how different agent communication protocols behave under real-world constraints: latency, failures, coordination overhead, and observability.

Goal: Answer the question: When should you use Tool-based agents (MCP) vs. Multi-agent communication (A2A)?


🎯 The Problem

Agent systems are often evaluated on "happy path" scenarios. In production:

  • Tools timeout
  • Agents crash
  • Failures cascade
  • Coordination becomes complex
  • Debugging is hard

This project deliberately injects failures to compare how different protocols handle real-world friction.


🔬 What We're Comparing

Model Context Protocol (MCP) - Tool-Oriented

  • Agent executes structured tool calls
  • Deterministic, schema-validated
  • Strength: Control, predictability
  • Weakness: Limited flexibility

Agent-to-Agent (A2A) - Multi-Agent

  • Agents communicate with each other
  • Task decomposition across multiple agents
  • Strength: Flexibility, scalability
  • Weakness: Coordination complexity, harder to debug

Hybrid

  • Agents use tools + communicate with peers
  • Best of both worlds (or worst?)

📊 Three Test Scenarios

Scenario 1: Tool-Oriented Execution (MCP Strength)

Use Case: User Data Enrichment Pipeline

Agent → Fetch User Data → Call External API → Aggregate Result

Focus: Tool latency, schema validation, retry behavior

Scenario 2: Multi-Agent Collaboration (A2A Strength)

Use Case: Incident Investigation

Planner → Log Analyzer → Metrics Analyzer → Summary Agent

Focus: Message passing overhead, coordination delays, state consistency

Scenario 3: Distributed Decision System (Differentiator) ✅ COMPLETE

Use Case: Security Anomaly Detection with Partial Failure Tolerance

Log Agent (0.0-1.0) ─┐
Metrics Agent (0.0-1.0) ├→ Parallel Execution → Aggregator → Decision
Behavior Agent (0.0-1.0) ┘

Focus: Partial failures, graceful degradation, decision under uncertainty

Test Results:

Failure Scenario Decision Confidence
3.1 No failures HIGH_RISK 100%
3.2 One agent timeout HIGH_RISK 66%
3.3 One agent exception HIGH_RISK 66%
3.4 Two agents fail MEDIUM_RISK 33%

Key Finding: System continues producing decisions even with 2/3 agents failing.


📈 Key Metrics

  • Performance: Latency (p50, p95, p99), throughput
  • Reliability: Success rate, retry count, failure propagation depth
  • Coordination Cost: Number of messages, execution steps
  • Observability: Traceability, decision clarity, debugging complexity

🏗️ Architecture

Core Components

Client Request
    ↓
API Gateway (validates, assigns request_id)
    ↓
Agent Runtime (orchestration engine)
    ├→ Planner (decide next action)
    ├→ Protocol Layer (MCP/A2A adapter)
    ├→ Agent Registry (manage agents)
    └→ Tool Services (execute tools)
    ↓
Observability Layer (structured logging)
    ↓
Failure Injection Layer (simulate real-world issues)

Design Principles

  1. Protocol-agnostic runtime - Both MCP and A2A run through the same engine
  2. Execution as state machine - Clear, testable flow
  3. Typed data contracts - Pydantic models for all data
  4. Failure-first design - Expect failures, handle them explicitly

🚀 Getting Started

Prerequisites

  • Python 3.9+
  • pip / venv

Installation

# Clone or setup the repo
cd agentic-protocols

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Run Scenarios

# Run all scenarios (1, 2, 3) with full trace
python src/main.py

# Run performance benchmarks (5 iterations each)
python src/main.py --benchmark

# Run full scenarios + benchmarks
python src/main.py --full

# Run resilience testing (baseline + 30% failure injection)
python src/benchmark/resilience_benchmarks.py

Benchmark Results

Quick Performance Summary:

MCP (Scenario 1):
  - Avg latency: 0.14 ms
  - Success rate: 100.0 %

A2A (Scenario 2):
  - Avg latency: 0.03 ms
  - Success rate: 100.0 %

Distributed (Scenario 3):
  - Avg latency: 0.22 ms
  - Success rate: 100.0 %

📊 See BENCHMARKING_GUIDE.md for:

  • Complete baseline and failure injection results
  • Performance vs. resilience trade-offs
  • Detailed recommendations by use case

📖 See BENCHMARK_FORMAT_GUIDE.md for:

  • How to interpret benchmark output
  • Custom benchmark configurations
  • Integration into your own reports

📁 Project Structure

agentic-protocols/
├── docs/
│   ├── project-plan.md                # Project roadmap & milestones
│   ├── architecture.md                # High-level design (HLD)
│   ├── lld-agent-runtime.md           # Low-level design (LLD)
│   ├── scenarios.md                   # Test scenarios & failure strategies
│   └── IMPLEMENTATION_SUMMARY.md      # Complete implementation status
│
├── src/
│   ├── runtime/
│   │   ├── context.py                 # ExecutionContext (state + trace)
│   │   └── engine.py                  # ExecutionEngine (main loop)
│   │
│   ├── actions/
│   │   ├── base.py                    # Action abstraction
│   │   ├── tool_action.py             # Execute tools (MCP)
│   │   ├── agent_action.py            # Agent calls (A2A)
│   │   ├── parallel_action.py         # Parallel execution (Scenario 3)
│   │   ├── noop_action.py             # End signal
│   │   └── result.py                  # Standardized results
│   │
│   ├── planner/
│   │   ├── simple_planner.py          # Scenario 1: Tool execution
│   │   ├── multi_agent_planner.py     # Scenario 2: Agent coordination
│   │   └── distributed_planner.py     # Scenario 3: Parallel + aggregation
│   │
│   ├── protocols/
│   │   ├── mcp_adapter.py             # Tool execution adapter
│   │   ├── a2a_adapter.py             # Agent-to-agent adapter
│   │   ├── parallel_adapter.py        # Parallel execution adapter
│   │   ├── resolver.py                # Protocol router with metrics
│   │   ├── observable_wrapper.py      # Observability wrapper
│   │   ├── failure_wrapper.py         # Failure injection wrapper
│   │   └── metrics_wrapper.py         # Metrics collection wrapper
│   │
│   ├── agents/
│   │   ├── base_agent.py              # Agent interface
│   │   ├── log_agent.py               # Log analysis agent
│   │   ├── metrics_agent.py           # Metrics analysis agent
│   │   ├── behavior_agent.py          # Behavior detection agent
│   │   └── aggregator_agent.py        # Decision aggregation agent
│   │
│   ├── registry/
│   │   └── agent_registry.py          # Central agent registry
│   │
│   ├── failure/
│   │   ├── injector.py                # Failure injection engine
│   │   ├── policy.py                  # Failure policies
│   │   └── types.py                   # Failure type enums
│   │
│   ├── metrics/
│   │   ├── models.py                  # ActionMetric, ExecutionMetrics
│   │   ├── collector.py               # Passive metrics collection
│   │   └── reporter.py                # Metrics aggregation & reporting
│   │
│   ├── benchmark/
│   │   └── runner.py                  # Multi-iteration benchmarking
│   │
│   ├── observability/
│   │   ├── logger.py                  # Structured logging
│   │   └── tracer.py                  # Trace correlation
│   │
│   └── main.py                        # Entry point for all scenarios
│
├── requirements.txt                   # Python dependencies
├── README.md                          # This file
└── .gitignore

🔄 Implementation Status

✅ Phase 1: Core Infrastructure

  • ✅ Execution engine with state management
  • ✅ Typed data contracts (Pydantic)
  • ✅ Protocol adapter abstraction
  • ✅ Structured observability with trace correlation

✅ Phase 2: Scenario 1 - MCP (Tool-Based)

  • ✅ Tool execution (3 failure modes: timeout, exception, success)
  • ✅ Retry logic with exponential backoff
  • ✅ End-to-end testing with 3 test cases
  • ✅ Baseline metrics collection

✅ Phase 3: Scenario 2 - A2A (Agent Coordination)

  • ✅ Multi-agent framework with agent registry
  • ✅ LogAnalyzerAgent, MetricsAgent examples
  • ✅ Agent-to-agent message passing
  • ✅ 3 test cases with failure injection

✅ Phase 4: Scenario 3 - Distributed Decisions

  • ✅ Parallel action execution with ParallelAdapter
  • ✅ Graceful degradation under partial failures
  • ✅ BehaviorAgent, AggregatorAgent (decision logic)
  • ✅ 4 test cases covering all failure combinations

✅ Phase 5: Metrics & Benchmarking

  • ✅ Passive metrics collection (MetricsCollector)
  • ✅ Metrics wrapper composition in protocol stack
  • ✅ Performance reporter with scenario comparison
  • ✅ Benchmark runner (configurable iterations)
  • ✅ Latency, success rate, retry tracking

✅ Phase 6: Production-Ready

  • ✅ Error handling across all protocols
  • ✅ Comprehensive trace logs
  • ✅ Benchmark CLI: --benchmark, --full
  • ✅ Documentation & implementation summary

🧪 Design Decisions

Why Start with Scenario 1?

  • Validates architecture before adding complexity
  • MCP is simpler - easier to debug and prove the system works
  • Faster iteration - mock tools are trivial to implement

Why Rule-Based Planner First?

  • No ML/LLM overhead - focuses on system design, not AI
  • Deterministic - easier to test and reason about
  • Extensible - can replace with LLM-based planner later

Why Pydantic?

  • Type safety - catches data shape errors early
  • Serialization - easy to log and trace
  • Validation - schema contracts enforced

🔍 Key Differentiators

This is not just another agent framework because:

  1. Evaluates failures, not just success - Most frameworks ignore degradation
  2. Protocol comparison - Systematic comparison across communication models
  3. Production constraints - Focuses on latency, coordination, observability
  4. Systems thinking - Treats agents as a distributed systems problem
  5. Actionable insights - Decision framework: Use MCP when X, A2A when Y

📚 Documentation


🛠️ Contributing

Current focus: Scenario-driven development

  1. Implement Scenario 1 end-to-end ✅
  2. Add failure injection to Scenario 1
  3. Implement Scenario 2 (A2A)
  4. Compare across scenarios

📝 License

Open source - educational and research purposes.


🎓 Questions This Project Answers

  1. When should MCP be used vs A2A vs Distributed?

    • MCP: Simple, fast (0.06ms), low overhead, fails linearly
    • A2A: Flexible, moderate latency (0.02ms), agent composition
    • Distributed: Complex, highest latency (0.37ms), graceful degradation
  2. What breaks first in each model?

    • MCP/A2A: Single point failure stops everything
    • Distributed: 2 of 3 agents can fail, system continues
  3. How does failure propagate differently?

    • MCP: Cascading (one tool failure = whole chain fails)
    • A2A: Cascading (one agent crash = orchestration fails)
    • Distributed: Isolated (failing agents tracked, others continue)
  4. Which approach is easier to debug?

    • MCP: Simplest (fewest hops, clear trace)
    • A2A: Moderate (agent registry complexity)
    • Distributed: Hardest (parallel execution, aggregation logic)
  5. What is the coordination cost?

    • MCP: 0ms (direct tool calls)
    • A2A: ~2x overhead from agent lookup + delegation
    • Distributed: ~6x overhead from parallel + aggregation
  6. How does latency affect each protocol?

    • All three operate in sub-millisecond range
    • Distributed acceptable for decision systems (latency not critical)
    • MCP preferred for latency-sensitive operations

📊 Final Metrics

Metric MCP A2A Distributed
Avg Latency 0.06ms 0.02ms 0.37ms
Latency Ratio 1x 0.33x 6x
Success Rate 100% 100% 100%
Partial Failure Tolerance
Max Agents 1 N 3+
Complexity Low Medium High

Key Insight: "Agent systems don't fail linearly—they fail probabilistically and collectively. Distributed systems accept higher latency for resilience."

About

A systems-level evaluation framework for comparing how different agent communication protocols behave under real-world constraints: latency, failures, coordination overhead, and observability.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages