A dynamic finance agent benchmark system for the AgentBeats Competition. This project implements both Green Agent (Evaluator) and Purple Agent (Finance Analyst) using the A2A (Agent-to-Agent) protocol.
This codebase is designed to work with the AgentBeats platform. The Green Agent follows the official green-agent-template.
- Deterministic configs: use fixed
sampling.seedin eval configs and avoid ad‑hoc overrides during runs. - LLM‑as‑judge determinism: set
llm_eval.temperature: 0.0and pinllm_eval.model(e.g.,gpt-4o-mini) in configs such asconfig/eval_all.yamlandconfig/eval_gdpval.yaml. - Crypto reproducibility: use anonymized hidden scenarios plus fixed seeds; for fully deterministic runs you can set
EVAL_SCENARIO_SEED. - Transparent evaluation logic: scoring lives in
src/evaluators/andsrc/cio_agent/unified_scoring.py; crypto scoring insrc/cio_agent/crypto_benchmark.py.
- Build & push Green/Purple images to GHCR.
- Register agents on AgentBeats and copy the agent IDs.
- Update the leaderboard
scenario.toml(setEVAL_CONFIG=config/eval_all.yamland agent IDs). - Add secrets to the leaderboard repo (
OPENAI_API_KEY,EVAL_DATA_REPO,EVAL_DATA_PAT, optionalHF_TOKEN). - Push
scenario.toml→ workflow runs → merge PR to publish results.
- CPU/RAM: 4 vCPU / 16 GB RAM recommended for multi‑dataset runs.
- Storage: local datasets + hidden crypto windows (size depends on your private repo).
- Network: required for HuggingFace datasets (BizFinBench/GDPVal) and LLM APIs.
- LLM: external API or local LLM; pin model + temperature for reproducibility.
-
scenario.tomlhas real AgentBeats IDs (no placeholders) -
config/eval_all.yaml(or your target config) uses fixed seeds +llm_eval.temperature: 0.0 - Hidden crypto data is private and only mounted into Green (not visible to Purple)
- README + DEPLOYMENT docs match the exact run steps
- End-to-end dry run completed from a clean clone
# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/Mac
# .\.venv\Scripts\Activate.ps1 # Windows PowerShell
# Install dependencies
pip install -e ".[dev]"
# Start Green Agent (A2A server)
# For all datasets: use config/eval_all.yaml
python src/cio_agent/a2a_server.py --host 0.0.0.0 --port 9109 --eval-config .\config\eval_all.yaml --store-predicted --predicted-max-chars 200
# Start Purple Agent
purple-agent serve --host 0.0.0.0 --port 9110 --card-url http://127.0.0.1:9110
# Start MCP server
python -m src.mcp_servers.sec_edgar --transport http --host 0.0.0.0 --port 8101
python -m src.mcp_servers.yahoo_finance --transport http --host 0.0.0.0 --port 8102
python -m src.mcp_servers.sandbox --transport http --host 0.0.0.0 --port 8103
# Run the evaluation
python scripts/run_a2a_eval.py --green-url http://127.0.0.1:9109 --purple-url http://127.0.0.1:9110 --num-tasks 25 -v -o results/eval_output.json
# gpt 4o total score: 42.44
# with debate
python scripts/run_a2a_eval.py --green-url http://127.0.0.1:9109 --purple-url http://127.0.0.1:9110 --num-tasks 25 --conduct-debate -v -o results/eval_output_debate.json
# gpt 4o total score: 43.65- Python 3.13 (recommended for AgentBeats)
- uv (recommended) or pip
- vLLM, Ollama, or LM Studio (for local LLM deployment)
# Clone the repository
git clone https://github.com/yxc20089/AgentBusters.git
cd AgentBusters
# Option 1: Using uv (recommended)
uv sync
# Option 2: Using pip
pip install -e ".[dev]"
# Option 3: Create .env file from template
cp .env.example .env
# Edit .env with your API keys and configurationThe CIO-Agent FAB++ system evaluates AI agents on financial analysis tasks using a unified scoring system across three weighted sections:
| Section | Weight | Datasets | Skills Tested |
|---|---|---|---|
| Knowledge Retrieval | 30% | BizFinBench, Public CSV | Data extraction, financial facts |
| Analytical Reasoning | 35% | Synthetic Questions | Logic puzzles, multi-step calculations |
| Options Trading | 35% | Options Alpha | Derivatives, Greeks, strategies |
OverallScore = 0.30 × Knowledge + 0.35 × Analysis + 0.35 × Options
All section scores are normalized to 0-100 scale. Example: Knowledge (83.33) + Analysis (50.00) + Options (51.25) → Overall: 60.44/100
- BizFinBench v2 (Knowledge): Event logic reasoning, quantitative computation
- Public CSV (Knowledge): Beat/miss analysis, market analysis from FAB benchmark
- Synthetic Questions (Analysis): 20 olympiad-style finance logic problems covering:
- Capital budgeting (NPV, IRR)
- Portfolio theory (beta, leverage)
- Fixed income (duration, immunization)
- Corporate finance (FCFF, M&M)
- Options & derivatives (put-call parity, swaps)
- Options Alpha (Options): Greeks analysis, strategy construction, P&L analysis
- Crypto Trading Scenarios (Optional): Multi-round trading evaluation on market states
- GDPVal (Optional): Open‑ended professional tasks scored by LLM‑as‑judge
- Unified Scoring: All evaluators normalized to 0-100, weighted by section
- MCP Servers: 6 servers for financial data, options pricing, and trading simulation
- Options Alpha Challenge: Black-Scholes pricing, Greeks analysis, multi-leg strategies
- Adversarial Debate: Optional counter-argument generation to test conviction
- Dynamic Weight Redistribution: When sections are disabled, weights redistribute proportionally
The repo includes an optional crypto trading benchmark that evaluates
multi-round trading decisions on historical scenarios (baseline, noisy,
adversarial, meta-consistency). Use config/eval_crypto.yaml to run it
and see docs/CRYPTO_BENCHMARK.md for data format and integration
details.
Do not commit hidden seeds or evaluation data. Keep
~/.agentbusters/hidden_seeds.yamlanddata/crypto/hidden/private.
The crypto benchmark implements a Hidden Windows strategy to prevent overfitting:
| Protection Layer | Mechanism |
|---|---|
| Private Seed | Master seed stored in ~/.agentbusters/ (not in repo) |
| Dynamic Selection | Evaluation windows selected deterministically from seed |
| Anonymous IDs | Scenario IDs are SHA256 hashes (cannot reverse to timestamps) |
| Quarterly Rotation | Seeds refreshed periodically to prevent long-term optimization |
For production deployment with PostgreSQL and hidden windows, see docs/DEPLOYMENT.md.
┌─────────────────────────────────────────────────────────────────────────────┐
│ AgentBusters System │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ A2A Protocol ┌───────────────┐ │
│ │ Green Agent │◄──────────────────────────────► │ Purple Agent │ │
│ │ (Evaluator) │ │ (Analyst) │ │
│ │ Port: 9109 │ │ Port: 9110 │ │
│ └───────┬─────────┘ └───────┬───────┘ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ 6 MCP Servers │ │ │
│ │ ├──────────────────────────────────────────────┤ │ │
│ │ │ ┌────────────┐ ┌────────────┐ ┌──────────┐ │ │ │
│ │ │ │ SEC EDGAR │ │ Yahoo │ │ Python │ │ │ │
│ │ │ │ :8101 │ │ Finance │ │ Sandbox │ │ │ │
│ │ │ │ │ │ :8102 │ │ :8103 │ │ │ │
│ │ │ └────────────┘ └────────────┘ └──────────┘ │ │ │
│ │ │ ┌────────────┐ ┌────────────┐ ┌──────────┐ │ │ │
│ │ │ │ Options │ │ Trading │ │ Risk │ │ │ │
│ │ │ │ Chain │ │ Sim │ │ Metrics │ │ │ │
│ │ │ │ :8104 │ │ :8105 │ │ :8106 │ │ │ │
│ │ │ └────────────┘ └────────────┘ └──────────┘ │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ │ │ │
│ ▼ ▼ │
│ ┌───────────────┐ ┌───────────────┐ │
│ │ SQLite │ │ SQLite │ │
│ │ tasks.db │ │ purple_tasks │ │
│ └───────────────┘ └───────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Use these exact commands to run the whole stack locally with openai/gpt-oss-20b. Each terminal runs one long-lived process; keep them open.
# (Optional) Terminal 1 — Local LLM with vLLM
# conda activate /chronos_data/conda_envs/py313
# Install vLLM
# pip install vllm
# IMPORTANT: Enable tool calling support with --enable-auto-tool-choice and --tool-call-parser
# Use the deployment script for easy setup:
python scripts/deploy_vllm.py --model qwen3-32b --dry-run # Preview command
python scripts/deploy_vllm.py --model qwen3-32b # Deploy
# Or manually start vLLM with tool calling enabled:
vllm serve Qwen/Qwen3-32B --port 8000 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml
# Supported tool-call-parser values by model:
# Qwen3: qwen3_xml (recommended), qwen3_coder
# DeepSeek: deepseek_v3, deepseek_v31, deepseek_v32
# Llama 3.x: llama3_json
# Llama 4.x: llama4_json, llama4_pythonic
# Mistral: mistral
# Others: hermes (generic), internlm, jamba, granite
# For multi-GPU: add --tensor-parallel-size=2
You can also set `OPENAI_API_KEY` or `ANTHROPIC_API_KEY` in `.env`.
# Terminal 2–4 — MCP Servers (OPTIONAL - can skip these!)
# Purple Agent can run MCP servers in-process (no external servers needed).
# Only start these if you want separate processes for debugging or multi-agent scenarios.
# Option A: Skip Terminals 2-4 entirely (recommended, uses in-process MCP)
# → Just comment out MCP_*_URL in .env
# Option B: Run external MCP servers (for debugging/multi-agent)
# Terminal 2 — SEC EDGAR MCP
# python -m src.mcp_servers.sec_edgar --transport http --host 0.0.0.0 --port 8101
# # Terminal 3 — Yahoo Finance MCP
# python -m src.mcp_servers.yahoo_finance --transport http --host 0.0.0.0 --port 8102
# # Terminal 4 — Sandbox MCP
# python -m src.mcp_servers.sandbox --transport http --host 0.0.0.0 --port 8103
# Terminal 5 — Purple Agent (Finance Analyst, A2A server for AgentBeats)
# Recommended: Production-grade A2A server with full LLM support
purple-agent serve --host 0.0.0.0 --port 9110 --card-url http://localhost:9110
# With custom database:
# purple-agent serve --host 0.0.0.0 --port 9110 --database-url sqlite+aiosqlite:///my_purple.db
# Alternatively: Simple test agent (minimal A2A + REST)
# python src/simple_purple_agent.py --host 0.0.0.0 --port 9110 --card-url http://localhost:9110
# Quick one-off analysis (no server needed)
# purple-agent analyze "Did NVIDIA beat or miss Q3 FY2026 expectations?" --ticker NVDA
# Terminal 6 — Green Agent (Evaluator, A2A server)
# No CLI wrapper for serve command—start the server directly
# If you use gpt-oss-20b, set
#```
# llm_eval:
# enabled: true
# model: "openai/gpt-oss-20b"
# temperature: 0.0
#```
# If you use local dataset, update
# - type: crypto
# path: data/crypto/scenarios/sample_btc_window
# in `eval_all.yaml`
python src/cio_agent/a2a_server.py --host 0.0.0.0 --port 9109 --eval-config config/eval_all.yaml
# With custom database:
# python src/cio_agent/a2a_server.py --host 0.0.0.0 --port 9109 \
# --eval-config config/eval_all.yaml --database-url sqlite+aiosqlite:///my_green.db
# Terminal 7 Run Evaluation:
python scripts/run_a2a_eval.py --green-url http://localhost:9109 --purple-url http://localhost:9110 --num-tasks 1 --timeout 300 -v################################################################################
# 1. QUICK START - Most Common Commands
################################################################################
# Quick smoke checks (discovery/health)
curl http://localhost:9109/.well-known/agent.json # Green agent card
curl http://localhost:9110/health # Purple agent health
# Tests and end-to-end run
# Run all tests
python -m pytest tests/ -v
# Run A2A conformance tests
python -m pytest tests/test_a2a_green.py -v --agent-url http://localhost:9109
# Run A2A tests with synthetic questions (integration test)
python -m pytest tests/test_a2a_green.py::test_synthetic_questions_evaluation -v \
--agent-url http://localhost:9109 --purple-url http://localhost:9110
# Run synthetic question unit tests (no server required)
python -m pytest tests/test_synthetic.py -v
# Run with coverage
python -m pytest tests/ --cov=src --cov-report=html
# Trigger a manual evaluation (Green → Purple via A2A)
# List available tasks
cio-agent list-tasks
# Run evaluation on a specific task
cio-agent evaluate --task-id FAB_001 --purple-endpoint http://localhost:9110
# Demo: NVIDIA Q3 FY2026 evaluation
python scripts/run_demo.py
# Optional: override Purple endpoint
# PURPLE_ENDPOINT=http://localhost:9110 python scripts/run_demo.py
# Purple Agent utilities
purple-agent info NVDA # Pulls quote/statistics/SEC snapshot via MCP
purple-agent card # Prints the Purple Agent Card JSON
# Green Evaluator power tools
cio-agent list-tasks # View all FAB++ templates
cio-agent lake-status # Check Financial Lake cache status
################################################################################
# 2. RUN EVALUATION (recommended workflow)
################################################################################
# Step 1: Start Green Agent A2A Server (choose one):
# RECOMMENDED: Multi-dataset config (production-ready)
python src/cio_agent/a2a_server.py --host 0.0.0.0 --port 9109 \
--eval-config config/eval_quick.yaml # Quick test (10 examples)
# Or:
# --eval-config config/eval_full.yaml # Full evaluation (100+ examples)
# Step 2: Trigger evaluation
python scripts/run_a2a_eval.py --num-tasks 5 -v
# With custom options:
python scripts/run_a2a_eval.py \
--green-url http://localhost:9109 \
--purple-url http://localhost:9110 \
--num-tasks 100 \
--conduct-debate \
-o results/eval_output.json
################################################################################
# 3. EVALUATION RESULTS STORAGE
################################################################################
# Results are stored in TWO places:
# 1. SQLite Database (persistent, auto-created)
# File: tasks.db
# Contains: task status, context_id, artifacts (full evaluation results)
# Note: predicted/predicted_full are empty unless you start Green with --store-predicted
# 2. JSON file (optional, via -o flag)
python scripts/run_a2a_eval.py --num-tasks 10 -o results/eval_output.json
# View stored results from database:
sqlite3 tasks.db "SELECT artifacts FROM tasks ORDER BY id DESC LIMIT 1;" | python3 -m json.tool
# Query task status by ID:
curl -X POST http://localhost:9109/ -H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"tasks/get","id":"q1","params":{"id":"TASK_ID"}}'
# Reset database (clear all history):
rm tasks.db
# Result storage comparison:
# ┌─────────────────────────────┬──────────────────────────────────────────┐
# │ Method │ Results Storage │
# ├─────────────────────────────┼──────────────────────────────────────────┤
# │ evaluate-synthetic --output │ Saved to JSON file (persistent) │
# │ A2A Server │ SQLite Database (persistent in tasks.db) │
# └─────────────────────────────┴──────────────────────────────────────────┘
################################################################################
# 4. GENERATE SYNTHETIC DATA (optional)
################################################################################
# Financial Lake + Synthetic benchmark (requires ALPHAVANTAGE_API_KEY)
# Rate limiting: Free tier allows 5 calls/min, 25 calls/day
# Each ticker needs 5 API calls, so harvest 1 ticker at a time
cio-agent harvest --tickers NVDA # ~1.5 min per ticker
cio-agent harvest --tickers AAPL # Run after first completes
# Generate synthetic questions from Financial Lake data
cio-agent generate-synthetic -n 10 -o data/synthetic_questions/questions.json
cio-agent verify-questions data/synthetic_questions/questions.json -o /tmp/verify.json
# Troubleshooting: If cache files are empty, delete and re-harvest
# rm data/alphavantage_cache/AAPL_EARNINGS.json # Delete empty file
# cio-agent harvest --tickers AAPL --force # Force re-fetch
################################################################################
# 5. ALTERNATIVE: Local Testing (no A2A server needed)
################################################################################
# For quick local testing, use evaluate-synthetic (simpler, faster):
cio-agent evaluate-synthetic data/synthetic_questions/questions.json \
--purple-endpoint http://localhost:9110 \
--output data/synthetic_questions/results.json \
--limit 5 --no-debate
# This directly calls Purple Agent's /analyze endpoint, no A2A server needed
# Recommendation: Use this with --output for local dev
################################################################################
# 6. ARCHITECTURE: Local Dev vs AgentBeats Evaluation
################################################################################
# Option A: Local Testing (evaluate-synthetic uses HTTP REST, faster)
# ┌─────────────────────┐ HTTP POST /analyze ┌───────────┐
# │ cio-agent CLI │─────────────────────────►│ Purple │
# │ evaluate-synthetic│◄─────────────────────────│ Agent │
# └─────────────────────┘ └───────────┘
#
# Option B: AgentBeats Evaluation (uses full A2A Protocol)
# ┌─────────────────────────┐
# │ AgentBeats Platform │
# │ (or curl test request) │
# └───────────┬─────────────┘
# │ A2A JSON-RPC
# ▼
# ┌─────────────────────────────────────────────────────────────┐
# │ Green Agent A2A Server (:9109) │◄──┐
# │ --eval-config config/eval_*.yaml │ │
# │ (Loads datasets from config file) │ │
# └───────────────────────────┬─────────────────────────────────┘ │
# │ Evaluates Purple Agent │
# ▼ │
# ┌───────────────────┐ │
# │ Purple Agent │ ┌────────────┐
# │ (:9110) │ │ SQLite │
# └───────────────────┘ │ (tasks.db) │
# └────────────┘
# Quick testing recommendation:
# ┌────────────────────────────────────────────────────────────────────────────┐
# │ Method │ Use Case │ Protocol │ Speed │
# ├────────────────────────────────────────────────────────────────────────────┤
# │ cio-agent evaluate-synthetic│ Local dev testing │ HTTP REST │ Fast │
# │ A2A Server + run_a2a_eval │ AgentBeats official│ A2A JSON-RPC│ Full stack│
# └────────────────────────────────────────────────────────────────────────────┘
################################################################################
# 7. ADVANCED: Raw Curl Testing & Legacy Mode
################################################################################
# Test A2A evaluation with curl (alternative to run_a2a_eval.py script):
curl -X POST http://localhost:9109/ \
-H "Content-Type: application/json" \
-d '{
"jsonrpc": "2.0",
"method": "message/send",
"id": "test-'$(date +%s)'",
"params": {
"message": {
"messageId": "'$(uuidgen || cat /proc/sys/kernel/random/uuid)'",
"role": "user",
"parts": [{"type": "text", "text": "{\"participants\": {\"purple_agent\": \"http://localhost:9110\"}, \"config\": {\"num_tasks\": 1}}"}]
}
}
}'
# NOTE: A2A SDK tracks tasks by session context.
# Error "Task already in terminal state" means the task ID was reused.
# Solution: Use dynamic UUID (shown above), or rm tasks.db to reset.
# Legacy: Single dataset mode (use --eval-config instead):
# python src/cio_agent/a2a_server.py --host 0.0.0.0 --port 9109 \
# --dataset-type bizfinbench --dataset-path data/BizFinBench.v2 \
# --task-type event_logic_reasoning --limit 10
# Config file example (config/eval_full.yaml):
# ---
# name: "FAB++ Full Evaluation"
# datasets:
# - type: synthetic
# path: data/synthetic_questions/questions.json
# limit: 10
# - type: bizfinbench
# path: data/BizFinBench.v2
# task_types: [event_logic_reasoning, user_sentiment_analysis]
# languages: [en, cn]
# limit_per_task: 20
# - type: public_csv
# path: finance-agent/data/public.csv
# limit: 100
# sampling:
# strategy: stratified # Options: sequential, random, stratified, weighted
# total_limit: 100
# seed: 42
llm_eval:
enabled: true
model: gpt-4o-mini
temperature: 0.0
# MCP helpers and CSV batch eval
# Note: start_mcp_servers.py uses stdio transport by default (for local dev or MCP Inspector testing)
# For network HTTP (used in Quick Start above), add --transport http: python -m src.mcp_servers.XXXX --transport http --host 0.0.0.0 --port PORT
python scripts/start_mcp_servers.py --server edgar # Stdio/SSE transport mode (dev only)
python scripts/test_mcp_live.py # Smoke test MCP servers
python -m scripts.run_csv_eval \
--dataset-path finance-agent/data/public.csv \
--purple-endpoint http://localhost:9110 \
--output /tmp/summary.json --no-debate --limit 5
# BizFinBench.v2 evaluation (29,578 Q&A pairs across 9 task types)
# English (8 tasks): anomaly_information_tracing, conterfactual, event_logic_reasoning,
# financial_data_description, financial_multi_turn_perception, financial_quantitative_computation,
# stock_price_predict, user_sentiment_analysis
# Chinese (9 tasks): all above + financial_report_analysis
python -m scripts.run_bizfin_eval \
--dataset-path data/BizFinBench.v2 \
--task-type event_logic_reasoning \
--language en \
--purple-endpoint http://localhost:9110 \
--output /tmp/bizfin_summary.json --limit 10
# List task types by language:
python -c "from cio_agent.local_datasets import BizFinBenchProvider; print(BizFinBenchProvider.list_task_types_by_language())"
# Dataset-specific evaluators (exact-match scoring by default, optional LLM grading):
# BizFinBench: numerical matching (+/-1% tolerance), sequence matching, classification
python -m scripts.run_bizfin_simple \
--dataset-path data/BizFinBench.v2 \
--task-type financial_quantitative_computation \
--language en \
--purple-endpoint http://localhost:9110 \
--output /tmp/bizfin_results.json --limit 5
# Optional: --eval-llm --eval-llm-model gpt-4o-mini
# public.csv: correctness/contradiction rubric evaluation
python -m scripts.run_csv_simple \
--dataset-path finance-agent/data/public.csv \
--purple-endpoint http://localhost:9110 \
--output /tmp/csv_results.json --limit 5
# Optional: --eval-llm --eval-llm-model gpt-4o-mini
# Alternative direct startup (stdio by default)
# Default: stdio transport (not accessible via HTTP). Add --transport http for network access.
python src/mcp_servers/sec_edgar.py # Stdio only
python src/mcp_servers/sec_edgar.py --transport http --port 8101 # HTTP on :8101
python src/mcp_servers/yahoo_finance.py --transport http --port 8102 # HTTP on :8102
python src/mcp_servers/sandbox.py --transport http --port 8103 # HTTP on :8103
# Purple Agent startup methods (all use HTTP/Uvicorn, differ in features):
python src/simple_purple_agent.py --host 0.0.0.0 --port 9110 # Minimal A2A + REST test agent
python src/purple_agent/server.py # Full A2A server (read .env for LLM config)
purple-agent serve --host 0.0.0.0 --port 9110 # CLI wrapper for src/purple_agent/server.pyTip: Using hosted APIs instead of local vLLM? You can skip Terminal 1 and just configure your .env:
# OpenAI (skip Terminal 1)
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-REDACTED
LLM_MODEL=gpt-4o
# Do not set OPENAI_API_BASE or OPENAI_BASE_URL when using OpenAI's hosted API# Anthropic (skip Terminal 1)
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-REDACTED
LLM_MODEL=claude-3.5-sonnetTip: For vLLM-backed LLM calls, set these in .env (auto-loaded):
LLM_PROVIDER=openai
OPENAI_API_BASE=http://localhost:8000/v1
OPENAI_BASE_URL=http://localhost:8000/v1 # alias for OPENAI_API_BASE
OPENAI_API_KEY=dummy
LLM_MODEL=Qwen/Qwen3-32B # or your deployed modelNote: vLLM must be started with tool calling enabled:
python scripts/deploy_vllm.py --model qwen3-32b # Recommended # Or manually: vllm serve Qwen/Qwen3-32B --enable-auto-tool-choice --tool-call-parser qwen3_xml
The system uses 6 MCP servers for financial data and options trading:
| Server | Port | Purpose |
|---|---|---|
| SEC EDGAR MCP | 8101 | SEC filings, XBRL data, temporal locking |
| Yahoo Finance MCP | 8102 | Market data, statistics, lookahead detection |
| Sandbox MCP | 8103 | Python code execution |
| Options Chain MCP | 8104 | Black-Scholes pricing, Greeks, IV surface |
| Trading Sim MCP | 8105 | Paper trading, slippage simulation, P&L |
| Risk Metrics MCP | 8106 | VaR, Sharpe/Sortino, stress testing |
Configure via environment variables or .env file:
# Core MCP Servers
MCP_EDGAR_URL=http://localhost:8101
MCP_YFINANCE_URL=http://localhost:8102
MCP_SANDBOX_URL=http://localhost:8103
# Options Alpha MCP Servers
MCP_OPTIONS_URL=http://localhost:8104
MCP_TRADING_URL=http://localhost:8105
MCP_RISK_URL=http://localhost:8106Tip: If MCP URLs are unset, the Purple Agent falls back to in-process MCP servers.
| Image | URL |
|---|---|
| Green Agent | ghcr.io/yxc20089/agentbusters-green:latest |
| Purple Agent | ghcr.io/yxc20089/agentbusters-purple:latest |
# Pull and run Green Agent
docker pull ghcr.io/yxc20089/agentbusters-green:latest
docker run -p 9109:9109 ghcr.io/yxc20089/agentbusters-green:latest --host 0.0.0.0
# Pull and run Purple Agent
docker pull ghcr.io/yxc20089/agentbusters-purple:latest
docker run -p 9110:9110 -e OPENAI_API_KEY=sk-xxx ghcr.io/yxc20089/agentbusters-purple:latest# Start all services (6 MCP servers + Green + Purple agents)
docker compose up -d
# Check status
docker compose ps
# View logs
docker compose logs -f green-agent purple-agent# Build Green Agent
docker build -f Dockerfile.green -t cio-agent-green .
docker run -p 9109:9109 cio-agent-green --host 0.0.0.0
# Build Purple Agent
docker build -f Dockerfile.purple -t purple-agent .
docker run -p 9110:9110 -e OPENAI_API_KEY=sk-xxx purple-agent# Green Agent
docker build -f Dockerfile -t cio-agent-green .
docker run -p 9109:9109 cio-agent-green
# Purple Agent
docker build -f Dockerfile.purple -t purple-agent .
docker run -p 9110:9110 purple-agent
# MCP Servers
docker build -f Dockerfile.mcp-edgar -t mcp-edgar .
docker run -p 8101:8000 mcp-edgar
docker build -f Dockerfile.mcp-yahoo -t mcp-yahoo .
docker run -p 8102:8000 mcp-yahoo
docker build -f Dockerfile.mcp-sandbox -t mcp-sandbox .
docker run -p 8103:8000 mcp-sandboxPort mapping: Green Agent 9109, Purple Agent 9110, EDGAR 8101, YFinance 8102, Sandbox 8103.
# 1. Create .env from template
cp .env.example .env
# 2. Edit .env with your LLM configurationFor local vLLM (Qwen3-32B):
LLM_PROVIDER=openai
OPENAI_API_BASE=http://localhost:8000/v1
OPENAI_API_KEY=dummy
LLM_MODEL=Qwen/Qwen3-32B
⚠️ Important: When starting vLLM, you must enable tool calling:vllm serve Qwen/Qwen3-32B --port 8000 --enable-auto-tool-choice --tool-call-parser qwen3_xmlOr use the deployment script:
python scripts/deploy_vllm.py --model qwen3-32b
For OpenAI API:
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-REDACTED
LLM_MODEL=gpt-4oFor Anthropic API:
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-REDACTED
LLM_MODEL=claude-3.5-sonnetMCP Servers (optional):
MCP_EDGAR_URL=http://localhost:8101
MCP_YFINANCE_URL=http://localhost:8102
MCP_SANDBOX_URL=http://localhost:8103LLM grading for dataset evaluators (bizfinbench/public_csv):
EVAL_USE_LLM=true
EVAL_LLM_MODEL=gpt-4o-mini
EVAL_LLM_TEMPERATURE=0.0Uses OPENAI_API_KEY + OPENAI_BASE_URL/OPENAI_API_BASE (OpenAI-compatible) or ANTHROPIC_API_KEY.
CLI override example:
python src/cio_agent/a2a_server.py --host 0.0.0.0 --port 9109 \
--eval-llm --eval-llm-model gpt-4o-mini --eval-llm-temperature 0.0Predicted output storage (recommended for memory control):
python src/cio_agent/a2a_server.py --host 0.0.0.0 --port 9109 \
--eval-config config/eval_config.yaml \
--store-predicted --predicted-max-chars 200By default, predicted outputs are omitted from results (fields are empty).
Use --store-predicted to include them, and --no-truncate-predicted to keep full outputs.
Question text storage:
python src/cio_agent/a2a_server.py --host 0.0.0.0 --port 9109 \
--eval-config config/eval_config.yaml \
--store-question --no-truncate-questionBy default, question text is truncated to 200 characters.
Use --store-question --no-truncate-question to keep full question text in results.
Expected answer storage:
python src/cio_agent/a2a_server.py --host 0.0.0.0 --port 9109 \
--eval-config config/eval_config.yaml \
--store-expected --no-truncate-expectedBy default, expected answer text is truncated to 100 characters.
Use --store-expected --no-truncate-expected to keep full expected answers in results.
Logging control:
# Default: quiet mode - minimal output
python src/cio_agent/a2a_server.py --host 0.0.0.0 --port 9109 \
--eval-config config/eval_config.yaml
# Verbose mode - show detailed startup info
python src/cio_agent/a2a_server.py --host 0.0.0.0 --port 9109 \
--eval-config config/eval_config.yaml --verbose
# Debug mode - verbose output for troubleshooting
python src/cio_agent/a2a_server.py --host 0.0.0.0 --port 9109 \
--eval-config config/eval_config.yaml --debugDefault is quiet mode (minimal output). Use --verbose or -v for detailed startup info.
Use --debug for evaluation details (Greeks extraction, LLM calls, etc.).
Database configuration:
# Green Agent with custom database
python src/cio_agent/a2a_server.py --host 0.0.0.0 --port 9109 \
--eval-config config/eval_config.yaml \
--database-url "sqlite+aiosqlite:///green_results.db"
# Purple Agent with custom database
purple-agent serve --host 0.0.0.0 --port 9110 \
--database-url "sqlite+aiosqlite:///purple_tasks.db"The --database-url parameter overrides DATABASE_URL (Green) or PURPLE_DATABASE_URL (Purple) environment variables.
The agents will automatically load .env on startup. Alternatively, you can use export commands instead of .env file.
| Variable | Description | Example |
|---|---|---|
LLM_PROVIDER |
LLM provider | openai, anthropic |
LLM_MODEL |
Model name | gpt-4o, claude-3.5-sonnet, openai/gpt-oss-20b |
OPENAI_API_KEY |
OpenAI API key | sk-... |
OPENAI_API_BASE |
Custom API endpoint (for local vLLM) | http://localhost:8000/v1 |
OPENAI_BASE_URL |
Alias for OPENAI_API_BASE |
http://localhost:8000/v1 |
ANTHROPIC_API_KEY |
Anthropic API key | sk-ant-... |
EVAL_USE_LLM |
Enable LLM grading for dataset evaluators | true |
EVAL_LLM_MODEL |
Model override for LLM grading | gpt-4o-mini |
EVAL_LLM_TEMPERATURE |
Temperature for LLM grading | 0.0 |
MCP_EDGAR_URL |
SEC EDGAR MCP server | http://localhost:8101 |
MCP_YFINANCE_URL |
Yahoo Finance MCP server | http://localhost:8102 |
MCP_SANDBOX_URL |
Sandbox MCP server | http://localhost:8103 |
DATABASE_URL |
SQLite database URL (Green Agent) | sqlite+aiosqlite:///tasks.db |
PURPLE_DATABASE_URL |
SQLite database URL (Purple Agent) | sqlite+aiosqlite:///purple_tasks.db |
Note: Both agents support --database-url CLI parameter which overrides the environment variable.
The Green Agent uses SQLite for persistent task storage. The database file (tasks.db) is created automatically on first use.
Backup:
# Simple file copy (stop server first for consistency)
cp tasks.db tasks.db.backup
# Or with timestamp
cp tasks.db "tasks_$(date +%Y%m%d_%H%M%S).db"Reset database:
# Delete to start fresh (all task history will be lost)
rm tasks.dbMigrations: The A2A SDK handles schema internally. If you encounter schema errors after upgrading a2a-sdk, delete tasks.db to regenerate with the new schema.
Troubleshooting:
- "Database is locked" → Ensure only one server instance is running
- "Disk I/O error" → Check disk space and file permissions
AgentBusters/
├── src/
│ ├── cio_agent/ # Green Agent (Evaluator)
│ │ ├── a2a_server.py # A2A server entry point (AgentBeats)
│ │ ├── green_executor.py # A2A protocol executor
│ │ ├── green_agent.py # FAB++ evaluation logic
│ │ ├── messenger.py # A2A messaging utilities
│ │ ├── models.py # Core data models (18 TaskCategories)
│ │ ├── evaluator.py # Comprehensive evaluator
│ │ ├── debate.py # Adversarial debate manager
│ │ ├── task_generator.py # Dynamic task generation (18 templates)
│ │ └── cli.py # CLI interface
│ │
│ ├── purple_agent/ # Purple Agent (Finance Analyst)
│ │ ├── server.py # A2A FastAPI server
│ │ ├── executor.py # A2A executor (options support)
│ │ ├── mcp_toolkit.py # MCP client toolkit (21 methods)
│ │ └── cli.py # CLI interface
│ │
│ ├── mcp_servers/ # MCP servers (FastMCP)
│ │ ├── sec_edgar.py # SEC EDGAR server (:8101)
│ │ ├── yahoo_finance.py # Yahoo Finance server (:8102)
│ │ ├── sandbox.py # Python execution sandbox (:8103)
│ │ ├── options_chain.py # Black-Scholes pricing (:8104)
│ │ ├── trading_sim.py # Paper trading simulator (:8105)
│ │ └── risk_metrics.py # VaR, Sharpe, stress tests (:8106)
│ │
│ └── evaluators/ # Evaluation components
│ ├── macro.py # Macro thesis evaluator
│ ├── fundamental.py # Fundamental analysis evaluator
│ ├── execution.py # Execution quality evaluator
│ └── options.py # Options-specific evaluator (P&L, Greeks)
│
├── scripts/
│ ├── run_a2a_eval.py # A2A evaluation trigger
│ ├── run_options_demo.py # Options Alpha Challenge demo
│ └── run_csv_eval.py # CSV dataset evaluation
│
├── data/
│ ├── synthetic_questions/ # Generated evaluation tasks
│ ├── BizFinBench.v2/ # HiThink benchmark dataset
│ └── financial_lake/ # Cached financial data
│
├── docs/
│ └── ARCHITECTURE_OPTIONS.md # Options system design
│
├── Dockerfile.green # Green Agent container
├── Dockerfile.purple # Purple Agent container
├── docker-compose.yml # Full stack deployment
└── ABSTRACT.md # Competition abstract
The benchmark includes a comprehensive options trading evaluation:
- Iron Condor: Construct neutral strategies with defined risk
- Volatility Trading: IV rank/percentile analysis
- Greeks Hedging: Delta neutralization strategies
- Risk Management: VaR-based position sizing
| Dimension | Weight | Description |
|---|---|---|
| P&L Accuracy | 25% | Profit/loss calculations |
| Greeks Accuracy | 25% | Delta, gamma, theta, vega |
| Strategy Quality | 25% | Structure and rationale |
| Risk Management | 25% | Position sizing, hedging |
# Single task
python scripts/run_options_demo.py --task iron_condor --ticker SPY
# All task types
python scripts/run_options_demo.py --task all --ticker SPYThe evaluation uses the Alpha Score metric:
Alpha Score = (RoleScore × DebateMultiplier) / (ln(1 + Cost) × (1 + LookaheadPenalty))
Where:
- RoleScore: Weighted combination of Macro (30%), Fundamental (40%), Execution (30%)
- DebateMultiplier: 0.5x - 1.2x based on conviction in adversarial debate
- Cost: Total USD cost of LLM and tool calls
- LookaheadPenalty: Penalty for temporal violations (accessing future data)
# Run all tests (excluding integration tests that require external services)
python -m pytest tests/ -v -m "not integration"
# Run all tests including integration tests (requires Purple Agent at :9110)
python -m pytest tests/ -v
# Run specific test categories
python -m pytest tests/test_evaluators.py -v # Unit tests
python -m pytest tests/test_purple_agent.py -v # Purple Agent tests
python -m pytest tests/test_mcp_servers.py -v # MCP server tests
python -m pytest tests/ -v -m integration # Integration tests only
# Run A2A conformance tests (requires Green Agent at :9109)
python -m pytest tests/test_a2a_green.py -v --agent-url http://localhost:9109
# Run with coverage
python -m pytest tests/ --cov=src --cov-report=html -m "not integration"Test Markers:
@pytest.mark.integration: Tests requiring external services (Purple/Green agents)@pytest.mark.asyncio: Async tests
| Endpoint | Method | Description |
|---|---|---|
/.well-known/agent.json |
GET | Agent Card (A2A discovery) |
/ |
POST | A2A JSON-RPC endpoint |
| Endpoint | Method | Description |
|---|---|---|
/.well-known/agent.json |
GET | Agent Card (A2A discovery) |
/health |
GET | Health check |
/analyze |
POST | Direct analysis (non-A2A) |
/ |
POST | A2A JSON-RPC endpoint |
This project is built for the AgentBeats Finance Track:
- Phase 1 (Jan 15, 2026): Green Agent submissions
- Phase 2 (Feb 2026): Purple Agent submissions
MIT License - see LICENSE for details.
- Fork the repository
- Create a feature branch
- Run tests:
python -m pytest tests/ -v - Submit a pull request
- AgentBeats Competition by Berkeley RDI
- A2A Protocol by Google
- FAB Benchmark for task templates
- green-agent-template for A2A implementation reference
