A self-hosted platform for evaluating, observing, and governing any multi-agent AI system in real time and offline. No external SaaS dependencies — everything runs locally via Docker.
Built on: Docker · ClickHouse · Jaeger · OpenTelemetry · FastAPI · Streamlit · Claude
- What This Platform Does
- Architecture Overview
- Quick Start
- Project Structure
- Infrastructure Stack
- Evaluation System
- Observability System
- Instrumentation SDK
- Eval UI
- AI Governance System
- Governance Enforcement
- EvalGov Intelligence Agent
- Multi-Agent Demo
- Running Offline Benchmarks
- Onboarding a New Agent System
- Evaluation Metrics Reference
- Configuration Reference
- Common Commands
- Captures every trace, log, and metric from any agent system via OpenTelemetry
- LLM spans follow OTel GenAI semantic conventions (
gen_ai.usage.input_tokens,gen_ai.usage.output_tokens,gen_ai.request.model) - Stores everything in ClickHouse for fast analytical queries
- Jaeger for interactive distributed trace exploration
- Two modes: real-time (production monitoring) and offline (benchmark test harness)
- Three-tier evaluator cascade: deterministic → LLM-as-judge → multi-turn judge
- LLM judges: faithfulness, relevance, instruction following, handoff fidelity, hallucination, safety, QA correctness, custom rubric
- Multi-turn judges: context retention, topic adherence, response consistency, conversation relevancy
- OTel-computed per-trace metrics: task success, latency, cost, token count, tool success rate, handoff success rate, dead span rate — written to
eval_scoreson every trace - Threshold alerting: 38 of 68 metrics configurable; violations feed shows prompt, trace, agent, and judge reasoning
- Regression detection: every run compared against a pinned baseline
- 13 detection categories: Safety, Identity, Reliability, Behavior, Lifecycle, Regulatory, Supply Chain, Tool Whitelist, SLO, Incidents, Anomaly Detection, Compliance Scorecards, Risk Register
- Pre-execution enforcement: gate check API classifies every agent action by risk tier; high-risk actions paused for human approval (HITL)
- Circuit breaker: CLOSED/OPEN/HALF_OPEN state machine per agent
- Trust scoring: composite score per agent from identity, behavior, compliance, and network signals
- Content quality gates: post-hoc watcher fires every 30s;
flag → hold → blockdecision ladder
- Conversational interface: Claude Sonnet agent with 44 real-time tools — answer any governance/eval/cost/trace question in plain English
- On-demand RCA: ask the agent why a circuit breaker opened, what caused a trust score drop, or what a safety event means — it correlates signals across governance, traces, and eval scores to give a specific root cause and recommended action
- Live system state: UI panel auto-refreshes every 60s showing active HITL requests, open incidents, circuit breaker states, and policy violations
- MCP server: expose all 44 tools to Claude Code and other AI agents
| Framework | Support |
|---|---|
| LangChain / LangGraph | ✅ Adapter included |
| AutoGen | ✅ Adapter included |
| CrewAI | ✅ Adapter included |
| Anthropic Claude SDK | ✅ Adapter included |
| OpenAI Agents SDK | ✅ Adapter included |
| OpenAI SDK (direct) | ✅ Auto-instrumented via OpenLLMetry |
| Google ADK | ✅ Works via manual span wrappers (see demo) |
| Custom Python agents | ✅ Manual span context managers |
┌─────────────────────────────────────────────────────────────────┐
│ Your Agent System (any SDK) │
│ pip install -e ./instrumentation[langchain] │
│ init_telemetry() + framework adapter or manual spans │
│ GovernedToolkit — gate check + HITL polling on every call │
└────────────────────────────┬────────────────────────────────────┘
│ OTLP gRPC :4317 / HTTP :4318
▼
┌──────────────────────────────┐
│ OTel Collector :4317/4318 │
│ Normalises + fans out │
└──────┬──────────┬────────────┘
│ │
┌──────▼──┐ ┌────▼────┐
│ClickHouse│ │ Jaeger │
│:8123/9000│ │:16686 │
│traces │ │trace UI │
│eval_scores│ └─────────┘
│gov_* │
└──────┬───┘
│
┌──────▼───────────┐ ┌──────────────────────┐
│ Eval Runner │ │ Governance Service │
│ :8000 │ │ :8002 │
│ Tiered evals │ │ Gate check API │
│ LLM judges │ │ Circuit breakers │
│ Regression │ │ Trust scores │
└──────────────────┘ │ 13 detection cats │
└──────────┬───────────┘
┌──────────────────────────────────┐│
│ EvalGov Agent :8003 ││
│ Chat agent · MCP server ◄┘
│ Live system state panel │
└──────────────────────────────────┘
┌──────────────────────────────────┐
│ Eval UI :8501 Streamlit │
│ Eval · Traces · Benchmarks │
│ AI Governance (13 categories) │
│ Governance Enforcement │
│ EvalGov Agent (chat + system state) │
└──────────────────────────────────┘
Agent runs → OTel spans → Collector → ClickHouse (stored)
→ Jaeger (trace UI)
→ Eval Runner → scores → ClickHouse
→ gate check → Governance Service → HITL queue → agent waits
→ CB update
→ trust score update
ClickHouse + Governance Service → EvalGov Agent → live system state panel
- Docker Desktop with Compose v2
- Python 3.10+
- Anthropic API key (for LLM judge evaluators and EvalGov Agent)
git clone <your-repo>
cd opt-aieval
cp .env.example .env
# Set ANTHROPIC_API_KEY in .envdocker compose up -d
# Starts 7 services: ClickHouse, Jaeger, OTel Collector,
# Eval Runner, Governance Service, EvalGov Agent, Eval UI| UI | URL |
|---|---|
| Eval UI (main dashboard) | http://localhost:8501 |
| Jaeger (trace viewer) | http://localhost:16686 |
| Governance API docs | http://localhost:8002/docs |
| EvalGov Agent API docs | http://localhost:8003/docs |
| ClickHouse console | http://localhost:8123/play |
# Terminal 1 — eval platform is already running from step 2 above (repo root)
# Terminal 2 — demo agent (from the demo directory)
cd examples/opt-demo
cp .env.example .env # set OPENAI_API_KEY or ANTHROPIC_API_KEY
pip install -r requirements.txt
uvicorn api:app --host 0.0.0.0 --port 8080 --workers 1
# Terminal 3 — chat UI (same demo directory)
cd examples/opt-demo
streamlit run chat_ui.py # → http://localhost:8502opt-aieval/
├── README.md This file — platform overview
├── docker-compose.yml 7-service stack
├── .env.example Environment variable template
│
├── docs/ Reference documentation
│ ├── instrumentation-guide.md Full span schema, attribute reference, evaluator tables
│ ├── gov-instrument-guide.md Phase 2 gate decision flow, HITL patterns, CB guide
│ ├── eval-metrics-dashboard-guide.md Filter behaviour + Option B span-level fix guide
│ ├── evalgov-agent.md EvalGov Agent — tool reference, chat agent, MCP server
│ ├── governance-enforcement-architecture.md Three-layer enforcement design
│ └── hitl-and-gate-ui-reference.md HITL and gate check UI reference
│
├── infra/
│ ├── clickhouse/
│ │ ├── init.sql Database schema (eval + governance tables)
│ │ └── users.xml ClickHouse user config
│ └── otelcol/
│ └── config.yaml OTel Collector: receivers, processors, exporters
│
├── instrumentation/ Instrumentation SDK — install and use this
│ ├── README.md ← Start here to instrument your agent system
│ ├── pyproject.toml pip-installable package config
│ ├── telemetry.py OTel init — init_telemetry(), get_tracer(), get_run_id()
│ ├── spans.py Canonical span context managers
│ ├── governed_toolkit.py GovernedToolkit — pre-execution gate check wrapper
│ ├── gate_client.py Low-level gate check HTTP client
│ ├── normalizer.py SDK attribute → canonical schema mapping
│ └── adapters/
│ ├── langchain_adapter.py LangChainEvalAdapter
│ ├── autogen_adapter.py AutoGenEvalAdapter
│ ├── crewai_adapter.py CrewAIEvalAdapter
│ ├── claude_adapter.py ClaudeEvalAdapter
│ └── openai_agents_adapter.py OpenAIAgentsEvalAdapter
│
├── eval_runner/ Evaluation service (FastAPI :8000)
│ ├── main.py REST endpoints + OTLP ingestion + eval dedup
│ ├── config/evaluator_config.yaml Evaluator routing config + regression thresholds
│ ├── evaluators/
│ │ ├── deterministic/ format_check, tool_accuracy, step_efficiency, etc.
│ │ └── llm_judges/ faithfulness, relevance, hallucination, safety, etc.
│ ├── pipeline/
│ │ ├── eval_pipeline.py Eval orchestration + OTel-computed metrics
│ │ └── conversation_eval.py Multi-turn judge runner
│ └── db/ ClickHouse client + repository
│
├── eval_ui/ Streamlit evaluation UI (:8501)
│ ├── app.py Home page with live metrics
│ └── pages/
│ ├── 1_Eval_Testing.py Run management + baseline
│ ├── 2_Eval_Measurements.py Traces, Scores, Regression, Benchmarks, Conversations
│ ├── 10_AI_Governance.py 13 governance categories, policy engine, compliance
│ ├── 11_Governance_Enforcement.py Circuit breakers, HITL approvals, quality gates
│ └── 14_EvalGov_Agent.py Chat UI + live system state panel
│
├── governance_service/ Governance FastAPI service (:8002)
│ ├── main.py 90+ REST endpoints across all governance domains
│ ├── db.py ClickHouse client + all governance CRUD
│ ├── gate.py Pre-execution gate check (risk tier logic)
│ ├── enforcement_engine.py Trust score + rogue assessment computation
│ ├── quality_gate_checker.py Post-hoc quality watcher (30s loop)
│ ├── incident_manager.py Incident creation, dedup, lifecycle
│ └── anomaly_detector.py Statistical anomaly detection vs baselines
│
├── evalgov_agent/ EvalGov Intelligence Agent (:8003)
│ ├── agent.py Claude chat agent with 44-tool loop
│ ├── tools.py Tool implementations + Anthropic schemas
│ ├── mcp_server.py MCP SSE server exposing all 44 tools
│ └── db.py ClickHouse client for analytics queries
│
│
├── benchmarks/ Offline benchmark test harness
│ ├── runner.py CLI: run suites, compare baseline
│ └── suites/
│ ├── unit/cases.json
│ ├── integration/cases.json
│ └── collaboration/cases.json
│
└── examples/
├── opt-demo/ Primary demo — 4-agent system (orchestrator, searcher,
│ summarizer, translator) with full governance integration
├── langchain_agent_example.py LangChain agent with LangChainEvalAdapter
├── claude_agent_example.py Claude SDK agent with ClaudeEvalAdapter
└── custom_agent_example.py Custom Python agent — all span types demonstrated
All observability signals and evaluation scores land here.
OTel tables (auto-created by collector):
otel.otel_traces— all spansotel.otel_logs— structured logsotel.otel_metrics_*— metrics
Eval tables (created by infra/clickhouse/init.sql):
otel.eval_runs— groups of traces (one per benchmark run or tagged production period)otel.eval_scores— evaluator scores linked to trace_id;eval_type∈deterministic/llm_judge/multiturn_judge/otel_computedotel.benchmarks— offline test cases withdataset_versionfor stable regression comparisonsotel.prompt_evals— per-agent prompt/response pairs with token counts and score maps; deduplicated by(trace_id, span_id)otel.alert_thresholds— one row per metric; operator + threshold value + enabled flagotel.human_reviews— human feedback on eval scores
Direct query access:
-- Recent eval scores
SELECT trace_id, metric, score, reasoning
FROM otel.eval_scores
ORDER BY evaluated_at DESC LIMIT 20;
-- Average score per metric for a run
SELECT metric, avg(score)
FROM otel.eval_scores
WHERE run_id = 'your-run-id'
GROUP BY metric;Receives OTLP from agents, normalises SDK-specific attributes, fans out to backends.
:4317— OTLP gRPC:4318— OTLP HTTP:8888— Collector self-metrics:13133— Health check
Normalisation maps LangChain / AutoGen / CrewAI attribute names to the canonical schema before storage. Config: infra/otelcol/config.yaml.
Interactive distributed trace explorer at http://localhost:16686. Every agent execution shows as a full span tree with parent-child nesting.
Offline (test harness) — run before deploying changes:
python benchmarks/runner.py --suite unit --name "my-run-v2"
python benchmarks/runner.py --suite unit --name "my-run-v2" --compare-baselineOnline (real-time) — monitors production traffic automatically. Every agent.task span completion triggers the eval pipeline. LLM judges run on a 15% sample to control cost.
| Tier | Evaluators | Cost | When it runs |
|---|---|---|---|
| Deterministic | format_compliance, tool_accuracy, step_efficiency, tool_output_handling, tool_error_rate, timeout_rate | Free | Always, 100% |
| LLM Judge | faithfulness, relevance, instruction_following, handoff_fidelity, hallucination_score, safety_score, qa_correctness, custom_rubric | ~$0.001/trace | Offline: 100% · Online: 15% sample |
| Multi-turn | context_retention, topic_adherence, response_consistency, conversation_relevancy | ~$0.001/conversation | Per conversation_id after idle 60s |
| OTel-computed | task_success, agent_failure, timeout, task_latency_ms, cost_usd, token_count, context_window_utilization, tool_call_success_rate, handoff_success_rate, dead_span_rate | Free | Always, 100% |
GET /health
GET /runs List all eval runs
POST /runs Create a new run
PUT /runs/{run_id}/baseline Set run as baseline
POST /runs/{run_id}/evaluate Trigger offline eval
GET /runs/{run_id}/scores Aggregated scores
GET /runs/{run_id}/regression Score deltas vs baseline
GET /traces/{trace_id}/scores All scores for a trace
Edit eval_runner/config/evaluator_config.yaml:
sampling:
online_llm_judge_rate: 0.15 # 15% of traces get LLM judges
always_judge_on_failure: true # always judge failed tasks
regression:
alert_threshold: 0.05 # flag if metric drops >5%Once init_telemetry() is called, OpenLLMetry auto-instrumentation captures every LLM call:
- Model name —
gen_ai.request.model - Token counts —
gen_ai.usage.input_tokens/gen_ai.usage.output_tokens - Latency — span
Duration - Finish reason —
gen_ai.response.finish_reasons
agent.task:
agent.id, agent.role, task.id, task.input, task.output,
task.status, run.id,
trace.source (production|benchmark|exploratory),
conversation.id
agent.tool_call:
agent.id, tool.name, tool.input (JSON), tool.output (JSON), tool.success, run.id
agent.handoff:
from.agent_id, to.agent_id, handoff.reason, handoff.context (JSON)
agent.memory:
agent.id, memory.op (read|write), memory.key, memory.scope
agent.decision:
agent.id, decision.input, decision.options, decision.chosen, decision.reason
When a span with SpanName = 'agent.task' arrives:
- OTel Collector forwards span to Eval Runner (
/v1/traces) - Eval Runner assembles the full trace tree from ClickHouse
- Evaluator pipeline fires (deterministic → LLM judges per sampling config)
- Scores written to
otel.eval_scores
The instrumentation/ directory is a pip-installable SDK. It provides init_telemetry(), framework adapters for all supported SDKs, canonical span context managers, and GovernedToolkit for pre-execution governance enforcement.
To instrument your agent system, start here:
instrumentation/README.md
That document covers installation, framework adapters, span context managers, governance enforcement integration, and how to expose a /chat endpoint for benchmark execution.
The detailed reference material lives in:
docs/instrumentation-guide.md— full span schema, attribute reference, evaluator tables, multi-agent context patternsdocs/gov-instrument-guide.md— gate decision flow, HITL patterns, circuit breaker integration
The Streamlit UI at http://localhost:8501 provides the main evaluation and governance interface.
| Page | Purpose |
|---|---|
| Home | Live metrics: total runs, traces evaluated, avg scores |
| 1 · Eval Testing | Create runs, set baseline, trigger offline eval |
| 2 · Eval Measurements | Traces, Scores, Regression, Benchmarks, Conversations, Eval Metrics (68 metrics / 9 categories), Prompt Analysis |
| 10 · AI Governance | 13 governance categories, policy engine, anomaly events, compliance reports |
| 11 · Governance Enforcement | Circuit breakers, trust scores, HITL approvals, quality gates, on-demand enforcement cycle |
| 14 · EvalGov Agent | Chat UI + live system state panel |
- Estimated cost computed from
gen_ai.usage.*tokens × configurable rates fromgov_threshold_config - Source filter in Eval Measurements — narrows to
production,benchmark, orexploratorytraces - Conversation filter — filter by partial or full
conversation.idacross Traces and Prompt Analysis - Threshold alerting on 38 of 68 metrics; violations feed shows full prompt, trace, agent, and judge reasoning
- Regression comparison — select two runs for side-by-side metric deltas and radar chart
For filter behaviour details and known limitations, see docs/eval-metrics-dashboard-guide.md.
The governance service runs as a separate FastAPI service (:8002) and wraps every agent action with a policy and enforcement layer.
| Category | What it tracks |
|---|---|
| Safety | Prompt injection, jailbreak attempts, toxicity, bias |
| Identity | Session token TTL violations, over-provisioned tool access |
| Reliability | Error rates, SLO compliance, uptime |
| Behavior | Output consistency, persona adherence, out-of-distribution outputs |
| Lifecycle | Deployment mode (canary/stable), version change log |
| Regulatory | Compliance scorecards (GDPR/SOC2/HIPAA/ISO27001), risk register |
| Supply Chain | Model artifact hash verification, supply chain registry |
| Tool Whitelist | Allowed/denied tool list per agent |
| SLO Config | Error budget tracking, burn rate |
| Incidents | 8 trigger types, dedup logic, MTTD/MTTC/MTTR KPIs |
| Anomaly Detection | Z-score anomaly detection vs computed baselines |
| Compliance Scorecards | Automated scoring across regulatory frameworks |
| Risk Register | Risk items with owner, severity, mitigation, review date |
When a governed agent calls /gate/check before any action, the gate applies:
- Circuit breaker state check — OPEN → block immediately
- PII escalation — if payload contains PII, escalate risk tier
- Budget check — if token budget exhausted, escalate risk tier
Risk tier outcome:
low→ allowmedium→ allow (logged)high→ HITL pause (agent polls until approved/rejected)critical→ block
Pre-execution enforcement is enabled/disabled via enforcement.phase2_enabled in governance thresholds.
Each agent has a circuit breaker with states CLOSED → OPEN → HALF_OPEN → CLOSED:
- Auto-opens when failure threshold is breached (default: 5 failures)
- When OPEN, all gate check calls return a block response
- After
recovery_timeout_minutes, transitions to HALF_OPEN - Manual controls: reset (close), half-open (probe), quarantine (permanent open)
GET /governance/summary
POST /gate/check
GET /enforcement/circuit-breakers
POST /enforcement/circuit-breakers/{role}/reset
GET /enforcement/trust-scores
GET /hitl/queue
PUT /hitl/{request_id} Approve or reject
GET /incidents
GET /compliance/report
The Enforcement page (page 11) is the operational console for real-time agent control.
- Trust scores per agent (composite + component breakdown)
- Burn rates (SLO error budget velocity)
- Rogue assessment scores (frequency, entropy, capability, quarantine flag)
- Circuit breaker states with manual controls
- Pre-execution gate toggle and on-demand enforcement cycle
Post-hoc quality gate — watcher fires every 30 seconds, reads eval scores, compares against per-agent thresholds:
| Decision | Effect |
|---|---|
flag |
Log only |
hold |
Creates a HITL queue entry; next invocation is gated pending human review |
block |
Escalates gate tier to critical — all subsequent actions blocked |
Pending queue shows agent, action type, risk tier, escalation reasons, full context. Approve/Reject with reviewer notes. Both the Policy Engine tab (page 10) and Enforcement HITL tab (page 11) read/write the same otel.gov_hitl_queue table.
See docs/hitl-and-gate-ui-reference.md for full reference.
The EvalGov Agent (:8003) is the intelligence and interaction layer on top of the governance system.
Chat Agent — Claude Sonnet with 44 real-time tools:
"What's wrong right now?"
"Why did the circuit breaker open for agent searcher?"
"Show me what it cost to run yesterday"
"Approve the pending HITL for request abc123"
"Which agents have trust scores below 0.5?"
MCP — expose all tools to Claude Code and other AI agents:
claude mcp add evalgov --transport sse http://localhost:8003/mcp/sseTwo-panel layout: chat (left) + live system state (right). The right panel shows active HITL requests, open incidents, circuit breaker states, and policy violations — auto-refreshes every 60s.
See docs/evalgov-agent.md for the complete tool reference and chat agent details.
The primary demo is examples/opt-demo — a 4-agent system with Google ADK + full governance integration.
| Agent | Role | Tool |
|---|---|---|
| Orchestrator | Classifies query intent, routes or responds directly | — |
| Searcher | Searches web, synthesises answer with citations | DuckDuckGo |
| Summarizer | Summarises text in exact word count | — |
| Translator | Translates to any target language | — |
# Terminal 1 — eval platform (repo root: opt-aieval/)
docker compose up -d
# Terminal 2 — demo agent (opt-aieval/examples/opt-demo/)
cd examples/opt-demo
cp .env.example .env # set OPENAI_API_KEY or ANTHROPIC_API_KEY
pip install -r requirements.txt
uvicorn api:app --host 0.0.0.0 --port 8080 --workers 1
# Terminal 3 — chat UI (opt-aieval/examples/opt-demo/)
cd examples/opt-demo
streamlit run chat_ui.py # → http://localhost:8502Use
--workers 1. Multiple workers use separate memory spaces and bypass the in-memory eval dedup dict, causing duplicateprompt_evalrows.
Every chat session is assigned a conversation_id (UUID) propagated to all agent spans — orchestrator and sub-agents. This enables the conversation browser in Eval UI, multi-turn LLM judges, and conversation-level filtering in Prompt Analysis.
| Event | Span | Where |
|---|---|---|
| Root task | agent.task (orchestrator) |
runner.py run_agent() |
| Sub-agent call | agent.task (searcher / summarizer / translator) |
runner.py _call_agent() |
conversation.id |
Set on all agent.task spans |
root + all child spans |
| web_search | agent.tool_call |
_maybe_emit_tool_span() |
| LLM calls | openai.chat, call_llm |
Google ADK + LiteLLM auto-instrumentation |
# Run unit suite
python benchmarks/runner.py --suite unit --name "baseline-v1" --agent-version "v1.0"
# Run all suites
python benchmarks/runner.py --suite all --name "pre-deploy-v2"
# Compare against baseline
python benchmarks/runner.py --suite unit --name "post-change-v2" --compare-baseline| Suite | Cases | Focus |
|---|---|---|
unit |
5 | Single-turn: factual QA, format, instruction following |
integration |
4 | Multi-step: research, tool use, reasoning chains |
collaboration |
3 | Multi-agent: orchestration, parallel agents |
Edit benchmarks/suites/{suite}/cases.json or use the Eval UI (Benchmarks tab):
{
"suite": "unit",
"name": "My test case",
"task_input": "The prompt to send to the agent",
"expected_output": "Optional ground truth for exact-match eval",
"rubric": {"criteria": ["criterion 1", "criterion 2"]},
"difficulty": "medium"
}Full instructions are in instrumentation/README.md. The short version:
pip install -e ./instrumentation[all] # or pick your framework extrafrom instrumentation import init_telemetry
init_telemetry(service_name="my-agent")# LangChain
from instrumentation import LangChainEvalAdapter
adapter = LangChainEvalAdapter(agent_id="lc-1", agent_role="researcher")
agent.invoke(input, config={"callbacks": [adapter]})
# Or manually (any framework)
from instrumentation import agent_task
with agent_task(agent_id="a1", role="researcher", task_input=query) as task:
result = my_agent.run(query)
task.set_output(result)Open http://localhost:16686, select your service.name, confirm agent.task spans appear with correct parent-child nesting.
from instrumentation import GovernedToolkit
toolkit = GovernedToolkit(agent_role="searcher")
governed_search = toolkit.wrap("web_search", web_search)See docs/gov-instrument-guide.md for Phase 2 gate patterns and the full checklist.
| Metric | What it measures |
|---|---|
faithfulness |
Output grounded in source context — no hallucinations |
relevance |
Output addresses the actual question |
instruction_following |
Output follows explicit prompt constraints |
hallucination_score |
Fabricated facts not in tool outputs or context |
safety_score |
6-dimension safety: toxic, harmful, PII, bias, unsafe advice, role violation |
qa_correctness |
Answer correctness vs expected output (benchmark mode) |
custom_rubric |
Configurable rubric per benchmark case |
handoff_fidelity |
Context preserved across agent handoffs |
| Metric | What it measures |
|---|---|
context_retention |
Each turn correctly references prior conversation context |
topic_adherence |
Conversation stays on-topic across turns |
response_consistency |
No contradictions between turns |
conversation_relevancy |
Each response relevant to the original user intent |
| Metric | Source | Unit |
|---|---|---|
task_success |
Root agent.task StatusCode |
0/1 |
agent_failure |
Root agent.task StatusCode |
0/1 |
task_latency_ms |
Root agent.task Duration |
ms |
cost_usd |
gen_ai.usage.* tokens × governance rates |
USD |
token_count |
All LLM spans gen_ai.usage.* |
tokens |
context_window_utilization |
Max gen_ai.usage.input_tokens / 128k |
0–1 |
tool_call_success_rate |
agent.tool_call StatusCode |
0–1 |
tool_calls_per_task |
Count of agent.tool_call spans |
count |
handoff_success_rate |
agent.handoff StatusCode |
0–1 |
dead_span_rate |
Orphaned spans / total spans | 0–1 |
| Score | Meaning |
|---|---|
| 0.8 – 1.0 | Good — meets expectations |
| 0.5 – 0.8 | Marginal — review recommended |
| 0.0 – 0.5 | Poor — action needed |
68 metrics across 9 categories. Metrics marked "Needs instrumentation" require additional span attributes not yet emitted.
| Category | Total | Live |
|---|---|---|
| 1 · Task completion & accuracy | 10 | 8 |
| 2 · Efficiency & cost | 7 | 6 |
| 3 · Multi-agent coordination | 8 | 5 |
| 4 · Reasoning & decision quality | 7 | 2 |
| 5 · Tool & MCP integration | 10 | 6 |
| 6 · Reliability & safety | 8 | 6 |
| 7 · Memory & context management | 6 | 1 |
| 8 · Observability & tracing | 7 | 5 |
| 9 · Conversation quality | 4 | 4 |
| Total | 68 | 43 |
See docs/eval-metrics-dashboard-guide.md for filter behaviour, known limitations, and the span-level filter fix guide.
| Variable | Default | Description |
|---|---|---|
ANTHROPIC_API_KEY |
— | Required — LLM judges and EvalGov Agent |
OPENAI_API_KEY |
— | Required for demo agents using OpenAI |
JUDGE_MODEL |
anthropic/claude-haiku-4-5-20251001 |
Model used by LLM judges |
AGENT_MODEL |
anthropic/claude-sonnet-4-6 |
Model used by EvalGov chat agent |
OTEL_EXPORTER_OTLP_ENDPOINT |
http://localhost:4317 |
OTel Collector endpoint |
CLICKHOUSE_HOST |
localhost |
ClickHouse host |
CLICKHOUSE_PORT |
9000 |
ClickHouse native port |
GOVERNANCE_SERVICE_URL |
http://localhost:8002 |
Governance service URL |
EVALGOV_AGENT_URL |
http://localhost:8003 |
EvalGov agent URL |
HITL_TIMEOUT_MINUTES |
15 |
Minutes before HITL timeout triggers a finding |
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama endpoint (if using local models) |
| File | Purpose |
|---|---|
infra/otelcol/config.yaml |
OTel Collector pipelines, exporters, normalisation rules |
eval_runner/config/evaluator_config.yaml |
Which evaluators run, LLM sampling rate, regression threshold |
infra/clickhouse/init.sql |
Database schema — all tables and materialized views |
# Platform lifecycle
docker compose up -d # Start all 7 services
docker compose down # Stop all services
docker compose logs -f # Tail all service logs
docker compose ps # Show service health
# Rebuild after code changes (Python services only)
docker compose build eval-runner && docker compose up -d eval-runner
docker compose build evalgov-agent && docker compose up -d evalgov-agent
# ClickHouse SQL console
docker exec -it aieval-clickhouse clickhouse-client
# Benchmark runner
python benchmarks/runner.py --suite unit --name "my-run"
python benchmarks/runner.py --suite all --compare-baseline
# Connect Claude Code to EvalGov tools via MCP
claude mcp add evalgov --transport sse http://localhost:8003/mcp/sse