AI Multi-Agent Evaluation, Observability & Governance Platform

A self-hosted platform for evaluating, observing, and governing any multi-agent AI system in real time and offline. No external SaaS dependencies — everything runs locally via Docker.

Built on: Docker · ClickHouse · Jaeger · OpenTelemetry · FastAPI · Streamlit · Claude

What This Platform Does
Architecture Overview
Quick Start
Project Structure
Infrastructure Stack
Evaluation System
Observability System
Instrumentation SDK
Eval UI
AI Governance System
Governance Enforcement
EvalGov Intelligence Agent
Multi-Agent Demo
Running Offline Benchmarks
Onboarding a New Agent System
Evaluation Metrics Reference
Configuration Reference
Common Commands

1. What This Platform Does

Observability

Captures every trace, log, and metric from any agent system via OpenTelemetry
LLM spans follow OTel GenAI semantic conventions (gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.request.model)
Stores everything in ClickHouse for fast analytical queries
Jaeger for interactive distributed trace exploration

Evaluation

Two modes: real-time (production monitoring) and offline (benchmark test harness)
Three-tier evaluator cascade: deterministic → LLM-as-judge → multi-turn judge
LLM judges: faithfulness, relevance, instruction following, handoff fidelity, hallucination, safety, QA correctness, custom rubric
Multi-turn judges: context retention, topic adherence, response consistency, conversation relevancy
OTel-computed per-trace metrics: task success, latency, cost, token count, tool success rate, handoff success rate, dead span rate — written to eval_scores on every trace
Threshold alerting: 38 of 68 metrics configurable; violations feed shows prompt, trace, agent, and judge reasoning
Regression detection: every run compared against a pinned baseline

AI Governance

13 detection categories: Safety, Identity, Reliability, Behavior, Lifecycle, Regulatory, Supply Chain, Tool Whitelist, SLO, Incidents, Anomaly Detection, Compliance Scorecards, Risk Register
Pre-execution enforcement: gate check API classifies every agent action by risk tier; high-risk actions paused for human approval (HITL)
Circuit breaker: CLOSED/OPEN/HALF_OPEN state machine per agent
Trust scoring: composite score per agent from identity, behavior, compliance, and network signals
Content quality gates: post-hoc watcher fires every 30s; flag → hold → block decision ladder

EvalGov Intelligence Agent

Conversational interface: Claude Sonnet agent with 44 real-time tools — answer any governance/eval/cost/trace question in plain English
On-demand RCA: ask the agent why a circuit breaker opened, what caused a trust score drop, or what a safety event means — it correlates signals across governance, traces, and eval scores to give a specific root cause and recommended action
Live system state: UI panel auto-refreshes every 60s showing active HITL requests, open incidents, circuit breaker states, and policy violations
MCP server: expose all 44 tools to Claude Code and other AI agents

Framework Support

Framework	Support
LangChain / LangGraph	✅ Adapter included
AutoGen	✅ Adapter included
CrewAI	✅ Adapter included
Anthropic Claude SDK	✅ Adapter included
OpenAI Agents SDK	✅ Adapter included
OpenAI SDK (direct)	✅ Auto-instrumented via OpenLLMetry
Google ADK	✅ Works via manual span wrappers (see demo)
Custom Python agents	✅ Manual span context managers

2. Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│  Your Agent System (any SDK)                                    │
│  pip install -e ./instrumentation[langchain]                    │
│  init_telemetry() + framework adapter or manual spans           │
│  GovernedToolkit — gate check + HITL polling on every call      │
└────────────────────────────┬────────────────────────────────────┘
                             │ OTLP gRPC :4317 / HTTP :4318
                             ▼
              ┌──────────────────────────────┐
              │   OTel Collector  :4317/4318  │
              │   Normalises + fans out       │
              └──────┬──────────┬────────────┘
                     │          │
              ┌──────▼──┐  ┌────▼────┐
              │ClickHouse│  │ Jaeger  │
              │:8123/9000│  │:16686   │
              │traces    │  │trace UI │
              │eval_scores│ └─────────┘
              │gov_*     │
              └──────┬───┘
                     │
              ┌──────▼───────────┐    ┌──────────────────────┐
              │  Eval Runner     │    │  Governance Service  │
              │  :8000           │    │  :8002               │
              │  Tiered evals    │    │  Gate check API      │
              │  LLM judges      │    │  Circuit breakers    │
              │  Regression      │    │  Trust scores        │
              └──────────────────┘    │  13 detection cats   │
                                      └──────────┬───────────┘
              ┌──────────────────────────────────┐│
              │  EvalGov Agent  :8003            ││
              │  Chat agent · MCP server         ◄┘
              │  Live system state panel         │
              └──────────────────────────────────┘

              ┌──────────────────────────────────┐
              │  Eval UI  :8501 Streamlit        │
              │  Eval · Traces · Benchmarks      │
              │  AI Governance (13 categories)   │
              │  Governance Enforcement          │
              │  EvalGov Agent (chat + system state) │
              └──────────────────────────────────┘

Data Flow

Agent runs → OTel spans → Collector → ClickHouse (stored)
                                    → Jaeger (trace UI)
                                    → Eval Runner → scores → ClickHouse

           → gate check → Governance Service → HITL queue → agent waits
                                             → CB update
                                             → trust score update

ClickHouse + Governance Service → EvalGov Agent → live system state panel

3. Quick Start

Prerequisites

Docker Desktop with Compose v2
Python 3.10+
Anthropic API key (for LLM judge evaluators and EvalGov Agent)

1. Clone and configure

git clone <your-repo>
cd opt-aieval
cp .env.example .env
# Set ANTHROPIC_API_KEY in .env

2. Start the platform

docker compose up -d
# Starts 7 services: ClickHouse, Jaeger, OTel Collector,
#                    Eval Runner, Governance Service, EvalGov Agent, Eval UI

3. Open the UIs

UI	URL
Eval UI (main dashboard)	http://localhost:8501
Jaeger (trace viewer)	http://localhost:16686
Governance API docs	http://localhost:8002/docs
EvalGov Agent API docs	http://localhost:8003/docs
ClickHouse console	http://localhost:8123/play

4. Run the demo (optional)

# Terminal 1 — eval platform is already running from step 2 above (repo root)

# Terminal 2 — demo agent (from the demo directory)
cd examples/opt-demo
cp .env.example .env   # set OPENAI_API_KEY or ANTHROPIC_API_KEY
pip install -r requirements.txt
uvicorn api:app --host 0.0.0.0 --port 8080 --workers 1

# Terminal 3 — chat UI (same demo directory)
cd examples/opt-demo
streamlit run chat_ui.py   # → http://localhost:8502

4. Project Structure

opt-aieval/
├── README.md                        This file — platform overview
├── docker-compose.yml               7-service stack
├── .env.example                     Environment variable template
│
├── docs/                            Reference documentation
│   ├── instrumentation-guide.md     Full span schema, attribute reference, evaluator tables
│   ├── gov-instrument-guide.md      Phase 2 gate decision flow, HITL patterns, CB guide
│   ├── eval-metrics-dashboard-guide.md  Filter behaviour + Option B span-level fix guide
│   ├── evalgov-agent.md             EvalGov Agent — tool reference, chat agent, MCP server
│   ├── governance-enforcement-architecture.md  Three-layer enforcement design
│   └── hitl-and-gate-ui-reference.md  HITL and gate check UI reference
│
├── infra/
│   ├── clickhouse/
│   │   ├── init.sql                 Database schema (eval + governance tables)
│   │   └── users.xml                ClickHouse user config
│   └── otelcol/
│       └── config.yaml              OTel Collector: receivers, processors, exporters
│
├── instrumentation/                 Instrumentation SDK — install and use this
│   ├── README.md                    ← Start here to instrument your agent system
│   ├── pyproject.toml               pip-installable package config
│   ├── telemetry.py                 OTel init — init_telemetry(), get_tracer(), get_run_id()
│   ├── spans.py                     Canonical span context managers
│   ├── governed_toolkit.py          GovernedToolkit — pre-execution gate check wrapper
│   ├── gate_client.py               Low-level gate check HTTP client
│   ├── normalizer.py                SDK attribute → canonical schema mapping
│   └── adapters/
│       ├── langchain_adapter.py     LangChainEvalAdapter
│       ├── autogen_adapter.py       AutoGenEvalAdapter
│       ├── crewai_adapter.py        CrewAIEvalAdapter
│       ├── claude_adapter.py        ClaudeEvalAdapter
│       └── openai_agents_adapter.py OpenAIAgentsEvalAdapter
│
├── eval_runner/                     Evaluation service (FastAPI :8000)
│   ├── main.py                      REST endpoints + OTLP ingestion + eval dedup
│   ├── config/evaluator_config.yaml Evaluator routing config + regression thresholds
│   ├── evaluators/
│   │   ├── deterministic/           format_check, tool_accuracy, step_efficiency, etc.
│   │   └── llm_judges/              faithfulness, relevance, hallucination, safety, etc.
│   ├── pipeline/
│   │   ├── eval_pipeline.py         Eval orchestration + OTel-computed metrics
│   │   └── conversation_eval.py     Multi-turn judge runner
│   └── db/                          ClickHouse client + repository
│
├── eval_ui/                         Streamlit evaluation UI (:8501)
│   ├── app.py                       Home page with live metrics
│   └── pages/
│       ├── 1_Eval_Testing.py        Run management + baseline
│       ├── 2_Eval_Measurements.py   Traces, Scores, Regression, Benchmarks, Conversations
│       ├── 10_AI_Governance.py      13 governance categories, policy engine, compliance
│       ├── 11_Governance_Enforcement.py  Circuit breakers, HITL approvals, quality gates
│       └── 14_EvalGov_Agent.py      Chat UI + live system state panel
│
├── governance_service/              Governance FastAPI service (:8002)
│   ├── main.py                      90+ REST endpoints across all governance domains
│   ├── db.py                        ClickHouse client + all governance CRUD
│   ├── gate.py                      Pre-execution gate check (risk tier logic)
│   ├── enforcement_engine.py        Trust score + rogue assessment computation
│   ├── quality_gate_checker.py      Post-hoc quality watcher (30s loop)
│   ├── incident_manager.py          Incident creation, dedup, lifecycle
│   └── anomaly_detector.py          Statistical anomaly detection vs baselines
│
├── evalgov_agent/                   EvalGov Intelligence Agent (:8003)
│   ├── agent.py                     Claude chat agent with 44-tool loop
│   ├── tools.py                     Tool implementations + Anthropic schemas
│   ├── mcp_server.py                MCP SSE server exposing all 44 tools
│   └── db.py                        ClickHouse client for analytics queries
│
│
├── benchmarks/                      Offline benchmark test harness
│   ├── runner.py                    CLI: run suites, compare baseline
│   └── suites/
│       ├── unit/cases.json
│       ├── integration/cases.json
│       └── collaboration/cases.json
│
└── examples/
    ├── opt-demo/                    Primary demo — 4-agent system (orchestrator, searcher,
    │                                summarizer, translator) with full governance integration
    ├── langchain_agent_example.py   LangChain agent with LangChainEvalAdapter
    ├── claude_agent_example.py      Claude SDK agent with ClaudeEvalAdapter
    └── custom_agent_example.py      Custom Python agent — all span types demonstrated

5. Infrastructure Stack

ClickHouse — Primary Data Store

All observability signals and evaluation scores land here.

OTel tables (auto-created by collector):

otel.otel_traces — all spans
otel.otel_logs — structured logs
otel.otel_metrics_* — metrics

Eval tables (created by infra/clickhouse/init.sql):

otel.eval_runs — groups of traces (one per benchmark run or tagged production period)
otel.eval_scores — evaluator scores linked to trace_id; eval_type ∈ deterministic / llm_judge / multiturn_judge / otel_computed
otel.benchmarks — offline test cases with dataset_version for stable regression comparisons
otel.prompt_evals — per-agent prompt/response pairs with token counts and score maps; deduplicated by (trace_id, span_id)
otel.alert_thresholds — one row per metric; operator + threshold value + enabled flag
otel.human_reviews — human feedback on eval scores

Direct query access:

-- Recent eval scores
SELECT trace_id, metric, score, reasoning
FROM otel.eval_scores
ORDER BY evaluated_at DESC LIMIT 20;

-- Average score per metric for a run
SELECT metric, avg(score)
FROM otel.eval_scores
WHERE run_id = 'your-run-id'
GROUP BY metric;

OTel Collector

Receives OTLP from agents, normalises SDK-specific attributes, fans out to backends.

:4317 — OTLP gRPC
:4318 — OTLP HTTP
:8888 — Collector self-metrics
:13133 — Health check

Normalisation maps LangChain / AutoGen / CrewAI attribute names to the canonical schema before storage. Config: infra/otelcol/config.yaml.

Jaeger

Interactive distributed trace explorer at http://localhost:16686. Every agent execution shows as a full span tree with parent-child nesting.

6. Evaluation System

Two Modes

Offline (test harness) — run before deploying changes:

python benchmarks/runner.py --suite unit --name "my-run-v2"
python benchmarks/runner.py --suite unit --name "my-run-v2" --compare-baseline

Online (real-time) — monitors production traffic automatically. Every agent.task span completion triggers the eval pipeline. LLM judges run on a 15% sample to control cost.

Evaluator Tiers

Tier	Evaluators	Cost	When it runs
Deterministic	format_compliance, tool_accuracy, step_efficiency, tool_output_handling, tool_error_rate, timeout_rate	Free	Always, 100%
LLM Judge	faithfulness, relevance, instruction_following, handoff_fidelity, hallucination_score, safety_score, qa_correctness, custom_rubric	~$0.001/trace	Offline: 100% · Online: 15% sample
Multi-turn	context_retention, topic_adherence, response_consistency, conversation_relevancy	~$0.001/conversation	Per conversation_id after idle 60s
OTel-computed	task_success, agent_failure, timeout, task_latency_ms, cost_usd, token_count, context_window_utilization, tool_call_success_rate, handoff_success_rate, dead_span_rate	Free	Always, 100%

Eval Runner API

GET  /health
GET  /runs                         List all eval runs
POST /runs                         Create a new run
PUT  /runs/{run_id}/baseline       Set run as baseline
POST /runs/{run_id}/evaluate       Trigger offline eval
GET  /runs/{run_id}/scores         Aggregated scores
GET  /runs/{run_id}/regression     Score deltas vs baseline
GET  /traces/{trace_id}/scores     All scores for a trace

Configuring Evaluators

Edit eval_runner/config/evaluator_config.yaml:

sampling:
  online_llm_judge_rate: 0.15     # 15% of traces get LLM judges
  always_judge_on_failure: true   # always judge failed tasks

regression:
  alert_threshold: 0.05           # flag if metric drops >5%

7. Observability System

What Gets Captured Automatically

Once init_telemetry() is called, OpenLLMetry auto-instrumentation captures every LLM call:

Model name — gen_ai.request.model
Token counts — gen_ai.usage.input_tokens / gen_ai.usage.output_tokens
Latency — span Duration
Finish reason — gen_ai.response.finish_reasons

Canonical Span Schema

agent.task:
  agent.id, agent.role, task.id, task.input, task.output,
  task.status, run.id,
  trace.source (production|benchmark|exploratory),
  conversation.id

agent.tool_call:
  agent.id, tool.name, tool.input (JSON), tool.output (JSON), tool.success, run.id

agent.handoff:
  from.agent_id, to.agent_id, handoff.reason, handoff.context (JSON)

agent.memory:
  agent.id, memory.op (read|write), memory.key, memory.scope

agent.decision:
  agent.id, decision.input, decision.options, decision.chosen, decision.reason

Trace → Eval Pipeline

When a span with SpanName = 'agent.task' arrives:

OTel Collector forwards span to Eval Runner (/v1/traces)
Eval Runner assembles the full trace tree from ClickHouse
Evaluator pipeline fires (deterministic → LLM judges per sampling config)
Scores written to otel.eval_scores

8. Instrumentation SDK

The instrumentation/ directory is a pip-installable SDK. It provides init_telemetry(), framework adapters for all supported SDKs, canonical span context managers, and GovernedToolkit for pre-execution governance enforcement.

To instrument your agent system, start here:

instrumentation/README.md

That document covers installation, framework adapters, span context managers, governance enforcement integration, and how to expose a /chat endpoint for benchmark execution.

The detailed reference material lives in:

docs/instrumentation-guide.md — full span schema, attribute reference, evaluator tables, multi-agent context patterns
docs/gov-instrument-guide.md — gate decision flow, HITL patterns, circuit breaker integration

9. Eval UI

The Streamlit UI at http://localhost:8501 provides the main evaluation and governance interface.

Pages

Page	Purpose
Home	Live metrics: total runs, traces evaluated, avg scores
1 · Eval Testing	Create runs, set baseline, trigger offline eval
2 · Eval Measurements	Traces, Scores, Regression, Benchmarks, Conversations, Eval Metrics (68 metrics / 9 categories), Prompt Analysis
10 · AI Governance	13 governance categories, policy engine, anomaly events, compliance reports
11 · Governance Enforcement	Circuit breakers, trust scores, HITL approvals, quality gates, on-demand enforcement cycle
14 · EvalGov Agent	Chat UI + live system state panel

Key Features

Estimated cost computed from gen_ai.usage.* tokens × configurable rates from gov_threshold_config
Source filter in Eval Measurements — narrows to production, benchmark, or exploratory traces
Conversation filter — filter by partial or full conversation.id across Traces and Prompt Analysis
Threshold alerting on 38 of 68 metrics; violations feed shows full prompt, trace, agent, and judge reasoning
Regression comparison — select two runs for side-by-side metric deltas and radar chart

For filter behaviour details and known limitations, see docs/eval-metrics-dashboard-guide.md.

10. AI Governance System

The governance service runs as a separate FastAPI service (:8002) and wraps every agent action with a policy and enforcement layer.

13 Detection Categories

Category	What it tracks
Safety	Prompt injection, jailbreak attempts, toxicity, bias
Identity	Session token TTL violations, over-provisioned tool access
Reliability	Error rates, SLO compliance, uptime
Behavior	Output consistency, persona adherence, out-of-distribution outputs
Lifecycle	Deployment mode (canary/stable), version change log
Regulatory	Compliance scorecards (GDPR/SOC2/HIPAA/ISO27001), risk register
Supply Chain	Model artifact hash verification, supply chain registry
Tool Whitelist	Allowed/denied tool list per agent
SLO Config	Error budget tracking, burn rate
Incidents	8 trigger types, dedup logic, MTTD/MTTC/MTTR KPIs
Anomaly Detection	Z-score anomaly detection vs computed baselines
Compliance Scorecards	Automated scoring across regulatory frameworks
Risk Register	Risk items with owner, severity, mitigation, review date

Pre-Execution Enforcement (Gate Check)

When a governed agent calls /gate/check before any action, the gate applies:

Circuit breaker state check — OPEN → block immediately
PII escalation — if payload contains PII, escalate risk tier
Budget check — if token budget exhausted, escalate risk tier

Risk tier outcome:

low → allow
medium → allow (logged)
high → HITL pause (agent polls until approved/rejected)
critical → block

Pre-execution enforcement is enabled/disabled via enforcement.phase2_enabled in governance thresholds.

Circuit Breaker

Each agent has a circuit breaker with states CLOSED → OPEN → HALF_OPEN → CLOSED:

Auto-opens when failure threshold is breached (default: 5 failures)
When OPEN, all gate check calls return a block response
After recovery_timeout_minutes, transitions to HALF_OPEN
Manual controls: reset (close), half-open (probe), quarantine (permanent open)

Governance Service API

GET  /governance/summary
POST /gate/check
GET  /enforcement/circuit-breakers
POST /enforcement/circuit-breakers/{role}/reset
GET  /enforcement/trust-scores
GET  /hitl/queue
PUT  /hitl/{request_id}           Approve or reject
GET  /incidents
GET  /compliance/report

11. Governance Enforcement

The Enforcement page (page 11) is the operational console for real-time agent control.

Tab 1 — Agent Pre-Execution Enforcement

Trust scores per agent (composite + component breakdown)
Burn rates (SLO error budget velocity)
Rogue assessment scores (frequency, entropy, capability, quarantine flag)
Circuit breaker states with manual controls
Pre-execution gate toggle and on-demand enforcement cycle

Tab 2 — Content Quality Enforcement

Post-hoc quality gate — watcher fires every 30 seconds, reads eval scores, compares against per-agent thresholds:

Decision	Effect
`flag`	Log only
`hold`	Creates a HITL queue entry; next invocation is gated pending human review
`block`	Escalates gate tier to critical — all subsequent actions blocked

Tab 3 — HITL Approvals

Pending queue shows agent, action type, risk tier, escalation reasons, full context. Approve/Reject with reviewer notes. Both the Policy Engine tab (page 10) and Enforcement HITL tab (page 11) read/write the same otel.gov_hitl_queue table.

See docs/hitl-and-gate-ui-reference.md for full reference.

12. EvalGov Intelligence Agent

The EvalGov Agent (:8003) is the intelligence and interaction layer on top of the governance system.

Three Capabilities

Chat Agent — Claude Sonnet with 44 real-time tools:

"What's wrong right now?"
"Why did the circuit breaker open for agent searcher?"
"Show me what it cost to run yesterday"
"Approve the pending HITL for request abc123"
"Which agents have trust scores below 0.5?"

MCP — expose all tools to Claude Code and other AI agents:

claude mcp add evalgov --transport sse http://localhost:8003/mcp/sse

Chat UI (page 14)

Two-panel layout: chat (left) + live system state (right). The right panel shows active HITL requests, open incidents, circuit breaker states, and policy violations — auto-refreshes every 60s.

See docs/evalgov-agent.md for the complete tool reference and chat agent details.

13. Multi-Agent Demo

The primary demo is examples/opt-demo — a 4-agent system with Google ADK + full governance integration.

Agents

Agent	Role	Tool
Orchestrator	Classifies query intent, routes or responds directly	—
Searcher	Searches web, synthesises answer with citations	DuckDuckGo
Summarizer	Summarises text in exact word count	—
Translator	Translates to any target language	—

Setup

# Terminal 1 — eval platform (repo root: opt-aieval/)
docker compose up -d

# Terminal 2 — demo agent (opt-aieval/examples/opt-demo/)
cd examples/opt-demo
cp .env.example .env   # set OPENAI_API_KEY or ANTHROPIC_API_KEY
pip install -r requirements.txt
uvicorn api:app --host 0.0.0.0 --port 8080 --workers 1

# Terminal 3 — chat UI (opt-aieval/examples/opt-demo/)
cd examples/opt-demo
streamlit run chat_ui.py   # → http://localhost:8502

Use --workers 1. Multiple workers use separate memory spaces and bypass the in-memory eval dedup dict, causing duplicate prompt_eval rows.

Multi-Turn Conversation Tracking

Every chat session is assigned a conversation_id (UUID) propagated to all agent spans — orchestrator and sub-agents. This enables the conversation browser in Eval UI, multi-turn LLM judges, and conversation-level filtering in Prompt Analysis.

Instrumentation in the Demo

Event	Span	Where
Root task	`agent.task` (orchestrator)	`runner.py run_agent()`
Sub-agent call	`agent.task` (searcher / summarizer / translator)	`runner.py _call_agent()`
`conversation.id`	Set on all `agent.task` spans	root + all child spans
web_search	`agent.tool_call`	`_maybe_emit_tool_span()`
LLM calls	`openai.chat`, `call_llm`	Google ADK + LiteLLM auto-instrumentation

14. Running Offline Benchmarks

# Run unit suite
python benchmarks/runner.py --suite unit --name "baseline-v1" --agent-version "v1.0"

# Run all suites
python benchmarks/runner.py --suite all --name "pre-deploy-v2"

# Compare against baseline
python benchmarks/runner.py --suite unit --name "post-change-v2" --compare-baseline

Benchmark Suites

Suite	Cases	Focus
`unit`	5	Single-turn: factual QA, format, instruction following
`integration`	4	Multi-step: research, tool use, reasoning chains
`collaboration`	3	Multi-agent: orchestration, parallel agents

Adding Test Cases

Edit benchmarks/suites/{suite}/cases.json or use the Eval UI (Benchmarks tab):

{
  "suite": "unit",
  "name": "My test case",
  "task_input": "The prompt to send to the agent",
  "expected_output": "Optional ground truth for exact-match eval",
  "rubric": {"criteria": ["criterion 1", "criterion 2"]},
  "difficulty": "medium"
}

15. Onboarding a New Agent System

Full instructions are in instrumentation/README.md. The short version:

1. Install the SDK

pip install -e ./instrumentation[all]   # or pick your framework extra

2. Initialize telemetry

from instrumentation import init_telemetry
init_telemetry(service_name="my-agent")

3. Attach an adapter or wrap spans manually

# LangChain
from instrumentation import LangChainEvalAdapter
adapter = LangChainEvalAdapter(agent_id="lc-1", agent_role="researcher")
agent.invoke(input, config={"callbacks": [adapter]})

# Or manually (any framework)
from instrumentation import agent_task
with agent_task(agent_id="a1", role="researcher", task_input=query) as task:
    result = my_agent.run(query)
    task.set_output(result)

4. Verify in Jaeger

Open http://localhost:16686, select your service.name, confirm agent.task spans appear with correct parent-child nesting.

5. Add governance enforcement (optional)

from instrumentation import GovernedToolkit
toolkit = GovernedToolkit(agent_role="searcher")
governed_search = toolkit.wrap("web_search", web_search)

See docs/gov-instrument-guide.md for Phase 2 gate patterns and the full checklist.

16. Evaluation Metrics Reference

LLM Judge Scores

Metric	What it measures
`faithfulness`	Output grounded in source context — no hallucinations
`relevance`	Output addresses the actual question
`instruction_following`	Output follows explicit prompt constraints
`hallucination_score`	Fabricated facts not in tool outputs or context
`safety_score`	6-dimension safety: toxic, harmful, PII, bias, unsafe advice, role violation
`qa_correctness`	Answer correctness vs expected output (benchmark mode)
`custom_rubric`	Configurable rubric per benchmark case
`handoff_fidelity`	Context preserved across agent handoffs

Multi-Turn Judges

Metric	What it measures
`context_retention`	Each turn correctly references prior conversation context
`topic_adherence`	Conversation stays on-topic across turns
`response_consistency`	No contradictions between turns
`conversation_relevancy`	Each response relevant to the original user intent

OTel-Computed Per-Trace Metrics

Metric	Source	Unit
`task_success`	Root `agent.task` StatusCode	0/1
`agent_failure`	Root `agent.task` StatusCode	0/1
`task_latency_ms`	Root `agent.task` Duration	ms
`cost_usd`	`gen_ai.usage.*` tokens × governance rates	USD
`token_count`	All LLM spans `gen_ai.usage.*`	tokens
`context_window_utilization`	Max `gen_ai.usage.input_tokens` / 128k	0–1
`tool_call_success_rate`	`agent.tool_call` StatusCode	0–1
`tool_calls_per_task`	Count of `agent.tool_call` spans	count
`handoff_success_rate`	`agent.handoff` StatusCode	0–1
`dead_span_rate`	Orphaned spans / total spans	0–1

Score Interpretation

Score	Meaning
0.8 – 1.0	Good — meets expectations
0.5 – 0.8	Marginal — review recommended
0.0 – 0.5	Poor — action needed

Eval Metrics Dashboard (page 2 → Eval Metrics tab)

68 metrics across 9 categories. Metrics marked "Needs instrumentation" require additional span attributes not yet emitted.

Category	Total	Live
1 · Task completion & accuracy	10	8
2 · Efficiency & cost	7	6
3 · Multi-agent coordination	8	5
4 · Reasoning & decision quality	7	2
5 · Tool & MCP integration	10	6
6 · Reliability & safety	8	6
7 · Memory & context management	6	1
8 · Observability & tracing	7	5
9 · Conversation quality	4	4
Total	68	43

See docs/eval-metrics-dashboard-guide.md for filter behaviour, known limitations, and the span-level filter fix guide.

17. Configuration Reference

Environment Variables

Variable	Default	Description
`ANTHROPIC_API_KEY`	—	Required — LLM judges and EvalGov Agent
`OPENAI_API_KEY`	—	Required for demo agents using OpenAI
`JUDGE_MODEL`	`anthropic/claude-haiku-4-5-20251001`	Model used by LLM judges
`AGENT_MODEL`	`anthropic/claude-sonnet-4-6`	Model used by EvalGov chat agent
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://localhost:4317`	OTel Collector endpoint
`CLICKHOUSE_HOST`	`localhost`	ClickHouse host
`CLICKHOUSE_PORT`	`9000`	ClickHouse native port
`GOVERNANCE_SERVICE_URL`	`http://localhost:8002`	Governance service URL
`EVALGOV_AGENT_URL`	`http://localhost:8003`	EvalGov agent URL
`HITL_TIMEOUT_MINUTES`	`15`	Minutes before HITL timeout triggers a finding
`OLLAMA_BASE_URL`	`http://localhost:11434`	Ollama endpoint (if using local models)

Key Config Files

File	Purpose
`infra/otelcol/config.yaml`	OTel Collector pipelines, exporters, normalisation rules
`eval_runner/config/evaluator_config.yaml`	Which evaluators run, LLM sampling rate, regression threshold
`infra/clickhouse/init.sql`	Database schema — all tables and materialized views

18. Common Commands

# Platform lifecycle
docker compose up -d          # Start all 7 services
docker compose down           # Stop all services
docker compose logs -f        # Tail all service logs
docker compose ps             # Show service health

# Rebuild after code changes (Python services only)
docker compose build eval-runner && docker compose up -d eval-runner
docker compose build evalgov-agent && docker compose up -d evalgov-agent

# ClickHouse SQL console
docker exec -it aieval-clickhouse clickhouse-client

# Benchmark runner
python benchmarks/runner.py --suite unit --name "my-run"
python benchmarks/runner.py --suite all --compare-baseline

# Connect Claude Code to EvalGov tools via MCP
claude mcp add evalgov --transport sse http://localhost:8003/mcp/sse

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
benchmarks		benchmarks
docs		docs
eval_runner		eval_runner
eval_ui		eval_ui
evalgov_agent		evalgov_agent
evalgov_cli		evalgov_cli
examples		examples
governance_service		governance_service
infra		infra
instrumentation		instrumentation
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation