Skip to content

Build TMB evals system v2 — outcome-first multi-scorer (replaces L6 v1 brittle-strict-match) #110

@ZaxShen

Description

@ZaxShen

Why this issue

L6 v1 (PR #109) uses strict trajectory matching — assert the agent calls exactly N specific tools in this exact order. Web research (citations below) found this is exactly what Anthropic explicitly warns against as too brittle, and the industry has converged on better patterns. We need v2 before scaling L6 to 12+ scaffolded flows.

User direction (2026-04-26): "L6 PR already merged. debug_tmb is not industry standards. We can abandon it and follow industry standards. Make our debug mode comprehensive and robust... It can save all of the testing process for all following works."

The problem with L6 v1 — per Anthropic

"There is a common instinct to check that agents followed very specific steps like a sequence of tool calls in the right order. We've found this approach too brittle and results in overly brittle tests, as agents regularly find valid approaches that eval designers didn't anticipate."

"Grade what the agent produced, not the path it took."

Anthropic Engineering: Demystifying evals for AI agents

L6 v1's expected/<flow>.txt strict match falls in this trap.

Industry standard (synthesized from 8 sources — see Citations)

Three-tier eval model:

Tier What When
Outcome Final state assertion: does the DB / filesystem / commits look right? Primary — every flow
Trajectory Were the right tools called (subset/superset, NOT strict)? Were forbidden tools NOT called? Secondary — every flow
Cost / Latency Token usage, P50/P99 latency, tracked vs baseline Observability — every flow
LLM-as-judge Subjective quality (spec readability, tone, completeness) Opt-in for prose-quality flows

Proposed v2 architecture (Inspect AI primitives)

Inspect AI is the de facto industry standard — Anthropic, DeepMind, and other safety orgs use it (per Hamel Husain). It defines four primitives we should adopt:

  • Dataset: (input, target) pairs. For TMB: (prompt, fixture_pre.sql, expected_state.sql, expected_tools.json)
  • Solver: produces output for input. For TMB: claude -p run with TMB_DEBUG_TRAJECTORY=1 (already built in L6 v1)
  • Scorers: grade the output (multiple per task). For TMB: 4 scorers per flow (outcome / trajectory / cost / optional LLM-judge)
  • Task: ties dataset + solver + scorers together

Concrete schema additions

-- Cost + latency tracking on existing trajectory rows
ALTER TABLE debug_trajectory ADD COLUMN tokens_in   INTEGER DEFAULT 0;
ALTER TABLE debug_trajectory ADD COLUMN tokens_out  INTEGER DEFAULT 0;
ALTER TABLE debug_trajectory ADD COLUMN latency_ms  INTEGER DEFAULT 0;

-- Per-scorer results, one row per (flow, scorer) per run
CREATE TABLE IF NOT EXISTS eval_results (
  id           INTEGER PRIMARY KEY AUTOINCREMENT,
  run_id       TEXT NOT NULL,            -- groups all scorers for one flow run
  flow_name    TEXT NOT NULL,
  scorer_name  TEXT NOT NULL,            -- 'outcome' | 'trajectory_subset' | 'cost' | 'llm_judge'
  pass         INTEGER NOT NULL,         -- 1 = pass, 0 = fail
  value        TEXT,                     -- numeric or categorical
  explanation  TEXT,                     -- why pass/fail
  metadata_json TEXT NOT NULL DEFAULT '{}',
  created_at   TEXT NOT NULL DEFAULT (datetime('now'))
);
CREATE INDEX idx_eval_results_run ON eval_results(run_id, scorer_name);

Per-flow file layout (replaces L6 v1's expected/<name>.txt)

tests/dogfood/flows/02-simple-task/
├── prompt.txt                      # the user's ask
├── fixture-pre.sql                 # DB state before claude runs
├── scorers/
│   ├── outcome.sql                 # assertions on final DB state
│   ├── tools-required.json         # superset: these MUST be called (any order)
│   ├── tools-forbidden.json        # never-called list
│   ├── cost-budget.json            # max tokens + max latency_ms
│   └── llm-judge-rubric.md         # optional, opt-in per flow
└── README.md                       # what this flow tests

Trajectory match modes — adopt LangSmith's 4 modes

Mode Use case
subset Agent calls at most the listed tools (efficiency check)
superset (default for us) Agent calls at least the listed tools (minimum-required check)
unordered Same tools, any order
strict Same tools, same order — only use for hard sequential constraints like "config_set must come after identity_set in onboarding"

Scorer-by-scorer specifications

1. Outcome scorer (deterministic, primary)

For each flow, assert post-state SQL queries return the expected counts/values. Example for 02-simple-task:

-- expected outcome
SELECT COUNT(*) FROM issues WHERE objective LIKE '%cli todo%';      -- expect 1
SELECT COUNT(*) FROM tasks WHERE issue_id = (SELECT MAX(id) FROM issues);  -- expect 1
SELECT COUNT(*) FROM ledger WHERE event_type = 'planning_complete';  -- expect ≥1

If all assertions pass → outcome scorer = pass.

2. Trajectory subset/superset scorer (deterministic)

// tools-required.json (superset mode — these must appear)
[
  "mcp__plugin_tmb_trajectory-server__issue_create",
  "mcp__plugin_tmb_trajectory-server__task_create_batch",
  "Task"
]

// tools-forbidden.json (must NOT appear)
[
  "mcp__plugin_tmb_trajectory-server__validation_record",
  "mcp__plugin_tmb_trajectory-server__task_update_status with status='closed' before completed"
]

Less brittle: doesn't care if bro calls 3 extra discussion_append rows or reorders config probes.

3. Cost / latency scorer (observability)

// cost-budget.json — per-flow soft budget
{
  "max_tokens_total": 50000,
  "max_latency_ms_p99": 60000,
  "fail_above_max": false,           // soft signal — log but don't fail
  "warn_drift_pct": 25                // alert if >25% above 7-day baseline
}

Tracks tokens and latency per flow run; fails only if hard cap exceeded; warns on drift.

4. LLM-as-judge scorer (opt-in, paid)

For flows where prose quality matters (e.g., bro's response to the Human, spec readability):

<!-- llm-judge-rubric.md -->
Grade bro's final response on a 1-5 scale:
- 5: Concise, accurate, addresses all parts of the ask
- 1: Verbose, off-topic, or incomplete

Pass threshold: ≥4

Costs a tiny amount of tokens per eval (one extra Claude call). Skip on CI runs to save cost; run weekly.

pass^k consistency (per survey §3.3.1)

Run each flow N=3 times. Report:

  • pass@3 (loose: at least one succeeded)
  • pass^3 (strict: all three succeeded)

Strict consistency matters for production reliability.

Migration from L6 v1

L6 v1 v2 equivalent Status
debug_trajectory table Same — primary capture Keep
MCP wrapper in index.ts Extend with token + latency capture Modify
scripts/hooks/debug-trajectory.sh Same Keep
tests/dogfood/expected/<name>.txt (strict) flows/<name>/scorers/*.{sql,json,md} (multi) Convert
tests/dogfood/run-l6.sh Add scorer dispatch logic Refactor
4 wired flows Each becomes a directory with multi-scorer config Convert
12 scaffolded flows Author with v2 patterns from the start Author

Implementation phases

Phase Scope Estimate Blocks
1. v2 runner + outcome scorer Refactor runner to dispatch to scorers; convert 1 flow as proof 1 day Phase 2-5
2. Trajectory subset/superset scorer Replace strict assertions; convert remaining 3 wired flows 1 day Phase 4
3. Cost tracking Schema columns + capture wiring + cost scorer 1 day independent
4. Author 12 scaffolded flows One PR per 3-flow batch, all with v2 patterns 4 days depends on Phase 2
5. LLM-as-judge + pass^k Optional scorers + consistency runs 2 days independent

Total: ~9 days of work, parallelizable. Phase 1+2+3 can land as one PR (~3 days).

Why this is a load-bearing investment

  • Doesn't false-positive on cosmetic trajectory differences (per Anthropic warning)
  • Catches what matters (final state, not path)
  • Industry-standard structure — easier to onboard contributors, port to Inspect AI later
  • Self-monitoring — cost/latency drift alerts before they become problems
  • Makes Codebase memory: lazy bootstrap + per-session verify + task-closure updates #45 (codebase memory) safer to ship — any future feature gets the eval safety net

User said it: "It can save all of the testing process for all following works." Investing here means every future PR is safer.

Citations (all from web research, not memory)

Acceptance criteria

  • Schema migration: debug_trajectory adds (tokens_in, tokens_out, latency_ms); new eval_results table
  • Runner refactor: tests/dogfood/run-l6.sh dispatches to per-flow scorer config
  • All 4 wired flows converted to v2 multi-scorer format (no strict-trajectory assertions remain)
  • At least 3 of the 12 scaffolded flows authored with full v2 (outcome + trajectory)
  • Cost scorer reports per-flow tokens + latency to eval_results table
  • CI workflow updated: still runs on tag/PR-label, but reports cost drift in PR comments
  • tests/README.md rewritten — L6 = "industry-standard agentic evals" (cite sources)
  • One example LLM-as-judge scorer (opt-in via per-flow config)
  • pass^k=3 consistency report on at least one flow

Out of scope (separate follow-ups if/when needed)

  • Full Inspect AI adoption (the framework itself) — could come later if we want their visualization tools
  • Replay-from-production (re-running captured production sessions on new model versions)
  • Adversarial prompt generation
  • Multi-platform eval matrix (macOS / Windows runners)

Metadata

Metadata

Assignees

Labels

DoctrineDoctrine clarification or contract changeFeatureNew feature or requestPriority: HighHigh priority — blocks meaningful workflowsTestsTest infrastructure (L0-L6)

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions