Build TMB evals system v2 — outcome-first multi-scorer (replaces L6 v1 brittle-strict-match)

## Why this issue

L6 v1 (PR #109) uses **strict trajectory matching** — assert the agent calls exactly N specific tools in this exact order. Web research (citations below) found this is **exactly what Anthropic explicitly warns against** as too brittle, and the industry has converged on better patterns. We need v2 before scaling L6 to 12+ scaffolded flows.

User direction (2026-04-26): *"L6 PR already merged. debug_tmb is not industry standards. We can abandon it and follow industry standards. Make our debug mode comprehensive and robust... It can save all of the testing process for all following works."*

## The problem with L6 v1 — per Anthropic

> "There is a common instinct to check that agents followed very specific steps like a sequence of tool calls in the right order. We've found this approach too brittle and results in overly brittle tests, as agents regularly find valid approaches that eval designers didn't anticipate."

> "Grade what the agent produced, not the path it took."

— [Anthropic Engineering: Demystifying evals for AI agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)

L6 v1's `expected/<flow>.txt` strict match falls in this trap.

## Industry standard (synthesized from 8 sources — see Citations)

Three-tier eval model:

| Tier | What | When |
|---|---|---|
| **Outcome** | Final state assertion: does the DB / filesystem / commits look right? | Primary — every flow |
| **Trajectory** | Were the right tools called (subset/superset, NOT strict)? Were forbidden tools NOT called? | Secondary — every flow |
| **Cost / Latency** | Token usage, P50/P99 latency, tracked vs baseline | Observability — every flow |
| **LLM-as-judge** | Subjective quality (spec readability, tone, completeness) | Opt-in for prose-quality flows |

## Proposed v2 architecture (Inspect AI primitives)

[Inspect AI](https://inspect.aisi.org.uk/) is the de facto industry standard — Anthropic, DeepMind, and other safety orgs use it ([per Hamel Husain](https://hamel.dev/notes/llm/evals/inspect.html)). It defines four primitives we should adopt:

- **Dataset**: `(input, target)` pairs. For TMB: `(prompt, fixture_pre.sql, expected_state.sql, expected_tools.json)`
- **Solver**: produces output for input. For TMB: `claude -p` run with `TMB_DEBUG_TRAJECTORY=1` (already built in L6 v1)
- **Scorers**: grade the output (multiple per task). For TMB: 4 scorers per flow (outcome / trajectory / cost / optional LLM-judge)
- **Task**: ties dataset + solver + scorers together

### Concrete schema additions

```sql
-- Cost + latency tracking on existing trajectory rows
ALTER TABLE debug_trajectory ADD COLUMN tokens_in   INTEGER DEFAULT 0;
ALTER TABLE debug_trajectory ADD COLUMN tokens_out  INTEGER DEFAULT 0;
ALTER TABLE debug_trajectory ADD COLUMN latency_ms  INTEGER DEFAULT 0;

-- Per-scorer results, one row per (flow, scorer) per run
CREATE TABLE IF NOT EXISTS eval_results (
  id           INTEGER PRIMARY KEY AUTOINCREMENT,
  run_id       TEXT NOT NULL,            -- groups all scorers for one flow run
  flow_name    TEXT NOT NULL,
  scorer_name  TEXT NOT NULL,            -- 'outcome' | 'trajectory_subset' | 'cost' | 'llm_judge'
  pass         INTEGER NOT NULL,         -- 1 = pass, 0 = fail
  value        TEXT,                     -- numeric or categorical
  explanation  TEXT,                     -- why pass/fail
  metadata_json TEXT NOT NULL DEFAULT '{}',
  created_at   TEXT NOT NULL DEFAULT (datetime('now'))
);
CREATE INDEX idx_eval_results_run ON eval_results(run_id, scorer_name);
```

### Per-flow file layout (replaces L6 v1's `expected/<name>.txt`)

```
tests/dogfood/flows/02-simple-task/
├── prompt.txt                      # the user's ask
├── fixture-pre.sql                 # DB state before claude runs
├── scorers/
│   ├── outcome.sql                 # assertions on final DB state
│   ├── tools-required.json         # superset: these MUST be called (any order)
│   ├── tools-forbidden.json        # never-called list
│   ├── cost-budget.json            # max tokens + max latency_ms
│   └── llm-judge-rubric.md         # optional, opt-in per flow
└── README.md                       # what this flow tests
```

### Trajectory match modes — adopt LangSmith's [4 modes](https://docs.langchain.com/langsmith/trajectory-evals)

| Mode | Use case |
|---|---|
| `subset` | Agent calls **at most** the listed tools (efficiency check) |
| `superset` (default for us) | Agent calls **at least** the listed tools (minimum-required check) |
| `unordered` | Same tools, any order |
| `strict` | Same tools, same order — **only use for hard sequential constraints** like "config_set must come after identity_set in onboarding" |

### Scorer-by-scorer specifications

#### 1. Outcome scorer (deterministic, primary)

For each flow, assert post-state SQL queries return the expected counts/values. Example for `02-simple-task`:

```sql
-- expected outcome
SELECT COUNT(*) FROM issues WHERE objective LIKE '%cli todo%';      -- expect 1
SELECT COUNT(*) FROM tasks WHERE issue_id = (SELECT MAX(id) FROM issues);  -- expect 1
SELECT COUNT(*) FROM ledger WHERE event_type = 'planning_complete';  -- expect ≥1
```

If all assertions pass → outcome scorer = pass.

#### 2. Trajectory subset/superset scorer (deterministic)

```json
// tools-required.json (superset mode — these must appear)
[
  "mcp__plugin_tmb_trajectory-server__issue_create",
  "mcp__plugin_tmb_trajectory-server__task_create_batch",
  "Task"
]

// tools-forbidden.json (must NOT appear)
[
  "mcp__plugin_tmb_trajectory-server__validation_record",
  "mcp__plugin_tmb_trajectory-server__task_update_status with status='closed' before completed"
]
```

Less brittle: doesn't care if bro calls 3 extra `discussion_append` rows or reorders config probes.

#### 3. Cost / latency scorer (observability)

```json
// cost-budget.json — per-flow soft budget
{
  "max_tokens_total": 50000,
  "max_latency_ms_p99": 60000,
  "fail_above_max": false,           // soft signal — log but don't fail
  "warn_drift_pct": 25                // alert if >25% above 7-day baseline
}
```

Tracks tokens and latency per flow run; fails only if hard cap exceeded; warns on drift.

#### 4. LLM-as-judge scorer (opt-in, paid)

For flows where prose quality matters (e.g., bro's response to the Human, spec readability):

```markdown

Grade bro's final response on a 1-5 scale:
- 5: Concise, accurate, addresses all parts of the ask
- 1: Verbose, off-topic, or incomplete

Pass threshold: ≥4
```

Costs a tiny amount of tokens per eval (one extra Claude call). Skip on CI runs to save cost; run weekly.

### pass^k consistency (per [survey](https://arxiv.org/html/2507.21504v1) §3.3.1)

Run each flow N=3 times. Report:
- pass@3 (loose: at least one succeeded)
- pass^3 (strict: all three succeeded)

Strict consistency matters for production reliability.

## Migration from L6 v1

| L6 v1 | v2 equivalent | Status |
|---|---|---|
| `debug_trajectory` table | Same — primary capture | Keep |
| MCP wrapper in `index.ts` | Extend with token + latency capture | Modify |
| `scripts/hooks/debug-trajectory.sh` | Same | Keep |
| `tests/dogfood/expected/<name>.txt` (strict) | `flows/<name>/scorers/*.{sql,json,md}` (multi) | **Convert** |
| `tests/dogfood/run-l6.sh` | Add scorer dispatch logic | **Refactor** |
| 4 wired flows | Each becomes a directory with multi-scorer config | **Convert** |
| 12 scaffolded flows | Author with v2 patterns from the start | **Author** |

## Implementation phases

| Phase | Scope | Estimate | Blocks |
|---|---|---|---|
| **1. v2 runner + outcome scorer** | Refactor runner to dispatch to scorers; convert 1 flow as proof | 1 day | Phase 2-5 |
| **2. Trajectory subset/superset scorer** | Replace strict assertions; convert remaining 3 wired flows | 1 day | Phase 4 |
| **3. Cost tracking** | Schema columns + capture wiring + cost scorer | 1 day | independent |
| **4. Author 12 scaffolded flows** | One PR per 3-flow batch, all with v2 patterns | 4 days | depends on Phase 2 |
| **5. LLM-as-judge + pass^k** | Optional scorers + consistency runs | 2 days | independent |

Total: ~9 days of work, parallelizable. Phase 1+2+3 can land as one PR (~3 days).

## Why this is a load-bearing investment

- **Doesn't false-positive** on cosmetic trajectory differences (per Anthropic warning)
- **Catches what matters** (final state, not path)
- **Industry-standard structure** — easier to onboard contributors, port to Inspect AI later
- **Self-monitoring** — cost/latency drift alerts before they become problems
- **Makes #45 (codebase memory) safer to ship** — any future feature gets the eval safety net

User said it: *"It can save all of the testing process for all following works."* Investing here means every future PR is safer.

## Citations (all from web research, not memory)

- [Anthropic Engineering: Demystifying evals for AI agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents) — outcome-first doctrine + warning against strict-step matching + pass@k vs pass^k
- [Anthropic: Writing tools for agents](https://www.anthropic.com/engineering/writing-tools-for-agents) — eval-driven tool design
- [LangChain Docs: Trajectory evaluations](https://docs.langchain.com/langsmith/trajectory-evals) — 4 trajectory match modes (strict / unordered / subset / superset)
- [LangChain agentevals (GitHub)](https://github.com/langchain-ai/agentevals) — `create_trajectory_match_evaluator` + `create_trajectory_llm_as_judge` reference implementations
- [Inspect AI (UK AISI)](https://inspect.aisi.org.uk/) — Dataset/Solver/Scorer/Task primitives; supports Claude Code as agent
- [Inspect AI Scorers documentation](https://inspect.aisi.org.uk/scorers.html) — built-in scorers (`includes`, `match`, `model_graded_qa`, etc.) + custom scorer pattern
- [Hamel Husain on Inspect AI](https://hamel.dev/notes/llm/evals/inspect.html) — "adopted by Anthropic, DeepMind, and Grok"
- [arxiv 2507.21504 — Evaluation and Benchmarking of LLM Agents: A Survey](https://arxiv.org/html/2507.21504v1) — taxonomy: behavior / capabilities / reliability / safety dimensions; pass@k + pass^k metrics
- [arxiv 2510.04550 — TRAJECT-Bench](https://arxiv.org/abs/2510.04550) — trajectory-aware tool-use benchmark with breadth/depth axes
- [AWS: Evaluating AI agents at Amazon](https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-real-world-lessons-from-building-agentic-systems-at-amazon/) — production lessons: holistic multi-dim, app-specific metrics, HITL is essential, continuous monitoring
- [Google Cloud: A methodical approach to agent evaluation](https://cloud.google.com/blog/topics/developers-practitioners/a-methodical-approach-to-agent-evaluation) — three-layer eval architecture
- [orq.ai: Agent Evaluation in 2025 Complete Guide](https://orq.ai/blog/agent-evaluation) — glass-box trajectory + white-box single-step + outcome layering

## Acceptance criteria

- [ ] Schema migration: `debug_trajectory` adds (tokens_in, tokens_out, latency_ms); new `eval_results` table
- [ ] Runner refactor: `tests/dogfood/run-l6.sh` dispatches to per-flow scorer config
- [ ] All 4 wired flows converted to v2 multi-scorer format (no strict-trajectory assertions remain)
- [ ] At least 3 of the 12 scaffolded flows authored with full v2 (outcome + trajectory)
- [ ] Cost scorer reports per-flow tokens + latency to `eval_results` table
- [ ] CI workflow updated: still runs on tag/PR-label, but reports cost drift in PR comments
- [ ] `tests/README.md` rewritten — L6 = "industry-standard agentic evals" (cite sources)
- [ ] One example LLM-as-judge scorer (opt-in via per-flow config)
- [ ] pass^k=3 consistency report on at least one flow

## Out of scope (separate follow-ups if/when needed)

- Full Inspect AI adoption (the framework itself) — could come later if we want their visualization tools
- Replay-from-production (re-running captured production sessions on new model versions)
- Adversarial prompt generation
- Multi-platform eval matrix (macOS / Windows runners)

Mode	Use case
`subset`	Agent calls at most the listed tools (efficiency check)
`superset` (default for us)	Agent calls at least the listed tools (minimum-required check)
`unordered`	Same tools, any order
`strict`	Same tools, same order — only use for hard sequential constraints like "config_set must come after identity_set in onboarding"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build TMB evals system v2 — outcome-first multi-scorer (replaces L6 v1 brittle-strict-match) #110

Why this issue

The problem with L6 v1 — per Anthropic

Industry standard (synthesized from 8 sources — see Citations)

Proposed v2 architecture (Inspect AI primitives)

Concrete schema additions

Per-flow file layout (replaces L6 v1's `expected/<name>.txt`)

Trajectory match modes — adopt LangSmith's 4 modes

Scorer-by-scorer specifications

1. Outcome scorer (deterministic, primary)

2. Trajectory subset/superset scorer (deterministic)

3. Cost / latency scorer (observability)

4. LLM-as-judge scorer (opt-in, paid)

pass^k consistency (per survey §3.3.1)

Migration from L6 v1

Implementation phases

Why this is a load-bearing investment

Citations (all from web research, not memory)

Acceptance criteria

Out of scope (separate follow-ups if/when needed)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Tier	What	When
Outcome	Final state assertion: does the DB / filesystem / commits look right?	Primary — every flow
Trajectory	Were the right tools called (subset/superset, NOT strict)? Were forbidden tools NOT called?	Secondary — every flow
Cost / Latency	Token usage, P50/P99 latency, tracked vs baseline	Observability — every flow
LLM-as-judge	Subjective quality (spec readability, tone, completeness)	Opt-in for prose-quality flows

L6 v1	v2 equivalent	Status
`debug_trajectory` table	Same — primary capture	Keep
MCP wrapper in `index.ts`	Extend with token + latency capture	Modify
`scripts/hooks/debug-trajectory.sh`	Same	Keep
`tests/dogfood/expected/<name>.txt` (strict)	`flows/<name>/scorers/*.{sql,json,md}` (multi)	Convert
`tests/dogfood/run-l6.sh`	Add scorer dispatch logic	Refactor
4 wired flows	Each becomes a directory with multi-scorer config	Convert
12 scaffolded flows	Author with v2 patterns from the start	Author

Phase	Scope	Estimate	Blocks
1. v2 runner + outcome scorer	Refactor runner to dispatch to scorers; convert 1 flow as proof	1 day	Phase 2-5
2. Trajectory subset/superset scorer	Replace strict assertions; convert remaining 3 wired flows	1 day	Phase 4
3. Cost tracking	Schema columns + capture wiring + cost scorer	1 day	independent
4. Author 12 scaffolded flows	One PR per 3-flow batch, all with v2 patterns	4 days	depends on Phase 2
5. LLM-as-judge + pass^k	Optional scorers + consistency runs	2 days	independent

Build TMB evals system v2 — outcome-first multi-scorer (replaces L6 v1 brittle-strict-match) #110

Description

Why this issue

The problem with L6 v1 — per Anthropic

Industry standard (synthesized from 8 sources — see Citations)

Proposed v2 architecture (Inspect AI primitives)

Concrete schema additions

Per-flow file layout (replaces L6 v1's expected/<name>.txt)

Trajectory match modes — adopt LangSmith's 4 modes

Scorer-by-scorer specifications

1. Outcome scorer (deterministic, primary)

2. Trajectory subset/superset scorer (deterministic)

3. Cost / latency scorer (observability)

4. LLM-as-judge scorer (opt-in, paid)

pass^k consistency (per survey §3.3.1)

Migration from L6 v1

Implementation phases

Why this is a load-bearing investment

Citations (all from web research, not memory)

Acceptance criteria

Out of scope (separate follow-ups if/when needed)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Per-flow file layout (replaces L6 v1's `expected/<name>.txt`)