You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
L6 v1 (PR #109) uses strict trajectory matching — assert the agent calls exactly N specific tools in this exact order. Web research (citations below) found this is exactly what Anthropic explicitly warns against as too brittle, and the industry has converged on better patterns. We need v2 before scaling L6 to 12+ scaffolded flows.
User direction (2026-04-26): "L6 PR already merged. debug_tmb is not industry standards. We can abandon it and follow industry standards. Make our debug mode comprehensive and robust... It can save all of the testing process for all following works."
The problem with L6 v1 — per Anthropic
"There is a common instinct to check that agents followed very specific steps like a sequence of tool calls in the right order. We've found this approach too brittle and results in overly brittle tests, as agents regularly find valid approaches that eval designers didn't anticipate."
"Grade what the agent produced, not the path it took."
Inspect AI is the de facto industry standard — Anthropic, DeepMind, and other safety orgs use it (per Hamel Husain). It defines four primitives we should adopt:
Dataset: (input, target) pairs. For TMB: (prompt, fixture_pre.sql, expected_state.sql, expected_tools.json)
Solver: produces output for input. For TMB: claude -p run with TMB_DEBUG_TRAJECTORY=1 (already built in L6 v1)
Scorers: grade the output (multiple per task). For TMB: 4 scorers per flow (outcome / trajectory / cost / optional LLM-judge)
Task: ties dataset + solver + scorers together
Concrete schema additions
-- Cost + latency tracking on existing trajectory rowsALTERTABLE debug_trajectory ADD COLUMN tokens_in INTEGER DEFAULT 0;
ALTERTABLE debug_trajectory ADD COLUMN tokens_out INTEGER DEFAULT 0;
ALTERTABLE debug_trajectory ADD COLUMN latency_ms INTEGER DEFAULT 0;
-- Per-scorer results, one row per (flow, scorer) per runCREATETABLEIF NOT EXISTS eval_results (
id INTEGERPRIMARY KEY AUTOINCREMENT,
run_id TEXTNOT NULL, -- groups all scorers for one flow run
flow_name TEXTNOT NULL,
scorer_name TEXTNOT NULL, -- 'outcome' | 'trajectory_subset' | 'cost' | 'llm_judge'
pass INTEGERNOT NULL, -- 1 = pass, 0 = fail
value TEXT, -- numeric or categorical
explanation TEXT, -- why pass/fail
metadata_json TEXTNOT NULL DEFAULT '{}',
created_at TEXTNOT NULL DEFAULT (datetime('now'))
);
CREATEINDEXidx_eval_results_runON eval_results(run_id, scorer_name);
tests/dogfood/flows/02-simple-task/
├── prompt.txt # the user's ask
├── fixture-pre.sql # DB state before claude runs
├── scorers/
│ ├── outcome.sql # assertions on final DB state
│ ├── tools-required.json # superset: these MUST be called (any order)
│ ├── tools-forbidden.json # never-called list
│ ├── cost-budget.json # max tokens + max latency_ms
│ └── llm-judge-rubric.md # optional, opt-in per flow
└── README.md # what this flow tests
Trajectory match modes — adopt LangSmith's 4 modes
Mode
Use case
subset
Agent calls at most the listed tools (efficiency check)
superset (default for us)
Agent calls at least the listed tools (minimum-required check)
unordered
Same tools, any order
strict
Same tools, same order — only use for hard sequential constraints like "config_set must come after identity_set in onboarding"
Scorer-by-scorer specifications
1. Outcome scorer (deterministic, primary)
For each flow, assert post-state SQL queries return the expected counts/values. Example for 02-simple-task:
-- expected outcomeSELECTCOUNT(*) FROM issues WHERE objective LIKE'%cli todo%'; -- expect 1SELECTCOUNT(*) FROM tasks WHERE issue_id = (SELECTMAX(id) FROM issues); -- expect 1SELECTCOUNT(*) FROM ledger WHERE event_type ='planning_complete'; -- expect ≥1
// tools-required.json (superset mode — these must appear)
[
"mcp__plugin_tmb_trajectory-server__issue_create",
"mcp__plugin_tmb_trajectory-server__task_create_batch",
"Task"
]
// tools-forbidden.json (must NOT appear)
[
"mcp__plugin_tmb_trajectory-server__validation_record",
"mcp__plugin_tmb_trajectory-server__task_update_status with status='closed' before completed"
]
Less brittle: doesn't care if bro calls 3 extra discussion_append rows or reorders config probes.
3. Cost / latency scorer (observability)
// cost-budget.json — per-flow soft budget
{
"max_tokens_total": 50000,
"max_latency_ms_p99": 60000,
"fail_above_max": false, // soft signal — log but don't fail"warn_drift_pct": 25// alert if >25% above 7-day baseline
}
Tracks tokens and latency per flow run; fails only if hard cap exceeded; warns on drift.
4. LLM-as-judge scorer (opt-in, paid)
For flows where prose quality matters (e.g., bro's response to the Human, spec readability):
<!-- llm-judge-rubric.md -->
Grade bro's final response on a 1-5 scale:
- 5: Concise, accurate, addresses all parts of the ask
- 1: Verbose, off-topic, or incomplete
Pass threshold: ≥4
Costs a tiny amount of tokens per eval (one extra Claude call). Skip on CI runs to save cost; run weekly.
Why this issue
L6 v1 (PR #109) uses strict trajectory matching — assert the agent calls exactly N specific tools in this exact order. Web research (citations below) found this is exactly what Anthropic explicitly warns against as too brittle, and the industry has converged on better patterns. We need v2 before scaling L6 to 12+ scaffolded flows.
User direction (2026-04-26): "L6 PR already merged. debug_tmb is not industry standards. We can abandon it and follow industry standards. Make our debug mode comprehensive and robust... It can save all of the testing process for all following works."
The problem with L6 v1 — per Anthropic
— Anthropic Engineering: Demystifying evals for AI agents
L6 v1's
expected/<flow>.txtstrict match falls in this trap.Industry standard (synthesized from 8 sources — see Citations)
Three-tier eval model:
Proposed v2 architecture (Inspect AI primitives)
Inspect AI is the de facto industry standard — Anthropic, DeepMind, and other safety orgs use it (per Hamel Husain). It defines four primitives we should adopt:
(input, target)pairs. For TMB:(prompt, fixture_pre.sql, expected_state.sql, expected_tools.json)claude -prun withTMB_DEBUG_TRAJECTORY=1(already built in L6 v1)Concrete schema additions
Per-flow file layout (replaces L6 v1's
expected/<name>.txt)Trajectory match modes — adopt LangSmith's 4 modes
subsetsuperset(default for us)unorderedstrictScorer-by-scorer specifications
1. Outcome scorer (deterministic, primary)
For each flow, assert post-state SQL queries return the expected counts/values. Example for
02-simple-task:If all assertions pass → outcome scorer = pass.
2. Trajectory subset/superset scorer (deterministic)
Less brittle: doesn't care if bro calls 3 extra
discussion_appendrows or reorders config probes.3. Cost / latency scorer (observability)
Tracks tokens and latency per flow run; fails only if hard cap exceeded; warns on drift.
4. LLM-as-judge scorer (opt-in, paid)
For flows where prose quality matters (e.g., bro's response to the Human, spec readability):
Costs a tiny amount of tokens per eval (one extra Claude call). Skip on CI runs to save cost; run weekly.
pass^k consistency (per survey §3.3.1)
Run each flow N=3 times. Report:
Strict consistency matters for production reliability.
Migration from L6 v1
debug_trajectorytableindex.tsscripts/hooks/debug-trajectory.shtests/dogfood/expected/<name>.txt(strict)flows/<name>/scorers/*.{sql,json,md}(multi)tests/dogfood/run-l6.shImplementation phases
Total: ~9 days of work, parallelizable. Phase 1+2+3 can land as one PR (~3 days).
Why this is a load-bearing investment
User said it: "It can save all of the testing process for all following works." Investing here means every future PR is safer.
Citations (all from web research, not memory)
create_trajectory_match_evaluator+create_trajectory_llm_as_judgereference implementationsincludes,match,model_graded_qa, etc.) + custom scorer patternAcceptance criteria
debug_trajectoryadds (tokens_in, tokens_out, latency_ms); neweval_resultstabletests/dogfood/run-l6.shdispatches to per-flow scorer configeval_resultstabletests/README.mdrewritten — L6 = "industry-standard agentic evals" (cite sources)Out of scope (separate follow-ups if/when needed)