🧪 feat(tests): L6 deterministic-trajectory dogfood + opt-in capture (#108) by ZaxShen · Pull Request #109 · trustmybot/plugin

ZaxShen · 2026-04-26T09:13:52Z

Closes #108.

Summary

Manual L5 dogfood was the release bottleneck. L6 automates it:
pre-seed DB → run real claude -p → assert MCP/tool sequence matches FLOWS.md.

The trajectory capture is opt-in (TMB_DEBUG_TRAJECTORY=1) so production users see zero overhead.

What landed

Schema (additive — schema_version stays at 1)

New debug_trajectory table (15th in the DB), populated only when env is set. Indexed on (session_id, step_n) for fast L6 reads.

Capture wiring

Source	Where
MCP tool calls	`mcp/trajectory-server/src/index.ts` — wraps `CallToolRequestSchema` handler
Non-MCP tool calls (Bash, Read, Write, Edit, Task, Skill)	New PreToolUse hook `scripts/hooks/debug-trajectory.sh`, registered in `hooks/hooks.json` with `matcher: "*"`

Both fail-soft: any capture error is swallowed so the actual tool call never breaks.

Test infrastructure

tests/dogfood/
├── run-l6.sh                    ← runner (env + tools check, dispatch)
├── lib/flow-helpers.sh          ← l6_setup_scratch_project, l6_seed_db,
│                                  l6_run_claude, l6_assert_trajectory
├── flows/                       ← 16 flow scripts
├── fixtures/                    ← 3 pre-seed SQL files
└── expected/                    ← expected-trajectory files

4 fully wired flows

Flow	Asserts
`01-onboarding`	identity_get + config_get probes → identity_set + 3x config_set + ledger_log(tmb_onboarding_complete)
`02-simple-task`	bro detects code-touching ask → triage simple → issue_create + task_create_batch + Task spawn + planning_complete
`D-direct-mode`	≤3-line typo → Edit + commit + direct_mode_used. Hard invariants: NO task_create_batch, NO Task spawn, exactly one direct_mode_used event
`95-anonymous-cold-restart`	Cold session with Anonymous identity must skip re-onboarding. Hard invariants: NO identity_set, NO config_set after the cold restart. Locks #95 regression.

12 scaffolded flows (auto-skip until expected-trajectory authored)

03-difficult-task, 04-agent-creator, 05-skill-creation, 06-push-gate, 07-architecture-regen, 08-swe-retry, 09-roundtable, C-consultant, 32-team-config, 92-base-branch, 94-arch-bootstrap, 96-halt-on-error.

Each is ~30 lines of pattern-match copy/paste once the expected-trajectory file is authored.

CI

.github/workflows/l6-dogfood.yml — triggers:

Tag pushes (every release gets a green/red signal)
PRs labeled L6 (opt-in for risky doctrine changes)
Manual workflow_dispatch

Soft-fails when CLAUDE_CODE_OAUTH_TOKEN secret is absent (forks / external PRs don't break red). Uploads trajectory dumps as artifacts on failure for debugging.

Stale doctrine cleanup (audit found 3 issues)

skills/tmb_first-run-onboarding/SKILL.md: fixed event_type from stale tmb_bootstrap_complete → tmb_onboarding_complete + dropped "file copies" reference (swe + pr-reviewer ship globally now)
skills/tmb_agent-creator/SKILL.md: dropped tmb_bootstrap reference (skill is gone in v0.3.0+)
CLAUDE.md: removed the "tmb_bootstrap is being retired" sentence — already retired

Unverified assumption (flagged for follow-up)

claude -p mode behavior with AskUserQuestion is unknown. If the form auto-fails in headless mode, the 01-onboarding flow's trajectory will be shorter than expected — a real signal to address (could be a CC bug, could be a doctrine adjustment for headless mode).

The L6 design pre-seeds DB to skip past forms for every other flow. Only 01-onboarding actually exercises the form path.

Test plan

L1 lint passes (16 lints)
L2 unit passes (245 + 2 new schema tests = 247)
L3 hooks pass
L4 workflow-sim passes
Build green; dist/ committed up-to-date
L0 install-smoke (CI Docker)
L6 itself — runs in CI on this PR if labeled L6, OR run locally with bash tests/dogfood/run-l6.sh

Risk profile

Schema change is additive + opt-in — zero risk to production users.
MCP server wrapper is null-safe and try-wrapped — never breaks the real tool call.
New hook is matcher: "*" PreToolUse — fires on every tool call but exits 0 immediately when env unset (~1ms overhead).
All L1-L4 still pass.

Follow-ups (after merge)

Wake-up validation: assign label L6 on this PR (or trigger workflow_dispatch) → CI runs the 4 wired flows for the first time. If claude -p behavior surprises us, follow-up issue.
Fill in 12 scaffold expected-trajectories — each is a one-line trajectory dump after a single test run with TMB_DEBUG_TRAJECTORY=1.
Decide if L6 should run on every PR to dev — currently opt-in via L6 label. Could expand based on token cost data after a few runs.

🤖 Generated with Claude Code

…108) Manual L5 dogfood was the release bottleneck. L6 automates it: pre-seed DB → run real `claude -p` → assert MCP/tool sequence matches the expected from FLOWS.md. ## What's new **Schema** — `debug_trajectory` table (15th table). Off by default — populated only when env `TMB_DEBUG_TRAJECTORY=1`. Zero overhead in production. Schema version stays at 1 (additive). **Capture** — - MCP server writes a row per MCP call when env is set (src/index.ts wrapper) - New PreToolUse hook `scripts/hooks/debug-trajectory.sh` (matcher: "*") writes rows for non-MCP calls (Bash, Read, Write, Edit, Task, Skill) **Test infra** at `tests/dogfood/`: - `run-l6.sh` runner - `lib/flow-helpers.sh` shared helpers - 16 flow scripts in `flows/` (4 fully wired, 12 scaffolded) - `fixtures/` pre-seed SQL (empty, onboarding-named, onboarding-anonymous) - `expected/` expected-trajectory files **4 wired flows**: 01-onboarding, 02-simple-task, D-direct-mode (with hard invariants on no-task-spawn + direct_mode_used event), 95-anonymous-cold-restart (with assertions that no re-onboarding writes happen — locks #95 regression). **12 scaffolded flows** auto-skip until their expected-trajectory file is authored. Pattern is copy/paste — each follow-up is ~30 lines. **CI** at `.github/workflows/l6-dogfood.yml` — triggers on tag pushes, PRs labeled `L6`, manual dispatch. Soft-fails when CLAUDE_CODE_OAUTH_TOKEN secret is absent (forks won't break). Uploads trajectory dumps on failure. **Stale doctrine cleanup** (audit): - Onboarding skill: fixed `tmb_bootstrap_complete` → `tmb_onboarding_complete` - Agent-creator skill: dropped tmb_bootstrap ref (skill is gone) - Plugin CLAUDE.md: removed retirement-in-progress note for tmb_bootstrap ## Unverified assumption (flagged in #108) `claude -p` mode behavior with AskUserQuestion. If form auto-fails in headless mode, the onboarding flow trajectory is shorter than expected — that's a real signal to file as follow-up. ## Tests 2 new schema unit tests (table presence + columns + index). All L1-L4 green. L0 will run in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When authoring an expected-trajectory file for a new flow, the workflow is: run the flow once with TMB_DEBUG_TRAJECTORY=1, then read the debug_trajectory table. This script does step 2 cleanly — finds the right DB (handles channel isolation), prints in the L6-expected format ready to paste into tests/dogfood/expected/<flow>.txt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces L6 v1's brittle strict-trajectory matching with the industry-standard multi-scorer pattern (Inspect AI / AgentEvals / Anthropic doctrine). ## Why Anthropic explicitly warns against strict-step matching: "There is a common instinct to check that agents followed very specific steps... too brittle and results in overly brittle tests." "Grade what the agent produced, not the path it took." L6 v1 (PR #109) hit this trap. v2 fixes it. ## Schema additions (additive, schema_version=1 unchanged) - debug_trajectory: +tokens_in, +tokens_out, +latency_ms columns - eval_results: NEW table — one row per (flow, scorer) per run ## 4 scorer types - outcome (primary, deterministic) — SQL assertions on final DB state - trajectory_required (secondary) — superset: required tools called - trajectory_forbidden (secondary) — subset/safety: forbidden NOT called - cost (observational) — tokens + latency vs per-flow budget ## Per-flow layout (replaces expected/<name>.txt) tests/dogfood/flows/<name>/ README.md outcome.sql ← primary scorer config tools-required.json ← superset list tools-forbidden.json ← safety list cost-budget.json ← soft/hard budget run.sh ← invokes l6_score_flow ## Coverage 4 wired flows fully converted: 01-onboarding, 02-simple-task, D-direct-mode, 95-anonymous-cold-restart. 12 scaffold flows preserved; auto-skip until outcome.sql authored. Stale L6 v1 helpers removed (l6_assert_trajectory, expected/ dir). ## Tests 3 new schema tests (cost columns + eval_results structure). All L1-L4 green. ## Sources (all in PR body of #110) Anthropic Engineering, LangSmith trajectory-evals docs, Inspect AI (UK AISI), LangChain agentevals, arxiv 2507.21504 survey, AWS prod agent eval lessons, Hamel Husain's Inspect endorsement. Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ZaxShen and others added 2 commits April 26, 2026 02:12

ZaxShen merged commit b270fd4 into dev Apr 26, 2026
2 checks passed

This was referenced Apr 26, 2026

Build TMB evals system v2 — outcome-first multi-scorer (replaces L6 v1 brittle-strict-match) #110

Closed

🧪 feat(tests): L6 evals v2 — outcome-first multi-scorer (#110) #111

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🧪 feat(tests): L6 deterministic-trajectory dogfood + opt-in capture (#108)#109

🧪 feat(tests): L6 deterministic-trajectory dogfood + opt-in capture (#108)#109
ZaxShen merged 2 commits into
devfrom
feat/108-l6-deterministic-trajectory-tests

ZaxShen commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ZaxShen commented Apr 26, 2026

Summary

What landed

Schema (additive — schema_version stays at 1)

Capture wiring

Test infrastructure

4 fully wired flows

12 scaffolded flows (auto-skip until expected-trajectory authored)

CI

Stale doctrine cleanup (audit found 3 issues)

Unverified assumption (flagged for follow-up)

Test plan

Risk profile

Follow-ups (after merge)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant