π§ͺ feat(tests): L6 deterministic-trajectory dogfood + opt-in capture (#108)#109
Merged
Merged
Conversation
β¦108) Manual L5 dogfood was the release bottleneck. L6 automates it: pre-seed DB β run real `claude -p` β assert MCP/tool sequence matches the expected from FLOWS.md. ## What's new **Schema** β `debug_trajectory` table (15th table). Off by default β populated only when env `TMB_DEBUG_TRAJECTORY=1`. Zero overhead in production. Schema version stays at 1 (additive). **Capture** β - MCP server writes a row per MCP call when env is set (src/index.ts wrapper) - New PreToolUse hook `scripts/hooks/debug-trajectory.sh` (matcher: "*") writes rows for non-MCP calls (Bash, Read, Write, Edit, Task, Skill) **Test infra** at `tests/dogfood/`: - `run-l6.sh` runner - `lib/flow-helpers.sh` shared helpers - 16 flow scripts in `flows/` (4 fully wired, 12 scaffolded) - `fixtures/` pre-seed SQL (empty, onboarding-named, onboarding-anonymous) - `expected/` expected-trajectory files **4 wired flows**: 01-onboarding, 02-simple-task, D-direct-mode (with hard invariants on no-task-spawn + direct_mode_used event), 95-anonymous-cold-restart (with assertions that no re-onboarding writes happen β locks #95 regression). **12 scaffolded flows** auto-skip until their expected-trajectory file is authored. Pattern is copy/paste β each follow-up is ~30 lines. **CI** at `.github/workflows/l6-dogfood.yml` β triggers on tag pushes, PRs labeled `L6`, manual dispatch. Soft-fails when CLAUDE_CODE_OAUTH_TOKEN secret is absent (forks won't break). Uploads trajectory dumps on failure. **Stale doctrine cleanup** (audit): - Onboarding skill: fixed `tmb_bootstrap_complete` β `tmb_onboarding_complete` - Agent-creator skill: dropped tmb_bootstrap ref (skill is gone) - Plugin CLAUDE.md: removed retirement-in-progress note for tmb_bootstrap ## Unverified assumption (flagged in #108) `claude -p` mode behavior with AskUserQuestion. If form auto-fails in headless mode, the onboarding flow trajectory is shorter than expected β that's a real signal to file as follow-up. ## Tests 2 new schema unit tests (table presence + columns + index). All L1-L4 green. L0 will run in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When authoring an expected-trajectory file for a new flow, the workflow is: run the flow once with TMB_DEBUG_TRAJECTORY=1, then read the debug_trajectory table. This script does step 2 cleanly β finds the right DB (handles channel isolation), prints in the L6-expected format ready to paste into tests/dogfood/expected/<flow>.txt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 26, 2026
Closed
ZaxShen
added a commit
that referenced
this pull request
Apr 26, 2026
Replaces L6 v1's brittle strict-trajectory matching with the industry-standard multi-scorer pattern (Inspect AI / AgentEvals / Anthropic doctrine). ## Why Anthropic explicitly warns against strict-step matching: "There is a common instinct to check that agents followed very specific steps... too brittle and results in overly brittle tests." "Grade what the agent produced, not the path it took." L6 v1 (PR #109) hit this trap. v2 fixes it. ## Schema additions (additive, schema_version=1 unchanged) - debug_trajectory: +tokens_in, +tokens_out, +latency_ms columns - eval_results: NEW table β one row per (flow, scorer) per run ## 4 scorer types - outcome (primary, deterministic) β SQL assertions on final DB state - trajectory_required (secondary) β superset: required tools called - trajectory_forbidden (secondary) β subset/safety: forbidden NOT called - cost (observational) β tokens + latency vs per-flow budget ## Per-flow layout (replaces expected/<name>.txt) tests/dogfood/flows/<name>/ README.md outcome.sql β primary scorer config tools-required.json β superset list tools-forbidden.json β safety list cost-budget.json β soft/hard budget run.sh β invokes l6_score_flow ## Coverage 4 wired flows fully converted: 01-onboarding, 02-simple-task, D-direct-mode, 95-anonymous-cold-restart. 12 scaffold flows preserved; auto-skip until outcome.sql authored. Stale L6 v1 helpers removed (l6_assert_trajectory, expected/ dir). ## Tests 3 new schema tests (cost columns + eval_results structure). All L1-L4 green. ## Sources (all in PR body of #110) Anthropic Engineering, LangSmith trajectory-evals docs, Inspect AI (UK AISI), LangChain agentevals, arxiv 2507.21504 survey, AWS prod agent eval lessons, Hamel Husain's Inspect endorsement. Co-authored-by: Zax Shen <ZaxShen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #108.
Summary
Manual L5 dogfood was the release bottleneck. L6 automates it:
pre-seed DB β run real
claude -pβ assert MCP/tool sequence matchesFLOWS.md.The trajectory capture is opt-in (
TMB_DEBUG_TRAJECTORY=1) so production users see zero overhead.What landed
Schema (additive β schema_version stays at 1)
New
debug_trajectorytable (15th in the DB), populated only when env is set. Indexed on(session_id, step_n)for fast L6 reads.Capture wiring
mcp/trajectory-server/src/index.tsβ wrapsCallToolRequestSchemahandlerscripts/hooks/debug-trajectory.sh, registered inhooks/hooks.jsonwithmatcher: "*"Both fail-soft: any capture error is swallowed so the actual tool call never breaks.
Test infrastructure
4 fully wired flows
01-onboarding02-simple-taskD-direct-mode95-anonymous-cold-restart12 scaffolded flows (auto-skip until expected-trajectory authored)
03-difficult-task,04-agent-creator,05-skill-creation,06-push-gate,07-architecture-regen,08-swe-retry,09-roundtable,C-consultant,32-team-config,92-base-branch,94-arch-bootstrap,96-halt-on-error.Each is ~30 lines of pattern-match copy/paste once the expected-trajectory file is authored.
CI
.github/workflows/l6-dogfood.ymlβ triggers:L6(opt-in for risky doctrine changes)workflow_dispatchSoft-fails when
CLAUDE_CODE_OAUTH_TOKENsecret is absent (forks / external PRs don't break red). Uploads trajectory dumps as artifacts on failure for debugging.Stale doctrine cleanup (audit found 3 issues)
skills/tmb_first-run-onboarding/SKILL.md: fixed event_type from staletmb_bootstrap_completeβtmb_onboarding_complete+ dropped "file copies" reference (swe + pr-reviewer ship globally now)skills/tmb_agent-creator/SKILL.md: droppedtmb_bootstrapreference (skill is gone in v0.3.0+)CLAUDE.md: removed the "tmb_bootstrap is being retired" sentence β already retiredUnverified assumption (flagged for follow-up)
claude -pmode behavior withAskUserQuestionis unknown. If the form auto-fails in headless mode, the01-onboardingflow's trajectory will be shorter than expected β a real signal to address (could be a CC bug, could be a doctrine adjustment for headless mode).The L6 design pre-seeds DB to skip past forms for every other flow. Only
01-onboardingactually exercises the form path.Test plan
L6, OR run locally withbash tests/dogfood/run-l6.shRisk profile
matcher: "*"PreToolUse β fires on every tool call but exits 0 immediately when env unset (~1ms overhead).Follow-ups (after merge)
L6on this PR (or trigger workflow_dispatch) β CI runs the 4 wired flows for the first time. Ifclaude -pbehavior surprises us, follow-up issue.TMB_DEBUG_TRAJECTORY=1.devβ currently opt-in viaL6label. Could expand based on token cost data after a few runs.π€ Generated with Claude Code