Problem
Manual L5 dogfood is now the release bottleneck. Every v0.X.Y stable cut requires ~30-45 min of manual testing — install, walk 7-10 scenarios, eyeball trajectory, run release.sh. This stalls iteration speed and gates every release on human availability.
Existing L0-L4 cover MCP/hook correctness in isolation but don't exercise real Claude Code driving bro through real workflows. The gap is the single biggest reason L5 stays manual.
Proposal — L6 deterministic-trajectory tests
Core idea
Rather than mocking or trying to handle interactive AskUserQuestion, pre-seed the DB to put bro in a known state, then run claude -p "<prompt>" (non-interactive mode), then assert the resulting MCP/tool trajectory matches the expected sequence from FLOWS.md.
Each flow in FLOWS.md becomes a deterministic test:
| Test |
Pre-seed |
Run |
Assert |
| Onboarding |
Empty DB |
claude -p "@bro hi" |
Expected: identity_get, config_get, AskUserQuestion form rendered, identity_set, config_set x3, ledger_log(tmb_onboarding_complete) |
| 1st task (simple triage) |
DB with completed onboarding |
claude -p "@bro write a python cli todo" |
Expected: tmb_project-prescan runs, tmb_lazy-regen-check runs, triage='simple', task_create_batch, Task(swe) spawn, ledger_log(planning_complete) |
| 2nd task (after 1st closed) |
DB with 1st task closed |
claude -p "@bro add a --limit flag" |
Expected: same shape; previous-task context picked up via issue_list / task_get |
| Bro verification (V1/V2/V3) |
DB with SWE-completed task |
claude -p "@bro check task 3" |
Expected: task_get, git diff, verification commands, ledger_log(bro_verification_pass), task_update_status(closed) |
| Direct mode |
DB with onboarding done |
claude -p "@bro fix typo in README line 3" |
Expected: Edit, Bash(git commit), ledger_log(direct_mode_used). NO task_create_batch, NO Task spawn. |
| Anonymous cold-restart |
DB with anonymous identity |
claude -p "@bro hi" |
Expected: identity_get returns row → bro skips onboarding → greets |
Why this works (non-determinism is bounded)
User insight: "tmb's interaction MUST be deterministic in numbers of tools usage, mcp, as well as the tool name and mcp name matching" — meaning while Claude's prose varies, the MCP/tool call sequence for a given (prompt, DB state) pair is doctrine-fixed.
If bro's actual trajectory deviates from FLOWS.md, EITHER:
- The doctrine is broken (real bug to fix)
FLOWS.md is stale (doc to update)
- The skill/agent prompt drifted (regression to fix)
All three are valuable signals — exactly what L5 catches manually today.
Three pieces of infrastructure
1. Docker harness with real Claude Code
- Base images:
ubuntu:24.04, node:22-slim (debian-based), macos-latest (only on local; GH Actions doesn't support nested macOS), windows-server-2022 (later)
- Install Claude Code in each:
npm install -g @anthropic-ai/claude-code (or whatever the install-time command is)
- Auth: User has Claude Code account; can provide
CLAUDE_CODE_OAUTH_TOKEN via repo secret in .env form. Mount or pass to container.
- Install TMB plugin via the
--plugin-dir mode pointing at the checked-out source (so test runs against the PR's code, not the marketplace tag)
2. Debug trajectory table (gated by /debug_tmb or env var)
New table:
CREATE TABLE IF NOT EXISTS debug_trajectory (
id INTEGER PRIMARY KEY AUTOINCREMENT,
session_id TEXT NOT NULL,
step_n INTEGER NOT NULL,
kind TEXT NOT NULL, -- 'tool_use' | 'mcp_call' | 'agent_thinking' | 'response'
agent TEXT, -- 'bro' | 'swe' | 'pr-reviewer' | etc
tool_or_mcp_name TEXT, -- e.g. 'mcp__plugin_tmb_trajectory-server__identity_get' or 'Bash'
args_json TEXT, -- input args (truncated)
result_json TEXT, -- output summary (truncated)
ts TEXT NOT NULL
);
- Only populated when env
TMB_DEBUG_TRAJECTORY=1 (or via a /debug_tmb slash command for live debugging). Off by default — zero overhead in production.
- Populated by a thin wrapper around the MCP server's tool-call dispatcher (writes one row per tool call) + a hook that records Bash/Read/Write/Edit calls.
- L6 test runner reads from this table after
claude -p exits to make assertions.
3. Test runner
tests/dogfood/run-l6.sh
flows/01-onboarding.test.sh
flows/02-first-task.test.sh
flows/03-second-task.test.sh
flows/04-direct-mode.test.sh
flows/05-bro-verification.test.sh
flows/06-anonymous-cold-restart.test.sh
flows/07-channel-isolation.test.sh
...
Each flow test:
- Spin up Docker container with TMB plugin installed
- Pre-seed
.claude/<plugin>/trajectory.db with required state
claude -p "<prompt>" with TMB_DEBUG_TRAJECTORY=1
- Read
debug_trajectory table
- Compare the sequence of
(kind, tool_or_mcp_name) against an expected JSON file
- Assert match (allowing prose variation in args, but tool sequence is checked)
Out of scope
- Code quality — already covered by L1-L4. L6 only checks workflow correctness.
- AskUserQuestion handling — pre-seed DB to skip the form entirely.
- Token cost optimization — initially run only on release-prep PRs (not every PR).
Cost considerations
- Each L6 invocation costs real Claude tokens (user's account). Estimate: 5-20K tokens per flow × 7 flows × 3 platforms = ~300K tokens per L6 run.
- Mitigation: run L6 only on
release-prep/* branches and on tags, not every PR.
- Or: run a single Linux platform on every PR, full matrix only on release.
Open questions
- Does
claude -p mode allow the plugin's AskUserQuestion skill to render? If not, we need to either (a) pre-seed past every form, or (b) use --allowedTools to skip them.
- How to inject
CLAUDE_CODE_OAUTH_TOKEN into Docker safely. Repo secret + GH Actions env var → container env var is standard. Local runs would source .env.
- Does the debug trajectory table belong in the main schema or a separate file? Probably main — it's just one extra table, gated by the env var.
Acceptance criteria
Why this matters
Today's release sequence: dev → manual L5 dogfood (45 min) → release.sh. With L6: dev → push tag → CI runs L6 → release.sh auto. Removes the human-in-the-loop bottleneck for routine releases. Manual L5 stays as the safety net for major-version cuts and edge cases.
This is enabling tech for everything else — once L6 lands, every future release is faster and safer.
Problem
Manual L5 dogfood is now the release bottleneck. Every v0.X.Y stable cut requires ~30-45 min of manual testing — install, walk 7-10 scenarios, eyeball trajectory, run release.sh. This stalls iteration speed and gates every release on human availability.
Existing L0-L4 cover MCP/hook correctness in isolation but don't exercise real Claude Code driving bro through real workflows. The gap is the single biggest reason L5 stays manual.
Proposal — L6 deterministic-trajectory tests
Core idea
Rather than mocking or trying to handle interactive
AskUserQuestion, pre-seed the DB to put bro in a known state, then runclaude -p "<prompt>"(non-interactive mode), then assert the resulting MCP/tool trajectory matches the expected sequence fromFLOWS.md.Each flow in
FLOWS.mdbecomes a deterministic test:claude -p "@bro hi"identity_get,config_get, AskUserQuestion form rendered,identity_set,config_setx3,ledger_log(tmb_onboarding_complete)claude -p "@bro write a python cli todo"tmb_project-prescanruns,tmb_lazy-regen-checkruns, triage='simple',task_create_batch,Task(swe)spawn,ledger_log(planning_complete)claude -p "@bro add a --limit flag"issue_list/task_getclaude -p "@bro check task 3"task_get,git diff, verification commands,ledger_log(bro_verification_pass),task_update_status(closed)claude -p "@bro fix typo in README line 3"Edit,Bash(git commit),ledger_log(direct_mode_used). NOtask_create_batch, NO Task spawn.claude -p "@bro hi"identity_getreturns row → bro skips onboarding → greetsWhy this works (non-determinism is bounded)
User insight: "tmb's interaction MUST be deterministic in numbers of tools usage, mcp, as well as the tool name and mcp name matching" — meaning while Claude's prose varies, the MCP/tool call sequence for a given (prompt, DB state) pair is doctrine-fixed.
If bro's actual trajectory deviates from
FLOWS.md, EITHER:FLOWS.mdis stale (doc to update)All three are valuable signals — exactly what L5 catches manually today.
Three pieces of infrastructure
1. Docker harness with real Claude Code
ubuntu:24.04,node:22-slim(debian-based),macos-latest(only on local; GH Actions doesn't support nested macOS),windows-server-2022(later)npm install -g @anthropic-ai/claude-code(or whatever the install-time command is)CLAUDE_CODE_OAUTH_TOKENvia repo secret in.envform. Mount or pass to container.--plugin-dirmode pointing at the checked-out source (so test runs against the PR's code, not the marketplace tag)2. Debug trajectory table (gated by
/debug_tmbor env var)New table:
TMB_DEBUG_TRAJECTORY=1(or via a/debug_tmbslash command for live debugging). Off by default — zero overhead in production.claude -pexits to make assertions.3. Test runner
Each flow test:
.claude/<plugin>/trajectory.dbwith required stateclaude -p "<prompt>"withTMB_DEBUG_TRAJECTORY=1debug_trajectorytable(kind, tool_or_mcp_name)against an expected JSON fileOut of scope
Cost considerations
release-prep/*branches and on tags, not every PR.Open questions
claude -pmode allow the plugin'sAskUserQuestionskill to render? If not, we need to either (a) pre-seed past every form, or (b) use--allowedToolsto skip them.CLAUDE_CODE_OAUTH_TOKENinto Docker safely. Repo secret + GH Actions env var → container env var is standard. Local runs would source.env.Acceptance criteria
debug_trajectorytable added; populated only whenTMB_DEBUG_TRAJECTORY=1tests/dogfood/run-l6.shrunner with at least 3 flow tests (onboarding, first-task, direct-mode)l6-dogfood.ymlruns onrelease-prep/*and tag pushes; usesCLAUDE_CODE_OAUTH_TOKENrepo secrettests/README.md— when L6 runs, what it asserts, how to add a flowWhy this matters
Today's release sequence:
dev → manual L5 dogfood (45 min) → release.sh. With L6:dev → push tag → CI runs L6 → release.sh auto. Removes the human-in-the-loop bottleneck for routine releases. Manual L5 stays as the safety net for major-version cuts and edge cases.This is enabling tech for everything else — once L6 lands, every future release is faster and safer.