Automate L5 dogfood as L6: deterministic-trajectory tests in Docker with real Claude Code

## Problem

Manual L5 dogfood is now the release bottleneck. Every v0.X.Y stable cut requires ~30-45 min of manual testing — install, walk 7-10 scenarios, eyeball trajectory, run release.sh. This stalls iteration speed and gates every release on human availability.

Existing L0-L4 cover MCP/hook correctness in isolation but don't exercise **real Claude Code driving bro through real workflows**. The gap is the single biggest reason L5 stays manual.

## Proposal — L6 deterministic-trajectory tests

### Core idea

Rather than mocking or trying to handle interactive `AskUserQuestion`, **pre-seed the DB to put bro in a known state, then run `claude -p "<prompt>"` (non-interactive mode), then assert the resulting MCP/tool trajectory matches the expected sequence from `FLOWS.md`.**

Each flow in `FLOWS.md` becomes a deterministic test:

| Test | Pre-seed | Run | Assert |
|---|---|---|---|
| Onboarding | Empty DB | `claude -p "@bro hi"` | Expected: `identity_get`, `config_get`, AskUserQuestion form rendered, `identity_set`, `config_set` x3, `ledger_log(tmb_onboarding_complete)` |
| 1st task (simple triage) | DB with completed onboarding | `claude -p "@bro write a python cli todo"` | Expected: `tmb_project-prescan` runs, `tmb_lazy-regen-check` runs, triage='simple', `task_create_batch`, `Task(swe)` spawn, `ledger_log(planning_complete)` |
| 2nd task (after 1st closed) | DB with 1st task closed | `claude -p "@bro add a --limit flag"` | Expected: same shape; previous-task context picked up via `issue_list` / `task_get` |
| Bro verification (V1/V2/V3) | DB with SWE-completed task | `claude -p "@bro check task 3"` | Expected: `task_get`, `git diff`, verification commands, `ledger_log(bro_verification_pass)`, `task_update_status(closed)` |
| Direct mode | DB with onboarding done | `claude -p "@bro fix typo in README line 3"` | Expected: `Edit`, `Bash(git commit)`, `ledger_log(direct_mode_used)`. NO `task_create_batch`, NO Task spawn. |
| Anonymous cold-restart | DB with anonymous identity | `claude -p "@bro hi"` | Expected: `identity_get` returns row → bro skips onboarding → greets |

### Why this works (non-determinism is bounded)

User insight: *"tmb's interaction MUST be deterministic in numbers of tools usage, mcp, as well as the tool name and mcp name matching"* — meaning while Claude's prose varies, the MCP/tool call sequence for a given (prompt, DB state) pair is doctrine-fixed.

If bro's actual trajectory deviates from `FLOWS.md`, EITHER:
- The doctrine is broken (real bug to fix)
- `FLOWS.md` is stale (doc to update)
- The skill/agent prompt drifted (regression to fix)

All three are valuable signals — exactly what L5 catches manually today.

### Three pieces of infrastructure

#### 1. Docker harness with real Claude Code

- Base images: `ubuntu:24.04`, `node:22-slim` (`debian-based`), `macos-latest` (only on local; GH Actions doesn't support nested macOS), `windows-server-2022` (later)
- Install Claude Code in each: `npm install -g @anthropic-ai/claude-code` (or whatever the install-time command is)
- **Auth**: User has Claude Code account; can provide `CLAUDE_CODE_OAUTH_TOKEN` via repo secret in `.env` form. Mount or pass to container.
- Install TMB plugin via the `--plugin-dir` mode pointing at the checked-out source (so test runs against the PR's code, not the marketplace tag)

#### 2. Debug trajectory table (gated by `/debug_tmb` or env var)

New table:

```sql
CREATE TABLE IF NOT EXISTS debug_trajectory (
  id           INTEGER PRIMARY KEY AUTOINCREMENT,
  session_id   TEXT NOT NULL,
  step_n       INTEGER NOT NULL,
  kind         TEXT NOT NULL,            -- 'tool_use' | 'mcp_call' | 'agent_thinking' | 'response'
  agent        TEXT,                     -- 'bro' | 'swe' | 'pr-reviewer' | etc
  tool_or_mcp_name TEXT,                 -- e.g. 'mcp__plugin_tmb_trajectory-server__identity_get' or 'Bash'
  args_json    TEXT,                     -- input args (truncated)
  result_json  TEXT,                     -- output summary (truncated)
  ts           TEXT NOT NULL
);
```

- **Only populated when env `TMB_DEBUG_TRAJECTORY=1`** (or via a `/debug_tmb` slash command for live debugging). Off by default — zero overhead in production.
- Populated by a thin wrapper around the MCP server's tool-call dispatcher (writes one row per tool call) + a hook that records Bash/Read/Write/Edit calls.
- L6 test runner reads from this table after `claude -p` exits to make assertions.

#### 3. Test runner

```bash
tests/dogfood/run-l6.sh
  flows/01-onboarding.test.sh
  flows/02-first-task.test.sh
  flows/03-second-task.test.sh
  flows/04-direct-mode.test.sh
  flows/05-bro-verification.test.sh
  flows/06-anonymous-cold-restart.test.sh
  flows/07-channel-isolation.test.sh
  ...
```

Each flow test:
1. Spin up Docker container with TMB plugin installed
2. Pre-seed `.claude/<plugin>/trajectory.db` with required state
3. `claude -p "<prompt>"` with `TMB_DEBUG_TRAJECTORY=1`
4. Read `debug_trajectory` table
5. Compare the sequence of `(kind, tool_or_mcp_name)` against an expected JSON file
6. Assert match (allowing prose variation in args, but tool sequence is checked)

### Out of scope

- **Code quality** — already covered by L1-L4. L6 only checks workflow correctness.
- **AskUserQuestion handling** — pre-seed DB to skip the form entirely.
- **Token cost optimization** — initially run only on release-prep PRs (not every PR).

### Cost considerations

- Each L6 invocation costs real Claude tokens (user's account). Estimate: 5-20K tokens per flow × 7 flows × 3 platforms = ~300K tokens per L6 run.
- Mitigation: run L6 only on `release-prep/*` branches and on tags, not every PR.
- Or: run a single Linux platform on every PR, full matrix only on release.

### Open questions

1. Does `claude -p` mode allow the plugin's `AskUserQuestion` skill to render? If not, we need to either (a) pre-seed past every form, or (b) use `--allowedTools` to skip them.
2. How to inject `CLAUDE_CODE_OAUTH_TOKEN` into Docker safely. Repo secret + GH Actions env var → container env var is standard. Local runs would source `.env`.
3. Does the debug trajectory table belong in the main schema or a separate file? Probably main — it's just one extra table, gated by the env var.

### Acceptance criteria

- [ ] Schema: `debug_trajectory` table added; populated only when `TMB_DEBUG_TRAJECTORY=1`
- [ ] Wrapper around MCP dispatch writes one row per tool/mcp call when env is set
- [ ] `tests/dogfood/run-l6.sh` runner with at least 3 flow tests (onboarding, first-task, direct-mode)
- [ ] CI workflow `l6-dogfood.yml` runs on `release-prep/*` and tag pushes; uses `CLAUDE_CODE_OAUTH_TOKEN` repo secret
- [ ] Documented in `tests/README.md` — when L6 runs, what it asserts, how to add a flow
- [ ] At least one flow's expected-trajectory JSON committed and matched against actual

### Why this matters

Today's release sequence: `dev → manual L5 dogfood (45 min) → release.sh`. With L6: `dev → push tag → CI runs L6 → release.sh auto`. Removes the human-in-the-loop bottleneck for routine releases. Manual L5 stays as the safety net for major-version cuts and edge cases.

This is enabling tech for everything else — once L6 lands, every future release is faster and safer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate L5 dogfood as L6: deterministic-trajectory tests in Docker with real Claude Code #108

Problem

Proposal — L6 deterministic-trajectory tests

Core idea

Why this works (non-determinism is bounded)

Three pieces of infrastructure

1. Docker harness with real Claude Code

2. Debug trajectory table (gated by `/debug_tmb` or env var)

3. Test runner

Out of scope

Cost considerations

Open questions

Acceptance criteria

Why this matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Test	Pre-seed	Run	Assert
Onboarding	Empty DB	`claude -p "@bro hi"`	Expected: `identity_get`, `config_get`, AskUserQuestion form rendered, `identity_set`, `config_set` x3, `ledger_log(tmb_onboarding_complete)`
1st task (simple triage)	DB with completed onboarding	`claude -p "@bro write a python cli todo"`	Expected: `tmb_project-prescan` runs, `tmb_lazy-regen-check` runs, triage='simple', `task_create_batch`, `Task(swe)` spawn, `ledger_log(planning_complete)`
2nd task (after 1st closed)	DB with 1st task closed	`claude -p "@bro add a --limit flag"`	Expected: same shape; previous-task context picked up via `issue_list` / `task_get`
Bro verification (V1/V2/V3)	DB with SWE-completed task	`claude -p "@bro check task 3"`	Expected: `task_get`, `git diff`, verification commands, `ledger_log(bro_verification_pass)`, `task_update_status(closed)`
Direct mode	DB with onboarding done	`claude -p "@bro fix typo in README line 3"`	Expected: `Edit`, `Bash(git commit)`, `ledger_log(direct_mode_used)`. NO `task_create_batch`, NO Task spawn.
Anonymous cold-restart	DB with anonymous identity	`claude -p "@bro hi"`	Expected: `identity_get` returns row → bro skips onboarding → greets

Automate L5 dogfood as L6: deterministic-trajectory tests in Docker with real Claude Code #108

Description

Problem

Proposal — L6 deterministic-trajectory tests

Core idea

Why this works (non-determinism is bounded)

Three pieces of infrastructure

1. Docker harness with real Claude Code

2. Debug trajectory table (gated by /debug_tmb or env var)

3. Test runner

Out of scope

Cost considerations

Open questions

Acceptance criteria

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. Debug trajectory table (gated by `/debug_tmb` or env var)