🧪 feat(tests): L5+L6 combined Docker harness — replaces manual L5 (#112) by ZaxShen · Pull Request #113 · trustmybot/plugin

ZaxShen · 2026-04-26T17:44:36Z

Closes #112.

Summary

Replaces manual L5 dogfood for everything except UX-only verification. Builds a Docker image that:

Simulates CC's marketplace install path (bun install --ignore-scripts → stages at ~/.claude/plugins/cache/trustmybot/tmb/<version>/)
Asserts install-correctness (dist/, schema.sql present)
Installs Claude Code CLI
Runs L6 deterministic-trajectory flows against the marketplace-installed plugin

Catches BOTH install-path bugs AND workflow doctrine bugs in ONE Docker build.

Per user direction

"You mentioned the L5 and L6 is token heavy. No problem, we only run the full test framework per release, not per PR."

"Automating 99% test cases across all layers by deterministic methods, leading to minimizing our manual tests and making the testing framework standard."

This PR is the load-bearing piece of that vision.

Files

tests/docker/l5-l6-combined.Dockerfile — combined install + claude + L6 flows
tests/docker/run-l5-l6-combined.sh — local convenience wrapper (BuildKit secret for token)
.github/workflows/l5-l6-combined.yml — release-only CI (tag pushes + manual dispatch)

Token security

CLAUDE_CODE_OAUTH_TOKEN passed via Docker BuildKit secret (mounted at /run/secrets/cc_token), NOT baked into image layers. Standard Docker secret pattern.

When this runs

Trigger	Behavior
Tag push (`v*`)	Full L0 + L6 against the as-shipped plugin
`workflow_dispatch`	Manual trigger for ad-hoc validation
PRs to `dev` / `main`	Does not run — too token-heavy. PR-time uses L0+L6-light (already in CI).

When secret is absent

Workflow soft-fails: L0 install piece runs, L6 piece skips with ::warning:: notice. Forks / external PRs don't break red.

Coverage matrix after this lands

Bug class	Tested by
dist/ shipping	L0 (every PR) + L5+L6 (every release)
Native bindings (now N/A — node:sqlite)	L0 + L5+L6
MCP server cold-spawn	L0 + L5+L6
Hooks register correctly	L3 + L5+L6
Bro doctrine (workflow correctness)	L4 (MCP-only) + L6 (real Claude on local source) + L5+L6 (real Claude on installed plugin)
Manual L5 dogfood	Reduced to UX-only items

Test plan

L1-L4 still pass (no regression)
L0 install-smoke (CI)
L5+L6 combined first run — will execute when this PR's tag is pushed (after merge to dev → main → tag), OR via manual workflow_dispatch from any branch
Verify Linux marketplace cache path matches CC's expectation (if not, follow-up to fix the path arg)

Caveats

First CI run will reveal whether ~/.claude/plugins/cache/trustmybot/tmb/<version>/ is the exact layout CC expects on Linux. Might need adjustment.
The L6 piece needs claude -p to work in headless Docker — same unverified assumption from Automate L5 dogfood as L6: deterministic-trajectory tests in Docker with real Claude Code #108.

Next steps after merge

Tag a v0.4.1.1 patch (or wait for v0.4.2) → CI auto-runs full L5+L6 on the tag
If green: cut v0.4.1 stable with confidence; manual L5 no longer required
Update tests/manual/scenarios.md to remove items now covered by automation

🤖 Generated with Claude Code

Replaces manual L5 dogfood with an automated Docker image that simulates CC's marketplace install path THEN runs L6 deterministic-trajectory flows against the marketplace-installed plugin. Catches BOTH: - Install-path bugs (L0's surface — dist/, native bindings, MCP cold spawn) - Workflow doctrine bugs (L6's surface — does bro do the right thing?) In ONE Docker build. Release-only (token-heavy: ~$1-3 per full run). ## What's new - tests/docker/l5-l6-combined.Dockerfile - bun install --ignore-scripts (CC's actual install behavior) - Stages plugin at ~/.claude/plugins/cache/trustmybot/tmb/<version>/ - Hard install assertions (dist/, schema.sql present) - npm install -g @anthropic-ai/claude-code - Runs tests/dogfood/run-l6.sh inside container with token from BuildKit secret - tests/docker/run-l5-l6-combined.sh — local wrapper - With token: full L0 + L6 run - Without token: install-only check (L0 piece), L6 skipped cleanly - .github/workflows/l5-l6-combined.yml — release-only CI - Triggers: tag pushes (every release), workflow_dispatch - Token passed as Docker BuildKit secret, NOT baked into image layers - Soft-fails if secret absent ## Why release-only Per user direction: token-heavy tests run per-release, NOT per-PR. Cost is amortized across releases (one run per tag), trading per-run cost for elimination of the 30-45 min manual L5 dogfood per release. ## Caveats / next steps - Verify the marketplace cache layout matches CC's actual Linux behavior (might differ slightly from macOS) on first CI run - After this lands, manual L5 scenarios.md can be reduced to UX-only items ## Vision User direction (2026-04-26): "automating 99% test cases across all layers by deterministic methods, leading to minimizing our manual tests and making the testing framework standard." L0 + L1-L4 + L6 + L5+L6-combined = automated coverage of every regression class we've ever shipped. Manual L5 stays as a thin layer for genuine UX judgment that automation can't capture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

#113) Subagents (pr-reviewer, swe in worktree) inherit cwd=~ and lack TRAJECTORY_DB_PATH env, so walk-up fails to find the workspace DB. Writes a sentinel at ~/.claude/tmb-active-workspace at SessionStart pointing at the workspace path. tmb_db_path reads the sentinel as a priority resolver before walk-up. Layer 2 fix from #113's three-layer plan. Layer 1 (ToolSearch in spawn prompt) was invalidated empirically; Layer 3 (upstream filing) remains queued.

🔥 fix(hooks): sentinel-file env propagation for subagent DB resolution (#113) See merge request trustmybot/plugin!33

#117) The test "silent no-op when workspace not detected" assumed walk-up failure as the only path to no-workspace, but #113's sentinel resolver (at ~/.claude/tmb-active-workspace) added a second path that the test didn't account for. Fix: isolate HOME to a tmpdir for this case.

…undation) New library at scripts/lib/sqlite3-fallback.sh exposes 6 wrapper functions covering the most-used MCP write tools (validation_record, task_update_status, discussion_append, ledger_log, issue_close, file_registry_update_summaries). Each wrapper validates role, writes via sqlite3 directly, and logs a synthetic ledger row (event_type=mcp_unavailable_fallback_invoked) for audit integrity. Formalizes the pattern bro has been using manually since #97/#113 subagent-MCP-availability issues surfaced. Doctrine shift from "writes stay blocked under fallback" to "writes via fallback are sanctioned with audit trail" — see #100 discussion id=53 for the architectural Q+A.

…#118 + #119) Two tightly-coupled fixes from 2026-04-29 L0-L5 verification: - #118: scripts/hooks/write-active-workspace-sentinel.sh was committed at mode 100644 (sibling hooks are 100755). Production regression from #113. Chmod fix. - #119: L0 install-smoke executable-check used (echo FAIL && exit 1) in a subshell, so build kept going past the FAIL. Replace with set -e + brace-group { ...; exit 1; } so exit propagates. Without #119, #118 would have shipped silently (the build didn't fail when the bug existed). Pairing them keeps the verification chain honest going forward.

ZaxShen merged commit e9e1c12 into dev Apr 26, 2026
2 checks passed

ZaxShen deleted the feat/112-l5-l6-combined-docker branch April 26, 2026 18:17

ZaxShen self-assigned this Apr 26, 2026

ZaxShen added a commit that referenced this pull request May 20, 2026

Merge branch 'fix/113-mcp-env-sentinel' into 'dev'

0bd6eb7

🔥 fix(hooks): sentinel-file env propagation for subagent DB resolution (#113) See merge request trustmybot/plugin!33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🧪 feat(tests): L5+L6 combined Docker harness — replaces manual L5 (#112)#113

🧪 feat(tests): L5+L6 combined Docker harness — replaces manual L5 (#112)#113
ZaxShen merged 1 commit into
devfrom
feat/112-l5-l6-combined-docker

ZaxShen commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ZaxShen commented Apr 26, 2026

Summary

Per user direction

Files

Token security

When this runs

When secret is absent

Coverage matrix after this lands

Test plan

Caveats

Next steps after merge

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant