Skip to content

🧪 feat(tests): L5+L6 combined Docker harness — replaces manual L5 (#112)#113

Merged
ZaxShen merged 1 commit into
devfrom
feat/112-l5-l6-combined-docker
Apr 26, 2026
Merged

🧪 feat(tests): L5+L6 combined Docker harness — replaces manual L5 (#112)#113
ZaxShen merged 1 commit into
devfrom
feat/112-l5-l6-combined-docker

Conversation

@ZaxShen
Copy link
Copy Markdown
Contributor

@ZaxShen ZaxShen commented Apr 26, 2026

Closes #112.

Summary

Replaces manual L5 dogfood for everything except UX-only verification. Builds a Docker image that:

  1. Simulates CC's marketplace install path (bun install --ignore-scripts → stages at ~/.claude/plugins/cache/trustmybot/tmb/<version>/)
  2. Asserts install-correctness (dist/, schema.sql present)
  3. Installs Claude Code CLI
  4. Runs L6 deterministic-trajectory flows against the marketplace-installed plugin

Catches BOTH install-path bugs AND workflow doctrine bugs in ONE Docker build.

Per user direction

"You mentioned the L5 and L6 is token heavy. No problem, we only run the full test framework per release, not per PR."

"Automating 99% test cases across all layers by deterministic methods, leading to minimizing our manual tests and making the testing framework standard."

This PR is the load-bearing piece of that vision.

Files

  • tests/docker/l5-l6-combined.Dockerfile — combined install + claude + L6 flows
  • tests/docker/run-l5-l6-combined.sh — local convenience wrapper (BuildKit secret for token)
  • .github/workflows/l5-l6-combined.yml — release-only CI (tag pushes + manual dispatch)

Token security

CLAUDE_CODE_OAUTH_TOKEN passed via Docker BuildKit secret (mounted at /run/secrets/cc_token), NOT baked into image layers. Standard Docker secret pattern.

When this runs

Trigger Behavior
Tag push (v*) Full L0 + L6 against the as-shipped plugin
workflow_dispatch Manual trigger for ad-hoc validation
PRs to dev / main Does not run — too token-heavy. PR-time uses L0+L6-light (already in CI).

When secret is absent

Workflow soft-fails: L0 install piece runs, L6 piece skips with ::warning:: notice. Forks / external PRs don't break red.

Coverage matrix after this lands

Bug class Tested by
dist/ shipping L0 (every PR) + L5+L6 (every release)
Native bindings (now N/A — node:sqlite) L0 + L5+L6
MCP server cold-spawn L0 + L5+L6
Hooks register correctly L3 + L5+L6
Bro doctrine (workflow correctness) L4 (MCP-only) + L6 (real Claude on local source) + L5+L6 (real Claude on installed plugin)
Manual L5 dogfood Reduced to UX-only items

Test plan

  • L1-L4 still pass (no regression)
  • L0 install-smoke (CI)
  • L5+L6 combined first run — will execute when this PR's tag is pushed (after merge to dev → main → tag), OR via manual workflow_dispatch from any branch
  • Verify Linux marketplace cache path matches CC's expectation (if not, follow-up to fix the path arg)

Caveats

Next steps after merge

  1. Tag a v0.4.1.1 patch (or wait for v0.4.2) → CI auto-runs full L5+L6 on the tag
  2. If green: cut v0.4.1 stable with confidence; manual L5 no longer required
  3. Update tests/manual/scenarios.md to remove items now covered by automation

🤖 Generated with Claude Code

Replaces manual L5 dogfood with an automated Docker image that simulates
CC's marketplace install path THEN runs L6 deterministic-trajectory
flows against the marketplace-installed plugin.

Catches BOTH:
- Install-path bugs (L0's surface — dist/, native bindings, MCP cold spawn)
- Workflow doctrine bugs (L6's surface — does bro do the right thing?)

In ONE Docker build. Release-only (token-heavy: ~$1-3 per full run).

## What's new

- tests/docker/l5-l6-combined.Dockerfile
  - bun install --ignore-scripts (CC's actual install behavior)
  - Stages plugin at ~/.claude/plugins/cache/trustmybot/tmb/<version>/
  - Hard install assertions (dist/, schema.sql present)
  - npm install -g @anthropic-ai/claude-code
  - Runs tests/dogfood/run-l6.sh inside container with token from BuildKit secret

- tests/docker/run-l5-l6-combined.sh — local wrapper
  - With token: full L0 + L6 run
  - Without token: install-only check (L0 piece), L6 skipped cleanly

- .github/workflows/l5-l6-combined.yml — release-only CI
  - Triggers: tag pushes (every release), workflow_dispatch
  - Token passed as Docker BuildKit secret, NOT baked into image layers
  - Soft-fails if secret absent

## Why release-only

Per user direction: token-heavy tests run per-release, NOT per-PR. Cost is
amortized across releases (one run per tag), trading per-run cost for
elimination of the 30-45 min manual L5 dogfood per release.

## Caveats / next steps

- Verify the marketplace cache layout matches CC's actual Linux behavior
  (might differ slightly from macOS) on first CI run
- After this lands, manual L5 scenarios.md can be reduced to UX-only items

## Vision

User direction (2026-04-26): "automating 99% test cases across all layers
by deterministic methods, leading to minimizing our manual tests and
making the testing framework standard."

L0 + L1-L4 + L6 + L5+L6-combined = automated coverage of every regression
class we've ever shipped. Manual L5 stays as a thin layer for genuine UX
judgment that automation can't capture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ZaxShen ZaxShen merged commit e9e1c12 into dev Apr 26, 2026
2 checks passed
@ZaxShen ZaxShen deleted the feat/112-l5-l6-combined-docker branch April 26, 2026 18:17
@ZaxShen ZaxShen self-assigned this Apr 26, 2026
ZaxShen added a commit that referenced this pull request May 20, 2026
#113)

Subagents (pr-reviewer, swe in worktree) inherit cwd=~ and lack
TRAJECTORY_DB_PATH env, so walk-up fails to find the workspace DB.
Writes a sentinel at ~/.claude/tmb-active-workspace at SessionStart
pointing at the workspace path. tmb_db_path reads the sentinel as a
priority resolver before walk-up.

Layer 2 fix from #113's three-layer plan. Layer 1 (ToolSearch in
spawn prompt) was invalidated empirically; Layer 3 (upstream filing)
remains queued.
ZaxShen added a commit that referenced this pull request May 20, 2026
🔥 fix(hooks): sentinel-file env propagation for subagent DB resolution (#113)

See merge request trustmybot/plugin!33
ZaxShen added a commit that referenced this pull request May 20, 2026
#117)

The test "silent no-op when workspace not detected" assumed walk-up
failure as the only path to no-workspace, but #113's sentinel resolver
(at ~/.claude/tmb-active-workspace) added a second path that the test
didn't account for. Fix: isolate HOME to a tmpdir for this case.
ZaxShen added a commit that referenced this pull request May 20, 2026
…undation)

New library at scripts/lib/sqlite3-fallback.sh exposes 6 wrapper
functions covering the most-used MCP write tools (validation_record,
task_update_status, discussion_append, ledger_log, issue_close,
file_registry_update_summaries). Each wrapper validates role, writes
via sqlite3 directly, and logs a synthetic ledger row
(event_type=mcp_unavailable_fallback_invoked) for audit integrity.

Formalizes the pattern bro has been using manually since #97/#113
subagent-MCP-availability issues surfaced. Doctrine shift from "writes
stay blocked under fallback" to "writes via fallback are sanctioned
with audit trail" — see #100 discussion id=53 for the architectural
Q+A.
ZaxShen added a commit that referenced this pull request May 20, 2026
…#118 + #119)

Two tightly-coupled fixes from 2026-04-29 L0-L5 verification:
- #118: scripts/hooks/write-active-workspace-sentinel.sh was committed
  at mode 100644 (sibling hooks are 100755). Production regression
  from #113. Chmod fix.
- #119: L0 install-smoke executable-check used (echo FAIL && exit 1)
  in a subshell, so build kept going past the FAIL. Replace with
  set -e + brace-group { ...; exit 1; } so exit propagates.

Without #119, #118 would have shipped silently (the build didn't fail
when the bug existed). Pairing them keeps the verification chain
honest going forward.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant