feat: sandbox test harness, multi-agent expansion plan, and project roadmap#17
Conversation
Track A of multi-agent sandbox expansion complete: - Split sandbox-test-harness.md into sandbox-architecture.md (shared two-layer design) and sandbox-claude-code.md (Claude Code-specific implementation) - Created multi-agent-sandbox.md exec plan with 5 implementation tracks (doc restructure, fixtures, Layer 1 coverage, Docker expansion, per-agent docs) - Created ROADMAP.md with Done/In Progress/Planned sections, agent support matrix, and Skill Quality Infrastructure features (badges, auto-evolve, marketplace integration, conflict detection) - Updated design docs index with new file names - Added docs/strategy/ to .gitignore (strategy docs kept locally as private) Includes devcontainer setup, workflow improvements, README consolidation, and infrastructure updates for multi-agent support. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds a sandbox testing infrastructure (Layer‑1 local + Layer‑2 devcontainer/LLM), devcontainer and Docker images, fixture provisioning and orchestration scripts, init firewall, GitHub Actions pinning/permissions, numerous docs/Makefile/package edits, CODEOWNERS, and repo config updates across ~40 files. Changes
Sequence Diagram(s)sequenceDiagram
participant Host as Host (make sandbox)
participant Tmp as TempSandbox (Layer 1)
participant Project as ProjectFiles
participant CLI as selftune CLI
participant Hooks as Hooks
participant Results as results/ JSON
Host->>Tmp: create temp HOME, set HOME env
Host->>Project: copy fixtures (logs, transcripts, config)
loop run CLI commands
Host->>CLI: invoke commands (doctor, evals, dashboard, contribute, etc.)
CLI->>Tmp: read/write under HOME
CLI-->>Host: stdout/stderr + exit code
Host->>Results: record per-command result
end
loop run hooks
Host->>Hooks: invoke prompt-log, skill-eval, session-stop (pipe JSON)
Hooks->>Tmp: append/modify JSONL logs
Hooks-->>Host: exit + status
Host->>Results: record hook result
end
Host->>Results: write timestamped report
Host->>Tmp: cleanup (or keep)
sequenceDiagram
participant Dev as Devcontainer Startup
participant FW as iptables (init-firewall.sh)
participant Docker as Docker Volumes
participant Entrypoint as entrypoint.sh
participant Fixtures as FixtureFiles (.claude/.selftune)
participant Tests as run-with-llm.ts
participant Claude as Claude Code CLI/API
participant Results as results/ JSON
Dev->>FW: apply iptables rules (DNS/SSH/HTTPS/HTTP)
Dev->>Docker: mount volumes (sandbox HOME, results)
Dev->>Entrypoint: exec provisioning script
Entrypoint->>Fixtures: copy fixtures into HOME
Entrypoint->>Tests: start LLM test runner
Tests->>Claude: invoke `claude -p` / API calls
Claude-->>Tests: responses
Tests->>Results: validate & write final report
Tests-->>Dev: exit status
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Sandbox fixtures are intentional test data that should be reviewed, not vendored/generated files to ignore. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 21
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.devcontainer/devcontainer.json:
- Line 10: Remove the unnecessary NET_RAW capability from the devcontainer run
arguments: edit the runArgs array to drop the "--cap-add=NET_RAW" entry so only
"--cap-add=NET_ADMIN" remains; this targets the runArgs declaration and aligns
with the firewall script (init-firewall.sh) which only requires NET_ADMIN and
does not use raw sockets or packet crafting.
In @.devcontainer/Dockerfile:
- Around line 19-23: SNIPPET is defined but never written into any shell
startup, so interactive shells don't use /commandhistory/.bash_history; update
the Dockerfile to write the SNIPPET into a startup file (e.g., create
/etc/profile.d/commandhistory.sh or append to the user's ~/.bashrc) after
creating /commandhistory and setting ownership, by echoing the export lines for
PROMPT_COMMAND and HISTFILE into that file and ensuring it has appropriate
permissions so the shell will source it for interactive sessions; reference the
SNIPPET variable, PROMPT_COMMAND, HISTFILE, /commandhistory/.bash_history and
$USERNAME when making the change so the Docker build persists shell history.
- Around line 33-34: Update both RUN lines that execute the Bun installer ("RUN
curl -fsSL https://bun.sh/install | bash" and the similar second occurrence) to
run the pipeline under a shell that has strict failure handling; prepend a
POSIX-safe strict mode (set -euo pipefail) so that a failed curl or a failing
pipeline will cause the Docker build to fail rather than silently continuing.
Apply this change to both RUN instructions that invoke the bun installer.
In @.devcontainer/init-firewall.sh:
- Line 11: The iptables rule added to the INPUT chain uses the output interface
flag (-o lo) which is invalid there; update the iptables invocation that
currently adds the loopback accept rule (the line running "iptables -A INPUT -o
lo -j ACCEPT") to use the input interface flag (-i lo) instead so it reads as an
INPUT-chain loopback accept; keep the same chain/action/position but replace -o
with -i to prevent the script from failing under set -e.
- Around line 7-22: Add explicit default-deny policies by setting "iptables -P
INPUT DROP", "iptables -P OUTPUT DROP", and "iptables -P FORWARD DROP" before
any -A (append) rules, then keep the existing allow rules (DNS, SSH, loopback,
ESTABLISHED,RELATED, HTTP/HTTPS) after those policies; also correct the loopback
input rule by changing the INPUT rule from using "-o lo" to "-i lo" (replace the
current "iptables -A INPUT -o lo -j ACCEPT" with "iptables -A INPUT -i lo -j
ACCEPT") so loopback traffic is properly allowed.
In `@docs/design-docs/sandbox-architecture.md`:
- Around line 41-47: Add a language tag to the fenced code block that shows the
test fixture tree (change ``` to ```text) to satisfy MD040, and insert a single
blank line after each of the headings in the same section (the headings
surrounding the fixtures block referenced near the tests/sandbox/fixtures tree)
to satisfy MD022/MD040; update the fenced block and add blank lines immediately
after those heading lines so lint no longer reports MD040/MD022.
In `@docs/design-docs/sandbox-claude-code.md`:
- Line 36: Update the docs entry that currently lists Layer 2 as
`tests/sandbox/claude-code/` to the actual location used by the Makefile:
`tests/sandbox/docker/docker-compose.yml`; change the referenced path in
docs/design-docs/sandbox-claude-code.md and any mention of Layer 2 so it matches
the Makefile targets `sandbox-llm` and `sandbox-shell` which point to
`tests/sandbox/docker/docker-compose.yml`.
In `@docs/exec-plans/active/multi-agent-sandbox.md`:
- Around line 1-3: The MD041/MD040 violations come from the verification HTML
comment placed before the H1 and an untyped code fence; move the verification
comment "<!-- Verified: 2026-03-02 -->" so it appears immediately below the H1
"# Execution Plan: Multi-Agent Sandbox Expansion", and update the directory-tree
code fence to include a language specifier (use "text") for the block that lists
tests/sandbox/; apply the same change pattern to the other similar blocks
referenced around the content (lines noted 26-43) so all first-line headings
precede comments and all fences use a language (e.g., ```text).
In `@docs/exec-plans/tech-debt-tracker.md`:
- Line 15: The TD-007 row in the debt tracker was edited but the "Updated" date
wasn't changed; update the "Updated" column for the TD-007 row (the row
mentioning deployProposal / deploy-proposal.ts and evolve.ts) to the current
audit date (e.g., 2026-03-02) so metadata remains consistent with the edit.
In `@docs/golden-principles.md`:
- Line 95: Remove the hardcoded line-number reference in the sentence and
instead point to the lint rule by name or exported identifier in
lint-architecture.ts that enforces the isolated export path (e.g., the
rule/check that enforces "contribute" not importing from hooks/ ingestors/
grading/ evolution/ monitoring). Update the docs/golden-principles.md text to
say something like "enforced by the isolated-export-path rule in
lint-architecture.ts" (or use the actual exported rule/constant/function name
from lint-architecture.ts) so the doc references the rule identifier rather than
specific line numbers.
In `@docs/product-specs/index.md`:
- Around line 10-11: The two table rows in docs/product-specs/index.md
referencing "[ICP & GTM Strategy](../strategy/icp-gtm-strategy.md)" and
"[OpenClaw Integration](../strategy/openclaw-integration.md)" point to a
gitignored/private docs/strategy directory and will 404 for others; either move
the source files into a tracked docs folder and update the links to their new
paths, or replace those two linked entries with plain text labels (e.g., "ICP &
GTM Strategy — draft" and "OpenClaw Integration — draft") until the publishable
docs exist so the index no longer links to private paths.
In `@Makefile`:
- Line 1: Add missing phony targets and minimal implementations for Makefile
compliance: update the .PHONY declaration to include "all" and "clean", add an
"all" target that depends on the existing "check" target (so running `make` runs
checks), and add a "clean" target that removes build artifacts (e.g.,
temporary/build output directories and generated files used by the repo) to
satisfy checkmake requirements; reference the .PHONY line and the "check" target
when making these changes.
In `@package.json`:
- Around line 53-54: Keep a single canonical script and make the other an alias:
remove the duplicate command by preserving "start" with the real command "bun
run cli/selftune/index.ts --help" and change "boot" to run that script (e.g.,
set "boot" to "npm run start" or "bun run npm: start") so "boot" is an alias to
"start" and future drift is avoided; update the package.json "scripts" entries
for "start" and "boot" accordingly.
In `@tests/sandbox/docker/Dockerfile`:
- Around line 6-7: The Dockerfile uses ARG CLAUDE_CODE_VERSION=latest which
makes builds non-reproducible; change the default to a pinned semantic version
(e.g., ARG CLAUDE_CODE_VERSION=1.2.3) and ensure any other occurrence (same ARG
at the other location referenced) is updated consistently so builds use the
fixed version unless overridden at build time; update documentation or build
scripts to pass the desired version via --build-arg CLAUDE_CODE_VERSION when
needed and add a test or note to verify the pinned value is used during sandbox
builds.
In `@tests/sandbox/docker/run-with-llm.ts`:
- Around line 143-152: The JSON.parse call for grading-result.json can throw an
opaque error if the file is malformed; wrap the read + JSON.parse (where
resultPath, readFileSync and JSON.parse are used in run-with-llm.ts) in a
try-catch, catch parse errors and throw a new Error that includes the resultPath
and the original error message (or attach it) so the failure clearly indicates
which file failed to parse and why; keep the existing existsSync check and only
proceed to parse inside the try block, rethrow or throw a new descriptive Error
on failure.
- Around line 67-76: The code currently awaits proc.exited before consuming the
process streams, which can deadlock; change the order in the function that
spawns Bun (using the proc variable) so you create promises for stdout and
stderr (e.g., const stdoutPromise = new Response(proc.stdout).text(); const
stderrPromise = new Response(proc.stderr).text();) before awaiting proc.exited,
then await proc.exited and finally await those promises (await stdoutPromise,
await stderrPromise) to return { exitCode, stdout, stderr }—update references to
proc, proc.stdout, proc.stderr, and proc.exited accordingly.
- Around line 97-105: The current code awaits proc.exited before consuming
proc.stdout/proc.stderr which can deadlock if the child blocks writing; change
the flow to start reading both streams into text Promises immediately (e.g.,
create stdoutPromise and stderrPromise from new Response(proc.stdout).text() and
new Response(proc.stderr).text()), then await Promise.race/Promise.all with
proc.exited and the timeout, and only after the exit resolve use the
already-started stdout/stderr promises (await stdoutPromise and stderrPromise)
to include output in the error message; ensure the timeout (timeout) is still
cleared when exit completes and that proc.kill() is called on timeout.
In `@tests/sandbox/provision-claude.sh`:
- Line 45: The cp of "${FIXTURES}/transcripts/*.jsonl" into
"${TARGET_HOME}/.claude/projects/default/" will fail if the glob matches
nothing; modify the script around that cp to guard the copy by first checking
for the existence of matching files (e.g., test for files matching
"${FIXTURES}/transcripts/"/*.jsonl or use shell option nullglob) and only run
the cp (or a safe loop to copy each file) when at least one transcript exists;
ensure the target directory "${TARGET_HOME}/.claude/projects/default/" is
created (mkdir -p) before copying.
In `@tests/sandbox/run-sandbox.ts`:
- Around line 221-229: The main function currently logs and calls setupSandbox()
but doesn't guarantee cleanup on exceptions; wrap the actionable portion of main
(everything after setupSandbox() and any early-exit logic) in a try/finally so
cleanupSandbox() in your existing cleanup block (see the cleanup logic around
the block referenced at lines 458-471) always runs; ensure any async calls
inside the try are awaited so the finally runs at the correct time, and rethrow
the error after cleanup if you want the caller to observe failures.
- Around line 53-93: The setupSandbox function duplicates provisioning steps
already implemented in provision-claude.sh (creating SANDBOX_SELFTUNE_DIR,
SANDBOX_PROJECTS_DIR, copying FIXTURES_DIR files into SANDBOX_CLAUDE_DIR and
SANDBOX_PROJECTS_DIR, and copying selftune/claude settings); replace the
duplicated logic by invoking the existing provision-claude.sh script (or
refactor the shell logic into a shared TypeScript helper) from setupSandbox so
there is a single source of truth for fixture paths and copying behavior; keep
references to the same SANDBOX_* and FIXTURES_DIR constants and ensure
setupSandbox either spawns the script (capturing errors) or calls the shared
helper function so tests and Docker provisioning use identical logic.
- Around line 210-215: The current check in fileHasNewContent (and similar
functions around lines 337-390) can miscount empty files because calling trim()
before split may yield a single empty string; change the logic to first read the
raw content, return false immediately if content === "" (or content.length ===
0), or split on "\n" without trimming and then filter out blank lines via
.filter(l => l.trim()) to compute lines.length; update fileHasNewContent to use
this pattern so first-write append checks correctly treat empty files as 0
lines.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (17)
tests/sandbox/fixtures/all_queries_log.jsonlis excluded by!**/fixtures/**tests/sandbox/fixtures/claude-settings.jsonis excluded by!**/fixtures/**tests/sandbox/fixtures/evolution_audit_log.jsonlis excluded by!**/fixtures/**tests/sandbox/fixtures/hook-payloads/post-tool-use.jsonis excluded by!**/fixtures/**tests/sandbox/fixtures/hook-payloads/prompt-submit.jsonis excluded by!**/fixtures/**tests/sandbox/fixtures/hook-payloads/session-stop.jsonis excluded by!**/fixtures/**tests/sandbox/fixtures/selftune-config.jsonis excluded by!**/fixtures/**tests/sandbox/fixtures/session_telemetry_log.jsonlis excluded by!**/fixtures/**tests/sandbox/fixtures/skill_usage_log.jsonlis excluded by!**/fixtures/**tests/sandbox/fixtures/skills/ai-image-generation/SKILL.mdis excluded by!**/fixtures/**tests/sandbox/fixtures/skills/find-skills/SKILL.mdis excluded by!**/fixtures/**tests/sandbox/fixtures/skills/frontend-design/SKILL.mdis excluded by!**/fixtures/**tests/sandbox/fixtures/transcripts/session-001.jsonlis excluded by!**/fixtures/**tests/sandbox/fixtures/transcripts/session-002.jsonlis excluded by!**/fixtures/**tests/sandbox/fixtures/transcripts/session-003.jsonlis excluded by!**/fixtures/**tests/sandbox/fixtures/transcripts/session-004.jsonlis excluded by!**/fixtures/**tests/sandbox/fixtures/transcripts/session-005.jsonlis excluded by!**/fixtures/**
📒 Files selected for processing (35)
.devcontainer/Dockerfile.devcontainer/devcontainer.json.devcontainer/init-firewall.sh.github/CODEOWNERS.github/workflows/auto-bump-cli-version.yml.github/workflows/ci.yml.github/workflows/codeql.yml.github/workflows/publish.yml.github/workflows/scorecard.yml.gitignoreAGENTS.mdARCHITECTURE.mdCHANGELOG.mdCONTRIBUTING.mdMakefilePRD.mdREADME.mdROADMAP.mddocs/design-docs/index.mddocs/design-docs/sandbox-architecture.mddocs/design-docs/sandbox-claude-code.mddocs/exec-plans/active/multi-agent-sandbox.mddocs/exec-plans/completed/.gitkeepdocs/exec-plans/tech-debt-tracker.mddocs/golden-principles.mddocs/product-specs/index.mdllms.txtpackage.jsonrisk-policy.jsontests/sandbox/docker/Dockerfiletests/sandbox/docker/docker-compose.ymltests/sandbox/docker/entrypoint.shtests/sandbox/docker/run-with-llm.tstests/sandbox/provision-claude.shtests/sandbox/run-sandbox.ts
| iptables -A OUTPUT -p udp --dport 53 -j ACCEPT | ||
| iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT | ||
| iptables -A OUTPUT -p tcp --dport 22 -j ACCEPT | ||
| iptables -A OUTPUT -o lo -j ACCEPT | ||
| iptables -A INPUT -o lo -j ACCEPT | ||
|
|
||
| # Allow established connections | ||
| iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT | ||
| iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT | ||
|
|
||
| # Allow HTTPS (port 443) for Claude API, npm, GitHub | ||
| iptables -A OUTPUT -p tcp --dport 443 -j ACCEPT | ||
|
|
||
| # Allow HTTP (port 80) for package registries | ||
| iptables -A OUTPUT -p tcp --dport 80 -j ACCEPT | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd .devcontainer && ls -la | grep init && cat -n init-firewall.shRepository: WellDunDun/selftune
Length of output: 999
🏁 Script executed:
# Verify no DROP policies exist in the file
rg -c 'iptables -P' .devcontainer/init-firewall.sh || echo "No -P policy rules found"Repository: WellDunDun/selftune
Length of output: 87
Set default policies to DROP before appending allow rules.
The firewall config uses only ACCEPT rules with no default policies defined. Without explicit -P INPUT DROP, -P OUTPUT DROP, -P FORWARD DROP at the start, traffic not matching any rule still passes through default system policies (typically ACCEPT). This defeats the deny-by-default approach.
Also fix line 11: iptables -A INPUT -i lo -j ACCEPT (use -i for input interface, not -o).
Required changes
+iptables -P INPUT DROP
+iptables -P OUTPUT DROP
+iptables -P FORWARD DROP
+
# Allow DNS, SSH, localhost
iptables -A OUTPUT -p udp --dport 53 -j ACCEPT
iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT
iptables -A OUTPUT -p tcp --dport 22 -j ACCEPT
iptables -A OUTPUT -o lo -j ACCEPT
-iptables -A INPUT -o lo -j ACCEPT
+iptables -A INPUT -i lo -j ACCEPT🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.devcontainer/init-firewall.sh around lines 7 - 22, Add explicit
default-deny policies by setting "iptables -P INPUT DROP", "iptables -P OUTPUT
DROP", and "iptables -P FORWARD DROP" before any -A (append) rules, then keep
the existing allow rules (DNS, SSH, loopback, ESTABLISHED,RELATED, HTTP/HTTPS)
after those policies; also correct the loopback input rule by changing the INPUT
rule from using "-o lo" to "-i lo" (replace the current "iptables -A INPUT -o lo
-j ACCEPT" with "iptables -A INPUT -i lo -j ACCEPT") so loopback traffic is
properly allowed.
tests/sandbox/provision-claude.sh
Outdated
| cp "${FIXTURES}/skills/ai-image-generation/SKILL.md" "${TARGET_HOME}/.claude/skills/ai-image-generation/SKILL.md" | ||
|
|
||
| # Session transcripts | ||
| cp "${FIXTURES}"/transcripts/*.jsonl "${TARGET_HOME}/.claude/projects/default/" |
There was a problem hiding this comment.
Guard transcript copy when the fixture glob is empty.
The current wildcard copy can fail hard when no .jsonl transcript files are present.
Suggested fix
# Session transcripts
-cp "${FIXTURES}"/transcripts/*.jsonl "${TARGET_HOME}/.claude/projects/default/"
+shopt -s nullglob
+transcript_files=("${FIXTURES}"/transcripts/*.jsonl)
+if ((${`#transcript_files`[@]} > 0)); then
+ cp "${transcript_files[@]}" "${TARGET_HOME}/.claude/projects/default/"
+fi
+shopt -u nullglob📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cp "${FIXTURES}"/transcripts/*.jsonl "${TARGET_HOME}/.claude/projects/default/" | |
| # Session transcripts | |
| shopt -s nullglob | |
| transcript_files=("${FIXTURES}"/transcripts/*.jsonl) | |
| if ((${`#transcript_files`[@]} > 0)); then | |
| cp "${transcript_files[@]}" "${TARGET_HOME}/.claude/projects/default/" | |
| fi | |
| shopt -u nullglob |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/sandbox/provision-claude.sh` at line 45, The cp of
"${FIXTURES}/transcripts/*.jsonl" into
"${TARGET_HOME}/.claude/projects/default/" will fail if the glob matches
nothing; modify the script around that cp to guard the copy by first checking
for the existence of matching files (e.g., test for files matching
"${FIXTURES}/transcripts/"/*.jsonl or use shell option nullglob) and only run
the cp (or a safe loop to copy each file) when at least one transcript exists;
ensure the target directory "${TARGET_HOME}/.claude/projects/default/" is
created (mkdir -p) before copying.
| function setupSandbox(): void { | ||
| // Create directory structure | ||
| mkdirSync(SANDBOX_SELFTUNE_DIR, { recursive: true }); | ||
| mkdirSync(SANDBOX_PROJECTS_DIR, { recursive: true }); | ||
|
|
||
| // Copy JSONL log fixtures into ~/.claude/ | ||
| const logFiles = [ | ||
| "all_queries_log.jsonl", | ||
| "skill_usage_log.jsonl", | ||
| "session_telemetry_log.jsonl", | ||
| "evolution_audit_log.jsonl", | ||
| ]; | ||
| for (const file of logFiles) { | ||
| const src = join(FIXTURES_DIR, file); | ||
| if (existsSync(src)) { | ||
| copyFileSync(src, join(SANDBOX_CLAUDE_DIR, file)); | ||
| } | ||
| } | ||
|
|
||
| // Copy transcripts into ~/.claude/projects/default/ as individual session files | ||
| const transcriptsDir = join(FIXTURES_DIR, "transcripts"); | ||
| if (existsSync(transcriptsDir)) { | ||
| for (const file of readdirSync(transcriptsDir)) { | ||
| if (file.endsWith(".jsonl")) { | ||
| copyFileSync(join(transcriptsDir, file), join(SANDBOX_PROJECTS_DIR, file)); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Copy selftune config | ||
| copyFileSync( | ||
| join(FIXTURES_DIR, "selftune-config.json"), | ||
| join(SANDBOX_SELFTUNE_DIR, "config.json"), | ||
| ); | ||
|
|
||
| // Copy Claude Code settings | ||
| copyFileSync( | ||
| join(FIXTURES_DIR, "claude-settings.json"), | ||
| join(SANDBOX_CLAUDE_DIR, "settings.json"), | ||
| ); | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consolidate sandbox provisioning to a single source of truth.
This setup duplicates tests/sandbox/provision-claude.sh logic; fixture path changes can drift between the two flows. Reusing the provision script (or a shared TS helper) will keep Layer 1 and Docker setup consistent.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/sandbox/run-sandbox.ts` around lines 53 - 93, The setupSandbox function
duplicates provisioning steps already implemented in provision-claude.sh
(creating SANDBOX_SELFTUNE_DIR, SANDBOX_PROJECTS_DIR, copying FIXTURES_DIR files
into SANDBOX_CLAUDE_DIR and SANDBOX_PROJECTS_DIR, and copying selftune/claude
settings); replace the duplicated logic by invoking the existing
provision-claude.sh script (or refactor the shell logic into a shared TypeScript
helper) from setupSandbox so there is a single source of truth for fixture paths
and copying behavior; keep references to the same SANDBOX_* and FIXTURES_DIR
constants and ensure setupSandbox either spawns the script (capturing errors) or
calls the shared helper function so tests and Docker provisioning use identical
logic.
Devcontainer security: - Remove unnecessary NET_RAW capability from runArgs - Fix iptables INPUT chain: -o lo → -i lo for loopback - Add default DROP policies for INPUT/FORWARD/OUTPUT chains - Add pipefail to curl|bash Bun installs - Append SNIPPET to .bashrc for sandbox-agent shell Code fixes: - Fix stream deadlock: consume stdout/stderr concurrently via Promise.all - Add try-catch around JSON.parse for malformed grading results - Guard transcript glob copy with nullglob/array check - Extract countLines helper, handle empty files correctly - Wrap main test execution in try/finally for cleanup Documentation: - Fix path reference: tests/sandbox/claude-code/ → tests/sandbox/docker/ - Add language tags to bare fenced code blocks - Move HTML comment below H1 in multi-agent-sandbox.md - Update TD-007 date to 2026-03-02 - Remove hardcoded line numbers from golden-principles.md - Mark gitignored strategy doc links as "local only" - Deduplicate start/boot scripts in package.json Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Conflicts resolved in ARCHITECTURE.md, CHANGELOG.md, README.md, docs/design-docs/index.md. Kept both sides: dev's v0.8 features (auto-activation, enforcement, memory, dashboard-server, agents) and HEAD's sandbox test harness + devcontainer content. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (2)
tests/sandbox/run-sandbox.ts (1)
231-237:⚠️ Potential issue | 🟠 MajorCleanup is still not guaranteed on all failure paths.
setupSandbox()is outside thetry/finally, andprocess.exit(...)is called inside thetry. This can still leave/tmp/selftune-sandbox-*behind on early failures.Safer control-flow pattern
async function main(): Promise<void> { @@ - // Setup - setupSandbox(); - console.log("Sandbox directories created and fixtures copied.\n"); - - const results: RawTestResult[] = []; - - try { + const results: RawTestResult[] = []; + let exitCode = 1; + try { + setupSandbox(); + console.log("Sandbox directories created and fixtures copied.\n"); @@ - process.exit(passed === total ? 0 : 1); + exitCode = passed === total ? 0 : 1; } finally { @@ } + process.exit(exitCode); }In Node.js/Bun, does calling process.exit() inside a try block reliably execute the corresponding finally block?Also applies to: 449-461
docs/design-docs/sandbox-architecture.md (1)
61-71:⚠️ Potential issue | 🟡 MinorAdd blank lines after subsection headings to satisfy MD022.
The past review flagged missing blank lines after the headings on lines 61, 64, and 69. This markdownlint violation (MD022) has not been resolved.
📝 Proposed fix to add blank lines
### 1. HOME Env Var Redirection + All selftune paths go through `homedir()` in `constants.ts`. Setting `HOME=/tmp/sandbox-*` redirects everything without modifying production code. ### 2. Two-Layer Architecture + - Layer 1 is free, fast (~400ms), and runs in CI - Layer 2 costs tokens and requires Docker, reserved for pre-release validation - Both share the same fixture data ### 3. Result Recording + Every test run saves a JSON report to `tests/sandbox/results/` with command, exit code, stdout, stderr, duration, and pass/fail. This creates a historical record of sandbox health.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/design-docs/sandbox-architecture.md` around lines 61 - 71, The MD022 violation is due to missing blank lines after the subsection headings "1. HOME Env Var Redirection", "2. Two-Layer Architecture", and "3. Result Recording" in sandbox-architecture.md; fix it by inserting a single blank line immediately after each of those heading lines so each heading is followed by a blank line before the paragraph or list that follows, ensuring the file satisfies markdownlint MD022.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.devcontainer/Dockerfile:
- Line 5: Replace the dynamic build ARG CLAUDE_CODE_VERSION=latest with a pinned
explicit version (for example set ARG CLAUDE_CODE_VERSION=1.0.100) so
devcontainer builds are reproducible; locate the CLAUDE_CODE_VERSION ARG
declaration and any later references (the second occurrence that injects or
installs the Claude Code CLI) and update them to use the chosen fixed version
string, and document the chosen version in a comment so future bumps are
intentional.
In @.devcontainer/init-firewall.sh:
- Around line 7-27: The firewall currently only configures IPv4 via iptables,
allowing IPv6 traffic to bypass restrictions; add equivalent IPv6 rules using
ip6tables to mirror every iptables rule (accept loopback, DNS tcp/udp 53, SSH
tcp 22, established/related states, HTTPS 443, HTTP 80) and then set default
policies to DROP (INPUT, FORWARD, OUTPUT) for ip6tables as well; ensure rule
ordering matches the existing iptables block and use the same match flags (e.g.,
-m state --state ESTABLISHED,RELATED) so both iptables and ip6tables enforce the
same policy.
In `@docs/design-docs/sandbox-architecture.md`:
- Around line 91-97: The table entries for per-agent design docs are
inconsistent: "Claude Code" uses a markdown link while "Codex" and "OpenCode"
are plain text; update the rows in the table so formatting is consistent —
either convert "sandbox-codex.md" and "sandbox-opencode.md" to markdown links
like [sandbox-codex.md](sandbox-codex.md) and
[sandbox-opencode.md](sandbox-opencode.md) or remove the link syntax for
"sandbox-claude-code.md" so all three are plain text; update the entries for the
Agent and Design Doc columns in the table (e.g., the "Claude Code", "Codex",
"OpenCode" and their associated design-doc strings) and keep the Status column
unchanged.
- Around line 33-35: The docs duplicate the cost statement across two bullet
points; remove the redundant line so the "Cost" section only states "Uses
existing agent subscription — no per-call API charges." and keep the "Mechanism"
section as "Per-agent Docker containers with the agent's CLI installed. Uses
existing agent subscription — no separate API key needed." Locate and edit the
two bullets under the same heading in sandbox-architecture.md (the "Mechanism"
and "Cost" list items) and delete the repeated subscription/charges sentence
from the Cost bullet.
In `@tests/sandbox/docker/run-with-llm.ts`:
- Around line 225-241: testWatch currently doesn't validate the watch command
result so failures can be missed; update the test in run-with-llm.ts (the block
using runSelftune and returned values) to assert the expected exitCode and
result shape: verify exitCode equals the expected value (e.g., 1 for regression
detection), check stdout/stderr contain the expected markers (e.g., a known
"regression" or "watch" message) and ensure the returned object includes both
stdout and exitCode in the expected types before returning; reference the
runSelftune invocation and the surrounding testWatch/runTest flow when adding
these assertions.
- Around line 157-170: The code parses JSON into a loose Record<string, unknown>
(variable result) but then accesses result.summary.passed/result.summary.total
and result.reason/result.proposal without proper narrowing; add a type guard
(e.g., isGradingResult(obj): obj is GradingResult) or inline typeof checks
before using each property—verify result.summary is an object and typeof
result.summary.pass_rate === 'number' and typeof result.summary.passed ===
'number' and typeof result.summary.total === 'number' before logging, and verify
typeof result.reason === 'string' (or typeof result.proposal === 'string')
before using nullish-coalescing; replace unsafe property accesses with guarded
accesses or cast only after the guard so TypeScript knows the types are safe.
---
Duplicate comments:
In `@docs/design-docs/sandbox-architecture.md`:
- Around line 61-71: The MD022 violation is due to missing blank lines after the
subsection headings "1. HOME Env Var Redirection", "2. Two-Layer Architecture",
and "3. Result Recording" in sandbox-architecture.md; fix it by inserting a
single blank line immediately after each of those heading lines so each heading
is followed by a blank line before the paragraph or list that follows, ensuring
the file satisfies markdownlint MD022.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (13)
.devcontainer/Dockerfile.devcontainer/devcontainer.json.devcontainer/init-firewall.shdocs/design-docs/sandbox-architecture.mddocs/design-docs/sandbox-claude-code.mddocs/exec-plans/active/multi-agent-sandbox.mddocs/exec-plans/tech-debt-tracker.mddocs/golden-principles.mddocs/product-specs/index.mdpackage.jsontests/sandbox/docker/run-with-llm.tstests/sandbox/provision-claude.shtests/sandbox/run-sandbox.ts
|
|
||
| ARG TZ | ||
| ENV TZ="$TZ" | ||
| ARG CLAUDE_CODE_VERSION=latest |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "Dockerfile" -path "*devcontainer*" | head -5Repository: WellDunDun/selftune
Length of output: 89
🏁 Script executed:
cat -n .devcontainer/Dockerfile | head -60Repository: WellDunDun/selftune
Length of output: 2243
🌐 Web query:
@anthropic-ai/claude-code npm package latest version stable
💡 Result:
The latest stable version of the @anthropic-ai/claude-code npm package is 1.0.100. [1]
Pin Claude Code CLI to an explicit version instead of latest.
Using CLAUDE_CODE_VERSION=latest makes devcontainer builds non-reproducible and can break unexpectedly when upstream publishes changes. Pin to a stable version (e.g., 1.0.100) and bump intentionally.
Affects lines 5 and 49.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.devcontainer/Dockerfile at line 5, Replace the dynamic build ARG
CLAUDE_CODE_VERSION=latest with a pinned explicit version (for example set ARG
CLAUDE_CODE_VERSION=1.0.100) so devcontainer builds are reproducible; locate the
CLAUDE_CODE_VERSION ARG declaration and any later references (the second
occurrence that injects or installs the Claude Code CLI) and update them to use
the chosen fixed version string, and document the chosen version in a comment so
future bumps are intentional.
| iptables -A OUTPUT -p udp --dport 53 -j ACCEPT | ||
| iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT | ||
| iptables -A OUTPUT -p tcp --dport 22 -j ACCEPT | ||
| iptables -A OUTPUT -o lo -j ACCEPT | ||
| iptables -A INPUT -i lo -j ACCEPT | ||
|
|
||
| # Allow established connections | ||
| iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT | ||
| iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT | ||
|
|
||
| # Allow HTTPS (port 443) for Claude API, npm, GitHub | ||
| iptables -A OUTPUT -p tcp --dport 443 -j ACCEPT | ||
|
|
||
| # Allow HTTP (port 80) for package registries | ||
| iptables -A OUTPUT -p tcp --dport 80 -j ACCEPT | ||
|
|
||
| # Default deny all other traffic | ||
| iptables -P INPUT DROP | ||
| iptables -P FORWARD DROP | ||
| iptables -P OUTPUT DROP | ||
|
|
There was a problem hiding this comment.
Firewall policy is IPv4-only; add IPv6 controls to avoid bypass.
Current rules apply to iptables only. If IPv6 is available, outbound traffic can bypass these restrictions.
Suggested hardening
+# Mirror policy for IPv6 (or explicitly disable IPv6 in container)
+ip6tables -P INPUT DROP
+ip6tables -P FORWARD DROP
+ip6tables -P OUTPUT DROP
+
+ip6tables -A OUTPUT -p udp --dport 53 -j ACCEPT
+ip6tables -A OUTPUT -p tcp --dport 53 -j ACCEPT
+ip6tables -A OUTPUT -o lo -j ACCEPT
+ip6tables -A INPUT -i lo -j ACCEPT
+ip6tables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
+ip6tables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
+ip6tables -A OUTPUT -p tcp --dport 443 -j ACCEPT
+ip6tables -A OUTPUT -p tcp --dport 80 -j ACCEPT📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| iptables -A OUTPUT -p udp --dport 53 -j ACCEPT | |
| iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT | |
| iptables -A OUTPUT -p tcp --dport 22 -j ACCEPT | |
| iptables -A OUTPUT -o lo -j ACCEPT | |
| iptables -A INPUT -i lo -j ACCEPT | |
| # Allow established connections | |
| iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT | |
| iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT | |
| # Allow HTTPS (port 443) for Claude API, npm, GitHub | |
| iptables -A OUTPUT -p tcp --dport 443 -j ACCEPT | |
| # Allow HTTP (port 80) for package registries | |
| iptables -A OUTPUT -p tcp --dport 80 -j ACCEPT | |
| # Default deny all other traffic | |
| iptables -P INPUT DROP | |
| iptables -P FORWARD DROP | |
| iptables -P OUTPUT DROP | |
| iptables -A OUTPUT -p udp --dport 53 -j ACCEPT | |
| iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT | |
| iptables -A OUTPUT -p tcp --dport 22 -j ACCEPT | |
| iptables -A OUTPUT -o lo -j ACCEPT | |
| iptables -A INPUT -i lo -j ACCEPT | |
| # Allow established connections | |
| iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT | |
| iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT | |
| # Allow HTTPS (port 443) for Claude API, npm, GitHub | |
| iptables -A OUTPUT -p tcp --dport 443 -j ACCEPT | |
| # Allow HTTP (port 80) for package registries | |
| iptables -A OUTPUT -p tcp --dport 80 -j ACCEPT | |
| # Default deny all other traffic | |
| iptables -P INPUT DROP | |
| iptables -P FORWARD DROP | |
| iptables -P OUTPUT DROP | |
| # Mirror policy for IPv6 (or explicitly disable IPv6 in container) | |
| ip6tables -P INPUT DROP | |
| ip6tables -P FORWARD DROP | |
| ip6tables -P OUTPUT DROP | |
| ip6tables -A OUTPUT -p udp --dport 53 -j ACCEPT | |
| ip6tables -A OUTPUT -p tcp --dport 53 -j ACCEPT | |
| ip6tables -A OUTPUT -o lo -j ACCEPT | |
| ip6tables -A INPUT -i lo -j ACCEPT | |
| ip6tables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT | |
| ip6tables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT | |
| ip6tables -A OUTPUT -p tcp --dport 443 -j ACCEPT | |
| ip6tables -A OUTPUT -p tcp --dport 80 -j ACCEPT |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.devcontainer/init-firewall.sh around lines 7 - 27, The firewall currently
only configures IPv4 via iptables, allowing IPv6 traffic to bypass restrictions;
add equivalent IPv6 rules using ip6tables to mirror every iptables rule (accept
loopback, DNS tcp/udp 53, SSH tcp 22, established/related states, HTTPS 443,
HTTP 80) and then set default policies to DROP (INPUT, FORWARD, OUTPUT) for
ip6tables as well; ensure rule ordering matches the existing iptables block and
use the same match flags (e.g., -m state --state ESTABLISHED,RELATED) so both
iptables and ip6tables enforce the same policy.
| **Mechanism:** Per-agent Docker containers with the agent's CLI installed. Uses existing agent subscription — no separate API key needed. | ||
|
|
||
| **Cost:** Uses existing agent subscription — no per-call API charges. |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Remove duplicate cost statement.
Lines 33 and 35 both state that the solution uses the existing agent subscription without additional charges. The repetition is unnecessary.
✂️ Proposed fix to remove duplication
**Mechanism:** Per-agent Docker containers with the agent's CLI installed. Uses existing agent subscription — no separate API key needed.
-
-**Cost:** Uses existing agent subscription — no per-call API charges.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **Mechanism:** Per-agent Docker containers with the agent's CLI installed. Uses existing agent subscription — no separate API key needed. | |
| **Cost:** Uses existing agent subscription — no per-call API charges. | |
| **Mechanism:** Per-agent Docker containers with the agent's CLI installed. Uses existing agent subscription — no separate API key needed. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/design-docs/sandbox-architecture.md` around lines 33 - 35, The docs
duplicate the cost statement across two bullet points; remove the redundant line
so the "Cost" section only states "Uses existing agent subscription — no
per-call API charges." and keep the "Mechanism" section as "Per-agent Docker
containers with the agent's CLI installed. Uses existing agent subscription — no
separate API key needed." Locate and edit the two bullets under the same heading
in sandbox-architecture.md (the "Mechanism" and "Cost" list items) and delete
the repeated subscription/charges sentence from the Cost bullet.
| ## Per-Agent Design Docs | ||
|
|
||
| | Agent | Design Doc | Status | | ||
| |-------|-----------|--------| | ||
| | Claude Code | [sandbox-claude-code.md](sandbox-claude-code.md) | Implemented | | ||
| | Codex | sandbox-codex.md | Planned | | ||
| | OpenCode | sandbox-opencode.md | Planned | |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider consistent formatting for planned design docs.
Line 95 uses a markdown link for the implemented Claude Code doc, while lines 96-97 show plain text for planned docs. For consistency, either use markdown links for all (even if they don't resolve yet) or use plain text uniformly with status indicating availability.
🔗 Option: Use consistent markdown links
| Agent | Design Doc | Status |
|-------|-----------|--------|
| Claude Code | [sandbox-claude-code.md](sandbox-claude-code.md) | Implemented |
-| Codex | sandbox-codex.md | Planned |
-| OpenCode | sandbox-opencode.md | Planned |
+| Codex | [sandbox-codex.md](sandbox-codex.md) | Planned |
+| OpenCode | [sandbox-opencode.md](sandbox-opencode.md) | Planned |📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## Per-Agent Design Docs | |
| | Agent | Design Doc | Status | | |
| |-------|-----------|--------| | |
| | Claude Code | [sandbox-claude-code.md](sandbox-claude-code.md) | Implemented | | |
| | Codex | sandbox-codex.md | Planned | | |
| | OpenCode | sandbox-opencode.md | Planned | | |
| ## Per-Agent Design Docs | |
| | Agent | Design Doc | Status | | |
| |-------|-----------|--------| | |
| | Claude Code | [sandbox-claude-code.md](sandbox-claude-code.md) | Implemented | | |
| | Codex | [sandbox-codex.md](sandbox-codex.md) | Planned | | |
| | OpenCode | [sandbox-opencode.md](sandbox-opencode.md) | Planned | |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/design-docs/sandbox-architecture.md` around lines 91 - 97, The table
entries for per-agent design docs are inconsistent: "Claude Code" uses a
markdown link while "Codex" and "OpenCode" are plain text; update the rows in
the table so formatting is consistent — either convert "sandbox-codex.md" and
"sandbox-opencode.md" to markdown links like
[sandbox-codex.md](sandbox-codex.md) and
[sandbox-opencode.md](sandbox-opencode.md) or remove the link syntax for
"sandbox-claude-code.md" so all three are plain text; update the entries for the
Agent and Design Doc columns in the table (e.g., the "Claude Code", "Codex",
"OpenCode" and their associated design-doc strings) and keep the Status column
unchanged.
| let result: Record<string, unknown>; | ||
| try { | ||
| result = JSON.parse(raw); | ||
| } catch { | ||
| throw new Error(`grading-result.json is not valid JSON: ${raw.slice(0, 300)}`); | ||
| } | ||
|
|
||
| if (!result.summary || typeof result.summary.pass_rate !== "number") { | ||
| throw new Error("Grading result missing summary.pass_rate"); | ||
| } | ||
|
|
||
| console.log( | ||
| ` [grade] ${result.summary.passed}/${result.summary.total} (${Math.round(result.summary.pass_rate * 100)}%)`, | ||
| ); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== TypeScript configuration ==="
fd '^tsconfig.*\.json$' --type f | head -5 | while read f; do
echo "--- $f ---"
cat "$f" | grep -E '"strict"|"noImplicitAny"|"strictNullChecks"|"noUncheckedIndexedAccess"' || true
done
echo
echo "=== Full context around lines 157-170 and 200-215 ==="
wc -l tests/sandbox/docker/run-with-llm.ts
echo
echo "=== Lines 150-220 (full context) ==="
sed -n '150,220p' tests/sandbox/docker/run-with-llm.tsRepository: WellDunDun/selftune
Length of output: 2561
Remove unsafe property access on unknown types without proper narrowing.
Under strict TypeScript, accessing result.summary.passed and result.summary.total (line 165) without type guards violates type safety. The check !result.summary confirms existence but doesn't narrow the property types to non-unknown.
Line 208 similarly accesses result.reason (in the nullish coalescing) without establishing its type—the truthiness check !result.reason && !result.proposal doesn't narrow the type of .reason itself.
Type-guard or narrow each accessed property:
typeof result.summary?.passed === 'number'before usetypeof result.reason === 'string'before accessing
Same applies to line 200–215 region.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/sandbox/docker/run-with-llm.ts` around lines 157 - 170, The code parses
JSON into a loose Record<string, unknown> (variable result) but then accesses
result.summary.passed/result.summary.total and result.reason/result.proposal
without proper narrowing; add a type guard (e.g., isGradingResult(obj): obj is
GradingResult) or inline typeof checks before using each property—verify
result.summary is an object and typeof result.summary.pass_rate === 'number' and
typeof result.summary.passed === 'number' and typeof result.summary.total ===
'number' before logging, and verify typeof result.reason === 'string' (or typeof
result.proposal === 'string') before using nullish-coalescing; replace unsafe
property accesses with guarded accesses or cast only after the guard so
TypeScript knows the types are safe.
| const { exitCode, stdout, stderr } = await runSelftune([ | ||
| "watch", | ||
| "--skill", | ||
| "find-skills", | ||
| "--skill-path", | ||
| skillPath, | ||
| "--window", | ||
| "20", | ||
| "--threshold", | ||
| "0.1", | ||
| ]); | ||
|
|
||
| // Watch exits 1 on regression detection (expected for test data) | ||
| console.log(` [watch] exit=${exitCode}`); | ||
| console.log(` [watch] ${stdout.slice(0, 200)}`); | ||
| return { stdout: stdout.slice(0, 1000), exitCode }; | ||
| } |
There was a problem hiding this comment.
watch test can pass on unexpected failures.
testWatch never asserts expected exitCode or result shape, so runTest marks it PASS as long as no exception is thrown. This can hide regressions in watch.
Tighten validation for watch expectations
async function testWatch(): Promise<unknown> {
@@
- // Watch exits 1 on regression detection (expected for test data)
- console.log(` [watch] exit=${exitCode}`);
- console.log(` [watch] ${stdout.slice(0, 200)}`);
- return { stdout: stdout.slice(0, 1000), exitCode };
+ // Watch exits 1 on regression detection for this fixture.
+ if (exitCode !== 1) {
+ throw new Error(`watch expected exit=1 for regression fixture, got ${exitCode}: ${stderr.slice(0, 300)}`);
+ }
+ let parsed: any;
+ try {
+ parsed = JSON.parse(stdout);
+ } catch {
+ throw new Error(`watch stdout is not valid JSON: ${stdout.slice(0, 300)}`);
+ }
+ if (parsed?.regression_detected !== true) {
+ throw new Error("watch output missing regression_detected=true");
+ }
+ console.log(` [watch] exit=${exitCode}`);
+ return { exitCode, regression_detected: parsed.regression_detected };
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const { exitCode, stdout, stderr } = await runSelftune([ | |
| "watch", | |
| "--skill", | |
| "find-skills", | |
| "--skill-path", | |
| skillPath, | |
| "--window", | |
| "20", | |
| "--threshold", | |
| "0.1", | |
| ]); | |
| // Watch exits 1 on regression detection (expected for test data) | |
| console.log(` [watch] exit=${exitCode}`); | |
| console.log(` [watch] ${stdout.slice(0, 200)}`); | |
| return { stdout: stdout.slice(0, 1000), exitCode }; | |
| } | |
| const { exitCode, stdout, stderr } = await runSelftune([ | |
| "watch", | |
| "--skill", | |
| "find-skills", | |
| "--skill-path", | |
| skillPath, | |
| "--window", | |
| "20", | |
| "--threshold", | |
| "0.1", | |
| ]); | |
| // Watch exits 1 on regression detection for this fixture. | |
| if (exitCode !== 1) { | |
| throw new Error(`watch expected exit=1 for regression fixture, got ${exitCode}: ${stderr.slice(0, 300)}`); | |
| } | |
| let parsed: any; | |
| try { | |
| parsed = JSON.parse(stdout); | |
| } catch { | |
| throw new Error(`watch stdout is not valid JSON: ${stdout.slice(0, 300)}`); | |
| } | |
| if (parsed?.regression_detected !== true) { | |
| throw new Error("watch output missing regression_detected=true"); | |
| } | |
| console.log(` [watch] exit=${exitCode}`); | |
| return { exitCode, regression_detected: parsed.regression_detected }; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/sandbox/docker/run-with-llm.ts` around lines 225 - 241, testWatch
currently doesn't validate the watch command result so failures can be missed;
update the test in run-with-llm.ts (the block using runSelftune and returned
values) to assert the expected exitCode and result shape: verify exitCode equals
the expected value (e.g., 1 for regression detection), check stdout/stderr
contain the expected markers (e.g., a known "regression" or "watch" message) and
ensure the returned object includes both stdout and exitCode in the expected
types before returning; reference the runSelftune invocation and the surrounding
testWatch/runTest flow when adding these assertions.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
README.md (1)
246-261:⚠️ Potential issue | 🟡 MinorLicense section content is misplaced.
Line 246 creates a License heading, but the actual license link is at Line 261 after the Milestones section. Move
[MIT](LICENSE)directly under## License.Proposed fix
## License +[MIT](LICENSE) ## Milestones @@ -| v0.8 | Auto-activation, evolution memory, specialized agents, enforcement guardrails, dashboard server | Done | - -[MIT](LICENSE) +| v0.8 | Auto-activation, evolution memory, specialized agents, enforcement guardrails, dashboard server | Done |🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 246 - 261, The README has the "## License" heading separated from its actual content: move the existing license link "[MIT](LICENSE)" so it appears immediately under the "## License" heading (i.e., place "[MIT](LICENSE)" directly after the "## License" line) and ensure the "## Milestones" section follows after the license entry; update the block containing "## License", "[MIT](LICENSE)", and "## Milestones" so the heading and link are contiguous.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ARCHITECTURE.md`:
- Around line 48-50: Update the ARCHITECTURE.md section that describes
callViaAgent() spawning `claude -p` to explicitly restrict
`--dangerously-skip-permissions` to sandbox/devcontainer use only: add a single
hard guard sentence after the existing lines stating that
`--dangerously-skip-permissions` must never be used in regular/local/production
workflows and is only permitted in isolated devcontainer/sandbox testing,
referencing callViaAgent() and the flag name verbatim to make the constraint
unambiguous.
---
Outside diff comments:
In `@README.md`:
- Around line 246-261: The README has the "## License" heading separated from
its actual content: move the existing license link "[MIT](LICENSE)" so it
appears immediately under the "## License" heading (i.e., place "[MIT](LICENSE)"
directly after the "## License" line) and ensure the "## Milestones" section
follows after the license entry; update the block containing "## License",
"[MIT](LICENSE)", and "## Milestones" so the heading and link are contiguous.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
ARCHITECTURE.mdCHANGELOG.mdREADME.mddocs/design-docs/index.mddocs/golden-principles.mdpackage.json
| │ LLM calls use callViaAgent() which spawns `claude -p` as a | ||
| │ subprocess. In devcontainer testing, this runs with | ||
| │ --dangerously-skip-permissions for unattended operation. |
There was a problem hiding this comment.
Constrain dangerous flag usage to sandbox-only context explicitly.
Lines 48-50 document --dangerously-skip-permissions but don’t explicitly forbid use outside isolated devcontainer/sandbox runs. Add a hard guard sentence to prevent accidental adoption in normal/local workflows.
Suggested doc hardening
│ LLM calls use callViaAgent() which spawns `claude -p` as a
│ subprocess. In devcontainer testing, this runs with
│ --dangerously-skip-permissions for unattended operation.
+│ This flag is sandbox/devcontainer-test only and must never be
+│ used in host/local developer workflows.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| │ LLM calls use callViaAgent() which spawns `claude -p` as a | |
| │ subprocess. In devcontainer testing, this runs with | |
| │ --dangerously-skip-permissions for unattended operation. | |
| │ LLM calls use callViaAgent() which spawns `claude -p` as a | |
| │ subprocess. In devcontainer testing, this runs with | |
| │ --dangerously-skip-permissions for unattended operation. | |
| │ This flag is sandbox/devcontainer-test only and must never be | |
| │ used in host/local developer workflows. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ARCHITECTURE.md` around lines 48 - 50, Update the ARCHITECTURE.md section
that describes callViaAgent() spawning `claude -p` to explicitly restrict
`--dangerously-skip-permissions` to sandbox/devcontainer use only: add a single
hard guard sentence after the existing lines stating that
`--dangerously-skip-permissions` must never be used in regular/local/production
workflows and is only permitted in isolated devcontainer/sandbox testing,
referencing callViaAgent() and the flag name verbatim to make the constraint
unambiguous.
Merges dev commits (sandbox harness PR #17, PAI Infrastructure PR #15) into feature branch. Resolved conflicts by combining both branches' unique additions: dev's structural improvements (stream deadlock fix, nullglob guards, bash history, iptables corrections) with HEAD's OpenClaw integration (ingestor, cron, platform references). Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Bug fixes: - Fix routing retries using wrong refiner in evolve-body.ts (#15) - Fix baseline gate evaluating wrong skill body in evolve.ts (#19) - Fix Gate 1 accepting routing tables with zero data rows (#20) - Fix section heading prefix match allowing false positives (#13) Security: - Add ReDoS protection for regex assertions (length cap, try-catch, u flag) (#10) - Harden YES/NO parsing to first-token extraction with punctuation stripping (#23) Type safety & validation: - Add type params to readJsonl calls in baseline.ts (#3) - Validate LLM JSON shape before treating as SkillUnitTest[] (#4) - Add JSON.parse validation in unit-test-cli.ts (#9) Code quality: - Replace console.warn with structured JSON logging (#5, #11) - Sort readdirSync for deterministic traversal (#6) - Tighten fuzzy match to prevent over-selection (#7) - Rename evalFailures to positiveEvals for clarity (#8) - Enforce timeout_ms via Promise.race in runUnitTest (#12) - Enrich deploy audit entry with gates snapshot (#16) - Add Pareto 5th dimension comment (#18) - Extract composability CLI into composability-cli.ts (#21) - Add missing eval files to lint-architecture.ts (#28) - Reformat AssertionType union (#22), template literal (#14), imports (#17) - Reformat baseline conditional (#2) Documentation: - Renumber scope expansion list (#24) - Add text language to fenced code blocks (#25, #27, #30) - Add rollback audit payload note (#26) - Clarify --dir expects corpus root (#29) - Make --tests optional in syntax (#31) Tests: - Remove unused afterEach import (#32) - Replace ! assertion with ?. optional chaining (#33) - Fix temp dir cleanup leak with afterAll (#34) - Add Gate 2 status assertion (#40) - Strengthen routing validation assertions (#41) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
## Walkthrough
This PR adds new CLI commands (auto-grade, quickstart, hook), implements expectation auto-derivation from SKILL.md, distinguishes actual skill invocations from passive SKILL.md reads, adds an UNGRADED skill status, migrates hook checks to Claude Code settings.json, and enhances evolve CLI pre-flight validation, verbose logging, and post-run/error messaging.
## Changes
|Cohort / File(s)|Summary|
|---|---|
|**CLI Routing & Init** <br> `cli/selftune/index.ts`, `cli/selftune/init.ts`|Adds `auto-grade`, `quickstart`, and `hook` command routing; `installAgentFiles()` to deploy bundled agent files; expands Claude Code hook detection (4 keys) and supports nested hook entries.|
|**Evolution CLI** <br> `cli/selftune/evolution/evolve.ts`|Pre-flight checks for `--skill-path`/SKILL.md and eval-set paths, verbose config dump, enhanced post-run deployment/diagnostic messaging, and improved top-level error/troubleshooting output.|
|**Automated Grading** <br> `cli/selftune/grading/auto-grade.ts` (new), `cli/selftune/grading/grade-session.ts`|New auto-grade CLI; adds `deriveExpectationsFromSkill()` (exported) to extract up to 5 expectations from SKILL.md and uses it when explicit evals are missing; session selection now prefers `skills_invoked`.|
|**Hook / Skill Eval** <br> `cli/selftune/hooks/skill-eval.ts`, `cli/selftune/observability.ts`|Adds `hasSkillToolInvocation()` and updates `processToolUse` to set `triggered` based on detection; observability now checks Claude Code `settings.json` (flat and nested formats) instead of repo .git/hooks.|
|**Telemetry & Transcript** <br> `cli/selftune/types.ts`, `cli/selftune/utils/transcript.ts`, `cli/selftune/ingestors/claude-replay.ts`|Adds `skills_invoked` to telemetry/transcript types and populates it from tool_use entries; ingestion prefers `skills_invoked` and sets `triggered` accordingly (fallback to previous behavior).|
|**Status / Badge** <br> `cli/selftune/status.ts`, `cli/selftune/badge/badge-data.ts`|Introduces new SkillStatus value `UNGRADED` and extends BadgeData.status to include `UNGRADED`; status logic and display updated to treat skills with records but no triggered (graded) sessions as UNGRADED.|
|**Quickstart Onboarding** <br> `cli/selftune/quickstart.ts` (new)|Adds three-step quickstart flow (init, replay, status) with `suggestSkillsToEvolve()` ranking ungraded/critical/warning skills.|
|**Eval set behavior** <br> `cli/selftune/eval/hooks-to-evals.ts`|`buildEvalSet` now filters positives/negatives to only include records where `triggered === true`.|
|**Templates & Docs** <br> `templates/*-settings.json`, `skill/Workflows/*.md`|Replaces hardcoded path-based hook invocations with `npx selftune hook <name>`; updates Evolve workflow docs to require AskUserQuestion for user inputs.|
|**Tests & Fixtures** <br> `tests/*` (multiple)|Adds/updates tests for derived expectations, skills_invoked precedence, triggered=true/false cases, observability hook formats, UNGRADED state, and eval positives filtering; updates SKILL.md fixtures.|
## Sequence Diagram(s)
```mermaid
sequenceDiagram
participant User
participant CLI as auto-grade<br/>CLI
participant Telemetry as Telemetry<br/>Log
participant SkillMD as SKILL.md
participant PreGates as Pre-Gates
participant Agent as LLM<br/>Agent
participant Output as JSON<br/>Output
User->>CLI: auto-grade --skill X [--session-id Y]
CLI->>Telemetry: Load telemetry log
Telemetry-->>CLI: Sessions list
CLI->>CLI: Resolve session (by ID or latest using skills_invoked)
CLI->>SkillMD: Derive expectations (if needed)
SkillMD-->>CLI: Expectations (derived or none)
CLI->>PreGates: Run pre-gates on expectations
alt All resolved
PreGates-->>CLI: All expectations resolved
else Some unresolved
PreGates-->>CLI: Unresolved list
CLI->>Agent: gradeViaAgent(unresolved)
Agent-->>CLI: Graded expectations
end
CLI->>CLI: Compile results, metrics, summary
CLI->>Output: Write GradingResult JSON
Output-->>User: Summary + file path
```
```mermaid
sequenceDiagram
participant User
participant CLI as quickstart<br/>CLI
participant Config as Config<br/>Check
participant Replay as Replay<br/>Ingestor
participant Telemetry as Telemetry<br/>Log
participant Status as Status<br/>Engine
participant Skills as Skills<br/>Scorer
User->>CLI: quickstart
CLI->>Config: Check config/marker
alt Config missing
CLI->>CLI: runInit()
end
CLI->>Replay: Check replay marker
alt Marker missing
Replay->>CLI: Ingest transcripts (parseTranscript -> skills_invoked)
CLI->>Telemetry: Update logs/marker
end
CLI->>Status: Compute skill statuses (uses triggered & pass rates)
Status-->>CLI: Formatted status
CLI->>Skills: suggestSkillsToEvolve()
Skills-->>CLI: Top ranked skills
CLI->>User: Display suggestions
```
```mermaid
sequenceDiagram
participant Transcript as File
participant Parser as parseTranscript
participant Detector as Invocation<br/>Detector
participant Telemetry as Session<br/>Record
participant Ingestor as claude-replay
Parser->>Transcript: Read JSONL lines
loop each message
Parser->>Detector: Check tool_use entries
alt tool_use.toolName == "Skill"
Detector-->>Parser: Extract skill name
Parser->>Parser: Add to skills_invoked[]
else
Detector-->>Parser: ignore
end
end
Parser->>Telemetry: Emit TranscriptMetrics (skills_invoked)
Ingestor->>Telemetry: Create SkillUsageRecord (use skills_invoked if present)
```
## Estimated code review effort
🎯 4 (Complex) | ⏱️ ~60 minutes
## Possibly related PRs
- **#23**: Overlaps evolve CLI changes in `cli/selftune/evolution/evolve.ts` (pre-flight, logging, messaging).
- **#13**: Related telemetry/ingestion changes around `skills_invoked` and session selection behavior.
- **#17**: Touches the same ingestor surface (`cli/selftune/ingestors/claude-replay.ts`) and skill-invocation/trigger semantics.
Walkthrough
Adds a sandbox testing infrastructure with devcontainer and Docker Layer‑2 LLM tests, fixture provisioning and orchestration scripts, GitHub Actions pinning/permissions, firewall/init scripts, many docs/roadmap updates, CI/workflow changes, and small repo config edits across ~40 files.
Changes
\.devcontainer/Dockerfile,\.devcontainer/devcontainer.json,\.devcontainer/init-firewall.shtests/sandbox/docker/Dockerfile,tests/sandbox/docker/docker-compose.yml,tests/sandbox/docker/entrypoint.shtests/sandbox/run-sandbox.ts,tests/sandbox/docker/run-with-llm.ts,tests/sandbox/provision-claude.sh.claude/.selftunefixtures. High LOC and nontrivial result collection/validation logic.Makefile,package.jsonstart,boot) pointing to the CLI help entry.checknow depends onsandbox..github/workflows/...(auto-bump-cli-version.yml,ci.yml,codeql.yml,publish.yml,scorecard.yml).github/CODEOWNERS,.gitignore,.coderabbit.yamltests/sandbox/results/,.env*,docs/strategy/; re-enabled fixture files for review by removing exclusion.README.md,CONTRIBUTING.md,AGENTS.md,ARCHITECTURE.md,PRD.md,CHANGELOG.md,ROADMAP.mddocs/design-docs/*,docs/exec-plans/*risk-policy.json,docs/golden-principles.md,docs/product-specs/index.md,llms.txtcontribute/, product-spec links, and a newllms.txtreference doc.Sequence Diagram(s)
sequenceDiagram participant Host as Host<br/>(make sandbox) participant Tmp as Temp Sandbox<br/>(Layer 1) participant Project as Project<br/>Files participant CLI as selftune<br/>CLI participant Hooks as Hook<br/>Scripts participant Results as results/<br/>JSON Host->>Tmp: Create temp HOME Host->>Project: Copy logs, transcripts, config Host->>Tmp: Set HOME env loop Run CLI commands Host->>CLI: doctor, evals, status, last, dashboard, contribute CLI->>Tmp: Read/write using HOME CLI-->>Host: stdout/stderr + exit code Host->>Results: Record result end loop Run Hooks Host->>Hooks: prompt-log, skill-eval, session-stop (pipe JSON) Hooks->>Tmp: Append JSONL Hooks-->>Host: Exit + status Host->>Results: Record result end Host->>Results: Write timestamped JSON report Host->>Tmp: Cleanup (or --keep)sequenceDiagram participant Dev as Devcontainer<br/>Startup participant FW as iptables<br/>Firewall participant Entrypoint as entrypoint.sh<br/>+ provision participant Fixtures as Fixture<br/>Files participant Docker as Docker<br/>Volumes participant Tests as run-with-llm.ts<br/>(LLM Tests) participant Claude as Claude Code<br/>CLI/API participant Results as results/<br/>JSON Dev->>FW: Apply iptables rules (DNS/SSH/HTTPS/HTTP) Dev->>Docker: Mount volumes Dev->>Entrypoint: Execute provisioning Entrypoint->>Fixtures: Copy into .claude/.selftune Entrypoint->>Tests: Start LLM test runner Tests->>Claude: Invoke `claude -p` / API calls Claude->>Tests: Return responses Tests->>Results: Validate & append results Tests->>Results: Write final report Tests-->>Dev: Exit statusEstimated code review effort
🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs