feat: sandbox test harness, multi-agent expansion plan, and project roadmap by WellDunDun · Pull Request #17 · selftune-dev/selftune

WellDunDun · 2026-03-02T09:42:44Z

Walkthrough

Adds a sandbox testing infrastructure with devcontainer and Docker Layer‑2 LLM tests, fixture provisioning and orchestration scripts, GitHub Actions pinning/permissions, firewall/init scripts, many docs/roadmap updates, CI/workflow changes, and small repo config edits across ~40 files.

Changes

Cohort / File(s)	Summary
Devcontainer & Firewall `\.devcontainer/Dockerfile`, `\.devcontainer/devcontainer.json`, `\.devcontainer/init-firewall.sh`	New devcontainer image and config with Node 20 + Bun, installs Claude Code CLI, configures npm dirs, persists bash history, NET_ADMIN runArgs, and an iptables-based init firewall script.
Sandbox Docker Layer & Entrypoint `tests/sandbox/docker/Dockerfile`, `tests/sandbox/docker/docker-compose.yml`, `tests/sandbox/docker/entrypoint.sh`	New sandbox Docker image and compose service that provisions persistent sandbox HOME, installs Bun/Claude CLI, prepares entrypoint to provision fixtures and run Layer‑2 tests.
Sandbox Orchestrators & Provisioning `tests/sandbox/run-sandbox.ts`, `tests/sandbox/docker/run-with-llm.ts`, `tests/sandbox/provision-claude.sh`	Layer‑1 TypeScript runner (local CLI/hook tests), Layer‑2 devcontainer LLM test runner, and provisioning script to populate `.claude`/`.selftune` fixtures. High LOC and nontrivial result collection/validation logic.
Makefile & Scripts `Makefile`, `package.json`	Added sandbox-related Make targets and two npm scripts (`start`, `boot`) pointing to the CLI help entry. `check` now depends on `sandbox`.
GitHub Workflows (pinning & perms) `.github/workflows/...` (`auto-bump-cli-version.yml`, `ci.yml`, `codeql.yml`, `publish.yml`, `scorecard.yml`)	Pinned multiple actions to specific commit SHAs and introduced explicit workflow/job-level permissions (top-level {} / per-job contents/security permissions).
Repo Ownership & Ignore Rules `.github/CODEOWNERS`, `.gitignore`, `.coderabbit.yaml`	Adds CODEOWNERS entries, new ignores for `tests/sandbox/results/`, `.env*`, `docs/strategy/`; re-enabled fixture files for review by removing exclusion.
Docs, Roadmap & Architecture `README.md`, `CONTRIBUTING.md`, `AGENTS.md`, `ARCHITECTURE.md`, `PRD.md`, `CHANGELOG.md`, `ROADMAP.md`	Major documentation rewrites to describe sandbox, devcontainer testing, repositioned architecture and milestones (v0.x → M#), changelog entry for 0.2.0, and CONTRIBUTING sandbox procedures.
Design & Exec Plans `docs/design-docs/`, `docs/exec-plans/`	Adds sandbox architecture and Claude Code design docs, multi-agent sandbox exec plan, and updates design-docs index and tech-debt tracker.
Policy & Product Spec Changes `risk-policy.json`, `docs/golden-principles.md`, `docs/product-specs/index.md`, `llms.txt`	Adds docs-drift watch rule, a golden-principle import restriction for `contribute/`, product-spec links, and a new `llms.txt` reference doc.

Sequence Diagram(s)

sequenceDiagram
    participant Host as Host<br/>(make sandbox)
    participant Tmp as Temp Sandbox<br/>(Layer 1)
    participant Project as Project<br/>Files
    participant CLI as selftune<br/>CLI
    participant Hooks as Hook<br/>Scripts
    participant Results as results/<br/>JSON

    Host->>Tmp: Create temp HOME
    Host->>Project: Copy logs, transcripts, config
    Host->>Tmp: Set HOME env

    loop Run CLI commands
        Host->>CLI: doctor, evals, status, last, dashboard, contribute
        CLI->>Tmp: Read/write using HOME
        CLI-->>Host: stdout/stderr + exit code
        Host->>Results: Record result
    end

    loop Run Hooks
        Host->>Hooks: prompt-log, skill-eval, session-stop (pipe JSON)
        Hooks->>Tmp: Append JSONL
        Hooks-->>Host: Exit + status
        Host->>Results: Record result
    end

    Host->>Results: Write timestamped JSON report
    Host->>Tmp: Cleanup (or --keep)

sequenceDiagram
    participant Dev as Devcontainer<br/>Startup
    participant FW as iptables<br/>Firewall
    participant Entrypoint as entrypoint.sh<br/>+ provision
    participant Fixtures as Fixture<br/>Files
    participant Docker as Docker<br/>Volumes
    participant Tests as run-with-llm.ts<br/>(LLM Tests)
    participant Claude as Claude Code<br/>CLI/API
    participant Results as results/<br/>JSON

    Dev->>FW: Apply iptables rules (DNS/SSH/HTTPS/HTTP)
    Dev->>Docker: Mount volumes
    Dev->>Entrypoint: Execute provisioning
    Entrypoint->>Fixtures: Copy into .claude/.selftune
    Entrypoint->>Tests: Start LLM test runner
    Tests->>Claude: Invoke `claude -p` / API calls
    Claude->>Tests: Return responses
    Tests->>Results: Validate & append results
    Tests->>Results: Write final report
    Tests-->>Dev: Exit status

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Migrate selftune CLI from Python to Bun/TypeScript #1 — Overlaps sandbox/devcontainer infrastructure, provisioning scripts, and test runner logic.
Add CI/CD workflows and npm package deployment #2 — Overlaps CI/workflow changes (action pinning and permissions) and publish/auto-bump workflow edits.
Release: CI/CD, FOSS playbook, CLI status commands #6 — Related sandbox/devcontainer additions and orchestration scripts for LLM testing.

Track A of multi-agent sandbox expansion complete: - Split sandbox-test-harness.md into sandbox-architecture.md (shared two-layer design) and sandbox-claude-code.md (Claude Code-specific implementation) - Created multi-agent-sandbox.md exec plan with 5 implementation tracks (doc restructure, fixtures, Layer 1 coverage, Docker expansion, per-agent docs) - Created ROADMAP.md with Done/In Progress/Planned sections, agent support matrix, and Skill Quality Infrastructure features (badges, auto-evolve, marketplace integration, conflict detection) - Updated design docs index with new file names - Added docs/strategy/ to .gitignore (strategy docs kept locally as private) Includes devcontainer setup, workflow improvements, README consolidation, and infrastructure updates for multi-agent support. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

coderabbitai · 2026-03-02T09:43:00Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Sandbox test harness for end-to-end CLI testing (local and multi-agent Docker layers).
- Added replay and contribute CLI capabilities for transcript backfill and anonymized exports.
- New start script for CLI help.
Documentation
- ROADMAP, design docs, execution plans, and contributing guide expanded with sandbox and devcontainer testing guidance.
- README refocused to workflow-driven, testing-first narrative.
Infrastructure & Configuration
- Devcontainer-based dev environment with firewall/init sequencing.
- CI workflows and actions pinned to specific revisions; new Make targets for sandbox workflows.

Walkthrough

Adds a sandbox testing infrastructure (Layer‑1 local + Layer‑2 devcontainer/LLM), devcontainer and Docker images, fixture provisioning and orchestration scripts, init firewall, GitHub Actions pinning/permissions, numerous docs/Makefile/package edits, CODEOWNERS, and repo config updates across ~40 files.

Changes

Cohort / File(s)	Summary
Devcontainer & Firewall `\.devcontainer/Dockerfile`, `\.devcontainer/devcontainer.json`, `\.devcontainer/init-firewall.sh`	New devcontainer image (node:20 + Bun), installs Claude CLI, configures npm/global dirs, persistent history/claude dirs, NET_ADMIN runtime, and adds iptables init script executed at container start.
Sandbox Docker Layer & Entrypoint `tests/sandbox/docker/Dockerfile`, `tests/sandbox/docker/docker-compose.yml`, `tests/sandbox/docker/entrypoint.sh`	New sandbox Docker image and compose service: Bun + Claude CLI, persistent sandbox HOME volume, chown handling, fixture provisioning at startup, and container entrypoint that runs LLM test runner.
Sandbox Orchestrators & Provisioning `tests/sandbox/run-sandbox.ts`, `tests/sandbox/docker/run-with-llm.ts`, `tests/sandbox/provision-claude.sh`	Adds Layer‑1 TypeScript sandbox runner (local CLI/hook tests), Layer‑2 TypeScript LLM test runner, and shell provisioning script to populate `.claude`/`.selftune` fixtures and transcripts; result collection and validation logic included.
Makefile & package.json `Makefile`, `package.json`	Adds make targets: `sandbox`, `sandbox-llm`, `sandbox-shell`; `check` now depends on `sandbox`; new npm `start` script to show CLI help.
GitHub Workflows (pinning & perms) `.github/workflows/...` `auto-bump-cli-version.yml`, `ci.yml`, `codeql.yml`, `publish.yml`, `scorecard.yml`	Pins Actions to specific commit SHAs and introduces explicit top-level/per-job permissions (empty top-level perms + scoped per-job permissions).
Repo Ownership & Ignore Rules `.github/CODEOWNERS`, `.gitignore`, `.coderabbit.yaml`	Adds CODEOWNERS entries, new ignore patterns (`tests/sandbox/results/`, `docs/strategy/`, `.env*`), and re-includes fixtures in review scope by removing an exclusion.
Docs, Roadmap & Architecture `README.md`, `CONTRIBUTING.md`, `AGENTS.md`, `ARCHITECTURE.md`, `PRD.md`, `CHANGELOG.md`, `ROADMAP.md`	Large documentation rewrites: sandbox/devcontainer testing docs, milestone renaming (v0.x → M#), changelog 0.2.0, contributing/testing guidance, and roadmap updates.
Design & Exec Plans `docs/design-docs/`, `docs/exec-plans/`	Adds sandbox-architecture and sandbox-claude-code design docs, multi-agent sandbox execution plan, and updates design-docs index and tech-debt tracker.
Policy & Product Spec Changes `risk-policy.json`, `docs/golden-principles.md`, `docs/product-specs/index.md`, `llms.txt`	Adds docs-drift watch rule for `cli/selftune/contribute/`, an anti-pattern lint entry, product-spec rows, and a new llms reference file.

Sequence Diagram(s)

sequenceDiagram
    participant Host as Host (make sandbox)
    participant Tmp as TempSandbox (Layer 1)
    participant Project as ProjectFiles
    participant CLI as selftune CLI
    participant Hooks as Hooks
    participant Results as results/ JSON

    Host->>Tmp: create temp HOME, set HOME env
    Host->>Project: copy fixtures (logs, transcripts, config)
    loop run CLI commands
        Host->>CLI: invoke commands (doctor, evals, dashboard, contribute, etc.)
        CLI->>Tmp: read/write under HOME
        CLI-->>Host: stdout/stderr + exit code
        Host->>Results: record per-command result
    end
    loop run hooks
        Host->>Hooks: invoke prompt-log, skill-eval, session-stop (pipe JSON)
        Hooks->>Tmp: append/modify JSONL logs
        Hooks-->>Host: exit + status
        Host->>Results: record hook result
    end
    Host->>Results: write timestamped report
    Host->>Tmp: cleanup (or keep)

sequenceDiagram
    participant Dev as Devcontainer Startup
    participant FW as iptables (init-firewall.sh)
    participant Docker as Docker Volumes
    participant Entrypoint as entrypoint.sh
    participant Fixtures as FixtureFiles (.claude/.selftune)
    participant Tests as run-with-llm.ts
    participant Claude as Claude Code CLI/API
    participant Results as results/ JSON

    Dev->>FW: apply iptables rules (DNS/SSH/HTTPS/HTTP)
    Dev->>Docker: mount volumes (sandbox HOME, results)
    Dev->>Entrypoint: exec provisioning script
    Entrypoint->>Fixtures: copy fixtures into HOME
    Entrypoint->>Tests: start LLM test runner
    Tests->>Claude: invoke `claude -p` / API calls
    Claude-->>Tests: responses
    Tests->>Results: validate & write final report
    Tests-->>Dev: exit status

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Release: CI/CD, FOSS playbook, CLI status commands #6 — Overlaps sandbox/devcontainer additions, provisioning scripts, and Layer‑2 LLM test tooling.
Add replay and contribute: retroactive backfill and community contribution export (v0.7) #13 — Related to replay/contribute features and claude-replay ingestor referenced by sandbox tests.
fix: audit release gaps, fix publish workflow, address CodeRabbit review #12 — Related changes to GitHub Actions workflows and action pinning referenced in this PR.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 69.23% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title follows conventional commits format (feat:) and clearly describes the main changes: sandbox test harness, multi-agent expansion, and project roadmap.
Description check	✅ Passed	Description comprehensively explains changes across ~40 files with a well-organized table, sequence diagrams, and clear cohort groupings that directly relate to the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch custom/prefix/router-1772442284029

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Sandbox fixtures are intentional test data that should be reviewed, not vendored/generated files to ignore. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 21

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.devcontainer/devcontainer.json:
- Line 10: Remove the unnecessary NET_RAW capability from the devcontainer run
arguments: edit the runArgs array to drop the "--cap-add=NET_RAW" entry so only
"--cap-add=NET_ADMIN" remains; this targets the runArgs declaration and aligns
with the firewall script (init-firewall.sh) which only requires NET_ADMIN and
does not use raw sockets or packet crafting.

In @.devcontainer/Dockerfile:
- Around line 19-23: SNIPPET is defined but never written into any shell
startup, so interactive shells don't use /commandhistory/.bash_history; update
the Dockerfile to write the SNIPPET into a startup file (e.g., create
/etc/profile.d/commandhistory.sh or append to the user's ~/.bashrc) after
creating /commandhistory and setting ownership, by echoing the export lines for
PROMPT_COMMAND and HISTFILE into that file and ensuring it has appropriate
permissions so the shell will source it for interactive sessions; reference the
SNIPPET variable, PROMPT_COMMAND, HISTFILE, /commandhistory/.bash_history and
$USERNAME when making the change so the Docker build persists shell history.
- Around line 33-34: Update both RUN lines that execute the Bun installer ("RUN
curl -fsSL https://bun.sh/install | bash" and the similar second occurrence) to
run the pipeline under a shell that has strict failure handling; prepend a
POSIX-safe strict mode (set -euo pipefail) so that a failed curl or a failing
pipeline will cause the Docker build to fail rather than silently continuing.
Apply this change to both RUN instructions that invoke the bun installer.

In @.devcontainer/init-firewall.sh:
- Line 11: The iptables rule added to the INPUT chain uses the output interface
flag (-o lo) which is invalid there; update the iptables invocation that
currently adds the loopback accept rule (the line running "iptables -A INPUT -o
lo -j ACCEPT") to use the input interface flag (-i lo) instead so it reads as an
INPUT-chain loopback accept; keep the same chain/action/position but replace -o
with -i to prevent the script from failing under set -e.
- Around line 7-22: Add explicit default-deny policies by setting "iptables -P
INPUT DROP", "iptables -P OUTPUT DROP", and "iptables -P FORWARD DROP" before
any -A (append) rules, then keep the existing allow rules (DNS, SSH, loopback,
ESTABLISHED,RELATED, HTTP/HTTPS) after those policies; also correct the loopback
input rule by changing the INPUT rule from using "-o lo" to "-i lo" (replace the
current "iptables -A INPUT -o lo -j ACCEPT" with "iptables -A INPUT -i lo -j
ACCEPT") so loopback traffic is properly allowed.

In `@docs/design-docs/sandbox-architecture.md`:
- Around line 41-47: Add a language tag to the fenced code block that shows the
test fixture tree (change ``` to ```text) to satisfy MD040, and insert a single
blank line after each of the headings in the same section (the headings
surrounding the fixtures block referenced near the tests/sandbox/fixtures tree)
to satisfy MD022/MD040; update the fenced block and add blank lines immediately
after those heading lines so lint no longer reports MD040/MD022.

In `@docs/design-docs/sandbox-claude-code.md`:
- Line 36: Update the docs entry that currently lists Layer 2 as
`tests/sandbox/claude-code/` to the actual location used by the Makefile:
`tests/sandbox/docker/docker-compose.yml`; change the referenced path in
docs/design-docs/sandbox-claude-code.md and any mention of Layer 2 so it matches
the Makefile targets `sandbox-llm` and `sandbox-shell` which point to
`tests/sandbox/docker/docker-compose.yml`.

In `@docs/exec-plans/active/multi-agent-sandbox.md`:
- Around line 1-3: The MD041/MD040 violations come from the verification HTML
comment placed before the H1 and an untyped code fence; move the verification
comment "<!-- Verified: 2026-03-02 -->" so it appears immediately below the H1
"# Execution Plan: Multi-Agent Sandbox Expansion", and update the directory-tree
code fence to include a language specifier (use "text") for the block that lists
tests/sandbox/; apply the same change pattern to the other similar blocks
referenced around the content (lines noted 26-43) so all first-line headings
precede comments and all fences use a language (e.g., ```text).

In `@docs/exec-plans/tech-debt-tracker.md`:
- Line 15: The TD-007 row in the debt tracker was edited but the "Updated" date
wasn't changed; update the "Updated" column for the TD-007 row (the row
mentioning deployProposal / deploy-proposal.ts and evolve.ts) to the current
audit date (e.g., 2026-03-02) so metadata remains consistent with the edit.

In `@docs/golden-principles.md`:
- Line 95: Remove the hardcoded line-number reference in the sentence and
instead point to the lint rule by name or exported identifier in
lint-architecture.ts that enforces the isolated export path (e.g., the
rule/check that enforces "contribute" not importing from hooks/ ingestors/
grading/ evolution/ monitoring). Update the docs/golden-principles.md text to
say something like "enforced by the isolated-export-path rule in
lint-architecture.ts" (or use the actual exported rule/constant/function name
from lint-architecture.ts) so the doc references the rule identifier rather than
specific line numbers.

In `@docs/product-specs/index.md`:
- Around line 10-11: The two table rows in docs/product-specs/index.md
referencing "[ICP & GTM Strategy](../strategy/icp-gtm-strategy.md)" and
"[OpenClaw Integration](../strategy/openclaw-integration.md)" point to a
gitignored/private docs/strategy directory and will 404 for others; either move
the source files into a tracked docs folder and update the links to their new
paths, or replace those two linked entries with plain text labels (e.g., "ICP &
GTM Strategy — draft" and "OpenClaw Integration — draft") until the publishable
docs exist so the index no longer links to private paths.

In `@Makefile`:
- Line 1: Add missing phony targets and minimal implementations for Makefile
compliance: update the .PHONY declaration to include "all" and "clean", add an
"all" target that depends on the existing "check" target (so running `make` runs
checks), and add a "clean" target that removes build artifacts (e.g.,
temporary/build output directories and generated files used by the repo) to
satisfy checkmake requirements; reference the .PHONY line and the "check" target
when making these changes.

In `@package.json`:
- Around line 53-54: Keep a single canonical script and make the other an alias:
remove the duplicate command by preserving "start" with the real command "bun
run cli/selftune/index.ts --help" and change "boot" to run that script (e.g.,
set "boot" to "npm run start" or "bun run npm: start") so "boot" is an alias to
"start" and future drift is avoided; update the package.json "scripts" entries
for "start" and "boot" accordingly.

In `@tests/sandbox/docker/Dockerfile`:
- Around line 6-7: The Dockerfile uses ARG CLAUDE_CODE_VERSION=latest which
makes builds non-reproducible; change the default to a pinned semantic version
(e.g., ARG CLAUDE_CODE_VERSION=1.2.3) and ensure any other occurrence (same ARG
at the other location referenced) is updated consistently so builds use the
fixed version unless overridden at build time; update documentation or build
scripts to pass the desired version via --build-arg CLAUDE_CODE_VERSION when
needed and add a test or note to verify the pinned value is used during sandbox
builds.

In `@tests/sandbox/docker/run-with-llm.ts`:
- Around line 143-152: The JSON.parse call for grading-result.json can throw an
opaque error if the file is malformed; wrap the read + JSON.parse (where
resultPath, readFileSync and JSON.parse are used in run-with-llm.ts) in a
try-catch, catch parse errors and throw a new Error that includes the resultPath
and the original error message (or attach it) so the failure clearly indicates
which file failed to parse and why; keep the existing existsSync check and only
proceed to parse inside the try block, rethrow or throw a new descriptive Error
on failure.
- Around line 67-76: The code currently awaits proc.exited before consuming the
process streams, which can deadlock; change the order in the function that
spawns Bun (using the proc variable) so you create promises for stdout and
stderr (e.g., const stdoutPromise = new Response(proc.stdout).text(); const
stderrPromise = new Response(proc.stderr).text();) before awaiting proc.exited,
then await proc.exited and finally await those promises (await stdoutPromise,
await stderrPromise) to return { exitCode, stdout, stderr }—update references to
proc, proc.stdout, proc.stderr, and proc.exited accordingly.
- Around line 97-105: The current code awaits proc.exited before consuming
proc.stdout/proc.stderr which can deadlock if the child blocks writing; change
the flow to start reading both streams into text Promises immediately (e.g.,
create stdoutPromise and stderrPromise from new Response(proc.stdout).text() and
new Response(proc.stderr).text()), then await Promise.race/Promise.all with
proc.exited and the timeout, and only after the exit resolve use the
already-started stdout/stderr promises (await stdoutPromise and stderrPromise)
to include output in the error message; ensure the timeout (timeout) is still
cleared when exit completes and that proc.kill() is called on timeout.

In `@tests/sandbox/provision-claude.sh`:
- Line 45: The cp of "${FIXTURES}/transcripts/*.jsonl" into
"${TARGET_HOME}/.claude/projects/default/" will fail if the glob matches
nothing; modify the script around that cp to guard the copy by first checking
for the existence of matching files (e.g., test for files matching
"${FIXTURES}/transcripts/"/*.jsonl or use shell option nullglob) and only run
the cp (or a safe loop to copy each file) when at least one transcript exists;
ensure the target directory "${TARGET_HOME}/.claude/projects/default/" is
created (mkdir -p) before copying.

In `@tests/sandbox/run-sandbox.ts`:
- Around line 221-229: The main function currently logs and calls setupSandbox()
but doesn't guarantee cleanup on exceptions; wrap the actionable portion of main
(everything after setupSandbox() and any early-exit logic) in a try/finally so
cleanupSandbox() in your existing cleanup block (see the cleanup logic around
the block referenced at lines 458-471) always runs; ensure any async calls
inside the try are awaited so the finally runs at the correct time, and rethrow
the error after cleanup if you want the caller to observe failures.
- Around line 53-93: The setupSandbox function duplicates provisioning steps
already implemented in provision-claude.sh (creating SANDBOX_SELFTUNE_DIR,
SANDBOX_PROJECTS_DIR, copying FIXTURES_DIR files into SANDBOX_CLAUDE_DIR and
SANDBOX_PROJECTS_DIR, and copying selftune/claude settings); replace the
duplicated logic by invoking the existing provision-claude.sh script (or
refactor the shell logic into a shared TypeScript helper) from setupSandbox so
there is a single source of truth for fixture paths and copying behavior; keep
references to the same SANDBOX_* and FIXTURES_DIR constants and ensure
setupSandbox either spawns the script (capturing errors) or calls the shared
helper function so tests and Docker provisioning use identical logic.
- Around line 210-215: The current check in fileHasNewContent (and similar
functions around lines 337-390) can miscount empty files because calling trim()
before split may yield a single empty string; change the logic to first read the
raw content, return false immediately if content === "" (or content.length ===
0), or split on "\n" without trimming and then filter out blank lines via
.filter(l => l.trim()) to compute lines.length; update fileHasNewContent to use
this pattern so first-write append checks correctly treat empty files as 0
lines.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between dfa8fb4 and f5dfa73.

⛔ Files ignored due to path filters (17)

tests/sandbox/fixtures/all_queries_log.jsonl is excluded by !**/fixtures/**
tests/sandbox/fixtures/claude-settings.json is excluded by !**/fixtures/**
tests/sandbox/fixtures/evolution_audit_log.jsonl is excluded by !**/fixtures/**
tests/sandbox/fixtures/hook-payloads/post-tool-use.json is excluded by !**/fixtures/**
tests/sandbox/fixtures/hook-payloads/prompt-submit.json is excluded by !**/fixtures/**
tests/sandbox/fixtures/hook-payloads/session-stop.json is excluded by !**/fixtures/**
tests/sandbox/fixtures/selftune-config.json is excluded by !**/fixtures/**
tests/sandbox/fixtures/session_telemetry_log.jsonl is excluded by !**/fixtures/**
tests/sandbox/fixtures/skill_usage_log.jsonl is excluded by !**/fixtures/**
tests/sandbox/fixtures/skills/ai-image-generation/SKILL.md is excluded by !**/fixtures/**
tests/sandbox/fixtures/skills/find-skills/SKILL.md is excluded by !**/fixtures/**
tests/sandbox/fixtures/skills/frontend-design/SKILL.md is excluded by !**/fixtures/**
tests/sandbox/fixtures/transcripts/session-001.jsonl is excluded by !**/fixtures/**
tests/sandbox/fixtures/transcripts/session-002.jsonl is excluded by !**/fixtures/**
tests/sandbox/fixtures/transcripts/session-003.jsonl is excluded by !**/fixtures/**
tests/sandbox/fixtures/transcripts/session-004.jsonl is excluded by !**/fixtures/**
tests/sandbox/fixtures/transcripts/session-005.jsonl is excluded by !**/fixtures/**

📒 Files selected for processing (35)

.devcontainer/Dockerfile
.devcontainer/devcontainer.json
.devcontainer/init-firewall.sh
.github/CODEOWNERS
.github/workflows/auto-bump-cli-version.yml
.github/workflows/ci.yml
.github/workflows/codeql.yml
.github/workflows/publish.yml
.github/workflows/scorecard.yml
.gitignore
AGENTS.md
ARCHITECTURE.md
CHANGELOG.md
CONTRIBUTING.md
Makefile
PRD.md
README.md
ROADMAP.md
docs/design-docs/index.md
docs/design-docs/sandbox-architecture.md
docs/design-docs/sandbox-claude-code.md
docs/exec-plans/active/multi-agent-sandbox.md
docs/exec-plans/completed/.gitkeep
docs/exec-plans/tech-debt-tracker.md
docs/golden-principles.md
docs/product-specs/index.md
llms.txt
package.json
risk-policy.json
tests/sandbox/docker/Dockerfile
tests/sandbox/docker/docker-compose.yml
tests/sandbox/docker/entrypoint.sh
tests/sandbox/docker/run-with-llm.ts
tests/sandbox/provision-claude.sh
tests/sandbox/run-sandbox.ts

.devcontainer/devcontainer.json

.devcontainer/Dockerfile

coderabbitai · 2026-03-02T09:50:41Z

.devcontainer/init-firewall.sh

+iptables -A OUTPUT -p udp --dport 53 -j ACCEPT
+iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT
+iptables -A OUTPUT -p tcp --dport 22 -j ACCEPT
+iptables -A OUTPUT -o lo -j ACCEPT
+iptables -A INPUT -o lo -j ACCEPT
+
+# Allow established connections
+iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
+iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
+
+# Allow HTTPS (port 443) for Claude API, npm, GitHub
+iptables -A OUTPUT -p tcp --dport 443 -j ACCEPT
+
+# Allow HTTP (port 80) for package registries
+iptables -A OUTPUT -p tcp --dport 80 -j ACCEPT
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd .devcontainer && ls -la | grep init && cat -n init-firewall.sh

Repository: WellDunDun/selftune

Length of output: 999

🏁 Script executed:

# Verify no DROP policies exist in the file rg -c 'iptables -P' .devcontainer/init-firewall.sh || echo "No -P policy rules found"

Repository: WellDunDun/selftune

Length of output: 87

Set default policies to DROP before appending allow rules.

The firewall config uses only ACCEPT rules with no default policies defined. Without explicit -P INPUT DROP, -P OUTPUT DROP, -P FORWARD DROP at the start, traffic not matching any rule still passes through default system policies (typically ACCEPT). This defeats the deny-by-default approach.

Also fix line 11: iptables -A INPUT -i lo -j ACCEPT (use -i for input interface, not -o).

Required changes

+iptables -P INPUT DROP +iptables -P OUTPUT DROP +iptables -P FORWARD DROP + # Allow DNS, SSH, localhost iptables -A OUTPUT -p udp --dport 53 -j ACCEPT iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT iptables -A OUTPUT -p tcp --dport 22 -j ACCEPT iptables -A OUTPUT -o lo -j ACCEPT -iptables -A INPUT -o lo -j ACCEPT +iptables -A INPUT -i lo -j ACCEPT

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.devcontainer/init-firewall.sh around lines 7 - 22, Add explicit default-deny policies by setting "iptables -P INPUT DROP", "iptables -P OUTPUT DROP", and "iptables -P FORWARD DROP" before any -A (append) rules, then keep the existing allow rules (DNS, SSH, loopback, ESTABLISHED,RELATED, HTTP/HTTPS) after those policies; also correct the loopback input rule by changing the INPUT rule from using "-o lo" to "-i lo" (replace the current "iptables -A INPUT -o lo -j ACCEPT" with "iptables -A INPUT -i lo -j ACCEPT") so loopback traffic is properly allowed.

.devcontainer/init-firewall.sh

tests/sandbox/docker/run-with-llm.ts

coderabbitai · 2026-03-02T09:50:43Z

tests/sandbox/provision-claude.sh

+cp "${FIXTURES}/skills/ai-image-generation/SKILL.md" "${TARGET_HOME}/.claude/skills/ai-image-generation/SKILL.md"
+
+# Session transcripts
+cp "${FIXTURES}"/transcripts/*.jsonl "${TARGET_HOME}/.claude/projects/default/"


⚠️ Potential issue | 🟡 Minor

Guard transcript copy when the fixture glob is empty.

The current wildcard copy can fail hard when no .jsonl transcript files are present.

Suggested fix

# Session transcripts -cp "${FIXTURES}"/transcripts/*.jsonl "${TARGET_HOME}/.claude/projects/default/" +shopt -s nullglob +transcript_files=("${FIXTURES}"/transcripts/*.jsonl) +if ((${`#transcript_files`[@]} > 0)); then + cp "${transcript_files[@]}" "${TARGET_HOME}/.claude/projects/default/" +fi +shopt -u nullglob

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

cp "${FIXTURES}"/transcripts/*.jsonl "${TARGET_HOME}/.claude/projects/default/"

# Session transcripts

shopt -s nullglob

transcript_files=("${FIXTURES}"/transcripts/*.jsonl)

if ((${`#transcript_files`[@]} > 0)); then

cp "${transcript_files[@]}" "${TARGET_HOME}/.claude/projects/default/"

fi

shopt -u nullglob

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/sandbox/provision-claude.sh` at line 45, The cp of "${FIXTURES}/transcripts/*.jsonl" into "${TARGET_HOME}/.claude/projects/default/" will fail if the glob matches nothing; modify the script around that cp to guard the copy by first checking for the existence of matching files (e.g., test for files matching "${FIXTURES}/transcripts/"/*.jsonl or use shell option nullglob) and only run the cp (or a safe loop to copy each file) when at least one transcript exists; ensure the target directory "${TARGET_HOME}/.claude/projects/default/" is created (mkdir -p) before copying.

coderabbitai · 2026-03-02T09:50:43Z

tests/sandbox/run-sandbox.ts

+function setupSandbox(): void {
+  // Create directory structure
+  mkdirSync(SANDBOX_SELFTUNE_DIR, { recursive: true });
+  mkdirSync(SANDBOX_PROJECTS_DIR, { recursive: true });
+
+  // Copy JSONL log fixtures into ~/.claude/
+  const logFiles = [
+    "all_queries_log.jsonl",
+    "skill_usage_log.jsonl",
+    "session_telemetry_log.jsonl",
+    "evolution_audit_log.jsonl",
+  ];
+  for (const file of logFiles) {
+    const src = join(FIXTURES_DIR, file);
+    if (existsSync(src)) {
+      copyFileSync(src, join(SANDBOX_CLAUDE_DIR, file));
+    }
+  }
+
+  // Copy transcripts into ~/.claude/projects/default/ as individual session files
+  const transcriptsDir = join(FIXTURES_DIR, "transcripts");
+  if (existsSync(transcriptsDir)) {
+    for (const file of readdirSync(transcriptsDir)) {
+      if (file.endsWith(".jsonl")) {
+        copyFileSync(join(transcriptsDir, file), join(SANDBOX_PROJECTS_DIR, file));
+      }
+    }
+  }
+
+  // Copy selftune config
+  copyFileSync(
+    join(FIXTURES_DIR, "selftune-config.json"),
+    join(SANDBOX_SELFTUNE_DIR, "config.json"),
+  );
+
+  // Copy Claude Code settings
+  copyFileSync(
+    join(FIXTURES_DIR, "claude-settings.json"),
+    join(SANDBOX_CLAUDE_DIR, "settings.json"),
+  );
+}


🧹 Nitpick | 🔵 Trivial

Consolidate sandbox provisioning to a single source of truth.

This setup duplicates tests/sandbox/provision-claude.sh logic; fixture path changes can drift between the two flows. Reusing the provision script (or a shared TS helper) will keep Layer 1 and Docker setup consistent.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/sandbox/run-sandbox.ts` around lines 53 - 93, The setupSandbox function duplicates provisioning steps already implemented in provision-claude.sh (creating SANDBOX_SELFTUNE_DIR, SANDBOX_PROJECTS_DIR, copying FIXTURES_DIR files into SANDBOX_CLAUDE_DIR and SANDBOX_PROJECTS_DIR, and copying selftune/claude settings); replace the duplicated logic by invoking the existing provision-claude.sh script (or refactor the shell logic into a shared TypeScript helper) from setupSandbox so there is a single source of truth for fixture paths and copying behavior; keep references to the same SANDBOX_* and FIXTURES_DIR constants and ensure setupSandbox either spawns the script (capturing errors) or calls the shared helper function so tests and Docker provisioning use identical logic.

tests/sandbox/run-sandbox.ts

Devcontainer security: - Remove unnecessary NET_RAW capability from runArgs - Fix iptables INPUT chain: -o lo → -i lo for loopback - Add default DROP policies for INPUT/FORWARD/OUTPUT chains - Add pipefail to curl|bash Bun installs - Append SNIPPET to .bashrc for sandbox-agent shell Code fixes: - Fix stream deadlock: consume stdout/stderr concurrently via Promise.all - Add try-catch around JSON.parse for malformed grading results - Guard transcript glob copy with nullglob/array check - Extract countLines helper, handle empty files correctly - Wrap main test execution in try/finally for cleanup Documentation: - Fix path reference: tests/sandbox/claude-code/ → tests/sandbox/docker/ - Add language tags to bare fenced code blocks - Move HTML comment below H1 in multi-agent-sandbox.md - Update TD-007 date to 2026-03-02 - Remove hardcoded line numbers from golden-principles.md - Mark gitignored strategy doc links as "local only" - Deduplicate start/boot scripts in package.json Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Conflicts resolved in ARCHITECTURE.md, CHANGELOG.md, README.md, docs/design-docs/index.md. Kept both sides: dev's v0.8 features (auto-activation, enforcement, memory, dashboard-server, agents) and HEAD's sandbox test harness + devcontainer content. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 6

♻️ Duplicate comments (2)

tests/sandbox/run-sandbox.ts (1)

231-237: ⚠️ Potential issue | 🟠 Major

Cleanup is still not guaranteed on all failure paths.

setupSandbox() is outside the try/finally, and process.exit(...) is called inside the try. This can still leave /tmp/selftune-sandbox-* behind on early failures.

Safer control-flow pattern

 async function main(): Promise<void> {
@@
-  // Setup
-  setupSandbox();
-  console.log("Sandbox directories created and fixtures copied.\n");
-
-  const results: RawTestResult[] = [];
-
-  try {
+  const results: RawTestResult[] = [];
+  let exitCode = 1;
+  try {
+    setupSandbox();
+    console.log("Sandbox directories created and fixtures copied.\n");
@@
-    process.exit(passed === total ? 0 : 1);
+    exitCode = passed === total ? 0 : 1;
   } finally {
@@
   }
+  process.exit(exitCode);
 }

In Node.js/Bun, does calling process.exit() inside a try block reliably execute the corresponding finally block?

Also applies to: 449-461

docs/design-docs/sandbox-architecture.md (1)

61-71: ⚠️ Potential issue | 🟡 Minor

Add blank lines after subsection headings to satisfy MD022.

The past review flagged missing blank lines after the headings on lines 61, 64, and 69. This markdownlint violation (MD022) has not been resolved.

📝 Proposed fix to add blank lines

 ### 1. HOME Env Var Redirection
+
 All selftune paths go through `homedir()` in `constants.ts`. Setting `HOME=/tmp/sandbox-*` redirects everything without modifying production code.
 
 ### 2. Two-Layer Architecture
+
 - Layer 1 is free, fast (~400ms), and runs in CI
 - Layer 2 costs tokens and requires Docker, reserved for pre-release validation
 - Both share the same fixture data
 
 ### 3. Result Recording
+
 Every test run saves a JSON report to `tests/sandbox/results/` with command, exit code, stdout, stderr, duration, and pass/fail. This creates a historical record of sandbox health.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/design-docs/sandbox-architecture.md` around lines 61 - 71, The MD022
violation is due to missing blank lines after the subsection headings "1. HOME
Env Var Redirection", "2. Two-Layer Architecture", and "3. Result Recording" in
sandbox-architecture.md; fix it by inserting a single blank line immediately
after each of those heading lines so each heading is followed by a blank line
before the paragraph or list that follows, ensuring the file satisfies
markdownlint MD022.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.devcontainer/Dockerfile:
- Line 5: Replace the dynamic build ARG CLAUDE_CODE_VERSION=latest with a pinned
explicit version (for example set ARG CLAUDE_CODE_VERSION=1.0.100) so
devcontainer builds are reproducible; locate the CLAUDE_CODE_VERSION ARG
declaration and any later references (the second occurrence that injects or
installs the Claude Code CLI) and update them to use the chosen fixed version
string, and document the chosen version in a comment so future bumps are
intentional.

In @.devcontainer/init-firewall.sh:
- Around line 7-27: The firewall currently only configures IPv4 via iptables,
allowing IPv6 traffic to bypass restrictions; add equivalent IPv6 rules using
ip6tables to mirror every iptables rule (accept loopback, DNS tcp/udp 53, SSH
tcp 22, established/related states, HTTPS 443, HTTP 80) and then set default
policies to DROP (INPUT, FORWARD, OUTPUT) for ip6tables as well; ensure rule
ordering matches the existing iptables block and use the same match flags (e.g.,
-m state --state ESTABLISHED,RELATED) so both iptables and ip6tables enforce the
same policy.

In `@docs/design-docs/sandbox-architecture.md`:
- Around line 91-97: The table entries for per-agent design docs are
inconsistent: "Claude Code" uses a markdown link while "Codex" and "OpenCode"
are plain text; update the rows in the table so formatting is consistent —
either convert "sandbox-codex.md" and "sandbox-opencode.md" to markdown links
like [sandbox-codex.md](sandbox-codex.md) and
[sandbox-opencode.md](sandbox-opencode.md) or remove the link syntax for
"sandbox-claude-code.md" so all three are plain text; update the entries for the
Agent and Design Doc columns in the table (e.g., the "Claude Code", "Codex",
"OpenCode" and their associated design-doc strings) and keep the Status column
unchanged.
- Around line 33-35: The docs duplicate the cost statement across two bullet
points; remove the redundant line so the "Cost" section only states "Uses
existing agent subscription — no per-call API charges." and keep the "Mechanism"
section as "Per-agent Docker containers with the agent's CLI installed. Uses
existing agent subscription — no separate API key needed." Locate and edit the
two bullets under the same heading in sandbox-architecture.md (the "Mechanism"
and "Cost" list items) and delete the repeated subscription/charges sentence
from the Cost bullet.

In `@tests/sandbox/docker/run-with-llm.ts`:
- Around line 225-241: testWatch currently doesn't validate the watch command
result so failures can be missed; update the test in run-with-llm.ts (the block
using runSelftune and returned values) to assert the expected exitCode and
result shape: verify exitCode equals the expected value (e.g., 1 for regression
detection), check stdout/stderr contain the expected markers (e.g., a known
"regression" or "watch" message) and ensure the returned object includes both
stdout and exitCode in the expected types before returning; reference the
runSelftune invocation and the surrounding testWatch/runTest flow when adding
these assertions.
- Around line 157-170: The code parses JSON into a loose Record<string, unknown>
(variable result) but then accesses result.summary.passed/result.summary.total
and result.reason/result.proposal without proper narrowing; add a type guard
(e.g., isGradingResult(obj): obj is GradingResult) or inline typeof checks
before using each property—verify result.summary is an object and typeof
result.summary.pass_rate === 'number' and typeof result.summary.passed ===
'number' and typeof result.summary.total === 'number' before logging, and verify
typeof result.reason === 'string' (or typeof result.proposal === 'string')
before using nullish-coalescing; replace unsafe property accesses with guarded
accesses or cast only after the guard so TypeScript knows the types are safe.

---

Duplicate comments:
In `@docs/design-docs/sandbox-architecture.md`:
- Around line 61-71: The MD022 violation is due to missing blank lines after the
subsection headings "1. HOME Env Var Redirection", "2. Two-Layer Architecture",
and "3. Result Recording" in sandbox-architecture.md; fix it by inserting a
single blank line immediately after each of those heading lines so each heading
is followed by a blank line before the paragraph or list that follows, ensuring
the file satisfies markdownlint MD022.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8111121 and 8962186.

📒 Files selected for processing (13)

.devcontainer/Dockerfile
.devcontainer/devcontainer.json
.devcontainer/init-firewall.sh
docs/design-docs/sandbox-architecture.md
docs/design-docs/sandbox-claude-code.md
docs/exec-plans/active/multi-agent-sandbox.md
docs/exec-plans/tech-debt-tracker.md
docs/golden-principles.md
docs/product-specs/index.md
package.json
tests/sandbox/docker/run-with-llm.ts
tests/sandbox/provision-claude.sh
tests/sandbox/run-sandbox.ts

coderabbitai · 2026-03-02T10:16:49Z

.devcontainer/Dockerfile

+
+ARG TZ
+ENV TZ="$TZ"
+ARG CLAUDE_CODE_VERSION=latest


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "Dockerfile" -path "*devcontainer*" | head -5

Repository: WellDunDun/selftune

Length of output: 89

🏁 Script executed:

cat -n .devcontainer/Dockerfile | head -60

Repository: WellDunDun/selftune

Length of output: 2243

🌐 Web query:

@anthropic-ai/claude-code npm package latest version stable

💡 Result:

The latest stable version of the @anthropic-ai/claude-code npm package is 1.0.100. [1]

Pin Claude Code CLI to an explicit version instead of latest.

Using CLAUDE_CODE_VERSION=latest makes devcontainer builds non-reproducible and can break unexpectedly when upstream publishes changes. Pin to a stable version (e.g., 1.0.100) and bump intentionally.

Affects lines 5 and 49.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.devcontainer/Dockerfile at line 5, Replace the dynamic build ARG CLAUDE_CODE_VERSION=latest with a pinned explicit version (for example set ARG CLAUDE_CODE_VERSION=1.0.100) so devcontainer builds are reproducible; locate the CLAUDE_CODE_VERSION ARG declaration and any later references (the second occurrence that injects or installs the Claude Code CLI) and update them to use the chosen fixed version string, and document the chosen version in a comment so future bumps are intentional.

coderabbitai · 2026-03-02T10:16:49Z

.devcontainer/init-firewall.sh

+iptables -A OUTPUT -p udp --dport 53 -j ACCEPT
+iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT
+iptables -A OUTPUT -p tcp --dport 22 -j ACCEPT
+iptables -A OUTPUT -o lo -j ACCEPT
+iptables -A INPUT -i lo -j ACCEPT
+
+# Allow established connections
+iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
+iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
+
+# Allow HTTPS (port 443) for Claude API, npm, GitHub
+iptables -A OUTPUT -p tcp --dport 443 -j ACCEPT
+
+# Allow HTTP (port 80) for package registries
+iptables -A OUTPUT -p tcp --dport 80 -j ACCEPT
+
+# Default deny all other traffic
+iptables -P INPUT DROP
+iptables -P FORWARD DROP
+iptables -P OUTPUT DROP
+


⚠️ Potential issue | 🟠 Major

Firewall policy is IPv4-only; add IPv6 controls to avoid bypass.

Current rules apply to iptables only. If IPv6 is available, outbound traffic can bypass these restrictions.

Suggested hardening

+# Mirror policy for IPv6 (or explicitly disable IPv6 in container) +ip6tables -P INPUT DROP +ip6tables -P FORWARD DROP +ip6tables -P OUTPUT DROP + +ip6tables -A OUTPUT -p udp --dport 53 -j ACCEPT +ip6tables -A OUTPUT -p tcp --dport 53 -j ACCEPT +ip6tables -A OUTPUT -o lo -j ACCEPT +ip6tables -A INPUT -i lo -j ACCEPT +ip6tables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT +ip6tables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT +ip6tables -A OUTPUT -p tcp --dport 443 -j ACCEPT +ip6tables -A OUTPUT -p tcp --dport 80 -j ACCEPT

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

iptables -A OUTPUT -p udp --dport 53 -j ACCEPT

iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT

iptables -A OUTPUT -p tcp --dport 22 -j ACCEPT

iptables -A OUTPUT -o lo -j ACCEPT

iptables -A INPUT -i lo -j ACCEPT

# Allow established connections

iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Allow HTTPS (port 443) for Claude API, npm, GitHub

iptables -A OUTPUT -p tcp --dport 443 -j ACCEPT

# Allow HTTP (port 80) for package registries

iptables -A OUTPUT -p tcp --dport 80 -j ACCEPT

# Default deny all other traffic

iptables -P INPUT DROP

iptables -P FORWARD DROP

iptables -P OUTPUT DROP

iptables -A OUTPUT -p udp --dport 53 -j ACCEPT

iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT

iptables -A OUTPUT -p tcp --dport 22 -j ACCEPT

iptables -A OUTPUT -o lo -j ACCEPT

iptables -A INPUT -i lo -j ACCEPT

# Allow established connections

iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Allow HTTPS (port 443) for Claude API, npm, GitHub

iptables -A OUTPUT -p tcp --dport 443 -j ACCEPT

# Allow HTTP (port 80) for package registries

iptables -A OUTPUT -p tcp --dport 80 -j ACCEPT

# Default deny all other traffic

iptables -P INPUT DROP

iptables -P FORWARD DROP

iptables -P OUTPUT DROP

# Mirror policy for IPv6 (or explicitly disable IPv6 in container)

ip6tables -P INPUT DROP

ip6tables -P FORWARD DROP

ip6tables -P OUTPUT DROP

ip6tables -A OUTPUT -p udp --dport 53 -j ACCEPT

ip6tables -A OUTPUT -p tcp --dport 53 -j ACCEPT

ip6tables -A OUTPUT -o lo -j ACCEPT

ip6tables -A INPUT -i lo -j ACCEPT

ip6tables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

ip6tables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

ip6tables -A OUTPUT -p tcp --dport 443 -j ACCEPT

ip6tables -A OUTPUT -p tcp --dport 80 -j ACCEPT

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.devcontainer/init-firewall.sh around lines 7 - 27, The firewall currently only configures IPv4 via iptables, allowing IPv6 traffic to bypass restrictions; add equivalent IPv6 rules using ip6tables to mirror every iptables rule (accept loopback, DNS tcp/udp 53, SSH tcp 22, established/related states, HTTPS 443, HTTP 80) and then set default policies to DROP (INPUT, FORWARD, OUTPUT) for ip6tables as well; ensure rule ordering matches the existing iptables block and use the same match flags (e.g., -m state --state ESTABLISHED,RELATED) so both iptables and ip6tables enforce the same policy.

coderabbitai · 2026-03-02T10:16:49Z

docs/design-docs/sandbox-architecture.md

+**Mechanism:** Per-agent Docker containers with the agent's CLI installed. Uses existing agent subscription — no separate API key needed.
+
+**Cost:** Uses existing agent subscription — no per-call API charges.


🧹 Nitpick | 🔵 Trivial

Remove duplicate cost statement.

Lines 33 and 35 both state that the solution uses the existing agent subscription without additional charges. The repetition is unnecessary.

✂️ Proposed fix to remove duplication

**Mechanism:** Per-agent Docker containers with the agent's CLI installed. Uses existing agent subscription — no separate API key needed. - -**Cost:** Uses existing agent subscription — no per-call API charges.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

**Mechanism:** Per-agent Docker containers with the agent's CLI installed. Uses existing agent subscription — no separate API key needed.

**Cost:** Uses existing agent subscription — no per-call API charges.

**Mechanism:** Per-agent Docker containers with the agent's CLI installed. Uses existing agent subscription — no separate API key needed.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/design-docs/sandbox-architecture.md` around lines 33 - 35, The docs duplicate the cost statement across two bullet points; remove the redundant line so the "Cost" section only states "Uses existing agent subscription — no per-call API charges." and keep the "Mechanism" section as "Per-agent Docker containers with the agent's CLI installed. Uses existing agent subscription — no separate API key needed." Locate and edit the two bullets under the same heading in sandbox-architecture.md (the "Mechanism" and "Cost" list items) and delete the repeated subscription/charges sentence from the Cost bullet.

coderabbitai · 2026-03-02T10:16:49Z

docs/design-docs/sandbox-architecture.md

+## Per-Agent Design Docs
+
+| Agent | Design Doc | Status |
+|-------|-----------|--------|
+| Claude Code | [sandbox-claude-code.md](sandbox-claude-code.md) | Implemented |
+| Codex | sandbox-codex.md | Planned |
+| OpenCode | sandbox-opencode.md | Planned |


🧹 Nitpick | 🔵 Trivial

Consider consistent formatting for planned design docs.

Line 95 uses a markdown link for the implemented Claude Code doc, while lines 96-97 show plain text for planned docs. For consistency, either use markdown links for all (even if they don't resolve yet) or use plain text uniformly with status indicating availability.

🔗 Option: Use consistent markdown links

| Agent | Design Doc | Status | |-------|-----------|--------| | Claude Code | [sandbox-claude-code.md](sandbox-claude-code.md) | Implemented | -| Codex | sandbox-codex.md | Planned | -| OpenCode | sandbox-opencode.md | Planned | +| Codex | [sandbox-codex.md](sandbox-codex.md) | Planned | +| OpenCode | [sandbox-opencode.md](sandbox-opencode.md) | Planned |

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

## Per-Agent Design Docs

| Agent | Design Doc | Status |

|-------|-----------|--------|

| Claude Code | [sandbox-claude-code.md](sandbox-claude-code.md) | Implemented |

| Codex | sandbox-codex.md | Planned |

| OpenCode | sandbox-opencode.md | Planned |

## Per-Agent Design Docs

| Agent | Design Doc | Status |

|-------|-----------|--------|

| Claude Code | [sandbox-claude-code.md](sandbox-claude-code.md) | Implemented |

| Codex | [sandbox-codex.md](sandbox-codex.md) | Planned |

| OpenCode | [sandbox-opencode.md](sandbox-opencode.md) | Planned |

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/design-docs/sandbox-architecture.md` around lines 91 - 97, The table entries for per-agent design docs are inconsistent: "Claude Code" uses a markdown link while "Codex" and "OpenCode" are plain text; update the rows in the table so formatting is consistent — either convert "sandbox-codex.md" and "sandbox-opencode.md" to markdown links like [sandbox-codex.md](sandbox-codex.md) and [sandbox-opencode.md](sandbox-opencode.md) or remove the link syntax for "sandbox-claude-code.md" so all three are plain text; update the entries for the Agent and Design Doc columns in the table (e.g., the "Claude Code", "Codex", "OpenCode" and their associated design-doc strings) and keep the Status column unchanged.

coderabbitai · 2026-03-02T10:16:50Z

tests/sandbox/docker/run-with-llm.ts

+  let result: Record<string, unknown>;
+  try {
+    result = JSON.parse(raw);
+  } catch {
+    throw new Error(`grading-result.json is not valid JSON: ${raw.slice(0, 300)}`);
+  }
+
+  if (!result.summary || typeof result.summary.pass_rate !== "number") {
+    throw new Error("Grading result missing summary.pass_rate");
+  }
+
+  console.log(
+    `  [grade] ${result.summary.passed}/${result.summary.total} (${Math.round(result.summary.pass_rate * 100)}%)`,
+  );


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== TypeScript configuration ===" fd '^tsconfig.*\.json$' --type f | head -5 | while read f; do echo "--- $f ---" cat "$f" | grep -E '"strict"|"noImplicitAny"|"strictNullChecks"|"noUncheckedIndexedAccess"' || true done echo echo "=== Full context around lines 157-170 and 200-215 ===" wc -l tests/sandbox/docker/run-with-llm.ts echo echo "=== Lines 150-220 (full context) ===" sed -n '150,220p' tests/sandbox/docker/run-with-llm.ts

Repository: WellDunDun/selftune

Length of output: 2561

Remove unsafe property access on unknown types without proper narrowing.

Under strict TypeScript, accessing result.summary.passed and result.summary.total (line 165) without type guards violates type safety. The check !result.summary confirms existence but doesn't narrow the property types to non-unknown.

Line 208 similarly accesses result.reason (in the nullish coalescing) without establishing its type—the truthiness check !result.reason && !result.proposal doesn't narrow the type of .reason itself.

Type-guard or narrow each accessed property:

typeof result.summary?.passed === 'number' before use

typeof result.reason === 'string' before accessing

Same applies to line 200–215 region.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/sandbox/docker/run-with-llm.ts` around lines 157 - 170, The code parses JSON into a loose Record<string, unknown> (variable result) but then accesses result.summary.passed/result.summary.total and result.reason/result.proposal without proper narrowing; add a type guard (e.g., isGradingResult(obj): obj is GradingResult) or inline typeof checks before using each property—verify result.summary is an object and typeof result.summary.pass_rate === 'number' and typeof result.summary.passed === 'number' and typeof result.summary.total === 'number' before logging, and verify typeof result.reason === 'string' (or typeof result.proposal === 'string') before using nullish-coalescing; replace unsafe property accesses with guarded accesses or cast only after the guard so TypeScript knows the types are safe.

coderabbitai · 2026-03-02T10:16:50Z

tests/sandbox/docker/run-with-llm.ts

+  const { exitCode, stdout, stderr } = await runSelftune([
+    "watch",
+    "--skill",
+    "find-skills",
+    "--skill-path",
+    skillPath,
+    "--window",
+    "20",
+    "--threshold",
+    "0.1",
+  ]);
+
+  // Watch exits 1 on regression detection (expected for test data)
+  console.log(`  [watch] exit=${exitCode}`);
+  console.log(`  [watch] ${stdout.slice(0, 200)}`);
+  return { stdout: stdout.slice(0, 1000), exitCode };
+}


⚠️ Potential issue | 🟠 Major

watch test can pass on unexpected failures.

testWatch never asserts expected exitCode or result shape, so runTest marks it PASS as long as no exception is thrown. This can hide regressions in watch.

Tighten validation for watch expectations

async function testWatch(): Promise<unknown> { @@ - // Watch exits 1 on regression detection (expected for test data) - console.log(` [watch] exit=${exitCode}`); - console.log(` [watch] ${stdout.slice(0, 200)}`); - return { stdout: stdout.slice(0, 1000), exitCode }; + // Watch exits 1 on regression detection for this fixture. + if (exitCode !== 1) { + throw new Error(`watch expected exit=1 for regression fixture, got ${exitCode}: ${stderr.slice(0, 300)}`); + } + let parsed: any; + try { + parsed = JSON.parse(stdout); + } catch { + throw new Error(`watch stdout is not valid JSON: ${stdout.slice(0, 300)}`); + } + if (parsed?.regression_detected !== true) { + throw new Error("watch output missing regression_detected=true"); + } + console.log(` [watch] exit=${exitCode}`); + return { exitCode, regression_detected: parsed.regression_detected }; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const { exitCode, stdout, stderr } = await runSelftune([

"watch",

"--skill",

"find-skills",

"--skill-path",

skillPath,

"--window",

"20",

"--threshold",

"0.1",

]);

// Watch exits 1 on regression detection (expected for test data)

console.log(` [watch] exit=${exitCode}`);

console.log(` [watch] ${stdout.slice(0, 200)}`);

return { stdout: stdout.slice(0, 1000), exitCode };

}

const { exitCode, stdout, stderr } = await runSelftune([

"watch",

"--skill",

"find-skills",

"--skill-path",

skillPath,

"--window",

"20",

"--threshold",

"0.1",

]);

// Watch exits 1 on regression detection for this fixture.

if (exitCode !== 1) {

throw new Error(`watch expected exit=1 for regression fixture, got ${exitCode}: ${stderr.slice(0, 300)}`);

}

let parsed: any;

try {

parsed = JSON.parse(stdout);

} catch {

throw new Error(`watch stdout is not valid JSON: ${stdout.slice(0, 300)}`);

}

if (parsed?.regression_detected !== true) {

throw new Error("watch output missing regression_detected=true");

}

console.log(` [watch] exit=${exitCode}`);

return { exitCode, regression_detected: parsed.regression_detected };

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/sandbox/docker/run-with-llm.ts` around lines 225 - 241, testWatch currently doesn't validate the watch command result so failures can be missed; update the test in run-with-llm.ts (the block using runSelftune and returned values) to assert the expected exitCode and result shape: verify exitCode equals the expected value (e.g., 1 for regression detection), check stdout/stderr contain the expected markers (e.g., a known "regression" or "watch" message) and ensure the returned object includes both stdout and exitCode in the expected types before returning; reference the runSelftune invocation and the surrounding testWatch/runTest flow when adding these assertions.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

README.md (1)

246-261: ⚠️ Potential issue | 🟡 Minor

License section content is misplaced.

Line 246 creates a License heading, but the actual license link is at Line 261 after the Milestones section. Move [MIT](LICENSE) directly under ## License.

Proposed fix

 ## License
+[MIT](LICENSE)
 
 ## Milestones
@@
-| v0.8 | Auto-activation, evolution memory, specialized agents, enforcement guardrails, dashboard server | Done |
-
-[MIT](LICENSE)
+| v0.8 | Auto-activation, evolution memory, specialized agents, enforcement guardrails, dashboard server | Done |

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 246 - 261, The README has the "## License" heading
separated from its actual content: move the existing license link
"[MIT](LICENSE)" so it appears immediately under the "## License" heading (i.e.,
place "[MIT](LICENSE)" directly after the "## License" line) and ensure the "##
Milestones" section follows after the license entry; update the block containing
"## License", "[MIT](LICENSE)", and "## Milestones" so the heading and link are
contiguous.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ARCHITECTURE.md`:
- Around line 48-50: Update the ARCHITECTURE.md section that describes
callViaAgent() spawning `claude -p` to explicitly restrict
`--dangerously-skip-permissions` to sandbox/devcontainer use only: add a single
hard guard sentence after the existing lines stating that
`--dangerously-skip-permissions` must never be used in regular/local/production
workflows and is only permitted in isolated devcontainer/sandbox testing,
referencing callViaAgent() and the flag name verbatim to make the constraint
unambiguous.

---

Outside diff comments:
In `@README.md`:
- Around line 246-261: The README has the "## License" heading separated from
its actual content: move the existing license link "[MIT](LICENSE)" so it
appears immediately under the "## License" heading (i.e., place "[MIT](LICENSE)"
directly after the "## License" line) and ensure the "## Milestones" section
follows after the license entry; update the block containing "## License",
"[MIT](LICENSE)", and "## Milestones" so the heading and link are contiguous.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8962186 and 4877194.

📒 Files selected for processing (6)

ARCHITECTURE.md
CHANGELOG.md
README.md
docs/design-docs/index.md
docs/golden-principles.md
package.json

coderabbitai · 2026-03-02T10:19:50Z

ARCHITECTURE.md

+│                         LLM calls use callViaAgent() which spawns `claude -p` as a
+│                         subprocess. In devcontainer testing, this runs with
+│                         --dangerously-skip-permissions for unattended operation.


⚠️ Potential issue | 🟡 Minor

Constrain dangerous flag usage to sandbox-only context explicitly.

Lines 48-50 document --dangerously-skip-permissions but don’t explicitly forbid use outside isolated devcontainer/sandbox runs. Add a hard guard sentence to prevent accidental adoption in normal/local workflows.

Suggested doc hardening

│ LLM calls use callViaAgent() which spawns `claude -p` as a │ subprocess. In devcontainer testing, this runs with │ --dangerously-skip-permissions for unattended operation. +│ This flag is sandbox/devcontainer-test only and must never be +│ used in host/local developer workflows.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

│ LLM calls use callViaAgent() which spawns `claude -p` as a

│ subprocess. In devcontainer testing, this runs with

│ --dangerously-skip-permissions for unattended operation.

│ LLM calls use callViaAgent() which spawns `claude -p` as a

│ subprocess. In devcontainer testing, this runs with

│ --dangerously-skip-permissions for unattended operation.

│ This flag is sandbox/devcontainer-test only and must never be

│ used in host/local developer workflows.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ARCHITECTURE.md` around lines 48 - 50, Update the ARCHITECTURE.md section that describes callViaAgent() spawning `claude -p` to explicitly restrict `--dangerously-skip-permissions` to sandbox/devcontainer use only: add a single hard guard sentence after the existing lines stating that `--dangerously-skip-permissions` must never be used in regular/local/production workflows and is only permitted in isolated devcontainer/sandbox testing, referencing callViaAgent() and the flag name verbatim to make the constraint unambiguous.

Merges dev commits (sandbox harness PR #17, PAI Infrastructure PR #15) into feature branch. Resolved conflicts by combining both branches' unique additions: dev's structural improvements (stream deadlock fix, nullglob guards, bash history, iptables corrections) with HEAD's OpenClaw integration (ingestor, cron, platform references). Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

Bug fixes: - Fix routing retries using wrong refiner in evolve-body.ts (#15) - Fix baseline gate evaluating wrong skill body in evolve.ts (#19) - Fix Gate 1 accepting routing tables with zero data rows (#20) - Fix section heading prefix match allowing false positives (#13) Security: - Add ReDoS protection for regex assertions (length cap, try-catch, u flag) (#10) - Harden YES/NO parsing to first-token extraction with punctuation stripping (#23) Type safety & validation: - Add type params to readJsonl calls in baseline.ts (#3) - Validate LLM JSON shape before treating as SkillUnitTest[] (#4) - Add JSON.parse validation in unit-test-cli.ts (#9) Code quality: - Replace console.warn with structured JSON logging (#5, #11) - Sort readdirSync for deterministic traversal (#6) - Tighten fuzzy match to prevent over-selection (#7) - Rename evalFailures to positiveEvals for clarity (#8) - Enforce timeout_ms via Promise.race in runUnitTest (#12) - Enrich deploy audit entry with gates snapshot (#16) - Add Pareto 5th dimension comment (#18) - Extract composability CLI into composability-cli.ts (#21) - Add missing eval files to lint-architecture.ts (#28) - Reformat AssertionType union (#22), template literal (#14), imports (#17) - Reformat baseline conditional (#2) Documentation: - Renumber scope expansion list (#24) - Add text language to fenced code blocks (#25, #27, #30) - Add rollback audit payload note (#26) - Clarify --dir expects corpus root (#29) - Make --tests optional in syntax (#31) Tests: - Remove unused afterEach import (#32) - Replace ! assertion with ?. optional chaining (#33) - Fix temp dir cleanup leak with afterAll (#34) - Add Gate 2 status assertion (#40) - Strengthen routing validation assertions (#41) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

## Walkthrough This PR adds new CLI commands (auto-grade, quickstart, hook), implements expectation auto-derivation from SKILL.md, distinguishes actual skill invocations from passive SKILL.md reads, adds an UNGRADED skill status, migrates hook checks to Claude Code settings.json, and enhances evolve CLI pre-flight validation, verbose logging, and post-run/error messaging. ## Changes |Cohort / File(s)|Summary| |---|---| |**CLI Routing & Init** `cli/selftune/index.ts`, `cli/selftune/init.ts`|Adds `auto-grade`, `quickstart`, and `hook` command routing; `installAgentFiles()` to deploy bundled agent files; expands Claude Code hook detection (4 keys) and supports nested hook entries.| |**Evolution CLI** `cli/selftune/evolution/evolve.ts`|Pre-flight checks for `--skill-path`/SKILL.md and eval-set paths, verbose config dump, enhanced post-run deployment/diagnostic messaging, and improved top-level error/troubleshooting output.| |**Automated Grading** `cli/selftune/grading/auto-grade.ts` (new), `cli/selftune/grading/grade-session.ts`|New auto-grade CLI; adds `deriveExpectationsFromSkill()` (exported) to extract up to 5 expectations from SKILL.md and uses it when explicit evals are missing; session selection now prefers `skills_invoked`.| |**Hook / Skill Eval** `cli/selftune/hooks/skill-eval.ts`, `cli/selftune/observability.ts`|Adds `hasSkillToolInvocation()` and updates `processToolUse` to set `triggered` based on detection; observability now checks Claude Code `settings.json` (flat and nested formats) instead of repo .git/hooks.| |**Telemetry & Transcript** `cli/selftune/types.ts`, `cli/selftune/utils/transcript.ts`, `cli/selftune/ingestors/claude-replay.ts`|Adds `skills_invoked` to telemetry/transcript types and populates it from tool_use entries; ingestion prefers `skills_invoked` and sets `triggered` accordingly (fallback to previous behavior).| |**Status / Badge** `cli/selftune/status.ts`, `cli/selftune/badge/badge-data.ts`|Introduces new SkillStatus value `UNGRADED` and extends BadgeData.status to include `UNGRADED`; status logic and display updated to treat skills with records but no triggered (graded) sessions as UNGRADED.| |**Quickstart Onboarding** `cli/selftune/quickstart.ts` (new)|Adds three-step quickstart flow (init, replay, status) with `suggestSkillsToEvolve()` ranking ungraded/critical/warning skills.| |**Eval set behavior** `cli/selftune/eval/hooks-to-evals.ts`|`buildEvalSet` now filters positives/negatives to only include records where `triggered === true`.| |**Templates & Docs** `templates/*-settings.json`, `skill/Workflows/*.md`|Replaces hardcoded path-based hook invocations with `npx selftune hook <name>`; updates Evolve workflow docs to require AskUserQuestion for user inputs.| |**Tests & Fixtures** `tests/*` (multiple)|Adds/updates tests for derived expectations, skills_invoked precedence, triggered=true/false cases, observability hook formats, UNGRADED state, and eval positives filtering; updates SKILL.md fixtures.| ## Sequence Diagram(s) ```mermaid sequenceDiagram participant User participant CLI as auto-grade CLI participant Telemetry as Telemetry Log participant SkillMD as SKILL.md participant PreGates as Pre-Gates participant Agent as LLM Agent participant Output as JSON Output User->>CLI: auto-grade --skill X [--session-id Y] CLI->>Telemetry: Load telemetry log Telemetry-->>CLI: Sessions list CLI->>CLI: Resolve session (by ID or latest using skills_invoked) CLI->>SkillMD: Derive expectations (if needed) SkillMD-->>CLI: Expectations (derived or none) CLI->>PreGates: Run pre-gates on expectations alt All resolved PreGates-->>CLI: All expectations resolved else Some unresolved PreGates-->>CLI: Unresolved list CLI->>Agent: gradeViaAgent(unresolved) Agent-->>CLI: Graded expectations end CLI->>CLI: Compile results, metrics, summary CLI->>Output: Write GradingResult JSON Output-->>User: Summary + file path ``` ```mermaid sequenceDiagram participant User participant CLI as quickstart CLI participant Config as Config Check participant Replay as Replay Ingestor participant Telemetry as Telemetry Log participant Status as Status Engine participant Skills as Skills Scorer User->>CLI: quickstart CLI->>Config: Check config/marker alt Config missing CLI->>CLI: runInit() end CLI->>Replay: Check replay marker alt Marker missing Replay->>CLI: Ingest transcripts (parseTranscript -> skills_invoked) CLI->>Telemetry: Update logs/marker end CLI->>Status: Compute skill statuses (uses triggered & pass rates) Status-->>CLI: Formatted status CLI->>Skills: suggestSkillsToEvolve() Skills-->>CLI: Top ranked skills CLI->>User: Display suggestions ``` ```mermaid sequenceDiagram participant Transcript as File participant Parser as parseTranscript participant Detector as Invocation Detector participant Telemetry as Session Record participant Ingestor as claude-replay Parser->>Transcript: Read JSONL lines loop each message Parser->>Detector: Check tool_use entries alt tool_use.toolName == "Skill" Detector-->>Parser: Extract skill name Parser->>Parser: Add to skills_invoked[] else Detector-->>Parser: ignore end end Parser->>Telemetry: Emit TranscriptMetrics (skills_invoked) Ingestor->>Telemetry: Create SkillUsageRecord (use skills_invoked if present) ``` ## Estimated code review effort 🎯 4 (Complex) | ⏱️ ~60 minutes ## Possibly related PRs - **#23**: Overlaps evolve CLI changes in `cli/selftune/evolution/evolve.ts` (pre-flight, logging, messaging). - **#13**: Related telemetry/ingestion changes around `skills_invoked` and session selection behavior. - **#17**: Touches the same ingestor surface (`cli/selftune/ingestors/claude-replay.ts`) and skill-invocation/trigger semantics.

WellDunDun changed the title ~~docs: sandbox design split, exec plan, and roadmap restructure~~ Sandbox test harness, multi-agent expansion plan, and project roadmap Mar 2, 2026

fix: stop excluding fixtures from CodeRabbit review

8111121

Sandbox fixtures are intentional test data that should be reviewed, not vendored/generated files to ignore. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai bot reviewed Mar 2, 2026

View reviewed changes

WellDunDun changed the title ~~Sandbox test harness, multi-agent expansion plan, and project roadmap~~ feat: sandbox test harness, multi-agent expansion plan, and project roadmap Mar 2, 2026

WellDunDun and others added 2 commits March 2, 2026 13:11

coderabbitai bot reviewed Mar 2, 2026

View reviewed changes

WellDunDun merged commit b493c3c into dev Mar 2, 2026
5 checks passed

coderabbitai bot mentioned this pull request Mar 2, 2026

feat: Add cron jobs, OpenClaw ingestor, and sandbox testing #16

Merged

WellDunDun mentioned this pull request Mar 3, 2026

Audit skills vs PRs, add workflows, polish README #19

Merged

WellDunDun deleted the custom/prefix/router-1772442284029 branch March 8, 2026 12:11

coderabbitai bot mentioned this pull request Mar 9, 2026

Fix all 13 demo findings #29

Merged

-cp "${FIXTURES}"/transcripts/*.jsonl "${TARGET_HOME}/.claude/projects/default/"
+# Session transcripts
+shopt -s nullglob
+transcript_files=("${FIXTURES}"/transcripts/*.jsonl)
+if ((${`#transcript_files`[@]} > 0)); then
+  cp "${transcript_files[@]}" "${TARGET_HOME}/.claude/projects/default/"
+fi
+shopt -u nullglob

		Mechanism: Per-agent Docker containers with the agent's CLI installed. Uses existing agent subscription — no separate API key needed.

		Cost: Uses existing agent subscription — no per-call API charges.

Conversation

WellDunDun commented Mar 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

WellDunDun commented Mar 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 2, 2026 •

edited

Loading