Skip to content

feat: sandbox test harness, multi-agent expansion plan, and project roadmap#17

Merged
WellDunDun merged 4 commits intodevfrom
custom/prefix/router-1772442284029
Mar 2, 2026
Merged

feat: sandbox test harness, multi-agent expansion plan, and project roadmap#17
WellDunDun merged 4 commits intodevfrom
custom/prefix/router-1772442284029

Conversation

@WellDunDun
Copy link
Collaborator

@WellDunDun WellDunDun commented Mar 2, 2026

Walkthrough

Adds a sandbox testing infrastructure with devcontainer and Docker Layer‑2 LLM tests, fixture provisioning and orchestration scripts, GitHub Actions pinning/permissions, firewall/init scripts, many docs/roadmap updates, CI/workflow changes, and small repo config edits across ~40 files.

Changes

Cohort / File(s) Summary
Devcontainer & Firewall
\.devcontainer/Dockerfile, \.devcontainer/devcontainer.json, \.devcontainer/init-firewall.sh
New devcontainer image and config with Node 20 + Bun, installs Claude Code CLI, configures npm dirs, persists bash history, NET_ADMIN runArgs, and an iptables-based init firewall script.
Sandbox Docker Layer & Entrypoint
tests/sandbox/docker/Dockerfile, tests/sandbox/docker/docker-compose.yml, tests/sandbox/docker/entrypoint.sh
New sandbox Docker image and compose service that provisions persistent sandbox HOME, installs Bun/Claude CLI, prepares entrypoint to provision fixtures and run Layer‑2 tests.
Sandbox Orchestrators & Provisioning
tests/sandbox/run-sandbox.ts, tests/sandbox/docker/run-with-llm.ts, tests/sandbox/provision-claude.sh
Layer‑1 TypeScript runner (local CLI/hook tests), Layer‑2 devcontainer LLM test runner, and provisioning script to populate .claude/.selftune fixtures. High LOC and nontrivial result collection/validation logic.
Makefile & Scripts
Makefile, package.json
Added sandbox-related Make targets and two npm scripts (start, boot) pointing to the CLI help entry. check now depends on sandbox.
GitHub Workflows (pinning & perms)
.github/workflows/... (auto-bump-cli-version.yml, ci.yml, codeql.yml, publish.yml, scorecard.yml)
Pinned multiple actions to specific commit SHAs and introduced explicit workflow/job-level permissions (top-level {} / per-job contents/security permissions).
Repo Ownership & Ignore Rules
.github/CODEOWNERS, .gitignore, .coderabbit.yaml
Adds CODEOWNERS entries, new ignores for tests/sandbox/results/, .env*, docs/strategy/; re-enabled fixture files for review by removing exclusion.
Docs, Roadmap & Architecture
README.md, CONTRIBUTING.md, AGENTS.md, ARCHITECTURE.md, PRD.md, CHANGELOG.md, ROADMAP.md
Major documentation rewrites to describe sandbox, devcontainer testing, repositioned architecture and milestones (v0.x → M#), changelog entry for 0.2.0, and CONTRIBUTING sandbox procedures.
Design & Exec Plans
docs/design-docs/*, docs/exec-plans/*
Adds sandbox architecture and Claude Code design docs, multi-agent sandbox exec plan, and updates design-docs index and tech-debt tracker.
Policy & Product Spec Changes
risk-policy.json, docs/golden-principles.md, docs/product-specs/index.md, llms.txt
Adds docs-drift watch rule, a golden-principle import restriction for contribute/, product-spec links, and a new llms.txt reference doc.

Sequence Diagram(s)

sequenceDiagram
    participant Host as Host<br/>(make sandbox)
    participant Tmp as Temp Sandbox<br/>(Layer 1)
    participant Project as Project<br/>Files
    participant CLI as selftune<br/>CLI
    participant Hooks as Hook<br/>Scripts
    participant Results as results/<br/>JSON

    Host->>Tmp: Create temp HOME
    Host->>Project: Copy logs, transcripts, config
    Host->>Tmp: Set HOME env

    loop Run CLI commands
        Host->>CLI: doctor, evals, status, last, dashboard, contribute
        CLI->>Tmp: Read/write using HOME
        CLI-->>Host: stdout/stderr + exit code
        Host->>Results: Record result
    end

    loop Run Hooks
        Host->>Hooks: prompt-log, skill-eval, session-stop (pipe JSON)
        Hooks->>Tmp: Append JSONL
        Hooks-->>Host: Exit + status
        Host->>Results: Record result
    end

    Host->>Results: Write timestamped JSON report
    Host->>Tmp: Cleanup (or --keep)
Loading
sequenceDiagram
    participant Dev as Devcontainer<br/>Startup
    participant FW as iptables<br/>Firewall
    participant Entrypoint as entrypoint.sh<br/>+ provision
    participant Fixtures as Fixture<br/>Files
    participant Docker as Docker<br/>Volumes
    participant Tests as run-with-llm.ts<br/>(LLM Tests)
    participant Claude as Claude Code<br/>CLI/API
    participant Results as results/<br/>JSON

    Dev->>FW: Apply iptables rules (DNS/SSH/HTTPS/HTTP)
    Dev->>Docker: Mount volumes
    Dev->>Entrypoint: Execute provisioning
    Entrypoint->>Fixtures: Copy into .claude/.selftune
    Entrypoint->>Tests: Start LLM test runner
    Tests->>Claude: Invoke `claude -p` / API calls
    Claude->>Tests: Return responses
    Tests->>Results: Validate & append results
    Tests->>Results: Write final report
    Tests-->>Dev: Exit status
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Track A of multi-agent sandbox expansion complete:
- Split sandbox-test-harness.md into sandbox-architecture.md (shared two-layer
  design) and sandbox-claude-code.md (Claude Code-specific implementation)
- Created multi-agent-sandbox.md exec plan with 5 implementation tracks (doc
  restructure, fixtures, Layer 1 coverage, Docker expansion, per-agent docs)
- Created ROADMAP.md with Done/In Progress/Planned sections, agent support
  matrix, and Skill Quality Infrastructure features (badges, auto-evolve,
  marketplace integration, conflict detection)
- Updated design docs index with new file names
- Added docs/strategy/ to .gitignore (strategy docs kept locally as private)

Includes devcontainer setup, workflow improvements, README consolidation, and
infrastructure updates for multi-agent support.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Mar 2, 2026

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Sandbox test harness for end-to-end CLI testing (local and multi-agent Docker layers).
    • Added replay and contribute CLI capabilities for transcript backfill and anonymized exports.
    • New start script for CLI help.
  • Documentation

    • ROADMAP, design docs, execution plans, and contributing guide expanded with sandbox and devcontainer testing guidance.
    • README refocused to workflow-driven, testing-first narrative.
  • Infrastructure & Configuration

    • Devcontainer-based dev environment with firewall/init sequencing.
    • CI workflows and actions pinned to specific revisions; new Make targets for sandbox workflows.

Walkthrough

Adds a sandbox testing infrastructure (Layer‑1 local + Layer‑2 devcontainer/LLM), devcontainer and Docker images, fixture provisioning and orchestration scripts, init firewall, GitHub Actions pinning/permissions, numerous docs/Makefile/package edits, CODEOWNERS, and repo config updates across ~40 files.

Changes

Cohort / File(s) Summary
Devcontainer & Firewall
\.devcontainer/Dockerfile, \.devcontainer/devcontainer.json, \.devcontainer/init-firewall.sh
New devcontainer image (node:20 + Bun), installs Claude CLI, configures npm/global dirs, persistent history/claude dirs, NET_ADMIN runtime, and adds iptables init script executed at container start.
Sandbox Docker Layer & Entrypoint
tests/sandbox/docker/Dockerfile, tests/sandbox/docker/docker-compose.yml, tests/sandbox/docker/entrypoint.sh
New sandbox Docker image and compose service: Bun + Claude CLI, persistent sandbox HOME volume, chown handling, fixture provisioning at startup, and container entrypoint that runs LLM test runner.
Sandbox Orchestrators & Provisioning
tests/sandbox/run-sandbox.ts, tests/sandbox/docker/run-with-llm.ts, tests/sandbox/provision-claude.sh
Adds Layer‑1 TypeScript sandbox runner (local CLI/hook tests), Layer‑2 TypeScript LLM test runner, and shell provisioning script to populate .claude/.selftune fixtures and transcripts; result collection and validation logic included.
Makefile & package.json
Makefile, package.json
Adds make targets: sandbox, sandbox-llm, sandbox-shell; check now depends on sandbox; new npm start script to show CLI help.
GitHub Workflows (pinning & perms)
.github/workflows/...
auto-bump-cli-version.yml, ci.yml, codeql.yml, publish.yml, scorecard.yml
Pins Actions to specific commit SHAs and introduces explicit top-level/per-job permissions (empty top-level perms + scoped per-job permissions).
Repo Ownership & Ignore Rules
.github/CODEOWNERS, .gitignore, .coderabbit.yaml
Adds CODEOWNERS entries, new ignore patterns (tests/sandbox/results/, docs/strategy/, .env*), and re-includes fixtures in review scope by removing an exclusion.
Docs, Roadmap & Architecture
README.md, CONTRIBUTING.md, AGENTS.md, ARCHITECTURE.md, PRD.md, CHANGELOG.md, ROADMAP.md
Large documentation rewrites: sandbox/devcontainer testing docs, milestone renaming (v0.x → M#), changelog 0.2.0, contributing/testing guidance, and roadmap updates.
Design & Exec Plans
docs/design-docs/*, docs/exec-plans/*
Adds sandbox-architecture and sandbox-claude-code design docs, multi-agent sandbox execution plan, and updates design-docs index and tech-debt tracker.
Policy & Product Spec Changes
risk-policy.json, docs/golden-principles.md, docs/product-specs/index.md, llms.txt
Adds docs-drift watch rule for cli/selftune/contribute/, an anti-pattern lint entry, product-spec rows, and a new llms reference file.

Sequence Diagram(s)

sequenceDiagram
    participant Host as Host (make sandbox)
    participant Tmp as TempSandbox (Layer 1)
    participant Project as ProjectFiles
    participant CLI as selftune CLI
    participant Hooks as Hooks
    participant Results as results/ JSON

    Host->>Tmp: create temp HOME, set HOME env
    Host->>Project: copy fixtures (logs, transcripts, config)
    loop run CLI commands
        Host->>CLI: invoke commands (doctor, evals, dashboard, contribute, etc.)
        CLI->>Tmp: read/write under HOME
        CLI-->>Host: stdout/stderr + exit code
        Host->>Results: record per-command result
    end
    loop run hooks
        Host->>Hooks: invoke prompt-log, skill-eval, session-stop (pipe JSON)
        Hooks->>Tmp: append/modify JSONL logs
        Hooks-->>Host: exit + status
        Host->>Results: record hook result
    end
    Host->>Results: write timestamped report
    Host->>Tmp: cleanup (or keep)
Loading
sequenceDiagram
    participant Dev as Devcontainer Startup
    participant FW as iptables (init-firewall.sh)
    participant Docker as Docker Volumes
    participant Entrypoint as entrypoint.sh
    participant Fixtures as FixtureFiles (.claude/.selftune)
    participant Tests as run-with-llm.ts
    participant Claude as Claude Code CLI/API
    participant Results as results/ JSON

    Dev->>FW: apply iptables rules (DNS/SSH/HTTPS/HTTP)
    Dev->>Docker: mount volumes (sandbox HOME, results)
    Dev->>Entrypoint: exec provisioning script
    Entrypoint->>Fixtures: copy fixtures into HOME
    Entrypoint->>Tests: start LLM test runner
    Tests->>Claude: invoke `claude -p` / API calls
    Claude-->>Tests: responses
    Tests->>Results: validate & write final report
    Tests-->>Dev: exit status
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.23% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed Title follows conventional commits format (feat:) and clearly describes the main changes: sandbox test harness, multi-agent expansion, and project roadmap.
Description check ✅ Passed Description comprehensively explains changes across ~40 files with a well-organized table, sequence diagrams, and clear cohort groupings that directly relate to the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch custom/prefix/router-1772442284029

Comment @coderabbitai help to get the list of available commands and usage tips.

@WellDunDun WellDunDun changed the title docs: sandbox design split, exec plan, and roadmap restructure Sandbox test harness, multi-agent expansion plan, and project roadmap Mar 2, 2026
Sandbox fixtures are intentional test data that should be reviewed,
not vendored/generated files to ignore.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 21

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.devcontainer/devcontainer.json:
- Line 10: Remove the unnecessary NET_RAW capability from the devcontainer run
arguments: edit the runArgs array to drop the "--cap-add=NET_RAW" entry so only
"--cap-add=NET_ADMIN" remains; this targets the runArgs declaration and aligns
with the firewall script (init-firewall.sh) which only requires NET_ADMIN and
does not use raw sockets or packet crafting.

In @.devcontainer/Dockerfile:
- Around line 19-23: SNIPPET is defined but never written into any shell
startup, so interactive shells don't use /commandhistory/.bash_history; update
the Dockerfile to write the SNIPPET into a startup file (e.g., create
/etc/profile.d/commandhistory.sh or append to the user's ~/.bashrc) after
creating /commandhistory and setting ownership, by echoing the export lines for
PROMPT_COMMAND and HISTFILE into that file and ensuring it has appropriate
permissions so the shell will source it for interactive sessions; reference the
SNIPPET variable, PROMPT_COMMAND, HISTFILE, /commandhistory/.bash_history and
$USERNAME when making the change so the Docker build persists shell history.
- Around line 33-34: Update both RUN lines that execute the Bun installer ("RUN
curl -fsSL https://bun.sh/install | bash" and the similar second occurrence) to
run the pipeline under a shell that has strict failure handling; prepend a
POSIX-safe strict mode (set -euo pipefail) so that a failed curl or a failing
pipeline will cause the Docker build to fail rather than silently continuing.
Apply this change to both RUN instructions that invoke the bun installer.

In @.devcontainer/init-firewall.sh:
- Line 11: The iptables rule added to the INPUT chain uses the output interface
flag (-o lo) which is invalid there; update the iptables invocation that
currently adds the loopback accept rule (the line running "iptables -A INPUT -o
lo -j ACCEPT") to use the input interface flag (-i lo) instead so it reads as an
INPUT-chain loopback accept; keep the same chain/action/position but replace -o
with -i to prevent the script from failing under set -e.
- Around line 7-22: Add explicit default-deny policies by setting "iptables -P
INPUT DROP", "iptables -P OUTPUT DROP", and "iptables -P FORWARD DROP" before
any -A (append) rules, then keep the existing allow rules (DNS, SSH, loopback,
ESTABLISHED,RELATED, HTTP/HTTPS) after those policies; also correct the loopback
input rule by changing the INPUT rule from using "-o lo" to "-i lo" (replace the
current "iptables -A INPUT -o lo -j ACCEPT" with "iptables -A INPUT -i lo -j
ACCEPT") so loopback traffic is properly allowed.

In `@docs/design-docs/sandbox-architecture.md`:
- Around line 41-47: Add a language tag to the fenced code block that shows the
test fixture tree (change ``` to ```text) to satisfy MD040, and insert a single
blank line after each of the headings in the same section (the headings
surrounding the fixtures block referenced near the tests/sandbox/fixtures tree)
to satisfy MD022/MD040; update the fenced block and add blank lines immediately
after those heading lines so lint no longer reports MD040/MD022.

In `@docs/design-docs/sandbox-claude-code.md`:
- Line 36: Update the docs entry that currently lists Layer 2 as
`tests/sandbox/claude-code/` to the actual location used by the Makefile:
`tests/sandbox/docker/docker-compose.yml`; change the referenced path in
docs/design-docs/sandbox-claude-code.md and any mention of Layer 2 so it matches
the Makefile targets `sandbox-llm` and `sandbox-shell` which point to
`tests/sandbox/docker/docker-compose.yml`.

In `@docs/exec-plans/active/multi-agent-sandbox.md`:
- Around line 1-3: The MD041/MD040 violations come from the verification HTML
comment placed before the H1 and an untyped code fence; move the verification
comment "<!-- Verified: 2026-03-02 -->" so it appears immediately below the H1
"# Execution Plan: Multi-Agent Sandbox Expansion", and update the directory-tree
code fence to include a language specifier (use "text") for the block that lists
tests/sandbox/; apply the same change pattern to the other similar blocks
referenced around the content (lines noted 26-43) so all first-line headings
precede comments and all fences use a language (e.g., ```text).

In `@docs/exec-plans/tech-debt-tracker.md`:
- Line 15: The TD-007 row in the debt tracker was edited but the "Updated" date
wasn't changed; update the "Updated" column for the TD-007 row (the row
mentioning deployProposal / deploy-proposal.ts and evolve.ts) to the current
audit date (e.g., 2026-03-02) so metadata remains consistent with the edit.

In `@docs/golden-principles.md`:
- Line 95: Remove the hardcoded line-number reference in the sentence and
instead point to the lint rule by name or exported identifier in
lint-architecture.ts that enforces the isolated export path (e.g., the
rule/check that enforces "contribute" not importing from hooks/ ingestors/
grading/ evolution/ monitoring). Update the docs/golden-principles.md text to
say something like "enforced by the isolated-export-path rule in
lint-architecture.ts" (or use the actual exported rule/constant/function name
from lint-architecture.ts) so the doc references the rule identifier rather than
specific line numbers.

In `@docs/product-specs/index.md`:
- Around line 10-11: The two table rows in docs/product-specs/index.md
referencing "[ICP & GTM Strategy](../strategy/icp-gtm-strategy.md)" and
"[OpenClaw Integration](../strategy/openclaw-integration.md)" point to a
gitignored/private docs/strategy directory and will 404 for others; either move
the source files into a tracked docs folder and update the links to their new
paths, or replace those two linked entries with plain text labels (e.g., "ICP &
GTM Strategy — draft" and "OpenClaw Integration — draft") until the publishable
docs exist so the index no longer links to private paths.

In `@Makefile`:
- Line 1: Add missing phony targets and minimal implementations for Makefile
compliance: update the .PHONY declaration to include "all" and "clean", add an
"all" target that depends on the existing "check" target (so running `make` runs
checks), and add a "clean" target that removes build artifacts (e.g.,
temporary/build output directories and generated files used by the repo) to
satisfy checkmake requirements; reference the .PHONY line and the "check" target
when making these changes.

In `@package.json`:
- Around line 53-54: Keep a single canonical script and make the other an alias:
remove the duplicate command by preserving "start" with the real command "bun
run cli/selftune/index.ts --help" and change "boot" to run that script (e.g.,
set "boot" to "npm run start" or "bun run npm: start") so "boot" is an alias to
"start" and future drift is avoided; update the package.json "scripts" entries
for "start" and "boot" accordingly.

In `@tests/sandbox/docker/Dockerfile`:
- Around line 6-7: The Dockerfile uses ARG CLAUDE_CODE_VERSION=latest which
makes builds non-reproducible; change the default to a pinned semantic version
(e.g., ARG CLAUDE_CODE_VERSION=1.2.3) and ensure any other occurrence (same ARG
at the other location referenced) is updated consistently so builds use the
fixed version unless overridden at build time; update documentation or build
scripts to pass the desired version via --build-arg CLAUDE_CODE_VERSION when
needed and add a test or note to verify the pinned value is used during sandbox
builds.

In `@tests/sandbox/docker/run-with-llm.ts`:
- Around line 143-152: The JSON.parse call for grading-result.json can throw an
opaque error if the file is malformed; wrap the read + JSON.parse (where
resultPath, readFileSync and JSON.parse are used in run-with-llm.ts) in a
try-catch, catch parse errors and throw a new Error that includes the resultPath
and the original error message (or attach it) so the failure clearly indicates
which file failed to parse and why; keep the existing existsSync check and only
proceed to parse inside the try block, rethrow or throw a new descriptive Error
on failure.
- Around line 67-76: The code currently awaits proc.exited before consuming the
process streams, which can deadlock; change the order in the function that
spawns Bun (using the proc variable) so you create promises for stdout and
stderr (e.g., const stdoutPromise = new Response(proc.stdout).text(); const
stderrPromise = new Response(proc.stderr).text();) before awaiting proc.exited,
then await proc.exited and finally await those promises (await stdoutPromise,
await stderrPromise) to return { exitCode, stdout, stderr }—update references to
proc, proc.stdout, proc.stderr, and proc.exited accordingly.
- Around line 97-105: The current code awaits proc.exited before consuming
proc.stdout/proc.stderr which can deadlock if the child blocks writing; change
the flow to start reading both streams into text Promises immediately (e.g.,
create stdoutPromise and stderrPromise from new Response(proc.stdout).text() and
new Response(proc.stderr).text()), then await Promise.race/Promise.all with
proc.exited and the timeout, and only after the exit resolve use the
already-started stdout/stderr promises (await stdoutPromise and stderrPromise)
to include output in the error message; ensure the timeout (timeout) is still
cleared when exit completes and that proc.kill() is called on timeout.

In `@tests/sandbox/provision-claude.sh`:
- Line 45: The cp of "${FIXTURES}/transcripts/*.jsonl" into
"${TARGET_HOME}/.claude/projects/default/" will fail if the glob matches
nothing; modify the script around that cp to guard the copy by first checking
for the existence of matching files (e.g., test for files matching
"${FIXTURES}/transcripts/"/*.jsonl or use shell option nullglob) and only run
the cp (or a safe loop to copy each file) when at least one transcript exists;
ensure the target directory "${TARGET_HOME}/.claude/projects/default/" is
created (mkdir -p) before copying.

In `@tests/sandbox/run-sandbox.ts`:
- Around line 221-229: The main function currently logs and calls setupSandbox()
but doesn't guarantee cleanup on exceptions; wrap the actionable portion of main
(everything after setupSandbox() and any early-exit logic) in a try/finally so
cleanupSandbox() in your existing cleanup block (see the cleanup logic around
the block referenced at lines 458-471) always runs; ensure any async calls
inside the try are awaited so the finally runs at the correct time, and rethrow
the error after cleanup if you want the caller to observe failures.
- Around line 53-93: The setupSandbox function duplicates provisioning steps
already implemented in provision-claude.sh (creating SANDBOX_SELFTUNE_DIR,
SANDBOX_PROJECTS_DIR, copying FIXTURES_DIR files into SANDBOX_CLAUDE_DIR and
SANDBOX_PROJECTS_DIR, and copying selftune/claude settings); replace the
duplicated logic by invoking the existing provision-claude.sh script (or
refactor the shell logic into a shared TypeScript helper) from setupSandbox so
there is a single source of truth for fixture paths and copying behavior; keep
references to the same SANDBOX_* and FIXTURES_DIR constants and ensure
setupSandbox either spawns the script (capturing errors) or calls the shared
helper function so tests and Docker provisioning use identical logic.
- Around line 210-215: The current check in fileHasNewContent (and similar
functions around lines 337-390) can miscount empty files because calling trim()
before split may yield a single empty string; change the logic to first read the
raw content, return false immediately if content === "" (or content.length ===
0), or split on "\n" without trimming and then filter out blank lines via
.filter(l => l.trim()) to compute lines.length; update fileHasNewContent to use
this pattern so first-write append checks correctly treat empty files as 0
lines.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between dfa8fb4 and f5dfa73.

⛔ Files ignored due to path filters (17)
  • tests/sandbox/fixtures/all_queries_log.jsonl is excluded by !**/fixtures/**
  • tests/sandbox/fixtures/claude-settings.json is excluded by !**/fixtures/**
  • tests/sandbox/fixtures/evolution_audit_log.jsonl is excluded by !**/fixtures/**
  • tests/sandbox/fixtures/hook-payloads/post-tool-use.json is excluded by !**/fixtures/**
  • tests/sandbox/fixtures/hook-payloads/prompt-submit.json is excluded by !**/fixtures/**
  • tests/sandbox/fixtures/hook-payloads/session-stop.json is excluded by !**/fixtures/**
  • tests/sandbox/fixtures/selftune-config.json is excluded by !**/fixtures/**
  • tests/sandbox/fixtures/session_telemetry_log.jsonl is excluded by !**/fixtures/**
  • tests/sandbox/fixtures/skill_usage_log.jsonl is excluded by !**/fixtures/**
  • tests/sandbox/fixtures/skills/ai-image-generation/SKILL.md is excluded by !**/fixtures/**
  • tests/sandbox/fixtures/skills/find-skills/SKILL.md is excluded by !**/fixtures/**
  • tests/sandbox/fixtures/skills/frontend-design/SKILL.md is excluded by !**/fixtures/**
  • tests/sandbox/fixtures/transcripts/session-001.jsonl is excluded by !**/fixtures/**
  • tests/sandbox/fixtures/transcripts/session-002.jsonl is excluded by !**/fixtures/**
  • tests/sandbox/fixtures/transcripts/session-003.jsonl is excluded by !**/fixtures/**
  • tests/sandbox/fixtures/transcripts/session-004.jsonl is excluded by !**/fixtures/**
  • tests/sandbox/fixtures/transcripts/session-005.jsonl is excluded by !**/fixtures/**
📒 Files selected for processing (35)
  • .devcontainer/Dockerfile
  • .devcontainer/devcontainer.json
  • .devcontainer/init-firewall.sh
  • .github/CODEOWNERS
  • .github/workflows/auto-bump-cli-version.yml
  • .github/workflows/ci.yml
  • .github/workflows/codeql.yml
  • .github/workflows/publish.yml
  • .github/workflows/scorecard.yml
  • .gitignore
  • AGENTS.md
  • ARCHITECTURE.md
  • CHANGELOG.md
  • CONTRIBUTING.md
  • Makefile
  • PRD.md
  • README.md
  • ROADMAP.md
  • docs/design-docs/index.md
  • docs/design-docs/sandbox-architecture.md
  • docs/design-docs/sandbox-claude-code.md
  • docs/exec-plans/active/multi-agent-sandbox.md
  • docs/exec-plans/completed/.gitkeep
  • docs/exec-plans/tech-debt-tracker.md
  • docs/golden-principles.md
  • docs/product-specs/index.md
  • llms.txt
  • package.json
  • risk-policy.json
  • tests/sandbox/docker/Dockerfile
  • tests/sandbox/docker/docker-compose.yml
  • tests/sandbox/docker/entrypoint.sh
  • tests/sandbox/docker/run-with-llm.ts
  • tests/sandbox/provision-claude.sh
  • tests/sandbox/run-sandbox.ts

Comment on lines +7 to +22
iptables -A OUTPUT -p udp --dport 53 -j ACCEPT
iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT
iptables -A OUTPUT -p tcp --dport 22 -j ACCEPT
iptables -A OUTPUT -o lo -j ACCEPT
iptables -A INPUT -o lo -j ACCEPT

# Allow established connections
iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Allow HTTPS (port 443) for Claude API, npm, GitHub
iptables -A OUTPUT -p tcp --dport 443 -j ACCEPT

# Allow HTTP (port 80) for package registries
iptables -A OUTPUT -p tcp --dport 80 -j ACCEPT

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd .devcontainer && ls -la | grep init && cat -n init-firewall.sh

Repository: WellDunDun/selftune

Length of output: 999


🏁 Script executed:

# Verify no DROP policies exist in the file
rg -c 'iptables -P' .devcontainer/init-firewall.sh || echo "No -P policy rules found"

Repository: WellDunDun/selftune

Length of output: 87


Set default policies to DROP before appending allow rules.

The firewall config uses only ACCEPT rules with no default policies defined. Without explicit -P INPUT DROP, -P OUTPUT DROP, -P FORWARD DROP at the start, traffic not matching any rule still passes through default system policies (typically ACCEPT). This defeats the deny-by-default approach.

Also fix line 11: iptables -A INPUT -i lo -j ACCEPT (use -i for input interface, not -o).

Required changes
+iptables -P INPUT DROP
+iptables -P OUTPUT DROP
+iptables -P FORWARD DROP
+
 # Allow DNS, SSH, localhost
 iptables -A OUTPUT -p udp --dport 53 -j ACCEPT
 iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT
 iptables -A OUTPUT -p tcp --dport 22 -j ACCEPT
 iptables -A OUTPUT -o lo -j ACCEPT
-iptables -A INPUT -o lo -j ACCEPT
+iptables -A INPUT -i lo -j ACCEPT
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.devcontainer/init-firewall.sh around lines 7 - 22, Add explicit
default-deny policies by setting "iptables -P INPUT DROP", "iptables -P OUTPUT
DROP", and "iptables -P FORWARD DROP" before any -A (append) rules, then keep
the existing allow rules (DNS, SSH, loopback, ESTABLISHED,RELATED, HTTP/HTTPS)
after those policies; also correct the loopback input rule by changing the INPUT
rule from using "-o lo" to "-i lo" (replace the current "iptables -A INPUT -o lo
-j ACCEPT" with "iptables -A INPUT -i lo -j ACCEPT") so loopback traffic is
properly allowed.

cp "${FIXTURES}/skills/ai-image-generation/SKILL.md" "${TARGET_HOME}/.claude/skills/ai-image-generation/SKILL.md"

# Session transcripts
cp "${FIXTURES}"/transcripts/*.jsonl "${TARGET_HOME}/.claude/projects/default/"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard transcript copy when the fixture glob is empty.

The current wildcard copy can fail hard when no .jsonl transcript files are present.

Suggested fix
 # Session transcripts
-cp "${FIXTURES}"/transcripts/*.jsonl "${TARGET_HOME}/.claude/projects/default/"
+shopt -s nullglob
+transcript_files=("${FIXTURES}"/transcripts/*.jsonl)
+if ((${`#transcript_files`[@]} > 0)); then
+  cp "${transcript_files[@]}" "${TARGET_HOME}/.claude/projects/default/"
+fi
+shopt -u nullglob
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cp "${FIXTURES}"/transcripts/*.jsonl "${TARGET_HOME}/.claude/projects/default/"
# Session transcripts
shopt -s nullglob
transcript_files=("${FIXTURES}"/transcripts/*.jsonl)
if ((${`#transcript_files`[@]} > 0)); then
cp "${transcript_files[@]}" "${TARGET_HOME}/.claude/projects/default/"
fi
shopt -u nullglob
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/sandbox/provision-claude.sh` at line 45, The cp of
"${FIXTURES}/transcripts/*.jsonl" into
"${TARGET_HOME}/.claude/projects/default/" will fail if the glob matches
nothing; modify the script around that cp to guard the copy by first checking
for the existence of matching files (e.g., test for files matching
"${FIXTURES}/transcripts/"/*.jsonl or use shell option nullglob) and only run
the cp (or a safe loop to copy each file) when at least one transcript exists;
ensure the target directory "${TARGET_HOME}/.claude/projects/default/" is
created (mkdir -p) before copying.

Comment on lines +53 to +93
function setupSandbox(): void {
// Create directory structure
mkdirSync(SANDBOX_SELFTUNE_DIR, { recursive: true });
mkdirSync(SANDBOX_PROJECTS_DIR, { recursive: true });

// Copy JSONL log fixtures into ~/.claude/
const logFiles = [
"all_queries_log.jsonl",
"skill_usage_log.jsonl",
"session_telemetry_log.jsonl",
"evolution_audit_log.jsonl",
];
for (const file of logFiles) {
const src = join(FIXTURES_DIR, file);
if (existsSync(src)) {
copyFileSync(src, join(SANDBOX_CLAUDE_DIR, file));
}
}

// Copy transcripts into ~/.claude/projects/default/ as individual session files
const transcriptsDir = join(FIXTURES_DIR, "transcripts");
if (existsSync(transcriptsDir)) {
for (const file of readdirSync(transcriptsDir)) {
if (file.endsWith(".jsonl")) {
copyFileSync(join(transcriptsDir, file), join(SANDBOX_PROJECTS_DIR, file));
}
}
}

// Copy selftune config
copyFileSync(
join(FIXTURES_DIR, "selftune-config.json"),
join(SANDBOX_SELFTUNE_DIR, "config.json"),
);

// Copy Claude Code settings
copyFileSync(
join(FIXTURES_DIR, "claude-settings.json"),
join(SANDBOX_CLAUDE_DIR, "settings.json"),
);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consolidate sandbox provisioning to a single source of truth.

This setup duplicates tests/sandbox/provision-claude.sh logic; fixture path changes can drift between the two flows. Reusing the provision script (or a shared TS helper) will keep Layer 1 and Docker setup consistent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/sandbox/run-sandbox.ts` around lines 53 - 93, The setupSandbox function
duplicates provisioning steps already implemented in provision-claude.sh
(creating SANDBOX_SELFTUNE_DIR, SANDBOX_PROJECTS_DIR, copying FIXTURES_DIR files
into SANDBOX_CLAUDE_DIR and SANDBOX_PROJECTS_DIR, and copying selftune/claude
settings); replace the duplicated logic by invoking the existing
provision-claude.sh script (or refactor the shell logic into a shared TypeScript
helper) from setupSandbox so there is a single source of truth for fixture paths
and copying behavior; keep references to the same SANDBOX_* and FIXTURES_DIR
constants and ensure setupSandbox either spawns the script (capturing errors) or
calls the shared helper function so tests and Docker provisioning use identical
logic.

@WellDunDun WellDunDun changed the title Sandbox test harness, multi-agent expansion plan, and project roadmap feat: sandbox test harness, multi-agent expansion plan, and project roadmap Mar 2, 2026
WellDunDun and others added 2 commits March 2, 2026 13:11
Devcontainer security:
- Remove unnecessary NET_RAW capability from runArgs
- Fix iptables INPUT chain: -o lo → -i lo for loopback
- Add default DROP policies for INPUT/FORWARD/OUTPUT chains
- Add pipefail to curl|bash Bun installs
- Append SNIPPET to .bashrc for sandbox-agent shell

Code fixes:
- Fix stream deadlock: consume stdout/stderr concurrently via Promise.all
- Add try-catch around JSON.parse for malformed grading results
- Guard transcript glob copy with nullglob/array check
- Extract countLines helper, handle empty files correctly
- Wrap main test execution in try/finally for cleanup

Documentation:
- Fix path reference: tests/sandbox/claude-code/ → tests/sandbox/docker/
- Add language tags to bare fenced code blocks
- Move HTML comment below H1 in multi-agent-sandbox.md
- Update TD-007 date to 2026-03-02
- Remove hardcoded line numbers from golden-principles.md
- Mark gitignored strategy doc links as "local only"
- Deduplicate start/boot scripts in package.json

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Conflicts resolved in ARCHITECTURE.md, CHANGELOG.md, README.md,
docs/design-docs/index.md. Kept both sides: dev's v0.8 features
(auto-activation, enforcement, memory, dashboard-server, agents)
and HEAD's sandbox test harness + devcontainer content.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (2)
tests/sandbox/run-sandbox.ts (1)

231-237: ⚠️ Potential issue | 🟠 Major

Cleanup is still not guaranteed on all failure paths.

setupSandbox() is outside the try/finally, and process.exit(...) is called inside the try. This can still leave /tmp/selftune-sandbox-* behind on early failures.

Safer control-flow pattern
 async function main(): Promise<void> {
@@
-  // Setup
-  setupSandbox();
-  console.log("Sandbox directories created and fixtures copied.\n");
-
-  const results: RawTestResult[] = [];
-
-  try {
+  const results: RawTestResult[] = [];
+  let exitCode = 1;
+  try {
+    setupSandbox();
+    console.log("Sandbox directories created and fixtures copied.\n");
@@
-    process.exit(passed === total ? 0 : 1);
+    exitCode = passed === total ? 0 : 1;
   } finally {
@@
   }
+  process.exit(exitCode);
 }
In Node.js/Bun, does calling process.exit() inside a try block reliably execute the corresponding finally block?

Also applies to: 449-461

docs/design-docs/sandbox-architecture.md (1)

61-71: ⚠️ Potential issue | 🟡 Minor

Add blank lines after subsection headings to satisfy MD022.

The past review flagged missing blank lines after the headings on lines 61, 64, and 69. This markdownlint violation (MD022) has not been resolved.

📝 Proposed fix to add blank lines
 ### 1. HOME Env Var Redirection
+
 All selftune paths go through `homedir()` in `constants.ts`. Setting `HOME=/tmp/sandbox-*` redirects everything without modifying production code.
 
 ### 2. Two-Layer Architecture
+
 - Layer 1 is free, fast (~400ms), and runs in CI
 - Layer 2 costs tokens and requires Docker, reserved for pre-release validation
 - Both share the same fixture data
 
 ### 3. Result Recording
+
 Every test run saves a JSON report to `tests/sandbox/results/` with command, exit code, stdout, stderr, duration, and pass/fail. This creates a historical record of sandbox health.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design-docs/sandbox-architecture.md` around lines 61 - 71, The MD022
violation is due to missing blank lines after the subsection headings "1. HOME
Env Var Redirection", "2. Two-Layer Architecture", and "3. Result Recording" in
sandbox-architecture.md; fix it by inserting a single blank line immediately
after each of those heading lines so each heading is followed by a blank line
before the paragraph or list that follows, ensuring the file satisfies
markdownlint MD022.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.devcontainer/Dockerfile:
- Line 5: Replace the dynamic build ARG CLAUDE_CODE_VERSION=latest with a pinned
explicit version (for example set ARG CLAUDE_CODE_VERSION=1.0.100) so
devcontainer builds are reproducible; locate the CLAUDE_CODE_VERSION ARG
declaration and any later references (the second occurrence that injects or
installs the Claude Code CLI) and update them to use the chosen fixed version
string, and document the chosen version in a comment so future bumps are
intentional.

In @.devcontainer/init-firewall.sh:
- Around line 7-27: The firewall currently only configures IPv4 via iptables,
allowing IPv6 traffic to bypass restrictions; add equivalent IPv6 rules using
ip6tables to mirror every iptables rule (accept loopback, DNS tcp/udp 53, SSH
tcp 22, established/related states, HTTPS 443, HTTP 80) and then set default
policies to DROP (INPUT, FORWARD, OUTPUT) for ip6tables as well; ensure rule
ordering matches the existing iptables block and use the same match flags (e.g.,
-m state --state ESTABLISHED,RELATED) so both iptables and ip6tables enforce the
same policy.

In `@docs/design-docs/sandbox-architecture.md`:
- Around line 91-97: The table entries for per-agent design docs are
inconsistent: "Claude Code" uses a markdown link while "Codex" and "OpenCode"
are plain text; update the rows in the table so formatting is consistent —
either convert "sandbox-codex.md" and "sandbox-opencode.md" to markdown links
like [sandbox-codex.md](sandbox-codex.md) and
[sandbox-opencode.md](sandbox-opencode.md) or remove the link syntax for
"sandbox-claude-code.md" so all three are plain text; update the entries for the
Agent and Design Doc columns in the table (e.g., the "Claude Code", "Codex",
"OpenCode" and their associated design-doc strings) and keep the Status column
unchanged.
- Around line 33-35: The docs duplicate the cost statement across two bullet
points; remove the redundant line so the "Cost" section only states "Uses
existing agent subscription — no per-call API charges." and keep the "Mechanism"
section as "Per-agent Docker containers with the agent's CLI installed. Uses
existing agent subscription — no separate API key needed." Locate and edit the
two bullets under the same heading in sandbox-architecture.md (the "Mechanism"
and "Cost" list items) and delete the repeated subscription/charges sentence
from the Cost bullet.

In `@tests/sandbox/docker/run-with-llm.ts`:
- Around line 225-241: testWatch currently doesn't validate the watch command
result so failures can be missed; update the test in run-with-llm.ts (the block
using runSelftune and returned values) to assert the expected exitCode and
result shape: verify exitCode equals the expected value (e.g., 1 for regression
detection), check stdout/stderr contain the expected markers (e.g., a known
"regression" or "watch" message) and ensure the returned object includes both
stdout and exitCode in the expected types before returning; reference the
runSelftune invocation and the surrounding testWatch/runTest flow when adding
these assertions.
- Around line 157-170: The code parses JSON into a loose Record<string, unknown>
(variable result) but then accesses result.summary.passed/result.summary.total
and result.reason/result.proposal without proper narrowing; add a type guard
(e.g., isGradingResult(obj): obj is GradingResult) or inline typeof checks
before using each property—verify result.summary is an object and typeof
result.summary.pass_rate === 'number' and typeof result.summary.passed ===
'number' and typeof result.summary.total === 'number' before logging, and verify
typeof result.reason === 'string' (or typeof result.proposal === 'string')
before using nullish-coalescing; replace unsafe property accesses with guarded
accesses or cast only after the guard so TypeScript knows the types are safe.

---

Duplicate comments:
In `@docs/design-docs/sandbox-architecture.md`:
- Around line 61-71: The MD022 violation is due to missing blank lines after the
subsection headings "1. HOME Env Var Redirection", "2. Two-Layer Architecture",
and "3. Result Recording" in sandbox-architecture.md; fix it by inserting a
single blank line immediately after each of those heading lines so each heading
is followed by a blank line before the paragraph or list that follows, ensuring
the file satisfies markdownlint MD022.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8111121 and 8962186.

📒 Files selected for processing (13)
  • .devcontainer/Dockerfile
  • .devcontainer/devcontainer.json
  • .devcontainer/init-firewall.sh
  • docs/design-docs/sandbox-architecture.md
  • docs/design-docs/sandbox-claude-code.md
  • docs/exec-plans/active/multi-agent-sandbox.md
  • docs/exec-plans/tech-debt-tracker.md
  • docs/golden-principles.md
  • docs/product-specs/index.md
  • package.json
  • tests/sandbox/docker/run-with-llm.ts
  • tests/sandbox/provision-claude.sh
  • tests/sandbox/run-sandbox.ts


ARG TZ
ENV TZ="$TZ"
ARG CLAUDE_CODE_VERSION=latest
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "Dockerfile" -path "*devcontainer*" | head -5

Repository: WellDunDun/selftune

Length of output: 89


🏁 Script executed:

cat -n .devcontainer/Dockerfile | head -60

Repository: WellDunDun/selftune

Length of output: 2243


🌐 Web query:

@anthropic-ai/claude-code npm package latest version stable

💡 Result:

The latest stable version of the @anthropic-ai/claude-code npm package is 1.0.100. [1]


Pin Claude Code CLI to an explicit version instead of latest.

Using CLAUDE_CODE_VERSION=latest makes devcontainer builds non-reproducible and can break unexpectedly when upstream publishes changes. Pin to a stable version (e.g., 1.0.100) and bump intentionally.

Affects lines 5 and 49.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.devcontainer/Dockerfile at line 5, Replace the dynamic build ARG
CLAUDE_CODE_VERSION=latest with a pinned explicit version (for example set ARG
CLAUDE_CODE_VERSION=1.0.100) so devcontainer builds are reproducible; locate the
CLAUDE_CODE_VERSION ARG declaration and any later references (the second
occurrence that injects or installs the Claude Code CLI) and update them to use
the chosen fixed version string, and document the chosen version in a comment so
future bumps are intentional.

Comment on lines +7 to +27
iptables -A OUTPUT -p udp --dport 53 -j ACCEPT
iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT
iptables -A OUTPUT -p tcp --dport 22 -j ACCEPT
iptables -A OUTPUT -o lo -j ACCEPT
iptables -A INPUT -i lo -j ACCEPT

# Allow established connections
iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Allow HTTPS (port 443) for Claude API, npm, GitHub
iptables -A OUTPUT -p tcp --dport 443 -j ACCEPT

# Allow HTTP (port 80) for package registries
iptables -A OUTPUT -p tcp --dport 80 -j ACCEPT

# Default deny all other traffic
iptables -P INPUT DROP
iptables -P FORWARD DROP
iptables -P OUTPUT DROP

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Firewall policy is IPv4-only; add IPv6 controls to avoid bypass.

Current rules apply to iptables only. If IPv6 is available, outbound traffic can bypass these restrictions.

Suggested hardening
+# Mirror policy for IPv6 (or explicitly disable IPv6 in container)
+ip6tables -P INPUT DROP
+ip6tables -P FORWARD DROP
+ip6tables -P OUTPUT DROP
+
+ip6tables -A OUTPUT -p udp --dport 53 -j ACCEPT
+ip6tables -A OUTPUT -p tcp --dport 53 -j ACCEPT
+ip6tables -A OUTPUT -o lo -j ACCEPT
+ip6tables -A INPUT -i lo -j ACCEPT
+ip6tables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
+ip6tables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
+ip6tables -A OUTPUT -p tcp --dport 443 -j ACCEPT
+ip6tables -A OUTPUT -p tcp --dport 80 -j ACCEPT
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
iptables -A OUTPUT -p udp --dport 53 -j ACCEPT
iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT
iptables -A OUTPUT -p tcp --dport 22 -j ACCEPT
iptables -A OUTPUT -o lo -j ACCEPT
iptables -A INPUT -i lo -j ACCEPT
# Allow established connections
iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
# Allow HTTPS (port 443) for Claude API, npm, GitHub
iptables -A OUTPUT -p tcp --dport 443 -j ACCEPT
# Allow HTTP (port 80) for package registries
iptables -A OUTPUT -p tcp --dport 80 -j ACCEPT
# Default deny all other traffic
iptables -P INPUT DROP
iptables -P FORWARD DROP
iptables -P OUTPUT DROP
iptables -A OUTPUT -p udp --dport 53 -j ACCEPT
iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT
iptables -A OUTPUT -p tcp --dport 22 -j ACCEPT
iptables -A OUTPUT -o lo -j ACCEPT
iptables -A INPUT -i lo -j ACCEPT
# Allow established connections
iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
# Allow HTTPS (port 443) for Claude API, npm, GitHub
iptables -A OUTPUT -p tcp --dport 443 -j ACCEPT
# Allow HTTP (port 80) for package registries
iptables -A OUTPUT -p tcp --dport 80 -j ACCEPT
# Default deny all other traffic
iptables -P INPUT DROP
iptables -P FORWARD DROP
iptables -P OUTPUT DROP
# Mirror policy for IPv6 (or explicitly disable IPv6 in container)
ip6tables -P INPUT DROP
ip6tables -P FORWARD DROP
ip6tables -P OUTPUT DROP
ip6tables -A OUTPUT -p udp --dport 53 -j ACCEPT
ip6tables -A OUTPUT -p tcp --dport 53 -j ACCEPT
ip6tables -A OUTPUT -o lo -j ACCEPT
ip6tables -A INPUT -i lo -j ACCEPT
ip6tables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
ip6tables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
ip6tables -A OUTPUT -p tcp --dport 443 -j ACCEPT
ip6tables -A OUTPUT -p tcp --dport 80 -j ACCEPT
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.devcontainer/init-firewall.sh around lines 7 - 27, The firewall currently
only configures IPv4 via iptables, allowing IPv6 traffic to bypass restrictions;
add equivalent IPv6 rules using ip6tables to mirror every iptables rule (accept
loopback, DNS tcp/udp 53, SSH tcp 22, established/related states, HTTPS 443,
HTTP 80) and then set default policies to DROP (INPUT, FORWARD, OUTPUT) for
ip6tables as well; ensure rule ordering matches the existing iptables block and
use the same match flags (e.g., -m state --state ESTABLISHED,RELATED) so both
iptables and ip6tables enforce the same policy.

Comment on lines +33 to +35
**Mechanism:** Per-agent Docker containers with the agent's CLI installed. Uses existing agent subscription — no separate API key needed.

**Cost:** Uses existing agent subscription — no per-call API charges.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Remove duplicate cost statement.

Lines 33 and 35 both state that the solution uses the existing agent subscription without additional charges. The repetition is unnecessary.

✂️ Proposed fix to remove duplication
 **Mechanism:** Per-agent Docker containers with the agent's CLI installed. Uses existing agent subscription — no separate API key needed.
-
-**Cost:** Uses existing agent subscription — no per-call API charges.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Mechanism:** Per-agent Docker containers with the agent's CLI installed. Uses existing agent subscription — no separate API key needed.
**Cost:** Uses existing agent subscription — no per-call API charges.
**Mechanism:** Per-agent Docker containers with the agent's CLI installed. Uses existing agent subscription — no separate API key needed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design-docs/sandbox-architecture.md` around lines 33 - 35, The docs
duplicate the cost statement across two bullet points; remove the redundant line
so the "Cost" section only states "Uses existing agent subscription — no
per-call API charges." and keep the "Mechanism" section as "Per-agent Docker
containers with the agent's CLI installed. Uses existing agent subscription — no
separate API key needed." Locate and edit the two bullets under the same heading
in sandbox-architecture.md (the "Mechanism" and "Cost" list items) and delete
the repeated subscription/charges sentence from the Cost bullet.

Comment on lines +91 to +97
## Per-Agent Design Docs

| Agent | Design Doc | Status |
|-------|-----------|--------|
| Claude Code | [sandbox-claude-code.md](sandbox-claude-code.md) | Implemented |
| Codex | sandbox-codex.md | Planned |
| OpenCode | sandbox-opencode.md | Planned |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider consistent formatting for planned design docs.

Line 95 uses a markdown link for the implemented Claude Code doc, while lines 96-97 show plain text for planned docs. For consistency, either use markdown links for all (even if they don't resolve yet) or use plain text uniformly with status indicating availability.

🔗 Option: Use consistent markdown links
 | Agent | Design Doc | Status |
 |-------|-----------|--------|
 | Claude Code | [sandbox-claude-code.md](sandbox-claude-code.md) | Implemented |
-| Codex | sandbox-codex.md | Planned |
-| OpenCode | sandbox-opencode.md | Planned |
+| Codex | [sandbox-codex.md](sandbox-codex.md) | Planned |
+| OpenCode | [sandbox-opencode.md](sandbox-opencode.md) | Planned |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Per-Agent Design Docs
| Agent | Design Doc | Status |
|-------|-----------|--------|
| Claude Code | [sandbox-claude-code.md](sandbox-claude-code.md) | Implemented |
| Codex | sandbox-codex.md | Planned |
| OpenCode | sandbox-opencode.md | Planned |
## Per-Agent Design Docs
| Agent | Design Doc | Status |
|-------|-----------|--------|
| Claude Code | [sandbox-claude-code.md](sandbox-claude-code.md) | Implemented |
| Codex | [sandbox-codex.md](sandbox-codex.md) | Planned |
| OpenCode | [sandbox-opencode.md](sandbox-opencode.md) | Planned |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design-docs/sandbox-architecture.md` around lines 91 - 97, The table
entries for per-agent design docs are inconsistent: "Claude Code" uses a
markdown link while "Codex" and "OpenCode" are plain text; update the rows in
the table so formatting is consistent — either convert "sandbox-codex.md" and
"sandbox-opencode.md" to markdown links like
[sandbox-codex.md](sandbox-codex.md) and
[sandbox-opencode.md](sandbox-opencode.md) or remove the link syntax for
"sandbox-claude-code.md" so all three are plain text; update the entries for the
Agent and Design Doc columns in the table (e.g., the "Claude Code", "Codex",
"OpenCode" and their associated design-doc strings) and keep the Status column
unchanged.

Comment on lines +157 to +170
let result: Record<string, unknown>;
try {
result = JSON.parse(raw);
} catch {
throw new Error(`grading-result.json is not valid JSON: ${raw.slice(0, 300)}`);
}

if (!result.summary || typeof result.summary.pass_rate !== "number") {
throw new Error("Grading result missing summary.pass_rate");
}

console.log(
` [grade] ${result.summary.passed}/${result.summary.total} (${Math.round(result.summary.pass_rate * 100)}%)`,
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== TypeScript configuration ==="
fd '^tsconfig.*\.json$' --type f | head -5 | while read f; do
  echo "--- $f ---"
  cat "$f" | grep -E '"strict"|"noImplicitAny"|"strictNullChecks"|"noUncheckedIndexedAccess"' || true
done

echo
echo "=== Full context around lines 157-170 and 200-215 ==="
wc -l tests/sandbox/docker/run-with-llm.ts

echo
echo "=== Lines 150-220 (full context) ==="
sed -n '150,220p' tests/sandbox/docker/run-with-llm.ts

Repository: WellDunDun/selftune

Length of output: 2561


Remove unsafe property access on unknown types without proper narrowing.

Under strict TypeScript, accessing result.summary.passed and result.summary.total (line 165) without type guards violates type safety. The check !result.summary confirms existence but doesn't narrow the property types to non-unknown.

Line 208 similarly accesses result.reason (in the nullish coalescing) without establishing its type—the truthiness check !result.reason && !result.proposal doesn't narrow the type of .reason itself.

Type-guard or narrow each accessed property:

  • typeof result.summary?.passed === 'number' before use
  • typeof result.reason === 'string' before accessing

Same applies to line 200–215 region.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/sandbox/docker/run-with-llm.ts` around lines 157 - 170, The code parses
JSON into a loose Record<string, unknown> (variable result) but then accesses
result.summary.passed/result.summary.total and result.reason/result.proposal
without proper narrowing; add a type guard (e.g., isGradingResult(obj): obj is
GradingResult) or inline typeof checks before using each property—verify
result.summary is an object and typeof result.summary.pass_rate === 'number' and
typeof result.summary.passed === 'number' and typeof result.summary.total ===
'number' before logging, and verify typeof result.reason === 'string' (or typeof
result.proposal === 'string') before using nullish-coalescing; replace unsafe
property accesses with guarded accesses or cast only after the guard so
TypeScript knows the types are safe.

Comment on lines +225 to +241
const { exitCode, stdout, stderr } = await runSelftune([
"watch",
"--skill",
"find-skills",
"--skill-path",
skillPath,
"--window",
"20",
"--threshold",
"0.1",
]);

// Watch exits 1 on regression detection (expected for test data)
console.log(` [watch] exit=${exitCode}`);
console.log(` [watch] ${stdout.slice(0, 200)}`);
return { stdout: stdout.slice(0, 1000), exitCode };
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

watch test can pass on unexpected failures.

testWatch never asserts expected exitCode or result shape, so runTest marks it PASS as long as no exception is thrown. This can hide regressions in watch.

Tighten validation for watch expectations
 async function testWatch(): Promise<unknown> {
@@
-  // Watch exits 1 on regression detection (expected for test data)
-  console.log(`  [watch] exit=${exitCode}`);
-  console.log(`  [watch] ${stdout.slice(0, 200)}`);
-  return { stdout: stdout.slice(0, 1000), exitCode };
+  // Watch exits 1 on regression detection for this fixture.
+  if (exitCode !== 1) {
+    throw new Error(`watch expected exit=1 for regression fixture, got ${exitCode}: ${stderr.slice(0, 300)}`);
+  }
+  let parsed: any;
+  try {
+    parsed = JSON.parse(stdout);
+  } catch {
+    throw new Error(`watch stdout is not valid JSON: ${stdout.slice(0, 300)}`);
+  }
+  if (parsed?.regression_detected !== true) {
+    throw new Error("watch output missing regression_detected=true");
+  }
+  console.log(`  [watch] exit=${exitCode}`);
+  return { exitCode, regression_detected: parsed.regression_detected };
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const { exitCode, stdout, stderr } = await runSelftune([
"watch",
"--skill",
"find-skills",
"--skill-path",
skillPath,
"--window",
"20",
"--threshold",
"0.1",
]);
// Watch exits 1 on regression detection (expected for test data)
console.log(` [watch] exit=${exitCode}`);
console.log(` [watch] ${stdout.slice(0, 200)}`);
return { stdout: stdout.slice(0, 1000), exitCode };
}
const { exitCode, stdout, stderr } = await runSelftune([
"watch",
"--skill",
"find-skills",
"--skill-path",
skillPath,
"--window",
"20",
"--threshold",
"0.1",
]);
// Watch exits 1 on regression detection for this fixture.
if (exitCode !== 1) {
throw new Error(`watch expected exit=1 for regression fixture, got ${exitCode}: ${stderr.slice(0, 300)}`);
}
let parsed: any;
try {
parsed = JSON.parse(stdout);
} catch {
throw new Error(`watch stdout is not valid JSON: ${stdout.slice(0, 300)}`);
}
if (parsed?.regression_detected !== true) {
throw new Error("watch output missing regression_detected=true");
}
console.log(` [watch] exit=${exitCode}`);
return { exitCode, regression_detected: parsed.regression_detected };
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/sandbox/docker/run-with-llm.ts` around lines 225 - 241, testWatch
currently doesn't validate the watch command result so failures can be missed;
update the test in run-with-llm.ts (the block using runSelftune and returned
values) to assert the expected exitCode and result shape: verify exitCode equals
the expected value (e.g., 1 for regression detection), check stdout/stderr
contain the expected markers (e.g., a known "regression" or "watch" message) and
ensure the returned object includes both stdout and exitCode in the expected
types before returning; reference the runSelftune invocation and the surrounding
testWatch/runTest flow when adding these assertions.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
README.md (1)

246-261: ⚠️ Potential issue | 🟡 Minor

License section content is misplaced.

Line 246 creates a License heading, but the actual license link is at Line 261 after the Milestones section. Move [MIT](LICENSE) directly under ## License.

Proposed fix
 ## License
+[MIT](LICENSE)
 
 ## Milestones
@@
-| v0.8 | Auto-activation, evolution memory, specialized agents, enforcement guardrails, dashboard server | Done |
-
-[MIT](LICENSE)
+| v0.8 | Auto-activation, evolution memory, specialized agents, enforcement guardrails, dashboard server | Done |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 246 - 261, The README has the "## License" heading
separated from its actual content: move the existing license link
"[MIT](LICENSE)" so it appears immediately under the "## License" heading (i.e.,
place "[MIT](LICENSE)" directly after the "## License" line) and ensure the "##
Milestones" section follows after the license entry; update the block containing
"## License", "[MIT](LICENSE)", and "## Milestones" so the heading and link are
contiguous.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ARCHITECTURE.md`:
- Around line 48-50: Update the ARCHITECTURE.md section that describes
callViaAgent() spawning `claude -p` to explicitly restrict
`--dangerously-skip-permissions` to sandbox/devcontainer use only: add a single
hard guard sentence after the existing lines stating that
`--dangerously-skip-permissions` must never be used in regular/local/production
workflows and is only permitted in isolated devcontainer/sandbox testing,
referencing callViaAgent() and the flag name verbatim to make the constraint
unambiguous.

---

Outside diff comments:
In `@README.md`:
- Around line 246-261: The README has the "## License" heading separated from
its actual content: move the existing license link "[MIT](LICENSE)" so it
appears immediately under the "## License" heading (i.e., place "[MIT](LICENSE)"
directly after the "## License" line) and ensure the "## Milestones" section
follows after the license entry; update the block containing "## License",
"[MIT](LICENSE)", and "## Milestones" so the heading and link are contiguous.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8962186 and 4877194.

📒 Files selected for processing (6)
  • ARCHITECTURE.md
  • CHANGELOG.md
  • README.md
  • docs/design-docs/index.md
  • docs/golden-principles.md
  • package.json

Comment on lines +48 to +50
│ LLM calls use callViaAgent() which spawns `claude -p` as a
│ subprocess. In devcontainer testing, this runs with
│ --dangerously-skip-permissions for unattended operation.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Constrain dangerous flag usage to sandbox-only context explicitly.

Lines 48-50 document --dangerously-skip-permissions but don’t explicitly forbid use outside isolated devcontainer/sandbox runs. Add a hard guard sentence to prevent accidental adoption in normal/local workflows.

Suggested doc hardening
 │                         LLM calls use callViaAgent() which spawns `claude -p` as a
 │                         subprocess. In devcontainer testing, this runs with
 │                         --dangerously-skip-permissions for unattended operation.
+│                         This flag is sandbox/devcontainer-test only and must never be
+│                         used in host/local developer workflows.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
│ LLM calls use callViaAgent() which spawns `claude -p` as a
│ subprocess. In devcontainer testing, this runs with
│ --dangerously-skip-permissions for unattended operation.
│ LLM calls use callViaAgent() which spawns `claude -p` as a
│ subprocess. In devcontainer testing, this runs with
│ --dangerously-skip-permissions for unattended operation.
│ This flag is sandbox/devcontainer-test only and must never be
│ used in host/local developer workflows.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ARCHITECTURE.md` around lines 48 - 50, Update the ARCHITECTURE.md section
that describes callViaAgent() spawning `claude -p` to explicitly restrict
`--dangerously-skip-permissions` to sandbox/devcontainer use only: add a single
hard guard sentence after the existing lines stating that
`--dangerously-skip-permissions` must never be used in regular/local/production
workflows and is only permitted in isolated devcontainer/sandbox testing,
referencing callViaAgent() and the flag name verbatim to make the constraint
unambiguous.

@WellDunDun WellDunDun merged commit b493c3c into dev Mar 2, 2026
5 checks passed
WellDunDun added a commit that referenced this pull request Mar 2, 2026
Merges dev commits (sandbox harness PR #17, PAI Infrastructure PR #15)
into feature branch. Resolved conflicts by combining both branches'
unique additions: dev's structural improvements (stream deadlock fix,
nullglob guards, bash history, iptables corrections) with HEAD's
OpenClaw integration (ingestor, cron, platform references).

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
WellDunDun added a commit that referenced this pull request Mar 5, 2026
Bug fixes:
- Fix routing retries using wrong refiner in evolve-body.ts (#15)
- Fix baseline gate evaluating wrong skill body in evolve.ts (#19)
- Fix Gate 1 accepting routing tables with zero data rows (#20)
- Fix section heading prefix match allowing false positives (#13)

Security:
- Add ReDoS protection for regex assertions (length cap, try-catch, u flag) (#10)
- Harden YES/NO parsing to first-token extraction with punctuation stripping (#23)

Type safety & validation:
- Add type params to readJsonl calls in baseline.ts (#3)
- Validate LLM JSON shape before treating as SkillUnitTest[] (#4)
- Add JSON.parse validation in unit-test-cli.ts (#9)

Code quality:
- Replace console.warn with structured JSON logging (#5, #11)
- Sort readdirSync for deterministic traversal (#6)
- Tighten fuzzy match to prevent over-selection (#7)
- Rename evalFailures to positiveEvals for clarity (#8)
- Enforce timeout_ms via Promise.race in runUnitTest (#12)
- Enrich deploy audit entry with gates snapshot (#16)
- Add Pareto 5th dimension comment (#18)
- Extract composability CLI into composability-cli.ts (#21)
- Add missing eval files to lint-architecture.ts (#28)
- Reformat AssertionType union (#22), template literal (#14), imports (#17)
- Reformat baseline conditional (#2)

Documentation:
- Renumber scope expansion list (#24)
- Add text language to fenced code blocks (#25, #27, #30)
- Add rollback audit payload note (#26)
- Clarify --dir expects corpus root (#29)
- Make --tests optional in syntax (#31)

Tests:
- Remove unused afterEach import (#32)
- Replace ! assertion with ?. optional chaining (#33)
- Fix temp dir cleanup leak with afterAll (#34)
- Add Gate 2 status assertion (#40)
- Strengthen routing validation assertions (#41)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@WellDunDun WellDunDun deleted the custom/prefix/router-1772442284029 branch March 8, 2026 12:11
@coderabbitai coderabbitai bot mentioned this pull request Mar 9, 2026
WellDunDun added a commit that referenced this pull request Mar 9, 2026
## Walkthrough

This PR adds new CLI commands (auto-grade, quickstart, hook), implements expectation auto-derivation from SKILL.md, distinguishes actual skill invocations from passive SKILL.md reads, adds an UNGRADED skill status, migrates hook checks to Claude Code settings.json, and enhances evolve CLI pre-flight validation, verbose logging, and post-run/error messaging.

## Changes

|Cohort / File(s)|Summary|
|---|---|
|**CLI Routing & Init** <br> `cli/selftune/index.ts`, `cli/selftune/init.ts`|Adds `auto-grade`, `quickstart`, and `hook` command routing; `installAgentFiles()` to deploy bundled agent files; expands Claude Code hook detection (4 keys) and supports nested hook entries.|
|**Evolution CLI** <br> `cli/selftune/evolution/evolve.ts`|Pre-flight checks for `--skill-path`/SKILL.md and eval-set paths, verbose config dump, enhanced post-run deployment/diagnostic messaging, and improved top-level error/troubleshooting output.|
|**Automated Grading** <br> `cli/selftune/grading/auto-grade.ts` (new), `cli/selftune/grading/grade-session.ts`|New auto-grade CLI; adds `deriveExpectationsFromSkill()` (exported) to extract up to 5 expectations from SKILL.md and uses it when explicit evals are missing; session selection now prefers `skills_invoked`.|
|**Hook / Skill Eval** <br> `cli/selftune/hooks/skill-eval.ts`, `cli/selftune/observability.ts`|Adds `hasSkillToolInvocation()` and updates `processToolUse` to set `triggered` based on detection; observability now checks Claude Code `settings.json` (flat and nested formats) instead of repo .git/hooks.|
|**Telemetry & Transcript** <br> `cli/selftune/types.ts`, `cli/selftune/utils/transcript.ts`, `cli/selftune/ingestors/claude-replay.ts`|Adds `skills_invoked` to telemetry/transcript types and populates it from tool_use entries; ingestion prefers `skills_invoked` and sets `triggered` accordingly (fallback to previous behavior).|
|**Status / Badge** <br> `cli/selftune/status.ts`, `cli/selftune/badge/badge-data.ts`|Introduces new SkillStatus value `UNGRADED` and extends BadgeData.status to include `UNGRADED`; status logic and display updated to treat skills with records but no triggered (graded) sessions as UNGRADED.|
|**Quickstart Onboarding** <br> `cli/selftune/quickstart.ts` (new)|Adds three-step quickstart flow (init, replay, status) with `suggestSkillsToEvolve()` ranking ungraded/critical/warning skills.|
|**Eval set behavior** <br> `cli/selftune/eval/hooks-to-evals.ts`|`buildEvalSet` now filters positives/negatives to only include records where `triggered === true`.|
|**Templates & Docs** <br> `templates/*-settings.json`, `skill/Workflows/*.md`|Replaces hardcoded path-based hook invocations with `npx selftune hook <name>`; updates Evolve workflow docs to require AskUserQuestion for user inputs.|
|**Tests & Fixtures** <br> `tests/*` (multiple)|Adds/updates tests for derived expectations, skills_invoked precedence, triggered=true/false cases, observability hook formats, UNGRADED state, and eval positives filtering; updates SKILL.md fixtures.|

## Sequence Diagram(s)

```mermaid
sequenceDiagram
    participant User
    participant CLI as auto-grade<br/>CLI
    participant Telemetry as Telemetry<br/>Log
    participant SkillMD as SKILL.md
    participant PreGates as Pre-Gates
    participant Agent as LLM<br/>Agent
    participant Output as JSON<br/>Output

    User->>CLI: auto-grade --skill X [--session-id Y]
    CLI->>Telemetry: Load telemetry log
    Telemetry-->>CLI: Sessions list
    CLI->>CLI: Resolve session (by ID or latest using skills_invoked)
    CLI->>SkillMD: Derive expectations (if needed)
    SkillMD-->>CLI: Expectations (derived or none)
    CLI->>PreGates: Run pre-gates on expectations
    alt All resolved
        PreGates-->>CLI: All expectations resolved
    else Some unresolved
        PreGates-->>CLI: Unresolved list
        CLI->>Agent: gradeViaAgent(unresolved)
        Agent-->>CLI: Graded expectations
    end
    CLI->>CLI: Compile results, metrics, summary
    CLI->>Output: Write GradingResult JSON
    Output-->>User: Summary + file path
```

```mermaid
sequenceDiagram
    participant User
    participant CLI as quickstart<br/>CLI
    participant Config as Config<br/>Check
    participant Replay as Replay<br/>Ingestor
    participant Telemetry as Telemetry<br/>Log
    participant Status as Status<br/>Engine
    participant Skills as Skills<br/>Scorer

    User->>CLI: quickstart
    CLI->>Config: Check config/marker
    alt Config missing
        CLI->>CLI: runInit()
    end
    CLI->>Replay: Check replay marker
    alt Marker missing
        Replay->>CLI: Ingest transcripts (parseTranscript -> skills_invoked)
        CLI->>Telemetry: Update logs/marker
    end
    CLI->>Status: Compute skill statuses (uses triggered & pass rates)
    Status-->>CLI: Formatted status
    CLI->>Skills: suggestSkillsToEvolve()
    Skills-->>CLI: Top ranked skills
    CLI->>User: Display suggestions
```

```mermaid
sequenceDiagram
    participant Transcript as File
    participant Parser as parseTranscript
    participant Detector as Invocation<br/>Detector
    participant Telemetry as Session<br/>Record
    participant Ingestor as claude-replay

    Parser->>Transcript: Read JSONL lines
    loop each message
        Parser->>Detector: Check tool_use entries
        alt tool_use.toolName == "Skill"
            Detector-->>Parser: Extract skill name
            Parser->>Parser: Add to skills_invoked[]
        else
            Detector-->>Parser: ignore
        end
    end
    Parser->>Telemetry: Emit TranscriptMetrics (skills_invoked)
    Ingestor->>Telemetry: Create SkillUsageRecord (use skills_invoked if present)
```

## Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

## Possibly related PRs

- **#23**: Overlaps evolve CLI changes in `cli/selftune/evolution/evolve.ts` (pre-flight, logging, messaging).
- **#13**: Related telemetry/ingestion changes around `skills_invoked` and session selection behavior.
- **#17**: Touches the same ingestor surface (`cli/selftune/ingestors/claude-replay.ts`) and skill-invocation/trigger semantics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant