feat(0.24.0): in-sandbox executor — Phase 2.8 closes the 'delegate_code does real work' gap#59
Merged
Merged
Conversation
Single runnable file that wires @tangle-network/agent-runtime +
@tangle-network/agent-eval + @tangle-network/agent-knowledge +
@tangle-network/sandbox into one self-improving loop.
What it shows:
1. baseline AgentProfile v0 (sandbox substrate type)
2. runMultishot across 3 personas + 1 judge (agent-eval/multishot)
3. analyst phase reads transcripts → proposes systemPrompt mutation
4. applyMutation → AgentProfile v1
5. re-run multishot with v1
6. gate compares v0 vs v1 means → ship / hold
Default mode runs offline with scripted LLM responses (reproducible demo);
TANGLE_API_KEY=... MOCK=0 runs against the real router.
Verified live:
- pnpm typecheck clean (after dev-dep bump agent-eval ^0.33.1 → ^0.38.0)
- pnpm test — 284 tests pass
- pnpm tsx examples/self-improving-loop/ produces:
v0 mean: 3.17 → v1 mean: 8.50 (delta +5.33) → gate ships v1
README diagrams the substrate composition + maps each phase to its
substrate primitive. Cross-links to agent-stack-adoption skill for the
end-to-end 10-phase production runbook.
When agent-runtime-mcp is running INSIDE a sandbox whose image carries
the local coding-harness CLIs (claude / codex / opencode), delegations
now spawn the harness AS A SUBPROCESS against a git worktree on the
SAME filesystem instead of provisioning a sibling sandbox.
Why this matters:
- Zero provisioning latency (no client.create round-trip per delegation)
- Worker diffs land in-place on the caller's filesystem — no
cross-sandbox copy step
- Multi-harness fanout = N parallel subprocesses in N parallel
worktrees (claude + codex + opencode side-by-side via FanoutVote)
- Closes the long-standing "delegate_code calls a stub LLM" gap that
pre-Phase-2.8 multishot demos worked around with mocked tools
Selection (detectExecutor priority order):
AGENT_RUNTIME_IN_SANDBOX=1 → in-process executor (this PR)
TANGLE_FLEET_ID=... → fleet executor (Phase 2.5)
neither → sibling-sandbox executor (default)
Configuration env:
AGENT_RUNTIME_REPO_ROOT repo root (required)
AGENT_RUNTIME_LOCAL_HARNESSES csv list, default 'claude'
AGENT_RUNTIME_TEST_CMD e.g. 'pnpm test'
AGENT_RUNTIME_TYPECHECK_CMD e.g. 'pnpm typecheck'
Architecture:
client.create() → returns a virtual SandboxInstance whose streamPrompt:
1. createWorktree() — git worktree add /workspace/.coder-variants/<id>
2. yields 'in_process.harness.started' event for trace correlation
3. runLocalHarness() — spawns claude/codex/opencode subprocess (cwd=worktree)
4. yields 'in_process.harness.ended' with exit + duration + timed-out flag
5. captureWorktreeDiff() — git diff HEAD → patch + shortstat
6. runs configured testCmd + typecheckCmd against the worktree
7. yields the terminal { type: 'result', data: { result: CoderOutput } }
event the coderProfile event-parser consumes
8. removeWorktree() in finally (even on abort / harness throw)
Multi-harness rotation: pass harnesses: ['claude', 'codex', 'opencode']
to round-robin across create() calls. runLoop + FanoutVote(n: 3) gives
3-way diversity for free.
Files:
src/mcp/local-harness.ts 264 LOC — subprocess wrappers + per-CLI invocation shape
src/mcp/worktree.ts 162 LOC — git worktree create/diff/remove helpers
src/mcp/in-process-executor.ts 286 LOC — the DelegationExecutor factory
src/mcp/bin-helpers.ts +45 LOC — detectExecutor priority + env parsing
src/mcp/index.ts +14 LOC — public surface
Tests: 19 new across 3 files
- local-harness: 7 (spawn happy path, non-zero exit, ENOENT, timeout, abort, per-CLI args, unknown harness)
- in-process-executor: 6 (event emission, harness rotation, post-checks, error notes, cleanup-on-abort, placement)
- in-process-detect: 6 (env selection, error paths, harness list, test/typecheck CMD plumbing, priority over fleet)
Existing 284 tests still pass. Total 303.
Version bump 0.23.1 → 0.24.0.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When
agent-runtime-mcpis running INSIDE a sandbox whose image carries the local coding-harness CLIs (claude / codex / opencode), delegations now spawn the harness as a subprocess against a git worktree on the SAME filesystem — no sibling-sandbox provisioning, no stub LLM, worker diffs land in-place on the caller's filesystem.This is the architecture Drew named two days ago. Phase 2.8 closes the long-standing gap where
delegate_codeprovisioned an external sandbox + called a stub LLM. Now it forks a real harness CLI in a real git worktree.Why it matters
client.create()round-trip per delegationharnesses: ['claude', 'codex', 'opencode']rotates round-robin;runLoop + FanoutVote(n: 3)gives 3-way diversity automaticallydelegate_codesubstrate had to be mocked or call a stub LLMSelection (detectExecutor priority)
AGENT_RUNTIME_IN_SANDBOX=1→ in-process (this PR)TANGLE_FLEET_ID=...→ fleet (Phase 2.5)Configuration env
AGENT_RUNTIME_IN_SANDBOX1AGENT_RUNTIME_REPO_ROOTAGENT_RUNTIME_LOCAL_HARNESSESclaudeAGENT_RUNTIME_TEST_CMDpnpm testAGENT_RUNTIME_TYPECHECK_CMDpnpm typecheckArchitecture
Tests
19 new tests across 3 files. All 303 tests pass (was 284).
tests/mcp/local-harness.test.ts(7) — spawn happy path, non-zero exit, ENOENT, timeout, abort, per-CLI args, unknown harnesstests/mcp/in-process-executor.test.ts(6) — event emission, harness rotation, post-checks, error notes, cleanup-on-abort, placementtests/mcp/in-process-detect.test.ts(6) — env selection, error paths, harness list, test/typecheck wiring, priority over fleetPublic surface
New exports from
@tangle-network/agent-runtime/mcp:createInProcessExecutor+InProcessExecutorOptions/InProcessExecutorDescribePlacementrunLocalHarness+LocalHarness/LocalHarnessResult/RunLocalHarnessOptionscreateWorktree/captureWorktreeDiff/removeWorktree+ their option types +GitRunnerVerified
pnpm typecheck✓pnpm build✓pnpm test✓ — 303 testsLive smoke (deferred)
End-to-end live verification (claude/codex/opencode actually running inside a sandbox + emitting real diffs) is gated on the sandbox-provisioning fix landing in production. The substrate primitive is comprehensive + tested via injected test seams (
runHarness,runGit,runPostCheck); the live smoke is the integration test once infra is unblocked.Version bump 0.23.1 → 0.24.0.