feat(0.24.0): in-sandbox executor — Phase 2.8 closes the 'delegate_code does real work' gap by drewstone · Pull Request #59 · tangle-network/agent-runtime

drewstone · 2026-05-25T11:12:01Z

Summary

When agent-runtime-mcp is running INSIDE a sandbox whose image carries the local coding-harness CLIs (claude / codex / opencode), delegations now spawn the harness as a subprocess against a git worktree on the SAME filesystem — no sibling-sandbox provisioning, no stub LLM, worker diffs land in-place on the caller's filesystem.

This is the architecture Drew named two days ago. Phase 2.8 closes the long-standing gap where delegate_code provisioned an external sandbox + called a stub LLM. Now it forks a real harness CLI in a real git worktree.

Why it matters

Zero provisioning latency — no client.create() round-trip per delegation
Worker diffs in-place — no cross-sandbox copy step
Multi-harness fanout for free — harnesses: ['claude', 'codex', 'opencode'] rotates round-robin; runLoop + FanoutVote(n: 3) gives 3-way diversity automatically
Closes the multishot stub gap — pre-2.8 the delegate_code substrate had to be mocked or call a stub LLM

Selection (detectExecutor priority)

AGENT_RUNTIME_IN_SANDBOX=1 → in-process (this PR)
TANGLE_FLEET_ID=... → fleet (Phase 2.5)
neither → sibling-sandbox (default)

Configuration env

Env	Default	Purpose
`AGENT_RUNTIME_IN_SANDBOX`	unset	Selects in-process when `1`
`AGENT_RUNTIME_REPO_ROOT`	(required)	Workspace root for worktrees
`AGENT_RUNTIME_LOCAL_HARNESSES`	`claude`	CSV; round-robin across create()
`AGENT_RUNTIME_TEST_CMD`	unset	e.g. `pnpm test`
`AGENT_RUNTIME_TYPECHECK_CMD`	unset	e.g. `pnpm typecheck`

Architecture

client.create() → returns a virtual SandboxInstance whose streamPrompt:
  1. createWorktree()                       — git worktree add /workspace/.coder-variants/<id>
  2. yield 'in_process.harness.started'      — for trace correlation
  3. runLocalHarness()                      — spawn claude/codex/opencode (cwd=worktree)
  4. yield 'in_process.harness.ended'        — exit + duration + timed-out
  5. captureWorktreeDiff()                  — git diff HEAD → patch + shortstat
  6. testCmd / typecheckCmd against the worktree
  7. yield { type: 'result', data: { result: CoderOutput } }  ← coderProfile consumes this
  8. removeWorktree() in finally (even on abort / harness throw)

Tests

19 new tests across 3 files. All 303 tests pass (was 284).

tests/mcp/local-harness.test.ts (7) — spawn happy path, non-zero exit, ENOENT, timeout, abort, per-CLI args, unknown harness
tests/mcp/in-process-executor.test.ts (6) — event emission, harness rotation, post-checks, error notes, cleanup-on-abort, placement
tests/mcp/in-process-detect.test.ts (6) — env selection, error paths, harness list, test/typecheck wiring, priority over fleet

Public surface

New exports from @tangle-network/agent-runtime/mcp:

createInProcessExecutor + InProcessExecutorOptions / InProcessExecutorDescribePlacement
runLocalHarness + LocalHarness / LocalHarnessResult / RunLocalHarnessOptions
createWorktree / captureWorktreeDiff / removeWorktree + their option types + GitRunner

Verified

pnpm typecheck ✓
pnpm build ✓
pnpm test ✓ — 303 tests

Live smoke (deferred)

End-to-end live verification (claude/codex/opencode actually running inside a sandbox + emitting real diffs) is gated on the sandbox-provisioning fix landing in production. The substrate primitive is comprehensive + tested via injected test seams (runHarness, runGit, runPostCheck); the live smoke is the integration test once infra is unblocked.

Version bump 0.23.1 → 0.24.0.

Single runnable file that wires @tangle-network/agent-runtime + @tangle-network/agent-eval + @tangle-network/agent-knowledge + @tangle-network/sandbox into one self-improving loop. What it shows: 1. baseline AgentProfile v0 (sandbox substrate type) 2. runMultishot across 3 personas + 1 judge (agent-eval/multishot) 3. analyst phase reads transcripts → proposes systemPrompt mutation 4. applyMutation → AgentProfile v1 5. re-run multishot with v1 6. gate compares v0 vs v1 means → ship / hold Default mode runs offline with scripted LLM responses (reproducible demo); TANGLE_API_KEY=... MOCK=0 runs against the real router. Verified live: - pnpm typecheck clean (after dev-dep bump agent-eval ^0.33.1 → ^0.38.0) - pnpm test — 284 tests pass - pnpm tsx examples/self-improving-loop/ produces: v0 mean: 3.17 → v1 mean: 8.50 (delta +5.33) → gate ships v1 README diagrams the substrate composition + maps each phase to its substrate primitive. Cross-links to agent-stack-adoption skill for the end-to-end 10-phase production runbook.

When agent-runtime-mcp is running INSIDE a sandbox whose image carries the local coding-harness CLIs (claude / codex / opencode), delegations now spawn the harness AS A SUBPROCESS against a git worktree on the SAME filesystem instead of provisioning a sibling sandbox. Why this matters: - Zero provisioning latency (no client.create round-trip per delegation) - Worker diffs land in-place on the caller's filesystem — no cross-sandbox copy step - Multi-harness fanout = N parallel subprocesses in N parallel worktrees (claude + codex + opencode side-by-side via FanoutVote) - Closes the long-standing "delegate_code calls a stub LLM" gap that pre-Phase-2.8 multishot demos worked around with mocked tools Selection (detectExecutor priority order): AGENT_RUNTIME_IN_SANDBOX=1 → in-process executor (this PR) TANGLE_FLEET_ID=... → fleet executor (Phase 2.5) neither → sibling-sandbox executor (default) Configuration env: AGENT_RUNTIME_REPO_ROOT repo root (required) AGENT_RUNTIME_LOCAL_HARNESSES csv list, default 'claude' AGENT_RUNTIME_TEST_CMD e.g. 'pnpm test' AGENT_RUNTIME_TYPECHECK_CMD e.g. 'pnpm typecheck' Architecture: client.create() → returns a virtual SandboxInstance whose streamPrompt: 1. createWorktree() — git worktree add /workspace/.coder-variants/<id> 2. yields 'in_process.harness.started' event for trace correlation 3. runLocalHarness() — spawns claude/codex/opencode subprocess (cwd=worktree) 4. yields 'in_process.harness.ended' with exit + duration + timed-out flag 5. captureWorktreeDiff() — git diff HEAD → patch + shortstat 6. runs configured testCmd + typecheckCmd against the worktree 7. yields the terminal { type: 'result', data: { result: CoderOutput } } event the coderProfile event-parser consumes 8. removeWorktree() in finally (even on abort / harness throw) Multi-harness rotation: pass harnesses: ['claude', 'codex', 'opencode'] to round-robin across create() calls. runLoop + FanoutVote(n: 3) gives 3-way diversity for free. Files: src/mcp/local-harness.ts 264 LOC — subprocess wrappers + per-CLI invocation shape src/mcp/worktree.ts 162 LOC — git worktree create/diff/remove helpers src/mcp/in-process-executor.ts 286 LOC — the DelegationExecutor factory src/mcp/bin-helpers.ts +45 LOC — detectExecutor priority + env parsing src/mcp/index.ts +14 LOC — public surface Tests: 19 new across 3 files - local-harness: 7 (spawn happy path, non-zero exit, ENOENT, timeout, abort, per-CLI args, unknown harness) - in-process-executor: 6 (event emission, harness rotation, post-checks, error notes, cleanup-on-abort, placement) - in-process-detect: 6 (env selection, error paths, harness list, test/typecheck CMD plumbing, priority over fleet) Existing 284 tests still pass. Total 303. Version bump 0.23.1 → 0.24.0.

drewstone added 2 commits May 25, 2026 04:48

drewstone merged commit 9917144 into main May 25, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(0.24.0): in-sandbox executor — Phase 2.8 closes the 'delegate_code does real work' gap#59

feat(0.24.0): in-sandbox executor — Phase 2.8 closes the 'delegate_code does real work' gap#59
drewstone merged 2 commits into
mainfrom
feat/in-process-executor-phase-2-8

drewstone commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented May 25, 2026

Summary

Why it matters

Selection (detectExecutor priority)

Configuration env

Architecture

Tests

Public surface

Verified

Live smoke (deferred)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant