Skip to content

feat(0.24.0): in-sandbox executor — Phase 2.8 closes the 'delegate_code does real work' gap#59

Merged
drewstone merged 2 commits into
mainfrom
feat/in-process-executor-phase-2-8
May 25, 2026
Merged

feat(0.24.0): in-sandbox executor — Phase 2.8 closes the 'delegate_code does real work' gap#59
drewstone merged 2 commits into
mainfrom
feat/in-process-executor-phase-2-8

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Summary

When agent-runtime-mcp is running INSIDE a sandbox whose image carries the local coding-harness CLIs (claude / codex / opencode), delegations now spawn the harness as a subprocess against a git worktree on the SAME filesystem — no sibling-sandbox provisioning, no stub LLM, worker diffs land in-place on the caller's filesystem.

This is the architecture Drew named two days ago. Phase 2.8 closes the long-standing gap where delegate_code provisioned an external sandbox + called a stub LLM. Now it forks a real harness CLI in a real git worktree.

Why it matters

  • Zero provisioning latency — no client.create() round-trip per delegation
  • Worker diffs in-place — no cross-sandbox copy step
  • Multi-harness fanout for freeharnesses: ['claude', 'codex', 'opencode'] rotates round-robin; runLoop + FanoutVote(n: 3) gives 3-way diversity automatically
  • Closes the multishot stub gap — pre-2.8 the delegate_code substrate had to be mocked or call a stub LLM

Selection (detectExecutor priority)

  1. AGENT_RUNTIME_IN_SANDBOX=1in-process (this PR)
  2. TANGLE_FLEET_ID=... → fleet (Phase 2.5)
  3. neither → sibling-sandbox (default)

Configuration env

Env Default Purpose
AGENT_RUNTIME_IN_SANDBOX unset Selects in-process when 1
AGENT_RUNTIME_REPO_ROOT (required) Workspace root for worktrees
AGENT_RUNTIME_LOCAL_HARNESSES claude CSV; round-robin across create()
AGENT_RUNTIME_TEST_CMD unset e.g. pnpm test
AGENT_RUNTIME_TYPECHECK_CMD unset e.g. pnpm typecheck

Architecture

client.create() → returns a virtual SandboxInstance whose streamPrompt:
  1. createWorktree()                       — git worktree add /workspace/.coder-variants/<id>
  2. yield 'in_process.harness.started'      — for trace correlation
  3. runLocalHarness()                      — spawn claude/codex/opencode (cwd=worktree)
  4. yield 'in_process.harness.ended'        — exit + duration + timed-out
  5. captureWorktreeDiff()                  — git diff HEAD → patch + shortstat
  6. testCmd / typecheckCmd against the worktree
  7. yield { type: 'result', data: { result: CoderOutput } }  ← coderProfile consumes this
  8. removeWorktree() in finally (even on abort / harness throw)

Tests

19 new tests across 3 files. All 303 tests pass (was 284).

  • tests/mcp/local-harness.test.ts (7) — spawn happy path, non-zero exit, ENOENT, timeout, abort, per-CLI args, unknown harness
  • tests/mcp/in-process-executor.test.ts (6) — event emission, harness rotation, post-checks, error notes, cleanup-on-abort, placement
  • tests/mcp/in-process-detect.test.ts (6) — env selection, error paths, harness list, test/typecheck wiring, priority over fleet

Public surface

New exports from @tangle-network/agent-runtime/mcp:

  • createInProcessExecutor + InProcessExecutorOptions / InProcessExecutorDescribePlacement
  • runLocalHarness + LocalHarness / LocalHarnessResult / RunLocalHarnessOptions
  • createWorktree / captureWorktreeDiff / removeWorktree + their option types + GitRunner

Verified

  • pnpm typecheck
  • pnpm build
  • pnpm test ✓ — 303 tests

Live smoke (deferred)

End-to-end live verification (claude/codex/opencode actually running inside a sandbox + emitting real diffs) is gated on the sandbox-provisioning fix landing in production. The substrate primitive is comprehensive + tested via injected test seams (runHarness, runGit, runPostCheck); the live smoke is the integration test once infra is unblocked.

Version bump 0.23.1 → 0.24.0.

drewstone added 2 commits May 25, 2026 04:48
Single runnable file that wires @tangle-network/agent-runtime +
@tangle-network/agent-eval + @tangle-network/agent-knowledge +
@tangle-network/sandbox into one self-improving loop.

What it shows:
1. baseline AgentProfile v0 (sandbox substrate type)
2. runMultishot across 3 personas + 1 judge (agent-eval/multishot)
3. analyst phase reads transcripts → proposes systemPrompt mutation
4. applyMutation → AgentProfile v1
5. re-run multishot with v1
6. gate compares v0 vs v1 means → ship / hold

Default mode runs offline with scripted LLM responses (reproducible demo);
TANGLE_API_KEY=... MOCK=0 runs against the real router.

Verified live:
- pnpm typecheck clean (after dev-dep bump agent-eval ^0.33.1 → ^0.38.0)
- pnpm test — 284 tests pass
- pnpm tsx examples/self-improving-loop/ produces:
    v0 mean: 3.17 → v1 mean: 8.50 (delta +5.33) → gate ships v1

README diagrams the substrate composition + maps each phase to its
substrate primitive. Cross-links to agent-stack-adoption skill for the
end-to-end 10-phase production runbook.
When agent-runtime-mcp is running INSIDE a sandbox whose image carries
the local coding-harness CLIs (claude / codex / opencode), delegations
now spawn the harness AS A SUBPROCESS against a git worktree on the
SAME filesystem instead of provisioning a sibling sandbox.

Why this matters:
  - Zero provisioning latency (no client.create round-trip per delegation)
  - Worker diffs land in-place on the caller's filesystem — no
    cross-sandbox copy step
  - Multi-harness fanout = N parallel subprocesses in N parallel
    worktrees (claude + codex + opencode side-by-side via FanoutVote)
  - Closes the long-standing "delegate_code calls a stub LLM" gap that
    pre-Phase-2.8 multishot demos worked around with mocked tools

Selection (detectExecutor priority order):
  AGENT_RUNTIME_IN_SANDBOX=1 → in-process executor (this PR)
  TANGLE_FLEET_ID=...        → fleet executor (Phase 2.5)
  neither                    → sibling-sandbox executor (default)

Configuration env:
  AGENT_RUNTIME_REPO_ROOT          repo root (required)
  AGENT_RUNTIME_LOCAL_HARNESSES    csv list, default 'claude'
  AGENT_RUNTIME_TEST_CMD           e.g. 'pnpm test'
  AGENT_RUNTIME_TYPECHECK_CMD      e.g. 'pnpm typecheck'

Architecture:
  client.create() → returns a virtual SandboxInstance whose streamPrompt:
    1. createWorktree() — git worktree add /workspace/.coder-variants/<id>
    2. yields 'in_process.harness.started' event for trace correlation
    3. runLocalHarness() — spawns claude/codex/opencode subprocess (cwd=worktree)
    4. yields 'in_process.harness.ended' with exit + duration + timed-out flag
    5. captureWorktreeDiff() — git diff HEAD → patch + shortstat
    6. runs configured testCmd + typecheckCmd against the worktree
    7. yields the terminal { type: 'result', data: { result: CoderOutput } }
       event the coderProfile event-parser consumes
    8. removeWorktree() in finally (even on abort / harness throw)

Multi-harness rotation: pass harnesses: ['claude', 'codex', 'opencode']
to round-robin across create() calls. runLoop + FanoutVote(n: 3) gives
3-way diversity for free.

Files:
  src/mcp/local-harness.ts        264 LOC — subprocess wrappers + per-CLI invocation shape
  src/mcp/worktree.ts             162 LOC — git worktree create/diff/remove helpers
  src/mcp/in-process-executor.ts  286 LOC — the DelegationExecutor factory
  src/mcp/bin-helpers.ts          +45 LOC — detectExecutor priority + env parsing
  src/mcp/index.ts                +14 LOC — public surface

Tests: 19 new across 3 files
  - local-harness: 7 (spawn happy path, non-zero exit, ENOENT, timeout, abort, per-CLI args, unknown harness)
  - in-process-executor: 6 (event emission, harness rotation, post-checks, error notes, cleanup-on-abort, placement)
  - in-process-detect: 6 (env selection, error paths, harness list, test/typecheck CMD plumbing, priority over fleet)
  Existing 284 tests still pass. Total 303.

Version bump 0.23.1 → 0.24.0.
@drewstone drewstone merged commit 9917144 into main May 25, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant