Skip to content

feat(0.20.0): MCP delegation tools — delegate_code, delegate_research, delegate_feedback#45

Merged
tangletools merged 1 commit into
mainfrom
feat/mcp-delegation-tools
May 24, 2026
Merged

feat(0.20.0): MCP delegation tools — delegate_code, delegate_research, delegate_feedback#45
tangletools merged 1 commit into
mainfrom
feat/mcp-delegation-tools

Conversation

@tangletools
Copy link
Copy Markdown
Contributor

Summary

Phase 1.5 of the driven-loop substrate. Ships a stdio MCP server in
@tangle-network/agent-runtime/mcp plus an agent-runtime-mcp bin so
sandbox coding-harness agents (claude-code, codex, opencode) can delegate
long-running coder / researcher loops to other sandboxes managed by us.

Tool Kind Use
delegate_code async Code-modification task — returns a taskId; poll delegation_status for the patch
delegate_research async Source-grounded research task — returns a taskId; poll for items + citations
delegate_feedback sync Append agent/user/judge rating against a delegation, artifact, or outcome
delegation_status sync Snapshot of state machine (pendingrunningcompleted | failed | cancelled)
delegation_history sync Newest-first read of past delegations, filterable by namespace / profile / since

Async semantics

agent → delegate_code(goal, repoRoot)        → { taskId, estimatedDurationMs }
agent → delegation_status(taskId)            → { status: 'running', progress }
... (minutes pass)
agent → delegation_status(taskId)            → { status: 'completed', result: { profile: 'coder', output } }
agent → delegate_feedback(refersTo, rating)  → { recorded: true, id }
  • Idempotent: duplicate identical input → same taskId (canonical-form hash).
  • Cancellable: queue.cancel(taskId) aborts the in-flight signal.
  • In-memory queue state; Phase 2 → sqlite. Documented explicitly.

Tool descriptions (the agent-facing UX)

delegate_code

Delegate a coding task to specialist coder agents that produce a validated patch.

Use when: you need code written, fixed, refactored, or extended to satisfy a
user goal that touches a real repository. The coder runs in an isolated
sandbox, opens a fresh branch, keeps the diff minimal, runs the supplied
test + typecheck commands, and emits a unified-diff patch.

Returns immediately with a taskId. Poll delegation_status to retrieve the
patch + validator verdict (typically minutes-to-hours, longer for large
changes). Identical inputs return the same taskId — safe to retry.

When variants > 1, multiple coder harnesses (claude-code, codex, opencode)
attempt the task in parallel and the highest-scoring patch wins (smallest
passing diff). Use variants for high-stakes changes; single variant for
routine ones.

Capability scope: the coder cannot modify paths outside repoRoot and cannot
touch paths in config.forbiddenPaths. The validator hard-fails on a
forbidden-path violation, diff above config.maxDiffLines, test failure, or
typecheck failure — none of those make it past the gate.

delegate_research

Delegate a research question to specialist researcher agents that produce
source-grounded, evidence-bearing knowledge items.

Use when: you need to answer a factual question with external evidence —
audience research, competitive intelligence, recency-bound web searches,
corpus / docs lookups. The researcher emits items[] with provenance, a
citations[] index, and proposedWrites[] you decide whether to persist.

Returns immediately with a taskId. Poll delegation_status to retrieve the
items + verdict. Identical inputs return the same taskId — safe to retry.

When variants > 1, multiple researcher harnesses run in parallel and the
highest-scoring valid output wins (citation density × source diversity ×
recency match × gap coverage). Use variants when answers might disagree.

Multi-tenant isolation: every item carries namespace. The validator
hard-fails when any item is scoped outside namespace. Never pass another
tenant's namespace.

delegate_feedback

Record feedback on a delegation, artifact, or outcome. Synchronous — the
event is durably stored when this call returns.

Use when: you (the agent), the user, or a downstream judge has formed an
opinion about a piece of work and want it persisted for calibration,
pricing, or future routing. Every call is a new event — multiple ratings
on the same target are expected and never deduped.

refersTo.kind:

  • "delegation" — ref is a taskId returned by delegate_code/delegate_research
  • "artifact" — ref is a URI/path/git-sha — anything you can dereference
  • "outcome" — ref is a free-form description of a downstream result

by: "agent" | "user" | "downstream-judge"

When ref names a known taskId, the rating is also attached to the
delegation record so delegation_history surfaces it inline.

delegation_status

Poll the status of an async delegation. Returns the current state
(pending | running | completed | failed | cancelled), optional progress,
and the final result when status === "completed".

Use when: you previously called delegate_code or delegate_research and
need to know whether the work is done. The agent's right rhythm is to
call this every minute or two while waiting; do not busy-poll.

For a completed coder task, result.output is a CoderOutput with branch,
patch, test/typecheck results, and diff stats. For a completed research
task, result.output is the items + citations + proposedWrites bundle.

Throws NotFoundError when taskId is unknown — never silently returns
pending for a typo.

delegation_history

Read past delegations newest-first. Each entry carries the original
arguments, current status, cost, and any feedback attached via
delegate_feedback.

Use when: you want to introspect prior decisions — "have I asked this
question before?", "did the last patch land?", "what's the historical
success rate of coder delegations on this repo?". Feed the results back
into your own routing and calibration.

Filters: namespace (multi-tenant scope), profile ("coder" | "researcher"),
since (ISO date — only delegations started at-or-after). limit defaults
to 50, capped at 500.

Layering

agent-runtime/mcp                         ← NEW. server + 5 tools + queue + feedback store
  ↓ delegates wire to
agent-runtime/loops + agent-runtime/profiles  (coder)
agent-knowledge/profiles                       (researcher — injected; optional peer)

agent-runtime cannot depend on agent-knowledge (cycle). The bin
lazy-imports the researcher delegate from agent-knowledge when present;
the surface is silently omitted otherwise. Custom integrations wire their
own researcherDelegate via createMcpServer({ researcherDelegate }).

Sandbox SDK fleet-API findings

SandboxFleetClient exists (client.fleets.create({...})) and exposes
dispatchPrompt, dispatchExec, etc. for coordinated multi-machine
work. The MCP layer does not call it directly — runLoop already
parallelizes agentRuns through bounded Promise.all against the
underlying LoopSandboxClient, and fleet semantics (shared workspace,
dispatch traces, intelligence reports) are orthogonal to the
fire-and-poll task model the MCP server presents. We retain
MCP_MAX_CONCURRENT_SANDBOXES (default 4) for the kernel cap. Phase 2
can plumb fleet dispatch into a fleet-backed delegate if the workload
demands it.

Test results

Test Files  22 passed (22)
     Tests  215 passed (215)
  • 154 existing tests unchanged
  • 61 new tests in tests/mcp/* covering: queue lifecycle, idempotency,
    cancel + abort propagation, validation errors (Type/RangeError),
    namespace isolation, feedback append-only semantics + cross-reference
    to history, status NotFoundError, history filters + ordering + limit,
    full JSON-RPC roundtrip end-to-end (both server.handle() and stdio
    transport), parse-error handling, tool-descriptor self-tests.

Typecheck clean. pnpm build clean. biome check src tests clean.

Smoke transcript (in-process transport)

Driven through dist/mcp/index.js with stub delegates. The wire shape
matches what the bin emits.

=== smoke: initialize ===
{ protocolVersion: "2024-11-05", capabilities: { tools: {} }, serverInfo: { name: "agent-runtime-mcp", version: "0.20.0" } }

=== smoke: tools/list ===
registered 5 tools: delegate_code, delegate_research, delegate_feedback, delegation_status, delegation_history

=== smoke: delegate_research(question: "what content engages cpg-founder ICP on Twitter?", namespace: "test", variants: 2) ===
taskId: dlg-mpk2afeu-5nvjjd6r

=== smoke: poll status ===
{ taskId, profile: "researcher", status: "completed",
  result: { profile: "researcher", output: { items: [...1...], citations: [...1...], proposedWrites: [] } },
  startedAt, completedAt }

=== smoke: delegate_feedback(refersTo: {kind:'delegation', ref:taskId}, rating: {score:0.85, label:'good', notes:'great source diversity'}, by:'agent', namespace:'test') ===
{ recorded: true, id: "fbk-mpk2afew-hvifjvff" }

=== smoke: delegation_history({ namespace: "test" }) ===
{
  delegations: [
    {
      taskId: "dlg-mpk2afeu-5nvjjd6r",
      profile: "researcher",
      args: { question: "...", namespace: "test", variants: 2 },
      status: "completed",
      namespace: "test",
      feedback: [{ id: "fbk-...", score: 0.85, by: "agent", notes: "great source diversity", label: "good", capturedAt }]
    }
  ]
}

Real-credential smoke (against TCloud-routed sandboxes) is the next step
before tagging the release; this PR ships the substrate.

Out of scope (Phase 2 follow-ups)

  • Persistent task state (sqlite) — README documents the in-memory limitation
  • Webhook callbacks — MCP polling is the contract for v1
  • delegate_evaluation — separate future tool
  • Fleet-backed delegate that uses client.fleets.dispatchPrompt for
    cross-machine coordinated runs

Files

  • src/mcp/{server,task-queue,feedback-store,delegates,types,index,bin}.ts
  • src/mcp/tools/{delegate-code,delegate-research,delegate-feedback,delegation-status,delegation-history}.ts
  • tests/mcp/*.test.ts (8 files)
  • package.json — version 0.20.0, new sub-export, new bin, optional agent-knowledge peer
  • tsup.config.tsmcp/index + mcp/bin entries
  • README.md — Delegation tools (MCP) section

…, delegate_feedback

New sub-export `@tangle-network/agent-runtime/mcp` and `agent-runtime-mcp`
bin. Five tools exposed over stdio JSON-RPC (MCP 2024-11-05):

- delegate_code        async, idempotent — runs coderProfile / multi-harness fanout
- delegate_research    async, idempotent — runs an injected researcher delegate
- delegate_feedback    sync, append-only — every rating is its own event
- delegation_status    sync poll — state machine + progress + final result
- delegation_history   sync read — newest-first, filterable, feedback inline

State lives in an in-memory DelegationTaskQueue (Phase 2 → sqlite). The
server is topology-free; consumers wire coder + researcher delegates at
construction. The bin auto-wires the default coder against the real
Sandbox client and lazy-imports a researcher delegate when
@tangle-network/agent-knowledge is installed as an optional peer.

61 new tests cover validation, idempotency, lifecycle, cancellation,
namespace isolation, feedback cross-reference, and a full JSON-RPC
end-to-end through both in-process and stdio transports.
@tangletools tangletools merged commit 9c82adb into main May 24, 2026
1 check failed
@tangletools tangletools deleted the feat/mcp-delegation-tools branch May 24, 2026 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants