Home

agent-orchestrator

Multi-agent workflows that survive 2am: durable state, deterministic replay, hard budgets, and full tracing.

Built by Sarma Linux. MIT licence.

What this is

Most agent frameworks are demos with delusions of grandeur. They fall over the moment a tool times out or a model hallucinates a parameter. This orchestrator is built around the assumption that everything will fail, repeatedly, and that you need to debug what your agents did six hours after the fact.

Workflows are typed graphs. State is durable in Postgres. Every step writes a trace event that can be replayed deterministically, and every run, node, LLM call, and tool call is wrapped in an OpenTelemetry span. Agents have explicit token, tool-call, and wall-clock budgets, all enforced. It supports supervisor, swarm, and pipeline patterns, and ships with a Next.js inspector that draws the agent graph and walks runs step by step.

Who this is for

Teams shipping multi-step agent pipelines that must survive process crashes.
Engineers who have been burned by black-box orchestrators at 2am.
Anyone who needs audit trails, OpenTelemetry traces, and deterministic replay for LLM workflows.

How it fits together

The system is a small pnpm monorepo with two apps. apps/api is the Fastify control plane and the executor. apps/inspector is a Next.js UI that reads run state and draws the graph. State lives in Postgres through Drizzle ORM, and runs are distributed through a Redis BullMQ queue.

graph TD
  C[Client] --> API[Fastify API]
  API --> Q[Redis BullMQ queue]
  Q --> W[Run worker]
  W --> EX[Graph executor]
  EX --> AG[Agent dispatch]
  AG --> LLM[LLM adapter]
  AG --> R[Tool registry]
  R --> MCP[MCP tools]
  EX --> B[RunBudget]
  EX --> S[(Postgres state)]
  EX --> OT[OpenTelemetry spans]
  S --> I[Inspector UI]

  classDef ext fill:#a78bfa,stroke:#a78bfa,color:#fff
  class LLM,OT ext

A run moves through a clear lifecycle. The API accepts a start request naming a registered graph and an input payload, writes a runs row, and enqueues the run. With Redis configured the job goes to BullMQ and a worker picks it up with a retry policy applied; without Redis the run executes in-process. The executor walks the graph from its entry nodes, dispatches each node to its agent kind, records a trace event, and writes a checkpoint. Budgets are charged on every LLM and tool call. When a token, tool-call, or wall-clock limit is breached the executor aborts mid-run through an AbortSignal and the run lands in budget_exceeded. The full data model, including the runs, traces, and checkpoints tables, is on the Architecture page.

Why these choices

Postgres plus Drizzle is the state store because it survives process crashes and gives SQL access to inspect runs after the fact. SQLite would be simpler, but multi-instance deployments need a shared database, and an in-memory store is provided for tests. Redis with BullMQ handles queueing so workers can be restarted and scaled horizontally, and gives retry policies for free. Trace events are written on every step because debugging an agent failure hours later requires the full sequence of decisions, and logs alone are not enough. Budgets are hard limits enforced through an AbortSignal because soft limits get ignored in practice. OpenTelemetry is the tracing layer because it is the standard and lets you ship spans to whatever backend you already run.

Real-world examples

These are taken from apps/api/examples, which the API registers at start-up.

Research swarm

A supervisor plans, a pipeline gathers with web_search, a swarm analyses in parallel, and a pipeline summarises. A per-tool cap stops a runaway search loop without throttling the whole run.

import { graph } from '../../src/graph/definition.js'

export const research = graph('research-swarm')
  .node('plan',      { agent: 'supervisor', llm: 'sarmalink' })
  .node('search',    { agent: 'pipeline',   tools: ['web_search'] })
  .node('analyse',   { agent: 'swarm',      llm: 'sarmalink', concurrency: 3 })
  .node('summarise', { agent: 'pipeline',   llm: 'sarmalink' })
  .edge('plan', 'search')
  .edge('search', 'analyse')
  .edge('analyse', 'summarise')
  .budget({ tokens: 50000, tools: 100, wallClockSec: 300, perTool: { web_search: 20 } })

Trigger it over HTTP and watch it in the inspector:

curl -X POST http://localhost:4000/runs \
  -H "Content-Type: application/json" \
  -d '{"graph":"research-swarm","input":{"topic":"mediasoup vs LiveKit in 2026"}}'

Conditional routing

The triage example branches on run context. The executor only follows an edge whose when predicate returns truthy.

export const triage = graph('triage')
  .node('classify', { agent: 'supervisor', llm: 'sarmalink' })
  .node('refund',   { agent: 'pipeline',   tools: ['stripe_refund'] })
  .node('escalate', { agent: 'pipeline' })
  .edge('classify', 'refund',   (ctx) => (ctx.input as any)?.intent === 'refund')
  .edge('classify', 'escalate', (ctx) => (ctx.input as any)?.intent !== 'refund')
  .budget({ tokens: 8000, tools: 10, wallClockSec: 60, perTool: { stripe_refund: 1 } })

Replay for debugging

Replay reconstructs a run from a checkpointed step. Change a tool implementation, replay from the step before it ran, and confirm the new behaviour without re-running the whole graph.

curl -X POST http://localhost:4000/runs/<run-id>/replay \
  -H "Content-Type: application/json" \
  -d '{"fromStep": 2}'

Troubleshooting

pnpm migrate fails to connect. The migration step needs Postgres running and reachable. Start the services first with docker compose up -d postgres redis, confirm Postgres is listening on :5432, and check that DATABASE_URL in your .env matches the compose configuration.

Runs stay in pending and never progress. With REDIS_URL set, runs are processed by the BullMQ worker. If nothing advances, Redis is usually not reachable on :6379 or no worker is running. Confirm the queue container is up and that pnpm dev started both the API and the worker. To run without Redis, leave REDIS_URL unset and runs execute in-process.

A run ends in budget_exceeded sooner than expected. Budgets are totals across the whole run, not per node. A swarm node with concurrency: 3 multiplies token consumption across its branches. Raise the relevant limit in .budget(...), set a perTool cap if one tool is the culprit, or reduce concurrency.

Tool calls throw a validation error. Arguments are parsed against the tool's Zod schema before the handler runs. The error names the failing field. Align the arguments the agent produces with the schema, or relax the schema if the field is genuinely optional.

The inspector shows no runs. The inspector reads the API on :4000. Confirm the API is running, that NEXT_PUBLIC_API_URL points at it, and that at least one run has been started.

I have no Postgres, Redis, or LLM key. Leave DATABASE_URL, REDIS_URL, and SARMALINK_API_KEY unset. The API uses an in-memory store, runs in-process, and the LLM adapter returns deterministic offline output. This is the same path the test suite uses.

pnpm install warns about ignored build scripts. This is expected for native dependencies such as esbuild and sharp under pnpm. The build and tests do not require those scripts. Run pnpm approve-builds only if you need the native binaries.

Wiki pages

Architecture. Component diagram, run lifecycle, database schema, and the reasoning behind each piece.
Quick-Start. Clone, configure, run your first workflow, and replay it.
Graph-DSL. Node options, agent kinds, conditional edges, and budgets in detail.
Budgets-and-Tracing. How budgets are enforced, per-tool caps, MCP tools, and OpenTelemetry export.
Roadmap. What is shipped and what is next.

Repository

github.com/sarmakska/agent-orchestrator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

agent-orchestrator

What this is

Who this is for

How it fits together

Why these choices

Real-world examples

Research swarm

Conditional routing

Replay for debugging

Troubleshooting

Wiki pages

Repository

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally