# Graph Runtime The **graph runtime** (`src/graph/`) is TinyAgents' durable, typed workflow engine — a LangGraph-style superstep executor where nodes read a committed state snapshot, return partial updates, and the runtime merges, routes, checkpoints, and (when needed) pauses for human input. It is the second of the five surfaces in the [Architecture](Architecture.md) and the layer that makes recursion *durable*: a graph node can embed another compiled graph, and a model standing inside a graph can author one. See [Recursion and RLM](Recursion-and-RLM.md) for the full recursive picture. The graph runtime owns **structure and durability**. It does not call models or tools itself — nodes call into the [Harness](Harness.md) when they need a model, a tool loop, or a sub-agent. This separation keeps execution deterministic, provider-neutral, and testable. ## Mental model A graph is a set of named **nodes** wired by **edges**, executed in **supersteps** over a typed **State**. Two virtual nodes bracket every run: - `START` (`"__start__"`) — the synthetic entry; `set_entry(node)` is sugar for `add_edge(START, node)`. - `END` (`"__end__"`) — the synthetic terminal; `set_finish(node)` is sugar for `add_edge(node, END)`. ```mermaid flowchart TD Start((START)) --> A[agent] A -->|needs_tool| T[tool] A -->|done| End((END)) T --> A ``` Each superstep takes the **active node set**, runs every active node against the *same committed snapshot*, folds their results through a **reducer** at the step boundary, persists a **checkpoint**, then selects the next active set. The loop stops when the active set empties, all branches reach `END`, an **interrupt** pauses the run, or the **recursion limit** is hit. ## Building a graph: `GraphBuilder` `GraphBuilder` (`graph::builder`) is generic over two types: - `State` — the committed graph state (must be `Clone + Send + Sync + 'static`). - `Update` — the partial-update type each node returns, merged into `State` by a [`StateReducer`](#typed-state-reducers-and-channels). For the common "each node returns the whole next state" case, use `GraphBuilder::::overwrite()`, which installs the `OverwriteStateReducer`: ```rust let graph = GraphBuilder::::overwrite() .add_node("agent", agent_node) .add_node("tool", tool_node) .set_entry("agent") .add_conditional_edges( "agent", |s: &AgentState| if s.needs_tool { "tool".into() } else { "done".into() }, [("tool", "tool"), ("done", END)], ) .add_edge("tool", "agent") .compile()?; ``` Key builder methods: | Method | Effect | | --- | --- | | `add_node(id, handler)` | Register an async node returning `Result>`. | | `add_edge(from, to)` | Static, unconditional edge. | | `set_entry(node)` / `set_finish(node)` | `START → node` / `node → END`. | | `add_conditional_edges(from, router, routes)` | Router closure → label table. | | `mark_command_routing(node)` | Node routes only via `Command.goto`. | | `set_reducer(r)` | Install a custom `StateReducer`. | | `with_recursion_limit(n)` | Cap supersteps (default **50**). | | `with_parallel(true)` | Run multi-node supersteps concurrently (see [fan-out](#parallel-fan-out)). | | `with_graph_id(id)` | Override the generated graph id. | | `compile()` | Validate topology and freeze into a `CompiledGraph`. | `compile()` enforces the invariants: an entry must exist, `START` cannot route straight to `END` and cannot be an edge target, `END` cannot be an edge source, a node cannot mix static and conditional edges, and a `mark_command_routing` node cannot also declare static/conditional edges. Failures return `TinyAgentsError::Validation`, `MissingStart`, `MissingNode`, etc. ## Nodes and `NodeContext` A node handler has the shape `Fn(State, NodeContext) -> impl Future>>`. It receives a **clone** of the committed state and a per-task `NodeContext`: - `node_id`, `run_id`, optional `thread_id`, and the 1-based `step`. - `resume: Option` — `None` on a normal run; set to the resume payload when an interrupted node is re-run (see [interrupts](#interrupts-and-resume)). - `fork: Option` — the branch identity when this node runs as one fork of a concurrent superstep (`None` in sequential mode or single-node steps). Nodes never mutate shared state directly; they return a `NodeResult`, and the runtime owns all state changes through the reducer. ## Node results: updates, commands, interrupts `NodeResult` (`graph::command`) is one of three outcomes: - `NodeResult::Update(update)` — a partial update merged through the reducer. - `NodeResult::Command(Command)` — combine an optional `update` with explicit routing and/or a resume value. - `NodeResult::Interrupt(Interrupt)` — pause for human-in-the-loop input. A `Command` has three orthogonal fields: ```rust pub struct Command { pub update: Option, // applied before routing pub goto: Vec, // overrides static/conditional edges pub resume: Option, // paired with an interrupt resume } ``` `goto` is the mechanism for **dynamic routing** and for **`Send`-style fan-out**: returning multiple targets schedules them all in the next superstep. Pair `goto` with `mark_command_routing(node)` so the compiler knows the node owns its own routing. ## Routing precedence At each step boundary the executor resolves the next targets for every completed node in this order (`route` in `graph::compiled`): 1. **Command `goto`** — explicit targets win. 2. **Static edge** — the single `add_edge` target. 3. **Conditional edge** — the router closure returns a label, resolved against the route table; an unknown label is `TinyAgentsError::MissingRoute`. 4. **Sink** — no routing configured: that branch ends. Targets equal to `END` are dropped from the next active set; when the set empties the run completes. ## Typed state, reducers, and channels Reducers (`graph::reducer`) decide how writes merge at the step boundary. Two traits exist: - `StateReducer` — merges a partial `Update` into the whole `State`. This is the contract the executor uses. - `Reducer` — merges two values of the same channel type (channel-style state, one merge policy per key). Built-in markers: `OverwriteStateReducer` / `OverwriteReducer` (last write wins), `AppendReducer` (concatenate vectors), `SetUnionReducer` (union, dedup, first-seen order), `MinReducer` / `MaxReducer`, and the closure-backed `ClosureStateReducer` / `ClosureReducer` for custom merges. Because the executor folds branch updates **in deterministic active-set index order**, a reducer is also the *fan-in / join* for parallel supersteps — the merged state is reproducible regardless of which branch finishes first. ## The superstep executor (`graph::compiled`) `compile()` produces a `CompiledGraph` — immutable, cheap to clone (heavy fields are `Arc`-shared), and safe to run concurrently. Three entry points run it: - `run(state)` — run to completion or to an interrupt, with **no thread** (no checkpoints are persisted even if a checkpointer is attached). - `run_with_thread(thread_id, state)` — run under a thread id, persisting a checkpoint at every superstep boundary when a checkpointer is configured. - `resume(thread_id, command)` — resume an interrupted run from its latest checkpoint. Each run returns a `GraphExecution`: | Field | Meaning | | --- | --- | | `state` | Final committed state. | | `visited` | Ordered list of executed nodes (may repeat across steps). | | `steps` | Number of supersteps executed. | | `interrupts` | Interrupts that paused the run (`is_interrupted()` is `!interrupts.is_empty()`). | | `status` | A compact `GraphRunStatus` snapshot. | | `checkpoint_id` | Latest persisted checkpoint id, if any. | Inside a step the executor: schedules the active nodes, runs each handler against its own state clone, folds results (`fold_result`) into updates / `goto` map / the lowest-index interrupt, applies the reducer at the boundary, persists a checkpoint, then selects the next active set. Exceeding the recursion limit is a deterministic `TinyAgentsError::RecursionLimit`. ### Parallel fan-out By default a step runs its active nodes **sequentially**, short-circuiting on the first error or interrupt. Compile with `with_parallel(true)` and a step with more than one active node runs every branch concurrently via `futures::future::join_all`. Each branch executes on its own cloned snapshot and a distinct `ForkId { branch_index, node }`. All branches are driven to completion *before* the boundary; results are then folded in active-set index order, so: - the reducer resolves conflicting writes deterministically (lower index first); - the lowest-index branch that errors or interrupts is the step's terminal outcome — an error aborts the run, an interrupt persists a checkpoint whose pending nodes are that branch and every later active node; - because branches never share mutable state, concurrency is data-race free. ## Run status snapshots (`graph::status`) `GraphRunStatus` is a compact, observable summary at an execution boundary — not a checkpoint. It answers "is this run active?", "which node is executing?", "which interrupt is waiting?" without deserializing full state. Crucially for recursion it carries `root_run_id`, `parent_run_id`, and `checkpoint_namespace`, so nested subgraph/sub-agent runs roll up into a single observable run tree. ## Checkpoints, durability, and time travel (`graph::checkpoint`) Checkpoints are graph-runtime persistence, separate from harness memory. They are written **at superstep boundaries only — never mid-node** — because re-running a node from its start is far easier to reason about than suspending an async Rust stack, and it matches interrupt/resume exactly. A `Checkpoint` records the `thread_id` lineage, this `checkpoint_id` and its `parent_checkpoint_id`, the `namespace` (for nested subgraphs), the committed `state`, the `next_nodes` to run on resume, the `completed_tasks`, any `pending_writes`, pending `interrupts`, and free-form `metadata` (`source`, `step`). `CheckpointMetadata` is the lightweight listing form that avoids deserializing full state. The `Checkpointer` trait (`put` / `get` / `list`) abstracts storage; `InMemoryCheckpointer` ships for tests and single-process runs. Attach one with `with_checkpointer(...)`. Because every boundary is a parent-linked checkpoint, the lineage is a tree you can list, inspect, and resume from any point — this is the foundation for **time travel**: replay or fork a run from an earlier checkpoint by resuming the desired `thread_id` / checkpoint. ## Interrupts and resume An `Interrupt { id, node, payload }` is a human-in-the-loop pause. Interrupts **require a checkpointer**. When a node returns `NodeResult::Interrupt`, the executor applies the updates collected so far, persists a checkpoint whose `next_nodes` are the interrupted node plus every later active node, and returns control with `GraphExecution::is_interrupted() == true`. To continue, call `resume(thread_id, Command { resume: Some(value), .. })`. The executor loads the latest checkpoint, re-runs the pending nodes with `NodeContext::resume` set to your value, and proceeds. Resume without a checkpointer, or with no pending nodes, returns `TinyAgentsError::Resume`. ## Subgraphs: graph-level recursion (`graph::subgraph`) A `CompiledGraph` can be embedded as a node in a parent graph — this is the **graph-level recursion mechanism**, the structural counterpart to the harness' sub-agents. Two adapters wrap a child graph into a node handler: - `shared_subgraph_node(child)` — parent and child share the same `State`/`Update` channel. The child runs over the parent's state; its final state becomes the parent update. - `adapter_subgraph_node(child, to_child, from_child)` — parent and child use *different* state shapes. `to_child` projects parent state into the child input; `from_child` folds the child's final state back into a parent update. Both adapters append the embedding node id to the child's checkpoint **namespace**, so parent and child checkpoint ids never collide — nested runs stay independently inspectable. Combined with `with_recursion_limit`, subgraphs let a graph run a graph (and, transitively, a model author a graph that runs inside the graph it is executing in). The module spec (`docs/modules/graph/subagents-recursion.md`) extends this with a graph-level `SubAgentNode` / `RecursionPolicy` design — depth tracking, `root_run_id` / `parent_run_id` preservation, and child usage/cost rollups — for embedding a *harness agent* as a graph node. See [Recursion and RLM](Recursion-and-RLM.md). ## Streaming and events (`graph::stream`) Attach a `GraphEventSink` with `with_event_sink(...)` to observe execution. The executor emits low-level `GraphEvent`s: `StepStarted` / `StepCompleted`, `TaskScheduled`, `NodeStarted` / `NodeCompleted` / `NodeFailed`, `StateUpdated`, `RouteSelected`, `CheckpointSaved`, and `InterruptEmitted`. `NoopSink` discards them; `CollectingSink` buffers them for assertions in tests. `StreamMode` mirrors LangGraph's high-level projections — `Values`, `Updates`, `Messages`, `Debug`, `Interrupts`, `Custom` — as a selection enum over the same event stream. ## Topology export and visualization (`graph::export`) A `GraphTopology` is a serializable, behavior-free description of a graph's structure: `graph_id`, `entry`, `recursion_limit`, `parallel`, `nodes` (`NodeInfo`), `edges` (`EdgeInfo`), `conditional_edges` (`ConditionalEdgeInfo` / `RouteInfo`), `finish_nodes`, and `channels` (`ChannelInfo`). All collections are sorted, so exports are deterministic regardless of `HashMap` order. Extract one with `CompiledGraph::topology()` or `GraphBuilder::topology()`, or from a `.rag` `Blueprint` via `blueprint_to_topology`. Render it with: - `to_json` / `from_json` — round-trippable JSON for snapshots and tooling. - `to_mermaid` — a Mermaid diagram for docs and UIs. - `blueprint_to_json` / `blueprint_to_mermaid` — straight from a blueprint. This makes a graph — whether hand-written or agent-authored — inspectable before it ever runs. ## The harness boundary ```mermaid flowchart LR GraphNode[Graph Node] --> Harness[AgentHarness] Harness --> Model[ChatModel] Harness --> Tool[ToolRegistry] Harness --> Events[EventSink] ``` A node may call an `AgentHarness`, but the graph does not know or care whether the harness uses OpenAI, Anthropic, Ollama, a mock model, a tool loop, or a sub-agent. Nondeterministic work stays inside nodes; routing, durability, and topology stay explicit and inspectable. ## When to use a graph Reach for the graph runtime when a workflow needs explicit state transitions, branch routing, guarded loops, checkpointing/resume, inspectable topology, deterministic tests around model calls, sub-agent orchestration, or human review points. For a single model call, use the [Harness](Harness.md) directly. Both `.rag` and `.ragsh` (see [Recursion and RLM](Recursion-and-RLM.md)) lower into exactly these graph types — a model can build the same workflow you can. ## See also - [Architecture](Architecture.md) — the five surfaces and how they compose. - [Registry](Registry.md) — names that `.rag`/`.ragsh` bind, including graphs. - [Recursion and RLM](Recursion-and-RLM.md) — subgraphs, sub-agents, and self-authoring as one recursive model. - Module spec: `docs/modules/graph/README.md` and its sub-pages.