# Harness

The **harness** (`src/harness/`) is the orchestration layer around language-model
calls — and, in TinyAgents' recursive-language-model (RLM) framing, it is the
*substrate that makes recursion observable*. Every model call, tool call,
sub-agent run, and graph node ultimately bottoms out in a harness agent loop, so
the harness is where parent/child run identity, usage roll-ups, depth limits, and
event streams are tracked. When an agent calls another agent (a model calling a
model), it is one harness invoking another harness one level deeper in the
recursion tree.

This page is a developer deep-dive. Each section is grounded in real module and
type names under `src/harness/` and links to the matching design note under
[`docs/modules/harness/`](../docs/modules/harness/README.md).

> **Background on the RLM lineage** — see the
> [Recursive Language Models](https://alexzhang13.github.io/blog/2025/rlm/) blog
> and paper (Zhang, Kraska, Khattab, MIT CSAIL, 2025,
> [arXiv:2512.24601](https://arxiv.org/abs/2512.24601)). TinyAgents is
> *inspired by and architected around* that execution model — sub-model /
> sub-agent / sub-graph calls as functions, persistent session values, depth
> tracking, and trajectory logging — not a reimplementation of the paper.

## Module map

| Module (`src/harness/…`) | Role | Design note |
| --- | --- | --- |
| `agent_loop` | Default model→tool→model loop | — |
| `runtime` | `AgentHarness` facade + `RunPolicy` | — |
| `context` | `RunConfig`, `RunContext`, depth/limit tracking | [context.md](../docs/modules/harness/context.md) |
| `model` | Provider-neutral `ChatModel`, requests, responses, streams | [model.md](../docs/modules/harness/model.md) |
| `providers` | Feature-gated provider adapters (`MockModel`, `OpenAiModel`) | — |
| `tool` | Typed tool trait, schemas, registry | [tool.md](../docs/modules/harness/tool.md) |
| `middleware` | before/after hooks around agent, model, tool | [middleware.md](../docs/modules/harness/middleware.md) |
| `structured` | Typed/JSON-schema response extraction | [structured-output.md](../docs/modules/harness/structured-output.md) |
| `stream` / model streaming | Token & event streaming | [streaming.md](../docs/modules/harness/streaming.md) |
| `usage` | Token accounting | [usage.md](../docs/modules/harness/usage.md) |
| `cost` | Pricing + cost roll-ups | [cost.md](../docs/modules/harness/cost.md) |
| `limits` / `retry` | Caps, timeouts, backoff, fallback, rate limit | [limits-retry.md](../docs/modules/harness/limits-retry.md) |
| `cache` | Local response cache + prompt-cache layout | [cache.md](../docs/modules/harness/cache.md) |
| `memory` | Short-term thread memory / chat history | — |
| `embeddings` | Embedding models, vector stores, retrievers | [embeddings.md](../docs/modules/harness/embeddings.md) |
| `store` | Pluggable persistence backends | [store.md](../docs/modules/harness/store.md) |
| `events` | Typed event stream + run status | [observability.md](../docs/modules/harness/observability.md) |
| `subagent` | Agents-as-tools, reusable sessions | [subagent-steering.md](../docs/modules/harness/subagent-steering.md) |
| `steering` | Typed runtime control of running agents | [subagent-steering.md](../docs/modules/harness/subagent-steering.md) |
| `summarization` | Context-window-aware compaction | [summarization.md](../docs/modules/harness/summarization.md) |
| `cancel` | Cooperative `CancellationToken` | — |
| `testkit` | Fakes, recorders, trajectory asserts | [testkit.md](../docs/modules/harness/testkit.md) |

The harness deliberately does **not** depend on the graph module: you can call a
model or run a tool loop without constructing a graph. The graph runtime depends
on harness traits, not the other way around.

## The agent loop (`agent_loop`)

The default loop is implemented as inherent methods on
`AgentHarness<State, Ctx>` (`src/harness/agent_loop/mod.rs`). The canonical entry
points are:

- `invoke(state, ctx_data, config, input)` — full control over run identity and
  context data.
- `invoke_default(state, input)` — convenience wrapper that builds a default
  `RunConfig`.
- `invoke_in_context(state, ctx, input)` — run *inside an existing*
  `RunContext`, which is how nested/child runs inherit parent identity.
- `invoke_streaming*` variants — same loop, emitting `ModelStreamItem`s.
- `*_with_status` variants — return the `AgentRun` plus a `HarnessRunStatus`.

The lifecycle (model→tool→model) is:

```text
input messages
  -> build RunContext, emit RunStarted
  -> before_agent middleware
  -> loop:
       enforce model-call cap + wall-clock deadline (fail-closed)
       build ModelRequest (messages + tool schemas + default response format)
       before_model middleware; emit ModelStarted
       resolve + invoke model (with retry + fallback)
       after_model middleware; emit ModelCompleted; fold Usage into AgentRun
       append assistant message
       if tool calls -> enforce tool cap, before_tool, run tools,
                        after_tool, append tool results, continue
       else -> extract structured output (if configured) and break
  -> after_agent middleware; emit RunCompleted
```

On error the loop emits `RunFailed`, fans the error through `on_error`
middleware, and returns it. The loop is intentionally **sleep-free**: retry
backoff durations are *computed* via `RetryPolicy::backoff_for_attempt` but the
loop itself does not block, keeping tests deterministic.

The accumulated result is `AgentRun` (`harness::middleware::AgentRun`), holding
the final messages, folded `Usage`, and any extracted structured response.

## Provider-neutral model calls (`model`, `providers`)

All model invocation goes through one trait (`harness::model::ChatModel<State>`)
with `invoke` (unary) and `stream` (incremental) methods. A call is described by
`ModelRequest` (built fluently with `ModelRequest::new(messages).with_tools(…)
.with_tool_choice(…).with_response_format(…).with_model_hint(…)
.with_required_capabilities(…)`) and answered by `ModelResponse`, which carries
the assistant message, `Usage`, finish reason, and a `ResolvedModel` recording
*which* concrete provider/model actually served the call.

Model selection is explicit and reusable. `ModelRegistry` resolves a call through
`ModelSelection` / `ModelHint` against a `CapabilitySet` (see `ModelProfile`,
`Modalities`, `ModelStatus`), and records the choice as a `ResolvedModel` with a
`ModelResolutionSource`. This lets a parent and its sub-agents reason about model
identity consistently across the recursion tree.

`providers` holds the adapters:

- `MockModel` / the `providers` test models (`echo`, `constant`, scripted
  responses) — the default **offline** build.
- `OpenAiModel` (behind the `openai` Cargo feature). Despite the name it speaks
  the OpenAI Chat Completions wire format to *many* hosts via `ProviderSpec` /
  `ProviderKind`: constructors include `deepseek`, `anthropic`, `groq`, `xai`,
  `openrouter`, `together`, `mistral`, `ollama`, and `compatible(...)` for any
  OpenAI-compatible endpoint, plus `from_env` / `from_spec_env`.

## Typed tools (`tool`)

Tools implement `harness::tool::Tool<State>` with `name`, `description`,
`schema() -> ToolSchema`, and an async `call(state, ctx, call) -> Result<ToolResult>`.
`ToolCall` carries an id, name, and JSON `arguments`; `ToolResult` is built with
`ToolResult::text(...)` / `ToolResult::error(...)` and exposes `is_error()`.
`ToolRegistry` keys tools by name and is consulted by the loop when an assistant
message requests a tool. `ToolDelta` carries incremental tool progress for
streaming.

Crucially, a **whole agent can be a tool** — see [Sub-agents](#sub-agents-agents-as-tools-subagent).

## Middleware hooks (`middleware`)

`harness::middleware::Middleware<State, Ctx>` exposes the cross-cutting hooks the
loop fans out to: `before_agent` / `after_agent`, `before_model` /
`on_model_delta` / `after_model`, `before_tool` / `on_tool_delta` /
`after_tool`, and `on_error`. They are ordered through a `MiddlewareStack`
(`before_*` in registration order, `after_*` in reverse). `HookCounts` records
how often each hook fired. Built-in middleware lives here too:
`LoggingMiddleware`, `MessageTrimMiddleware`, `ContextCompressionMiddleware`,
`PromptCacheGuardMiddleware`, and `UsageAccountingMiddleware`.

## Structured output (`structured`)

`StructuredExtractor` extracts a typed value from the final `ModelResponse`
according to a `StructuredStrategy` (`StructuredStrategy::for_profile(...)`
chooses a provider-appropriate strategy). `response_format_for_strategy(...)`
maps a strategy onto a `ResponseFormat` (`Text`, `json_schema`, provider-native,
or a tool-call strategy). The parsed value lands in `StructuredOutput`, which
exposes `as_value()` and `parse::<T>()`. Set
`RunPolicy::default_response_format` to attach a schema to every model request in
a run; the loop runs extraction automatically before completing.

## Streaming (`stream`, model streaming)

Two layers cooperate. At the provider edge, `ChatModel::stream` yields a
`ModelStream` of `ModelStreamItem`s; `StreamAccumulator` (or the
`collect_model_stream` helper) folds them back into a `ModelResponse`. At the
harness edge, `harness::stream` exposes a `StreamSink` over `StreamChunk`s with
selectable `StreamMode`s (messages, tools, updates, events, final). The
`invoke_streaming*` agent-loop methods surface model deltas as the loop runs.

## Usage and cost (`usage`, `cost`)

`Usage` records `input_tokens` / `output_tokens` (plus cached-token bookkeeping)
and `effective_total()`. The loop folds each call's `Usage` into the `AgentRun`,
and `UsageTotals` aggregates across calls. `cost::estimate_cost(pricing, usage)`
turns a `ModelPricing` + `Usage` into `CostTotals`. **This is where recursion
becomes measurable:** because a child run executes through the parent's
`RunContext`/event sink, child usage and cost roll up into the parent's totals,
so an orchestrator can see the full cost of the subtree it spawned.

## Limits, retry, fallback, rate limiting (`limits`, `retry`)

`RunLimits` (built via `with_max_model_calls`, `with_max_tool_calls`,
`with_max_wall_clock_ms`, `with_max_retries_per_call`, `with_max_concurrency`,
and **`with_max_depth`**) is enforced fail-closed by a `LimitTracker` and by the
loop itself. Reaching a cap returns `TinyAgentsError::LimitExceeded`; the
deadline returns `TinyAgentsError::Timeout`.

`max_depth` is the **recursion limit**: it bounds how deep nested sub-agents (or
subgraphs) may recurse. An invocation whose child depth would exceed the cap
fails with `TinyAgentsError::SubAgentDepth`.

`retry` provides `RetryPolicy` (attempts, backoff, multiplier, jitter), the
`is_retryable(err)` classifier (only network/timeout/rate-limit/5xx by default),
`FallbackPolicy` (an ordered model fallback chain — `next_after(current)`), and a
token-bucket `RateLimiter`.

## Cache (`cache`)

`cache` separates two distinct ideas:

- **Local response cache** — `ResponseCache` (with `InMemoryResponseCache`),
  keyed by `cache_key(request)`. Attach it with
  `AgentHarness::with_response_cache`; the loop then checks it before each
  provider call and stores successful responses, emitting `cache.hit` /
  `cache.miss` events. Because the cache is owned by the harness, a repeated
  identical request — even across separate runs — can be served from an earlier
  result.
- **Provider prompt/KV-cache layout** — `PromptCacheLayout` (and
  `CacheLayoutEvent`) make the stable prompt prefix explicit so middleware that
  edits model-visible prompt segments can be detected
  (`is_prefix_stable_against`). `CachePolicy` gates both behaviors.
  `thread_id` should be propagated through parent agents, sub-agents, subgraphs,
  and nested harness calls so provider caches see one stable logical
  conversation. Provider adapters may map that stable identity into required
  cache/user headers when a backend, such as Fireworks-style prefix caching,
  requires explicit cache attribution.

## Memory, embeddings, retrieval (`memory`, `embeddings`)

`memory` owns conversation continuity: the `ChatHistory` trait
(`InMemoryChatHistory`, `StoreChatHistory<S>`) and `ShortTermMemory<H>`, which is
loaded before a loop and saved after, optionally trimmed. `MemoryScope`
distinguishes short- vs long-term data.

`embeddings` provides retrieval-augmented context: the `EmbeddingModel` trait
(`MockEmbeddingModel`; `OpenAiEmbeddingModel` behind the feature), a `VectorStore`
trait (`InMemoryVectorStore`), `cosine_similarity`, and a `Retriever` whose
`index(docs)` / `retrieve(query, top_k) -> Vec<ScoredDoc>` close the loop. In the
RLM framing, retrieval is how an agent pulls *snippets* of a large external
environment into context instead of stuffing the whole thing into one window.

## Sub-agents (agents as tools) (`subagent`)

This is the core recursive surface. A `SubAgent<State, Ctx>` wraps an
`Arc<AgentHarness>` plus a stable `name`, `description`, and optional
`system_prompt`. Invoking it always produces a **child run one level deeper** in
the recursion tree:

- `SubAgent::invoke` / `invoke_with_events` run a fresh child loop.
- `SubAgent::invoke_in_parent(state, ctx_data, parent, input)` threads the *live*
  parent `RunContext`, so the child inherits the parent's depth and event sink —
  the child runs at `parent.depth() + 1` and its events (and usage) surface on
  the parent's stream. It emits `SubAgentStarted` / `SubAgentCompleted` around
  the child loop, making the recursion observable.
- `SubAgentTool` adapts a sub-agent into a `Tool`, so a parent model can call an
  entire agent the same way it calls any tool — **a model calling a model**. The
  child input is read from the `SUBAGENT_INPUT_FIELD` (`"input"`) argument. Depth
  is fixed at construction (`with_parent_depth`) because the `Tool` trait gives
  `call` no live parent context.
- When a sub-agent invoked through `SubAgentTool` hits a child run limit
  (`max_model_calls`, `max_tool_calls`, wall-clock timeout, or depth), the tool
  returns an error `ToolResult` telling the parent orchestrator the delegated
  agent hit its limit. The parent run can continue and decide whether to narrow,
  split, retry, or report the task instead of mistaking the child limit for a
  completed answer.
- `SubAgentSession` keeps the *same* sub-agent alive across turns, accumulating a
  transcript (post-completion reuse, e.g. human-in-the-loop). Each reuse emits
  `SubAgentReused`.

Depth is bounded by `RunLimits::max_depth`; overrunning it yields
`TinyAgentsError::SubAgentDepth`. The graph analogue is `graph::subgraph`
(`SubAgentNode`), where a node embeds another compiled graph with the same depth
tracking.

## Steering (`steering`)

Steering is typed runtime control of an *already-running* agent (distinct from
sub-agent reuse, which acts between runs). An orchestrator — a parent agent, a
human UI, a graph supervisor, or a test — holds a cloneable `SteeringHandle`,
calls `send(command)`, and the loop `drain`s pending commands at a **safe
checkpoint** (before each model call). Commands are `SteeringCommand`: `Pause`,
`Resume`, `Cancel`, `InjectMessage`, `Redirect { instruction }`, and
`SetMetadata`. Each is gated by a `SteeringPolicy` allowlist
(`SteeringPolicy::new` permits nothing by default; `allow_all` for tests), and a
disallowed command is rejected with `TinyAgentsError::Steering`. Applying a batch
yields a `SteeringOutcome` (`Continue` / `Pause` / `Cancel`). Commands are
`Serialize`/`Deserialize`, so steering can be logged, transported, and replayed.

## Summarization (context-window-aware) (`summarization`)

`summarization` keeps the working transcript inside the model's context window.
`estimate_tokens(text)` and `TokenEstimate` provide a cheap budget;
`trim_messages(messages, strategy)` applies a `TrimStrategy`; and
`SummarizationPolicy` decides *when* to compact. A policy can be derived from the
model's own profile — `SummarizationPolicy::from_profile(profile, threshold)` and
`with_context_window(max_input_tokens)` — so the trigger budget scales with the
target model. `should_summarize(messages)` and `plan(messages)` split the
transcript into a "summarize these / keep these" partition, the `Summarizer`
trait (with `ConcatSummarizer`) produces the summary, and `SummaryRecord` /
`CompressionProvenance` record what was compressed so the compaction is
auditable. This matters for recursion: deep sub-agent transcripts stay bounded
instead of blowing the parent's budget.

## Events and run status (`events`)

`events` is the observability spine. `AgentEvent` enumerates lifecycle
boundaries (run, model, tool, middleware, sub-agent, retry, custom). Events flow
through an `EventSink` to `EventListener`s (e.g. `RecordingListener`), are
journaled in an `EventJournal` (with `replay_from(offset)`), and the run's
overall state is `HarnessRunStatus`. Because child runs share the parent's
`EventSink`, a single stream shows the *entire recursion tree* — parent and child
model/tool/sub-agent events interleaved with correct `depth` annotations.

## Testkit (`testkit`)

`harness::testkit` makes the loop deterministically testable: `ScriptedModel` /
`StreamingMock` / `SlowModel` fake providers, `FakeTool`, `DeterministicClock`,
`DeterministicIds`, `EventRecorder`, and a `Trajectory` assertion helper
(`from_events(...).assert_tool_called(...)`, `assert_model_called_times(n)`,
`assert_order(&[...])`, `assert_completed()`). Trajectory assertions are how you
verify the *shape* of a recursive run — that the orchestrator called the model,
then a sub-agent tool, then the model again, and completed.

## See also

- [Graph Runtime](Graph-Runtime.md) — durable typed state graphs and subgraphs.
- [Providers](Providers.md) — configuring hosted providers behind `openai`.
- [Examples](Examples.md) — annotated, runnable catalog.
- [`docs/modules/harness/README.md`](../docs/modules/harness/README.md) — the
  full module specification.