-
Notifications
You must be signed in to change notification settings - Fork 0
Harness
The harness (src/harness/) is the orchestration layer around language-model
calls — and, in TinyAgents' recursive-language-model (RLM) framing, it is the
substrate that makes recursion observable. Every model call, tool call,
sub-agent run, and graph node ultimately bottoms out in a harness agent loop, so
the harness is where parent/child run identity, usage roll-ups, depth limits, and
event streams are tracked. When an agent calls another agent (a model calling a
model), it is one harness invoking another harness one level deeper in the
recursion tree.
This page is a developer deep-dive. Each section is grounded in real module and
type names under src/harness/ and links to the matching design note under
docs/modules/harness/.
Background on the RLM lineage — see the Recursive Language Models blog and paper (Zhang, Kraska, Khattab, MIT CSAIL, 2025, arXiv:2512.24601). TinyAgents is inspired by and architected around that execution model — sub-model / sub-agent / sub-graph calls as functions, persistent session values, depth tracking, and trajectory logging — not a reimplementation of the paper.
Module (src/harness/…) |
Role | Design note |
|---|---|---|
agent_loop |
Default model→tool→model loop | — |
runtime |
AgentHarness facade + RunPolicy
|
— |
context |
RunConfig, RunContext, depth/limit tracking |
context.md |
message |
Typed Message / ContentBlock chat shapes |
— |
ids |
Typed RunId/ThreadId/CallId/… identifiers |
— |
prompt |
Prompt templates + cache-segmented PromptBuilder
|
prompt.md |
model |
Provider-neutral ChatModel, requests, responses, streams |
model.md |
providers |
Feature-gated provider adapters (MockModel, OpenAiModel) |
— |
tool |
Typed tool trait, schemas, registry | tool.md |
middleware |
before/after hooks around agent, model, tool | middleware.md |
structured |
Typed/JSON-schema response extraction | structured-output.md |
stream / model streaming |
Token & event streaming | streaming.md |
usage |
Token accounting | usage.md |
cost |
Pricing + cost roll-ups | cost.md |
limits / retry
|
Caps, timeouts, backoff, fallback, rate limit | limits-retry.md |
cache |
Local response cache + prompt-cache layout | cache.md |
memory |
Short-term thread memory / chat history | — |
embeddings |
Embedding models, vector stores, retrievers | embeddings.md |
store |
Pluggable persistence backends | store.md |
events |
Typed event stream + run status | observability.md |
observability |
Durable journals, status stores, sinks, latency metrics | observability.md |
subagent |
Agents-as-tools, reusable sessions | subagent-steering.md |
steering |
Typed runtime control of running agents | subagent-steering.md |
summarization |
Context-window-aware compaction | summarization.md |
cancel |
Cooperative CancellationToken
|
— |
testkit |
Fakes, recorders, trajectory asserts | testkit.md |
The harness deliberately does not depend on the graph module: you can call a model or run a tool loop without constructing a graph. The graph runtime depends on harness traits, not the other way around.
The default loop is implemented as inherent methods on
AgentHarness<State, Ctx> (src/harness/agent_loop/mod.rs). The canonical entry
points are:
-
invoke(state, ctx_data, config, input)— full control over run identity and context data. -
invoke_default(state, input)— convenience wrapper that builds a defaultRunConfig. -
invoke_in_context(state, ctx, input)— run inside an existingRunContext, which is how nested/child runs inherit parent identity. -
invoke_streaming*variants — same loop, emittingModelStreamItems. -
*_with_statusvariants — return theAgentRunplus aHarnessRunStatus.
The lifecycle (model→tool→model) is:
input messages
-> build RunContext, emit RunStarted
-> before_agent middleware
-> loop:
enforce model-call cap + wall-clock deadline (fail-closed)
build ModelRequest (messages + tool schemas + default response format)
before_model middleware; emit ModelStarted
resolve + invoke model (with retry + fallback)
after_model middleware; emit ModelCompleted; fold Usage into AgentRun
append assistant message
if tool calls -> enforce tool cap, before_tool, run tools,
after_tool, append tool results, continue
else -> extract structured output (if configured) and break
-> after_agent middleware; emit RunCompleted
On error the loop emits RunFailed, fans the error through on_error
middleware, and returns it. The loop is intentionally sleep-free: retry
backoff durations are computed via RetryPolicy::backoff_for_attempt but the
loop itself does not block, keeping tests deterministic.
The accumulated result is AgentRun (harness::middleware::AgentRun), holding
the final messages, folded Usage, and any extracted structured response.
All model invocation goes through one trait (harness::model::ChatModel<State>)
with invoke (unary) and stream (incremental) methods. A call is described by
ModelRequest (built fluently with ModelRequest::new(messages).with_tools(…) .with_tool_choice(…).with_response_format(…).with_model_hint(…) .with_required_capabilities(…)) and answered by ModelResponse, which carries
the assistant message, Usage, finish reason, and a ResolvedModel recording
which concrete provider/model actually served the call.
Model selection is explicit and reusable. ModelRegistry resolves a call through
ModelSelection / ModelHint against a CapabilitySet (see ModelProfile,
Modalities, ModelStatus), and records the choice as a ResolvedModel with a
ModelResolutionSource. This lets a parent and its sub-agents reason about model
identity consistently across the recursion tree.
providers holds the adapters:
-
MockModel/ theproviderstest models (echo,constant, scripted responses) — the default offline build. -
OpenAiModel(behind theopenaiCargo feature). Despite the name it speaks the OpenAI Chat Completions wire format to many hosts viaProviderSpec/ProviderKind: constructors includedeepseek,anthropic,groq,xai,openrouter,together,mistral,ollama, andcompatible(...)for any OpenAI-compatible endpoint, plusfrom_env/from_spec_env.
Tools implement harness::tool::Tool<State> with name, description,
schema() -> ToolSchema, and an async call(state, ctx, call) -> Result<ToolResult>.
ToolCall carries an id, name, and JSON arguments; ToolResult is built with
ToolResult::text(...) / ToolResult::error(...) and exposes is_error().
ToolRegistry keys tools by name and is consulted by the loop when an assistant
message requests a tool. ToolDelta carries incremental tool progress for
streaming.
Crucially, a whole agent can be a tool — see Sub-agents.
harness::middleware::Middleware<State, Ctx> exposes the cross-cutting hooks the
loop fans out to: before_agent / after_agent, before_model /
on_model_delta / after_model, before_tool / on_tool_delta /
after_tool, and on_error. They are ordered through a MiddlewareStack
(before_* in registration order, after_* in reverse). HookCounts records
how often each hook fired. Built-in middleware lives here too:
LoggingMiddleware, MessageTrimMiddleware, ContextCompressionMiddleware,
PromptCacheGuardMiddleware, and UsageAccountingMiddleware.
StructuredExtractor extracts a typed value from the final ModelResponse
according to a StructuredStrategy (StructuredStrategy::for_profile(...)
chooses a provider-appropriate strategy). response_format_for_strategy(...)
maps a strategy onto a ResponseFormat (Text, json_schema, provider-native,
or a tool-call strategy). The parsed value lands in StructuredOutput, which
exposes as_value() and parse::<T>(). Set
RunPolicy::default_response_format to attach a schema to every model request in
a run; the loop runs extraction automatically before completing.
Two layers cooperate. At the provider edge, ChatModel::stream yields a
ModelStream of ModelStreamItems; StreamAccumulator (or the
collect_model_stream helper) folds them back into a ModelResponse. At the
harness edge, harness::stream exposes a StreamSink over StreamChunks with
selectable StreamModes (messages, tools, updates, events, final). The
invoke_streaming* agent-loop methods surface model deltas as the loop runs.
Usage records input_tokens / output_tokens (plus cached-token bookkeeping)
and effective_total(). The loop folds each call's Usage into the AgentRun,
and UsageTotals aggregates across calls. cost::estimate_cost(pricing, usage)
turns a ModelPricing + Usage into CostTotals. This is where recursion
becomes measurable: because a child run executes through the parent's
RunContext/event sink, child usage and cost roll up into the parent's totals,
so an orchestrator can see the full cost of the subtree it spawned.
RunLimits (built via with_max_model_calls, with_max_tool_calls,
with_max_wall_clock_ms, with_max_retries_per_call, with_max_concurrency,
and with_max_depth) is enforced fail-closed by a LimitTracker and by the
loop itself. Reaching a cap returns TinyAgentsError::LimitExceeded; the
deadline returns TinyAgentsError::Timeout.
max_depth is the recursion limit: it bounds how deep nested sub-agents (or
subgraphs) may recurse. An invocation whose child depth would exceed the cap
fails with TinyAgentsError::SubAgentDepth.
retry provides RetryPolicy (attempts, backoff, multiplier, jitter), the
is_retryable(err) classifier (only network/timeout/rate-limit/5xx by default),
FallbackPolicy (an ordered model fallback chain — next_after(current)), and a
token-bucket RateLimiter.
cache separates two distinct ideas:
-
Local response cache —
ResponseCache(withInMemoryResponseCache), keyed bycache_key(request). Attach it withAgentHarness::with_response_cache; the loop then checks it before each provider call and stores successful responses, emittingcache.hit/cache.missevents. Because the cache is owned by the harness, a repeated identical request — even across separate runs — can be served from an earlier result. -
Provider prompt/KV-cache layout —
PromptCacheLayout(andCacheLayoutEvent) make the stable prompt prefix explicit so middleware that edits model-visible prompt segments can be detected (is_prefix_stable_against).CachePolicygates both behaviors.thread_idshould be propagated through parent agents, sub-agents, subgraphs, and nested harness calls so provider caches see one stable logical conversation. Provider adapters may map that stable identity into required cache/user headers when a backend, such as Fireworks-style prefix caching, requires explicit cache attribution.
memory owns conversation continuity: the ChatHistory trait
(InMemoryChatHistory, StoreChatHistory<S>) and ShortTermMemory<H>, which is
loaded before a loop and saved after, optionally trimmed. MemoryScope
distinguishes short- vs long-term data.
embeddings provides retrieval-augmented context: the EmbeddingModel trait
(MockEmbeddingModel; OpenAiEmbeddingModel behind the feature), a VectorStore
trait (InMemoryVectorStore), cosine_similarity, and a Retriever whose
index(docs) / retrieve(query, top_k) -> Vec<ScoredDoc> close the loop. In the
RLM framing, retrieval is how an agent pulls snippets of a large external
environment into context instead of stuffing the whole thing into one window.
This is the core recursive surface. A SubAgent<State, Ctx> wraps an
Arc<AgentHarness> plus a stable name, description, and optional
system_prompt. Invoking it always produces a child run one level deeper in
the recursion tree:
-
SubAgent::invoke/invoke_with_eventsrun a fresh child loop. -
SubAgent::invoke_in_parent(state, ctx_data, parent, input)threads the live parentRunContext, so the child inherits the parent's depth and event sink — the child runs atparent.depth() + 1and its events (and usage) surface on the parent's stream. It emitsSubAgentStarted/SubAgentCompletedaround the child loop, making the recursion observable. -
SubAgentTooladapts a sub-agent into aTool, so a parent model can call an entire agent the same way it calls any tool — a model calling a model. The child input is read from theSUBAGENT_INPUT_FIELD("input") argument. Depth is fixed at construction (with_parent_depth) because theTooltrait givescallno live parent context. - When a sub-agent invoked through
SubAgentToolhits a child run limit (max_model_calls,max_tool_calls, wall-clock timeout, or depth), the tool returns an errorToolResulttelling the parent orchestrator the delegated agent hit its limit. The parent run can continue and decide whether to narrow, split, retry, or report the task instead of mistaking the child limit for a completed answer. -
SubAgentSessionkeeps the same sub-agent alive across turns, accumulating a transcript (post-completion reuse, e.g. human-in-the-loop). Each reuse emitsSubAgentReused.
Depth is bounded by RunLimits::max_depth; overrunning it yields
TinyAgentsError::SubAgentDepth. The graph analogue is graph::subgraph
(SubAgentNode), where a node embeds another compiled graph with the same depth
tracking.
Steering is typed runtime control of an already-running agent (distinct from
sub-agent reuse, which acts between runs). An orchestrator — a parent agent, a
human UI, a graph supervisor, or a test — holds a cloneable SteeringHandle,
calls send(command), and the loop drains pending commands at a safe
checkpoint (before each model call). Commands are SteeringCommand: Pause,
Resume, Cancel, InjectMessage, Redirect { instruction }, and
SetMetadata. Each command reports its SteeringCommandKind (via kind()),
which a SteeringPolicy allowlist checks
(SteeringPolicy::new permits nothing by default; allow_all for tests), and a
disallowed command is rejected with TinyAgentsError::Steering. Applying a batch
yields a SteeringOutcome (Continue / Pause / Cancel). Commands are
Serialize/Deserialize, so steering can be logged, transported, and replayed.
summarization keeps the working transcript inside the model's context window.
estimate_tokens(text) and TokenEstimate provide a cheap budget;
trim_messages(messages, strategy) applies a TrimStrategy; and
SummarizationPolicy decides when to compact. A policy can be derived from the
model's own profile — SummarizationPolicy::from_profile(profile, threshold) and
with_context_window(max_input_tokens) — so the trigger budget scales with the
target model. should_summarize(messages) and plan(messages) split the
transcript into a "summarize these / keep these" partition, the Summarizer
trait (with ConcatSummarizer) produces the summary, and SummaryRecord /
CompressionProvenance record what was compressed so the compaction is
auditable. This matters for recursion: deep sub-agent transcripts stay bounded
instead of blowing the parent's budget.
events is the observability spine. AgentEvent enumerates lifecycle
boundaries — run, model (incl. ModelDelta), tool, middleware, sub-agent,
cache hit/miss, retry/rate-limit/fallback, route selection, usage/cost,
limit-reached (LimitKind), memory, and compression. Events flow
through an EventSink to EventListeners (e.g. RecordingListener), are
journaled in an EventJournal (with replay_from(offset)), and the run's
overall state is HarnessRunStatus. Because child runs share the parent's
EventSink, a single stream shows the entire recursion tree — parent and child
model/tool/sub-agent events interleaved with correct depth annotations.
While events is the in-memory spine, harness::observability is the
durable side that a journal or external trace needs without a live broadcast.
AgentObservation wraps each AgentEvent with the correlation metadata a trace
backend needs — event_id, run_id, parent_run_id / root_run_id lineage,
the stream offset, and a ts_ms wall-clock timestamp — so the whole recursion
tree can be reconstructed off-process. Persistence is pluggable: the
HarnessEventJournal trait (InMemoryEventJournal, StoreEventJournal<A> over
an AppendStore) durably records observations, and the HarnessStatusStore
trait (InMemoryStatusStore) snapshots HarnessRunStatus. Sinks compose over an
EventSink: FanOutSink (multiplex), RedactingSink (scrub sensitive fields),
JournalSink (write through to a journal), and JsonlSink (append JSONL).
AgentLatencyMetrics::from_record(...) summarizes a run's timing —
run_elapsed_ms, per-call AgentCallLatency lists, and total_model_ms /
max_model_ms / total_tool_ms / max_tool_ms roll-ups — turning the event
record into measurable latency across the subtree.
harness::message owns the chat shapes the loop carries: Message
(SystemMessage, UserMessage, AssistantMessage, ToolMessage), the
ContentBlock enum (with ImageRef for multimodal content), and MessageDelta
for streaming. harness::ids defines the newtype identifiers that thread
parent/child lineage through every layer — RunId, ThreadId, CallId,
EventId, ComponentId, GraphId, NodeId, TaskId, SessionId, CellId,
CheckpointId, and InterruptId — plus the ExecutionStatus and HarnessPhase
enums. Stable typed ids are what let usage, events, and caches all agree on
which run in the recursion tree they describe.
harness::prompt builds model-visible prompts deterministically. PromptTemplate
renders {{var}} substitutions (render, render_system, render_user,
render_assistant) with a TemplateRole, and MessagesTemplate renders an
ordered multi-message conversation. PromptBuilder assembles a request out of
named, ordered segments — push_system, push_tools_segment,
push_instructions, push_history, and push_volatile — then build(tail)
emits a ModelRequest. Because the stable segments come before the volatile
ones, fingerprint() identifies the cacheable prefix, which is exactly what the
PromptCacheLayout machinery in cache needs to keep a
provider's prompt/KV cache stable across turns.
harness::testkit makes the loop deterministically testable: ScriptedModel /
StreamingMock / SlowModel fake providers, FakeTool, DeterministicClock,
DeterministicIds, EventRecorder, and a Trajectory assertion helper
(from_events(...).assert_tool_called(...), assert_model_called_times(n),
assert_order(&[...]), assert_completed()). Trajectory assertions are how you
verify the shape of a recursive run — that the orchestrator called the model,
then a sub-agent tool, then the model again, and completed.
- Graph Runtime — durable typed state graphs and subgraphs.
-
Providers — configuring hosted providers behind
openai. - Examples — annotated, runnable catalog.
-
docs/modules/harness/README.md— the full module specification.
Recursive language-model (RLM) harness for Rust.
Getting started
Concepts
Modules
Providers
Contributing