-
Notifications
You must be signed in to change notification settings - Fork 0
Harness
The harness (src/harness/) is the orchestration layer around language-model
calls — and, in TinyAgents' recursive-language-model (RLM) framing, it is the
substrate that makes recursion observable. Every model call, tool call,
sub-agent run, and graph node ultimately bottoms out in a harness agent loop, so
the harness is where parent/child run identity, usage roll-ups, depth limits, and
event streams are tracked. When an agent calls another agent (a model calling a
model), it is one harness invoking another harness one level deeper in the
recursion tree.
This page is a developer deep-dive. Each section is grounded in real module and
type names under src/harness/ and links to the matching design note under
docs/modules/harness/.
Background on the RLM lineage — see the Recursive Language Models blog and paper (Zhang, Kraska, Khattab, MIT CSAIL, 2025, arXiv:2512.24601). TinyAgents is inspired by and architected around that execution model — sub-model / sub-agent / sub-graph calls as functions, persistent session values, depth tracking, and trajectory logging — not a reimplementation of the paper.
Module (src/harness/…) |
Role | Design note |
|---|---|---|
agent_loop |
Default model→tool→model loop | — |
runtime |
AgentHarness facade + RunPolicy
|
— |
context |
RunConfig, RunContext, depth/limit tracking |
context.md |
model |
Provider-neutral ChatModel, requests, responses, streams |
model.md |
providers |
Feature-gated provider adapters (MockModel, OpenAiModel) |
— |
tool |
Typed tool trait, schemas, registry | tool.md |
middleware |
before/after hooks around agent, model, tool | middleware.md |
structured |
Typed/JSON-schema response extraction | structured-output.md |
stream / model streaming |
Token & event streaming | streaming.md |
usage |
Token accounting | usage.md |
cost |
Pricing + cost roll-ups | cost.md |
limits / retry
|
Caps, timeouts, backoff, fallback, rate limit | limits-retry.md |
cache |
Local response cache + prompt-cache layout | cache.md |
memory |
Short-term thread memory / chat history | — |
embeddings |
Embedding models, vector stores, retrievers | embeddings.md |
store |
Pluggable persistence backends | store.md |
events |
Typed event stream + run status | observability.md |
subagent |
Agents-as-tools, reusable sessions | subagent-steering.md |
steering |
Typed runtime control of running agents | subagent-steering.md |
summarization |
Context-window-aware compaction | summarization.md |
cancel |
Cooperative CancellationToken
|
— |
testkit |
Fakes, recorders, trajectory asserts | testkit.md |
The harness deliberately does not depend on the graph module: you can call a model or run a tool loop without constructing a graph. The graph runtime depends on harness traits, not the other way around.
The default loop is implemented as inherent methods on
AgentHarness<State, Ctx> (src/harness/agent_loop/mod.rs). The canonical entry
points are:
-
invoke(state, ctx_data, config, input)— full control over run identity and context data. -
invoke_default(state, input)— convenience wrapper that builds a defaultRunConfig. -
invoke_in_context(state, ctx, input)— run inside an existingRunContext, which is how nested/child runs inherit parent identity. -
invoke_streaming*variants — same loop, emittingModelStreamItems. -
*_with_statusvariants — return theAgentRunplus aHarnessRunStatus.
The lifecycle (model→tool→model) is:
input messages
-> build RunContext, emit RunStarted
-> before_agent middleware
-> loop:
enforce model-call cap + wall-clock deadline (fail-closed)
build ModelRequest (messages + tool schemas + default response format)
before_model middleware; emit ModelStarted
resolve + invoke model (with retry + fallback)
after_model middleware; emit ModelCompleted; fold Usage into AgentRun
append assistant message
if tool calls -> enforce tool cap, before_tool, run tools,
after_tool, append tool results, continue
else -> extract structured output (if configured) and break
-> after_agent middleware; emit RunCompleted
On error the loop emits RunFailed, fans the error through on_error
middleware, and returns it. The loop is intentionally sleep-free: retry
backoff durations are computed via RetryPolicy::backoff_for_attempt but the
loop itself does not block, keeping tests deterministic.
The accumulated result is AgentRun (harness::middleware::AgentRun), holding
the final messages, folded Usage, and any extracted structured response.
All model invocation goes through one trait (harness::model::ChatModel<State>)
with invoke (unary) and stream (incremental) methods. A call is described by
ModelRequest (built fluently with ModelRequest::new(messages).with_tools(…) .with_tool_choice(…).with_response_format(…).with_model_hint(…) .with_required_capabilities(…)) and answered by ModelResponse, which carries
the assistant message, Usage, finish reason, and a ResolvedModel recording
which concrete provider/model actually served the call.
Model selection is explicit and reusable. ModelRegistry resolves a call through
ModelSelection / ModelHint against a CapabilitySet (see ModelProfile,
Modalities, ModelStatus), and records the choice as a ResolvedModel with a
ModelResolutionSource. This lets a parent and its sub-agents reason about model
identity consistently across the recursion tree.
providers holds the adapters:
-
MockModel/ theproviderstest models (echo,constant, scripted responses) — the default offline build. -
OpenAiModel(behind theopenaiCargo feature). Despite the name it speaks the OpenAI Chat Completions wire format to many hosts viaProviderSpec/ProviderKind: constructors includedeepseek,anthropic,groq,xai,openrouter,together,mistral,ollama, andcompatible(...)for any OpenAI-compatible endpoint, plusfrom_env/from_spec_env.
Tools implement harness::tool::Tool<State> with name, description,
schema() -> ToolSchema, and an async call(state, ctx, call) -> Result<ToolResult>.
ToolCall carries an id, name, and JSON arguments; ToolResult is built with
ToolResult::text(...) / ToolResult::error(...) and exposes is_error().
ToolRegistry keys tools by name and is consulted by the loop when an assistant
message requests a tool. ToolDelta carries incremental tool progress for
streaming.
Crucially, a whole agent can be a tool — see Sub-agents.
harness::middleware::Middleware<State, Ctx> exposes the cross-cutting hooks the
loop fans out to: before_agent / after_agent, before_model /
on_model_delta / after_model, before_tool / on_tool_delta /
after_tool, and on_error. They are ordered through a MiddlewareStack
(before_* in registration order, after_* in reverse). HookCounts records
how often each hook fired. Built-in middleware lives here too:
LoggingMiddleware, MessageTrimMiddleware, ContextCompressionMiddleware,
PromptCacheGuardMiddleware, and UsageAccountingMiddleware.
StructuredExtractor extracts a typed value from the final ModelResponse
according to a StructuredStrategy (StructuredStrategy::for_profile(...)
chooses a provider-appropriate strategy). response_format_for_strategy(...)
maps a strategy onto a ResponseFormat (Text, json_schema, provider-native,
or a tool-call strategy). The parsed value lands in StructuredOutput, which
exposes as_value() and parse::<T>(). Set
RunPolicy::default_response_format to attach a schema to every model request in
a run; the loop runs extraction automatically before completing.
Two layers cooperate. At the provider edge, ChatModel::stream yields a
ModelStream of ModelStreamItems; StreamAccumulator (or the
collect_model_stream helper) folds them back into a ModelResponse. At the
harness edge, harness::stream exposes a StreamSink over StreamChunks with
selectable StreamModes (messages, tools, updates, events, final). The
invoke_streaming* agent-loop methods surface model deltas as the loop runs.
Usage records input_tokens / output_tokens (plus cached-token bookkeeping)
and effective_total(). The loop folds each call's Usage into the AgentRun,
and UsageTotals aggregates across calls. cost::estimate_cost(pricing, usage)
turns a ModelPricing + Usage into CostTotals. This is where recursion
becomes measurable: because a child run executes through the parent's
RunContext/event sink, child usage and cost roll up into the parent's totals,
so an orchestrator can see the full cost of the subtree it spawned.
RunLimits (built via with_max_model_calls, with_max_tool_calls,
with_max_wall_clock_ms, with_max_retries_per_call, with_max_concurrency,
and with_max_depth) is enforced fail-closed by a LimitTracker and by the
loop itself. Reaching a cap returns TinyAgentsError::LimitExceeded; the
deadline returns TinyAgentsError::Timeout.
max_depth is the recursion limit: it bounds how deep nested sub-agents (or
subgraphs) may recurse. An invocation whose child depth would exceed the cap
fails with TinyAgentsError::SubAgentDepth.
retry provides RetryPolicy (attempts, backoff, multiplier, jitter), the
is_retryable(err) classifier (only network/timeout/rate-limit/5xx by default),
FallbackPolicy (an ordered model fallback chain — next_after(current)), and a
token-bucket RateLimiter.
cache separates two distinct ideas:
-
Local response cache —
ResponseCache(withInMemoryResponseCache), keyed bycache_key(request). Attach it withAgentHarness::with_response_cache; the loop then checks it before each provider call and stores successful responses, emittingcache.hit/cache.missevents. Because the cache is owned by the harness, a repeated identical request — even across separate runs — can be served from an earlier result. -
Provider prompt/KV-cache layout —
PromptCacheLayout(andCacheLayoutEvent) make the stable prompt prefix explicit so middleware that edits model-visible prompt segments can be detected (is_prefix_stable_against).CachePolicygates both behaviors.
memory owns conversation continuity: the ChatHistory trait
(InMemoryChatHistory, StoreChatHistory<S>) and ShortTermMemory<H>, which is
loaded before a loop and saved after, optionally trimmed. MemoryScope
distinguishes short- vs long-term data.
embeddings provides retrieval-augmented context: the EmbeddingModel trait
(MockEmbeddingModel; OpenAiEmbeddingModel behind the feature), a VectorStore
trait (InMemoryVectorStore), cosine_similarity, and a Retriever whose
index(docs) / retrieve(query, top_k) -> Vec<ScoredDoc> close the loop. In the
RLM framing, retrieval is how an agent pulls snippets of a large external
environment into context instead of stuffing the whole thing into one window.
This is the core recursive surface. A SubAgent<State, Ctx> wraps an
Arc<AgentHarness> plus a stable name, description, and optional
system_prompt. Invoking it always produces a child run one level deeper in
the recursion tree:
-
SubAgent::invoke/invoke_with_eventsrun a fresh child loop. -
SubAgent::invoke_in_parent(state, ctx_data, parent, input)threads the live parentRunContext, so the child inherits the parent's depth and event sink — the child runs atparent.depth() + 1and its events (and usage) surface on the parent's stream. It emitsSubAgentStarted/SubAgentCompletedaround the child loop, making the recursion observable. -
SubAgentTooladapts a sub-agent into aTool, so a parent model can call an entire agent the same way it calls any tool — a model calling a model. The child input is read from theSUBAGENT_INPUT_FIELD("input") argument. Depth is fixed at construction (with_parent_depth) because theTooltrait givescallno live parent context. -
SubAgentSessionkeeps the same sub-agent alive across turns, accumulating a transcript (post-completion reuse, e.g. human-in-the-loop). Each reuse emitsSubAgentReused.
Depth is bounded by RunLimits::max_depth; overrunning it yields
SubAgentError::SubAgentDepth. The graph analogue is graph::subgraph
(SubAgentNode), where a node embeds another compiled graph with the same depth
tracking.
Steering is typed runtime control of an already-running agent (distinct from
sub-agent reuse, which acts between runs). An orchestrator — a parent agent, a
human UI, a graph supervisor, or a test — holds a cloneable SteeringHandle,
calls send(command), and the loop drains pending commands at a safe
checkpoint (before each model call). Commands are SteeringCommand: Pause,
Resume, Cancel, InjectMessage, Redirect { instruction }, and
SetMetadata. Each is gated by a SteeringPolicy allowlist
(SteeringPolicy::new permits nothing by default; allow_all for tests), and a
disallowed command is rejected with TinyAgentsError::Steering. Applying a batch
yields a SteeringOutcome (Continue / Pause / Cancel). Commands are
Serialize/Deserialize, so steering can be logged, transported, and replayed.
summarization keeps the working transcript inside the model's context window.
estimate_tokens(text) and TokenEstimate provide a cheap budget;
trim_messages(messages, strategy) applies a TrimStrategy; and
SummarizationPolicy decides when to compact. A policy can be derived from the
model's own profile — SummarizationPolicy::from_profile(profile, threshold) and
with_context_window(max_input_tokens) — so the trigger budget scales with the
target model. should_summarize(messages) and plan(messages) split the
transcript into a "summarize these / keep these" partition, the Summarizer
trait (with ConcatSummarizer) produces the summary, and SummaryRecord /
CompressionProvenance record what was compressed so the compaction is
auditable. This matters for recursion: deep sub-agent transcripts stay bounded
instead of blowing the parent's budget.
events is the observability spine. AgentEvent enumerates lifecycle
boundaries (run, model, tool, middleware, sub-agent, retry, custom). Events flow
through an EventSink to EventListeners (e.g. RecordingListener), are
journaled in an EventJournal (with replay_from(offset)), and the run's
overall state is HarnessRunStatus. Because child runs share the parent's
EventSink, a single stream shows the entire recursion tree — parent and child
model/tool/sub-agent events interleaved with correct depth annotations.
harness::testkit makes the loop deterministically testable: ScriptedModel /
StreamingMock / SlowModel fake providers, FakeTool, DeterministicClock,
DeterministicIds, EventRecorder, and a Trajectory assertion helper
(from_events(...).assert_tool_called(...), assert_model_called_times(n),
assert_order(&[...]), assert_completed()). Trajectory assertions are how you
verify the shape of a recursive run — that the orchestrator called the model,
then a sub-agent tool, then the model again, and completed.
- Graph Runtime — durable typed state graphs and subgraphs.
-
Providers — configuring hosted providers behind
openai. - Examples — annotated, runnable catalog.
-
docs/modules/harness/README.md— the full module specification.
Recursive language-model (RLM) harness for Rust.
Getting started
Concepts
Modules
Providers
Contributing