fix(agent): guard agent prompts against model max-context limit (#2074)#2100
Conversation
…humansai#2074) Estimate conversation tokens before LLM dispatch using known per-model context windows, trim oldest non-system history when over budget, and log warnings with original vs trimmed counts. Co-authored-by: Cursor <cursoragent@cursor.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds model-context-window discovery, token estimation, and pre-dispatch budgeting: resolves per-model context windows, estimates tokens, trims oldest non-system messages (preserving system messages) in conversation history and provider-bound chat before each LLM request, and surfaces trimming outcomes via logs. ChangesToken Budgeting and Context Window Integration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/openhuman/inference/provider/router.rs (1)
63-69:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
Route.context_windowis discarded during router construction.At Line 63 onward,
RouterProvideronly stores(provider_index, model), so the newcontext_windowmetadata is dropped immediately. That prevents downstream budget enforcement from using the configured routed-model window.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/inference/provider/router.rs` around lines 63 - 69, The router is dropping Route.context_window because resolved_routes is built as HashMap<String, (usize, String)> and only stores (provider_index, model); update the resolved_routes shape and the RouterProvider construction to preserve the context window (e.g., include a third element or a small struct/tuple containing the context_window), propagate Route.context_window from routes.into_iter() (using name_to_index to map provider_name) into that stored value, and adjust any uses of RouterProvider/resolved_routes to read the context window for downstream budget enforcement (referencing resolved_routes, Route.context_window, name_to_index, and RouterProvider).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/agent/harness/session/turn.rs`:
- Around line 427-440: The pre-dispatch trimming is using
context_window_for_model(&effective_model) (and the same pattern around the
later block at 528-541) which uses the alias/tier’s window instead of the actual
routed provider model’s window; change the call sites to compute the context
window from the resolved provider model ID used for the provider call in this
iteration (the same resolved model/route value used when making the provider
request) and pass that into trim_conversation_history_to_budget instead of
&effective_model so trimming reflects the real upstream model limit.
In `@src/openhuman/agent/harness/token_budget.rs`:
- Around line 111-143: The current logic in trim_messages_to_budget (the block
using system_messages and other_messages, is_system, estimate, and max_tokens)
rebuilds messages as system_messages followed by other_messages which breaks
original ordering when system messages appear after user/other messages; instead
identify non-system message indices in the original messages vector, compute how
many oldest non-system indices must be removed to meet max_tokens (using
estimate and keeping all system messages), then reconstruct *messages by
iterating the original messages in order and skipping only the selected
non-system indices (removing oldest-first) so relative order of all remaining
messages is preserved; update any counters like removed accordingly.
---
Outside diff comments:
In `@src/openhuman/inference/provider/router.rs`:
- Around line 63-69: The router is dropping Route.context_window because
resolved_routes is built as HashMap<String, (usize, String)> and only stores
(provider_index, model); update the resolved_routes shape and the RouterProvider
construction to preserve the context window (e.g., include a third element or a
small struct/tuple containing the context_window), propagate
Route.context_window from routes.into_iter() (using name_to_index to map
provider_name) into that stored value, and adjust any uses of
RouterProvider/resolved_routes to read the context window for downstream budget
enforcement (referencing resolved_routes, Route.context_window, name_to_index,
and RouterProvider).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7a6f7f6d-758a-44a2-8ea3-7b6460aae33f
📒 Files selected for processing (9)
src/openhuman/agent/harness/mod.rssrc/openhuman/agent/harness/session/turn.rssrc/openhuman/agent/harness/token_budget.rssrc/openhuman/agent/harness/tool_loop.rssrc/openhuman/inference/mod.rssrc/openhuman/inference/model_context.rssrc/openhuman/inference/provider/ops.rssrc/openhuman/inference/provider/router.rssrc/openhuman/inference/provider/router_test.rs
| if let Some(context_window) = context_window_for_model(&effective_model) { | ||
| let budget_outcome = | ||
| trim_conversation_history_to_budget(&mut self.history, context_window); | ||
| if budget_outcome.trimmed { | ||
| log::warn!( | ||
| "[agent_loop] pre-dispatch history trimmed model={} context_window={} original_tokens={} final_tokens={} messages_removed={}", | ||
| effective_model, | ||
| context_window, | ||
| budget_outcome.original_tokens, | ||
| budget_outcome.final_tokens, | ||
| budget_outcome.messages_removed | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
Budgeting uses alias model window, not the actually routed model window.
Both trims use context_window_for_model(&effective_model). When effective_model is a tier alias (e.g., reasoning-v1) but routing resolves to a different provider model, the budget can be wrong and still exceed the real upstream limit.
The pre-dispatch budget should use the resolved route/model context window (or resolved model id) used by the provider call in the same iteration.
Also applies to: 528-541
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/openhuman/agent/harness/session/turn.rs` around lines 427 - 440, The
pre-dispatch trimming is using context_window_for_model(&effective_model) (and
the same pattern around the later block at 528-541) which uses the alias/tier’s
window instead of the actual routed provider model’s window; change the call
sites to compute the context window from the resolved provider model ID used for
the provider call in this iteration (the same resolved model/route value used
when making the provider request) and pass that into
trim_conversation_history_to_budget instead of &effective_model so trimming
reflects the real upstream model limit.
| let mut system_messages: Vec<T> = Vec::new(); | ||
| let mut other_messages: Vec<T> = Vec::new(); | ||
| for msg in messages.drain(..) { | ||
| if is_system(&msg) { | ||
| system_messages.push(msg); | ||
| } else { | ||
| other_messages.push(msg); | ||
| } | ||
| } | ||
|
|
||
| let mut removed = 0usize; | ||
| while !other_messages.is_empty() { | ||
| let total: usize = system_messages | ||
| .iter() | ||
| .map(&estimate) | ||
| .chain(other_messages.iter().map(&estimate)) | ||
| .sum(); | ||
| if total <= max_tokens { | ||
| break; | ||
| } | ||
| other_messages.remove(0); | ||
| removed += 1; | ||
| } | ||
|
|
||
| let final_tokens: usize = system_messages | ||
| .iter() | ||
| .map(&estimate) | ||
| .chain(other_messages.iter().map(&estimate)) | ||
| .sum(); | ||
|
|
||
| *messages = system_messages; | ||
| messages.extend(other_messages); | ||
|
|
There was a problem hiding this comment.
Preserve original message order while trimming.
trim_messages_to_budget currently rebuilds the vector as system_messages + other_messages, which reorders history if any system message appears after non-system messages. That can change prompt semantics unexpectedly. Keep original relative order and only drop selected non-system indices oldest-first.
Suggested fix
- let mut system_messages: Vec<T> = Vec::new();
- let mut other_messages: Vec<T> = Vec::new();
- for msg in messages.drain(..) {
- if is_system(&msg) {
- system_messages.push(msg);
- } else {
- other_messages.push(msg);
- }
- }
+ let mut kept: Vec<T> = messages.drain(..).collect();
+ let mut removable_positions: Vec<usize> = kept
+ .iter()
+ .enumerate()
+ .filter_map(|(idx, msg)| (!is_system(msg)).then_some(idx))
+ .collect();
- let mut removed = 0usize;
- while !other_messages.is_empty() {
- let total: usize = system_messages
- .iter()
- .map(&estimate)
- .chain(other_messages.iter().map(&estimate))
- .sum();
+ let mut removed = 0usize;
+ while !removable_positions.is_empty() {
+ let total: usize = kept.iter().map(&estimate).sum();
if total <= max_tokens {
break;
}
- other_messages.remove(0);
+ let remove_at = removable_positions.remove(0) - removed;
+ kept.remove(remove_at);
removed += 1;
}
- let final_tokens: usize = system_messages
- .iter()
- .map(&estimate)
- .chain(other_messages.iter().map(&estimate))
- .sum();
-
- *messages = system_messages;
- messages.extend(other_messages);
+ let final_tokens: usize = kept.iter().map(&estimate).sum();
+ *messages = kept;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/openhuman/agent/harness/token_budget.rs` around lines 111 - 143, The
current logic in trim_messages_to_budget (the block using system_messages and
other_messages, is_system, estimate, and max_tokens) rebuilds messages as
system_messages followed by other_messages which breaks original ordering when
system messages appear after user/other messages; instead identify non-system
message indices in the original messages vector, compute how many oldest
non-system indices must be removed to meet max_tokens (using estimate and
keeping all system messages), then reconstruct *messages by iterating the
original messages in order and skipping only the selected non-system indices
(removing oldest-first) so relative order of all remaining messages is
preserved; update any counters like removed accordingly.
graycyrus
left a comment
There was a problem hiding this comment.
Review — graycyrus
Solid fix for a real production issue (160 Sentry events from context-window overflows). The approach is sound: look up per-model context windows, estimate tokens, trim oldest non-system messages before dispatch, and log when trimming occurs. Tests cover the key paths, CI is green, and the ContextGuard integration in tool_loop.rs is a nice touch.
CodeRabbit already flagged the alias-vs-routed-model mismatch in turn.rs and the message reordering in trim_messages_to_budget — both valid, won't repeat here.
Change summary
| File | Change | Description |
|---|---|---|
agent/harness/mod.rs |
module registration | Adds mod token_budget |
agent/harness/session/turn.rs |
feature | Pre-dispatch trimming of conversation history + provider messages |
agent/harness/token_budget.rs |
new module | Generic trim-to-budget with ~4 chars/token estimation, 5 unit tests |
agent/harness/tool_loop.rs |
feature | Pre-dispatch trimming + ContextGuard seeded with known window |
inference/mod.rs |
wiring | Exports model_context module |
inference/model_context.rs |
new module | Model-to-context-window lookup table + tier alias resolution, 4 unit tests |
inference/provider/ops.rs |
wiring | Populates Route::context_window from context_window_for_model |
inference/provider/router.rs |
schema | Adds context_window: Option<u64> to Route |
inference/provider/router_test.rs |
test fix | Adds context_window: None to test fixture |
Additional finding
Trim + warn log pattern duplication (minor): The same lookup → trim → warn-log block is repeated three times across turn.rs (lines 427-438 and 528-541) and tool_loop.rs (lines 243-264). Consider extracting a small helper (e.g. try_trim_to_budget(messages, model, label)) to keep the log format and reserve logic in one place — right now a change to the log fields or reserve policy requires updating three call sites.
| ("gpt-4-turbo", 128_000), | ||
| ("gpt-4", 128_000), | ||
| ("gpt-3.5", 16_385), | ||
| ("o1", 200_000), |
There was a problem hiding this comment.
[major] "o1" and "o3" are dangerously broad for a contains() match.
Any model ID that happens to include the substring o1 or o3 will get a 200K context window — e.g. a local model named "solo1-7b", "proto3-chat", or a provider-prefixed path like "ollama/mistral-for-o1-benchmark". Since this sets the budget for trimming, a false positive means either:
- Under-trimming (real window < 200K) → the exact 400 errors this PR fixes
- Over-trimming (real window > 200K) → unnecessary context loss
Suggestion: use word-boundary-aware matching or exact prefixes:
// Option A: require the pattern to be a full segment
("o1", 200_000), // match "o1", "o1-mini", "o1-preview", NOT "proto3"
("o3", 200_000), // match "o3", "o3-mini", NOT "solo3"// In the matching logic:
let segments: Vec<&str> = lower.split(&['/', '-', '_', ':'][..]).collect();
for (pattern, window) in MODEL_CONTEXT_PATTERNS {
if segments.iter().any(|s| s == pattern) || lower.contains(pattern) {
// ...
}
}Or simply use longer patterns: "o1-", "o1" (exact), "/o1" depending on how model IDs are formatted in practice.
trim_messages_to_budget previously rebuilt the vector as system_messages ++ other_messages, which reordered history whenever a system message appeared after non-system messages — changing prompt semantics unexpectedly. Walk the original vector instead and drop only the oldest non-system indices, preserving relative order of every retained message. Adds a regression test for a late-positioned system message. Addresses CodeRabbit review on token_budget.rs:143.
…ives The "o1" and "o3" substring patterns matched any model id containing those two characters — e.g. "solo1-7b", "proto3-chat", or "ollama/octo3thing" would inherit OpenAI's 200K window and either under-trim (real window < 200K, exact 400-error class this PR is meant to fix) or over-trim. Mark these patterns as segment-matched so they only trigger when "o1"/"o3" appears as a full `-`/`_`/`/`/`:`/`.` delimited segment. Adds tests for legitimate ids (`o1`, `o1-mini`, `o3-mini`, `openai/o1-preview`) and false-positive guards (`solo1-7b`, `proto3-chat`, `octo3thing`). Addresses graycyrus review on model_context.rs:32.
| "[agent_loop] pre-dispatch token budget ok" | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
Noted — deferring to a follow-up.
Agreed the trim+warn block is duplicated across three call sites. The cleanest extraction needs to handle both ChatMessage and ConversationMessage (different trim functions), plus the ContextGuard seeding in tool_loop, so it's a small refactor rather than a one-liner. Keeping this PR scoped to the production fix; will follow up with a try_trim_to_budget(label, model, …) helper in a separate change.
| self.history.len() | ||
| ); | ||
|
|
||
| if let Some(context_window) = context_window_for_model(&effective_model) { |
There was a problem hiding this comment.
Deferred — needs router plumbing.
This is a valid concern, but fixing it cleanly requires plumbing the resolved-route context window from RouterProvider out to the harness (the harness only sees the alias model id; route resolution happens inside Provider::chat). Route::context_window is now populated, but no public API exposes it to the caller.
For the OpenHuman backend path (no model_routes), context_window_for_model(alias) already returns the correct tier window (200K for reasoning-v1 / agentic-v1 / coding-v1) via tier_context_window, so this PR's behavior is correct there.
The remaining gap — user-configured model_routes pointing a tier at a smaller-context custom-provider model (e.g. reasoning-v1 → gpt-3.5-turbo) — would trim too late. Worth a follow-up that exposes the resolved context window through Provider (e.g. fn resolve_context_window(&self, model: &str) -> Option<u64>) and threads it into the harness. Tracking separately to keep this PR scoped to the 200K-haiku Sentry fix.
…humansai#2074) (tinyhumansai#2100) Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>
…humansai#2074) (tinyhumansai#2100) Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>
Summary
inference::model_contextto resolve known per-model context windows (OpenHuman tier aliases, Copilot Haiku 200K, common Claude/GPT patterns).agent::harness::token_budgetto estimate prompt tokens (~4 chars/token) and trim oldest non-system history before LLM dispatch.run_tool_call_loopand the mainAgentsession turn path.Route::context_windowwhen building model routes from config.Problem
The agent harness built conversation context (system prompt + history + tool results) without checking total token count against the target model's context window before dispatching to the LLM. Long conversations or large tool outputs exceeded limits (e.g.
github_copilot/claude-haiku-4.5with a 200K cap), causing upstream400 BadRequesterrors (~160 Sentry events on Windows).Solution
Before each provider
chat()call, look up the model's context window viacontext_window_for_model, estimate total prompt tokens, and drop the oldest non-system messages until the payload fits (reserving ~10% / 8K tokens for completion and tool-schema overhead). When trimming occurs, emit alog::warn!with model id, context window, original vs final token estimates, and messages removed.ContextGuardis seeded with the known window when available so post-response utilization checks remain meaningful.Submission Checklist
.github/workflows/coverage.yml(Vitest + cargo-llvm-cov merged viadiff-cover). Local fullpnpm test:coverage/pnpm test:rustnot run in this session; new Rust modules include dedicated unit tests for pass-through, over-limit truncation, and model lookup edge cases. CI is the authoritative gate.## Related— N/A: no matrix feature IDsdocs/RELEASE-MANUAL-SMOKE.md) — N/ACloses #NNNin the## RelatedsectionImpact
ContextGuard/ autocompaction behavior is unchanged.Related
AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
cursor/a01-2074-agent-token-budget168f0cd5976111b3fb5d71b951918bc56bad98feValidation Run
pnpm --filter openhuman-app format:checkpnpm typecheckcargo test --lib token_budget::(5 passed),cargo test --lib model_context::(4 passed)cargo fmt --manifest-path Cargo.toml --all --check; pre-pushpnpm rust:check(Tauri shell) passedValidation Blocked
command:pnpm test:coverageandpnpm test:rust(full diff-cover gate)error:Not run locally in this session — CI will enforce ≥80% on changed linesimpact:Merge gate depends on GitHub Actions coverage workflow; new Rust modules include dedicated unit tests for pass-through, over-limit truncation, and model lookup edge casesBehavior Changes
Parity Contract
ContextGuardstill updated from provider usage; trimming only runs whencontext_window_for_modelreturnsSomeDuplicate / Superseded PR Handling
Made with Cursor
Summary by CodeRabbit