Summary
Conversation managers currently rely solely on message count (window_size) to decide when to reduce context. This is a coarse heuristic — a conversation with 10 messages containing large tool results can exceed a model's context window, while 100 short messages may fit comfortably.
Related Issues
This proposal addresses or partially addresses several existing feature requests:
Problem
- No token-budget awareness:
SlidingWindowConversationManager only checks len(messages) > window_size. There's no way to set a token budget and have context reduction trigger based on estimated token usage.
- No proactive reduction for summarizing manager:
SummarizingConversationManager only summarizes reactively after a ContextWindowOverflowException, which means the agent has already failed a model call.
- No micro-compaction: Stale tool results from early in the conversation consume token budget long after they're relevant, but there's no mechanism to replace them with stubs while preserving the toolUse/toolResult pair structure.
- No token estimation utility: There's no shared utility for estimating token counts across conversation managers.
Proposed Solution
New: _token_utils.py
estimate_tokens(messages) — chars/4 heuristic covering all ContentBlock types (text, toolResult, toolUse, image, document, video, reasoningContent, cachePoint, guardContent, citationsContent)
TokenCounter type alias for custom token counting functions
SlidingWindowConversationManager enhancements
max_context_tokens: int | None — optional token budget, checked alongside window_size
token_counter: TokenCounter | None — pluggable token counting function
compactable_after_messages: int | None — micro-compaction of stale tool results
- Proactive token-budget enforcement via
BeforeModelCallEvent hook
_last_compacted_index tracking to avoid re-scanning already-compacted messages
SummarizingConversationManager enhancements
max_context_tokens: int | None — optional token budget
proactive_threshold: float — fraction of budget at which proactive summarization triggers
token_counter: TokenCounter | None — pluggable token counting function
- Proactive summarization via
BeforeModelCallEvent hook (only registered when max_context_tokens is set)
Design decisions
- Always uses heuristic estimator, never model-reported
latest_context_size (stale after reduction → over-reduction spirals)
- Hook calls
apply_management() (not reduce_context() directly) to ensure micro-compaction runs first
_model_call_count only increments when per_turn is enabled (preserves existing semantics)
- Summarizing manager's
apply_management is a no-op to prevent double-summarization (hook + finally block)
Test Plan
Summary
Conversation managers currently rely solely on message count (
window_size) to decide when to reduce context. This is a coarse heuristic — a conversation with 10 messages containing large tool results can exceed a model's context window, while 100 short messages may fit comfortably.Related Issues
This proposal addresses or partially addresses several existing feature requests:
estimate_tokens()utility and pluggableTokenCountertype for conversation managersmax_context_tokens+BeforeModelCallEventhook, preventingContextWindowOverflowExceptionbefore it occursper_turn,compactable_after_messages, and hook-based token budget checks enable within-cycle context managementmodel.context_limitis added, it could auto-configuremax_context_tokensapply_management()which may callreduce_context(), but does not fire a dedicated event for itProblem
SlidingWindowConversationManageronly checkslen(messages) > window_size. There's no way to set a token budget and have context reduction trigger based on estimated token usage.SummarizingConversationManageronly summarizes reactively after aContextWindowOverflowException, which means the agent has already failed a model call.Proposed Solution
New:
_token_utils.pyestimate_tokens(messages)— chars/4 heuristic covering allContentBlocktypes (text, toolResult, toolUse, image, document, video, reasoningContent, cachePoint, guardContent, citationsContent)TokenCountertype alias for custom token counting functionsSlidingWindowConversationManagerenhancementsmax_context_tokens: int | None— optional token budget, checked alongsidewindow_sizetoken_counter: TokenCounter | None— pluggable token counting functioncompactable_after_messages: int | None— micro-compaction of stale tool resultsBeforeModelCallEventhook_last_compacted_indextracking to avoid re-scanning already-compacted messagesSummarizingConversationManagerenhancementsmax_context_tokens: int | None— optional token budgetproactive_threshold: float— fraction of budget at which proactive summarization triggerstoken_counter: TokenCounter | None— pluggable token counting functionBeforeModelCallEventhook (only registered whenmax_context_tokensis set)Design decisions
latest_context_size(stale after reduction → over-reduction spirals)apply_management()(notreduce_context()directly) to ensure micro-compaction runs first_model_call_countonly increments whenper_turnis enabled (preserves existing semantics)apply_managementis a no-op to prevent double-summarization (hook + finally block)Test Plan
test_token_aware_context_management.pyruff check), type clean (mypy)