Skip to content

feat: Token-aware context management for conversation managers #2146

@srbhsrkr

Description

@srbhsrkr

Summary

Conversation managers currently rely solely on message count (window_size) to decide when to reduce context. This is a coarse heuristic — a conversation with 10 messages containing large tool results can exceed a model's context window, while 100 short messages may fit comfortably.

Related Issues

This proposal addresses or partially addresses several existing feature requests:

Problem

  1. No token-budget awareness: SlidingWindowConversationManager only checks len(messages) > window_size. There's no way to set a token budget and have context reduction trigger based on estimated token usage.
  2. No proactive reduction for summarizing manager: SummarizingConversationManager only summarizes reactively after a ContextWindowOverflowException, which means the agent has already failed a model call.
  3. No micro-compaction: Stale tool results from early in the conversation consume token budget long after they're relevant, but there's no mechanism to replace them with stubs while preserving the toolUse/toolResult pair structure.
  4. No token estimation utility: There's no shared utility for estimating token counts across conversation managers.

Proposed Solution

New: _token_utils.py

  • estimate_tokens(messages) — chars/4 heuristic covering all ContentBlock types (text, toolResult, toolUse, image, document, video, reasoningContent, cachePoint, guardContent, citationsContent)
  • TokenCounter type alias for custom token counting functions

SlidingWindowConversationManager enhancements

  • max_context_tokens: int | None — optional token budget, checked alongside window_size
  • token_counter: TokenCounter | None — pluggable token counting function
  • compactable_after_messages: int | None — micro-compaction of stale tool results
  • Proactive token-budget enforcement via BeforeModelCallEvent hook
  • _last_compacted_index tracking to avoid re-scanning already-compacted messages

SummarizingConversationManager enhancements

  • max_context_tokens: int | None — optional token budget
  • proactive_threshold: float — fraction of budget at which proactive summarization triggers
  • token_counter: TokenCounter | None — pluggable token counting function
  • Proactive summarization via BeforeModelCallEvent hook (only registered when max_context_tokens is set)

Design decisions

  • Always uses heuristic estimator, never model-reported latest_context_size (stale after reduction → over-reduction spirals)
  • Hook calls apply_management() (not reduce_context() directly) to ensure micro-compaction runs first
  • _model_call_count only increments when per_turn is enabled (preserves existing semantics)
  • Summarizing manager's apply_management is a no-op to prevent double-summarization (hook + finally block)

Test Plan

  • 55 new tests in test_token_aware_context_management.py
  • All 73 existing conversation manager tests pass (no regressions)
  • Lint clean (ruff check), type clean (mypy)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions