fix(llm): include system prompt tokens in memory compressor budget#381
fix(llm): include system prompt tokens in memory compressor budget#3810xhis wants to merge 2 commits intousestrix:mainfrom
Conversation
Greptile SummaryThis PR fixes a budget-accounting gap in the memory compressor: the system prompt and agent-identity messages were not included in the token total, so compression could be triggered too late (or not at all) on long scans with large system prompts. Changes:
Confidence Score: 5/5
Important Files Changed
Prompt To Fix All With AIThis is a comment left during a code review.
Path: strix/llm/llm.py
Line: 13
Comment:
**Importing a private helper across module boundaries**
`_get_message_tokens` is prefixed with `_`, which conventionally signals it is an internal implementation detail of `memory_compressor.py`. Importing it directly into `llm.py` couples the two modules to an internal symbol that could be freely refactored or removed without a public-API change notice.
Consider either:
1. Renaming it to `get_message_tokens` (dropping the leading `_`) to make the public-export intent explicit, or
2. Exposing a small helper on `MemoryCompressor` itself (e.g. `MemoryCompressor.count_tokens(messages)`) so the compressor owns all token-counting logic.
```suggestion
from strix.llm.memory_compressor import MemoryCompressor, get_message_tokens
```
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: "fix(llm): include sy..." |
strix/llm/llm.py
Outdated
| from strix.config import Config | ||
| from strix.llm.config import LLMConfig | ||
| from strix.llm.memory_compressor import MemoryCompressor | ||
| from strix.llm.memory_compressor import MemoryCompressor, _get_message_tokens |
There was a problem hiding this comment.
Importing a private helper across module boundaries
_get_message_tokens is prefixed with _, which conventionally signals it is an internal implementation detail of memory_compressor.py. Importing it directly into llm.py couples the two modules to an internal symbol that could be freely refactored or removed without a public-API change notice.
Consider either:
- Renaming it to
get_message_tokens(dropping the leading_) to make the public-export intent explicit, or - Exposing a small helper on
MemoryCompressoritself (e.g.MemoryCompressor.count_tokens(messages)) so the compressor owns all token-counting logic.
| from strix.llm.memory_compressor import MemoryCompressor, _get_message_tokens | |
| from strix.llm.memory_compressor import MemoryCompressor, get_message_tokens |
Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/llm/llm.py
Line: 13
Comment:
**Importing a private helper across module boundaries**
`_get_message_tokens` is prefixed with `_`, which conventionally signals it is an internal implementation detail of `memory_compressor.py`. Importing it directly into `llm.py` couples the two modules to an internal symbol that could be freely refactored or removed without a public-API change notice.
Consider either:
1. Renaming it to `get_message_tokens` (dropping the leading `_`) to make the public-export intent explicit, or
2. Exposing a small helper on `MemoryCompressor` itself (e.g. `MemoryCompressor.count_tokens(messages)`) so the compressor owns all token-counting logic.
```suggestion
from strix.llm.memory_compressor import MemoryCompressor, get_message_tokens
```
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Pull request overview
This PR fixes token budgeting in the LLM memory compressor so that system prompt and agent identity “framing” tokens are accounted for when deciding whether (and how much) to compress conversation history, preventing unexpected truncation behavior on long runs with large system prompts.
Changes:
- Added a
reserved_tokensparameter toMemoryCompressor.compress_history()to incorporate non-history prompt tokens into the budget calculation. - Calculated
reserved_tokensinLLM._prepare_messages()from the system prompt + agent identity messages and passed it into the compressor.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
strix/llm/memory_compressor.py |
Extends history compression budgeting with a reserved_tokens parameter and uses it in token calculations. |
strix/llm/llm.py |
Computes reserved/system framing tokens during message preparation and passes them to the memory compressor. |
Comments suppressed due to low confidence (1)
strix/llm/memory_compressor.py:216
compress_history()can still return a history that exceeds the token budget whenreserved_tokens(or the non-compressible system/recent messages) already consume most/all of the limit. Consider explicitly handlingreserved_tokens >= MAX_TOTAL_TOKENS * 0.9(and/or re-checking after summarization) by dropping more of the history (e.g., reduce recent messages, return only system messages, or return an empty history) so the final prompt stays within limits.
total_tokens = reserved_tokens + sum(
_get_message_tokens(msg, model_name) for msg in system_msgs + regular_msgs
)
if total_tokens <= MAX_TOTAL_TOKENS * 0.9:
return messages
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Args: | ||
| messages: Conversation history messages to compress. | ||
| reserved_tokens: Tokens already reserved for system prompt and | ||
| other framing messages outside the conversation history. | ||
| Subtracted from the budget before checking limits. | ||
|
|
There was a problem hiding this comment.
The docstring says reserved_tokens is “Subtracted from the budget before checking limits”, but the implementation adds it into total_tokens and compares against the fixed budget. Either update the wording to match the behavior (reserved tokens are included in the total prompt token count) or change the logic to compute an available_budget = MAX_TOTAL_TOKENS * 0.9 - reserved_tokens and compare history tokens against that.
strix/llm/llm.py
Outdated
| from strix.config import Config | ||
| from strix.llm.config import LLMConfig | ||
| from strix.llm.memory_compressor import MemoryCompressor | ||
| from strix.llm.memory_compressor import MemoryCompressor, _get_message_tokens |
There was a problem hiding this comment.
llm.py imports _get_message_tokens from memory_compressor.py, but the leading underscore indicates this is intended as a private helper. To reduce coupling, consider exposing a public token-counting helper (e.g., get_message_tokens) or adding a MemoryCompressor.get_message_tokens() method, and have LLM call that instead of importing a private symbol.
| from strix.llm.memory_compressor import MemoryCompressor, _get_message_tokens | |
| from strix.llm.memory_compressor import MemoryCompressor |
2502a59 to
22d978e
Compare
The memory compressor was not accounting for system prompt and agent identity tokens when calculating the conversation budget. This caused premature history truncation on long scans with large system prompts. Adds a reserved_tokens parameter to compress_history() that subtracts already-accounted tokens from the budget before applying limits.
22d978e to
63035db
Compare
Summary
The memory compressor was not accounting for system prompt and agent identity tokens when calculating the conversation budget. This caused premature history truncation on long scans with large system prompts.
Changes
reserved_tokensparameter tocompress_history()that subtracts already-accounted tokens from the budget before applying limitsreserved_tokensfrom system prompt and agent identity messages in_prepare_messages()Files Changed
strix/llm/llm.py(+7/-2)strix/llm/memory_compressor.py(+8/-1)Split from #328.