v0.6.66b478
·
238 commits
to main
since this release
Lower RAM use during chat
The chat worker no longer over-allocates its context cache. With the new defaults, a chat-with-RAG session on a 4B model lives in roughly a third of the memory it used to, and long conversations stop creeping toward the system memory ceiling.
Before / after
Same model, same hardware, same indexed corpus, 10 turns of chat-with-RAG. Memory measured on the chat worker process:
| After turn 1 | After turn 10 | |
|---|---|---|
| Before | 2579 MB | 2519 MB |
| After | 905 MB | 874 MB |
| Reduction | 65% smaller | 65% smaller |
What you'll notice
- TUI, native model. Chat worker starts at roughly a third of the prior footprint. Long conversations stay in the same memory band instead of climbing, because lilbee slides the oldest user/assistant turns out of the prompt once they would push past the model's window.
- Ollama and frontier models. Unchanged. Lilbee passes through to those backends; they keep using their own context and memory settings.
- RAM-constrained hosts. The picker scales down automatically when the host can't back the working window, so smaller machines just get smaller workers, no OOMs.
Power-user knobs
Still tunable via LILBEE_* env vars or /settings:
LILBEE_CHAT_N_CTX_TARGET— working context the picker aims for (default 8K).LILBEE_NUM_CTX_MAX— explicit ceiling. Empty by default so the model's own training window is the cap; set to clamp below it on smaller hosts.LILBEE_KV_CACHE_TYPE—q8_0(new default),f16,q4_0, orf32.LILBEE_NUM_CTX— pin an exact context size that wins over the picker.