Skip to content

v0.6.66b478

Choose a tag to compare

@github-actions github-actions released this 19 May 08:03
· 238 commits to main since this release
24a895f

Lower RAM use during chat

The chat worker no longer over-allocates its context cache. With the new defaults, a chat-with-RAG session on a 4B model lives in roughly a third of the memory it used to, and long conversations stop creeping toward the system memory ceiling.

Before / after

Same model, same hardware, same indexed corpus, 10 turns of chat-with-RAG. Memory measured on the chat worker process:

After turn 1 After turn 10
Before 2579 MB 2519 MB
After 905 MB 874 MB
Reduction 65% smaller 65% smaller

What you'll notice

  • TUI, native model. Chat worker starts at roughly a third of the prior footprint. Long conversations stay in the same memory band instead of climbing, because lilbee slides the oldest user/assistant turns out of the prompt once they would push past the model's window.
  • Ollama and frontier models. Unchanged. Lilbee passes through to those backends; they keep using their own context and memory settings.
  • RAM-constrained hosts. The picker scales down automatically when the host can't back the working window, so smaller machines just get smaller workers, no OOMs.

Power-user knobs

Still tunable via LILBEE_* env vars or /settings:

  • LILBEE_CHAT_N_CTX_TARGET — working context the picker aims for (default 8K).
  • LILBEE_NUM_CTX_MAX — explicit ceiling. Empty by default so the model's own training window is the cap; set to clamp below it on smaller hosts.
  • LILBEE_KV_CACHE_TYPEq8_0 (new default), f16, q4_0, or f32.
  • LILBEE_NUM_CTX — pin an exact context size that wins over the picker.