Release v0.6.66b478 · tobocop2/lilbee

Lower RAM use during chat

The chat worker no longer over-allocates its context cache. With the new defaults, a chat-with-RAG session on a 4B model lives in roughly a third of the memory it used to, and long conversations stop creeping toward the system memory ceiling.

Before / after

Same model, same hardware, same indexed corpus, 10 turns of chat-with-RAG. Memory measured on the chat worker process:

	After turn 1	After turn 10
Before	2579 MB	2519 MB
After	905 MB	874 MB
Reduction	65% smaller	65% smaller

What you'll notice

TUI, native model. Chat worker starts at roughly a third of the prior footprint. Long conversations stay in the same memory band instead of climbing, because lilbee slides the oldest user/assistant turns out of the prompt once they would push past the model's window.
Ollama and frontier models. Unchanged. Lilbee passes through to those backends; they keep using their own context and memory settings.
RAM-constrained hosts. The picker scales down automatically when the host can't back the working window, so smaller machines just get smaller workers, no OOMs.

Power-user knobs

Still tunable via LILBEE_* env vars or /settings:

LILBEE_CHAT_N_CTX_TARGET — working context the picker aims for (default 8K).
LILBEE_NUM_CTX_MAX — explicit ceiling. Empty by default so the model's own training window is the cap; set to clamp below it on smaller hosts.
LILBEE_KV_CACHE_TYPE — q8_0 (new default), f16, q4_0, or f32.
LILBEE_NUM_CTX — pin an exact context size that wins over the picker.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.6.66b478

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Lower RAM use during chat

Before / after

What you'll notice

Power-user knobs

Uh oh!