OK-DMD is a mathematically grounded, query-agnostic KV-cache compression and eviction framework for Long-Context Large Language Models (LLMs). By treating the sequence of attention Key vectors as a physical trajectory in a semantic space, OK-DMD tracks and preserves the stable mathematical "attractors" of the text, preventing catastrophic context forgetting during long generations.
Traditional KV-cache eviction policies (like H2O or local windowing) treat attention keys as static, isolated vectors. They evaluate importance based on rolling attention scores.
However, attention scores are highly volatile and subject to attention drift—a token ignored now might be vital 100 steps later when the conversation shifts back.
OK-DMD shifts this paradigm: Think of the text flow as a river. Every new word is a drop of water following a current. Instead of analyzing individual drops, OK-DMD estimates the underlying current (the semantic attractor) using Koopman Operator Theory.
We assume that the transition from one Key state (
By calculating the persistent eigenvalues of
Standard eviction heuristics (like H2O) suffer from catastrophic forgetting during domain shifts.
If your LLM is processing a long document about Astronomy, and the user suddenly inserts a transient Lasagna Recipe, H2O's local attention focus shifts entirely to cooking terms. To fit the strict memory budget, H2O purges the older Astronomy keys. When the conversation returns to Astronomy, the model's memory of the primary domain is gone, leading to hallucinations.
Because OK-DMD tracks the spectral footprint (the persistent eigenvalues) of the global trajectory, the "Astronomy" subspace remains stored in the background of the Koopman operator. Transient noise is allowed to pass through and get evicted, while the primary domain is shielded.
You can install the package locally in editable mode:
git clone https://github.com/unkcenter/ok-dmd.git
cd ok-dmd
pip install -e .To reproduce our multi-domain stress test (Astronomy
python benchmarks/run_stress_test.pyUnder an aggressive 92.2% compression budget (restricting the KV-cache of Qwen-2.5-0.5B from 817 down to just 64 active tokens), OK-DMD consistently outperforms attention-accumulating baselines:
------------------------------------------------------------
FINAL PERFORMANCE REPORT: DOMAIN SHIFT STRESS TEST
Total Sequence Length: 817 tokens
KV-Cache Budget Limit: 64 tokens (Aggressive Compression)
------------------------------------------------------------
-> Mean Attention Coherence (H2O): 0.997211
-> Mean Attention Coherence (OK-DMD): 0.997972
------------------------------------------------------------
[SUCCESS] OK-DMD outperformed H2O! The dynamic attractor preserved the old Astronomy context.
OK-DMD runs on three integrated mathematical components:
To update the Koopman matrix
-
$\lambda = 0.995$ acts as an exponential forgetting factor. -
$\sigma I$ is a Tikhonov diagonal loading step that guarantees numerical stability under FP16 or BF16 GPU precisions.
Instead of performing complex eigendecompositions, we extract the orthonormal basis
The survival score
Maintaining the state matrices
By analyzing the memory trade-offs, we determine the exact sequence length
- 50% Eviction: OK-DMD is memory-profitable for any sequence longer than 256 tokens.
- 80% Eviction: OK-DMD is memory-profitable for any sequence longer than 160 tokens.
As a mathematically rigorous project, we explicitly outline the core limitation of OK-DMD:
- The Needle in a Haystack (NIAH) Constraint: OK-DMD is designed to extract global semantic trends. If a completely out-of-context semantic anomaly (a needle) is mentioned briefly and never recalled, OK-DMD will mathematically classify it as a transient perturbation (noise) and evict it under tight budgets.
- The Production Solution: In enterprise RAG architectures, OK-DMD should not compress the cache step-by-step during prompt prefilling. The prompt and the final instruction should be parsed in parallel, and a Prompt-Aware filter (like SnapKV) should lock the needle tokens. OK-DMD is then deployed during the long answer generation phase (Decode) to prevent semantic drift.
This project is licensed under the Apache-2.0 License - see the LICENSE file for details.
Developed with ☕ by UnK Center Inc.
