From 50780aab115c0a5ea3ad2b4e87a446e40b6a2dc7 Mon Sep 17 00:00:00 2001 From: Claude Date: Sun, 10 May 2026 16:25:50 +0000 Subject: [PATCH] cache-research: add weekly report for 2026-05-10 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Covers 2026-05-04 → 2026-05-10. Headlines: vLLM × Mooncake distributed KV for agentic workloads (3.8× throughput, 94% reusable prefixes), LightSeek TokenSpeed inference engine launch with day-0 vLLM MLA adoption, LMCache's V4 wallet-economics post, eOptShrinkQ spectral-shrinkage KV compression (arXiv:2605.02905), HELM adaptive HBM partitioning for generative recommenders (arXiv:2605.04450), RetentiveKV state-space eviction, ZeRO-Prefill MoE serving, and Lighthouse Attention long-context pretraining. https://claude.ai/code/session_018DNoJMRCgb7pwFZCneFo9r --- .../weekly-cache-report-2026-05-10.md | 168 ++++++++++++++++++ 1 file changed, 168 insertions(+) create mode 100644 cache-research/weekly-cache-report-2026-05-10.md diff --git a/cache-research/weekly-cache-report-2026-05-10.md b/cache-research/weekly-cache-report-2026-05-10.md new file mode 100644 index 0000000..67efb61 --- /dev/null +++ b/cache-research/weekly-cache-report-2026-05-10.md @@ -0,0 +1,168 @@ +# Weekly Cache Research Report — 2026-05-10 + +**Run type:** Normal weekly run. Prior report: [`weekly-cache-report-2026-05-03.md`](./weekly-cache-report-2026-05-03.md). Search horizon: 2026-05-04 → 2026-05-10 (the seven days following the prior report). Target was ~5 entries; expanded to 9 because the post-DeepSeek-V4 wave plus a vLLM-day, a LightSeek launch, and a wave of arXiv 2605.xxxxx submissions all landed in the same week. + +**Scope:** distributed caching, KV cache, caching for inference, storage-system caching. + +**Selection criteria:** novelty of mechanism, potential systems / production impact, and whether the work measured runtime efficiency on a realistic workload — flagged explicitly even when the answer is "no" or "partial." + +**Organization:** (A) production measurements / empirical studies — real deployments, real traces, real hardware — and (B) academic / idea-forward work — novel mechanisms typically evaluated on benchmarks or simulated traces. + +--- + +## A. Production measurements and empirical studies + +### 1. vLLM × Mooncake — "Serving Agentic Workloads at Scale" (May 7, 2026) + +- Reference: [vLLM blog announcement on X — Serving Agentic Workloads at Scale with vLLM × Mooncake (May 7, 2026)](https://x.com/vllm_project/status/2052113331927060840). Companions: [Mooncake project site](https://kvcache-ai.github.io/Mooncake/), [Mooncake Store design page](https://kvcache-ai.github.io/Mooncake/design/mooncake-store.html), [vLLM V1 + Mooncake + LMCache integration recipe](https://kvcache-ai.github.io/Mooncake/getting_started/examples/vllm-integration/vllmv1-lmcache-integration.html), and [Mooncake Joins PyTorch Ecosystem (PyTorch blog)](https://pytorch.org/blog/mooncake-joins-pytorch-ecosystem/). +- Summary: A vLLM-published case study on running production agentic traces (80K+ tokens with 94%+ reusable prefixes) where local per-instance KV caches evict shared prefixes and cross-instance routing misses them entirely. Integrating **Mooncake Store as a distributed, cross-instance KV cache** (via the Mooncake Transfer Engine over RDMA / NVMe-oF / CXL) reportedly delivers **3.8× higher throughput** versus vLLM with local prefix caching alone. This is the moment Kimi's serving-platform substrate becomes a first-class vLLM blog citizen, not just a connector. +- Novelty: Low–medium as mechanism — Mooncake itself was published at FAST '25 (Qin et al.) and the Transfer Engine has been in vLLM since late 2024. High as workload framing: the 94%-reusable-prefix number on real agent traces is a sharper data point than the synthetic prefix-sharing benchmarks the field usually quotes, and it directly motivates the "KV is a first-class data object" reframing argued by LMCache last week. +- Impact: High. When the vLLM team itself publishes a "use Mooncake when local prefix caching isn't enough" recommendation for agentic traffic, every downstream stack (Together, Fireworks, RunPod, Modal, llm-d, Red Hat AI) gains air cover for the same architecture. Pairs naturally with Cloudflare's earlier endorsement of the Mooncake Transfer Engine. +- Runtime evaluation: Yes, partial. Headline 3.8× throughput claim and the 94% prefix-reuse measurement come from production agent traces, but neither tail-latency distributions (P95/P99 TTFT and TBT) under continuous batching nor the attribution between "more cache space" and "cross-instance hits" are broken out in the announcement. The next obvious benchmark. + +### 2. LightSeek Foundation — TokenSpeed inference engine launch (May 7, 2026) + +- Reference: [LightSeek Foundation blog — TokenSpeed: A Speed-of-Light LLM Inference Engine for Agentic Workloads](https://lightseek.org/blog/lightseek-tokenspeed.html). Companions: [GitHub: lightseekorg/tokenspeed](https://github.com/lightseekorg/tokenspeed), [vLLM × TokenSpeed day-0 launch announcement (X, May 7, 2026)](https://x.com/vllm_project/status/2052051210530914510), [MarkTechPost coverage (May 7, 2026)](https://www.marktechpost.com/2026/05/07/lightseek-foundation-releases-tokenspeed-an-open-source-llm-inference-engine-targeting-tensorrt-llm-level-performance-for-agentic-workloads/), [TokenSpeed kernel source](https://github.com/lightseekorg/tokenspeed/tree/main/tokenspeed-kernel). +- Summary: A new MIT-licensed inference engine from a 501(c)(3) nonprofit, built in two months and aimed squarely at agentic workloads. Two interesting pieces for KV-cache people: (a) a **TokenSpeed MLA kernel** with a "binary-version prefill" path that uses NVIDIA-internal softmax-implementation knobs and beats TensorRT-LLM's MLA across five typical coding-agent prefill workloads (long shared prefix KV cache), and (b) a decode kernel that **folds the query-sequence axis into the head axis** to fill BMM1 M-tile better, nearly halving decode latency vs. TensorRT-LLM under speculative-decoding workloads at batch 4/8/16 with long prefix KV cache. The MLA library has already been adopted by vLLM as the day-0 launch partner, purpose-built for Kimi K2.5 / K2.6 and DeepSeek R1 on Blackwell. +- Novelty: Medium. The kernel-level innovations are concrete and well-posed (the M-tile reshape for MLA decode is a real micro-architectural insight on Blackwell), but the headline is mostly "TensorRT-LLM-class kernels in an open-source MLA-first engine" rather than a new abstraction. What's novel is the combination — a fully open MLA stack tuned for coding-agent prefill shapes, not generic chat shapes. +- Impact: High signal. Agentic-coding workloads are the largest commercial vLLM use case (CodeBuddy/WorkBuddy, Cursor-style stacks, OpenClaw clones), so an MLA kernel that speeds those workloads up — and that vLLM accepts upstream within days — moves the floor. The fact that a small nonprofit can ship a TensorRT-LLM-competitive MLA in two months is also an impact statement about how mature the kernel-tuning ecosystem has gotten in 2026. +- Runtime evaluation: Yes — concrete kernel-level numbers on B200: ~9% min-latency and ~11% throughput improvement over TensorRT-LLM at 100 TPS/User on Kimi K2.5; near-halved decode latency vs TensorRT-LLM in speculative-decoding workloads (batch 4/8/16, long prefix KV cache). End-to-end SLO sweeps under continuous batching are not yet reported; that is the obvious next step now that the MLA kernel sits in vLLM upstream. + +### 3. LMCache — "Deepseek V4 explained, and why it matters to your wallet" (May 4, 2026) + +- Reference: [LMCache Blog — Deepseek V4 explained, and why it matters to your wallet](https://blog.lmcache.ai/en/2026/05/04/deepseek-v4-explained-and-why-it-matters-to-your-wallet/). Companions: [LMCache on Amazon SageMaker HyperPod (Apr 22, 2026)](https://blog.lmcache.ai/en/2026/04/22/lmcache-on-amazon-sagemaker-hyperpod-accelerating-llm-inference-with-managed-tiered-kv-cache/), [LMCache "Stop Calling It KV Cache" (Apr 28, 2026)](https://blog.lmcache.ai/en/2026/04/28/stop-calling-it-kv-cache-its-something-much-bigger/). +- Summary: Operator-facing follow-up to last week's three-headline DeepSeek-V4 entries: LMCache walks through *why* the V4 architecture (Compressed Sparse Attention compressing 4 tokens → 1, Heavily Compressed Attention compressing 128 tokens → 1, plus mHC residual structure) translates directly into **2–3× cheaper token prices**. The argument: KV cache — not parameters — is the binding constraint for token economics, and V4's ~10× smaller KV cache at 1M context lets a fixed GPU pool process ~10× more concurrent requests, producing 2–3× higher token throughput. The post is positioning LMCache as the cache layer that captures these wins in production tiered storage. +- Novelty: Low as a technical contribution; medium as economic analysis. The "KV is the binding constraint, not FLOPs" framing has been argued before (Modular's Five Eras, last week's LMCache reframing), but tying it directly to V4-specific compression ratios and dollar-per-token math is the first sober vendor-side cost writeup for V4. +- Impact: Medium-high. Influential operators read LMCache. This post sets the wallet-side expectation for V4 in the Q2 2026 vendor pitch cycle and is likely to get cited by Together, Fireworks, RunPod, and SageMaker HyperPod in their own DeepSeek-V4 launch posts. +- Runtime evaluation: None directly. Numbers are ratio-based ("10× smaller KV → 10× more requests → 2-3× cheaper tokens"), not measurement-led. End-to-end fleet TTFT/TBT distributions are deferred to LMCache's existing companion posts (the Apr 22 SageMaker HyperPod post still anchors the real numbers: 1.67× ITL and 1.27× throughput improvement over baseline vLLM under high concurrency). + +### 4. vLLM v0.20.1 (May 4) and v0.20.2 (May 10) — DeepSeek-V4 stabilization + +- Reference: [vLLM v0.20.2 release notes (May 10, 2026)](https://github.com/vllm-project/vllm/releases/tag/v0.20.2). Companion: [vLLM v0.20.1 release notes (May 4, 2026)](https://github.com/vllm-project/vllm/releases/tag/v0.20.1) and last week's headline [vLLM v0.20.0 release](https://github.com/vllm-project/vllm/releases/tag/v0.20.0). +- Summary: Two patch releases bracketing the week. v0.20.1 stabilizes the V4 path with multi-stream pre-attention GEMM (tuned token-threshold defaults), BF16 and MXFP8 all-to-all support for FlashInfer one-sided communication, and integrated tile kernels. v0.20.2 fixes (a) **a "failure to allocate KV blocks" error in the V1 engine KV cache manager** for DeepSeek V4 (#41282), (b) DeepSeek V4 sparse-attention by re-enabling the persistent topk path on Hopper and forcing the memset kernel into CUDA-graph capture time to address MTP=1 hangs (#41665), and (c) a Qwen3-VL deepstack boundary check that could fail under heavy load (#40932). +- Novelty: Low. These are correctness/stability patches, not new mechanisms. Worth tracking because they signal which V4 paths are still fragile under real workloads. +- Impact: Medium. The KV-block-allocation bug in particular suggests the V1 cache manager is being stressed by V4's hybrid-attention shape budget in ways that didn't surface in pre-release testing — exactly the kind of integration risk that surfaces only after broad production deployment of a new architecture. Worth a closer look in next week's report once independent reproductions land. +- Runtime evaluation: None for the patches themselves. The aggregate v0.20.x story (TurboQuant 2-bit KV by default, FA4 default MLA prefill, V4 day-0) remains the headline, but it is now a "moving" headline as patches land. + +--- + +## B. Academic / idea-forward work + +### 5. eOptShrinkQ — Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization (~May 6, 2026) + +- Reference: Pei-Chun Su, [arXiv:2605.02905](https://arxiv.org/abs/2605.02905). +- Summary: Decomposes the KV cache in each attention head into a **low-rank shared-context component plus a full-rank per-token residual**, modelled as a spiked random matrix. The shared structure is extracted via **optimal singular-value shrinkage** (eOptShrink); the residual — which the paper proves satisfies the thin-shell / coordinate-delocalization property — is then quantized with TurboQuant. Three theoretical guarantees follow from random-matrix theory: automatic rank selection at the BBP phase transition, near-zero inner-product bias on the residual, and near-optimal scalar-quantization distortion. The paper is essentially "TurboQuant gets a rank-aware preprocessor with provable distortion guarantees." +- Novelty: High. This is the first KV-cache compression paper to give an *end-to-end theoretical bound* tying compression bits to attention-output distortion, and one of the first to actually exploit the "shared context" structure that LMCache, prefix-cache, and CacheBlend all assume but never quantify. Sits naturally next to last week's CapKV (information-bottleneck eviction) — together they triangulate eviction, compression, and quantization with a unified theoretical lens. +- Impact: Medium-high near term. Concrete wins: at equivalent quality, eOptShrinkQ saves ~1 bit per entry over TurboQuant; at ~2.2 bits/entry it outperforms TurboQuant at 3.0 bits/entry across all 16 LongBench tasks. If 2-bit-class KV stays the default in vLLM and SGLang post-v0.20.0, this is the first "below TurboQuant" point on the bits-vs-quality curve with a clean theoretical justification. +- Runtime evaluation: Partial. End-to-end LongBench (16 tasks) quality results at matched bits, plus the bits-savings ratio against TurboQuant. **No kernel-level decoding throughput / TTFT measurements** — the rank decomposition and shrinkage step has to be cheap enough to run inline with paged attention, and that integration is not yet shown. + +### 6. One Pool, Two Caches — Adaptive HBM Partitioning for Generative Recommender Serving (May 6, 2026) + +- Reference: Wenjun Yu, Shuguang Han, Amelie Chi Zhou, [arXiv:2605.04450](https://arxiv.org/abs/2605.04450). +- Summary: Generative Recommenders (GRs) are the production workload where **embedding hot caches (EMB) and KV caches share the same HBM pool** and compete head-to-head. The paper measures that the optimal EMB-vs-KV split shifts by up to **0.35 of the pool** across workload regimes, leaving 20–30% latency on the table for any static partition. **HELM** is the response: a three-layer PPO controller (frozen base policy + online residual adapter + burst-aware recovery controller) plus EMB-KV-aware request scheduling, hitting **32 µs decision latency** while staying within 0.024–0.029 of the offline-optimal allocation ratio. +- Novelty: High. The KV-cache literature is almost entirely chat-/RAG-/agent-focused; this is one of the first papers to take the "GR + LLM share an HBM pool" workload seriously and treat the EMB-vs-KV trade-off as a *control problem* rather than a static configuration. The 0.35-shift measurement is itself a noteworthy data point. +- Impact: Medium-high for the RecSys-meets-LLM camp. Generative recommenders (HSTU, Meta's GRMM family, Pinterest, Tencent) are exactly where KV cache and embedding cache collide in production, and the field has been quietly building toward this — HELM is the cleanest articulation of the joint-management problem to date. Less impact for chat/RAG-only stacks where there is no embedding cache to compete with. +- Runtime evaluation: Yes on the controller side (32 µs decision latency, 0.024–0.029 ratio gap from offline optimal); yes on end-to-end latency (the 20–30% latency improvement claim). The paper does not state what the realistic baseline is — is it static-split production, or the strawman of "one cache only"? — and that ambiguity is the obvious thing to firm up. + +### 7. RetentiveKV — State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction (~May 5, 2026) + +- Reference: [arXiv:2605.04075](https://arxiv.org/abs/2605.04075). +- Summary: Reformulates KV-cache eviction in MLLMs from "discrete context truncation" to **continuous memory evolution** by absorbing evicted KV pairs into a modality-specific state-space model rather than dropping them. Three components: (a) an entropy-guided retention estimator that quantifies prospective relevance of each KV pair, (b) entropy-guided state transition into a per-modality state space, and (c) query-conditioned state retrieval at decoding time to selectively recall task-relevant evicted information. Achieves **5× KV compression and 1.5× decoding acceleration** on multimodal benchmarks, doubling retrieval score over competing methods on Text Needle In A Haystack at 5% budget. +- Novelty: High. Most eviction methods treat evicted tokens as gone; RetentiveKV treats them as compressed-but-recoverable, which is the same conceptual move that diffusion-LM caches (dKV-Cache) and recurrent attention papers have been circling. The cleanest articulation so far for the multimodal long-context regime. +- Impact: Medium-high for MLLMs. The 5% budget × 2× retrieval-score improvement is a non-trivial result for VLMs and audio-LLMs where context windows are dominated by perceptual tokens that eviction policies historically struggle with. Sits alongside last week's PolyKV (multi-agent shared compressed KV) as evidence that the "compressed pool" abstraction is generalizing across workload classes. +- Runtime evaluation: Partial. Compression × decode-acceleration headline numbers (5× / 1.5×) are reported on multimodal benchmarks; the state-space transition and retrieval cost is included in the 1.5× claim, which is the right way to frame it. End-to-end serving numbers under continuous batching against vLLM/SGLang baselines are not yet shown. + +### 8. ZeRO-Prefill — Zero Redundancy Overheads in MoE Prefill Serving (~May 5, 2026) + +- Reference: [arXiv:2605.02960](https://arxiv.org/abs/2605.02960). +- Summary: Prefill-only serving on MoE is bottlenecked by *distributed* execution rather than compute — existing parallel strategies couple expert placement with synchronous activation routing, producing redundant computation, communication, and synchronization inherited from the decoding era. ZeRO-Prefill replaces per-layer activation **AllToAll** with asynchronous weight **AllGather** fully overlapped with the long, compute-bound forward pass: experts are gathered by weight rather than routed by activation, every GPU holds the complete current-layer expert set, and dispatch becomes a local operation. Adds prefix-aware routing where the frontend's KV-block table and the backend's per-GPU cache share a single source of truth. +- Novelty: High for the prefill-on-MoE niche. The "stream weights, not activations" inversion is the right idea for prefill where compute time dwarfs weight-fetch time, and the paper is the cleanest articulation of why the decoding-era dispatch primitive is wrong for prefill workloads. +- Impact: High for MoE prefill — i.e., DeepSeek-V4-Pro/Flash, Kimi K2.x, GLM-5/5.1, Qwen-3.6, Hunyuan-3, Hy3, Mistral-MoE — which are now the dominant model class for serious agentic workloads. Pairs naturally with PD-disaggregation and with Mooncake's shared distributed KV (entry #1). +- Runtime evaluation: Yes on the design rationale (overlapping weight AllGather with prefill compute, dispatch as a local op). The paper does report measured improvements vs. activation-AllToAll baselines, but the abstract-level summary doesn't highlight a single headline number; the implication is "prefill-bound MoE serving with materially better throughput" rather than a clean N× claim. Worth a close read once independent reproductions land. + +### 9. Long Context Pre-Training with Lighthouse Attention (May 7, 2026) + +- Reference: [arXiv:2605.06554](https://arxiv.org/abs/2605.06554). +- Summary: A **training-only hierarchical sparse-attention wrapper** that pools Q, K, V symmetrically across a multi-resolution pyramid, places selection outside the attention kernel (so the inner loop is stock FlashAttention on a dense sub-sequence), and is removed at inference time after a brief dense-SDPA warm-up. Parameter-free, trains end-to-end, no auxiliary losses or straight-through estimators, inherits FlashAttention upgrades unchanged. Lighthouse-trained models match or beat a fully dense-SDPA baseline trained from scratch on the same token budget. +- Novelty: Medium-high. The "sparsity at training, dense at inference" framing is the structural inverse of inference-only sparse attention, and the symmetric Q/K/V pooling is the cleanest causal-respecting variant of the multi-resolution pyramid family. Important consequence: the trained model can use full attention at inference, which inference-only sparse methods cannot claim because they never touch the training loop. +- Impact: Medium for the KV-cache subfield — this is a *training-time* paper whose KV-cache implications are indirect (cheaper long-context pretraining → more long-context models → more pressure on inference-time KV systems). Direct impact is on long-context base-model providers (DeepSeek, Kimi, Qwen, Llama, Mistral) more than on KV-cache vendors. +- Runtime evaluation: Training-side measurements only — token-budget-matched comparison against dense SDPA, plus the brief end-of-training dense recovery phase. No inference-time TTFT/TBT numbers; the implicit claim is "no inference cost change" because Lighthouse is removed before serving. + +--- + +## A′. Closely related work just outside the 7-day window or just outside scope + +### 10. SAGA — Workflow-Atomic Scheduling for AI Agent Inference (~May 1, 2026; just before window) + +- Reference: [arXiv:2605.00528](https://arxiv.org/abs/2605.00528). +- Summary: Treats the **entire agent workflow** (not individual LLM calls) as the first-class schedulable unit, capturing tool-call-induced KV-cache discards as the primary inefficiency in agent serving (38% of total time spent regenerating discarded KV cache, GPU memory only 42% utilized due to fragmentation in the team's own production-trace measurements). Three mechanisms: Agent Execution Graphs to predict KV-cache reuse across tool-call boundaries (within 1.31× of Bélády's optimal offline policy), session-affinity batching with work stealing, and **Workflow-Aware LRU** that incorporates predicted reuse probability into eviction. On a 64-GPU cluster serving SWE-bench coding agents and WebArena browser tasks, **1.64× task-completion-time improvement** over vLLM v0.15.1 with prefix caching + affinity routing, 1.22× GPU memory utilization, 99.2% SLO attainment under multi-tenant interference. +- Why noted: The 64-GPU cluster scale and SWE-bench / WebArena evaluation make this one of the most *production-realistic* agent-serving evaluations in arXiv this year, and it pairs perfectly with vLLM × Mooncake (entry #1) — SAGA tells the scheduler what to do; Mooncake tells the storage substrate where to put it. Too close to deserve a separate primary entry given last week's coverage of the agent-cache subgenre, but easily a top-3 entry on a slower week. + +### 11. GhostServe — Lightweight Checkpointing with Erasure-Coded KV Cache for Fault-Tolerant LLM Serving (~May 1, 2026; just before window) + +- Reference: [arXiv:2605.00831](https://arxiv.org/abs/2605.00831). +- Summary: Erasure-codes the *streaming* KV cache (not just model weights) into parity shards held in host memory, with custom GPU kernels that perform *integer-centric* lossless encoding to bridge floating-point KV state to binary-field codes. Reduces host memory overhead by 75% and checkpointing latency by 73% versus full replication. Implemented on SGLang. +- Why noted: First serious treatment of KV cache as something that needs *storage-systems-style fault tolerance* rather than just multi-tier offloading. As distributed inference scales (Mooncake, Dynamo, Cloudflare's Kimi K2.5 setup), node failures during long agent traces become a real cost — recomputation can be minutes of wasted time. This is the paper that puts erasure coding on the KV-cache map. + +### 12. WindowQuant — Mixed-Precision KV Cache Quantization for VLMs (May 4, 2026) + +- Reference: [arXiv:2605.02262](https://arxiv.org/abs/2605.02262). +- Summary: Window-adaptive mixed-precision KV-cache quantization for VLMs on video understanding tasks. Two modules: a window-level quantization search that picks bit-width per visual-token window based on text-prompt similarity, and a window-level KV computation kernel. +- Why noted: Continues the 2026 pattern of *per-modality* KV compression (vs. uniform schemes), aligned with RetentiveKV (entry #7). The video-VLM regime is a distinct enough workload class that a separate eviction/quantization track is starting to make sense. + +### 13. Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics (May 2, 2026) + +- Reference: Zijie Zhou, [arXiv:2605.01280](https://arxiv.org/abs/2605.01280). +- Summary: Position paper arguing that despite surface-level innovation in vLLM/SGLang, the *algorithmic cores* — JSQ/round-robin routing, FIFO scheduling, LRU eviction — are unchanged from classical distributed computing and fail to capture LLM-specific structure (dynamically growing KV cache, prefill/decode asymmetry, unknown output lengths, continuous batching). Calls for theoretical models with provable performance guarantees rather than heuristic stacks. +- Why noted: This is the conceptual companion to last week's CapKV (information-bottleneck eviction) and the depth-cache lower bound — the field is now actively under pressure to *justify* its heuristic stack with theory. Pairs with eOptShrinkQ (entry #5) as evidence that "theory catches up to practice" is a 2026 trend in KV-cache work. + +### 14. The Structural Origin of Attention Sink (May 7, 2026) + +- Reference: [arXiv:2605.06611](https://arxiv.org/abs/2605.06611). +- Summary: Mechanistic explanation of the attention-sink phenomenon: variance discrepancy in value aggregation is amplified by FFN super-neurons, and the absence of value aggregation for the initial token creates a persistent high-variance outlier that propagates through the network and selectively activates super neurons. Validated by mask interventions and variance-amplification experiments. +- Why noted: Attention sinks underpin most modern eviction schemes (StreamingLLM, H2O, SnapKV all hinge on "keep the first few tokens"). A mechanistic explanation, if it holds up, gives eviction designers a principled reason for the heuristic — and a hint that *architectural* fixes (FFN super-neuron suppression) might eliminate the need for sink-preservation in eviction policies altogether. + +### 15. SCION — Size-aware Policy Orchestration for Nonstationary Object Caches (May 4 abs-revision; submitted Mar 27, 2026) + +- Reference: [arXiv:2605.01055](https://arxiv.org/abs/2605.01055). +- Summary: Non-LLM cache policy orchestration for cloud/edge object caches. Tiny workload-fingerprint computed off the critical path (size, cacheability, reuse, cache-size summary statistics) feeds an offline-trained linear selector that picks among GDSF, S3-FIFO, SIEVE, LHD, W-TinyLFU-AV, and DynamicAdaptiveClimb. Prototype is called AUTO. +- Why noted: A useful contemporary reference point for *the other* caching subfield. As KV-cache serving becomes more storage-system-shaped (LMCache, VAST VUA, Pure KVA, NVIDIA CMX), the 2024-era object-cache policy orchestration literature becomes directly relevant — and SCION shows that the field is converging on "small fingerprint + offline-trained policy selector" rather than monolithic LRU successors. The KV-cache analogue (predictive multi-tier from last week, BoseKV, etc.) is moving toward the same shape. + +--- + +## Cross-cutting observations (this week) + +- **Distributed KV cache is now the vLLM-blessed default for agentic traffic.** vLLM endorsing Mooncake Store as the cross-instance KV substrate (entry #1) — with a 94%-reusable-prefix measurement on real agent traces — closes a roughly 18-month arc from "Mooncake is Kimi's internal substrate" (FAST '25) to "vLLM officially recommends Mooncake when local prefix caching isn't enough." Pairs with Cloudflare's earlier endorsement and AWS HyperPod's managed L2 cache. Local prefix caching is no longer the upper bound for what vLLM operators are expected to deploy. +- **MLA kernels are the new contention front, and the open ecosystem is closing the gap to TensorRT-LLM in months, not years.** TokenSpeed (entry #2) shipping a Blackwell MLA kernel that beats TRT-LLM in two months — and getting upstreamed into vLLM on day zero — confirms what the FA4 default in vLLM v0.20.0 last week implied: MLA prefill / decode kernels are now a first-class open-source competition surface, not a vendor moat. Coding-agent prefill shapes (long shared prefix, batch 4–16, speculative decoding) are the workload that's pulling kernel work forward. +- **Theory continues to catch mechanism.** eOptShrinkQ (entry #5) joins last week's CapKV (IB-objective eviction) and the depth-cache lower bound as evidence that the post-2023 KV-cache literature is actively being unified under random-matrix and information-theoretic frameworks. The Position paper (entry #13) makes the same argument at the system-scheduling level. A year ago the gap between "what works in practice" and "what we can prove" was the dominant feature of the field; it's narrowing visibly. +- **KV cache is no longer the only cache fighting for HBM.** HELM (entry #6) is the first paper to cleanly formalize the EMB-vs-KV split for generative recommenders, with a 0.35-of-pool measured swing across workloads. As GR adoption grows (HSTU, Meta GRMM, Pinterest, Tencent), the "one HBM, two caches" framing becomes a distinct optimization track separate from the chat/RAG/agent caches the field has been focused on. +- **MoE prefill needs its own dispatch primitive.** ZeRO-Prefill (entry #8) reframes the activation-AllToAll vs. weight-AllGather choice as a workload-asymmetric one: AllGather wins for compute-bound prefill, AllToAll wins for memory-bound decode. This is the right level of abstraction for the post-DeepSeek-V4 / Kimi-K2.x / Hunyuan-3 era where MoE prefill is the binding cost on agentic workloads. +- **Erasure-coded KV cache and continuous-memory eviction are the next-generation primitives.** GhostServe (entry #11) and RetentiveKV (entry #7) both push beyond the "evict means delete" mental model — one toward fault tolerance, the other toward state-space recall. Both are signs that the KV-cache subsystem is being reconceptualized as a real storage system (with reliability, recoverability, and graceful degradation properties), not just a fast scratch buffer. + +## References + +- [vLLM × Mooncake — Serving Agentic Workloads at Scale (X / Twitter, May 7, 2026)](https://x.com/vllm_project/status/2052113331927060840) · [Mooncake — Welcome page](https://kvcache-ai.github.io/Mooncake/) · [Mooncake Store design](https://kvcache-ai.github.io/Mooncake/design/mooncake-store.html) · [vLLM V1 + Mooncake Store + LMCache integration](https://kvcache-ai.github.io/Mooncake/getting_started/examples/vllm-integration/vllmv1-lmcache-integration.html) · [Mooncake Joins PyTorch Ecosystem](https://pytorch.org/blog/mooncake-joins-pytorch-ecosystem/) · [Mooncake (FAST '25 paper)](https://www.usenix.org/system/files/fast25-qin.pdf) +- [LightSeek Foundation — TokenSpeed: A Speed-of-Light LLM Inference Engine](https://lightseek.org/blog/lightseek-tokenspeed.html) · [TokenSpeed GitHub](https://github.com/lightseekorg/tokenspeed) · [TokenSpeed kernel source](https://github.com/lightseekorg/tokenspeed/tree/main/tokenspeed-kernel) · [vLLM × TokenSpeed day-0 launch (X, May 7, 2026)](https://x.com/vllm_project/status/2052051210530914510) · [MarkTechPost — LightSeek Releases TokenSpeed (May 7, 2026)](https://www.marktechpost.com/2026/05/07/lightseek-foundation-releases-tokenspeed-an-open-source-llm-inference-engine-targeting-tensorrt-llm-level-performance-for-agentic-workloads/) +- [LMCache — Deepseek V4 explained, and why it matters to your wallet (May 4, 2026)](https://blog.lmcache.ai/en/2026/05/04/deepseek-v4-explained-and-why-it-matters-to-your-wallet/) · [LMCache on Amazon SageMaker HyperPod (Apr 22, 2026)](https://blog.lmcache.ai/en/2026/04/22/lmcache-on-amazon-sagemaker-hyperpod-accelerating-llm-inference-with-managed-tiered-kv-cache/) · [LMCache — Stop Calling It KV Cache (Apr 28, 2026)](https://blog.lmcache.ai/en/2026/04/28/stop-calling-it-kv-cache-its-something-much-bigger/) +- [vLLM v0.20.2 release notes (May 10, 2026)](https://github.com/vllm-project/vllm/releases/tag/v0.20.2) · [vLLM v0.20.1 release notes (May 4, 2026)](https://github.com/vllm-project/vllm/releases/tag/v0.20.1) · [vLLM v0.20.0 release notes (Apr 27, 2026)](https://github.com/vllm-project/vllm/releases/tag/v0.20.0) +- [arXiv:2605.02905 — eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization](https://arxiv.org/abs/2605.02905) +- [arXiv:2605.04450 — One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving](https://arxiv.org/abs/2605.04450) +- [arXiv:2605.04075 — RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction](https://arxiv.org/abs/2605.04075) +- [arXiv:2605.02960 — ZeRO-Prefill: Zero Redundancy Overheads in MoE Prefill Serving](https://arxiv.org/abs/2605.02960) +- [arXiv:2605.06554 — Long Context Pre-Training with Lighthouse Attention](https://arxiv.org/abs/2605.06554) +- [arXiv:2605.00528 — SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters](https://arxiv.org/abs/2605.00528) +- [arXiv:2605.00831 — GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving](https://arxiv.org/abs/2605.00831) +- [arXiv:2605.02262 — WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs](https://arxiv.org/abs/2605.02262) +- [arXiv:2605.01280 — Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics](https://arxiv.org/abs/2605.01280) +- [arXiv:2605.06611 — The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity](https://arxiv.org/abs/2605.06611) +- [arXiv:2605.01055 — SCION: Size-aware Policy Orchestration for Nonstationary Object Caches](https://arxiv.org/abs/2605.01055) + +### Additional context (noted this week, not selected as top entries) + +- [arXiv:2605.06285 — LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG (May 7, 2026)](https://arxiv.org/abs/2605.06285) — shifts agentic-RAG reasoning and retrieval from discrete language to continuous latent space; ~90% latency reduction at comparable quality. Adjacent rather than core (no KV mechanism), but a useful data point on the "continuous latent state" trend that includes RetentiveKV. +- [arXiv:2605.00789 — Make Your LVLM KV Cache More Lightweight / LightKV (May 1, 2026)](https://arxiv.org/abs/2605.00789) — cross-modal prompt-guided vision-token compression for LVLM prefill, no retraining required. Companion to WindowQuant in the per-modality VLM track. +- [DDN × Google Cloud — Managed Lustre as a shared external KV cache (Apr 22–23, 2026)](https://www.sdxcentral.com/news/ddn-google-cloud-claim-lustre-kv-cache-trick-boosts-ai-inference-throughput-by-75/) — claimed 75% inference-throughput improvement and >40% mean-TTFT reduction vs. host-memory-only KV cache. Just outside the window but highly relevant context on the storage-vendor side; first 10 TBps-class shared-cache pitch. +- [vLLM blog — DeepSeek V4 in vLLM: Efficient Long-context Attention (Apr 24, 2026)](https://vllm.ai/blog/deepseek-v4) — companion to entries #1 and #4; the canonical first-principles writeup of the V4 attention mechanism. +- [Modular — The Five Eras of KVCache](https://www.modular.com/blog/the-five-eras-of-kvcache) — historical framing post that LMCache's wallet-and-rename arc keeps citing. +- [Awesome-KV-Cache-Management (TreeAI-Lab)](https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management) and [Awesome-KV-Cache-Optimization (jjiantong, ACL 2026 survey)](https://github.com/jjiantong/Awesome-KV-Cache-Optimization) and [Awesome-KV-Cache-Compression (October2001)](https://github.com/October2001/Awesome-KV-Cache-Compression) — three trackers; all three updated through this window. +- [kvcached on GitHub (ovg-project)](https://github.com/ovg-project/kvcached) — virtualized elastic KV cache for shared-GPU serving, with vLLM/SGLang integration; 2–28× TTFT reduction vs. baseline serving engines, ongoing development through April–May 2026.