A KVPress-based Implementation for Efficient Long-Context Prefilling
DASH (Delta Attention Selective Halting) is a training-free, inference-time token halting method built on top of NVIDIA’s KVPress framework.
It targets the prefill stage of long-context and multimodal inference—where the majority of computation is incurred—and reduces FLOPs by selectively halting tokens whose representations have already stabilized.
The core intuition of DASH is that, as depth increases, many tokens gradually converge to a semantic fixed point: their representations receive only negligible updates and no longer meaningfully participate in global information aggregation. Continuing to process such tokens yields diminishing returns while incurring substantial computational cost.
As illustrated in the figure above, DASH monitors layer-local attention updates during prefill. At a designated start layer
Once the pruning gate is activated at
This design makes DASH:
- Training-free and easy to integrate
- Compatible with FlashAttention / SDPA, as it does not require attention matrices
- Applicable to both text-only and vision–language models using a unified halting criterion
Overall, DASH provides a simple yet effective mechanism to reduce prefill computation while preserving model accuracy and generation behavior.
This implementation follows the KVPress environment exactly.
Please refer to the official KVPress repository for full dependency and system requirements:
git clone https://github.com/NVIDIA/kvpress.git
cd kvpress
pip install -e .Copy the DASH implementation file into KVPress:
cp DASH/presses/delta_press.py kvpress/kvpress/presses/delta_press.pyThis implementation corresponds to Δattn-guided token halting as described in the DASH paper.
Add the import:
from kvpress.presses.delta_press import DeltaPressRegister it in __all__:
__all__ = [
# other registered methods
"DeltaPress",
]Add the import:
from kvpress import (
# other registered methods
DeltaPress,
)Register DASH in PRESS_REGISTRY:
PRESS_REGISTRY = {
# other methods
"delta_press": DeltaPress(),
}delta_press_l_nr.py implements a layer-level delta (Δblock / Δlayer) variant.
Unlike the main DASH implementation, it uses full-layer deltas instead of attention-branch deltas as the halting signal; this variant is included solely for ablation and reference purposes.
In addition to the original attention hooks, an extra pair of FFN hooks is required to measure full-layer deltas.
As a result, the original KVPress base class must be replaced:
cp presses/base_press.py kvpress/kvpress/presses/base_press.pyDASH uses the standard KVPress evaluation entry:
kvpress/evaluation/evaluate.shdataset="longbench"
data_dirs=(qasper_e multifieldqa_en_e hotpotqa_e 2wikimqa_e gov_report_e multi_news_e trec_e triviaqa_e samsum_e passage_count_e passage_retrieval_en_e repobench-p_e lcc_e)
press="delta_press"
compression_ratios=0.667compression_ratioscorresponds to the pruning ratio ρ in the DASH paper- It denotes the fraction of eligible prompt tokens removed at the decision layer
- Token halting is applied only during prefill
- Decoding behavior is kept identical across all methods
With default internal settings:
l_s = floor(0.4 * L)
Halting applies to layers [l_s, L).
The overall effective token reduction is approximately:
(1 - start_ratio) * compression_ratios
≈ 0.6 * 0.667 ≈ 40%
This matches the operating points reported in the paper.
DASH uses the attention-branch pre-residual update as the halting signal.
Concretely, for each token at layer
This signal has the following properties:
- It captures layer-local representation updates without requiring attention weights
- No attention matrix is materialized
- Fully compatible with FlashAttention / SDPA
At the start layer
- Compute the token-wise delta-attention score
$\Delta_t^{(l_s)}$ - Select the top
$(1-\rho)T$ tokens as the active set - Reuse this active set for all deeper layers (single-shot halting schedule)
Halted tokens skip both self-attention and FFN.
- Token halting applies only to prompt tokens
- Prefill FLOPs are reduced significantly
- No token halting is applied during decoding
- Generation behavior is preserved, while KV cache usage is reduced due to prefill-time halting
The figure below summarizes the end-to-end (E2E) latency–accuracy trade-off of DASH compared with representative prefill-time token compression baselines. Each point corresponds to a specific compression setting, plotting task score against total inference time.
Overall, DASH consistently lies on the Pareto frontier, achieving substantial latency reduction with minimal accuracy degradation. In particular, DASH attains notable speedups over strong baselines (e.g., SnapKV(pruned) and FastV) while preserving comparable or higher task performance, and significantly outperforms prompt-level compression methods such as LLMLingua2 under the same latency budgets.
These results indicate that ∆attn-guided single-shot halting is an effective strategy for reducing prefill computation without sacrificing downstream task quality. For detailed experimental settings, additional benchmarks, and quantitative analyses, please refer to the paper.
- This implementation targets prefill-time acceleration only
- No retraining or finetuning is required
- Token selection is single-shot and static
- Adaptive schedules and learned policies are not explored here
For full details, please refer to the DASH paper.

