llama.cpp fork with TurboQuant quantization (turbo2/3/4) and TriAttention GPU-accelerated KV cache pruning. 75 tok/s on Qwen3-8B / RTX 3080.
-
Updated
Apr 9, 2026 - C++
llama.cpp fork with TurboQuant quantization (turbo2/3/4) and TriAttention GPU-accelerated KV cache pruning. 75 tok/s on Qwen3-8B / RTX 3080.
Add a description, image, and links to the triattention topic page so that developers can more easily learn about it.
To associate your repository with the triattention topic, visit your repo's landing page and select "manage topics."