Experimental fork of llama.cpp via spiritbuun/buun-llama-cpp with TurboQuant, TriAttention, PlanarQuant, and IsoQuant KV cache compression.
# Build for V100 (SM70)
cd build && CUDACXX=/usr/local/cuda/bin/nvcc cmake .. \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=70 \
-DGGML_CUDA_FORCE_MMQ=OFF \
-DGGML_CUDA_F16=ON \
-DGGML_CUDA_FORCE_CUBLAS=ON \
-DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
# Run server with Turbo2 KV compression
./build/bin/llama-server -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2 -c 8192 --port 8080
# Run server with TriAttention + Turbo2 (maximum KV compression)
TRIA_STATS_PATH=stats/qwen35_9b.bin TRIA_BUDGET_TOKENS=2048 \
./build/bin/llama-server -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2 -c 32768 --port 8080
# Benchmark
./build/bin/llama-bench -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2 -p 512,2048 -n 128,512
# CLI
./build/bin/llama-cli -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2 -c 8192 -p "Hello world"Hardware: Tesla V100-SXM2-32GB | CUDA: 12.6.85 | Build: SM70 | Model: Qwen3.5-9B Q4_K_M (5.28 GiB)
| KV Type | K | V | tg128 | tg512 | pp512 | pp2048 | vs f16 | KV Compress | Best For |
|---|---|---|---|---|---|---|---|---|---|
| f16/f16 | f16 | f16 | 93.15 | 93.08 | 2723 | 2707 | — | 1.0× | Quality baseline |
| turbo2/turbo2 | turbo2 | turbo2 | 87.54 | 87.31 | 2669 | 2643 | -6% | 2.3× | Best speed/compression |
| turbo3/turbo3 | turbo3 | turbo3 | 85.28 | 85.05 | 2664 | 2626 | -8% | 3.3× | Good compression, fast |
| planar3 K/f16 V | planar3 | f16 | 85.30 | 84.68 | 2340 | 1906 | -8% | 1.8× | Zero PPL loss, K-only |
| iso3 K/f16 V | iso3 | f16 | 80.09 | 79.86 | 1765 | 1122 | -14% | 1.8× | K-only quality |
| f16 K/planar3 V | f16 | planar3 | 85.01 | 81.93 | 2162 | 1389 | -9% | 1.8× | V-only compression |
| f16 K/iso3 V | f16 | iso3 | 84.88 | 81.90 | 2156 | 1381 | -9% | 1.8× | V-only quality |
| planar3/planar3 | planar3 | planar3 | 78.88 | 75.75 | 1980 | 1212 | -15% | 9.8× | Max compression |
| iso3/iso3 | iso3 | iso3 | 75.13 | 72.59 | 1799 | 1039 | -19% | 9.8× | Best PPL/compression |
| planar4/planar4 | planar4 | planar4 | 75.20 ± 0.21 | 72.28 ± 0.27 | 1799.95 ± 2.63 | 1050.46 ± 1.29 | -21% | 4.0× | Newer, 2D Givens+4b |
| iso4/iso4 | iso4 | iso4 | 73.59 ± 0.21 | 70.19 ± 0.32 | 1655.38 ± 6.06 | 915.08 ± 0.07 | -24% | 4.0× | Newer, 4D quaternion+4b |
| planar4 K/f16 V | planar4 | f16 | 81.03 ± 0.06 | 80.54 ± 0.30 | 2246.71 ± 4.55 | 1776.88 ± 0.87 | -13% | 1.8× | Zero PPL loss, K-only, newer |
| planar4/turbo2 | planar4 | turbo2 | 81.91 | 81.49 | 2201 | 1735 | -12% | 5.3× | Higher compression than turbo2 |
| planar3/turbo3 | planar3 | turbo3 | 81.04 | 80.94 | 2306 | 1879 | -13% | 5.3× | Higher compression than turbo3 |
| iso4/turbo2 | iso4 | turbo2 | 79.84 | 78.42 | 1953 | 1411 | -16% | 5.3× | Higher compression than turbo2, lower speed |
All numbers are tokens/sec ± stddev (see TEST_RESULTS_V100.md for full stats).
- Turbo2 is the fastest compressed type on V100: 87.5 t/s with 2.3× KV compression
- Planar3 K + f16 V matches turbo3 speed (85.3 t/s) but with zero PPL loss — ideal for quality-critical workloads
- Iso3 symmetric has the best quality per bit (PPL +4.2% vs turbo3's +6.6%) at 9.8× compression
- On V100, planar3/iso3 are slower than turbo3 because they use the native VEC kernel (dequantize-in-loop) while turbo3 dequants to f16 + MMA tensor cores
- Asymmetric KV (planar4/turbo2, planar3/turbo3, iso4/turbo2) now supported with flash attention on GPU. Use
-ctk planar4 -ctv turbo2for zero PPL loss in K-only quantization. - planar4/turbo2 and planar3/turbo3 achieve 5.3× KV compression with 12–13% decode speed loss and 19–20% prefill loss vs f16. iso4/turbo2 gives similar compression with ~16% decode and ~36% prefill loss.
| Config | 512 tok gen | 2048 tok gen | KV Size | TriA Events |
|---|---|---|---|---|
| Baseline f16 | 88.4 t/s | 88.1 t/s | 256 MiB | 0 |
| FA+Turbo2 | 82.7 t/s | 81.0 t/s | 40 MiB | 0 |
| FA+Turbo2+TriA (budget=2048) | 82.0 t/s | 69.3 t/s | 40 MiB | 72 |
| FA+Turbo3_TCQ+TriA (budget=2048) | 77.6 t/s | 65.7 t/s | 56 MiB | 75 |
| FA+planar4/turbo2+TriA (budget=2048) | 80.3 t/s | 69.3 t/s | 48 MiB | - |
TriAttention adds ~14% overhead (CPU-side scoring). Physical KV compaction pending — when active, combined TriA+Turbo2 yields ~37× total KV compression.
Walsh-Hadamard Transform + Lloyd-Max scalar quantization + QJL correction. Implemented as turbo2, turbo3, turbo4 types with optional TCQ codebooks.
# CLI
llama-cli -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2
# Server
llama-server -m model.gguf -ngl 99 -fa 1 -ctk turbo3 -ctv turbo3 --port 8080
# With TCQ codebooks
TURBO_TCQ_CB2=codebooks/2bit/tcq_2bit_100iter_s99.bin \
TURBO_TCQ_CB=codebooks/3bit/cb_50iter_finetuned.bin \
llama-server -m model.gguf -ngl 99 -fa 1 -ctk turbo3_tcq -ctv turbo3_tcq --port 8080
# Benchmark all TurboQuant variants
llama-bench -m model.gguf -ngl 99 -fa 1 \
-ctk f16,turbo2,turbo3,turbo2_tcq,turbo3_tcq \
-ctv f16,turbo2,turbo3,turbo2_tcq,turbo3_tcq \
-p 512,2048 -n 128,512| Type | Bits | Rotation | QJL | TCQ | Decode Speed |
|---|---|---|---|---|---|
| turbo2 | 2-bit | WHT 128×128 | No | No | 87 t/s |
| turbo3 | 3-bit | WHT 128×128 | Yes | No | 85 t/s |
| turbo2_tcq | 2-bit+TCQ | WHT 128×128 | No | Yes | 86 t/s |
| turbo3_tcq | 3-bit+TCQ | WHT 128×128 | Yes | Yes | 82 t/s |
Trigonometric frequency-based KV entry scoring + eviction. Reduces number of KV entries (complementary to TurboQuant's per-element compression). Scoring triggers every 128 tokens during decode.
# Step 1: Generate calibration stats (offline, one-time)
python3 triattention_calibrate.py \
--model Qwen/Qwen3.5-9B \
--input wikitext.txt \
--output stats/qwen35_9b.bin
# Step 2: Run with TriAttention
TRIA_STATS_PATH=stats/qwen35_9b.bin \
TRIA_BUDGET_TOKENS=2048 \
llama-server -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2 -c 32768 --port 8080
# Alternative: budget as percentage of context
TRIA_STATS_PATH=stats/qwen35_9b.bin \
TRIA_BUDGET_PCT=50 \
llama-server -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2 -c 32768 --port 8080| Env Var | Default | Description |
|---|---|---|
TRIA_STATS_PATH |
(required) | Path to calibration .bin file |
TRIA_BUDGET_TOKENS |
— | Absolute KV budget in tokens |
TRIA_BUDGET_PCT |
50 | KV budget as % of context |
TRIA_RUNTIME_DIVIDE_LENGTH |
128 | Scoring trigger interval |
TRIA_RUNTIME_WINDOW_SIZE |
128 | Recent tokens always kept |
TRIA_RAMP_START_PCT |
10 | Exponential ramp start % |
Block-diagonal rotation KV cache quantization — replaces TurboQuant's full d×d WHT with 2D Givens (PlanarQuant) or 4D quaternion (IsoQuant) rotations. Fewer parameters, faster prefill, better PPL than TurboQuant at same compression.
# PlanarQuant 3-bit (best speed)
llama-server -m model.gguf -ngl 99 -fa 1 -ctk planar3 -ctv planar3 --port 8080
# IsoQuant 3-bit (best quality)
llama-server -m model.gguf -ngl 99 -fa 1 -ctk iso3 -ctv iso3 --port 8080
# K-only compression (zero PPL loss)
llama-server -m model.gguf -ngl 99 -fa 1 -ctk planar3 -ctv f16 --port 8080
# Mixed: PlanarQuant K + TurboQuant V
llama-server -m model.gguf -ngl 99 -fa 1 -ctk planar3 -ctv turbo3 --port 8080| Type | Rotation | Group Size | FMAs (d=128) | Params | Status |
|---|---|---|---|---|---|
| TurboQuant | WHT butterfly | 128 | 16,384 | 16,384 | ✅ V100 tested |
| PlanarQuant | 2D Givens | 2 | 256 | 128 | ✅ V100 tested |
| IsoQuant | 4D quaternion | 4 | 512 | 128 | ✅ V100 tested |
| RotorQuant (Cl(3,0)) | 3D rotor sandwich | 3 | ~2,400 | 372 | Research only (Triton) |
V100 tested: Planar3 K+f16 V (85.3 t/s), planar3 symmetric (78.9 t/s), iso3 symmetric (75.1 t/s).
On V100, planar3/iso3 use native VEC kernel (dequant-in-loop) which is slower than turbo3's dequant-to-f16+MMA path. On RTX 5090 (Blackwell), planar3 (119 t/s) and iso3 (118 t/s) beat turbo3 (93 t/s) by 27-28%.
TriAttention (entry eviction) and TurboQuant/PlanarQuant/IsoQuant (per-element compression) are orthogonal and stackable:
┌──────────────────────────────────────────────┐
│ TriAttention: score + evict low-value │
│ KV entries → 10-16× entry reduction │
├──────────────────────────────────────────────┤
│ PlanarQuant/IsoQuant/TurboQuant: compress │
│ each remaining entry → 2-3× per-element │
├──────────────────────────────────────────────┤
│ Total: 20-37× KV compression vs FP16 │
│ (when TriAttention compaction is active) │
└──────────────────────────────────────────────┘
# TriAttention + PlanarQuant 3-bit at 32K context
TRIA_STATS_PATH=stats/qwen35_9b.bin TRIA_BUDGET_TOKENS=2048 \
llama-server -m model.gguf -ngl 99 -fa 1 \
-ctk planar3 -ctv planar3 -c 32768 --port 8080mkdir build && cd build
CUDACXX=/usr/local/cuda/bin/nvcc cmake .. \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=70 \
-DGGML_CUDA_FORCE_MMQ=OFF \
-DGGML_CUDA_F16=ON \
-DGGML_CUDA_FORCE_CUBLAS=ON \
-DCMAKE_BUILD_TYPE=Release
make -j$(nproc)CUDACXX=/usr/local/cuda/bin/nvcc cmake .. \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=80 \ # or 90 for H100
-DGGML_CUDA_FORCE_MMQ=ON \
-DCMAKE_BUILD_TYPE=Release
make -j$(nproc)cmake .. \
-DGGML_METAL=ON \
-DGGML_METAL_EMBED_LIBRARY=ON \
-DCMAKE_BUILD_TYPE=Release
make -j$(nproc)Status:
The DFlash block-diffusion drafter was tested with speculative-simple but the V100's SM70 lacks tensor core throughput for efficient draft/verify. Accept rate was only 8.95%. DFlash is designed for H100/A100 or Apple M-series hardware.
| Path | Description |
|---|---|
ggml/src/ggml-cuda/triattention-cuda.cu |
TriAttention CUDA scoring kernels |
ggml/src/ggml-cuda/cpy-planar-iso.cu |
PlanarQuant/IsoQuant CUDA copy+dequant kernels |
ggml/src/ggml-planar-quant.c |
PlanarQuant 3-bit CPU quantization |
ggml/src/ggml-iso-quant.c |
IsoQuant 3-bit CPU quantization |
ggml/src/ggml-planar4-quant.c |
PlanarQuant 4-bit CPU quantization |
ggml/src/ggml-iso4-quant.c |
IsoQuant 4-bit CPU quantization |
src/triattention.c/h |
TriAttention core scoring math |
src/triattention-runtime.c/h |
TriAttention runtime (scoring/eviction/budget) |
src/triattention-bridge.cpp |
C++ bridge for KV cache access |
src/triattention-backend.c/h |
CUDA backend adapter |
stats/ |
TriAttention calibration stats |
codebooks/ |
TCQ codebooks for TurboQuant |
- Repo: https://github.com/xeonai44/xllama.cpp
- Upstream: https://github.com/spiritbuun/buun-llama-cpp
- TurboQuant Paper: https://arxiv.org/abs/2504.19874 (ICLR 2026)
- TriAttention Paper: https://arxiv.org/abs/2503.20181 (WeianMao et al.)
- RotorQuant Paper: https://www.scrya.com/rotorquant.pdf (Pope, March 2026)
- PlanarQuant/IsoQuant: https://github.com/ParaMind2025/isoquant
- Calibration Tool:
triattention_calibrate.pyin repo root
Built and tested on Tesla V100-SXM2-32GB. All benchmarks use Qwen3.5-9B Q4_K_M unless noted.
