Skip to content

xeonai44/xllama.cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8,926 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

xllama.cpp — Optimized LLM Inference for V100

ganzo

Experimental fork of llama.cpp via spiritbuun/buun-llama-cpp with TurboQuant, TriAttention, PlanarQuant, and IsoQuant KV cache compression.

Quick Start

# Build for V100 (SM70)
cd build && CUDACXX=/usr/local/cuda/bin/nvcc cmake .. \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=70 \
  -DGGML_CUDA_FORCE_MMQ=OFF \
  -DGGML_CUDA_F16=ON \
  -DGGML_CUDA_FORCE_CUBLAS=ON \
  -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# Run server with Turbo2 KV compression
./build/bin/llama-server -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2 -c 8192 --port 8080

# Run server with TriAttention + Turbo2 (maximum KV compression)
TRIA_STATS_PATH=stats/qwen35_9b.bin TRIA_BUDGET_TOKENS=2048 \
  ./build/bin/llama-server -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2 -c 32768 --port 8080

# Benchmark
./build/bin/llama-bench -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2 -p 512,2048 -n 128,512

# CLI
./build/bin/llama-cli -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2 -c 8192 -p "Hello world"

V100 Benchmark Summary — All KV Cache Types

Hardware: Tesla V100-SXM2-32GB | CUDA: 12.6.85 | Build: SM70 | Model: Qwen3.5-9B Q4_K_M (5.28 GiB)

Full Results (llama-bench, 3 repeats, b512)

KV Type K V tg128 tg512 pp512 pp2048 vs f16 KV Compress Best For
f16/f16 f16 f16 93.15 93.08 2723 2707 1.0× Quality baseline
turbo2/turbo2 turbo2 turbo2 87.54 87.31 2669 2643 -6% 2.3× Best speed/compression
turbo3/turbo3 turbo3 turbo3 85.28 85.05 2664 2626 -8% 3.3× Good compression, fast
planar3 K/f16 V planar3 f16 85.30 84.68 2340 1906 -8% 1.8× Zero PPL loss, K-only
iso3 K/f16 V iso3 f16 80.09 79.86 1765 1122 -14% 1.8× K-only quality
f16 K/planar3 V f16 planar3 85.01 81.93 2162 1389 -9% 1.8× V-only compression
f16 K/iso3 V f16 iso3 84.88 81.90 2156 1381 -9% 1.8× V-only quality
planar3/planar3 planar3 planar3 78.88 75.75 1980 1212 -15% 9.8× Max compression
iso3/iso3 iso3 iso3 75.13 72.59 1799 1039 -19% 9.8× Best PPL/compression
planar4/planar4 planar4 planar4 75.20 ± 0.21 72.28 ± 0.27 1799.95 ± 2.63 1050.46 ± 1.29 -21% 4.0× Newer, 2D Givens+4b
iso4/iso4 iso4 iso4 73.59 ± 0.21 70.19 ± 0.32 1655.38 ± 6.06 915.08 ± 0.07 -24% 4.0× Newer, 4D quaternion+4b
planar4 K/f16 V planar4 f16 81.03 ± 0.06 80.54 ± 0.30 2246.71 ± 4.55 1776.88 ± 0.87 -13% 1.8× Zero PPL loss, K-only, newer
planar4/turbo2 planar4 turbo2 81.91 81.49 2201 1735 -12% 5.3× Higher compression than turbo2
planar3/turbo3 planar3 turbo3 81.04 80.94 2306 1879 -13% 5.3× Higher compression than turbo3
iso4/turbo2 iso4 turbo2 79.84 78.42 1953 1411 -16% 5.3× Higher compression than turbo2, lower speed

All numbers are tokens/sec ± stddev (see TEST_RESULTS_V100.md for full stats).

Key Takeaways

  • Turbo2 is the fastest compressed type on V100: 87.5 t/s with 2.3× KV compression
  • Planar3 K + f16 V matches turbo3 speed (85.3 t/s) but with zero PPL loss — ideal for quality-critical workloads
  • Iso3 symmetric has the best quality per bit (PPL +4.2% vs turbo3's +6.6%) at 9.8× compression
  • On V100, planar3/iso3 are slower than turbo3 because they use the native VEC kernel (dequantize-in-loop) while turbo3 dequants to f16 + MMA tensor cores
  • Asymmetric KV (planar4/turbo2, planar3/turbo3, iso4/turbo2) now supported with flash attention on GPU. Use -ctk planar4 -ctv turbo2 for zero PPL loss in K-only quantization.
  • planar4/turbo2 and planar3/turbo3 achieve 5.3× KV compression with 12–13% decode speed loss and 19–20% prefill loss vs f16. iso4/turbo2 gives similar compression with ~16% decode and ~36% prefill loss.

TriAttention + TurboQuant (llama-server, 8K context)

Config 512 tok gen 2048 tok gen KV Size TriA Events
Baseline f16 88.4 t/s 88.1 t/s 256 MiB 0
FA+Turbo2 82.7 t/s 81.0 t/s 40 MiB 0
FA+Turbo2+TriA (budget=2048) 82.0 t/s 69.3 t/s 40 MiB 72
FA+Turbo3_TCQ+TriA (budget=2048) 77.6 t/s 65.7 t/s 56 MiB 75
FA+planar4/turbo2+TriA (budget=2048) 80.3 t/s 69.3 t/s 48 MiB -

TriAttention adds ~14% overhead (CPU-side scoring). Physical KV compaction pending — when active, combined TriA+Turbo2 yields ~37× total KV compression.

TurboQuant (Google ICLR 2026)

Walsh-Hadamard Transform + Lloyd-Max scalar quantization + QJL correction. Implemented as turbo2, turbo3, turbo4 types with optional TCQ codebooks.

# CLI
llama-cli -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2

# Server
llama-server -m model.gguf -ngl 99 -fa 1 -ctk turbo3 -ctv turbo3 --port 8080

# With TCQ codebooks
TURBO_TCQ_CB2=codebooks/2bit/tcq_2bit_100iter_s99.bin \
TURBO_TCQ_CB=codebooks/3bit/cb_50iter_finetuned.bin \
llama-server -m model.gguf -ngl 99 -fa 1 -ctk turbo3_tcq -ctv turbo3_tcq --port 8080

# Benchmark all TurboQuant variants
llama-bench -m model.gguf -ngl 99 -fa 1 \
  -ctk f16,turbo2,turbo3,turbo2_tcq,turbo3_tcq \
  -ctv f16,turbo2,turbo3,turbo2_tcq,turbo3_tcq \
  -p 512,2048 -n 128,512
Type Bits Rotation QJL TCQ Decode Speed
turbo2 2-bit WHT 128×128 No No 87 t/s
turbo3 3-bit WHT 128×128 Yes No 85 t/s
turbo2_tcq 2-bit+TCQ WHT 128×128 No Yes 86 t/s
turbo3_tcq 3-bit+TCQ WHT 128×128 Yes Yes 82 t/s

TriAttention (WeianMao/MIT)

Trigonometric frequency-based KV entry scoring + eviction. Reduces number of KV entries (complementary to TurboQuant's per-element compression). Scoring triggers every 128 tokens during decode.

# Step 1: Generate calibration stats (offline, one-time)
python3 triattention_calibrate.py \
  --model Qwen/Qwen3.5-9B \
  --input wikitext.txt \
  --output stats/qwen35_9b.bin

# Step 2: Run with TriAttention
TRIA_STATS_PATH=stats/qwen35_9b.bin \
TRIA_BUDGET_TOKENS=2048 \
llama-server -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2 -c 32768 --port 8080

# Alternative: budget as percentage of context
TRIA_STATS_PATH=stats/qwen35_9b.bin \
TRIA_BUDGET_PCT=50 \
llama-server -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2 -c 32768 --port 8080
Env Var Default Description
TRIA_STATS_PATH (required) Path to calibration .bin file
TRIA_BUDGET_TOKENS Absolute KV budget in tokens
TRIA_BUDGET_PCT 50 KV budget as % of context
TRIA_RUNTIME_DIVIDE_LENGTH 128 Scoring trigger interval
TRIA_RUNTIME_WINDOW_SIZE 128 Recent tokens always kept
TRIA_RAMP_START_PCT 10 Exponential ramp start %

PlanarQuant & IsoQuant (RotorQuant/ParaMind2025)

Block-diagonal rotation KV cache quantization — replaces TurboQuant's full d×d WHT with 2D Givens (PlanarQuant) or 4D quaternion (IsoQuant) rotations. Fewer parameters, faster prefill, better PPL than TurboQuant at same compression.

# PlanarQuant 3-bit (best speed)
llama-server -m model.gguf -ngl 99 -fa 1 -ctk planar3 -ctv planar3 --port 8080

# IsoQuant 3-bit (best quality)
llama-server -m model.gguf -ngl 99 -fa 1 -ctk iso3 -ctv iso3 --port 8080

# K-only compression (zero PPL loss)
llama-server -m model.gguf -ngl 99 -fa 1 -ctk planar3 -ctv f16 --port 8080

# Mixed: PlanarQuant K + TurboQuant V
llama-server -m model.gguf -ngl 99 -fa 1 -ctk planar3 -ctv turbo3 --port 8080
Type Rotation Group Size FMAs (d=128) Params Status
TurboQuant WHT butterfly 128 16,384 16,384 ✅ V100 tested
PlanarQuant 2D Givens 2 256 128 ✅ V100 tested
IsoQuant 4D quaternion 4 512 128 ✅ V100 tested
RotorQuant (Cl(3,0)) 3D rotor sandwich 3 ~2,400 372 Research only (Triton)

V100 tested: Planar3 K+f16 V (85.3 t/s), planar3 symmetric (78.9 t/s), iso3 symmetric (75.1 t/s).

On V100, planar3/iso3 use native VEC kernel (dequant-in-loop) which is slower than turbo3's dequant-to-f16+MMA path. On RTX 5090 (Blackwell), planar3 (119 t/s) and iso3 (118 t/s) beat turbo3 (93 t/s) by 27-28%.


Combined Compression Stack

TriAttention (entry eviction) and TurboQuant/PlanarQuant/IsoQuant (per-element compression) are orthogonal and stackable:

┌──────────────────────────────────────────────┐
│  TriAttention: score + evict low-value       │
│  KV entries → 10-16× entry reduction         │
├──────────────────────────────────────────────┤
│  PlanarQuant/IsoQuant/TurboQuant: compress   │
│  each remaining entry → 2-3× per-element     │
├──────────────────────────────────────────────┤
│  Total: 20-37× KV compression vs FP16       │
│  (when TriAttention compaction is active)    │
└──────────────────────────────────────────────┘

Example: Maximum Compression

# TriAttention + PlanarQuant 3-bit at 32K context
TRIA_STATS_PATH=stats/qwen35_9b.bin TRIA_BUDGET_TOKENS=2048 \
  llama-server -m model.gguf -ngl 99 -fa 1 \
  -ctk planar3 -ctv planar3 -c 32768 --port 8080

Build Configuration

NVIDIA V100 (SM70)

mkdir build && cd build
CUDACXX=/usr/local/cuda/bin/nvcc cmake .. \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=70 \
  -DGGML_CUDA_FORCE_MMQ=OFF \
  -DGGML_CUDA_F16=ON \
  -DGGML_CUDA_FORCE_CUBLAS=ON \
  -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

NVIDIA A100/H100 (SM80/SM90)

CUDACXX=/usr/local/cuda/bin/nvcc cmake .. \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=80 \   # or 90 for H100
  -DGGML_CUDA_FORCE_MMQ=ON \
  -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

Apple Silicon (Metal)

cmake .. \
  -DGGML_METAL=ON \
  -DGGML_METAL_EMBED_LIBRARY=ON \
  -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

DFlash Speculative Decoding

Status: ⚠️ Works but no speedup on V100 (0.14× — 7× slower than baseline).

The DFlash block-diffusion drafter was tested with speculative-simple but the V100's SM70 lacks tensor core throughput for efficient draft/verify. Accept rate was only 8.95%. DFlash is designed for H100/A100 or Apple M-series hardware.


Project Structure

Path Description
ggml/src/ggml-cuda/triattention-cuda.cu TriAttention CUDA scoring kernels
ggml/src/ggml-cuda/cpy-planar-iso.cu PlanarQuant/IsoQuant CUDA copy+dequant kernels
ggml/src/ggml-planar-quant.c PlanarQuant 3-bit CPU quantization
ggml/src/ggml-iso-quant.c IsoQuant 3-bit CPU quantization
ggml/src/ggml-planar4-quant.c PlanarQuant 4-bit CPU quantization
ggml/src/ggml-iso4-quant.c IsoQuant 4-bit CPU quantization
src/triattention.c/h TriAttention core scoring math
src/triattention-runtime.c/h TriAttention runtime (scoring/eviction/budget)
src/triattention-bridge.cpp C++ bridge for KV cache access
src/triattention-backend.c/h CUDA backend adapter
stats/ TriAttention calibration stats
codebooks/ TCQ codebooks for TurboQuant

Links


Built and tested on Tesla V100-SXM2-32GB. All benchmarks use Qwen3.5-9B Q4_K_M unless noted.

About

xllama.cpp — Optimized LLM Inference for NVidia Tesla V100-SXM2-32GB

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors