xllama.cpp — Optimized LLM Inference for V100

Experimental fork of llama.cpp via spiritbuun/buun-llama-cpp with TurboQuant, TriAttention, PlanarQuant, and IsoQuant KV cache compression.

Quick Start

# Build for V100 (SM70)
cd build && CUDACXX=/usr/local/cuda/bin/nvcc cmake .. \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=70 \
  -DGGML_CUDA_FORCE_MMQ=OFF \
  -DGGML_CUDA_F16=ON \
  -DGGML_CUDA_FORCE_CUBLAS=ON \
  -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# Run server with Turbo2 KV compression
./build/bin/llama-server -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2 -c 8192 --port 8080

# Run server with TriAttention + Turbo2 (maximum KV compression)
TRIA_STATS_PATH=stats/qwen35_9b.bin TRIA_BUDGET_TOKENS=2048 \
  ./build/bin/llama-server -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2 -c 32768 --port 8080

# Benchmark
./build/bin/llama-bench -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2 -p 512,2048 -n 128,512

# CLI
./build/bin/llama-cli -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2 -c 8192 -p "Hello world"

V100 Benchmark Summary — All KV Cache Types

Hardware: Tesla V100-SXM2-32GB | CUDA: 12.6.85 | Build: SM70 | Model: Qwen3.5-9B Q4_K_M (5.28 GiB)

Full Results (llama-bench, 3 repeats, b512)

KV Type	K	V	tg128	tg512	pp512	pp2048	vs f16	KV Compress	Best For
f16/f16	f16	f16	93.15	93.08	2723	2707	—	1.0×	Quality baseline
turbo2/turbo2	turbo2	turbo2	87.54	87.31	2669	2643	-6%	2.3×	Best speed/compression
turbo3/turbo3	turbo3	turbo3	85.28	85.05	2664	2626	-8%	3.3×	Good compression, fast
planar3 K/f16 V	planar3	f16	85.30	84.68	2340	1906	-8%	1.8×	Zero PPL loss, K-only
iso3 K/f16 V	iso3	f16	80.09	79.86	1765	1122	-14%	1.8×	K-only quality
f16 K/planar3 V	f16	planar3	85.01	81.93	2162	1389	-9%	1.8×	V-only compression
f16 K/iso3 V	f16	iso3	84.88	81.90	2156	1381	-9%	1.8×	V-only quality
planar3/planar3	planar3	planar3	78.88	75.75	1980	1212	-15%	9.8×	Max compression
iso3/iso3	iso3	iso3	75.13	72.59	1799	1039	-19%	9.8×	Best PPL/compression
planar4/planar4	planar4	planar4	75.20 ± 0.21	72.28 ± 0.27	1799.95 ± 2.63	1050.46 ± 1.29	-21%	4.0×	Newer, 2D Givens+4b
iso4/iso4	iso4	iso4	73.59 ± 0.21	70.19 ± 0.32	1655.38 ± 6.06	915.08 ± 0.07	-24%	4.0×	Newer, 4D quaternion+4b
planar4 K/f16 V	planar4	f16	81.03 ± 0.06	80.54 ± 0.30	2246.71 ± 4.55	1776.88 ± 0.87	-13%	1.8×	Zero PPL loss, K-only, newer
planar4/turbo2	planar4	turbo2	81.91	81.49	2201	1735	-12%	5.3×	Higher compression than turbo2
planar3/turbo3	planar3	turbo3	81.04	80.94	2306	1879	-13%	5.3×	Higher compression than turbo3
iso4/turbo2	iso4	turbo2	79.84	78.42	1953	1411	-16%	5.3×	Higher compression than turbo2, lower speed

All numbers are tokens/sec ± stddev (see TEST_RESULTS_V100.md for full stats).

Key Takeaways

Turbo2 is the fastest compressed type on V100: 87.5 t/s with 2.3× KV compression
Planar3 K + f16 V matches turbo3 speed (85.3 t/s) but with zero PPL loss — ideal for quality-critical workloads
Iso3 symmetric has the best quality per bit (PPL +4.2% vs turbo3's +6.6%) at 9.8× compression
On V100, planar3/iso3 are slower than turbo3 because they use the native VEC kernel (dequantize-in-loop) while turbo3 dequants to f16 + MMA tensor cores
Asymmetric KV (planar4/turbo2, planar3/turbo3, iso4/turbo2) now supported with flash attention on GPU. Use -ctk planar4 -ctv turbo2 for zero PPL loss in K-only quantization.
planar4/turbo2 and planar3/turbo3 achieve 5.3× KV compression with 12–13% decode speed loss and 19–20% prefill loss vs f16. iso4/turbo2 gives similar compression with ~16% decode and ~36% prefill loss.

TriAttention + TurboQuant (llama-server, 8K context)

Config	512 tok gen	2048 tok gen	KV Size	TriA Events
Baseline f16	88.4 t/s	88.1 t/s	256 MiB	0
FA+Turbo2	82.7 t/s	81.0 t/s	40 MiB	0
FA+Turbo2+TriA (budget=2048)	82.0 t/s	69.3 t/s	40 MiB	72
FA+Turbo3_TCQ+TriA (budget=2048)	77.6 t/s	65.7 t/s	56 MiB	75
FA+planar4/turbo2+TriA (budget=2048)	80.3 t/s	69.3 t/s	48 MiB	-

TriAttention adds ~14% overhead (CPU-side scoring). Physical KV compaction pending — when active, combined TriA+Turbo2 yields ~37× total KV compression.

TurboQuant (Google ICLR 2026)

Walsh-Hadamard Transform + Lloyd-Max scalar quantization + QJL correction. Implemented as turbo2, turbo3, turbo4 types with optional TCQ codebooks.

# CLI
llama-cli -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2

# Server
llama-server -m model.gguf -ngl 99 -fa 1 -ctk turbo3 -ctv turbo3 --port 8080

# With TCQ codebooks
TURBO_TCQ_CB2=codebooks/2bit/tcq_2bit_100iter_s99.bin \
TURBO_TCQ_CB=codebooks/3bit/cb_50iter_finetuned.bin \
llama-server -m model.gguf -ngl 99 -fa 1 -ctk turbo3_tcq -ctv turbo3_tcq --port 8080

# Benchmark all TurboQuant variants
llama-bench -m model.gguf -ngl 99 -fa 1 \
  -ctk f16,turbo2,turbo3,turbo2_tcq,turbo3_tcq \
  -ctv f16,turbo2,turbo3,turbo2_tcq,turbo3_tcq \
  -p 512,2048 -n 128,512

Type	Bits	Rotation	QJL	TCQ	Decode Speed
turbo2	2-bit	WHT 128×128	No	No	87 t/s
turbo3	3-bit	WHT 128×128	Yes	No	85 t/s
turbo2_tcq	2-bit+TCQ	WHT 128×128	No	Yes	86 t/s
turbo3_tcq	3-bit+TCQ	WHT 128×128	Yes	Yes	82 t/s

TriAttention (WeianMao/MIT)

Trigonometric frequency-based KV entry scoring + eviction. Reduces number of KV entries (complementary to TurboQuant's per-element compression). Scoring triggers every 128 tokens during decode.

# Step 1: Generate calibration stats (offline, one-time)
python3 triattention_calibrate.py \
  --model Qwen/Qwen3.5-9B \
  --input wikitext.txt \
  --output stats/qwen35_9b.bin

# Step 2: Run with TriAttention
TRIA_STATS_PATH=stats/qwen35_9b.bin \
TRIA_BUDGET_TOKENS=2048 \
llama-server -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2 -c 32768 --port 8080

# Alternative: budget as percentage of context
TRIA_STATS_PATH=stats/qwen35_9b.bin \
TRIA_BUDGET_PCT=50 \
llama-server -m model.gguf -ngl 99 -fa 1 -ctk turbo2 -ctv turbo2 -c 32768 --port 8080

Env Var	Default	Description
`TRIA_STATS_PATH`	(required)	Path to calibration `.bin` file
`TRIA_BUDGET_TOKENS`	—	Absolute KV budget in tokens
`TRIA_BUDGET_PCT`	50	KV budget as % of context
`TRIA_RUNTIME_DIVIDE_LENGTH`	128	Scoring trigger interval
`TRIA_RUNTIME_WINDOW_SIZE`	128	Recent tokens always kept
`TRIA_RAMP_START_PCT`	10	Exponential ramp start %

PlanarQuant & IsoQuant (RotorQuant/ParaMind2025)

Block-diagonal rotation KV cache quantization — replaces TurboQuant's full d×d WHT with 2D Givens (PlanarQuant) or 4D quaternion (IsoQuant) rotations. Fewer parameters, faster prefill, better PPL than TurboQuant at same compression.

# PlanarQuant 3-bit (best speed)
llama-server -m model.gguf -ngl 99 -fa 1 -ctk planar3 -ctv planar3 --port 8080

# IsoQuant 3-bit (best quality)
llama-server -m model.gguf -ngl 99 -fa 1 -ctk iso3 -ctv iso3 --port 8080

# K-only compression (zero PPL loss)
llama-server -m model.gguf -ngl 99 -fa 1 -ctk planar3 -ctv f16 --port 8080

# Mixed: PlanarQuant K + TurboQuant V
llama-server -m model.gguf -ngl 99 -fa 1 -ctk planar3 -ctv turbo3 --port 8080

Type	Rotation	Group Size	FMAs (d=128)	Params	Status
TurboQuant	WHT butterfly	128	16,384	16,384	✅ V100 tested
PlanarQuant	2D Givens	2	256	128	✅ V100 tested
IsoQuant	4D quaternion	4	512	128	✅ V100 tested
RotorQuant (Cl(3,0))	3D rotor sandwich	3	~2,400	372	Research only (Triton)

V100 tested: Planar3 K+f16 V (85.3 t/s), planar3 symmetric (78.9 t/s), iso3 symmetric (75.1 t/s).

On V100, planar3/iso3 use native VEC kernel (dequant-in-loop) which is slower than turbo3's dequant-to-f16+MMA path. On RTX 5090 (Blackwell), planar3 (119 t/s) and iso3 (118 t/s) beat turbo3 (93 t/s) by 27-28%.

Combined Compression Stack

TriAttention (entry eviction) and TurboQuant/PlanarQuant/IsoQuant (per-element compression) are orthogonal and stackable:

┌──────────────────────────────────────────────┐
│  TriAttention: score + evict low-value       │
│  KV entries → 10-16× entry reduction         │
├──────────────────────────────────────────────┤
│  PlanarQuant/IsoQuant/TurboQuant: compress   │
│  each remaining entry → 2-3× per-element     │
├──────────────────────────────────────────────┤
│  Total: 20-37× KV compression vs FP16       │
│  (when TriAttention compaction is active)    │
└──────────────────────────────────────────────┘

Example: Maximum Compression

# TriAttention + PlanarQuant 3-bit at 32K context
TRIA_STATS_PATH=stats/qwen35_9b.bin TRIA_BUDGET_TOKENS=2048 \
  llama-server -m model.gguf -ngl 99 -fa 1 \
  -ctk planar3 -ctv planar3 -c 32768 --port 8080

Build Configuration

NVIDIA V100 (SM70)

mkdir build && cd build
CUDACXX=/usr/local/cuda/bin/nvcc cmake .. \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=70 \
  -DGGML_CUDA_FORCE_MMQ=OFF \
  -DGGML_CUDA_F16=ON \
  -DGGML_CUDA_FORCE_CUBLAS=ON \
  -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

NVIDIA A100/H100 (SM80/SM90)

CUDACXX=/usr/local/cuda/bin/nvcc cmake .. \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=80 \   # or 90 for H100
  -DGGML_CUDA_FORCE_MMQ=ON \
  -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

Apple Silicon (Metal)

cmake .. \
  -DGGML_METAL=ON \
  -DGGML_METAL_EMBED_LIBRARY=ON \
  -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

DFlash Speculative Decoding

Status: ⚠️ Works but no speedup on V100 (0.14× — 7× slower than baseline).

The DFlash block-diffusion drafter was tested with speculative-simple but the V100's SM70 lacks tensor core throughput for efficient draft/verify. Accept rate was only 8.95%. DFlash is designed for H100/A100 or Apple M-series hardware.

Project Structure

Path	Description
`ggml/src/ggml-cuda/triattention-cuda.cu`	TriAttention CUDA scoring kernels
`ggml/src/ggml-cuda/cpy-planar-iso.cu`	PlanarQuant/IsoQuant CUDA copy+dequant kernels
`ggml/src/ggml-planar-quant.c`	PlanarQuant 3-bit CPU quantization
`ggml/src/ggml-iso-quant.c`	IsoQuant 3-bit CPU quantization
`ggml/src/ggml-planar4-quant.c`	PlanarQuant 4-bit CPU quantization
`ggml/src/ggml-iso4-quant.c`	IsoQuant 4-bit CPU quantization
`src/triattention.c/h`	TriAttention core scoring math
`src/triattention-runtime.c/h`	TriAttention runtime (scoring/eviction/budget)
`src/triattention-bridge.cpp`	C++ bridge for KV cache access
`src/triattention-backend.c/h`	CUDA backend adapter
`stats/`	TriAttention calibration stats
`codebooks/`	TCQ codebooks for TurboQuant

Links

Repo: https://github.com/xeonai44/xllama.cpp
Upstream: https://github.com/spiritbuun/buun-llama-cpp
TurboQuant Paper: https://arxiv.org/abs/2504.19874 (ICLR 2026)
TriAttention Paper: https://arxiv.org/abs/2503.20181 (WeianMao et al.)
RotorQuant Paper: https://www.scrya.com/rotorquant.pdf (Pope, March 2026)
PlanarQuant/IsoQuant: https://github.com/ParaMind2025/isoquant
Calibration Tool: triattention_calibrate.py in repo root

Built and tested on Tesla V100-SXM2-32GB. All benchmarks use Qwen3.5-9B Q4_K_M unless noted.

Name		Name	Last commit message	Last commit date
Latest commit History 8,926 Commits
.devops		.devops
.gemini		.gemini
.github		.github
benches		benches
ci		ci
cmake		cmake
codebooks		codebooks
common		common
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
pocs		pocs
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
vendor		vendor
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AUTHORS		AUTHORS
BENCHMARK_ASYMMETRIC_TRIANG_2026-04-27.md		BENCHMARK_ASYMMETRIC_TRIANG_2026-04-27.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
TESTS.md		TESTS.md
TEST_RESULTS.md		TEST_RESULTS.md
TEST_RESULTS_V100.md		TEST_RESULTS_V100.md
V100_BENCHMARK_SUMMARY.md		V100_BENCHMARK_SUMMARY.md
apply_turbo_cuda_v2.py		apply_turbo_cuda_v2.py
build-xcframework.sh		build-xcframework.sh
buunslamma.png		buunslamma.png
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
flake.lock		flake.lock
flake.nix		flake.nix
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
servers		servers
ty.toml		ty.toml
xllama_banner_orange_v2.png		xllama_banner_orange_v2.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

xllama.cpp — Optimized LLM Inference for V100

Quick Start

V100 Benchmark Summary — All KV Cache Types

Full Results (llama-bench, 3 repeats, b512)

Key Takeaways

TriAttention + TurboQuant (llama-server, 8K context)

TurboQuant (Google ICLR 2026)

TriAttention (WeianMao/MIT)

PlanarQuant & IsoQuant (RotorQuant/ParaMind2025)

Combined Compression Stack

Example: Maximum Compression

Build Configuration

NVIDIA V100 (SM70)

NVIDIA A100/H100 (SM80/SM90)

Apple Silicon (Metal)

DFlash Speculative Decoding

Project Structure

Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

xllama.cpp — Optimized LLM Inference for V100

Quick Start

V100 Benchmark Summary — All KV Cache Types

Full Results (llama-bench, 3 repeats, b512)

Key Takeaways

TriAttention + TurboQuant (llama-server, 8K context)

TurboQuant (Google ICLR 2026)

TriAttention (WeianMao/MIT)

PlanarQuant & IsoQuant (RotorQuant/ParaMind2025)

Combined Compression Stack

Example: Maximum Compression

Build Configuration

NVIDIA V100 (SM70)

NVIDIA A100/H100 (SM80/SM90)

Apple Silicon (Metal)

DFlash Speculative Decoding

Project Structure

Links

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages