Skip to content

rookiemann/multi-turboquant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-TurboQuant

Unified KV cache compression toolkit for LLM inference
10 methods. 16 presets. GPU-validated. One API.

Windows Linux macOS

NVIDIA AMD Metal TurboQuant 10 Methods Tests Python MIT


What Is This

A Python toolkit that compresses the KV cache in large language models. The KV cache is the #1 memory bottleneck during inference — a 32B model at 32K context uses 8+ GB just for the cache. This library gives you 10 different ways to compress it, all under one API.

Install it, pick a preset, and get the exact launch command for llama.cpp or vLLM with optimal compression. Or use it directly in your own inference code.

git clone https://github.com/rookiemann/multi-turboquant
cd multi-turboquant
pip install -e .
python run_ui.py

Four lines. Opens a browser dashboard. See your GPUs, benchmark methods, plan deployments, generate commands.

Methods

Method Family Transform Bits Compression Calibration Speed Impact
turbo2 TurboQuant Walsh-Hadamard 128-d 2.25 7.1x Required -3%
turbo3 TurboQuant Walsh-Hadamard 128-d 3.25 4.9x Required -5%
turbo4 TurboQuant Walsh-Hadamard 128-d 4.25 3.8x Required -4%
turbo2_tcq TCQ WHT + Viterbi trellis 2.25 7.1x Required -3%
turbo3_tcq TCQ WHT + Viterbi trellis 3.25 4.9x Required -5%
iso3 IsoQuant Quaternion 4D rotation 3.25 4.9x No ~0%
iso4 IsoQuant Quaternion 4D rotation 4.25 3.8x No ~0%
planar3 PlanarQuant Givens 2D rotation 3.25 4.9x No -1%
planar4 PlanarQuant Givens 2D rotation 4.25 3.8x No ~0%
triattention TriAttention DFT token eviction 16 10-16x Required Varies

Combined mode (unique to this repo): Token eviction + quantization together. Evict unimportant tokens, compress the survivors. ~80x total KV reduction.

All 10 methods run on GPU through our code. No upstream forks needed.

GPU-Validated Results

Every method tested on RTX 3090, real CUDA tensors, our code:

Method Cosine Similarity Compression GPU Verified
turbo2 0.9420 5.8x
turbo3 0.9817 4.0x
turbo4 0.9947 3.2x
turbo3_tcq 0.9817 4.0x
iso3 0.9783 4.7x
iso4 0.9951 3.7x
planar3 0.9783 4.7x
planar4 0.9952 3.7x
TriAttn + iso3 0.9782 9.5x

Tests

77 automated tests: 68 CPU + 9 GPU.

Suite Tests What It Proves
test_methods.py 37 All 10 methods encode/decode, config, presets, integration
test_integration.py 31 Vectorized kernels, paged KV cache, dispatch, TriAttention composition
test_gpu.py 9 Real GPU inference, calibration generation, hardware detection
pytest tests/                              # all 77 tests
pytest tests/ --ignore=tests/test_gpu.py   # CPU only (68 tests)

Quick Start

Pick a preset

from multi_turboquant import get_preset

config = get_preset("balanced")       # turbo3_tcq symmetric, 5x
config = get_preset("k_only_iso")     # ISO3 K-only, zero speed cost, no calibration
config = get_preset("extreme")        # TriAttention + turbo3_tcq, ~80x
config = get_preset("agents_8x16k")   # 8 agents at 16K context

Generate a llama.cpp command

from multi_turboquant.integration import get_llamacpp_command

cmd = get_llamacpp_command(
    config,
    model_path="/opt/models/model.gguf",
    port=8080,
    tensor_split="24,12",    # dual GPU
    parallel_slots=8,        # 8 concurrent agents
)
# llama-server --model ... --cache-type-k turbo3_tcq --cache-type-v turbo3_tcq
#   -fa on -c 131072 --tensor-split 24,12 --parallel 8

Plan a multi-agent deployment

from multi_turboquant import plan_agents

result = plan_agents(
    gpus=[{"name": "RTX 3090", "vram_gb": 24}, {"name": "RTX 3060", "vram_gb": 12}],
    model_params_b=32,
    model_quant="Q4_K_M",
    desired_agents=8,
    desired_context=16384,
)
result.print_report()
# Preset: turbo4 | 8 agents at 16K | KV: 8.5 GB | Headroom: 9 GB

Compress tensors directly

import torch
from multi_turboquant import compress, decompress, CacheConfig, CacheMethod

config = CacheConfig(k_method=CacheMethod.ISO3, v_method=CacheMethod.FP16)
keys = torch.randn(32, 8, 128, device="cuda")
compressed = compress(keys, config, which="k")
reconstructed = decompress(compressed)
# cosine similarity > 0.97

Detect hardware

from multi_turboquant.hardware import detect_platform
from multi_turboquant.compatibility import check_config, get_recommended_config

platform = detect_platform()
print(platform.summary())
# NVIDIA: all 10 methods | AMD: iso/planar only | Mac: iso/planar only

config = get_recommended_config(platform)
issues = check_config(config, platform)

Presets

Preset Config Use Case
k_only_iso K=iso3, V=f16 Zero speed cost, no calibration
balanced turbo3_tcq symmetric Best quality at 5x
speed turbo3 symmetric Fastest on Ampere
quality turbo4 symmetric Near-lossless 3.8x
max_compression turbo2_tcq symmetric Maximum 7x
extreme turbo3_tcq + TriAttention ~80x total reduction
agents_8x16k turbo4 symmetric 8 agents at 16K context
agents_4x8k_70b turbo4 symmetric 4 agents on 70B model
no_calibration_symmetric iso3 symmetric No setup needed

Full list: 16 presets

Capacity Planner

python scripts/plan_and_launch.py --model 32 --agents 8 --context 16384 --gpus 24 12

Works with any number of GPUs. Auto-detects NVIDIA, AMD, Apple Silicon. Generates the exact launch command with tensor-split and parallel flags.

Calibration

TurboQuant/TCQ methods need a one-time calibration from the model's safetensors weights:

mtq-calibrate /path/to/model-safetensors --recipe turbo3
# Generates turboquant_kv.json (~200 KB, ~30 seconds)

IsoQuant and PlanarQuant need no calibration — just works.

Platform Support

Platform Methods Available Engine
Linux + NVIDIA All 10 llama.cpp + vLLM
Windows + NVIDIA All 10 llama.cpp + vLLM
Linux + AMD (ROCm) iso/planar (4) llama.cpp
macOS + Apple Silicon iso/planar (4) llama.cpp (Metal)
Any (CPU) All 10 Library only

Web Dashboard

python run_ui.py

Browser-based UI for exploring methods, running benchmarks, planning deployments, and generating commands. No dependencies beyond the library itself.

Architecture

multi_turboquant/
  config.py              CacheConfig, CacheMethod, 12 cache types
  registry.py            Method registration and discovery
  presets.py             16 named presets + auto-recommend
  planner.py             Multi-agent capacity planning, any GPU count
  hardware.py            GPU auto-detection (NVIDIA, AMD, Metal)
  compatibility.py       Method/platform compatibility checks
  methods/               5 method families, all with encode/decode
  kernels/triton/        Attention backend, vectorized encode, dispatch
  calibration/           Weight-norm analysis, frequency stats, auto-calibrate
  integration/           llama.cpp flags, vLLM patch, bridge adapter
  benchmark/             Head-to-head comparison, perplexity, VRAM profiling

Documentation

Full manual with 23 chapters: docs/manual.md

Attribution

This project reimplements algorithms from published research. All original repos are MIT or Apache-2.0 licensed:

Contribution Source
Walsh-Hadamard KV compression TheTom/llama-cpp-turboquant
Trellis Coded Quantization spiritbuun/buun-llama-cpp
IsoQuant / PlanarQuant scrya-com/rotorquant (ParaMind2025)
CUDA + Metal kernels johndpope/llama-cpp-turboquant
TriAttention token eviction WeianMao/triattention

We reimplemented the algorithms in Python. Credit goes to these authors for the mathematical ideas.

License

MIT

About

Unified KV cache compression for LLM inference — TurboQuant, IsoQuant, PlanarQuant, TriAttention. 10 methods, GPU-validated, multi-GPU planner. Compress KV cache 5-80x to run bigger models, longer context, more agents on your GPU.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors