Releases: quantumaikr/quant.cpp
Releases · quantumaikr/quant.cpp
v0.2.0
Full Changelog: v0.1.0...v0.2.0
v0.1.0 — Multi-Architecture Engine with KV Cache Compression
TurboQuant.cpp v0.1.0 — First Release
Multi-architecture LLM inference engine in pure C with KV cache compression.
Highlights
- 3 models supported: Gemma 3 4B, Qwen3.5-0.8B, Gemma 3 270M
- 3.8x KV cache compression — at 32K context: 1.2 GB vs llama.cpp's 4.4 GB
- llama.cpp parity: 51 tok/s single-thread (vs 50.7 tok/s)
- Multi-shard safetensors: loads sharded models (Gemma 4B = 2 shards)
- Dual tokenizer: GPT2 byte-level BPE + SentencePiece auto-detect
- TQM format: pre-quantized mmap binary, instant loading
- Zero dependencies: libc only, ~1MB binary
Supported Models
| Model | Speed (Q4, 6T) | Quality |
|---|---|---|
| Gemma 3 4B | 5.2 tok/s | "capital of France" → "Paris" |
| Qwen3.5-0.8B | 82 tok/s | 0.999 cosine vs PyTorch |
| Gemma 3 270M | 176 tok/s | per-layer exact match |
KV Cache Memory Savings
Gemma 3 4B at 32K context:
llama.cpp (FP16 KV): 4,352 MB
TurboQuant (Q4 KV): 1,156 MB ← 3.8x compression
Quick Start
git clone https://github.com/quantumaikr/TurboQuant.cpp && cd TurboQuant.cpp
bash scripts/quickstart.sh "What is deep learning?"What's Inside
- 9,000+ lines of pure C — complete inference engine
- 8 quantization types: Uniform, Mixed, PolarQuant, QJL, TurboQuant
- Architecture dispatch: Qwen3.5 (DeltaNet + Attention) + Gemma 3 (Sliding Window + GQA)
- Q4 weight quantization with NEON 2-row batching + thread pool
- Integer Q4×Q8 attention via ARM vdotq_s32
- 20 test suites, 70+ tests
- Python bindings (ctypes), llama.cpp/vLLM integration stubs
References
- TurboQuant (ICLR 2026) — KV cache compression
- QJL (AAAI 2025) — 1-bit quantized JL transform
- PolarQuant (AISTATS 2026) — Polar coordinate quantization