Releases · quantumaikr/quant.cpp · GitHub

03 Apr 16:56

v0.2.0 Latest

Latest

Full Changelog: v0.1.0...v0.2.0

Assets 4

31 Mar 14:31

unamedkr

v0.1.0 — Multi-Architecture Engine with KV Cache Compression

TurboQuant.cpp v0.1.0 — First Release

Multi-architecture LLM inference engine in pure C with KV cache compression.

Highlights

3 models supported: Gemma 3 4B, Qwen3.5-0.8B, Gemma 3 270M
3.8x KV cache compression — at 32K context: 1.2 GB vs llama.cpp's 4.4 GB
llama.cpp parity: 51 tok/s single-thread (vs 50.7 tok/s)
Multi-shard safetensors: loads sharded models (Gemma 4B = 2 shards)
Dual tokenizer: GPT2 byte-level BPE + SentencePiece auto-detect
TQM format: pre-quantized mmap binary, instant loading
Zero dependencies: libc only, ~1MB binary

Supported Models

Model	Speed (Q4, 6T)	Quality
Gemma 3 4B	5.2 tok/s	"capital of France" → "Paris"
Qwen3.5-0.8B	82 tok/s	0.999 cosine vs PyTorch
Gemma 3 270M	176 tok/s	per-layer exact match

KV Cache Memory Savings

Gemma 3 4B at 32K context:
  llama.cpp (FP16 KV):    4,352 MB
  TurboQuant (Q4 KV):     1,156 MB  ← 3.8x compression

Quick Start

git clone https://github.com/quantumaikr/TurboQuant.cpp && cd TurboQuant.cpp
bash scripts/quickstart.sh "What is deep learning?"

What's Inside

9,000+ lines of pure C — complete inference engine
8 quantization types: Uniform, Mixed, PolarQuant, QJL, TurboQuant
Architecture dispatch: Qwen3.5 (DeltaNet + Attention) + Gemma 3 (Sliding Window + GQA)
Q4 weight quantization with NEON 2-row batching + thread pool
Integer Q4×Q8 attention via ARM vdotq_s32
20 test suites, 70+ tests
Python bindings (ctypes), llama.cpp/vLLM integration stubs

References

TurboQuant (ICLR 2026) — KV cache compression
QJL (AAAI 2025) — 1-bit quantized JL transform
PolarQuant (AISTATS 2026) — Polar coordinate quantization

Assets 2