Skip to content

quantumaikr/quant.cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

183 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

quant.cpp

quant.cpp Hero

Embeddable LLM inference in pure C.

33K LOC. No external libraries. Read it in an afternoon.

License CI Tests


What quant.cpp does

~4x longer context on the same hardware. KV cache compression reduces per-token memory by 3.8x, extending context proportionally.

Hardware Model FP16 KV 4-bit K + Q4 V Gain
8GB Laptop Llama 8B (Q4) ~16K tokens ~61K tokens 3.8x
16GB Mac Air SmolLM2 1.7B ~78K tokens ~298K tokens 3.8x
24GB RTX 3090 Llama 8B (Q4) ~147K tokens ~559K tokens 3.8x

Estimates based on KV memory reduction. Actual context depends on available memory after model weights.

./quant model.gguf -p "hello"

Why quant.cpp

quant.cpp llama.cpp
Code 33K LOC, pure C 250K+ LOC, C++
Design Read, modify, embed Feature-complete
Dependencies libc + pthreads only ggml framework
KV compression PPL -3.2% (better than FP32) PPL +10.6%

quant.cpp is not a fork. It's a standalone engine built from scratch for one goal: LLM inference you can understand, customize, and ship inside your own product.

  • Read — 33K lines. The full forward pass fits in one file. You can trace every computation.
  • Modify — Pure C11, modular. Add your own quantization type, swap the attention kernel, change the sampling strategy.
  • Embed — No frameworks, no package managers. Copy the source into your project. Compiles on any platform with a C compiler.

Quick Start

git clone https://github.com/quantumaikr/quant.cpp && cd quant.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Run inference with a GGUF model
./build/quant model.gguf -p "hello"

# KV compression: 4-bit keys + Q4 values (3.8x, recommended)
./build/quant model.gguf -p "hello" -k uniform_4b -v q4

# Delta compression: 3-bit keys + Q4 values (4.3x, best compression)
./build/quant model.gguf -p "hello" -k uniform_3b -v q4 --delta

# Measure perplexity
./build/quant model.gguf --ppl input.txt -k uniform_4b -v q4

KV Cache Compression

Modes

Config Compression PPL vs FP32 When to use
delta + 3b K + Q4 V ~4.3x -3.2% Maximum context length
delta + 4b K + Q4 V ~3.8x -12.2% Maximum quality
uniform 4b K + Q4 V 3.8x -7.8% Simple, no delta overhead
uniform 4b K + FP16 V 1.6x +0.0% Lossless baseline

Delta compression

Standard KV caching stores each key vector as-is. Delta mode stores key[t] - reconstruct(key[t-1]) — like video P-frames.

Adjacent keys differ by ~30% of their absolute range. This smaller range means 3-bit quantization preserves full quality. Without delta, 3-bit gives PPL +62%. With delta: -3.2%.

Every 64 tokens, an FP32 I-frame is stored to prevent drift.

Verified PPL (SmolLM2 1.7B, 999 tokens)

Config PPL vs FP32
FP32 baseline 14.58 --
delta + 4b K + Q4 V 12.80 -12.2%
delta + 3b K + Q4 V 14.11 -3.2%
uniform 4b K + Q4 V 13.44 -7.8%
uniform 3b (no delta) 23.62 +62%

Cross-model: SmolLM2 1.7B (-1.6%), Qwen3.5 0.8B (+0.9%), Qwen3.5 4B (+0.6%).


Supported Models

Model Architecture Params Status
SmolLM2-1.7B Llama 1.7B PPL verified
Qwen3.5-0.8B Qwen3.5 (DeltaNet) 752M PPL verified
Qwen3.5-4B Qwen3.5 (DeltaNet) 4B PPL verified
Qwen3.5-35B-A3B Qwen2-MoE 35B (3B active) Working
Gemma 3 270M Gemma 3 270M Working
Gemma 4 E2B Gemma 4 2B WIP

Architectures: Llama/Qwen3.5 (shared path), Gemma 3/4 (sliding + full attention), Qwen2-MoE.

GGUF format. Load any llama.cpp-compatible model file.


Backends

Backend Platform Status
NEON ARM CPU Production
AVX2 x86 CPU Production
Metal Apple Silicon Verified
CUDA NVIDIA GPU Compiles
Vulkan Cross-platform Compiles

FAQ

How is this different from llama.cpp?

llama.cpp is a full-featured inference framework (250K+ LOC). quant.cpp is a minimal engine (33K LOC) you can read, modify, and embed in your own C/C++ project. On KV compression specifically: llama.cpp Q4_0 gives PPL +10.6% on SmolLM2 1.7B; quant.cpp gives +0.0%.

Can I embed this in my app?

Yes. Pure C11, zero dependencies. Copy the source files, link against libc/libm, and call tq_load_model() / tq_generate(). Works on Linux, macOS, Windows, iOS, Android, and WASM. Thread pool is global but mutex-protected.

What about sub-3-bit quantization?

Tested extensively: 2-bit delta, sub-block scaling, multi-hash, error feedback, NF2, online SVD. None reached acceptable quality. The barrier: per-step cosine 0.997 compounds to 0.885 after 200 steps. 3-bit + delta is the practical minimum.


References

  • TurboQuant (ICLR 2026) — KV cache compression theory
  • QJL (AAAI 2025) — Quantized JL transform
  • PolarQuant (AISTATS 2026) — Polar coordinate quantization

QuantumAI | GitHub

Star History Chart

About

Embeddable LLM inference in pure C. 33K LOC, zero dependencies. Delta KV compression — 4x longer context. Inspired by TurboQuant (ICLR 2026).

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors