Embeddable LLM inference in pure C.
33K LOC. No external libraries. Read it in an afternoon.
~4x longer context on the same hardware. KV cache compression reduces per-token memory by 3.8x, extending context proportionally.
| Hardware | Model | FP16 KV | 4-bit K + Q4 V | Gain |
|---|---|---|---|---|
| 8GB Laptop | Llama 8B (Q4) | ~16K tokens | ~61K tokens | 3.8x |
| 16GB Mac Air | SmolLM2 1.7B | ~78K tokens | ~298K tokens | 3.8x |
| 24GB RTX 3090 | Llama 8B (Q4) | ~147K tokens | ~559K tokens | 3.8x |
Estimates based on KV memory reduction. Actual context depends on available memory after model weights.
./quant model.gguf -p "hello"| quant.cpp | llama.cpp | |
|---|---|---|
| Code | 33K LOC, pure C | 250K+ LOC, C++ |
| Design | Read, modify, embed | Feature-complete |
| Dependencies | libc + pthreads only | ggml framework |
| KV compression | PPL -3.2% (better than FP32) | PPL +10.6% |
quant.cpp is not a fork. It's a standalone engine built from scratch for one goal: LLM inference you can understand, customize, and ship inside your own product.
- Read — 33K lines. The full forward pass fits in one file. You can trace every computation.
- Modify — Pure C11, modular. Add your own quantization type, swap the attention kernel, change the sampling strategy.
- Embed — No frameworks, no package managers. Copy the source into your project. Compiles on any platform with a C compiler.
git clone https://github.com/quantumaikr/quant.cpp && cd quant.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
# Run inference with a GGUF model
./build/quant model.gguf -p "hello"
# KV compression: 4-bit keys + Q4 values (3.8x, recommended)
./build/quant model.gguf -p "hello" -k uniform_4b -v q4
# Delta compression: 3-bit keys + Q4 values (4.3x, best compression)
./build/quant model.gguf -p "hello" -k uniform_3b -v q4 --delta
# Measure perplexity
./build/quant model.gguf --ppl input.txt -k uniform_4b -v q4| Config | Compression | PPL vs FP32 | When to use |
|---|---|---|---|
| delta + 3b K + Q4 V | ~4.3x | -3.2% | Maximum context length |
| delta + 4b K + Q4 V | ~3.8x | -12.2% | Maximum quality |
| uniform 4b K + Q4 V | 3.8x | -7.8% | Simple, no delta overhead |
| uniform 4b K + FP16 V | 1.6x | +0.0% | Lossless baseline |
Standard KV caching stores each key vector as-is. Delta mode stores key[t] - reconstruct(key[t-1]) — like video P-frames.
Adjacent keys differ by ~30% of their absolute range. This smaller range means 3-bit quantization preserves full quality. Without delta, 3-bit gives PPL +62%. With delta: -3.2%.
Every 64 tokens, an FP32 I-frame is stored to prevent drift.
| Config | PPL | vs FP32 |
|---|---|---|
| FP32 baseline | 14.58 | -- |
| delta + 4b K + Q4 V | 12.80 | -12.2% |
| delta + 3b K + Q4 V | 14.11 | -3.2% |
| uniform 4b K + Q4 V | 13.44 | -7.8% |
| uniform 3b (no delta) | 23.62 | +62% |
Cross-model: SmolLM2 1.7B (-1.6%), Qwen3.5 0.8B (+0.9%), Qwen3.5 4B (+0.6%).
| Model | Architecture | Params | Status |
|---|---|---|---|
| SmolLM2-1.7B | Llama | 1.7B | PPL verified |
| Qwen3.5-0.8B | Qwen3.5 (DeltaNet) | 752M | PPL verified |
| Qwen3.5-4B | Qwen3.5 (DeltaNet) | 4B | PPL verified |
| Qwen3.5-35B-A3B | Qwen2-MoE | 35B (3B active) | Working |
| Gemma 3 270M | Gemma 3 | 270M | Working |
| Gemma 4 E2B | Gemma 4 | 2B | WIP |
Architectures: Llama/Qwen3.5 (shared path), Gemma 3/4 (sliding + full attention), Qwen2-MoE.
GGUF format. Load any llama.cpp-compatible model file.
| Backend | Platform | Status |
|---|---|---|
| NEON | ARM CPU | Production |
| AVX2 | x86 CPU | Production |
| Metal | Apple Silicon | Verified |
| CUDA | NVIDIA GPU | Compiles |
| Vulkan | Cross-platform | Compiles |
How is this different from llama.cpp?
llama.cpp is a full-featured inference framework (250K+ LOC). quant.cpp is a minimal engine (33K LOC) you can read, modify, and embed in your own C/C++ project. On KV compression specifically: llama.cpp Q4_0 gives PPL +10.6% on SmolLM2 1.7B; quant.cpp gives +0.0%.
Can I embed this in my app?
Yes. Pure C11, zero dependencies. Copy the source files, link against libc/libm, and call tq_load_model() / tq_generate(). Works on Linux, macOS, Windows, iOS, Android, and WASM. Thread pool is global but mutex-protected.
What about sub-3-bit quantization?
Tested extensively: 2-bit delta, sub-block scaling, multi-hash, error feedback, NF2, online SVD. None reached acceptable quality. The barrier: per-step cosine 0.997 compounds to 0.885 after 200 steps. 3-bit + delta is the practical minimum.
- TurboQuant (ICLR 2026) — KV cache compression theory
- QJL (AAAI 2025) — Quantized JL transform
- PolarQuant (AISTATS 2026) — Polar coordinate quantization
