Tracking the next round of TurboQuant work after #14 was resolved by Variant F (commit ac3c46a).
Current state (post-Variant F):
| KV type |
Bits/elem |
Llama 3.2 3B PPL |
Δ vs FP32 |
| FP32 |
32 |
13.56 |
— |
| turbo_kv_4b ⭐ |
4 |
14.28 |
+5.3% |
| uniform_4b |
4 |
14.41 |
+6.3% |
| turbo_kv_3b |
3 |
15.39 |
+13.5% |
We beat both our previous baseline and llama.cpp q4_0 KV at 4-bit. The remaining gap to FP32 (5.3%) and to the Google paper's near-zero claim (~0%) is real and could be closed with the following work.
Open follow-ups
1. Per-channel outlier handling (high impact)
The paper allocates ~25% of channels (32 out of 128) to a higher bit width for outliers. Our turbo_kv_4b uses uniform allocation. This is the most likely structural cause of the remaining gap.
Sketch:
- Compute per-channel max-abs across a calibration corpus (one-time, per layer)
- Identify the top-K outlier channels
- Store outlier indices + their values at FP16 (or higher-bit codebook) in the block header
- Quantize remaining channels at 3-bit / 4-bit
Storage cost: 32 outliers × (FP16 + 7-bit index) = 96 bytes — too big for our 72-byte block. Either:
(a) larger block size (256 bytes? trade off compression ratio for quality)
(b) per-layer outlier mask shared across blocks (much cheaper)
(c) start with K=8 outliers per block (24 bytes added — might fit)
2. Paper-faithful Llama 3.1 8B + LongBench-E reproduction
The paper reports on Llama 3.1 8B + LongBench-E, not WikiText PPL. We need:
3. 5-bit codebook variant
Layout for 5-bit per element at TQ_BK=128: 128 × 5 / 8 = 80 bytes for indices + 8 byte header = 88 bytes per block. Larger than 72-byte 4-bit but covers the ~5 bpc point that the paper also tests. Would need to:
- Compute Lloyd-Max-Gaussian centroids for 32 levels
- Add to
tq_codebook.c lookup table
- Add
TQ_TYPE_TURBO_KV_5B enum + register in tq_traits.c
- Pack/unpack 5-bit indices
4. Per-head rotation seeds
Currently all keys use `TKV_DEFAULT_SEED`. Per-head or per-layer seeds may help decorrelation in models where certain heads have correlated channels.
5. Regression test pinning quality
Add a slow integration test that fails CI if `turbo_kv_4b` PPL on Llama 3.2 3B exceeds 14.5. This guards against future regressions in the Karpathy-loop optimization.
Out of scope (won't fix)
- ❌ QJL stage revival — ablation showed it contributes ~0; reinvesting bytes in larger codebook is empirically better in our regime
- ❌ Multi-stage rotation — Walsh-Hadamard one-pass is fast and good enough
- ❌ Per-block adaptive bit allocation — not enough header space without breaking ABI
Resources
Tracking the next round of TurboQuant work after #14 was resolved by Variant F (commit ac3c46a).
Current state (post-Variant F):
We beat both our previous baseline and llama.cpp q4_0 KV at 4-bit. The remaining gap to FP32 (5.3%) and to the Google paper's near-zero claim (~0%) is real and could be closed with the following work.
Open follow-ups
1. Per-channel outlier handling (high impact)
The paper allocates ~25% of channels (32 out of 128) to a higher bit width for outliers. Our
turbo_kv_4buses uniform allocation. This is the most likely structural cause of the remaining gap.Sketch:
Storage cost: 32 outliers × (FP16 + 7-bit index) = 96 bytes — too big for our 72-byte block. Either:
(a) larger block size (256 bytes? trade off compression ratio for quality)
(b) per-layer outlier mask shared across blocks (much cheaper)
(c) start with K=8 outliers per block (24 bytes added — might fit)
2. Paper-faithful Llama 3.1 8B + LongBench-E reproduction
The paper reports on Llama 3.1 8B + LongBench-E, not WikiText PPL. We need:
3. 5-bit codebook variant
Layout for 5-bit per element at TQ_BK=128: 128 × 5 / 8 = 80 bytes for indices + 8 byte header = 88 bytes per block. Larger than 72-byte 4-bit but covers the ~5 bpc point that the paper also tests. Would need to:
tq_codebook.clookup tableTQ_TYPE_TURBO_KV_5Benum + register in tq_traits.c4. Per-head rotation seeds
Currently all keys use `TKV_DEFAULT_SEED`. Per-head or per-layer seeds may help decorrelation in models where certain heads have correlated channels.
5. Regression test pinning quality
Add a slow integration test that fails CI if `turbo_kv_4b` PPL on Llama 3.2 3B exceeds 14.5. This guards against future regressions in the Karpathy-loop optimization.
Out of scope (won't fix)
Resources