TurboQuant follow-ups: per-channel outliers, Llama 3.1 8B reproduction, 5-bit codebook

Tracking the next round of TurboQuant work after #14 was resolved by Variant F (commit ac3c46a).

Current state (post-Variant F):

| KV type | Bits/elem | Llama 3.2 3B PPL | Δ vs FP32 |
|---|---:|---:|---:|
| FP32 | 32 | 13.56 | — |
| **turbo_kv_4b** ⭐ | 4 | **14.28** | +5.3% |
| uniform_4b | 4 | 14.41 | +6.3% |
| turbo_kv_3b | 3 | 15.39 | +13.5% |

We beat both our previous baseline and llama.cpp q4_0 KV at 4-bit. The remaining gap to FP32 (5.3%) and to the Google paper's near-zero claim (~0%) is real and could be closed with the following work.

## Open follow-ups

### 1. Per-channel outlier handling (high impact)

The paper allocates ~25% of channels (32 out of 128) to a higher bit width for outliers. Our `turbo_kv_4b` uses uniform allocation. This is the most likely structural cause of the remaining gap.

Sketch:
- Compute per-channel max-abs across a calibration corpus (one-time, per layer)
- Identify the top-K outlier channels
- Store outlier indices + their values at FP16 (or higher-bit codebook) in the block header
- Quantize remaining channels at 3-bit / 4-bit

Storage cost: 32 outliers × (FP16 + 7-bit index) = 96 bytes — too big for our 72-byte block. Either:
  (a) larger block size (256 bytes? trade off compression ratio for quality)
  (b) per-layer outlier mask shared across blocks (much cheaper)
  (c) start with K=8 outliers per block (24 bytes added — might fit)

### 2. Paper-faithful Llama 3.1 8B + LongBench-E reproduction

The paper reports on Llama 3.1 8B + LongBench-E, not WikiText PPL. We need:
- [ ] Download Llama 3.1 8B GGUF
- [ ] Find or build a LongBench-E runner harness
- [ ] Run baseline + turbo_kv_4b + (after outlier handling) and compare to paper Table 1

### 3. 5-bit codebook variant

Layout for 5-bit per element at TQ_BK=128: 128 × 5 / 8 = 80 bytes for indices + 8 byte header = 88 bytes per block. Larger than 72-byte 4-bit but covers the ~5 bpc point that the paper also tests. Would need to:
- Compute Lloyd-Max-Gaussian centroids for 32 levels
- Add to `tq_codebook.c` lookup table
- Add `TQ_TYPE_TURBO_KV_5B` enum + register in tq_traits.c
- Pack/unpack 5-bit indices

### 4. Per-head rotation seeds

Currently all keys use \`TKV_DEFAULT_SEED\`. Per-head or per-layer seeds may help decorrelation in models where certain heads have correlated channels.

### 5. Regression test pinning quality

Add a slow integration test that fails CI if \`turbo_kv_4b\` PPL on Llama 3.2 3B exceeds 14.5. This guards against future regressions in the Karpathy-loop optimization.

## Out of scope (won't fix)

- ❌ QJL stage revival — ablation showed it contributes ~0; reinvesting bytes in larger codebook is empirically better in our regime
- ❌ Multi-stage rotation — Walsh-Hadamard one-pass is fast and good enough
- ❌ Per-block adaptive bit allocation — not enough header space without breaking ABI

## Resources

- Reproduction report: [bench/results/turboquant_reproduction.md](../blob/main/bench/results/turboquant_reproduction.md)
- Variant F commit: ac3c46a
- Karpathy loop ablation commit: 4da6915
- Original gap doc: docs/turboquant-gap-analysis.md
- llama.cpp #20969 discussion comment: https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16481725

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TurboQuant follow-ups: per-channel outliers, Llama 3.1 8B reproduction, 5-bit codebook #15

Open follow-ups

1. Per-channel outlier handling (high impact)

2. Paper-faithful Llama 3.1 8B + LongBench-E reproduction

3. 5-bit codebook variant

4. Per-head rotation seeds

5. Regression test pinning quality

Out of scope (won't fix)

Resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

KV type	Bits/elem	Llama 3.2 3B PPL	Δ vs FP32
FP32	32	13.56	—
turbo_kv_4b ⭐	4	14.28	+5.3%
uniform_4b	4	14.41	+6.3%
turbo_kv_3b	3	15.39	+13.5%

TurboQuant follow-ups: per-channel outliers, Llama 3.1 8B reproduction, 5-bit codebook #15

Description

Open follow-ups

1. Per-channel outlier handling (high impact)

2. Paper-faithful Llama 3.1 8B + LongBench-E reproduction

3. 5-bit codebook variant

4. Per-head rotation seeds

5. Regression test pinning quality

Out of scope (won't fix)

Resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions