Skip to content

TurboQuant follow-ups: per-channel outliers, Llama 3.1 8B reproduction, 5-bit codebook #15

@unamedkr

Description

@unamedkr

Tracking the next round of TurboQuant work after #14 was resolved by Variant F (commit ac3c46a).

Current state (post-Variant F):

KV type Bits/elem Llama 3.2 3B PPL Δ vs FP32
FP32 32 13.56
turbo_kv_4b 4 14.28 +5.3%
uniform_4b 4 14.41 +6.3%
turbo_kv_3b 3 15.39 +13.5%

We beat both our previous baseline and llama.cpp q4_0 KV at 4-bit. The remaining gap to FP32 (5.3%) and to the Google paper's near-zero claim (~0%) is real and could be closed with the following work.

Open follow-ups

1. Per-channel outlier handling (high impact)

The paper allocates ~25% of channels (32 out of 128) to a higher bit width for outliers. Our turbo_kv_4b uses uniform allocation. This is the most likely structural cause of the remaining gap.

Sketch:

  • Compute per-channel max-abs across a calibration corpus (one-time, per layer)
  • Identify the top-K outlier channels
  • Store outlier indices + their values at FP16 (or higher-bit codebook) in the block header
  • Quantize remaining channels at 3-bit / 4-bit

Storage cost: 32 outliers × (FP16 + 7-bit index) = 96 bytes — too big for our 72-byte block. Either:
(a) larger block size (256 bytes? trade off compression ratio for quality)
(b) per-layer outlier mask shared across blocks (much cheaper)
(c) start with K=8 outliers per block (24 bytes added — might fit)

2. Paper-faithful Llama 3.1 8B + LongBench-E reproduction

The paper reports on Llama 3.1 8B + LongBench-E, not WikiText PPL. We need:

  • Download Llama 3.1 8B GGUF
  • Find or build a LongBench-E runner harness
  • Run baseline + turbo_kv_4b + (after outlier handling) and compare to paper Table 1

3. 5-bit codebook variant

Layout for 5-bit per element at TQ_BK=128: 128 × 5 / 8 = 80 bytes for indices + 8 byte header = 88 bytes per block. Larger than 72-byte 4-bit but covers the ~5 bpc point that the paper also tests. Would need to:

  • Compute Lloyd-Max-Gaussian centroids for 32 levels
  • Add to tq_codebook.c lookup table
  • Add TQ_TYPE_TURBO_KV_5B enum + register in tq_traits.c
  • Pack/unpack 5-bit indices

4. Per-head rotation seeds

Currently all keys use `TKV_DEFAULT_SEED`. Per-head or per-layer seeds may help decorrelation in models where certain heads have correlated channels.

5. Regression test pinning quality

Add a slow integration test that fails CI if `turbo_kv_4b` PPL on Llama 3.2 3B exceeds 14.5. This guards against future regressions in the Karpathy-loop optimization.

Out of scope (won't fix)

  • ❌ QJL stage revival — ablation showed it contributes ~0; reinvesting bytes in larger codebook is empirically better in our regime
  • ❌ Multi-stage rotation — Walsh-Hadamard one-pass is fast and good enough
  • ❌ Per-block adaptive bit allocation — not enough header space without breaking ABI

Resources

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions