Skip to content

v3.3.0

Choose a tag to compare

@wpferrell wpferrell released this 18 May 20:14
· 55 commits to main since this release

v3.3.0 ships KV cache compression infrastructure and switches the repo to Elastic License 2.0.

Codec is correct and lossless. Live-inference integration is NOT shipped — per-attention-pass decode at seq=2000 is ~30s on CPU, too slow for live token generation. Shipped as opt-in API for KV-at-rest use (snapshot/restore) and as infrastructure for future GPU-AC-kernel work.

Added

  • bigsmall.codecs.kv_cache.compress_kv_entry(keys, values) -> bytes + decompress_kv_entry(bytes, device) -> (k, v). Bit-identical round-trip via the existing bf16_se_ac codec.
  • bigsmall.kv_cache_manager.CompressedKVCache — drop-in storage class with set/get/memory_usage/compression_ratio/clear.
  • 5 new tests, 96 passed / 2 skipped total (up from 91).

Empirical findings (Phi-3.5-mini, 4 long prompts)

  • K compresses to 67.90% of raw BF16 (H(K) = 10.62 bits/el).
  • V compresses to 67.64% of raw BF16 (H(V) = 10.58 bits/el).
  • Full-model KV at seq=2000: 786.4 MB raw → 515.1 MB compressed, 271.3 MB saved (1.53x reduction).
  • Encode throughput: ~46 MB/s. Decode: ~26 MB/s. Per-attention-pass decode at seq=2000: ~30 s.
  • Spec's "3-4x KV cache size reduction" target was NOT MET — that's a lossy compression target. The lossless ceiling on KV cache is similar to weights (~1.47x).

Licensing

  • Repository license changed from Apache 2.0 to Elastic License 2.0. See LICENSE and LICENSING.md.

Compatibility

  • All existing tests pass. Default behaviour unchanged — KV compression is opt-in.

Install: pip install bigsmall==3.3.0