v3.3.0
v3.3.0 ships KV cache compression infrastructure and switches the repo to Elastic License 2.0.
Codec is correct and lossless. Live-inference integration is NOT shipped — per-attention-pass decode at seq=2000 is ~30s on CPU, too slow for live token generation. Shipped as opt-in API for KV-at-rest use (snapshot/restore) and as infrastructure for future GPU-AC-kernel work.
Added
bigsmall.codecs.kv_cache.compress_kv_entry(keys, values) -> bytes+decompress_kv_entry(bytes, device) -> (k, v). Bit-identical round-trip via the existingbf16_se_accodec.bigsmall.kv_cache_manager.CompressedKVCache— drop-in storage class withset/get/memory_usage/compression_ratio/clear.- 5 new tests, 96 passed / 2 skipped total (up from 91).
Empirical findings (Phi-3.5-mini, 4 long prompts)
- K compresses to 67.90% of raw BF16 (H(K) = 10.62 bits/el).
- V compresses to 67.64% of raw BF16 (H(V) = 10.58 bits/el).
- Full-model KV at seq=2000: 786.4 MB raw → 515.1 MB compressed, 271.3 MB saved (1.53x reduction).
- Encode throughput: ~46 MB/s. Decode: ~26 MB/s. Per-attention-pass decode at seq=2000: ~30 s.
- Spec's "3-4x KV cache size reduction" target was NOT MET — that's a lossy compression target. The lossless ceiling on KV cache is similar to weights (~1.47x).
Licensing
- Repository license changed from Apache 2.0 to Elastic License 2.0. See
LICENSEandLICENSING.md.
Compatibility
- All existing tests pass. Default behaviour unchanged — KV compression is opt-in.
Install: pip install bigsmall==3.3.0