Skip to content

v3.4.0

Choose a tag to compare

@wpferrell wpferrell released this 18 May 20:56
· 54 commits to main since this release

v3.4.0 adds bf16_se_rans, a new BF16 codec using constriction.stream.stack.AnsCoder (range Asymmetric Numeral Systems) in place of the range-coding RangeEncoder used by bf16_se_ac. Same algorithm, same compression ratio (within 0.0015pp), GPU-portable bytestream.

Honest performance on Phi-3.5-mini shard 1 (128 BF16 tensors, 4.97 GB raw)

Codec Encode Decode Ratio
bf16_se_ac 46.0 MB/s 25.9 MB/s 65.71%
bf16_se_rans 45.0 MB/s 27.0 MB/s 65.70%
Speedup 0.98x 1.04x -0.0015 pp

The spec predicted 10-50x. Actual: ~4% decode speedup on real model data. The algorithmic AC-vs-ANS difference (~1.18x on a single big stream) is washed out by constriction's per-call Python↔Rust FFI overhead on bf16's per-exp-bucket coding (~80 coders per tensor). Codec ships because (a) lossless + correct, (b) bytestream is GPU-portable, (c) infrastructure for a future fast codec.

Added

  • bigsmall.codecs.bf16_rans — new module.
  • bf16_se_rans registered in codec_registry, new default for BF16 tensors (placed first; small tie-break tolerance ≤0.01% of raw).
  • KV cache format v2 uses rANS for new blobs (v1 blobs from v3.3.0 still decode).
  • 6 new tests, 102 passed / 2 skipped total (up from 96).

Compatibility

  • All existing .bs files (3.0.0-3.3.0) decode bit-identically.
  • bf16_se_rans files require bigsmall ≥ 3.4.0.

What did NOT pan out (documented honestly)

  • KV cache live inference: still ~29s/attention-pass at seq=2000 (target was <1s).
  • Streaming inference: still ~300s/token (weight-decode bottleneck unchanged).
  • The spec's 10-50x speedup: not achievable in constriction's Python-FFI layer.

Install: pip install bigsmall==3.4.0