v3.4.0
v3.4.0 adds bf16_se_rans, a new BF16 codec using constriction.stream.stack.AnsCoder (range Asymmetric Numeral Systems) in place of the range-coding RangeEncoder used by bf16_se_ac. Same algorithm, same compression ratio (within 0.0015pp), GPU-portable bytestream.
Honest performance on Phi-3.5-mini shard 1 (128 BF16 tensors, 4.97 GB raw)
| Codec | Encode | Decode | Ratio |
|---|---|---|---|
| bf16_se_ac | 46.0 MB/s | 25.9 MB/s | 65.71% |
| bf16_se_rans | 45.0 MB/s | 27.0 MB/s | 65.70% |
| Speedup | 0.98x | 1.04x | -0.0015 pp |
The spec predicted 10-50x. Actual: ~4% decode speedup on real model data. The algorithmic AC-vs-ANS difference (~1.18x on a single big stream) is washed out by constriction's per-call Python↔Rust FFI overhead on bf16's per-exp-bucket coding (~80 coders per tensor). Codec ships because (a) lossless + correct, (b) bytestream is GPU-portable, (c) infrastructure for a future fast codec.
Added
bigsmall.codecs.bf16_rans— new module.bf16_se_ransregistered in codec_registry, new default for BF16 tensors (placed first; small tie-break tolerance ≤0.01% of raw).- KV cache format v2 uses rANS for new blobs (v1 blobs from v3.3.0 still decode).
- 6 new tests, 102 passed / 2 skipped total (up from 96).
Compatibility
- All existing .bs files (3.0.0-3.3.0) decode bit-identically.
bf16_se_ransfiles require bigsmall ≥ 3.4.0.
What did NOT pan out (documented honestly)
- KV cache live inference: still ~29s/attention-pass at seq=2000 (target was <1s).
- Streaming inference: still ~300s/token (weight-decode bottleneck unchanged).
- The spec's 10-50x speedup: not achievable in constriction's Python-FFI layer.
Install: pip install bigsmall==3.4.0