Release v3.4.0 · wpferrell/Bigsmall

v3.4.0 adds bf16_se_rans, a new BF16 codec using constriction.stream.stack.AnsCoder (range Asymmetric Numeral Systems) in place of the range-coding RangeEncoder used by bf16_se_ac. Same algorithm, same compression ratio (within 0.0015pp), GPU-portable bytestream.

Honest performance on Phi-3.5-mini shard 1 (128 BF16 tensors, 4.97 GB raw)

Codec	Encode	Decode	Ratio
bf16_se_ac	46.0 MB/s	25.9 MB/s	65.71%
bf16_se_rans	45.0 MB/s	27.0 MB/s	65.70%
Speedup	0.98x	1.04x	-0.0015 pp

The spec predicted 10-50x. Actual: ~4% decode speedup on real model data. The algorithmic AC-vs-ANS difference (~1.18x on a single big stream) is washed out by constriction's per-call Python↔Rust FFI overhead on bf16's per-exp-bucket coding (~80 coders per tensor). Codec ships because (a) lossless + correct, (b) bytestream is GPU-portable, (c) infrastructure for a future fast codec.

Added

bigsmall.codecs.bf16_rans — new module.
bf16_se_rans registered in codec_registry, new default for BF16 tensors (placed first; small tie-break tolerance ≤0.01% of raw).
KV cache format v2 uses rANS for new blobs (v1 blobs from v3.3.0 still decode).
6 new tests, 102 passed / 2 skipped total (up from 96).

Compatibility

All existing .bs files (3.0.0-3.3.0) decode bit-identically.
bf16_se_rans files require bigsmall ≥ 3.4.0.

What did NOT pan out (documented honestly)

KV cache live inference: still ~29s/attention-pass at seq=2000 (target was <1s).
Streaming inference: still ~300s/token (weight-decode bottleneck unchanged).
The spec's 10-50x speedup: not achievable in constriction's Python-FFI layer.

Install: pip install bigsmall==3.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.4.0

Choose a tag to compare

Sorry, something went wrong.