Release v3.5.0 · wpferrell/Bigsmall

v3.5.0 ships bf16_se_tans — a Numba-JIT-compiled rANS codec that delivers a measurable speedup over the constriction baseline.

The spec proposed Cython; on this Windows box no MSVC/MinGW/gcc is in PATH, so I used Numba (already in deps) — same goal of eliminating Python↔Rust FFI overhead, no build step required.

Measurements on Phi-3.5-mini shard 1 (128 BF16 tensors, 4.97 GB)

Codec	Encode	Decode	Ratio	Decode vs AC
bf16_se_ac (3.3.0)	48.0 MB/s	26.5 MB/s	65.71%	1.00x
bf16_se_rans (3.4.0)	45.0 MB/s	27.0 MB/s	65.70%	1.04x
bf16_se_tans (3.5.0)	51.9 MB/s	61.0 MB/s	65.80%	2.30x

Size cost: +0.095pp (within spec's 0.1pp gate). Lossless md5-verified.

Added

bigsmall.codecs.numba_rans — @njit(cache=True) rANS primitives.
bigsmall.codecs.bf16_tans — BF16 codec built on numba_rans.
New codec name bf16_se_tans registered.
compress(..., prefer_speed=True) opt-in flag.

Tests

6 new tests in tests/test_tans.py. 108 passed / 2 skipped (up from 102).

Compatibility

Default compress() behavior unchanged.
All existing .bs files (3.0.0-3.4.0) decode bit-identically.
bf16_se_tans files require bigsmall ≥ 3.5.0.

What did NOT pan out (honest)

Spec target: 5-10x decode. Actual: 2.3x. Per-bucket Python orchestration (~80 buckets/tensor) wasn't moved inside the Numba JIT boundary — multi-day work to fold all buckets into a single @njit function.
Streaming inference > 1 token/sec: ~130s/token now (was 300s in 3.4.0). 2.3x speedup is real but not the order-of-magnitude needed for "live".
KV cache < 100ms/pass: ~13s at seq=2000 (down from 30s). Real progress, not "live" territory yet.

Install: pip install bigsmall==3.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.5.0

Choose a tag to compare

Sorry, something went wrong.