Skip to content

v3.5.0

Choose a tag to compare

@wpferrell wpferrell released this 18 May 21:23
· 53 commits to main since this release

v3.5.0 ships bf16_se_tans — a Numba-JIT-compiled rANS codec that delivers a measurable speedup over the constriction baseline.

The spec proposed Cython; on this Windows box no MSVC/MinGW/gcc is in PATH, so I used Numba (already in deps) — same goal of eliminating Python↔Rust FFI overhead, no build step required.

Measurements on Phi-3.5-mini shard 1 (128 BF16 tensors, 4.97 GB)

Codec Encode Decode Ratio Decode vs AC
bf16_se_ac (3.3.0) 48.0 MB/s 26.5 MB/s 65.71% 1.00x
bf16_se_rans (3.4.0) 45.0 MB/s 27.0 MB/s 65.70% 1.04x
bf16_se_tans (3.5.0) 51.9 MB/s 61.0 MB/s 65.80% 2.30x

Size cost: +0.095pp (within spec's 0.1pp gate). Lossless md5-verified.

Added

  • bigsmall.codecs.numba_rans@njit(cache=True) rANS primitives.
  • bigsmall.codecs.bf16_tans — BF16 codec built on numba_rans.
  • New codec name bf16_se_tans registered.
  • compress(..., prefer_speed=True) opt-in flag.

Tests

  • 6 new tests in tests/test_tans.py. 108 passed / 2 skipped (up from 102).

Compatibility

  • Default compress() behavior unchanged.
  • All existing .bs files (3.0.0-3.4.0) decode bit-identically.
  • bf16_se_tans files require bigsmall ≥ 3.5.0.

What did NOT pan out (honest)

  • Spec target: 5-10x decode. Actual: 2.3x. Per-bucket Python orchestration (~80 buckets/tensor) wasn't moved inside the Numba JIT boundary — multi-day work to fold all buckets into a single @njit function.
  • Streaming inference > 1 token/sec: ~130s/token now (was 300s in 3.4.0). 2.3x speedup is real but not the order-of-magnitude needed for "live".
  • KV cache < 100ms/pass: ~13s at seq=2000 (down from 30s). Real progress, not "live" territory yet.

Install: pip install bigsmall==3.5.0