v3.5.0
v3.5.0 ships bf16_se_tans — a Numba-JIT-compiled rANS codec that delivers a measurable speedup over the constriction baseline.
The spec proposed Cython; on this Windows box no MSVC/MinGW/gcc is in PATH, so I used Numba (already in deps) — same goal of eliminating Python↔Rust FFI overhead, no build step required.
Measurements on Phi-3.5-mini shard 1 (128 BF16 tensors, 4.97 GB)
| Codec | Encode | Decode | Ratio | Decode vs AC |
|---|---|---|---|---|
| bf16_se_ac (3.3.0) | 48.0 MB/s | 26.5 MB/s | 65.71% | 1.00x |
| bf16_se_rans (3.4.0) | 45.0 MB/s | 27.0 MB/s | 65.70% | 1.04x |
| bf16_se_tans (3.5.0) | 51.9 MB/s | 61.0 MB/s | 65.80% | 2.30x |
Size cost: +0.095pp (within spec's 0.1pp gate). Lossless md5-verified.
Added
bigsmall.codecs.numba_rans—@njit(cache=True)rANS primitives.bigsmall.codecs.bf16_tans— BF16 codec built onnumba_rans.- New codec name
bf16_se_tansregistered. compress(..., prefer_speed=True)opt-in flag.
Tests
- 6 new tests in
tests/test_tans.py. 108 passed / 2 skipped (up from 102).
Compatibility
- Default
compress()behavior unchanged. - All existing .bs files (3.0.0-3.4.0) decode bit-identically.
bf16_se_tansfiles require bigsmall ≥ 3.5.0.
What did NOT pan out (honest)
- Spec target: 5-10x decode. Actual: 2.3x. Per-bucket Python orchestration (~80 buckets/tensor) wasn't moved inside the Numba JIT boundary — multi-day work to fold all buckets into a single
@njitfunction. - Streaming inference > 1 token/sec: ~130s/token now (was 300s in 3.4.0). 2.3x speedup is real but not the order-of-magnitude needed for "live".
- KV cache < 100ms/pass: ~13s at seq=2000 (down from 30s). Real progress, not "live" territory yet.
Install: pip install bigsmall==3.5.0