v3.6.0
v3.6.0 ships bf16_se_single_kernel — the entire BF16 tensor encode and decode collapsed into one Numba @njit function per direction. Eliminates per-bucket Python boundary crossings AND numpy argsort (replaced with O(n) counting sort).
Largest single-session speedup of the v3.x speed arc.
Phi-3.5-mini shard 1 (128 BF16 tensors, 4.97 GB)
| Codec | Encode | Decode | Decode vs AC |
|---|---|---|---|
| bf16_se_ac (3.3.0) | 43.4 MB/s | 25.7 MB/s | 1.00x |
| bf16_se_rans (3.4.0) | 45.0 MB/s | 27.0 MB/s | 1.04x |
| bf16_se_tans (3.5.0) | 48.4 MB/s | 58.4 MB/s | 2.27x |
| bf16_se_single_kernel (3.6.0) | 98.6 MB/s | 117.5 MB/s | 4.57x |
Added
bigsmall.codecs.single_kernel— one@njitfunction each for encode/decode covering the full pipeline (sign/exp/mantissa split, SE freq + rANS encode, O(n) counting-sort, per-bucket mantissa freq + rANS encode, blob assembly). Zero Python orchestration between phases.- New codec name
bf16_se_single_kernelregistered. compress(prefer_speed=True)now picksbf16_se_single_kerneloverbf16_se_tanswhen within +0.6% size tolerance.- 6 new tests. 114 passed / 2 skipped (up from 108).
Compatibility
- All existing .bs files (3.0.0-3.5.0) decode bit-identically.
bf16_se_single_kernelfiles require bigsmall >= 3.6.0.- Default
compress()behavior unchanged.
What did NOT pan out (honest)
- Spec gate of <0.2pp size cost: actual +0.45pp on Phi shard 1 (per-bucket rANS framing + slightly less-efficient Numba quantisation). Out-of-spec by ~2.25x but trade-off documented and opt-in.
- Streaming inference >1 token/sec: still ~130 s/token. Weight-decode speedup is real but streaming is bottlenecked by HF model setup + per-layer transfers, not entropy decoding.
- KV cache <100ms/pass: ~14 s at seq=2000 (down from 30s baseline). Real progress, not "live".
Install: pip install bigsmall==3.6.0