Release v3.6.0 · wpferrell/Bigsmall

v3.6.0 ships bf16_se_single_kernel — the entire BF16 tensor encode and decode collapsed into one Numba @njit function per direction. Eliminates per-bucket Python boundary crossings AND numpy argsort (replaced with O(n) counting sort).

Largest single-session speedup of the v3.x speed arc.

Phi-3.5-mini shard 1 (128 BF16 tensors, 4.97 GB)

Codec	Encode	Decode	Decode vs AC
bf16_se_ac (3.3.0)	43.4 MB/s	25.7 MB/s	1.00x
bf16_se_rans (3.4.0)	45.0 MB/s	27.0 MB/s	1.04x
bf16_se_tans (3.5.0)	48.4 MB/s	58.4 MB/s	2.27x
bf16_se_single_kernel (3.6.0)	98.6 MB/s	117.5 MB/s	4.57x

Added

bigsmall.codecs.single_kernel — one @njit function each for encode/decode covering the full pipeline (sign/exp/mantissa split, SE freq + rANS encode, O(n) counting-sort, per-bucket mantissa freq + rANS encode, blob assembly). Zero Python orchestration between phases.
New codec name bf16_se_single_kernel registered.
compress(prefer_speed=True) now picks bf16_se_single_kernel over bf16_se_tans when within +0.6% size tolerance.
6 new tests. 114 passed / 2 skipped (up from 108).

Compatibility

All existing .bs files (3.0.0-3.5.0) decode bit-identically.
bf16_se_single_kernel files require bigsmall >= 3.6.0.
Default compress() behavior unchanged.

What did NOT pan out (honest)

Spec gate of <0.2pp size cost: actual +0.45pp on Phi shard 1 (per-bucket rANS framing + slightly less-efficient Numba quantisation). Out-of-spec by ~2.25x but trade-off documented and opt-in.
Streaming inference >1 token/sec: still ~130 s/token. Weight-decode speedup is real but streaming is bottlenecked by HF model setup + per-layer transfers, not entropy decoding.
KV cache <100ms/pass: ~14 s at seq=2000 (down from 30s baseline). Real progress, not "live".

Install: pip install bigsmall==3.6.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.6.0

Choose a tag to compare

Sorry, something went wrong.