Skip to content

v3.6.0

Choose a tag to compare

@wpferrell wpferrell released this 18 May 23:12
· 52 commits to main since this release

v3.6.0 ships bf16_se_single_kernel — the entire BF16 tensor encode and decode collapsed into one Numba @njit function per direction. Eliminates per-bucket Python boundary crossings AND numpy argsort (replaced with O(n) counting sort).

Largest single-session speedup of the v3.x speed arc.

Phi-3.5-mini shard 1 (128 BF16 tensors, 4.97 GB)

Codec Encode Decode Decode vs AC
bf16_se_ac (3.3.0) 43.4 MB/s 25.7 MB/s 1.00x
bf16_se_rans (3.4.0) 45.0 MB/s 27.0 MB/s 1.04x
bf16_se_tans (3.5.0) 48.4 MB/s 58.4 MB/s 2.27x
bf16_se_single_kernel (3.6.0) 98.6 MB/s 117.5 MB/s 4.57x

Added

  • bigsmall.codecs.single_kernel — one @njit function each for encode/decode covering the full pipeline (sign/exp/mantissa split, SE freq + rANS encode, O(n) counting-sort, per-bucket mantissa freq + rANS encode, blob assembly). Zero Python orchestration between phases.
  • New codec name bf16_se_single_kernel registered.
  • compress(prefer_speed=True) now picks bf16_se_single_kernel over bf16_se_tans when within +0.6% size tolerance.
  • 6 new tests. 114 passed / 2 skipped (up from 108).

Compatibility

  • All existing .bs files (3.0.0-3.5.0) decode bit-identically.
  • bf16_se_single_kernel files require bigsmall >= 3.6.0.
  • Default compress() behavior unchanged.

What did NOT pan out (honest)

  • Spec gate of <0.2pp size cost: actual +0.45pp on Phi shard 1 (per-bucket rANS framing + slightly less-efficient Numba quantisation). Out-of-spec by ~2.25x but trade-off documented and opt-in.
  • Streaming inference >1 token/sec: still ~130 s/token. Weight-decode speedup is real but streaming is bottlenecked by HF model setup + per-layer transfers, not entropy decoding.
  • KV cache <100ms/pass: ~14 s at seq=2000 (down from 30s baseline). Real progress, not "live".

Install: pip install bigsmall==3.6.0