Skip to content

v3.2.0

Choose a tag to compare

@wpferrell wpferrell released this 18 May 20:00
· 56 commits to main since this release

v3.2.0 ships GPU-decode infrastructure for BigSmall: a new parallel-stream codec, a working Triton GPU kernel, and a streaming inference wrapper.

Functionality is correct end-to-end; performance is honest infrastructure, not yet user-ready. The GPU kernel decodes ~2x faster than the CPU rANS path but is still ~7x slower than the existing bf16_se_ac codec. Streaming inference on Phi-3.5-mini produces correct output at 0.63 GB peak VRAM (12x reduction from 7.6 GB BF16) but takes ~300s per token. Closing the throughput gap is multi-week GPU-kernel optimisation work, on the V4+ roadmap.

Added

  • bf16_parallel_v1 codec — N interleaved slices, rANS-encoded with SHARED probability tables. GPU-portable bitstream format.
  • Triton GPU kernel (bigsmall/kernels/ac_triton.py) — one program per stream, decodes SE on GPU.
  • Auto-fallback kernel wrapper — picks CUDA-C ext > Triton > CPU automatically. BIGSMALL_FORCE_CPU=1 to disable.
  • BigSmallStreamingModel.from_pretrained() — load model layer-by-layer, peak VRAM = non-layer + activations + one layer.
  • compress(..., gpu_optimised=True) flag — opt-in parallel codec with +1% size tolerance.

Tests

  • 14 new tests across tests/test_bf16_parallel.py, tests/test_gpu_kernel.py, tests/test_streaming_inference.py. 91 passed, 2 skipped (up from 77).

Empirical findings (Phi-3.5-mini, real model)

  • Ratio cost of bf16_parallel_v1 at N=128: +0.07-0.34pp on big tensors (within spec's +1% tolerance gate).
  • GPU SE-decode throughput: 2.4 MB/s (CPU rANS: 1.2 MB/s; CPU constriction: 17 MB/s).
  • Streaming inference: prompt "The capital of France is" → "The capital of France is Paris." Peak VRAM 0.63 GB vs 7.64 GB standard (12x reduction ✓). 300s/token (50%-of-standard throughput target NOT met; multi-week kernel optimisation needed).

Compatibility

  • All existing tests pass. Files written by 3.1.0 read identically by 3.2.0.
  • bf16_parallel_v1 files require bigsmall >= 3.2.0.
  • Default behaviour unchanged — gpu_optimised=False is the default.

Known limitations (deliberate, documented)

  • Mantissa decode still on CPU in GPU kernel path (straightforward port for v3.3.0).
  • GPU kernel uses BLOCK_SIZE=1; warp-cooperative decode is the next optimisation.
  • Streaming inference is correct but slow — VRAM target met, tokens/sec target not.

Install: pip install bigsmall==3.2.0