v3.2.0
v3.2.0 ships GPU-decode infrastructure for BigSmall: a new parallel-stream codec, a working Triton GPU kernel, and a streaming inference wrapper.
Functionality is correct end-to-end; performance is honest infrastructure, not yet user-ready. The GPU kernel decodes ~2x faster than the CPU rANS path but is still ~7x slower than the existing bf16_se_ac codec. Streaming inference on Phi-3.5-mini produces correct output at 0.63 GB peak VRAM (12x reduction from 7.6 GB BF16) but takes ~300s per token. Closing the throughput gap is multi-week GPU-kernel optimisation work, on the V4+ roadmap.
Added
bf16_parallel_v1codec — N interleaved slices, rANS-encoded with SHARED probability tables. GPU-portable bitstream format.- Triton GPU kernel (
bigsmall/kernels/ac_triton.py) — one program per stream, decodes SE on GPU. - Auto-fallback kernel wrapper — picks CUDA-C ext > Triton > CPU automatically.
BIGSMALL_FORCE_CPU=1to disable. BigSmallStreamingModel.from_pretrained()— load model layer-by-layer, peak VRAM = non-layer + activations + one layer.compress(..., gpu_optimised=True)flag — opt-in parallel codec with +1% size tolerance.
Tests
- 14 new tests across
tests/test_bf16_parallel.py,tests/test_gpu_kernel.py,tests/test_streaming_inference.py. 91 passed, 2 skipped (up from 77).
Empirical findings (Phi-3.5-mini, real model)
- Ratio cost of
bf16_parallel_v1at N=128: +0.07-0.34pp on big tensors (within spec's +1% tolerance gate). - GPU SE-decode throughput: 2.4 MB/s (CPU rANS: 1.2 MB/s; CPU constriction: 17 MB/s).
- Streaming inference: prompt "The capital of France is" → "The capital of France is Paris." Peak VRAM 0.63 GB vs 7.64 GB standard (12x reduction ✓). 300s/token (50%-of-standard throughput target NOT met; multi-week kernel optimisation needed).
Compatibility
- All existing tests pass. Files written by 3.1.0 read identically by 3.2.0.
bf16_parallel_v1files require bigsmall >= 3.2.0.- Default behaviour unchanged —
gpu_optimised=Falseis the default.
Known limitations (deliberate, documented)
- Mantissa decode still on CPU in GPU kernel path (straightforward port for v3.3.0).
- GPU kernel uses BLOCK_SIZE=1; warp-cooperative decode is the next optimisation.
- Streaming inference is correct but slow — VRAM target met, tokens/sec target not.
Install: pip install bigsmall==3.2.0