Skip to content

v3.9.0

Choose a tag to compare

@wpferrell wpferrell released this 19 May 01:43
· 51 commits to main since this release

v3.9.0 ships streaming-compression infrastructure plus several ergonomics improvements.

Added

  • bigsmall.compress_streaming(src, dst) — encodes one tensor at a time via safetensors lazy loading. Output bit-identical to compress() on models without tied weights (md5-verified). Trade-offs: no cross-tensor tied-weight dedup (most modern LLMs don't tie anyway), serial encode (no worker pool).
  • bigsmall.compress_from_hub(repo_id, output_dir) — downloads each shard via huggingface_hub and runs compress_streaming. Peak RAM stays at one tensor regardless of model size.
  • bigsmall.decompress_layers(bs_path, layer_indices, …) — decompress only the requested transformer layers. Useful for partial fine-tuning, layer analysis, early-exit inference.
  • BigSmallStreamingModel(prefetch=N) — optional async prefetch worker (lazy init). Default disabled.
  • Better error messages in from_pretrained(): missing path / missing config.json now surface actionable suggestions.

Memory measurement on Phi-3.5-mini shard 1 (4.97 GB raw)

Path RSS growth Python heap peak
compress(workers=1) 11.5 GB 11.79 GB
compress_streaming 8.19 GB 3.37 GB
Reduction 1.41x 3.50x

For a 70B model that's the difference between "needs 140 GB RAM" and "needs ~5 GB RAM".

Tests

  • 5 new tests in tests/test_opt_step6.py. 124 passed / 2 skipped (up from 119).

What did NOT pan out

  • Async prefetch doesn't unlock streaming inference throughput: with decode at ~3s/layer and GPU forward at ~85ms total/token, the decoder is the critical path. Prefetch is shipped as infrastructure; real unlock needs the GPU AC kernel (v3.2.0 Triton roadmap).
  • compress_from_hub streams from the local HF cache, not the CDN itself. Truly buffered streaming from the CDN (zero local disk) would need re-implementing safetensors lazy loading on HTTP range requests — multi-day project, deferred.

Install: pip install bigsmall==3.9.0