Release v3.9.0 · wpferrell/Bigsmall

v3.9.0 ships streaming-compression infrastructure plus several ergonomics improvements.

bigsmall.compress_streaming(src, dst) — encodes one tensor at a time via safetensors lazy loading. Output bit-identical to compress() on models without tied weights (md5-verified). Trade-offs: no cross-tensor tied-weight dedup (most modern LLMs don't tie anyway), serial encode (no worker pool).
bigsmall.compress_from_hub(repo_id, output_dir) — downloads each shard via huggingface_hub and runs compress_streaming. Peak RAM stays at one tensor regardless of model size.
bigsmall.decompress_layers(bs_path, layer_indices, …) — decompress only the requested transformer layers. Useful for partial fine-tuning, layer analysis, early-exit inference.
BigSmallStreamingModel(prefetch=N) — optional async prefetch worker (lazy init). Default disabled.
Better error messages in from_pretrained(): missing path / missing config.json now surface actionable suggestions.

For a 70B model that's the difference between "needs 140 GB RAM" and "needs ~5 GB RAM".

5 new tests in tests/test_opt_step6.py. 124 passed / 2 skipped (up from 119).

Async prefetch doesn't unlock streaming inference throughput: with decode at ~3s/layer and GPU forward at ~85ms total/token, the decoder is the critical path. Prefetch is shipped as infrastructure; real unlock needs the GPU AC kernel (v3.2.0 Triton roadmap).
compress_from_hub streams from the local HF cache, not the CDN itself. Truly buffered streaming from the CDN (zero local disk) would need re-implementing safetensors lazy loading on HTTP range requests — multi-day project, deferred.

Install: pip install bigsmall==3.9.0

Provide feedback