v3.9.0
v3.9.0 ships streaming-compression infrastructure plus several ergonomics improvements.
Added
bigsmall.compress_streaming(src, dst)— encodes one tensor at a time via safetensors lazy loading. Output bit-identical tocompress()on models without tied weights (md5-verified). Trade-offs: no cross-tensor tied-weight dedup (most modern LLMs don't tie anyway), serial encode (no worker pool).bigsmall.compress_from_hub(repo_id, output_dir)— downloads each shard via huggingface_hub and runscompress_streaming. Peak RAM stays at one tensor regardless of model size.bigsmall.decompress_layers(bs_path, layer_indices, …)— decompress only the requested transformer layers. Useful for partial fine-tuning, layer analysis, early-exit inference.BigSmallStreamingModel(prefetch=N)— optional async prefetch worker (lazy init). Default disabled.- Better error messages in
from_pretrained(): missing path / missing config.json now surface actionable suggestions.
Memory measurement on Phi-3.5-mini shard 1 (4.97 GB raw)
| Path | RSS growth | Python heap peak |
|---|---|---|
compress(workers=1) |
11.5 GB | 11.79 GB |
compress_streaming |
8.19 GB | 3.37 GB |
| Reduction | 1.41x | 3.50x |
For a 70B model that's the difference between "needs 140 GB RAM" and "needs ~5 GB RAM".
Tests
- 5 new tests in
tests/test_opt_step6.py. 124 passed / 2 skipped (up from 119).
What did NOT pan out
- Async prefetch doesn't unlock streaming inference throughput: with decode at ~3s/layer and GPU forward at ~85ms total/token, the decoder is the critical path. Prefetch is shipped as infrastructure; real unlock needs the GPU AC kernel (v3.2.0 Triton roadmap).
compress_from_hubstreams from the local HF cache, not the CDN itself. Truly buffered streaming from the CDN (zero local disk) would need re-implementing safetensors lazy loading on HTTP range requests — multi-day project, deferred.
Install: pip install bigsmall==3.9.0