v3.7.0
v3.7.0 unlocks parallel tensor encoding on Windows. The historical hard-coded workers=1 default was overly conservative — Windows spawn-context multiprocessing works correctly and produces bit-identical output.
Speedup on Phi-3.5-mini partial shard (876 MB raw, 20 BF16 tensors)
| Workers | Wall time | Speedup |
|---|---|---|
| 1 | 115.19 s | 1.00x |
| 2 | 79.33 s | 1.45x |
| 4 | 63.30 s | 1.82x |
| 8 | 68.79 s | 1.67x (past optimal) |
Outputs are md5-identical across all worker counts.
Added
- Default
workers = min(cpu_count, 8)on all platforms (was 1 on Windows). Override viaBIGSMALL_WORKERSenv var still works. encoder._safe_workers()— caps worker count by available RAM (psutil) and tensor count. Always returns ≥ 1.- Explicit
mp_context = spawnonProcessPoolExecutorfor cross-platform consistency. Same fix applied tocompress_delta().
Tests
- 5 new tests in
tests/test_multiprocessing.py. 119 passed / 2 skipped (up from 114).
Compatibility
- Output is deterministic across worker counts — every existing .bs file is reproducible at any
workerssetting. BIGSMALL_WORKERS=1still selects the serial (no-pool) path.
What did NOT pan out
- Spec target >4x: actual 1.82x. Process-spawn + pickle overhead caps Windows-spawn at this workload. Pushing further needs Numba-warm workers or thread-pool variant.
Install: pip install bigsmall==3.7.0