Skip to content

v3.7.0

Choose a tag to compare

@wpferrell wpferrell released this 18 May 23:53
· 51 commits to main since this release

v3.7.0 unlocks parallel tensor encoding on Windows. The historical hard-coded workers=1 default was overly conservative — Windows spawn-context multiprocessing works correctly and produces bit-identical output.

Speedup on Phi-3.5-mini partial shard (876 MB raw, 20 BF16 tensors)

Workers Wall time Speedup
1 115.19 s 1.00x
2 79.33 s 1.45x
4 63.30 s 1.82x
8 68.79 s 1.67x (past optimal)

Outputs are md5-identical across all worker counts.

Added

  • Default workers = min(cpu_count, 8) on all platforms (was 1 on Windows). Override via BIGSMALL_WORKERS env var still works.
  • encoder._safe_workers() — caps worker count by available RAM (psutil) and tensor count. Always returns ≥ 1.
  • Explicit mp_context = spawn on ProcessPoolExecutor for cross-platform consistency. Same fix applied to compress_delta().

Tests

  • 5 new tests in tests/test_multiprocessing.py. 119 passed / 2 skipped (up from 114).

Compatibility

  • Output is deterministic across worker counts — every existing .bs file is reproducible at any workers setting.
  • BIGSMALL_WORKERS=1 still selects the serial (no-pool) path.

What did NOT pan out

  • Spec target >4x: actual 1.82x. Process-spawn + pickle overhead caps Windows-spawn at this workload. Pushing further needs Numba-warm workers or thread-pool variant.

Install: pip install bigsmall==3.7.0