v0.6.1 — parallel fan-out + daemon RAM collapse
Performance release. Verified end-to-end on the EnterpriseRAG-Bench v2 fixture (850K genes / 100 shards, HELIX_SHARD_WORKERS=8): resident RAM ~120 GB → 7.1 GB peak, per-query 125s → 57.6s median, 0 daemon deaths, ranked output byte-identical to the serial path.
Highlights
- perf(memory) #173 — share ONE BGE-M3 model process-wide (
get_shared_codec) instead of ~100 per-shard copies; the dominant 100-shard RAM driver. Default on;HELIX_SHARE_DENSE_CODEC=0reverts. Plus opt-in fp16 dense matrix (HELIX_DENSE_MATRIX_DTYPE=float16), SQLitecache_sizecaps, explicitmmap_size=0. - perf(retrieval) #172 — concurrent shard fan-out via ThreadPoolExecutor (BLAS-pinned), gated by
HELIX_SHARD_WORKERS(default 1 = serial). Byte-identical ranked output. Plus a WAL-checkpoint fix on shard close.
Operator notes
- Both default-safe — 0.6.0 behavior is unchanged until you set
HELIX_SHARD_WORKERS>1. - Enable the two together: A1 (model singleton) is the prerequisite that lets the fan-out reach full speedup on a memory-bound rig (without it, per-shard model duplication thrashes the pagefile and caps the win at ~1.5x).
Full notes in CHANGELOG.md.