Skip to content

v0.6.1 — parallel fan-out + daemon RAM collapse

Choose a tag to compare

@mbachaud mbachaud released this 30 May 08:39
· 23 commits to master since this release
b7af5b0

Performance release. Verified end-to-end on the EnterpriseRAG-Bench v2 fixture (850K genes / 100 shards, HELIX_SHARD_WORKERS=8): resident RAM ~120 GB → 7.1 GB peak, per-query 125s → 57.6s median, 0 daemon deaths, ranked output byte-identical to the serial path.

Highlights

  • perf(memory) #173 — share ONE BGE-M3 model process-wide (get_shared_codec) instead of ~100 per-shard copies; the dominant 100-shard RAM driver. Default on; HELIX_SHARE_DENSE_CODEC=0 reverts. Plus opt-in fp16 dense matrix (HELIX_DENSE_MATRIX_DTYPE=float16), SQLite cache_size caps, explicit mmap_size=0.
  • perf(retrieval) #172 — concurrent shard fan-out via ThreadPoolExecutor (BLAS-pinned), gated by HELIX_SHARD_WORKERS (default 1 = serial). Byte-identical ranked output. Plus a WAL-checkpoint fix on shard close.

Operator notes

  • Both default-safe — 0.6.0 behavior is unchanged until you set HELIX_SHARD_WORKERS>1.
  • Enable the two together: A1 (model singleton) is the prerequisite that lets the fan-out reach full speedup on a memory-bound rig (without it, per-shard model duplication thrashes the pagefile and caps the win at ~1.5x).

Full notes in CHANGELOG.md.