feat(eml-hnsw): v2 integrated pipeline — retention selector + SIMD rerank + PQ + progressive cascade (supersedes #353) by ruvnet · Pull Request #356 · ruvnet/RuVector

ruvnet · 2026-04-16T18:02:00Z

Credit

This work builds directly on two outstanding upstream contributions:

@aepod (Mathew Beane) — original PR feat: EML-enhanced HNSW — 6 learned optimizations (10-30x distance, 2-5x search) #353 author. Designed and implemented all six learned models (EmlDistanceModel, ProgressiveDistance, AdaptiveEfModel, SearchPathPredictor, RebuildPredictor, PqDistanceCorrector), the gradient-free eml-core training library, and the 4-stage proof chain methodology. Without @aepod's Stage 4 hypothesis ("EML is the teacher, not the runtime — use plain cosine on selected dims") this v2 would not exist. The architectural pivot described in his own PR #353 comment thread is exactly what this branch ships as callable code.
@shaal (Ofer Shaal) — author of issue EML Operator-Inspired Optimizations: Log Quantization, Unified Distance, EML Trees #351 and PR feat: EML operator-inspired optimizations for quantization, distance, and learned indexes #352. The SimSIMD-backed UnifiedDistanceParams kernel, the four-stage proof methodology (adopted verbatim here), and the honest SIFT1M+GloVe measurement discipline all originated in his work. Tier 1B of this branch is a direct port of his SIMD cosine approach into the reduced-dim rerank stage.

Both authors are credited as Co-Authored-By: on the merged commit, and every piece of measured evidence below is traceable to one or both of their PRs.

Supersedes #353

Rewrites the EML-HNSW contribution into a working integrated pipeline with measured SIFT1M numbers. The original PR shipped six standalone learned models but had no downstream consumer — the ruvector-eml-hnsw crate compiled but its code never reached any RuVector HNSW path. This branch closes that gap and folds in the winning results from a six-experiment swarm run on ruvultra (AMD Ryzen 9 9950X / 32T / 123 GB) against real SIFT1M.

What's in v2

Component	Source tier	Measured result on SIFT1M
`EmlHnsw` wrapper around `hnsw_rs::Hnsw` + `search_with_rerank`	fix/eml-hnsw-integration	baseline — unlocks every result below
SimSIMD rerank kernel (`cosine_distance_simd`), after @shaal's PR #352 kernel	Tier 1B	5.65× @ d=128, 6.22× @ d=384; recall unchanged
`EmlDistanceModel::train_for_retention` — greedy forward selection	Tier 1C	+10.5 pp recall@10 vs @aepod's Pearson (0.712 → 0.817), > 3σ
`ProgressiveEmlHnsw` `[8, 32, 128]` multi-level cascade, using @aepod's `ProgressiveDistance`	Tier 3A	0.984 recall@10 at 961 µs p50 (2× latency at matched recall; 5.9× build cost)
`PqEmlHnsw` 8×256 Product Quantizer paired with @aepod's `PqDistanceCorrector`	Tier 3B	64× memory reduction (512 B → 8 B/vec); rerank recall 0.9515 ≥ 0.80 floor

What's NOT in v2 (and why)

EmlDistanceModel::fast_distance (EML tree per call): measured 2.35× slower than scalar baseline. Kept as reference impl; not on any query-time path. This matches @aepod's own Stage-1 finding on his test hardware.
AdaptiveEfModel: 290 ns/query actual overhead vs 3 ns claimed — too expensive to amortize against the ef-search work it would save.
Sliced Wasserstein rerank (Tier 2 experiment): 50.9× slower and 38.1 pp worse recall than cosine rerank on SIFT. Cleanly falsified for gradient-histogram datasets — documented as closed in ADR-151.
PqDistanceCorrector is kept but held advisory-only: under training on SIFT1M it increased MSE (1.4e9 → 6.4e10) because feature normalization against a global max_pq_dist saturates on SIFT's O(10⁵) distance scale. Final rank is exact cosine so this does not hurt recall. Noted in ADR-151 as a design flaw with a proposed fix direction (per-vector exact normalization).

Test surface

92 tests pass on the merged branch:

85 unit tests across 10 modules (new: selected_distance, pq, pq_hnsw, progressive_hnsw, hnsw_integration; retained: all original ruvector-eml-hnsw tests from @aepod's PR feat: EML-enhanced HNSW — 6 learned optimizations (10-30x distance, 2-5x search) #353)
3 integration tests (recall_integration)
4 SIFT1M real-data tests (env-gated; skipped in CI without the dataset): sift1m_real, retention_vs_pearson, progressive_sift1m, sift1m_pq
1 micro-benchmark (benches/rerank_kernel.rs)

Reproducibility recipe (on any Linux box with rustc ≥ 1.80):

# One-time: fetch SIFT1M (Texmex, ~400MB)
mkdir -p bench_data && cd bench_data
curl -fLO ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz && tar xzf sift.tar.gz
cd ..

# Full SIFT1M test suite
B=$(pwd)/bench_data/sift
export RUVECTOR_EML_SIFT1M_BASE=$B/sift_base.fvecs \
       RUVECTOR_EML_SIFT1M_QUERY=$B/sift_query.fvecs \
       RUVECTOR_EML_SIFT1M_LEARN=$B/sift_learn.fvecs \
       RUVECTOR_EML_SIFT1M_GT=$B/sift_groundtruth.ivecs
cargo test --release -p ruvector-eml-hnsw -- --nocapture

Coupling with #352

@shaal's PR #352 (unified SIMD kernel + QuantizationConfig::Log) is strictly additive over this branch. Landing both captures the full effect: #352 accelerates the inner distance kernel, this branch adds the pre-filter stage that makes wide fetch_k viable. See issue #351 for the cross-PR measurements.

Surface area and compatibility

DbOptions::default() behavior unchanged.
HnswIndex::new(...) and all existing RuVector retrieval paths unchanged.
EmlHnsw / ProgressiveEmlHnsw / PqEmlHnsw are explicitly constructed by callers opting into the approximate-then-exact pipeline.

References

ADR-151 (docs/adr/ADR-151-eml-hnsw-selected-dims.md) — acceptance matrix, per-tier measured numbers, closed/open questions.
PR feat: EML-enhanced HNSW — 6 learned optimizations (10-30x distance, 2-5x search) #353 (@aepod, original contribution this builds on) — feat: EML-enhanced HNSW — 6 learned optimizations (10-30x distance, 2-5x search) #353
Issue EML Operator-Inspired Optimizations: Log Quantization, Unified Distance, EML Trees #351 (@shaal, proof methodology + proposal) — EML Operator-Inspired Optimizations: Log Quantization, Unified Distance, EML Trees #351
PR feat: EML operator-inspired optimizations for quantization, distance, and learned indexes #352 (@shaal, SIMD unified kernel) — feat: EML operator-inspired optimizations for quantization, distance, and learned indexes #352

Closes #353 on merge. Cc @aepod @shaal for review — your work drove every measured result in this PR.

The execute_match() function previously collapsed all match results into a single ExecutionContext via context.bind(), which overwrote previous bindings. MATCH (n:Person) on 3 Person nodes returned only 1 row. This commit refactors the executor to use a ResultSet pipeline: - type ResultSet = Vec<ExecutionContext> - Each clause transforms ResultSet → ResultSet - execute_match() expands the set (one context per match) - execute_return() projects one row per context - execute_set/delete() apply to all contexts - Cross-product semantics for multiple patterns in one MATCH Also adds comprehensive tests: - test_match_returns_multiple_rows (the Issue #269 regression) - test_match_return_properties (verify correct values per row) - test_match_where_filter (WHERE correctly filters multi-row) - test_match_single_result (1 match → 1 row, no regression) - test_match_no_results (0 matches → 0 rows) - test_match_many_nodes (100 nodes → 100 rows, stress test) Co-Authored-By: claude-flow <ruv@ruv.net>

RETURN n.name now produces column "n.name" instead of "?column?". Property expressions (Expression::Property) are formatted as "object.property" for column naming, matching standard Cypher behavior. Co-Authored-By: claude-flow <ruv@ruv.net>

Built from commit b2347ce Platforms updated: - linux-x64-gnu - linux-arm64-gnu - darwin-x64 - darwin-arm64 - win32-x64-msvc 🤖 Generated by GitHub Actions

Built from commit 2adb949 Platforms updated: - linux-x64-gnu - linux-arm64-gnu - darwin-x64 - darwin-arm64 - win32-x64-msvc 🤖 Generated by GitHub Actions

Phase 2 of the ruvector remediation plan. Replaces simulated benchmarks with real measurements: - Python harness: hnswlib (C++) and numpy brute-force on same datasets - Rust test: ruvector-core HNSW with ground-truth recall measurement - Datasets: random-10K and random-100K, 128 dimensions - Metrics: QPS (p50/p95), recall@10 vs ground truth, memory, build time Key findings: - ruvector recall@10 is good: 98.3% (10K), 86.75% (100K) - ruvector QPS is 2.6-2.9x slower than hnswlib - ruvector build time is 2.2-5.9x slower than hnswlib - ruvector uses ~523MB for 100K vectors (10x raw data size) - All numbers are REAL — no hardcoded values, no simulation Co-Authored-By: claude-flow <ruv@ruv.net>

Built from commit 3b173a9 Platforms updated: - linux-x64-gnu - linux-arm64-gnu - darwin-x64 - darwin-arm64 - win32-x64-msvc 🤖 Generated by GitHub Actions

New crate: ruvector-eml-hnsw (6 modules, 93 tests) Patch: hnsw_rs/src/eml_distance.rs (integrated implementations) 1. Cosine Decomposition (EmlDistanceModel) — 10-30x distance speed Learns which dimensions discriminate, reduces O(384) to O(k) 2. Progressive Dimensionality (ProgressiveDistance) — 5-20x search Layer 2: 8-dim, Layer 1: 32-dim, Layer 0: full-dim 3. Adaptive ef (AdaptiveEfModel) — 1.5-3x search speed Per-query beam width from (norm, variance, graph_size, max_component) 4. Search Path Prediction (SearchPathPredictor) — 2-5x search K-means query regions → cached entry points, skip top-layer traversal 5. Rebuild Cost Prediction (RebuildPredictor) — operational efficiency Predicts recall degradation, triggers rebuild only when needed 6. PQ Distance Correction (PqDistanceCorrector) — DiskANN recall Learns PQ approximation error correction from exact/PQ pairs All backward compatible — untrained models fall back to standard behavior. Based on: Odrzywolel 2026, arXiv:2603.21852v2 Co-Authored-By: claude-flow <ruv@ruv.net>

Stage 1: micro-benchmarks (cosine decomp, adaptive ef, path prediction, rebuild prediction) — raw 16d L2 proxy is 9.3x faster than full 128d cosine, but EML model overhead makes fast_distance 2.1x slower. Stage 2: synthetic e2e (10K x 128d) — recall@10 drops to 0.1% on uniform random data because all dimensions are equally important. EML decomposition needs structured embeddings to work. Stage 3: real dataset — deferred, SIFT1M not available. Infrastructure in place to auto-run when dataset is downloaded. Stage 4: hypothesis test — DISPROVEN on random data (Spearman rho=0.013 vs required 0.95). Expected: uniform random has no discriminative dimensions. Real embeddings with PCA structure should score higher. Honest results: dimension reduction mechanism works, but EML model inference overhead and random-data limitations are documented clearly. Following shaal's methodology from PR #352. Co-Authored-By: claude-flow <ruv@ruv.net>

PR #353 added 6 standalone learned models but no consumer, so the selected-dims approach never reached any index. This commit closes that gap: - selected_distance.rs: plain cosine over learned dim subset (the corrected runtime path; the original fast_distance evaluated the EML tree per call and was 2.1x SLOWER than baseline, confirmed on ruvultra AMD 9950X). - hnsw_integration.rs: EmlHnsw wraps hnsw_rs::Hnsw, projects vectors to the learned subspace on add/search, keeps full-dim store for optional rerank. - tests/recall_integration.rs: end-to-end synthetic validation (rerank recall@10 >= 0.83 on structured data). - tests/sift1m_real.rs: Stage-3 gated real-data harness. Test counts: 70 unit + 3 recall_integration + 1 SIFT1M gated + 3 doctests (vs PR #353 body claim of 93 unit tests; actual on pr-353 pre-fix was 60). Stage-3 SIFT1M measured (50k base x 200 queries x 128d, selected_k=32, AMD 9950X): recall@10 reduced = 0.194 (PR #353 author expected ~0.85-0.95) recall@10 +rerank = 0.438 (fetch_k=50 too tight on real data) reduced HNSW p50 = 268.9 us reduced HNSW p95 = 361.8 us Finding: the mechanism is viable as a candidate pre-filter but requires (a) larger fetch_k (200-500), (b) SIMD-accelerated rerank (per PR #352), and (c) training on many more than 500-1000 samples for real embeddings. The synthetic ρ=0.958 claim does NOT reproduce on SIFT1M.

…rank + PQ + progressive cascade Supersedes the original PR #353 contribution with the combined result of six targeted experiments run on ruvultra (AMD Ryzen 9 9950X / 32T / 123 GB) against real SIFT1M (50k base × 200 queries). Integration gap is closed — this crate now has actual consumers (EmlHnsw, ProgressiveEmlHnsw, PqEmlHnsw), each with a real hnsw_rs-backed search path + rerank. ## Landing 1. EmlHnsw wrapper (base, from fix/eml-hnsw-integration) - Projects vectors to the learned subspace on insert/search, keeps full-dim store for rerank, exposes search_with_rerank(query, k, fetch_k, ef). - Fixes the fundamental "no consumer" problem in PR #353's original crate. 2. Tier 1B — SimSIMD rerank kernel - cosine_distance_simd backed by simsimd::SpatialSimilarity - 5.65× speedup at d=128 (59.1 ns → 10.5 ns), 6.22× at d=384 - Recall unchanged (Δ = 0.002, f32-vs-f64 accumulation noise) - Benchmark: benches/rerank_kernel.rs 3. Tier 1C — retention-objective selector - EmlDistanceModel::train_for_retention: greedy forward selection that maximizes recall@target_k on held-out queries - SIFT1M result at selected_k=32, fetch_k=200: pearson selector: recall@10 = 0.712 retention selector: recall@10 = 0.817 (+0.105, >3σ at n=200) - Training 37× slower but offline/one-shot 4. Tier 3A — ProgressiveEmlHnsw [8, 32, 128] cascade - Multi-index coarsest→finest, union + exact cosine rerank - SIFT1M: recall@10 = 0.984 at 961 µs p50 vs single-index 0.974 at ~1950 µs (2.0× latency improvement at matched recall) - Build cost 5.9× baseline — read-heavy workloads only 5. Tier 3B — PqEmlHnsw (8 subspaces × 256 centroids) + corrector - 64× memory reduction (512 B → 8 B per vector) - SIFT1M: rerank@10 = 0.9515, clears the ≥0.80 tier target - k-means converged cleanly (10-19 iterations per subspace, 25-iter cap never bound) - PqDistanceCorrector kept advisory-only: normalization against global max_pq_dist saturates on SIFT's O(10⁵) distance scale (MSE 1.4e9 → 6.4e10). Does not hurt recall because final rank is exact cosine. ## Measured evidence (all on ruvultra) See docs/adr/ADR-151-eml-hnsw-selected-dims.md for full context, acceptance criteria, and per-tier commit SHAs. Per-PR measured numbers are in GitHub issue #351 and PR #353 discussion. ## NOT included from PR #353 - EmlDistanceModel::fast_distance (EML tree per call): 2.35× SLOWER than scalar baseline on ruvultra. Kept as reference impl; not on any search path. See ADR-151 §Rejected Surface. - AdaptiveEfModel: 290 ns/query actual vs 3 ns claimed. Rejected until a <20 ns predictor is demonstrated. - Sliced Wasserstein rerank (Tier 2 experiment): 50.9× slower AND 38.1 pp worse than cosine rerank on SIFT. Cleanly falsified for gradient- histogram datasets. Documented in ADR-151 closed open-questions. ## Surface area - Default RuVector retrieval paths unchanged. - HnswIndex::new() and DbOptions::default() untouched. - EmlHnsw / ProgressiveEmlHnsw / PqEmlHnsw are explicitly constructed by callers opting into the approximate-then-exact pipeline. Co-Authored-By: swarm-coder <swarm@ruv.net> Co-Authored-By: Mathew Beane (aepod) <124563+aepod@users.noreply.github.com> Co-Authored-By: Ofer Shaal (shaal) <22901+shaal@users.noreply.github.com>

…ence Primary artifact for PR #356. Documents: - PR #353 claims vs measured reality on ruvultra (AMD 9950X) - v2 accepted surface (EmlHnsw, ProgressiveEmlHnsw, PqEmlHnsw, retention selector, SimSIMD rerank) - Rejected surface (fast_distance, AdaptiveEfModel, Sliced Wasserstein) - 6-tier swarm results: 4 passes, 1 clean falsification - SOTA v3 scope: 4-agent swarm in progress - Open questions with current status Co-Authored-By: Mathew Beane (aepod) <124563+aepod@users.noreply.github.com> Co-Authored-By: Ofer Shaal (shaal) <22901+shaal@users.noreply.github.com>

ruvnet · 2026-04-16T19:20:23Z

v3 update (branch feat/eml-hnsw-optimizations-v3)

Merge of four SOTA tiers on top of v2 (dac6f60e). v3 tip: 1fa28216.

Tier landing

tier	landed	measured on merged v3
SOTA-A PQ-native HNSW (+ OPQ)	yes	rerank@10 = 0.9510 @ 8 B/vec (64× payload reduction vs 512 B legacy). p50 rerank = 371.6 µs (2.56× faster than legacy 952 µs). OPQ gives no measurable gain on SIFT-native basis — kept as documented null result.
SOTA-B parallel rerank + 1M benchmarks + hnswlib baseline	yes (reframed)	parallel rerank = 1.10× serial (overhead-bound on SIFT128 × fetch_k=500). Plain `hnsw_rs` DistCosine @ 1M hits recall=0.9525 @ QPS=893 (ef=100); EmlHnsw selected_k=48 + fetch_k=500 plateaus at 0.8159 across all ef_search.
SOTA-C corrector local-scale fix (promoted) + beam selector (falsified)	yes (partial)	corrector held-out MSE −60.5% (1.397e9 → 5.52e8), non-advisory path wired into `search_with_rerank` (15×k pre-truncation). Beam selector gives +0.0065 recall over greedy at 4.39× training cost — inside SE ≈ 0.027, not promoted.
SOTA-D `HnswIndex::new_with_selected_dims()` in ruvector-core	yes	4 new integration tests passing (`crates/ruvector-core/tests/hnsw_selected_dims.rs`). Selected-dim prefilter now first-class in core without `ruvector-eml-hnsw` dependency.

Retention selector A/B (SIFT1M, selected_k=32)

selector	recall@10	train cost
pearson	0.7125	1.02 s
retention_greedy (v3 default)	0.8165	39.8 s
retention_beam (beam=4)	0.8230	174.7 s

Greedy wins: +10.4 pp over pearson. Beam gain is inside noise.

Honest reframe

Plain hnsw_rs beats EmlHnsw''s reduced-dim prefilter at 1M scale on both recall and QPS at matched HNSW config (m=16, ef_construction=200). The v3 speed story does not survive full-corpus scaling.

What v3 does deliver:

Memory win (PQ-native): 64× graph payload reduction (512 → 8 B/vec) with a 0.65 pp gain in rerank-recall over the legacy PQ path. The PQ-native HNSW now stores only u8 codes and computes asymmetric distances at query time via PqAsymmetricDistance.
Integration win (ruvector-core): HnswIndex::new_with_selected_dims() is first-class in core, with 4 new integration tests. Closes ADR-151 Q7.
Selector quality win: retention-greedy is a measurable +10.4 pp recall improvement over pearson and is the v3 default.
Corrector fix: the SOTA-C local-scale corrector fixes the global-max bug and is now non-advisory (pre-rerank truncation to rerank_k = 15·k). Closes ADR-151 Q6.

Clean falsifications (kept in repo)

OPQ on SIFT-native corpus — test stays (sift1m_opq.rs) as null result
Rayon parallel rerank — 1.10× only, kept for wider-embedding future; commit text marks it modest
Beam selector — 0.65 pp over greedy inside SE, not promoted as default

Files

ADR evidence: docs/adr/ADR-151-eml-hnsw-selected-dims.md §v3 SOTA Evidence
PQ-native: crates/ruvector-eml-hnsw/src/pq_hnsw.rs, src/opq.rs
Corrector fix: crates/ruvector-eml-hnsw/src/pq_corrector.rs
Retention-greedy selector: crates/ruvector-eml-hnsw/src/cosine_decomp.rs::train_for_retention_beam
ruvector-core API: crates/ruvector-core/src/index/hnsw_selected.rs, crates/ruvector-core/tests/hnsw_selected_dims.rs
Merge commits: a842f0d5 (D) → 6a797c08 (A) → e13de438 (C) → 54483c45 (B) → 1fa28216 (ADR)

Readiness

v3 is ready for review. All 93 lib + 4 new core integration tests green on merged branch. Recommend reading ADR-151 §v3 SOTA Evidence first — it carries the honest-reframe framing this comment summarizes.

aepod and others added 10 commits March 24, 2026 12:34

chore: Update NAPI-RS binaries for all platforms

c504a29

Built from commit b2347ce Platforms updated: - linux-x64-gnu - linux-arm64-gnu - darwin-x64 - darwin-arm64 - win32-x64-msvc 🤖 Generated by GitHub Actions

chore: Update NAPI-RS binaries for all platforms

5156ceb

Built from commit 2adb949 Platforms updated: - linux-x64-gnu - linux-arm64-gnu - darwin-x64 - darwin-arm64 - win32-x64-msvc 🤖 Generated by GitHub Actions

chore: Update NAPI-RS binaries for all platforms

b12db45

Built from commit 3b173a9 Platforms updated: - linux-x64-gnu - linux-arm64-gnu - darwin-x64 - darwin-arm64 - win32-x64-msvc 🤖 Generated by GitHub Actions

ruvnet force-pushed the feat/eml-hnsw-optimizations-v2 branch from 0ade479 to db1c58b Compare April 16, 2026 18:02

This was referenced Apr 16, 2026

feat: EML-enhanced HNSW — 6 learned optimizations (10-30x distance, 2-5x search) #353

Open

EML Operator-Inspired Optimizations: Log Quantization, Unified Distance, EML Trees #351

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eml-hnsw): v2 integrated pipeline — retention selector + SIMD rerank + PQ + progressive cascade (supersedes #353)#356

feat(eml-hnsw): v2 integrated pipeline — retention selector + SIMD rerank + PQ + progressive cascade (supersedes #353)#356
ruvnet wants to merge 11 commits intomainfrom
feat/eml-hnsw-optimizations-v2

ruvnet commented Apr 16, 2026 •

edited

Loading

Uh oh!

ruvnet commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ruvnet commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Credit

Supersedes #353

What's in v2

What's NOT in v2 (and why)

Test surface

Coupling with #352

Surface area and compatibility

References

Uh oh!

ruvnet commented Apr 16, 2026

v3 update (branch feat/eml-hnsw-optimizations-v3)

Tier landing

Retention selector A/B (SIFT1M, selected_k=32)

Honest reframe

Clean falsifications (kept in repo)

Files

Readiness

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ruvnet commented Apr 16, 2026 •

edited

Loading