Skip to content

feat(eml-hnsw): v2 integrated pipeline — retention selector + SIMD rerank + PQ + progressive cascade (supersedes #353)#356

Open
ruvnet wants to merge 11 commits intomainfrom
feat/eml-hnsw-optimizations-v2
Open

feat(eml-hnsw): v2 integrated pipeline — retention selector + SIMD rerank + PQ + progressive cascade (supersedes #353)#356
ruvnet wants to merge 11 commits intomainfrom
feat/eml-hnsw-optimizations-v2

Conversation

@ruvnet
Copy link
Copy Markdown
Owner

@ruvnet ruvnet commented Apr 16, 2026

Credit

This work builds directly on two outstanding upstream contributions:

Both authors are credited as Co-Authored-By: on the merged commit, and every piece of measured evidence below is traceable to one or both of their PRs.

Supersedes #353

Rewrites the EML-HNSW contribution into a working integrated pipeline with measured SIFT1M numbers. The original PR shipped six standalone learned models but had no downstream consumer — the ruvector-eml-hnsw crate compiled but its code never reached any RuVector HNSW path. This branch closes that gap and folds in the winning results from a six-experiment swarm run on ruvultra (AMD Ryzen 9 9950X / 32T / 123 GB) against real SIFT1M.

What's in v2

Component Source tier Measured result on SIFT1M
EmlHnsw wrapper around hnsw_rs::Hnsw + search_with_rerank fix/eml-hnsw-integration baseline — unlocks every result below
SimSIMD rerank kernel (cosine_distance_simd), after @shaal's PR #352 kernel Tier 1B 5.65× @ d=128, 6.22× @ d=384; recall unchanged
EmlDistanceModel::train_for_retention — greedy forward selection Tier 1C +10.5 pp recall@10 vs @aepod's Pearson (0.712 → 0.817), > 3σ
ProgressiveEmlHnsw [8, 32, 128] multi-level cascade, using @aepod's ProgressiveDistance Tier 3A 0.984 recall@10 at 961 µs p50 (2× latency at matched recall; 5.9× build cost)
PqEmlHnsw 8×256 Product Quantizer paired with @aepod's PqDistanceCorrector Tier 3B 64× memory reduction (512 B → 8 B/vec); rerank recall 0.9515 ≥ 0.80 floor

What's NOT in v2 (and why)

  • EmlDistanceModel::fast_distance (EML tree per call): measured 2.35× slower than scalar baseline. Kept as reference impl; not on any query-time path. This matches @aepod's own Stage-1 finding on his test hardware.
  • AdaptiveEfModel: 290 ns/query actual overhead vs 3 ns claimed — too expensive to amortize against the ef-search work it would save.
  • Sliced Wasserstein rerank (Tier 2 experiment): 50.9× slower and 38.1 pp worse recall than cosine rerank on SIFT. Cleanly falsified for gradient-histogram datasets — documented as closed in ADR-151.
  • PqDistanceCorrector is kept but held advisory-only: under training on SIFT1M it increased MSE (1.4e9 → 6.4e10) because feature normalization against a global max_pq_dist saturates on SIFT's O(10⁵) distance scale. Final rank is exact cosine so this does not hurt recall. Noted in ADR-151 as a design flaw with a proposed fix direction (per-vector exact normalization).

Test surface

92 tests pass on the merged branch:

  • 85 unit tests across 10 modules (new: selected_distance, pq, pq_hnsw, progressive_hnsw, hnsw_integration; retained: all original ruvector-eml-hnsw tests from @aepod's PR feat: EML-enhanced HNSW — 6 learned optimizations (10-30x distance, 2-5x search) #353)
  • 3 integration tests (recall_integration)
  • 4 SIFT1M real-data tests (env-gated; skipped in CI without the dataset): sift1m_real, retention_vs_pearson, progressive_sift1m, sift1m_pq
  • 1 micro-benchmark (benches/rerank_kernel.rs)

Reproducibility recipe (on any Linux box with rustc ≥ 1.80):

# One-time: fetch SIFT1M (Texmex, ~400MB)
mkdir -p bench_data && cd bench_data
curl -fLO ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz && tar xzf sift.tar.gz
cd ..

# Full SIFT1M test suite
B=$(pwd)/bench_data/sift
export RUVECTOR_EML_SIFT1M_BASE=$B/sift_base.fvecs \
       RUVECTOR_EML_SIFT1M_QUERY=$B/sift_query.fvecs \
       RUVECTOR_EML_SIFT1M_LEARN=$B/sift_learn.fvecs \
       RUVECTOR_EML_SIFT1M_GT=$B/sift_groundtruth.ivecs
cargo test --release -p ruvector-eml-hnsw -- --nocapture

Coupling with #352

@shaal's PR #352 (unified SIMD kernel + QuantizationConfig::Log) is strictly additive over this branch. Landing both captures the full effect: #352 accelerates the inner distance kernel, this branch adds the pre-filter stage that makes wide fetch_k viable. See issue #351 for the cross-PR measurements.

Surface area and compatibility

  • DbOptions::default() behavior unchanged.
  • HnswIndex::new(...) and all existing RuVector retrieval paths unchanged.
  • EmlHnsw / ProgressiveEmlHnsw / PqEmlHnsw are explicitly constructed by callers opting into the approximate-then-exact pipeline.

References

Closes #353 on merge. Cc @aepod @shaal for review — your work drove every measured result in this PR.

aepod and others added 10 commits March 24, 2026 12:34
The execute_match() function previously collapsed all match results into
a single ExecutionContext via context.bind(), which overwrote previous
bindings. MATCH (n:Person) on 3 Person nodes returned only 1 row.

This commit refactors the executor to use a ResultSet pipeline:
- type ResultSet = Vec<ExecutionContext>
- Each clause transforms ResultSet → ResultSet
- execute_match() expands the set (one context per match)
- execute_return() projects one row per context
- execute_set/delete() apply to all contexts
- Cross-product semantics for multiple patterns in one MATCH

Also adds comprehensive tests:
- test_match_returns_multiple_rows (the Issue #269 regression)
- test_match_return_properties (verify correct values per row)
- test_match_where_filter (WHERE correctly filters multi-row)
- test_match_single_result (1 match → 1 row, no regression)
- test_match_no_results (0 matches → 0 rows)
- test_match_many_nodes (100 nodes → 100 rows, stress test)

Co-Authored-By: claude-flow <ruv@ruv.net>
RETURN n.name now produces column "n.name" instead of "?column?".
Property expressions (Expression::Property) are formatted as
"object.property" for column naming, matching standard Cypher behavior.

Co-Authored-By: claude-flow <ruv@ruv.net>
  Built from commit b2347ce

  Platforms updated:
  - linux-x64-gnu
  - linux-arm64-gnu
  - darwin-x64
  - darwin-arm64
  - win32-x64-msvc

  🤖 Generated by GitHub Actions
  Built from commit 2adb949

  Platforms updated:
  - linux-x64-gnu
  - linux-arm64-gnu
  - darwin-x64
  - darwin-arm64
  - win32-x64-msvc

  🤖 Generated by GitHub Actions
Phase 2 of the ruvector remediation plan. Replaces simulated benchmarks
with real measurements:

- Python harness: hnswlib (C++) and numpy brute-force on same datasets
- Rust test: ruvector-core HNSW with ground-truth recall measurement
- Datasets: random-10K and random-100K, 128 dimensions
- Metrics: QPS (p50/p95), recall@10 vs ground truth, memory, build time

Key findings:
- ruvector recall@10 is good: 98.3% (10K), 86.75% (100K)
- ruvector QPS is 2.6-2.9x slower than hnswlib
- ruvector build time is 2.2-5.9x slower than hnswlib
- ruvector uses ~523MB for 100K vectors (10x raw data size)
- All numbers are REAL — no hardcoded values, no simulation

Co-Authored-By: claude-flow <ruv@ruv.net>
  Built from commit 3b173a9

  Platforms updated:
  - linux-x64-gnu
  - linux-arm64-gnu
  - darwin-x64
  - darwin-arm64
  - win32-x64-msvc

  🤖 Generated by GitHub Actions
New crate: ruvector-eml-hnsw (6 modules, 93 tests)
Patch: hnsw_rs/src/eml_distance.rs (integrated implementations)

1. Cosine Decomposition (EmlDistanceModel) — 10-30x distance speed
   Learns which dimensions discriminate, reduces O(384) to O(k)

2. Progressive Dimensionality (ProgressiveDistance) — 5-20x search
   Layer 2: 8-dim, Layer 1: 32-dim, Layer 0: full-dim

3. Adaptive ef (AdaptiveEfModel) — 1.5-3x search speed
   Per-query beam width from (norm, variance, graph_size, max_component)

4. Search Path Prediction (SearchPathPredictor) — 2-5x search
   K-means query regions → cached entry points, skip top-layer traversal

5. Rebuild Cost Prediction (RebuildPredictor) — operational efficiency
   Predicts recall degradation, triggers rebuild only when needed

6. PQ Distance Correction (PqDistanceCorrector) — DiskANN recall
   Learns PQ approximation error correction from exact/PQ pairs

All backward compatible — untrained models fall back to standard behavior.
Based on: Odrzywolel 2026, arXiv:2603.21852v2

Co-Authored-By: claude-flow <ruv@ruv.net>
Stage 1: micro-benchmarks (cosine decomp, adaptive ef, path prediction,
rebuild prediction) — raw 16d L2 proxy is 9.3x faster than full 128d
cosine, but EML model overhead makes fast_distance 2.1x slower.

Stage 2: synthetic e2e (10K x 128d) — recall@10 drops to 0.1% on
uniform random data because all dimensions are equally important.
EML decomposition needs structured embeddings to work.

Stage 3: real dataset — deferred, SIFT1M not available. Infrastructure
in place to auto-run when dataset is downloaded.

Stage 4: hypothesis test — DISPROVEN on random data (Spearman rho=0.013
vs required 0.95). Expected: uniform random has no discriminative
dimensions. Real embeddings with PCA structure should score higher.

Honest results: dimension reduction mechanism works, but EML model
inference overhead and random-data limitations are documented clearly.
Following shaal's methodology from PR #352.

Co-Authored-By: claude-flow <ruv@ruv.net>
PR #353 added 6 standalone learned models but no consumer, so the selected-dims
approach never reached any index. This commit closes that gap:

- selected_distance.rs: plain cosine over learned dim subset (the corrected
  runtime path; the original fast_distance evaluated the EML tree per call and
  was 2.1x SLOWER than baseline, confirmed on ruvultra AMD 9950X).
- hnsw_integration.rs: EmlHnsw wraps hnsw_rs::Hnsw, projects vectors to the
  learned subspace on add/search, keeps full-dim store for optional rerank.
- tests/recall_integration.rs: end-to-end synthetic validation
  (rerank recall@10 >= 0.83 on structured data).
- tests/sift1m_real.rs: Stage-3 gated real-data harness.

Test counts: 70 unit + 3 recall_integration + 1 SIFT1M gated + 3 doctests
(vs PR #353 body claim of 93 unit tests; actual on pr-353 pre-fix was 60).

Stage-3 SIFT1M measured (50k base x 200 queries x 128d, selected_k=32, AMD 9950X):
  recall@10 reduced = 0.194    (PR #353 author expected ~0.85-0.95)
  recall@10 +rerank = 0.438    (fetch_k=50 too tight on real data)
  reduced HNSW p50  = 268.9 us
  reduced HNSW p95  = 361.8 us

Finding: the mechanism is viable as a candidate pre-filter but requires
(a) larger fetch_k (200-500), (b) SIMD-accelerated rerank (per PR #352), and
(c) training on many more than 500-1000 samples for real embeddings.
The synthetic ρ=0.958 claim does NOT reproduce on SIFT1M.
…rank + PQ + progressive cascade

Supersedes the original PR #353 contribution with the combined result of six
targeted experiments run on ruvultra (AMD Ryzen 9 9950X / 32T / 123 GB) against
real SIFT1M (50k base × 200 queries). Integration gap is closed — this crate now
has actual consumers (EmlHnsw, ProgressiveEmlHnsw, PqEmlHnsw), each with a
real hnsw_rs-backed search path + rerank.

## Landing

1. EmlHnsw wrapper (base, from fix/eml-hnsw-integration)
   - Projects vectors to the learned subspace on insert/search, keeps full-dim
     store for rerank, exposes search_with_rerank(query, k, fetch_k, ef).
   - Fixes the fundamental "no consumer" problem in PR #353's original crate.

2. Tier 1B — SimSIMD rerank kernel
   - cosine_distance_simd backed by simsimd::SpatialSimilarity
   - 5.65× speedup at d=128 (59.1 ns → 10.5 ns), 6.22× at d=384
   - Recall unchanged (Δ = 0.002, f32-vs-f64 accumulation noise)
   - Benchmark: benches/rerank_kernel.rs

3. Tier 1C — retention-objective selector
   - EmlDistanceModel::train_for_retention: greedy forward selection that
     maximizes recall@target_k on held-out queries
   - SIFT1M result at selected_k=32, fetch_k=200:
       pearson   selector: recall@10 = 0.712
       retention selector: recall@10 = 0.817   (+0.105, >3σ at n=200)
   - Training 37× slower but offline/one-shot

4. Tier 3A — ProgressiveEmlHnsw [8, 32, 128] cascade
   - Multi-index coarsest→finest, union + exact cosine rerank
   - SIFT1M: recall@10 = 0.984 at 961 µs p50 vs single-index 0.974 at ~1950 µs
     (2.0× latency improvement at matched recall)
   - Build cost 5.9× baseline — read-heavy workloads only

5. Tier 3B — PqEmlHnsw (8 subspaces × 256 centroids) + corrector
   - 64× memory reduction (512 B → 8 B per vector)
   - SIFT1M: rerank@10 = 0.9515, clears the ≥0.80 tier target
   - k-means converged cleanly (10-19 iterations per subspace, 25-iter cap never bound)
   - PqDistanceCorrector kept advisory-only: normalization against global
     max_pq_dist saturates on SIFT's O(10⁵) distance scale (MSE 1.4e9 → 6.4e10).
     Does not hurt recall because final rank is exact cosine.

## Measured evidence (all on ruvultra)

See docs/adr/ADR-151-eml-hnsw-selected-dims.md for full context, acceptance
criteria, and per-tier commit SHAs. Per-PR measured numbers are in
GitHub issue #351 and PR #353 discussion.

## NOT included from PR #353

- EmlDistanceModel::fast_distance (EML tree per call): 2.35× SLOWER than
  scalar baseline on ruvultra. Kept as reference impl; not on any search
  path. See ADR-151 §Rejected Surface.
- AdaptiveEfModel: 290 ns/query actual vs 3 ns claimed. Rejected until a
  <20 ns predictor is demonstrated.
- Sliced Wasserstein rerank (Tier 2 experiment): 50.9× slower AND 38.1 pp
  worse than cosine rerank on SIFT. Cleanly falsified for gradient-
  histogram datasets. Documented in ADR-151 closed open-questions.

## Surface area

- Default RuVector retrieval paths unchanged.
- HnswIndex::new() and DbOptions::default() untouched.
- EmlHnsw / ProgressiveEmlHnsw / PqEmlHnsw are explicitly constructed by
  callers opting into the approximate-then-exact pipeline.

Co-Authored-By: swarm-coder <swarm@ruv.net>

Co-Authored-By: Mathew Beane (aepod) <124563+aepod@users.noreply.github.com>
Co-Authored-By: Ofer Shaal (shaal) <22901+shaal@users.noreply.github.com>
…ence

Primary artifact for PR #356. Documents:
- PR #353 claims vs measured reality on ruvultra (AMD 9950X)
- v2 accepted surface (EmlHnsw, ProgressiveEmlHnsw, PqEmlHnsw, retention selector, SimSIMD rerank)
- Rejected surface (fast_distance, AdaptiveEfModel, Sliced Wasserstein)
- 6-tier swarm results: 4 passes, 1 clean falsification
- SOTA v3 scope: 4-agent swarm in progress
- Open questions with current status

Co-Authored-By: Mathew Beane (aepod) <124563+aepod@users.noreply.github.com>
Co-Authored-By: Ofer Shaal (shaal) <22901+shaal@users.noreply.github.com>
@ruvnet
Copy link
Copy Markdown
Owner Author

ruvnet commented Apr 16, 2026

v3 update (branch feat/eml-hnsw-optimizations-v3)

Merge of four SOTA tiers on top of v2 (dac6f60e). v3 tip: 1fa28216.

Tier landing

tier landed measured on merged v3
SOTA-A PQ-native HNSW (+ OPQ) yes rerank@10 = 0.9510 @ 8 B/vec (64× payload reduction vs 512 B legacy). p50 rerank = 371.6 µs (2.56× faster than legacy 952 µs). OPQ gives no measurable gain on SIFT-native basis — kept as documented null result.
SOTA-B parallel rerank + 1M benchmarks + hnswlib baseline yes (reframed) parallel rerank = 1.10× serial (overhead-bound on SIFT128 × fetch_k=500). Plain hnsw_rs DistCosine @ 1M hits recall=0.9525 @ QPS=893 (ef=100); EmlHnsw selected_k=48 + fetch_k=500 plateaus at 0.8159 across all ef_search.
SOTA-C corrector local-scale fix (promoted) + beam selector (falsified) yes (partial) corrector held-out MSE −60.5% (1.397e9 → 5.52e8), non-advisory path wired into search_with_rerank (15×k pre-truncation). Beam selector gives +0.0065 recall over greedy at 4.39× training cost — inside SE ≈ 0.027, not promoted.
SOTA-D HnswIndex::new_with_selected_dims() in ruvector-core yes 4 new integration tests passing (crates/ruvector-core/tests/hnsw_selected_dims.rs). Selected-dim prefilter now first-class in core without ruvector-eml-hnsw dependency.

Retention selector A/B (SIFT1M, selected_k=32)

selector recall@10 train cost
pearson 0.7125 1.02 s
retention_greedy (v3 default) 0.8165 39.8 s
retention_beam (beam=4) 0.8230 174.7 s

Greedy wins: +10.4 pp over pearson. Beam gain is inside noise.

Honest reframe

Plain hnsw_rs beats EmlHnsw''s reduced-dim prefilter at 1M scale on both recall and QPS at matched HNSW config (m=16, ef_construction=200). The v3 speed story does not survive full-corpus scaling.

What v3 does deliver:

  1. Memory win (PQ-native): 64× graph payload reduction (512 → 8 B/vec) with a 0.65 pp gain in rerank-recall over the legacy PQ path. The PQ-native HNSW now stores only u8 codes and computes asymmetric distances at query time via PqAsymmetricDistance.
  2. Integration win (ruvector-core): HnswIndex::new_with_selected_dims() is first-class in core, with 4 new integration tests. Closes ADR-151 Q7.
  3. Selector quality win: retention-greedy is a measurable +10.4 pp recall improvement over pearson and is the v3 default.
  4. Corrector fix: the SOTA-C local-scale corrector fixes the global-max bug and is now non-advisory (pre-rerank truncation to rerank_k = 15·k). Closes ADR-151 Q6.

Clean falsifications (kept in repo)

  • OPQ on SIFT-native corpus — test stays (sift1m_opq.rs) as null result
  • Rayon parallel rerank — 1.10× only, kept for wider-embedding future; commit text marks it modest
  • Beam selector — 0.65 pp over greedy inside SE, not promoted as default

Files

  • ADR evidence: docs/adr/ADR-151-eml-hnsw-selected-dims.md §v3 SOTA Evidence
  • PQ-native: crates/ruvector-eml-hnsw/src/pq_hnsw.rs, src/opq.rs
  • Corrector fix: crates/ruvector-eml-hnsw/src/pq_corrector.rs
  • Retention-greedy selector: crates/ruvector-eml-hnsw/src/cosine_decomp.rs::train_for_retention_beam
  • ruvector-core API: crates/ruvector-core/src/index/hnsw_selected.rs, crates/ruvector-core/tests/hnsw_selected_dims.rs
  • Merge commits: a842f0d5 (D) → 6a797c08 (A) → e13de438 (C) → 54483c45 (B) → 1fa28216 (ADR)

Readiness

v3 is ready for review. All 93 lib + 4 new core integration tests green on merged branch. Recommend reading ADR-151 §v3 SOTA Evidence first — it carries the honest-reframe framing this comment summarizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant