📋 Status: Ready for maintainer review (2026-04-14)
Proposal fully implemented, benchmarked on real public ANN datasets, and committed to branch issue/eml on fork shaal/ruvector. PR #352 is open but this issue contains everything needed to evaluate the work end-to-end. All four proof reports live in bench_results/ on the branch.
One-line ask: merge the branch (or cherry-pick the relevant commits) if the SIFT1M real-data win and the honest failure-mode documentation meet the bar; reject with specific concerns if not.
TL;DR
Four optimizations inspired by Odrzywołek's "All elementary functions from a single operator" paper, validated through a rigorous four-stage proof chain that includes one self-disproof and one hypothesis-test failure — honestly reported.
Headline real-data result (the single most important number)
On SIFT1M (100K subset, the canonical ANN benchmark, 3 seeds): +14.0% QPS at ef=64, +0.75% Recall@1, −3.3% build time. Recall preserved or slightly improved across k=1/10/100. Full table below.
Honest caveat
On GloVe-100d: −10.4% QPS at ef=256 (recall preserved). The non-power-of-two 100D dimension causes SimSIMD's scalar tail-handling to execute per distance call. We tried zero-padding as a fix (v4) and it made the regression worse due to per-call padding overhead. The correct fix (pad-at-insert) is scoped as follow-up work; the API hook for it is already shipped.
Four-stage proof chain
The Proposal
Four optimizations derived from the proof that eml(x, y) = exp(x) − ln(y) combined with the constant 1 is functionally complete for all elementary mathematical functions — analogous to NAND gates in Boolean logic. The paper's identities we use:
eml(x, 1) = exp(x) — exponential
eml(0, y) = 1 − ln(y) — shifted logarithm
eml(ln(a), exp(b)) = a − b — subtraction
Optimization 1 — LogQuantized (Logarithmic Scalar Quantization)
Problem. ScalarQuantized maps values linearly to [0, 255]. Real embedding distributions (transformer outputs, ReLU activations, frequency features) are non-uniform — values cluster near zero with heavy tails. Linear mapping wastes bits on the sparse tail.
Solution. Apply ln(x − min + 1) before uniform quantization and exp(q) − 1 + min to reconstruct. Same 4× compression as scalar, finer granularity where values are dense.
Implementation. src/quantization.rs — same u8 storage as ScalarQuantized, same SIMD-accelerated distance path.
Optimization 2 — UnifiedDistanceParams (Branch-free Distance Kernel)
Problem. distance() dispatches via match metric { … } on every call. Inside HNSW search, this match executes thousands of times per query.
Solution. Pre-resolve the metric once into boolean flags (normalize, use_abs_diff, use_sq_diff, apply_sqrt, negate). The inner loop then evaluates a predictable if/else if cascade on those flags, which branch-predicts perfectly for a given index.
Implementation (src/advanced/eml.rs). After v1's disproof, compute() now dispatches directly to simsimd::SpatialSimilarity::{cosine, sqeuclidean, dot} — same hand-tuned AVX2/NEON kernels as the baseline.
Optimization 3 — EmlTree / EmlModel (Learned-Index CDF Approximator)
Problem. RecursiveModelIndex uses linear models for CDF approximation; real distributions are rarely linear.
Solution. Replace linear model with a trainable binary tree of eml() nodes. The paper's functional-completeness proof guarantees any elementary CDF can be represented.
Evidence. On y = exp(x), a depth-1 EML tree converges in 12 iterations to test-MSE 0.000000 vs linear model's 0.194 (~2 billion× improvement).
Optimization 4 — EmlScoreFusion (Non-linear Hybrid Search Scoring)
Problem. Linear fusion α·vector + (1−α)·bm25 can't capture non-linear interactions.
Solution. eml(a·vector + b, c·bm25 + d) — captures the real-world pattern that vector similarity has exponential impact while keyword matching has logarithmic impact. Measured 3.16× asymmetry in tests/eml_proof.rs (PROOF 5).
Overhead. 2.5 ns/pair absolute (4.3× slower than linear). Negligible for top-K hybrid scoring.
Evidence Detail
Micro-evidence (tests/eml_proof.rs — 8 integration tests, all passing)
| Claim |
Measurement |
| LogQuantized MSE (half-normal / ReLU) |
20.9% lower than Scalar |
| LogQuantized MSE (exponential) |
52.5% lower than Scalar |
| LogQuantized MSE (log-normal) |
5.1% lower than Scalar |
| Unified distance correctness |
max diff < 3.34e-6 across 1,710 vector pairs |
EML tree on exp(x) |
~2 billion× lower test-MSE than linear model |
| EML score fusion asymmetry |
3.16× vector/keyword ratio |
| Zero-padding semantic neutrality |
proven for Euclidean / Cosine / DotProduct / Manhattan |
End-to-End real-data evidence (SIFT1M — the headline win)
Dataset: SIFT1M from Texmex corpus, 100K base × 500 queries × 128D, Euclidean.
HNSW: M=16, ef_construction=200. Three seeds [42, 1337, 2024].
| Metric |
Baseline (Scalar+Dispatch) |
LogQuantized (Log+Unified SIMD) |
Delta |
| QPS ef=64 |
2,946 ± 300 |
3,358 ± 262 |
+14.0% ⭐ |
| QPS ef=256 |
1,604 ± 63 |
1,653 ± 94 |
+3.0% |
| Recall@1 (ef=256) |
0.9807 ± 0.0057 |
0.9880 ± 0.0016 |
+0.75% ⭐ |
| Recall@10 |
0.9843 |
0.9853 |
+0.1% (~same) |
| Recall@100 |
0.9785 |
0.9785 |
~0% |
| Build time |
39.0 s |
37.7 s |
−3.3% |
| p50 ef=64 |
337 µs |
298 µs |
−11.5% (better) |
| p95 ef=64 |
480 µs |
417 µs |
−13.1% (better) |
| p99 ef=64 |
538 µs |
504 µs |
−6.3% (better) |
| p99.9 ef=64 |
658 µs ± 147 |
882 µs ± 551 |
+34.0% (high stddev) |
| p50 ef=256 |
618 µs |
591 µs |
−4.4% |
| p95/p99 ef=256 |
— |
— |
≈baseline ±10% |
| p99.9 ef=256 |
1,190 µs |
1,628 µs |
+36.9% (tail noise) |
End-to-End real-data evidence (GloVe-100d — the honest regression)
Dataset: GloVe-100d from Stanford NLP, 100K base × 500 queries × 100D, Cosine.
HNSW: identical config.
| Metric |
Baseline |
Log+Unified |
Delta |
| QPS ef=64 |
2,483 |
2,429 |
−2.2% (~same) |
| QPS ef=256 |
1,198 |
1,073 |
−10.4% WORSE ⚠️ |
| Recall@1 |
0.9740 |
0.9713 |
−0.27% (noise) |
| Recall@10 |
0.9698 |
0.9705 |
+0.1% |
| Recall@100 |
0.9374 |
0.9375 |
~0% |
Root cause: 100D is not a multiple of 8, so SimSIMD runs a scalar tail-loop on every distance call. Cosine normalization also dispatches to the same simsimd::cosine() function in both baseline and unified paths, so the theoretical branch-free advantage has no room to manifest.
v4 padding hypothesis test (honest disproof)
We added pad_to_power_of_two: bool to UnifiedDistanceParams and HnswIndex::new_unified_padded(), zero-padding 100D → 104D per-call via thread-local scratch. Re-ran the same GloVe benchmark:
| GloVe Δ vs baseline |
v3 unpadded |
v4 padded |
Interpretation |
| QPS ef=64 |
−2.2% |
−24.6% WORSE |
padding hurt |
| QPS ef=256 |
−10.4% |
−23.4% WORSE |
padding hurt |
| Build time |
+3.3% |
+31.7% WORSE |
padding hurt |
Why: per-call padding path = RefCell::borrow_mut() × 2 + clear() × 2 + extend_from_slice(100) × 2 (400-byte memcpy each) + resize(104) × 2 + SimSIMD call. That's ~60–100 cycles overhead per distance call vs ~5–10 cycles saved on the SimSIMD tail. Net loss. HNSW does thousands of distance calls per query, so the loss amplifies.
Correct fix (scoped as follow-up, not in this PR): pad vectors once at add() / search() time, store padded in HNSW, let compute_raw() see pre-aligned slices with zero per-call overhead. The pad_to_power_of_two API hook is shipped so this follow-up lands without a public API change. A TODO(future) comment near UnifiedDistanceParams::with_padding() records this.
API Surface (all opt-in; defaults unchanged)
// New variant in existing enum — backward compatible
pub enum QuantizationConfig {
None, Scalar, Product { ... }, Binary,
Log, // ← NEW
}
// New constructors alongside existing HnswIndex::new()
impl HnswIndex {
pub fn new(...) // UNCHANGED — default
pub fn new_unified(...) // ← NEW: branch-free kernel
pub fn new_unified_padded(...) // ← NEW: + zero-padding (experimental, see v4)
}
// New type in advanced::eml
pub struct UnifiedDistanceParams {
pub negate: bool,
pub normalize: bool,
pub use_abs_diff: bool,
pub use_sq_diff: bool,
pub apply_sqrt: bool,
pub pad_to_power_of_two: bool, // ← NEW (default false)
}
impl UnifiedDistanceParams {
pub fn euclidean() -> Self { ... }
pub fn cosine() -> Self { ... }
pub fn dot_product() -> Self { ... }
pub fn manhattan() -> Self { ... }
pub fn from_metric(DistanceMetric) -> Self { ... }
pub fn with_padding(self, bool) -> Self { ... } // ← NEW
pub fn compute(&self, &[f32], &[f32]) -> f32 { ... }
pub fn batch_compute(...) -> Vec<f32> { ... }
pub fn batch_compute_parallel(...) -> Vec<f32> { ... } // feature = "parallel"
}
// Plus: EmlTree, EmlModel, EmlScoreFusion, TrainConfig, eml(), eml_safe()
// all under `ruvector_core::advanced::eml`, opt-in usage.
Zero breaking changes. Default VectorDB::new(DbOptions::default()) still produces ScalarQuantized + match-dispatched distance + standard HnswIndex::new(). Users opt into each EML feature explicitly.
Test & Validation Coverage
| Suite |
Count |
Result |
EML unit tests (advanced::eml::tests) |
27 |
✅ all pass (21 original + 6 new padding correctness) |
| Quantization tests (incl. LogQuantized) |
18 |
✅ all pass |
| HNSW tests |
7 |
✅ all pass |
Integration proofs (tests/eml_proof.rs) |
8 |
✅ all pass |
| Total across all affected modules |
60+ |
✅ all pass |
Micro-benchmarks: cargo bench -p ruvector-core --bench eml_bench — 6 Criterion groups.
End-to-end: cargo bench -p ruvector-core --bench eml_end_to_end — fast Criterion subset runs in ~1 min; full real-dataset proof in ~15 min.
Reproducibility (maintainer can re-run from scratch)
# Check out the branch
git fetch https://github.com/shaal/ruvector.git issue/eml
git checkout -b review/eml-issue-351 FETCH_HEAD
# Pre-cache datasets (one-time, ~1GB; gitignored)
mkdir -p bench_data && cd bench_data
curl -fLO ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz && tar xzf sift.tar.gz
curl -fLO https://nlp.stanford.edu/data/glove.6B.zip && unzip -o glove.6B.zip glove.6B.100d.txt
cd ..
# Run correctness tests (< 1 min)
cargo test -p ruvector-core --lib --release -- advanced::eml:: quantization:: index::hnsw::
cargo test -p ruvector-core --test eml_proof --release -- --nocapture
# Micro-benchmarks (~5 min)
cargo bench -p ruvector-core --bench eml_bench
# End-to-end real-dataset proof (v3 table, ~15 min at 100K)
EML_FULL_PROOF=1 EML_REAL_DATASETS=1 EML_PROOF_N=100000 EML_PROOF_Q=500 \
cargo bench -p ruvector-core --bench eml_end_to_end -- eml_e2e_full_proof
# v4 padding test on GloVe only (~5 min)
EML_FULL_PROOF=1 EML_REAL_DATASETS=1 EML_SYNTHETIC_DATASETS=0 EML_SKIP_SIFT1M=1 \
EML_PAD_UNIFIED=1 EML_PROOF_N=100000 EML_PROOF_Q=500 \
cargo bench -p ruvector-core --bench eml_end_to_end -- eml_e2e_full_proof
Each benchmark prints a Markdown table + inline CSV to stderr, identical format to the committed proof reports.
Files Changed (on branch issue/eml)
| File |
Role |
src/advanced/eml.rs |
Core EML operator, EmlTree, training, SIMD UnifiedDistanceParams with padding, EmlScoreFusion |
src/quantization.rs |
LogQuantized type with SIMD distance |
src/types.rs |
QuantizationConfig::Log variant + doc comment with power-of-two caveat + all proof refs |
src/index/hnsw.rs |
HnswDistanceFn enum, DistanceStrategy, new_unified{_padded}() |
benches/eml_bench.rs |
6 Criterion micro-benchmark groups |
benches/eml_end_to_end.rs |
End-to-end ANN proof: .fvecs + GloVe loaders, Dataset abstraction, env-gated proof runner |
tests/eml_proof.rs |
8 integration proof tests |
docs/adr/ADR-033-eml-operator-optimizations.md |
Architecture decision record |
docs/adr/DDD-eml-bounded-context.md |
Domain-driven design context |
docs/benchmarks/BENCHMARKING_GUIDE.md |
New section + headline numbers + v1–v4 references |
bench_results/eml_proof_2026-04-14{,_v2,_v3,_v4}.md |
Four proof reports (evidence chain) |
.gitignore |
/bench_data/ excluded |
Commit history (4 commits on the branch, ready for cherry-pick or squash-merge)
9bc8d165 docs: align types.rs + BENCHMARKING_GUIDE.md with v4 finding + TODO
9cdc6907 feat: real-dataset ANN proof (SIFT1M + GloVe) + dimension padding option
ab18edf9 feat: end-to-end ANN proof for EML optimizations + SIMD port
5a4b5f8e feat: add EML operator-inspired optimizations for quantization, distance, and learned indexes
Known Follow-up Work (explicitly NOT in this PR)
- Pad-at-insert HNSW variant. Closes the GloVe regression if the hypothesis holds at the insert layer instead of the per-distance-call layer. The API hook (
pad_to_power_of_two flag + new_unified_padded()) is already shipped so this lands without a public API change. Estimated scope: change HnswIndex::add_batch() and search_with_ef() to pad once; ~100 LOC.
- Full 1M SIFT1M run. Current validation uses the 100K subset for a ~15-min run; 1M would take ~3 hours but confirm the win scales.
- ANN-Benchmarks protocol on Deep1M / MS MARCO / Fashion-MNIST. Broader corpus coverage before declaring the wins universally production-ready.
- Faiss / pgvector comparative benchmarks. Cross-library validation.
None of these are blocking for the current PR — they're future work the proof chain identifies, not gaps in the current evidence.
What to Review
For the maintainer, the minimum viable review path is:
- Read the TL;DR above — is the SIFT1M headline win credible and valuable?
- Open v3 report — scan the Markdown table for SIFT1M; verify the reproducibility instructions make sense.
- Open v4 report — verify the honest regression disclosure for GloVe is adequate and the pad-at-insert follow-up plan is sound.
- Skim
types.rs around line 105 — verify QuantizationConfig::Log doc carries the caveat clearly.
- Verify defaults unchanged —
DbOptions::default() still produces ScalarQuantized; no existing user gets silently opted into new behavior.
If those five checks are satisfied, this can merge. If any concerns surface, happy to address them directly in comments or spin up additional benchmarks.
References
TL;DR
Four optimizations inspired by Odrzywołek's "All elementary functions from a single operator" paper, validated through a rigorous four-stage proof chain that includes one self-disproof and one hypothesis-test failure — honestly reported.
Headline real-data result (the single most important number)
On SIFT1M (100K subset, the canonical ANN benchmark, 3 seeds): +14.0% QPS at ef=64, +0.75% Recall@1, −3.3% build time. Recall preserved or slightly improved across k=1/10/100. Full table below.
Honest caveat
On GloVe-100d: −10.4% QPS at ef=256 (recall preserved). The non-power-of-two 100D dimension causes SimSIMD's scalar tail-handling to execute per distance call. We tried zero-padding as a fix (v4) and it made the regression worse due to per-call padding overhead. The correct fix (pad-at-insert) is scoped as follow-up work; the API hook for it is already shipped.
Four-stage proof chain
eml_proof_2026-04-14.mdcompute()to SimSIMD, re-run syntheticeml_proof_2026-04-14_v2.mdeml_proof_2026-04-14_v3.mdeml_proof_2026-04-14_v4.mdThe Proposal
Four optimizations derived from the proof that
eml(x, y) = exp(x) − ln(y)combined with the constant 1 is functionally complete for all elementary mathematical functions — analogous to NAND gates in Boolean logic. The paper's identities we use:eml(x, 1) = exp(x)— exponentialeml(0, y) = 1 − ln(y)— shifted logarithmeml(ln(a), exp(b)) = a − b— subtractionOptimization 1 —
LogQuantized(Logarithmic Scalar Quantization)Problem.
ScalarQuantizedmaps values linearly to [0, 255]. Real embedding distributions (transformer outputs, ReLU activations, frequency features) are non-uniform — values cluster near zero with heavy tails. Linear mapping wastes bits on the sparse tail.Solution. Apply
ln(x − min + 1)before uniform quantization andexp(q) − 1 + minto reconstruct. Same 4× compression as scalar, finer granularity where values are dense.Implementation.
src/quantization.rs— same u8 storage asScalarQuantized, same SIMD-accelerated distance path.Optimization 2 —
UnifiedDistanceParams(Branch-free Distance Kernel)Problem.
distance()dispatches viamatch metric { … }on every call. Inside HNSW search, this match executes thousands of times per query.Solution. Pre-resolve the metric once into boolean flags (
normalize,use_abs_diff,use_sq_diff,apply_sqrt,negate). The inner loop then evaluates a predictableif/else ifcascade on those flags, which branch-predicts perfectly for a given index.Implementation (
src/advanced/eml.rs). After v1's disproof,compute()now dispatches directly tosimsimd::SpatialSimilarity::{cosine, sqeuclidean, dot}— same hand-tuned AVX2/NEON kernels as the baseline.Optimization 3 —
EmlTree/EmlModel(Learned-Index CDF Approximator)Problem.
RecursiveModelIndexuses linear models for CDF approximation; real distributions are rarely linear.Solution. Replace linear model with a trainable binary tree of
eml()nodes. The paper's functional-completeness proof guarantees any elementary CDF can be represented.Evidence. On
y = exp(x), a depth-1 EML tree converges in 12 iterations to test-MSE0.000000vs linear model's0.194(~2 billion× improvement).Optimization 4 —
EmlScoreFusion(Non-linear Hybrid Search Scoring)Problem. Linear fusion
α·vector + (1−α)·bm25can't capture non-linear interactions.Solution.
eml(a·vector + b, c·bm25 + d)— captures the real-world pattern that vector similarity has exponential impact while keyword matching has logarithmic impact. Measured 3.16× asymmetry intests/eml_proof.rs(PROOF 5).Overhead. 2.5 ns/pair absolute (4.3× slower than linear). Negligible for top-K hybrid scoring.
Evidence Detail
Micro-evidence (
tests/eml_proof.rs— 8 integration tests, all passing)< 3.34e-6across 1,710 vector pairsexp(x)End-to-End real-data evidence (SIFT1M — the headline win)
Dataset: SIFT1M from Texmex corpus, 100K base × 500 queries × 128D, Euclidean.
HNSW: M=16, ef_construction=200. Three seeds [42, 1337, 2024].
End-to-End real-data evidence (GloVe-100d — the honest regression)
Dataset: GloVe-100d from Stanford NLP, 100K base × 500 queries × 100D, Cosine.
HNSW: identical config.
Root cause: 100D is not a multiple of 8, so SimSIMD runs a scalar tail-loop on every distance call. Cosine normalization also dispatches to the same
simsimd::cosine()function in both baseline and unified paths, so the theoretical branch-free advantage has no room to manifest.v4 padding hypothesis test (honest disproof)
We added
pad_to_power_of_two: booltoUnifiedDistanceParamsandHnswIndex::new_unified_padded(), zero-padding 100D → 104D per-call via thread-local scratch. Re-ran the same GloVe benchmark:Why: per-call padding path =
RefCell::borrow_mut()× 2 +clear()× 2 +extend_from_slice(100)× 2 (400-byte memcpy each) +resize(104)× 2 + SimSIMD call. That's ~60–100 cycles overhead per distance call vs ~5–10 cycles saved on the SimSIMD tail. Net loss. HNSW does thousands of distance calls per query, so the loss amplifies.Correct fix (scoped as follow-up, not in this PR): pad vectors once at
add()/search()time, store padded in HNSW, letcompute_raw()see pre-aligned slices with zero per-call overhead. Thepad_to_power_of_twoAPI hook is shipped so this follow-up lands without a public API change. ATODO(future)comment nearUnifiedDistanceParams::with_padding()records this.API Surface (all opt-in; defaults unchanged)
Zero breaking changes. Default
VectorDB::new(DbOptions::default())still producesScalarQuantized+ match-dispatched distance + standardHnswIndex::new(). Users opt into each EML feature explicitly.Test & Validation Coverage
advanced::eml::tests)tests/eml_proof.rs)Micro-benchmarks:
cargo bench -p ruvector-core --bench eml_bench— 6 Criterion groups.End-to-end:
cargo bench -p ruvector-core --bench eml_end_to_end— fast Criterion subset runs in ~1 min; full real-dataset proof in ~15 min.Reproducibility (maintainer can re-run from scratch)
Each benchmark prints a Markdown table + inline CSV to stderr, identical format to the committed proof reports.
Files Changed (on branch
issue/eml)src/advanced/eml.rsUnifiedDistanceParamswith padding,EmlScoreFusionsrc/quantization.rsLogQuantizedtype with SIMD distancesrc/types.rsQuantizationConfig::Logvariant + doc comment with power-of-two caveat + all proof refssrc/index/hnsw.rsHnswDistanceFnenum,DistanceStrategy,new_unified{_padded}()benches/eml_bench.rsbenches/eml_end_to_end.rsDatasetabstraction, env-gated proof runnertests/eml_proof.rsdocs/adr/ADR-033-eml-operator-optimizations.mddocs/adr/DDD-eml-bounded-context.mddocs/benchmarks/BENCHMARKING_GUIDE.mdbench_results/eml_proof_2026-04-14{,_v2,_v3,_v4}.md.gitignore/bench_data/excludedCommit history (4 commits on the branch, ready for cherry-pick or squash-merge)
Known Follow-up Work (explicitly NOT in this PR)
pad_to_power_of_twoflag +new_unified_padded()) is already shipped so this lands without a public API change. Estimated scope: changeHnswIndex::add_batch()andsearch_with_ef()to pad once; ~100 LOC.None of these are blocking for the current PR — they're future work the proof chain identifies, not gaps in the current evidence.
What to Review
For the maintainer, the minimum viable review path is:
types.rsaround line 105 — verifyQuantizationConfig::Logdoc carries the caveat clearly.DbOptions::default()still producesScalarQuantized; no existing user gets silently opted into new behavior.If those five checks are satisfied, this can merge. If any concerns surface, happy to address them directly in comments or spin up additional benchmarks.
References