v3.10.27 β RRF ablation harness + HONEST NEGATIVE RESULT (default RRF degrades nDCG@10)
What ships
Honest negative result. The textbook "lowest-regret" first move β BM25+dense RRF k=60 β degrades nDCG@10 on both NFCorpus and SciFact because our multi-field BM25 is materially weaker than Lucene's. We ship the ablation harness + the finding anyway.
Acceptance test outcome
"RRF improves or preserves nDCG@10 on both NFCorpus and SciFact, bootstrap CI does not undermine the claim, defaults fixed before viewing test result." FAILS.
| Config (BOTH datasets, fixed defaults BEFORE viewing) | NFCorpus | SciFact | Mean |
|---|---|---|---|
| dense alone (BGE-base) | 0.352 | 0.626 | 0.489 |
| RRF k=60 equal (textbook default) | 0.328 β | 0.569 β | 0.449 β |
| RRF k=30 equal (best ablation) | 0.335 β | 0.582 β | 0.459 β |
| RRF k=60 dense=1.2, bm25=0.8 | 0.334 β | 0.577 β | 0.456 β |
| RRF k=60 dense=0.8, bm25=1.2 | 0.323 β | 0.558 β | 0.441 β |
Every RRF variant underperforms dense-alone on the 2-dataset mean (-0.04 nDCG@10 worse).
What DID work β recall
Recall@100 IS up on both:
| Dataset | Dense R@100 | RRF R@100 | Ξ |
|---|---|---|---|
| NFCorpus | 0.305 | 0.321 | +0.016 |
| SciFact | 0.828 | 0.951 | +0.123 |
RRF surfaces more candidates correctly β it just ranks them worse at top-K. This is the right setup for stage 2: cross-encoder rerank on the wider candidate pool (ADR-088 / 3.10.28).
Diagnosis (why RRF hurt)
The classic RRF win assumes comparably-strong systems with different failure modes. Our setup is asymmetric: BGE-base dense is strong (0.626 SciFact), our multi-field BM25 is weak (0.576 SciFact vs Lucene published 0.679). Pure BM25 nDCG@10 on NFCorpus: 0.279 vs Lucene 0.325 β we're 14% relative below.
When one input is weak, RRF averages its noise into top positions instead of cancelling it. The math works perfectly for the documented Lucene+strong-dense case; we don't match that profile yet.
Bug found and fixed
bge-cache/ was hardcoded to /tmp/beir-nfcorpus/bge-cache/ β the SciFact run silently overwrote the NFCorpus cache. Caught only when the first RRF run returned nDCG=0.14 (random-noise level), forcing investigation. Now per-dataset path. 3.10.25 and 3.10.26 NFCorpus numbers were computed before the overwrite and are still valid.
What's in the box
scripts/run-beir-rrf-ablation.mjsβ re-runnable ablation harness with bootstrap CI on the fixed default config + full ablation matrix.scripts/run-beir-hybrid.mjsβ full RRF + opt-in cross-encoder rerank runner (rerank wired but pending ADR-088 measurement).bge-cache/per-dataset path fix in run-beir-bge.mjs.- ADR-087 β full negative-result writeup with diagnosis + tracked next steps.
- Updated BEIR-MATRIX.md with ablation rows + the honest 2-dataset mean comparison.
- No default change β dense-only stays the BEIR runner default. RRF is opt-in for callers with Lucene-strength BM25.
Next steps (already tracked)
- ADR-088 / 3.10.28: Cross-encoder rerank on RRF's wider candidate pool (Recall@100 0.951 on SciFact says the candidates ARE there).
- Lucene-style BM25: Porter/Snowball stemmer + Lucene stopword list + length norm. Would make RRF actually work as designed.
- ruvnet/RuVector#524: bundle BGE in ruvector so downstream packages stop hitting the sharp dependency.
Reproduce
git clone https://github.com/ruvnet/ruflo && cd ruflo
npm install && ( cd v3/@claude-flow/cli && npx tsc )
mkdir -p /tmp/beir-nfcorpus && cd /tmp/beir-nfcorpus
curl -sL -o nf.zip 'https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip' && unzip -q nf.zip
node /path/to/v3/@claude-flow/cli/scripts/run-beir-bge.mjs # ingest
node /path/to/v3/@claude-flow/cli/scripts/run-beir-rrf-ablation.mjs # ablation matrixInstall
npx ruflo@3.10.27 # latest / alpha / v3alpha all aligned