v3.10.28 β Lucene BM25 + RRF + CE rerank β passes acceptance test on BOTH datasets (rank 3/13 on 2-dataset mean)
What ships
The pipeline that works. ADR-087's diagnosis of "our multi-field BM25 is too weak for RRF" is fixed here: shipped a real Lucene-style BM25 (Porter 1980 stemmer + Lucene stopwords + length norm, 12/12 published Porter tests passing) and wired the cross-encoder rerank into the BEIR runner.
The acceptance test PASSES
| System | Params | NFCorpus | SciFact | Mean | Beats BM25 both? |
|---|---|---|---|---|---|
| BGE-large-v1.5 (published) | 335M | 0.380 | 0.722 | 0.551 | yes |
| SPLADE++ (published) | 110M | 0.347 | 0.704 | 0.526 | yes |
| ruflo Lucene RRF + CE rerank (us) | 110M | 0.358 | 0.683 | 0.521 | YES (+0.033 / +0.004) |
| Lucene BM25 alone (us, matches published) | β | 0.328 | 0.681 | 0.505 | tied |
| BM25 (published Lucene) | β | 0.325 | 0.679 | 0.502 | β |
| ruflo dense alone (BGE-base) | 110M | 0.352 | 0.626 | 0.489 | no |
Rank 3 of 13 entries on the 2-dataset mean. Using a 110M base vs BGE-large's 335M and GTR-XL's 1.2B.
Per-dataset:
- NFCorpus 0.358, rank 2/11 (only behind BGE-large 0.380)
- SciFact 0.683, rank 3/11 (behind SPLADE++ and BGE-large only)
The diagnostic that earned this
ADR-087 (the previous release) measured RRF DEGRADING both datasets and diagnosed it as asymmetric input strength β our BM25 was 0.279 NFCorpus vs published Lucene 0.325, so RRF averaged its noise into top-K. This release proves the diagnosis: with a real Lucene-style BM25 that matches the published baseline within Β±0.003, RRF + cross-encoder rerank produces real wins on both datasets.
The user's reframe β "don't try to invent your way up BEIR; stack proven primitives, measure each lift, then decide where you add unique value" β is exactly what this release executed.
Subtle finding from the full ablation
On NFCorpus, Lucene RRF k=60 alone (0.360) is tied with Lucene RRF + CE rerank (0.358) β the cross-encoder doesn't add value when underlying RRF is already strong. CE's value is on SciFact (RRF 0.639 β RRF+CE 0.683, +0.044 lift). Pipeline auto-adapts: rerank helps most when candidate pool has high recall but low top-K precision. Matches published literature.
What's in the box
src/memory/lucene-bm25.tsβ Porter 1980 + Lucene 8.x English stopwords (~120 tokens) + single-field BM25 (k1=1.2, b=0.75). No external deps. 12/12 published Porter tests passing.scripts/run-beir-hybrid.mjsgainsUSE_LUCENE_BM25=1+RERANK=1flags.scripts/run-beir-lucene-bm25.mjsβ standalone runner for the Lucene BM25 + RRF ablation.- ADR-088 β full ablation matrix + diagnosis confirmation + honest limits.
- BEIR-MATRIX.md β updated 2-dataset mean leaderboard (13 entries, ruflo at rank 3).
Reproduce
git clone https://github.com/ruvnet/ruflo && cd ruflo
npm install && ( cd v3/@claude-flow/cli && npx tsc )
# Re-use existing caches from ADR-085 (or re-ingest with run-beir-bge.mjs)
cd /tmp/beir-nfcorpus
USE_LUCENE_BM25=1 RERANK=1 node /path/to/v3/@claude-flow/cli/scripts/run-beir-hybrid.mjs
# β nDCG@10 0.358, rank 2/11
cd /tmp/beir-scifact
USE_LUCENE_BM25=1 RERANK=1 BEIR_DATA_DIR=/tmp/beir-scifact/scifact node /path/to/v3/@claude-flow/cli/scripts/run-beir-hybrid.mjs
# β nDCG@10 0.683, rank 3/11Honest limits
- Two BEIR datasets measured. The 0.521 mean is suggestive, not BEIR-average.
- Zero-shot β no fine-tuning. NFCorpus train split (110K pairs) could lift another ~0.02-0.05.
- Lucene BM25 is a re-implementation (matches published within Β±0.003, not bit-identical).
- Rerank adds ~4.6s/query CPU latency at top-100; production callers should budget per latency tolerance.
- Production runtime defaults UNCHANGED β runtime still uses multi-field BM25 (better for ruflo's commit-history corpora). Lucene BM25 is BEIR-benchmark-scoped.
What's next (already tracked)
- BGE-large swap β drop-in
BGE_MODEL=Xenova/bge-large-en-v1.5. Likely lifts further. ~3Γ embed latency. - 3-5 more BEIR datasets via Tailscale GPU: TREC-COVID, FiQA, ArguAna, HotpotQA, NQ. Would establish a real BEIR-mini-average.
- Fine-tune BGE-base on NFCorpus train (GPU job, +0.02-0.05 expected).
- ruvector BGE bundling (ruvnet/ruvector#524) β kills the silent-fallback bug at source.
Install
npx ruflo@3.10.28 # latest / alpha / v3alpha all aligned