feat: adaptive weighted-RRF fusion + alwaysRerank for accuracy mode by willgitdata · Pull Request #10 · willgitdata/augur

willgitdata · 2026-05-08T19:17:46Z

The user asked for competitive retrieval accuracy with private-team systems plus smarter techniques for predicting the best retrieval method. Two changes that push both:

What changed

1. Adaptive weighted-RRF fusion in the candidate pool

The fusion that feeds the cross-encoder used to be symmetric RRF — every doc got the same weight from each side. Now it's prior + evidence:

```
static_weight = pickVectorWeight(query_signals) // existing query-signal prior
adaptive_shift = (vec_confidence - kw_confidence) * 0.30 // bounded ±0.20
final_weight = clamp(static + shift, 0.10, 0.90)
```

Confidence is the normalized top-1-to-top-2 score gap of each retriever (range-normalized so BM25 and cosine score scales are comparable). The query-signal prior says "this looks lexical / semantic"; the retrieval evidence says "vector returned a clear winner this query, lean into it." Bounded shift means the prior is never fully overridden.

2. `HeuristicRouter({ alwaysRerank: true })` — accuracy mode

The router skipped reranking on keyword-strategy queries to keep a sub-ms BM25 fast path. For RAG (LLM call dominates latency) that fast path is leaving NDCG on the table. The new option pushes every query through the multi-stage gather → fuse → rerank pipeline. Default stays `false` for backwards compat; opt-in for accuracy-first deployments.

Numbers

	Bundled (504q)	SciFact	NFCorpus
Default	0.912	0.709	0.312
+ adaptive fusion	0.914	0.707	0.311
+ adaptive fusion + alwaysRerank	0.920	0.707	0.324

NFCorpus jumps +0.012 because 76% of medical queries previously hit the BM25 fast path; now they all see the cross-encoder. SciFact stays flat because it was already 98% hybrid. The bundled eval shows the breadth:

short_kw 0.923 → 0.967 (+0.044)
quoted 0.846 → 0.868 (+0.022)
non_english 0.824 → 0.839 (+0.015)
error_code 0.825 → 0.837 (+0.012)
code 0.994 → 1.000 (+0.006, perfect)

Latency cost

Always-rerank pays roughly +20 ms p50 on queries that used to take the BM25 fast path. NFCorpus measured: p50 0.9 ms → 18.2 ms, QPS 155 → 51. Tail (p95) barely moves because rerank-bound queries already paid this cost. For RAG where the LLM call afterwards is hundreds of ms, the trade is correct.

Tests

97 core tests pass (94 prior + 3 new covering the alwaysRerank routing decision)
Typecheck clean across workspace
Eval runner gains `--always-rerank` flag; BEIR runner ditto

🤖 Generated with Claude Code

…y mode Two intertwined changes that push accuracy without giving up the explainability or composability of the existing stack. ## 1. Adaptive weighted-RRF fusion (replaces symmetric RRF) The candidate-pool fusion now runs in two steps: static_weight = pickVectorWeight(query_signals) // existing prior adaptive_shift = clamp((vec_conf - kw_conf) * 0.30, -0.20, +0.20) final_weight = clamp(static_weight + adaptive_shift, 0.10, 0.90) Where `vec_conf` and `kw_conf` are the normalized top-1-to-top-2 score gap for each retriever (range-normalized over the visible top-10 so cosine and BM25 score distributions are comparable). Intuition: the query-signal prior tells us "this looks like a keyword-flavored query"; the retrieval evidence tells us "vector returned a clear winner this time, weight it." Both vote together, neither can fully override the other. On the bundled 504-query eval this lifts NDCG@10 from 0.912 → 0.914 without other changes. Notable per-category lifts: code 0.994 → 1.000, non_english +0.014, procedural +0.013, vector-strategy 0.926 → 0.931. ## 2. HeuristicRouter alwaysRerank option (accuracy mode) Today the router skips reranking on keyword-strategy queries because BM25 scores are already lexical and rerank costs latency. But for RAG — where the LLM call dominates total latency anyway — the +10-25ms cross-encoder spend on every query can recover NDCG that the BM25-only fast path leaves on the table. new HeuristicRouter({ alwaysRerank: true }) Now every keyword-routed query goes through the multi-stage gather → fuse → rerank pipeline. Latency budget still wins (a < 800 ms request still skips rerank). Combined with the adaptive fusion above: Aggregate SciFact NFCorpus Default 0.912 0.709 0.312 +adaptive fusion 0.914 0.707 0.311 +alwaysRerank 0.920 0.707 0.324 NFCorpus +0.012 because 76% of medical queries previously hit the BM25 fast path; now they all see the cross-encoder. SciFact flat because it was already 98% hybrid (and therefore already reranking). The bundled eval is the broader story: short_kw 0.923 → 0.967 (+0.044), quoted +0.022, error_code +0.012, non_english +0.015. Latency cost (NFCorpus measured): p50 0.9 ms → 18.2 ms (every query reranks) p95 23.0 ms → 24.9 ms (was already rerank-bound) QPS 155 → 51 (single-threaded) Eval CLI gains a `--always-rerank` flag; BEIR runner ditto. ## Tests 97 core tests pass (94 prior + 3 new for alwaysRerank routing behavior). Typecheck clean across workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ranker Implements 10 fixes from a critical code review on the publish-ready SDK. Each item is independently shippable; this lands them as one coordinated bump because several feed each other (the eval smoke test exercises the new fusion module; the explicit-reranker breaking change needs MIGRATING.md to land at the same time). #1 Smoke eval harness — 16-doc / 12-query synthetic fixture with a regression floor (NDCG@10 > 0.65 on the stub stack). Runs in <50 ms as part of `pnpm test`. The full BEIR + 504-query eval stays where it already lives — git history — for "did this tweak win?" measurement. #2 Extract pure helpers to `fusion.ts` (composeFilter, pickVectorWeight, weightedRrfFuse, adaptWeightByConfidence, topGapNormalized, clamp) + 31 unit tests. Each empirical threshold is now annotated with what it was tuned against. #3 autoLanguageFilter integration tests — OFF default, ON for non-English, user-filter override wins, soft-fallback when the filtered pool empties. #4 Basic-search example wires the recommended stack: LocalEmbedder + LocalReranker + MetadataChunker(SentenceChunker) + InMemoryAdapter({ useStemming: true }). Matches the README's headline configuration so users copying from "hello world" land on the auto path that produces NDCG@10 = 0.920. #5 PineconeAdapter mocked-fetch tests (13 tests). Pins the wire format (URL, method, auth header, body shape, response decode) so refactors can't silently regress one of the three production adapters. Previously had zero coverage. #6 Ad-hoc scratch adapter cache — bounded LRU keyed by a deterministic fingerprint of (id, content). Repeat searches over the same `req.documents` skip re-chunking + re-embedding. Tunable via `adHocCacheSize` (default 8; set 0 to disable). #7 **BREAKING** Drop `HeuristicReranker` as silent default. The previous default did almost nothing while emitting a "yes I rerank" line in the trace. Default is now `null`; pass `new LocalReranker()` (or any provider's reranker) explicitly to keep cross-encoder voting on. Documented in MIGRATING.md. #8 MIGRATING.md — covers every BREAKING change in 0.2 with smallest-diff examples; non-breaking adoptions documented separately. #9 SemanticChunker tests (8 tests) — boundary detection, maxSize cap, async-only API, metadata propagation. Was the only chunker without coverage. #10 Magic-number provenance documented in `router.ts` and `fusion.ts`. Every threshold (≤2 / ≤6 word counts, 0.6 ambiguity floor, 800 ms latency floor, k=60 RRF, ±0.20 shift, 0.10–0.90 weight clamp, 0.3/0.4/0.5/0.7 priors) now records what it's tuned against and what's load-bearing vs negotiable. Test count: 100 → 163. Build green; full eval still ships against the published packages, not against the smoke fixture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Round 1 left two critical lying-trace bugs and a handful of architectural papercuts. This commit clears the table. #1 (critical) Router no longer lies about reranking when no reranker is configured. `Router.decide` gained an optional `hasReranker` parameter; `HeuristicRouter` plumbs it through to `shouldRerank` and forces `reranked: false` when absent. The trace now records "reranking skipped (no reranker configured)" instead of pretending the cross-encoder fired. Augur passes `this.reranker !== null`. Default is `true` so existing third-party Router implementations keep working. #2 (critical) `SearchTrace` declares the four fields that augur.ts attaches at runtime: `adHoc`, `adHocCacheHit`, `autoLanguageFilter`, `autoLanguageFilterDropped`. `Tracer.finish` opts widened to `Omit<SearchTrace, "id"|"query"|"startedAt"|"totalMs"|"spans">` so adding a SearchTrace field propagates automatically. Tests drop their `as unknown as` casts. #3 (high) PgVectorAdapter mocked-fetch tests (14 tests). Pin SQL shape, INSERT batching at 200/round-trip, parameter renumbering across filter clauses, identifier-validation guard against `; DROP TABLE`, **and a filter-key SQL-injection regression test** — the JSON-path quote-doubling defense gets explicit coverage. #4 (high) Adapter trace-string format change reverted. `trace.adapter` is always the bare adapter name; ad-hoc / cache-hit signals surface as the new structured boolean fields from #2. No more "in-memory (ad-hoc, cached)" string parsing. #5 (high) `fingerprintDocs` extracted to `fingerprint.ts` with 10 direct unit tests covering reorder, byte-change, prefix-equal corpora, id|content boundary, doc-record boundary, empty list, unicode, and the output format contract. #6 (medium) Async chunkers no longer pretend to be `Chunker`s. Introduced an explicit `AsyncChunker` interface; `SemanticChunker`, `Doc2QueryChunker`, `ContextualChunker` implement it (no longer Chunker). The runtime traps in their throwing `chunk()` methods are gone — the type system catches misuse at compile time. APIs that accept either flavor (`AugurOptions.chunker`, all chunker `base` fields, `chunkDocument`) now use `Chunker | AsyncChunker`. `MetadataChunker` keeps its dual sync+async path with a runtime guard for the user-opted-in case where its base is async. #7 (medium) `StubEmbedder` consolidated into `packages/core/src/test-fixtures.ts`. Excluded from the published package via tsconfig. Three duplicated copies dropped. #8 (low) `eval-smoke.test.ts` header explicitly distinguishes the synthetic-fixture smoke test (structural) from the BEIR / 504-query eval that produced the README's NDCG@10 = 0.920 numbers (preserved at git `feffc73^`, runs in ~30 min). #9 (low) `BaseAdapter` JSDoc rewritten as the canonical "starting point for custom adapters" comment, including the RRF / capability / `searchHybrid` override guidance. `AsyncChunker` added to public exports. #10 (low) `examples/basic-search/index.ts` header documents the `npm i @huggingface/transformers` requirement for users copying the file out of the repo. Test count: 163 → 191 (+28). Build green; published-package contents verified clean (`find dist -name "test-fixtures*"` → empty, `find dist -name "*.test.*"` → empty). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ranker Implements 10 fixes from a critical code review on the publish-ready SDK. Each item is independently shippable; this lands them as one coordinated bump because several feed each other (the eval smoke test exercises the new fusion module; the explicit-reranker breaking change needs MIGRATING.md to land at the same time). #1 Smoke eval harness — 16-doc / 12-query synthetic fixture with a regression floor (NDCG@10 > 0.65 on the stub stack). Runs in <50 ms as part of `pnpm test`. The full BEIR + 504-query eval stays where it already lives — git history — for "did this tweak win?" measurement. #2 Extract pure helpers to `fusion.ts` (composeFilter, pickVectorWeight, weightedRrfFuse, adaptWeightByConfidence, topGapNormalized, clamp) + 31 unit tests. Each empirical threshold is now annotated with what it was tuned against. #3 autoLanguageFilter integration tests — OFF default, ON for non-English, user-filter override wins, soft-fallback when the filtered pool empties. #4 Basic-search example wires the recommended stack: LocalEmbedder + LocalReranker + MetadataChunker(SentenceChunker) + InMemoryAdapter({ useStemming: true }). Matches the README's headline configuration so users copying from "hello world" land on the auto path that produces NDCG@10 = 0.920. #5 PineconeAdapter mocked-fetch tests (13 tests). Pins the wire format (URL, method, auth header, body shape, response decode) so refactors can't silently regress one of the three production adapters. Previously had zero coverage. #6 Ad-hoc scratch adapter cache — bounded LRU keyed by a deterministic fingerprint of (id, content). Repeat searches over the same `req.documents` skip re-chunking + re-embedding. Tunable via `adHocCacheSize` (default 8; set 0 to disable). #7 **BREAKING** Drop `HeuristicReranker` as silent default. The previous default did almost nothing while emitting a "yes I rerank" line in the trace. Default is now `null`; pass `new LocalReranker()` (or any provider's reranker) explicitly to keep cross-encoder voting on. Documented in MIGRATING.md. #8 MIGRATING.md — covers every BREAKING change in 0.2 with smallest-diff examples; non-breaking adoptions documented separately. #9 SemanticChunker tests (8 tests) — boundary detection, maxSize cap, async-only API, metadata propagation. Was the only chunker without coverage. #10 Magic-number provenance documented in `router.ts` and `fusion.ts`. Every threshold (≤2 / ≤6 word counts, 0.6 ambiguity floor, 800 ms latency floor, k=60 RRF, ±0.20 shift, 0.10–0.90 weight clamp, 0.3/0.4/0.5/0.7 priors) now records what it's tuned against and what's load-bearing vs negotiable. Test count: 100 → 163. Build green; full eval still ships against the published packages, not against the smoke fixture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Round 1 left two critical lying-trace bugs and a handful of architectural papercuts. This commit clears the table. #1 (critical) Router no longer lies about reranking when no reranker is configured. `Router.decide` gained an optional `hasReranker` parameter; `HeuristicRouter` plumbs it through to `shouldRerank` and forces `reranked: false` when absent. The trace now records "reranking skipped (no reranker configured)" instead of pretending the cross-encoder fired. Augur passes `this.reranker !== null`. Default is `true` so existing third-party Router implementations keep working. #2 (critical) `SearchTrace` declares the four fields that augur.ts attaches at runtime: `adHoc`, `adHocCacheHit`, `autoLanguageFilter`, `autoLanguageFilterDropped`. `Tracer.finish` opts widened to `Omit<SearchTrace, "id"|"query"|"startedAt"|"totalMs"|"spans">` so adding a SearchTrace field propagates automatically. Tests drop their `as unknown as` casts. #3 (high) PgVectorAdapter mocked-fetch tests (14 tests). Pin SQL shape, INSERT batching at 200/round-trip, parameter renumbering across filter clauses, identifier-validation guard against `; DROP TABLE`, **and a filter-key SQL-injection regression test** — the JSON-path quote-doubling defense gets explicit coverage. #4 (high) Adapter trace-string format change reverted. `trace.adapter` is always the bare adapter name; ad-hoc / cache-hit signals surface as the new structured boolean fields from #2. No more "in-memory (ad-hoc, cached)" string parsing. #5 (high) `fingerprintDocs` extracted to `fingerprint.ts` with 10 direct unit tests covering reorder, byte-change, prefix-equal corpora, id|content boundary, doc-record boundary, empty list, unicode, and the output format contract. #6 (medium) Async chunkers no longer pretend to be `Chunker`s. Introduced an explicit `AsyncChunker` interface; `SemanticChunker`, `Doc2QueryChunker`, `ContextualChunker` implement it (no longer Chunker). The runtime traps in their throwing `chunk()` methods are gone — the type system catches misuse at compile time. APIs that accept either flavor (`AugurOptions.chunker`, all chunker `base` fields, `chunkDocument`) now use `Chunker | AsyncChunker`. `MetadataChunker` keeps its dual sync+async path with a runtime guard for the user-opted-in case where its base is async. #7 (medium) `StubEmbedder` consolidated into `packages/core/src/test-fixtures.ts`. Excluded from the published package via tsconfig. Three duplicated copies dropped. #8 (low) `eval-smoke.test.ts` header explicitly distinguishes the synthetic-fixture smoke test (structural) from the BEIR / 504-query eval that produced the README's NDCG@10 = 0.920 numbers (preserved at git `4d52844^`, runs in ~30 min). #9 (low) `BaseAdapter` JSDoc rewritten as the canonical "starting point for custom adapters" comment, including the RRF / capability / `searchHybrid` override guidance. `AsyncChunker` added to public exports. #10 (low) `examples/basic-search/index.ts` header documents the `npm i @huggingface/transformers` requirement for users copying the file out of the repo. Test count: 163 → 191 (+28). Build green; published-package contents verified clean (`find dist -name "test-fixtures*"` → empty, `find dist -name "*.test.*"` → empty). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

willgitdata merged commit adbe13e into main May 8, 2026
1 check passed

willgitdata deleted the feat/adaptive-fusion branch May 8, 2026 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adaptive weighted-RRF fusion + alwaysRerank for accuracy mode#10

feat: adaptive weighted-RRF fusion + alwaysRerank for accuracy mode#10
willgitdata merged 1 commit into
mainfrom
feat/adaptive-fusion

willgitdata commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

willgitdata commented May 8, 2026

What changed

1. Adaptive weighted-RRF fusion in the candidate pool

2. `HeuristicRouter({ alwaysRerank: true })` — accuracy mode

Numbers

Latency cost

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant