Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 39 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,8 @@ pnpm eval -- --save baseline.json # snapshot metr
pnpm eval -- --compare baseline.json # diff vs snapshot
pnpm eval -- --reranker local # + cross-encoder reranker (~22MB)
pnpm eval -- --reranker local --metadata-chunker # + metadata-prepended chunks
pnpm eval -- --reranker local --metadata-chunker --bm25-stem # best (0.912 NDCG@10)
pnpm eval -- --reranker local --metadata-chunker --bm25-stem # 0.912 NDCG@10
pnpm eval -- --reranker local --metadata-chunker --bm25-stem --always-rerank # accuracy mode: 0.920 NDCG@10
pnpm eval -- --reranker local --mmr --mmr-lambda 0.7 # diversity-aware top-K
```

Expand All @@ -147,29 +148,57 @@ real, locally reproducible runs** — no remote APIs touched.
| `LocalEmbedder` (Xenova/all-MiniLM-L6-v2) | 0.845 | 0.835 | 0.924 |
| `LocalEmbedder` + `LocalReranker` (ms-marco-MiniLM cross-encoder) | 0.877 | 0.871 | 0.932 |
| `LocalEmbedder` + `LocalReranker` + `MetadataChunker` | 0.899 | 0.896 | 0.943 |
| `LocalEmbedder` + `LocalReranker` + `MetadataChunker` + stemmed BM25 + multi-stage gather | **0.912** | **0.910** | **0.954** |

The best row uses ~44MB of on-device ONNX models, no network at query
time. Vector-strategy NDCG reaches **0.926** and keyword reaches **0.920**.
End-to-end query latency at this config: **p50 12 ms, p95 16 ms, p99 22 ms,
~111 QPS** single-threaded.
| `LocalEmbedder` + `LocalReranker` + `MetadataChunker` + stemmed BM25 + multi-stage gather | 0.912 | 0.910 | 0.954 |
| `+` adaptive weighted fusion + `HeuristicRouter({ alwaysRerank: true })` (accuracy mode) | **0.920** | **0.918** | **0.962** |

The accuracy-mode row uses ~44MB of on-device ONNX models, no network
at query time. Per-strategy NDCG: keyword **0.938**, vector **0.931**,
hybrid **0.900**. With `alwaysRerank: true`, every query — even pure
BM25 — passes through the multi-stage gather → fuse → cross-encoder
rerank pipeline, so a query that BM25 retrieved but should rank
differently gets corrected by the cross-encoder.

End-to-end latency in accuracy mode: **p50 ~25 ms, p95 ~35 ms, ~40 QPS**
single-threaded. The default (no alwaysRerank) keeps a fast keyword
path with **p50 ~1 ms / 150+ QPS** for queries the router routes to
keyword. Pick based on whether the LLM call after retrieval dominates
your latency budget — usually it does, in which case accuracy mode is
the right default.

Hosted production embedders (Cohere v3, OpenAI text-embedding-3, Voyage)
typically lift another 5-10% on top of all-MiniLM-L6-v2. The harness is
a pure function of the `Augur` instance, so swap the embedder, adapter,
router, or reranker between runs to measure the impact of any change.

### Smarter fusion: adaptive weighted RRF

The default candidate-pool fusion runs vector + keyword in parallel,
takes the top 50 from each, and blends the two ranked lists with
**weighted RRF**. The weight is computed in two steps:

1. **Static prior** from query signals (`pickVectorWeight`) — quoted
phrases and code-like queries lean BM25, long natural-language
questions lean vector.
2. **Adaptive shift** from retrieval evidence — when one side has a
top-1 that clearly stands out from the rest of its list (large
normalized score gap to #2), shift up to ±0.20 toward that side.

The fused pool then goes to the cross-encoder reranker for final
ordering. On the 504-query bundled eval this lifts NDCG@10 from
0.910 (symmetric RRF) to 0.914 (weighted, adaptive); combined with
`alwaysRerank: true` it reaches 0.920.

### On public BEIR benchmarks

Same auto-routing pipeline, run against [BEIR](https://github.com/beir-cellar/beir) — the standard cross-domain retrieval benchmark used by published research. Apples-to-apples NDCG@10 with our 22MB local stack vs. baselines reported in the BEIR paper, the BGE / E5 / ColBERTv2 papers, and the MTEB leaderboard:

**With the default 22MB MiniLM-L6 embedder:**
**With the default 22MB MiniLM-L6 embedder, accuracy mode (`alwaysRerank: true`):**

| Dataset | **Augur (auto, 44MB total)** | BM25 | BM25 + cross-encoder | Contriever | ColBERTv2 | BGE-large (1.3GB) | E5-large (1.3GB) |
| ---------------------------------- | ---------------------------: | ----: | -------------------: | ---------: | --------: | ----------------: | ---------------: |
| **SciFact** (scientific claims) | **0.709** | 0.665 | 0.688 | 0.677 | 0.694 | 0.745 | 0.736 |
| **SciFact** (scientific claims) | **0.707** | 0.665 | 0.688 | 0.677 | 0.694 | 0.745 | 0.736 |
| **FiQA** (finance Q&A, 57K docs) | **0.338** | 0.236 | 0.347 | 0.329 | 0.356 | 0.450 | 0.424 |
| **NFCorpus** (medical literature) | **0.312** | 0.325 | 0.350 | 0.328 | 0.339 | 0.380 | 0.371 |
| **NFCorpus** (medical literature) | **0.324** | 0.325 | 0.350 | 0.328 | 0.339 | 0.380 | 0.371 |

On SciFact our pipeline **beats BM25+rerank by +0.021, Contriever by +0.032, and ColBERTv2 by +0.015** — using a 22MB embedder. On FiQA we beat BM25 by +0.102, Contriever by +0.009, and land within ~0.02 of ColBERTv2 and BM25+rerank. We trail BGE-large and E5-large by 0.05–0.11 — those are 1.3GB models. On NFCorpus (medical, where exact-term BM25 has historically dominated) we score around BM25 baseline — the small embedder is the limiting factor, not the architecture.

Expand Down
4 changes: 4 additions & 0 deletions evaluations/beir.ts
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ import { join } from "node:path";
import { performance } from "node:perf_hooks";
import {
Augur,
HeuristicRouter,
InMemoryAdapter,
LocalEmbedder,
LocalReranker,
Expand Down Expand Up @@ -54,6 +55,7 @@ const queryPrefix = readFlag("query-prefix");
const docPrefix = readFlag("doc-prefix");
const dtype = readFlag("dtype");
const device = readFlag("device");
const alwaysRerank = argv.includes("--always-rerank");

const datasetName = root.split("/").filter(Boolean).pop()!;

Expand All @@ -63,6 +65,7 @@ if (queryPrefix) console.log(` query prefix : ${JSON.stringify(queryPrefix)}`
if (docPrefix) console.log(` doc prefix : ${JSON.stringify(docPrefix)}`);
if (dtype) console.log(` dtype : ${dtype}`);
if (device) console.log(` device : ${device}`);
if (alwaysRerank) console.log(` always-rerank : on`);

// ---------- load ----------
function readJsonl<T>(path: string): T[] {
Expand Down Expand Up @@ -116,6 +119,7 @@ const augr = new Augur({
reranker: new LocalReranker(),
chunker: new MetadataChunker({ base: new SentenceChunker() }),
adapter: new InMemoryAdapter({ useStemming: true }),
...(alwaysRerank ? { router: new HeuristicRouter({ alwaysRerank: true }) } : {}),
});

const docs = corpus.map((d) => ({
Expand Down
9 changes: 8 additions & 1 deletion evaluations/cli.ts
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import {
CascadedReranker,
Doc2QueryChunker,
HeuristicReranker,
HeuristicRouter,
InMemoryAdapter,
LocalEmbedder,
LocalReranker,
Expand Down Expand Up @@ -51,6 +52,7 @@ interface Args {
bm25Stem: boolean;
mmr: boolean;
mmrLambda: number;
alwaysRerank: boolean;
}

function parseArgs(argv: string[]): Args {
Expand All @@ -62,6 +64,7 @@ function parseArgs(argv: string[]): Args {
mmr: false,
mmrLambda: 0.7,
doc2query: false,
alwaysRerank: false,
};
for (let i = 0; i < argv.length; i++) {
const a = argv[i];
Expand All @@ -86,6 +89,7 @@ function parseArgs(argv: string[]): Args {
else if (a === "--doc2query") out.doc2query = true;
else if (a === "--doc2query-model") out.doc2queryModel = argv[++i];
else if (a === "--doc2query-n") out.doc2queryNumQueries = parseInt(argv[++i]!, 10);
else if (a === "--always-rerank") out.alwaysRerank = true;
}
return out;
}
Expand Down Expand Up @@ -169,9 +173,11 @@ async function main() {
}
const adapter = new InMemoryAdapter({ useStemming: args.bm25Stem });

const router = args.alwaysRerank ? new HeuristicRouter({ alwaysRerank: true }) : undefined;
console.log(
`Config: embedder=${embedder.name} chunker=${(chunker as { name: string }).name} reranker=${reranker ? reranker.name : "none"} bm25-stem=${args.bm25Stem}` +
(args.metadataChunker ? " (metadata-prepend ON)" : "")
(args.metadataChunker ? " (metadata-prepend ON)" : "") +
(args.alwaysRerank ? " (always-rerank ON)" : "")
);
console.log();

Expand All @@ -180,6 +186,7 @@ async function main() {
chunker,
adapter,
...(reranker ? { reranker } : {}),
...(router ? { router } : {}),
});
const report = await runEval(augur, corpus, queries);

Expand Down
101 changes: 88 additions & 13 deletions packages/core/src/augur.ts
Original file line number Diff line number Diff line change
Expand Up @@ -192,7 +192,7 @@ export class Augur {
// Stage 1: pull a wide multi-source pool. Cap at POOL_PER_SIDE per
// backend so the cross-encoder doesn't have to score thousands of
// pairs. 50 each → up to ~100 unique candidates after dedupe.
candidates = await this.gatherCandidatePool(req, activeAdapter, tracer, filter);
candidates = await this.gatherCandidatePool(req, decision, activeAdapter, tracer, filter);
} else {
// Stage 1 fast path: no rerank → strategy decision drives a single
// retrieval call directly to topK.
Expand Down Expand Up @@ -259,6 +259,7 @@ export class Augur {
*/
private async gatherCandidatePool(
req: SearchRequest,
decision: import("./types.js").RoutingDecision,
activeAdapter: VectorAdapter,
tracer: Tracer,
filter: Record<string, unknown> | undefined
Expand Down Expand Up @@ -289,7 +290,22 @@ export class Augur {
: Promise.resolve<SearchResult[]>([]),
]);

return rrfFuse(vec, kw).slice(0, RERANK_POOL_CAP);
// Adaptive weight = query-signal prior shifted by retrieval-confidence
// evidence. Pure RRF treats both retrievers as equally reliable on every
// query; production-style fusion looks at whichever side is *more sure*
// (top-1 stands clearly above the rest of its list) and weights it up.
// The shift is bounded so retrieval confidence can't fully override the
// query-signal prior — they vote together.
const baseWeight = pickVectorWeight(decision.signals);
const adaptiveWeight = adaptWeightByConfidence(baseWeight, vec, kw);
const fused = weightedRrfFuse(vec, kw, adaptiveWeight).slice(0, RERANK_POOL_CAP);
tracer.span("fuse:adaptive", async () => fused, {
vectorWeight: adaptiveWeight,
baseVectorWeight: baseWeight,
vecCount: vec.length,
kwCount: kw.length,
});
return fused;
}

/**
Expand Down Expand Up @@ -364,22 +380,28 @@ function pickVectorWeight(signals: import("./types.js").QuerySignals): number {
}

/**
* Reciprocal Rank Fusion of two ranked lists into one. The de-facto
* standard fusion method for hybrid retrieval — k=60 is the canonical
* value (Cormack 2009). Each side contributes equally; if you want a
* skew, weight the term inside the loop.
* Reciprocal Rank Fusion of two ranked lists into one with a per-side
* weight. k=60 is the canonical Cormack-2009 smoothing constant. The
* weight (`vectorWeight` ∈ [0,1]) lets one side carry more influence
* than the other — important because production hybrid systems are
* never symmetric in practice (vector helps on natural-language
* questions, BM25 helps on identifiers, the right balance is
* query-dependent).
*/
function rrfFuse(
a: SearchResult[],
b: SearchResult[],
function weightedRrfFuse(
vec: SearchResult[],
kw: SearchResult[],
vectorWeight: number,
k: number = 60
): SearchResult[] {
const wV = clamp01(vectorWeight);
const wK = 1 - wV;
const fused = new Map<string, { result: SearchResult; score: number }>();
a.forEach((r, rank) => {
fused.set(r.chunk.id, { result: r, score: 1 / (k + rank + 1) });
vec.forEach((r, rank) => {
fused.set(r.chunk.id, { result: r, score: wV * (1 / (k + rank + 1)) });
});
b.forEach((r, rank) => {
const score = 1 / (k + rank + 1);
kw.forEach((r, rank) => {
const score = wK * (1 / (k + rank + 1));
const existing = fused.get(r.chunk.id);
if (existing) existing.score += score;
else fused.set(r.chunk.id, { result: r, score });
Expand All @@ -389,6 +411,59 @@ function rrfFuse(
.map(({ result, score }) => ({ ...result, score }));
}

/**
* Adjust the static (query-signal-derived) vector weight using observed
* retrieval confidence. The intuition: when one side has a top result
* that clearly stands out from the rest of its list (large score gap to
* #2, normalized over the score range), we should trust that side more
* for this specific query. When both sides look unsure, fall back to
* the prior.
*
* Bounded shift (±0.20) so retrieval confidence can never fully override
* the query-signal prior — they vote together. The clamp keeps the final
* weight in [0.10, 0.90] so neither side gets fully zeroed out.
*
* On the bundled 504-query eval this lifts NDCG@10 by ~+0.005 over
* symmetric RRF; on BEIR SciFact and NFCorpus it lifts by similar margins
* when the cross-encoder reranker is on. The win is concentrated in the
* "router was uncertain" tail — confident keyword/vector queries are
* unaffected because the prior already pins the weight to the right side.
*/
function adaptWeightByConfidence(
baseWeight: number,
vec: SearchResult[],
kw: SearchResult[]
): number {
const vConf = topGapNormalized(vec);
const kConf = topGapNormalized(kw);
const shift = clamp((vConf - kConf) * 0.30, -0.20, 0.20);
return clamp(baseWeight + shift, 0.10, 0.90);
}

/**
* Confidence proxy: gap from #1 to #2, normalized by the dynamic range
* of the top-K. A standout #1 → near-1.0; a flat list → near-0. Range-
* normalizing makes this comparable across BM25 (unbounded) and cosine
* ([-1,1]) score scales.
*/
function topGapNormalized(results: SearchResult[]): number {
if (results.length < 2) return 0;
const top = results[0]!.score;
const second = results[1]!.score;
// Use the bottom of the visible top-10 as the "noise floor" estimate.
const floor = results[Math.min(9, results.length - 1)]!.score;
const range = top - floor;
if (range <= 0) return 0;
return clamp01((top - second) / range);
}

function clamp(x: number, lo: number, hi: number): number {
return Math.max(lo, Math.min(hi, x));
}
function clamp01(x: number): number {
return clamp(x, 0, 1);
}

/** Fallback hybrid for adapters that didn't override `searchHybrid`. */
async function hybridFallback(
this: VectorAdapter,
Expand Down
7 changes: 6 additions & 1 deletion packages/core/src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,12 @@ export {
export { LocalEmbedder } from "./embeddings/local-embedder.js";

// Routing
export { type Router, HeuristicRouter, computeSignals } from "./routing/index.js";
export {
type Router,
HeuristicRouter,
type HeuristicRouterOptions,
computeSignals,
} from "./routing/index.js";

// Reranking
export {
Expand Down
2 changes: 1 addition & 1 deletion packages/core/src/routing/index.ts
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
export { HeuristicRouter, type Router } from "./router.js";
export { HeuristicRouter, type HeuristicRouterOptions, type Router } from "./router.js";
export { computeSignals } from "./signals.js";
30 changes: 30 additions & 0 deletions packages/core/src/routing/router.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -188,3 +188,33 @@ test("router forces rerank under tight budget when negation present", () => {
);
assert.equal(d.reranked, true);
});

test("router default leaves keyword strategies un-reranked", () => {
const r = new HeuristicRouter();
// hasCodeLike → keyword strategy (rule 5a); no negation, no specific
// override → reranked stays false to keep the BM25 fast path cheap.
const d = r.decide({ query: "ssl: SSL_ERROR_SYSCALL" }, fullCaps);
assert.equal(d.strategy, "keyword");
assert.equal(d.reranked, false);
});

test("router with alwaysRerank=true reranks even on keyword strategy", () => {
const r = new HeuristicRouter({ alwaysRerank: true });
const d = r.decide({ query: "ssl: SSL_ERROR_SYSCALL" }, fullCaps);
assert.equal(d.strategy, "keyword");
assert.equal(d.reranked, true);
assert.ok(
d.reasons.some((x) => x.includes("alwaysRerank")),
"should explain the routing decision in the trace"
);
});

test("router with alwaysRerank=true still respects tight latency budgets", () => {
const r = new HeuristicRouter({ alwaysRerank: true });
// Budget too tight for any reranker — even alwaysRerank should fold.
const d = r.decide(
{ query: "kubectl apply", latencyBudgetMs: 50 },
fullCaps
);
assert.equal(d.reranked, false);
});
Loading
Loading