Skip to content

Add FiQA result to BEIR table (NDCG@10 = 0.338)#6

Merged
willgitdata merged 2 commits into
mainfrom
docs/fiqa-result
May 8, 2026
Merged

Add FiQA result to BEIR table (NDCG@10 = 0.338)#6
willgitdata merged 2 commits into
mainfrom
docs/fiqa-result

Conversation

@willgitdata
Copy link
Copy Markdown
Owner

Summary

Adds the FiQA row to the BEIR comparison table in the README. FiQA is a 57K-doc finance Q&A benchmark and was the third dataset I queued during the BEIR evaluation; it just finished after running ~60 min in the background (brute-force cosine on 57K docs takes time).

Metric Augur (auto)
NDCG@10 0.338
MRR@10 0.412
Recall@10 0.413

vs. published BEIR baselines:

  • BM25: 0.236 (we beat by +0.102)
  • Contriever: 0.329 (we beat by +0.009)
  • BM25 + cross-enc: 0.347 (-0.009)
  • ColBERTv2: 0.356 (-0.018)
  • BGE-large (1.3GB): 0.450
  • E5-large (1.3GB): 0.424

Strategy distribution: 72% vector, 23% hybrid, 5% keyword — the router correctly identified that natural-language finance questions reward semantic match over BM25.

Also adds a latency note: FiQA hit p50 ~118ms / 8 QPS because InMemoryAdapter does brute-force cosine over ~150K chunks. For corpora past ~100K chunks, the right move is to swap in PgVectorAdapter, PineconeAdapter, or TurbopufferAdapter — those bring native ANN, dropping per-query cost back to ~10ms regardless of corpus size. Orchestrator code is unchanged; only the adapter swaps.

Test plan

  • pnpm exec tsx evaluations/beir.ts /tmp/beir/fiqa → matches the numbers in this PR
  • Eyeball the README table renders cleanly on github.com

Note

The first commit on this branch (`d84b254`) accidentally included three root-level draft SVGs (`augur-logo*.svg`, `augur-wordmark.svg`) that were untracked working files in the local checkout. The follow-up commit (`dc76999`) un-stages them — they remain on disk locally as drafts, just not in the repo. Squash-merge will collapse both into one clean commit.

🤖 Generated with Claude Code

willgitdata and others added 2 commits May 7, 2026 21:57
FiQA (57K docs, 648 test queries, finance Q&A) — auto-routing pipeline:

  NDCG@10   0.338
  MRR@10    0.412
  Recall@10 0.413

Strategy distribution: 72% vector, 23% hybrid, 5% keyword. The router
correctly leaned semantic on natural-language finance questions.

Comparison to published baselines:
  BM25:               0.236  (we beat by +0.102)
  Contriever:         0.329  (we beat by +0.009)
  BM25 + rerank:      0.347  (-0.009)
  ColBERTv2:          0.356  (-0.018)
  BGE-large (1.3GB):  0.450
  E5-large (1.3GB):   0.424

Also adds a latency note: FiQA hit p50 118ms / 8 QPS because
InMemoryAdapter does brute-force cosine over ~150K chunks. For corpora
past 100K chunks, swap in PgVector/Pinecone/Turbopuffer for native
ANN — orchestrator code is unchanged, only the adapter swaps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@willgitdata willgitdata merged commit e8d60d5 into main May 8, 2026
1 check passed
@willgitdata willgitdata deleted the docs/fiqa-result branch May 8, 2026 05:05
willgitdata added a commit that referenced this pull request May 8, 2026
…ranker

Implements 10 fixes from a critical code review on the publish-ready
SDK. Each item is independently shippable; this lands them as one
coordinated bump because several feed each other (the eval smoke test
exercises the new fusion module; the explicit-reranker breaking change
needs MIGRATING.md to land at the same time).

  #1  Smoke eval harness — 16-doc / 12-query synthetic fixture with
      a regression floor (NDCG@10 > 0.65 on the stub stack). Runs in
      <50 ms as part of `pnpm test`. The full BEIR + 504-query eval
      stays where it already lives — git history — for "did this
      tweak win?" measurement.
  #2  Extract pure helpers to `fusion.ts` (composeFilter,
      pickVectorWeight, weightedRrfFuse, adaptWeightByConfidence,
      topGapNormalized, clamp) + 31 unit tests. Each empirical
      threshold is now annotated with what it was tuned against.
  #3  autoLanguageFilter integration tests — OFF default, ON for
      non-English, user-filter override wins, soft-fallback when the
      filtered pool empties.
  #4  Basic-search example wires the recommended stack: LocalEmbedder
      + LocalReranker + MetadataChunker(SentenceChunker) +
      InMemoryAdapter({ useStemming: true }). Matches the README's
      headline configuration so users copying from "hello world"
      land on the auto path that produces NDCG@10 = 0.920.
  #5  PineconeAdapter mocked-fetch tests (13 tests). Pins the wire
      format (URL, method, auth header, body shape, response decode)
      so refactors can't silently regress one of the three production
      adapters. Previously had zero coverage.
  #6  Ad-hoc scratch adapter cache — bounded LRU keyed by a
      deterministic fingerprint of (id, content). Repeat searches
      over the same `req.documents` skip re-chunking + re-embedding.
      Tunable via `adHocCacheSize` (default 8; set 0 to disable).
  #7  **BREAKING** Drop `HeuristicReranker` as silent default. The
      previous default did almost nothing while emitting a
      "yes I rerank" line in the trace. Default is now `null`;
      pass `new LocalReranker()` (or any provider's reranker)
      explicitly to keep cross-encoder voting on. Documented in
      MIGRATING.md.
  #8  MIGRATING.md — covers every BREAKING change in 0.2 with
      smallest-diff examples; non-breaking adoptions documented
      separately.
  #9  SemanticChunker tests (8 tests) — boundary detection, maxSize
      cap, async-only API, metadata propagation. Was the only
      chunker without coverage.
  #10 Magic-number provenance documented in `router.ts` and
      `fusion.ts`. Every threshold (≤2 / ≤6 word counts, 0.6
      ambiguity floor, 800 ms latency floor, k=60 RRF, ±0.20
      shift, 0.10–0.90 weight clamp, 0.3/0.4/0.5/0.7 priors) now
      records what it's tuned against and what's load-bearing
      vs negotiable.

Test count: 100 → 163. Build green; full eval still ships against
the published packages, not against the smoke fixture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
willgitdata added a commit that referenced this pull request May 8, 2026
Round 1 left two critical lying-trace bugs and a handful of
architectural papercuts. This commit clears the table.

  #1 (critical) Router no longer lies about reranking when no
     reranker is configured. `Router.decide` gained an optional
     `hasReranker` parameter; `HeuristicRouter` plumbs it through to
     `shouldRerank` and forces `reranked: false` when absent. The
     trace now records "reranking skipped (no reranker configured)"
     instead of pretending the cross-encoder fired. Augur passes
     `this.reranker !== null`. Default is `true` so existing
     third-party Router implementations keep working.

  #2 (critical) `SearchTrace` declares the four fields that augur.ts
     attaches at runtime: `adHoc`, `adHocCacheHit`, `autoLanguageFilter`,
     `autoLanguageFilterDropped`. `Tracer.finish` opts widened to
     `Omit<SearchTrace, "id"|"query"|"startedAt"|"totalMs"|"spans">`
     so adding a SearchTrace field propagates automatically. Tests
     drop their `as unknown as` casts.

  #3 (high) PgVectorAdapter mocked-fetch tests (14 tests). Pin SQL
     shape, INSERT batching at 200/round-trip, parameter renumbering
     across filter clauses, identifier-validation guard against
     `; DROP TABLE`, **and a filter-key SQL-injection regression
     test** — the JSON-path quote-doubling defense gets explicit
     coverage.

  #4 (high) Adapter trace-string format change reverted. `trace.adapter`
     is always the bare adapter name; ad-hoc / cache-hit signals
     surface as the new structured boolean fields from #2. No more
     "in-memory (ad-hoc, cached)" string parsing.

  #5 (high) `fingerprintDocs` extracted to `fingerprint.ts` with 10
     direct unit tests covering reorder, byte-change, prefix-equal
     corpora, id|content boundary, doc-record boundary, empty list,
     unicode, and the output format contract.

  #6 (medium) Async chunkers no longer pretend to be `Chunker`s.
     Introduced an explicit `AsyncChunker` interface;
     `SemanticChunker`, `Doc2QueryChunker`, `ContextualChunker`
     implement it (no longer Chunker). The runtime traps in their
     throwing `chunk()` methods are gone — the type system catches
     misuse at compile time. APIs that accept either flavor
     (`AugurOptions.chunker`, all chunker `base` fields,
     `chunkDocument`) now use `Chunker | AsyncChunker`.
     `MetadataChunker` keeps its dual sync+async path with a runtime
     guard for the user-opted-in case where its base is async.

  #7 (medium) `StubEmbedder` consolidated into
     `packages/core/src/test-fixtures.ts`. Excluded from the
     published package via tsconfig. Three duplicated copies dropped.

  #8 (low) `eval-smoke.test.ts` header explicitly distinguishes the
     synthetic-fixture smoke test (structural) from the BEIR /
     504-query eval that produced the README's NDCG@10 = 0.920
     numbers (preserved at git `feffc73^`, runs in ~30 min).

  #9 (low) `BaseAdapter` JSDoc rewritten as the canonical
     "starting point for custom adapters" comment, including the
     RRF / capability / `searchHybrid` override guidance.
     `AsyncChunker` added to public exports.

  #10 (low) `examples/basic-search/index.ts` header documents the
      `npm i @huggingface/transformers` requirement for users
      copying the file out of the repo.

Test count: 163 → 191 (+28). Build green; published-package
contents verified clean (`find dist -name "test-fixtures*"` →
empty, `find dist -name "*.test.*"` → empty).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
willgitdata added a commit that referenced this pull request May 11, 2026
…ranker

Implements 10 fixes from a critical code review on the publish-ready
SDK. Each item is independently shippable; this lands them as one
coordinated bump because several feed each other (the eval smoke test
exercises the new fusion module; the explicit-reranker breaking change
needs MIGRATING.md to land at the same time).

  #1  Smoke eval harness — 16-doc / 12-query synthetic fixture with
      a regression floor (NDCG@10 > 0.65 on the stub stack). Runs in
      <50 ms as part of `pnpm test`. The full BEIR + 504-query eval
      stays where it already lives — git history — for "did this
      tweak win?" measurement.
  #2  Extract pure helpers to `fusion.ts` (composeFilter,
      pickVectorWeight, weightedRrfFuse, adaptWeightByConfidence,
      topGapNormalized, clamp) + 31 unit tests. Each empirical
      threshold is now annotated with what it was tuned against.
  #3  autoLanguageFilter integration tests — OFF default, ON for
      non-English, user-filter override wins, soft-fallback when the
      filtered pool empties.
  #4  Basic-search example wires the recommended stack: LocalEmbedder
      + LocalReranker + MetadataChunker(SentenceChunker) +
      InMemoryAdapter({ useStemming: true }). Matches the README's
      headline configuration so users copying from "hello world"
      land on the auto path that produces NDCG@10 = 0.920.
  #5  PineconeAdapter mocked-fetch tests (13 tests). Pins the wire
      format (URL, method, auth header, body shape, response decode)
      so refactors can't silently regress one of the three production
      adapters. Previously had zero coverage.
  #6  Ad-hoc scratch adapter cache — bounded LRU keyed by a
      deterministic fingerprint of (id, content). Repeat searches
      over the same `req.documents` skip re-chunking + re-embedding.
      Tunable via `adHocCacheSize` (default 8; set 0 to disable).
  #7  **BREAKING** Drop `HeuristicReranker` as silent default. The
      previous default did almost nothing while emitting a
      "yes I rerank" line in the trace. Default is now `null`;
      pass `new LocalReranker()` (or any provider's reranker)
      explicitly to keep cross-encoder voting on. Documented in
      MIGRATING.md.
  #8  MIGRATING.md — covers every BREAKING change in 0.2 with
      smallest-diff examples; non-breaking adoptions documented
      separately.
  #9  SemanticChunker tests (8 tests) — boundary detection, maxSize
      cap, async-only API, metadata propagation. Was the only
      chunker without coverage.
  #10 Magic-number provenance documented in `router.ts` and
      `fusion.ts`. Every threshold (≤2 / ≤6 word counts, 0.6
      ambiguity floor, 800 ms latency floor, k=60 RRF, ±0.20
      shift, 0.10–0.90 weight clamp, 0.3/0.4/0.5/0.7 priors) now
      records what it's tuned against and what's load-bearing
      vs negotiable.

Test count: 100 → 163. Build green; full eval still ships against
the published packages, not against the smoke fixture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
willgitdata added a commit that referenced this pull request May 11, 2026
Round 1 left two critical lying-trace bugs and a handful of
architectural papercuts. This commit clears the table.

  #1 (critical) Router no longer lies about reranking when no
     reranker is configured. `Router.decide` gained an optional
     `hasReranker` parameter; `HeuristicRouter` plumbs it through to
     `shouldRerank` and forces `reranked: false` when absent. The
     trace now records "reranking skipped (no reranker configured)"
     instead of pretending the cross-encoder fired. Augur passes
     `this.reranker !== null`. Default is `true` so existing
     third-party Router implementations keep working.

  #2 (critical) `SearchTrace` declares the four fields that augur.ts
     attaches at runtime: `adHoc`, `adHocCacheHit`, `autoLanguageFilter`,
     `autoLanguageFilterDropped`. `Tracer.finish` opts widened to
     `Omit<SearchTrace, "id"|"query"|"startedAt"|"totalMs"|"spans">`
     so adding a SearchTrace field propagates automatically. Tests
     drop their `as unknown as` casts.

  #3 (high) PgVectorAdapter mocked-fetch tests (14 tests). Pin SQL
     shape, INSERT batching at 200/round-trip, parameter renumbering
     across filter clauses, identifier-validation guard against
     `; DROP TABLE`, **and a filter-key SQL-injection regression
     test** — the JSON-path quote-doubling defense gets explicit
     coverage.

  #4 (high) Adapter trace-string format change reverted. `trace.adapter`
     is always the bare adapter name; ad-hoc / cache-hit signals
     surface as the new structured boolean fields from #2. No more
     "in-memory (ad-hoc, cached)" string parsing.

  #5 (high) `fingerprintDocs` extracted to `fingerprint.ts` with 10
     direct unit tests covering reorder, byte-change, prefix-equal
     corpora, id|content boundary, doc-record boundary, empty list,
     unicode, and the output format contract.

  #6 (medium) Async chunkers no longer pretend to be `Chunker`s.
     Introduced an explicit `AsyncChunker` interface;
     `SemanticChunker`, `Doc2QueryChunker`, `ContextualChunker`
     implement it (no longer Chunker). The runtime traps in their
     throwing `chunk()` methods are gone — the type system catches
     misuse at compile time. APIs that accept either flavor
     (`AugurOptions.chunker`, all chunker `base` fields,
     `chunkDocument`) now use `Chunker | AsyncChunker`.
     `MetadataChunker` keeps its dual sync+async path with a runtime
     guard for the user-opted-in case where its base is async.

  #7 (medium) `StubEmbedder` consolidated into
     `packages/core/src/test-fixtures.ts`. Excluded from the
     published package via tsconfig. Three duplicated copies dropped.

  #8 (low) `eval-smoke.test.ts` header explicitly distinguishes the
     synthetic-fixture smoke test (structural) from the BEIR /
     504-query eval that produced the README's NDCG@10 = 0.920
     numbers (preserved at git `4d52844^`, runs in ~30 min).

  #9 (low) `BaseAdapter` JSDoc rewritten as the canonical
     "starting point for custom adapters" comment, including the
     RRF / capability / `searchHybrid` override guidance.
     `AsyncChunker` added to public exports.

  #10 (low) `examples/basic-search/index.ts` header documents the
      `npm i @huggingface/transformers` requirement for users
      copying the file out of the repo.

Test count: 163 → 191 (+28). Build green; published-package
contents verified clean (`find dist -name "test-fixtures*"` →
empty, `find dist -name "*.test.*"` → empty).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant