Add BM25 full-text search with hybrid search via RRF fusion by sroussey · Pull Request #478 · workglow-dev/libs

sroussey · 2026-05-10T03:11:02Z

Summary

This PR introduces a pluggable full-text search layer to the knowledge base, enabling hybrid search that fuses vector similarity with BM25F text relevance using Reciprocal Rank Fusion (RRF).

Key Changes

New ITextIndex interface (packages/storage/src/text/ITextIndex.ts): Defines the contract for full-text indexes with methods for adding/removing chunks and searching.
BM25F implementation (packages/storage/src/text/BM25Index.ts): In-memory BM25F index with:
- Per-field weighting (tuned defaults for hierarchical chunks: titles weighted higher than body)
- Configurable BM25 parameters (k1, b)
- Pluggable tokenizer support
- JSON serialization/deserialization for persistence
- Cascade deletion via document ID tracking
Tokenizer abstraction (packages/storage/src/text/Tokenizer.ts):
- Default English tokenizer with stopword filtering, case normalization, and Unicode-aware splitting
- Exported stopword set for customization
- Deterministic, side-effect-free design for consistent index/query behavior
KnowledgeBase integration:
- New HybridSearchOptions and TextOnlySearchOptions interfaces moved from vector storage to KB layer
- installTextIndex() / getTextIndex() / reindexText() methods for managing the text index
- Auto-indexing of text fields on chunk upsert/bulk operations
- Cascade deletion to text index on document removal
- hybridSearch() and textSearch() methods using RRF fusion with configurable vector weight
- supportsHybridSearch() predicate
Removed vector storage hybrid search: hybridSearch() methods removed from all vector storage implementations (InMemory, IndexedDB, SQLite, PostgreSQL) as this responsibility now belongs to the KB layer, enabling cleaner separation of concerns and support for pluggable text indexes.
Test coverage: Comprehensive tests for BM25 ranking, field weighting, upsert/remove semantics, and KB-level hybrid search with RRF fusion.

Implementation Details

RRF formula: score = vectorWeight / (rrfK + rank_vector) + (1 - vectorWeight) / (rrfK + rank_text) with configurable rrfK (default 60) and vectorWeight (default 0.7)
Field mapping: Text-indexable fields on ChunkRecord (text, doc_title, sectionTitles, summary, parentSummaries) are extracted and weighted according to DEFAULT_CHUNK_FIELD_WEIGHTS
Filtering: Both text and vector searches support metadata filtering; hybrid search applies filters before RRF fusion
Over-fetching: Hybrid search fetches topK * candidatePoolMultiplier candidates from each ranker to ensure sufficient overlap for fusion

https://claude.ai/code/session_01XkakhtNScxuC5RM6d5wLoR

Replaces the naive keyword-matching hybridSearch baked into each IVectorStorage implementation with a real BM25F text index that lives alongside the vector index, with fusion now performed at the KnowledgeBase layer via Reciprocal Rank Fusion. - Adds @workglow/storage/text: ITextIndex interface, pluggable Tokenizer with default English stopwords, and an in-memory BM25F index with per-field weights, JSON-serialisable state, and idempotent upserts. - KnowledgeBase gains installTextIndex / getTextIndex / reindexText, auto-writes chunks to the text index on upsert, cascades deletes, and fuses similaritySearch + textIndex.search through RRF in hybridSearch. - createKnowledgeBase accepts a textIndex option. - Removes IVectorStorage.hybridSearch, HybridSearchOptions, and the per-backend keyword-match implementations across InMemory, IndexedDB, Telemetry, Scoped, Sqlite, SqliteAi, and Postgres vector storages. Native Postgres FTS can be reintroduced later as an ITextIndex adapter. - ChunkRetrievalTask error message updated for the new "install a text index" semantics; scoreThreshold is no longer forwarded to hybrid (RRF scores are not comparable to cosine scores). - New tests: BM25Index unit tests covering ranking, BM25F field weights, length normalisation, upsert/remove, removeByDocument cascades, JSON round-trips, and tokenizer behaviour; KB integration test covering auto-indexing, cascade deletes, RRF promotion of text-strong matches, reindexText, and the no-index error path.

pkg-pr-new · 2026-05-10T03:12:15Z

Open in StackBlitz

@workglow/cli

npm i https://pkg.pr.new/@workglow/cli@478

@workglow/ai

npm i https://pkg.pr.new/@workglow/ai@478

@workglow/job-queue

npm i https://pkg.pr.new/@workglow/job-queue@478

@workglow/knowledge-base

npm i https://pkg.pr.new/@workglow/knowledge-base@478

@workglow/storage

npm i https://pkg.pr.new/@workglow/storage@478

@workglow/task-graph

npm i https://pkg.pr.new/@workglow/task-graph@478

@workglow/tasks

npm i https://pkg.pr.new/@workglow/tasks@478

@workglow/util

npm i https://pkg.pr.new/@workglow/util@478

workglow

npm i https://pkg.pr.new/workglow@478

commit: 48660ae

github-actions · 2026-05-10T03:14:56Z

Coverage Report

Status	Category	Percentage	Covered / Total
🔵	Lines	62.04%	20093 / 32386
🔵	Statements	61.89%	20781 / 33575
🔵	Functions	64.54%	3871 / 5997
🔵	Branches	51.17%	9674 / 18903

File Coverage

No changed files found.

Generated in workflow #2139 for commit 48660ae by the Vitest Coverage Report Action

Copilot

Pull request overview

This PR introduces a pluggable full-text indexing/search layer (BM25F) and moves “hybrid search” (vector + text) into the KnowledgeBase layer, fusing rankings via Reciprocal Rank Fusion (RRF). It also removes the previous per-vector-backend hybridSearch APIs/implementations to centralize the responsibility in the KB.

Changes:

Add ITextIndex + default tokenizer + in-memory BM25Index with JSON round-tripping.
Integrate text indexing into KnowledgeBase (auto-index on chunk upserts, cascade deletes) and implement hybridSearch()/textSearch() with RRF fusion.
Remove hybridSearch from all IVectorStorage implementations/wrappers and delete corresponding integration tests.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
providers/sqlite/src/storage/SqliteVectorStorage.ts	Removes `hybridSearch` from SQLite vector storage implementation.
providers/sqlite/src/storage/SqliteAiVectorStorage.ts	Removes sqlite-vector hybrid search and its in-memory fallback.
providers/postgres/src/storage/PostgresVectorStorage.ts	Removes Postgres hybrid search implementation and fallback.
packages/storage/src/vector/TelemetryVectorStorage.ts	Drops telemetry wrapper support for `hybridSearch`.
packages/storage/src/vector/IVectorStorage.ts	Removes `HybridSearchOptions`, `hybridSearch` API, and related event listener hook.
packages/storage/src/vector/InMemoryVectorStorage.ts	Removes in-memory `hybridSearch` implementation.
packages/indexeddb/src/storage/IndexedDbVectorStorage.ts	Removes IndexedDB `hybridSearch` implementation.
packages/knowledge-base/src/knowledge-base/ScopedVectorStorage.ts	Removes `hybridSearch` passthrough from scoped wrapper.
packages/knowledge-base/src/knowledge-base/KnowledgeBase.ts	Adds text index management, auto-indexing, `textSearch()`, and RRF-based `hybridSearch()`.
packages/knowledge-base/src/knowledge-base/createKnowledgeBase.ts	Adds `textIndex` option wiring into KnowledgeBase creation.
packages/storage/src/text/Tokenizer.ts	Adds deterministic default tokenizer + exported stopwords.
packages/storage/src/text/ITextIndex.ts	Introduces `ITextIndex` interface and associated types.
packages/storage/src/text/BM25Index.ts	Implements in-memory BM25F index + JSON serialization.
packages/storage/src/text/index.ts	Exports text-index public surface from `@workglow/storage`.
packages/storage/src/common.ts	Re-exports the new text-index module from storage common entrypoint.
packages/ai/src/task/ChunkRetrievalTask.ts	Updates hybrid retrieval to require KB text index and stops passing `scoreThreshold` into hybrid.
packages/test/src/test/vector/SqliteAiVectorStorage.integration.test.ts	Removes vector-storage-level hybrid search integration tests.
packages/test/src/test/vector/IndexedDbVectorStorage.integration.test.ts	Removes vector-storage-level hybrid search integration tests.
packages/test/src/test/storage-text/BM25Index.test.ts	Adds unit tests for BM25 ranking, weights, upsert/remove, and JSON round-trip.
packages/test/src/test/rag/KnowledgeBaseHybridSearch.test.ts	Adds KB-level tests for auto-indexing, cascades, and RRF fusion behavior.

Comments suppressed due to low confidence (1)

packages/ai/src/task/ChunkRetrievalTask.ts:261

For method: "hybrid", scoreThreshold is ignored (by design of RRF scoring), but it is still accepted in the task input schema and the output schema still describes scores as “Similarity scores”. This is confusing and can mislead callers. Consider updating the input/output schema descriptions (and/or runtime validation) to clarify that scoreThreshold only applies to similarity search and that hybrid scores are RRF fusion scores (not cosine similarity).

    const results: ChunkSearchResult[] =
      method === "hybrid"
        ? await kb.hybridSearch(searchVector, {
            textQuery: queryText!,
            topK,
            filter,
            vectorWeight,
          })
        : await kb.similaritySearch(searchVector, {
            topK,
            filter,
            scoreThreshold,
          });

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    const stored = await this.chunkStorage.put(chunk);
+    if (this.textIndex) {
+      const fields = chunkTextFields(stored.metadata);
+      if (fields) this.textIndex.add(stored.chunk_id, stored.doc_id, fields);
+    }
+    return stored;


+    const {
+      textQuery,
+      topK = 10,
+      filter,
+      vectorWeight = 0.7,
+      rrfK = 60,
+      candidatePoolMultiplier = 2,
+    } = options;
+
+    const poolSize = Math.max(topK, topK * candidatePoolMultiplier);
+
+    const [vectorResults, textResults] = await Promise.all([
+      this.similaritySearch(query, { topK: poolSize, filter }),
+      Promise.resolve(index.search(textQuery, { topK: poolSize })),


+ * {@link toJSON} / {@link fromJSON}. Field weights and tokenizer are *not*
+ * serialised — the caller is responsible for restoring an index with the
+ * same configuration as the one that produced the state.


+    if (!s || typeof s !== "object" || s.version !== 1) {
+      throw new Error("BM25Index.fromJSON: unsupported or missing state version");
+    }
+


+    for (const [term, byField] of this.postings) {
+      for (const [field, list] of byField) {
+        const filtered = list.filter((p) => p.chunkId !== chunkId);
+        if (filtered.length === 0) {
+          byField.delete(field);
+        } else if (filtered.length !== list.length) {
+          byField.set(field, filtered);
+        }
+      }
+      if (byField.size === 0) {
+        this.postings.delete(term);
+      }
+    }


- BM25Index.fromJSON now restores k1/b (previously silently kept the constructor's defaults, so a round-trip with mismatched params could change scoring). - BM25Index.remove uses a per-chunk reverse posting index instead of scanning the whole vocabulary. New chunkPostings map is populated on add and rebuilt by fromJSON. - BM25Index class JSDoc corrected: scoring config (k1, b, fieldWeights) IS serialised; only the tokenizer is not. - KnowledgeBase.upsertChunk / upsertChunksBulk now drop stale postings when an updated chunk has no indexable text fields. - KnowledgeBase.put / putBulk delegate to upsertChunk / upsertChunksBulk so the text index stays in sync via either entry point. - KnowledgeBase.hybridSearch uses Math.ceil for the candidate pool size to keep topK an integer when candidatePoolMultiplier is fractional. - New regression tests for k1/b round-trip, post-fromJSON remove, empty-text upserts, put/putBulk indexing, and fractional multipliers.

- BM25Index: maintain a per-term document frequency cache (termDf) so search() does O(1) df lookup instead of walking every posting list per query term. Updated incrementally on add/remove and rebuilt on fromJSON; the old distinctChunkCount helper is gone. - KnowledgeBase.reindexText: now atomic with respect to async failures. Reads chunkStorage and stages tokenisation before mutating the index; if getAll() throws, the existing index is untouched. - KnowledgeBase.hybridSearch: default candidatePoolMultiplier bumped 2 -> 5 to give RRF enough overlap between rankers (industry norm; with 2x and topK=10 the fusion was degenerating toward "OR of top-K" for queries with low ranker agreement). - ChunkRetrievalTask: schema descriptions for method, scoreThreshold, and scores updated to disambiguate similarity (cosine in [0,1]) from hybrid (RRF fusion scores, not comparable). scoreThreshold still silently ignored for hybrid; schema now says so. - Tests: switched KB names from Date.now() to uuid4() to remove same-millisecond collision risk; added regression tests for atomic reindexText (mocks getAll to throw, asserts index unchanged) and for RRF score shape (asserts hybrid scores are small positives, not cosine-range).

GitMCP (gitmcp.io) sits behind Cloudflare and consistently rate-limits CI runs, returning a Cloudflare error page that fails the suite. The five other servers in the list have been stable, so removing GitMCP restores green CI without losing meaningful coverage.

Copilot

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.

+    const poolSize = Math.max(topK, Math.ceil(topK * candidatePoolMultiplier));
+
+    const [vectorResults, textResults] = await Promise.all([
+      this.similaritySearch(query, { topK: poolSize, filter }),
+      Promise.resolve(index.search(textQuery, { topK: poolSize })),


+    const vectorWeightClamped = Math.max(0, Math.min(1, vectorWeight));
+    const textWeight = 1 - vectorWeightClamped;
+
+    const fused = new Map<string, { score: number; entity: ChunkSearchResult | undefined }>();
+
+    vectorResults.forEach((entity, rank) => {
+      const contribution = vectorWeightClamped / (rrfK + rank + 1);
+      fused.set(entity.chunk_id, { score: contribution, entity });
+    });
+
+    for (let rank = 0; rank < textResults.length; rank++) {
+      const { chunkId } = textResults[rank];
+      const contribution = textWeight / (rrfK + rank + 1);
+      const existing = fused.get(chunkId);


+    const missing = Array.from(fused.entries())
+      .filter(([, v]) => v.entity === undefined)
+      .map(([chunkId]) => chunkId);
+
+    if (missing.length > 0) {
+      const hydrated = await Promise.all(
+        missing.map((chunk_id) => this.chunkStorage.get({ chunk_id }))
+      );
+      for (let i = 0; i < missing.length; i++) {
+        const entity = hydrated[i] as ChunkVectorEntity | undefined;


+  it("hybridSearch throws when no text index is installed", async () => {
+    const kb = await createKnowledgeBase({ name: kbName, vectorDimensions: dimensions });
+    expect(kb.supportsHybridSearch()).toBe(false);
+    await expect(
+      kb.hybridSearch(vec(1, 0, 0), { textQuery: "rabbit", topK: 5 })
+    ).rejects.toThrow(/text index/i);
+  });
+
+  it("auto-indexes text fields on upsertChunk and exposes them via textSearch", async () => {
+    const index = new BM25Index();
+    const kb = await createKnowledgeBase({
+      name: kbName,
+      vectorDimensions: dimensions,
+      textIndex: index,
+    });
+


+    const hits = await kb.textSearch("rabbit");
+    expect(hits.map((r) => r.chunk_id)).toEqual(["c1"]);
+
+    // Restore so the destroy() in afterEach behaves.


- hybridSearch falls back to similaritySearch when textQuery is empty or whitespace-only. Previously it ran the full RRF path with an empty text-result list, returning vector-only RRF-shaped scores (~[0,0.05]) in the score field — surprising to callers expecting cosine scores in [0,1] when the query has no text signal. - Clamp rrfK to a non-negative value at use site (Math.max(0, rrfK)). rrfK <= -1 previously produced Infinity / sign-flipped denominators for some ranks. - Clamp candidatePoolMultiplier to >= 1 to prevent zero / negative pool sizes. - KnowledgeBaseHybridSearch.test.ts: pass register: false to every createKnowledgeBase call so the global KB registry doesn't accumulate per-test entries. Drop the no-op afterEach (its comment claimed in-memory KBs need no teardown, but said nothing about the registry cost; with register: false there's nothing to do). - New regression tests: empty/whitespace textQuery returns cosine scores; rrfK=-10 produces finite, positive scores. Not fixed in this PR: the N+1 hydration concern (kb.hybridSearch / textSearch issue one chunkStorage.get per chunkId). Fixing it requires adding a multi-id batch fetch to ITabularStorage, which is broader than this PR's scope. Tracked as a follow-up.

ChunkSearchResult now carries an optional `scoreType: "cosine" | "bm25" | "rrf"` field so callers (typically UIs) can render scores appropriately without having to remember which search method produced them — the three scorers live on incompatible scales: - cosine: [-1,1], typically [0,1] for text embeddings. Absolute. - bm25: [0,inf). Absolute but corpus-dependent. - rrf: bounded above by 2/(rrfK+1) (~0.033 with defaults). Rank-based, not absolute, not comparable across queries. KB.similaritySearch tags "cosine", KB.textSearch tags "bm25", KB.hybridSearch tags "rrf" — except the empty-textQuery fallback that routes through similaritySearch, which correctly surfaces "cosine". ChunkRetrievalTask exposes a top-level `scoreType` field in its output (single string, since every result in one call shares the same type). Field is optional on ChunkSearchResult to keep the type non-breaking for storage-only callers that construct results without going through the KB layer.

sroussey self-assigned this May 10, 2026

sroussey requested a review from Copilot May 10, 2026 03:11

Copilot started reviewing on behalf of sroussey May 10, 2026 03:11 View session

Copilot AI reviewed May 10, 2026

View reviewed changes

claude added 2 commits May 10, 2026 03:19

sroussey requested a review from Copilot May 10, 2026 20:58

Copilot started reviewing on behalf of sroussey May 10, 2026 20:58 View session

Copilot AI reviewed May 10, 2026

View reviewed changes

claude added 2 commits May 10, 2026 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BM25 full-text search with hybrid search via RRF fusion#478

Add BM25 full-text search with hybrid search via RRF fusion#478
sroussey wants to merge 6 commits intomainfrom
claude/knowledge-base-explanation-PcAlh

sroussey commented May 10, 2026

Uh oh!

pkg-pr-new Bot commented May 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sroussey commented May 10, 2026

Summary

Key Changes

Implementation Details

Uh oh!

pkg-pr-new Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pkg-pr-new Bot commented May 10, 2026 •

edited

Loading

github-actions Bot commented May 10, 2026 •

edited

Loading