Skip to content

Add BM25 full-text search with hybrid search via RRF fusion#478

Open
sroussey wants to merge 6 commits intomainfrom
claude/knowledge-base-explanation-PcAlh
Open

Add BM25 full-text search with hybrid search via RRF fusion#478
sroussey wants to merge 6 commits intomainfrom
claude/knowledge-base-explanation-PcAlh

Conversation

@sroussey
Copy link
Copy Markdown
Collaborator

Summary

This PR introduces a pluggable full-text search layer to the knowledge base, enabling hybrid search that fuses vector similarity with BM25F text relevance using Reciprocal Rank Fusion (RRF).

Key Changes

  • New ITextIndex interface (packages/storage/src/text/ITextIndex.ts): Defines the contract for full-text indexes with methods for adding/removing chunks and searching.

  • BM25F implementation (packages/storage/src/text/BM25Index.ts): In-memory BM25F index with:

    • Per-field weighting (tuned defaults for hierarchical chunks: titles weighted higher than body)
    • Configurable BM25 parameters (k1, b)
    • Pluggable tokenizer support
    • JSON serialization/deserialization for persistence
    • Cascade deletion via document ID tracking
  • Tokenizer abstraction (packages/storage/src/text/Tokenizer.ts):

    • Default English tokenizer with stopword filtering, case normalization, and Unicode-aware splitting
    • Exported stopword set for customization
    • Deterministic, side-effect-free design for consistent index/query behavior
  • KnowledgeBase integration:

    • New HybridSearchOptions and TextOnlySearchOptions interfaces moved from vector storage to KB layer
    • installTextIndex() / getTextIndex() / reindexText() methods for managing the text index
    • Auto-indexing of text fields on chunk upsert/bulk operations
    • Cascade deletion to text index on document removal
    • hybridSearch() and textSearch() methods using RRF fusion with configurable vector weight
    • supportsHybridSearch() predicate
  • Removed vector storage hybrid search: hybridSearch() methods removed from all vector storage implementations (InMemory, IndexedDB, SQLite, PostgreSQL) as this responsibility now belongs to the KB layer, enabling cleaner separation of concerns and support for pluggable text indexes.

  • Test coverage: Comprehensive tests for BM25 ranking, field weighting, upsert/remove semantics, and KB-level hybrid search with RRF fusion.

Implementation Details

  • RRF formula: score = vectorWeight / (rrfK + rank_vector) + (1 - vectorWeight) / (rrfK + rank_text) with configurable rrfK (default 60) and vectorWeight (default 0.7)
  • Field mapping: Text-indexable fields on ChunkRecord (text, doc_title, sectionTitles, summary, parentSummaries) are extracted and weighted according to DEFAULT_CHUNK_FIELD_WEIGHTS
  • Filtering: Both text and vector searches support metadata filtering; hybrid search applies filters before RRF fusion
  • Over-fetching: Hybrid search fetches topK * candidatePoolMultiplier candidates from each ranker to ensure sufficient overlap for fusion

https://claude.ai/code/session_01XkakhtNScxuC5RM6d5wLoR

Replaces the naive keyword-matching hybridSearch baked into each
IVectorStorage implementation with a real BM25F text index that lives
alongside the vector index, with fusion now performed at the
KnowledgeBase layer via Reciprocal Rank Fusion.

- Adds @workglow/storage/text: ITextIndex interface, pluggable Tokenizer
  with default English stopwords, and an in-memory BM25F index with
  per-field weights, JSON-serialisable state, and idempotent upserts.
- KnowledgeBase gains installTextIndex / getTextIndex / reindexText,
  auto-writes chunks to the text index on upsert, cascades deletes, and
  fuses similaritySearch + textIndex.search through RRF in hybridSearch.
- createKnowledgeBase accepts a textIndex option.
- Removes IVectorStorage.hybridSearch, HybridSearchOptions, and the
  per-backend keyword-match implementations across InMemory, IndexedDB,
  Telemetry, Scoped, Sqlite, SqliteAi, and Postgres vector storages.
  Native Postgres FTS can be reintroduced later as an ITextIndex
  adapter.
- ChunkRetrievalTask error message updated for the new "install a text
  index" semantics; scoreThreshold is no longer forwarded to hybrid
  (RRF scores are not comparable to cosine scores).
- New tests: BM25Index unit tests covering ranking, BM25F field weights,
  length normalisation, upsert/remove, removeByDocument cascades, JSON
  round-trips, and tokenizer behaviour; KB integration test covering
  auto-indexing, cascade deletes, RRF promotion of text-strong matches,
  reindexText, and the no-index error path.
@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 10, 2026

Open in StackBlitz

@workglow/cli

npm i https://pkg.pr.new/@workglow/cli@478

@workglow/ai

npm i https://pkg.pr.new/@workglow/ai@478

@workglow/job-queue

npm i https://pkg.pr.new/@workglow/job-queue@478

@workglow/knowledge-base

npm i https://pkg.pr.new/@workglow/knowledge-base@478

@workglow/storage

npm i https://pkg.pr.new/@workglow/storage@478

@workglow/task-graph

npm i https://pkg.pr.new/@workglow/task-graph@478

@workglow/tasks

npm i https://pkg.pr.new/@workglow/tasks@478

@workglow/util

npm i https://pkg.pr.new/@workglow/util@478

workglow

npm i https://pkg.pr.new/workglow@478

commit: 48660ae

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 10, 2026

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 62.04% 20093 / 32386
🔵 Statements 61.89% 20781 / 33575
🔵 Functions 64.54% 3871 / 5997
🔵 Branches 51.17% 9674 / 18903
File CoverageNo changed files found.
Generated in workflow #2139 for commit 48660ae by the Vitest Coverage Report Action

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a pluggable full-text indexing/search layer (BM25F) and moves “hybrid search” (vector + text) into the KnowledgeBase layer, fusing rankings via Reciprocal Rank Fusion (RRF). It also removes the previous per-vector-backend hybridSearch APIs/implementations to centralize the responsibility in the KB.

Changes:

  • Add ITextIndex + default tokenizer + in-memory BM25Index with JSON round-tripping.
  • Integrate text indexing into KnowledgeBase (auto-index on chunk upserts, cascade deletes) and implement hybridSearch()/textSearch() with RRF fusion.
  • Remove hybridSearch from all IVectorStorage implementations/wrappers and delete corresponding integration tests.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
providers/sqlite/src/storage/SqliteVectorStorage.ts Removes hybridSearch from SQLite vector storage implementation.
providers/sqlite/src/storage/SqliteAiVectorStorage.ts Removes sqlite-vector hybrid search and its in-memory fallback.
providers/postgres/src/storage/PostgresVectorStorage.ts Removes Postgres hybrid search implementation and fallback.
packages/storage/src/vector/TelemetryVectorStorage.ts Drops telemetry wrapper support for hybridSearch.
packages/storage/src/vector/IVectorStorage.ts Removes HybridSearchOptions, hybridSearch API, and related event listener hook.
packages/storage/src/vector/InMemoryVectorStorage.ts Removes in-memory hybridSearch implementation.
packages/indexeddb/src/storage/IndexedDbVectorStorage.ts Removes IndexedDB hybridSearch implementation.
packages/knowledge-base/src/knowledge-base/ScopedVectorStorage.ts Removes hybridSearch passthrough from scoped wrapper.
packages/knowledge-base/src/knowledge-base/KnowledgeBase.ts Adds text index management, auto-indexing, textSearch(), and RRF-based hybridSearch().
packages/knowledge-base/src/knowledge-base/createKnowledgeBase.ts Adds textIndex option wiring into KnowledgeBase creation.
packages/storage/src/text/Tokenizer.ts Adds deterministic default tokenizer + exported stopwords.
packages/storage/src/text/ITextIndex.ts Introduces ITextIndex interface and associated types.
packages/storage/src/text/BM25Index.ts Implements in-memory BM25F index + JSON serialization.
packages/storage/src/text/index.ts Exports text-index public surface from @workglow/storage.
packages/storage/src/common.ts Re-exports the new text-index module from storage common entrypoint.
packages/ai/src/task/ChunkRetrievalTask.ts Updates hybrid retrieval to require KB text index and stops passing scoreThreshold into hybrid.
packages/test/src/test/vector/SqliteAiVectorStorage.integration.test.ts Removes vector-storage-level hybrid search integration tests.
packages/test/src/test/vector/IndexedDbVectorStorage.integration.test.ts Removes vector-storage-level hybrid search integration tests.
packages/test/src/test/storage-text/BM25Index.test.ts Adds unit tests for BM25 ranking, weights, upsert/remove, and JSON round-trip.
packages/test/src/test/rag/KnowledgeBaseHybridSearch.test.ts Adds KB-level tests for auto-indexing, cascades, and RRF fusion behavior.
Comments suppressed due to low confidence (1)

packages/ai/src/task/ChunkRetrievalTask.ts:261

  • For method: "hybrid", scoreThreshold is ignored (by design of RRF scoring), but it is still accepted in the task input schema and the output schema still describes scores as “Similarity scores”. This is confusing and can mislead callers. Consider updating the input/output schema descriptions (and/or runtime validation) to clarify that scoreThreshold only applies to similarity search and that hybrid scores are RRF fusion scores (not cosine similarity).
    const results: ChunkSearchResult[] =
      method === "hybrid"
        ? await kb.hybridSearch(searchVector, {
            textQuery: queryText!,
            topK,
            filter,
            vectorWeight,
          })
        : await kb.similaritySearch(searchVector, {
            topK,
            filter,
            scoreThreshold,
          });

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +405 to +410
const stored = await this.chunkStorage.put(chunk);
if (this.textIndex) {
const fields = chunkTextFields(stored.metadata);
if (fields) this.textIndex.add(stored.chunk_id, stored.doc_id, fields);
}
return stored;
Comment on lines +500 to +513
const {
textQuery,
topK = 10,
filter,
vectorWeight = 0.7,
rrfK = 60,
candidatePoolMultiplier = 2,
} = options;

const poolSize = Math.max(topK, topK * candidatePoolMultiplier);

const [vectorResults, textResults] = await Promise.all([
this.similaritySearch(query, { topK: poolSize, filter }),
Promise.resolve(index.search(textQuery, { topK: poolSize })),
Comment thread packages/storage/src/text/BM25Index.ts Outdated
Comment on lines +58 to +60
* {@link toJSON} / {@link fromJSON}. Field weights and tokenizer are *not*
* serialised — the caller is responsible for restoring an index with the
* same configuration as the one that produced the state.
if (!s || typeof s !== "object" || s.version !== 1) {
throw new Error("BM25Index.fromJSON: unsupported or missing state version");
}

Comment thread packages/storage/src/text/BM25Index.ts Outdated
Comment on lines +182 to +194
for (const [term, byField] of this.postings) {
for (const [field, list] of byField) {
const filtered = list.filter((p) => p.chunkId !== chunkId);
if (filtered.length === 0) {
byField.delete(field);
} else if (filtered.length !== list.length) {
byField.set(field, filtered);
}
}
if (byField.size === 0) {
this.postings.delete(term);
}
}
claude added 2 commits May 10, 2026 03:19
- BM25Index.fromJSON now restores k1/b (previously silently kept the
  constructor's defaults, so a round-trip with mismatched params could
  change scoring).
- BM25Index.remove uses a per-chunk reverse posting index instead of
  scanning the whole vocabulary. New chunkPostings map is populated on
  add and rebuilt by fromJSON.
- BM25Index class JSDoc corrected: scoring config (k1, b, fieldWeights)
  IS serialised; only the tokenizer is not.
- KnowledgeBase.upsertChunk / upsertChunksBulk now drop stale postings
  when an updated chunk has no indexable text fields.
- KnowledgeBase.put / putBulk delegate to upsertChunk / upsertChunksBulk
  so the text index stays in sync via either entry point.
- KnowledgeBase.hybridSearch uses Math.ceil for the candidate pool size
  to keep topK an integer when candidatePoolMultiplier is fractional.
- New regression tests for k1/b round-trip, post-fromJSON remove,
  empty-text upserts, put/putBulk indexing, and fractional multipliers.
- BM25Index: maintain a per-term document frequency cache (termDf) so
  search() does O(1) df lookup instead of walking every posting list per
  query term. Updated incrementally on add/remove and rebuilt on
  fromJSON; the old distinctChunkCount helper is gone.
- KnowledgeBase.reindexText: now atomic with respect to async failures.
  Reads chunkStorage and stages tokenisation before mutating the index;
  if getAll() throws, the existing index is untouched.
- KnowledgeBase.hybridSearch: default candidatePoolMultiplier bumped
  2 -> 5 to give RRF enough overlap between rankers (industry norm;
  with 2x and topK=10 the fusion was degenerating toward "OR of top-K"
  for queries with low ranker agreement).
- ChunkRetrievalTask: schema descriptions for method, scoreThreshold,
  and scores updated to disambiguate similarity (cosine in [0,1]) from
  hybrid (RRF fusion scores, not comparable). scoreThreshold still
  silently ignored for hybrid; schema now says so.
- Tests: switched KB names from Date.now() to uuid4() to remove
  same-millisecond collision risk; added regression tests for atomic
  reindexText (mocks getAll to throw, asserts index unchanged) and for
  RRF score shape (asserts hybrid scores are small positives, not
  cosine-range).
GitMCP (gitmcp.io) sits behind Cloudflare and consistently rate-limits
CI runs, returning a Cloudflare error page that fails the suite. The
five other servers in the list have been stable, so removing GitMCP
restores green CI without losing meaningful coverage.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 5 comments.

Comment on lines +529 to +533
const poolSize = Math.max(topK, Math.ceil(topK * candidatePoolMultiplier));

const [vectorResults, textResults] = await Promise.all([
this.similaritySearch(query, { topK: poolSize, filter }),
Promise.resolve(index.search(textQuery, { topK: poolSize })),
Comment on lines +536 to +549
const vectorWeightClamped = Math.max(0, Math.min(1, vectorWeight));
const textWeight = 1 - vectorWeightClamped;

const fused = new Map<string, { score: number; entity: ChunkSearchResult | undefined }>();

vectorResults.forEach((entity, rank) => {
const contribution = vectorWeightClamped / (rrfK + rank + 1);
fused.set(entity.chunk_id, { score: contribution, entity });
});

for (let rank = 0; rank < textResults.length; rank++) {
const { chunkId } = textResults[rank];
const contribution = textWeight / (rrfK + rank + 1);
const existing = fused.get(chunkId);
Comment on lines +557 to +566
const missing = Array.from(fused.entries())
.filter(([, v]) => v.entity === undefined)
.map(([chunkId]) => chunkId);

if (missing.length > 0) {
const hydrated = await Promise.all(
missing.map((chunk_id) => this.chunkStorage.get({ chunk_id }))
);
for (let i = 0; i < missing.length; i++) {
const entity = hydrated[i] as ChunkVectorEntity | undefined;
Comment on lines +40 to +55
it("hybridSearch throws when no text index is installed", async () => {
const kb = await createKnowledgeBase({ name: kbName, vectorDimensions: dimensions });
expect(kb.supportsHybridSearch()).toBe(false);
await expect(
kb.hybridSearch(vec(1, 0, 0), { textQuery: "rabbit", topK: 5 })
).rejects.toThrow(/text index/i);
});

it("auto-indexes text fields on upsertChunk and exposes them via textSearch", async () => {
const index = new BM25Index();
const kb = await createKnowledgeBase({
name: kbName,
vectorDimensions: dimensions,
textIndex: index,
});

const hits = await kb.textSearch("rabbit");
expect(hits.map((r) => r.chunk_id)).toEqual(["c1"]);

// Restore so the destroy() in afterEach behaves.
claude added 2 commits May 10, 2026 21:05
- hybridSearch falls back to similaritySearch when textQuery is empty
  or whitespace-only. Previously it ran the full RRF path with an empty
  text-result list, returning vector-only RRF-shaped scores (~[0,0.05])
  in the score field — surprising to callers expecting cosine scores
  in [0,1] when the query has no text signal.
- Clamp rrfK to a non-negative value at use site (Math.max(0, rrfK)).
  rrfK <= -1 previously produced Infinity / sign-flipped denominators
  for some ranks.
- Clamp candidatePoolMultiplier to >= 1 to prevent zero / negative
  pool sizes.
- KnowledgeBaseHybridSearch.test.ts: pass register: false to every
  createKnowledgeBase call so the global KB registry doesn't accumulate
  per-test entries. Drop the no-op afterEach (its comment claimed
  in-memory KBs need no teardown, but said nothing about the registry
  cost; with register: false there's nothing to do).
- New regression tests: empty/whitespace textQuery returns cosine
  scores; rrfK=-10 produces finite, positive scores.

Not fixed in this PR: the N+1 hydration concern (kb.hybridSearch /
textSearch issue one chunkStorage.get per chunkId). Fixing it requires
adding a multi-id batch fetch to ITabularStorage, which is broader than
this PR's scope. Tracked as a follow-up.
ChunkSearchResult now carries an optional `scoreType: "cosine" | "bm25"
| "rrf"` field so callers (typically UIs) can render scores
appropriately without having to remember which search method produced
them — the three scorers live on incompatible scales:

- cosine: [-1,1], typically [0,1] for text embeddings. Absolute.
- bm25:   [0,inf). Absolute but corpus-dependent.
- rrf:    bounded above by 2/(rrfK+1) (~0.033 with defaults).
          Rank-based, not absolute, not comparable across queries.

KB.similaritySearch tags "cosine", KB.textSearch tags "bm25",
KB.hybridSearch tags "rrf" — except the empty-textQuery fallback that
routes through similaritySearch, which correctly surfaces "cosine".

ChunkRetrievalTask exposes a top-level `scoreType` field in its output
(single string, since every result in one call shares the same type).

Field is optional on ChunkSearchResult to keep the type non-breaking
for storage-only callers that construct results without going through
the KB layer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants