docs: add design for storage getBulk plural-get by sroussey · Pull Request #480 · workglow-dev/libs

sroussey · 2026-05-11T00:19:28Z

Design spec for adding getBulk(keys) to IKvStorage / ITabularStorage,
reclaiming the name from the deprecated offset/limit method on tabular
(renamed to getOffsetPage). Vector inherits via ITabularStorage.

pkg-pr-new · 2026-05-11T00:22:57Z

Open in StackBlitz

@workglow/cli

npm i https://pkg.pr.new/@workglow/cli@480

@workglow/ai

npm i https://pkg.pr.new/@workglow/ai@480

@workglow/job-queue

npm i https://pkg.pr.new/@workglow/job-queue@480

@workglow/knowledge-base

npm i https://pkg.pr.new/@workglow/knowledge-base@480

@workglow/storage

npm i https://pkg.pr.new/@workglow/storage@480

@workglow/task-graph

npm i https://pkg.pr.new/@workglow/task-graph@480

@workglow/tasks

npm i https://pkg.pr.new/@workglow/tasks@480

@workglow/util

npm i https://pkg.pr.new/@workglow/util@480

workglow

npm i https://pkg.pr.new/workglow@480

commit: 99d438d

github-actions · 2026-05-11T02:34:59Z

Coverage Report

Status	Category	Percentage	Covered / Total
🔵	Lines	61.89%	20164 / 32579
🔵	Statements	61.74%	20872 / 33802
🔵	Functions	64.53%	3908 / 6056
🔵	Branches	51.08%	9716 / 19020

File Coverage

File	Stmts	Branches	Functions	Lines	Uncovered Lines
Changed Files
packages/storage/src/kv/IKvStorage.ts	100%	100%	100%	100%
packages/storage/src/kv/KvStorage.ts	40%	100%	14.28%	40%	104-154
packages/storage/src/kv/KvViaTabularStorage.ts	40%	57.14%	25%	41.86%	58-69, 88-94, 122-173
packages/storage/src/tabular/BaseTabularStorage.ts	18.63%	7.5%	10.93%	20.52%	78-117, 212, 215, 234, 237-239, 244, 258-265, 277-278, 284-287, 291-297, 307-308, 312, 317-957, 986-987, 990-991, 1025-1078, 1095-1201
packages/storage/src/tabular/FsFolderTabularStorage.ts	16.76%	6.45%	12.19%	17.9%	45-77, 137-170, 185-206, 216-223, 237-238, 268-394, 416-505
packages/storage/src/tabular/ITabularStorage.ts	0%	0%	0%	0%	207-212

Generated in workflow #2160 for commit 99d438d by the Vitest Coverage Report Action

Design spec for adding getBulk(keys) to IKvStorage / ITabularStorage, reclaiming the name from the deprecated offset/limit method on tabular (renamed to getOffsetPage). Vector inherits via ITabularStorage.

@deprecated

…etPage Frees the getBulk name for the upcoming plural-get-by-keys method. @deprecated JSDoc preserved; migration target remains getPage. https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m

https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m

…setPage Missed in the initial rename sweep. Brings SupabaseTabularStorage in line with the renamed ITabularStorage interface.

Default implementation in BaseTabularStorage does Promise.all(get). SQL backends will override with a single batched query in follow-up commits.

Replaces the inherited N-parallel-get default in SqliteTabularStorage with a single-statement override: single-column PKs use `WHERE pk IN (?,?,...)`, compound PKs use `WHERE (p1,p2) IN ((?,?),...)` — one round-trip regardless of batch size. https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m

https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m

Delegates to ITabularStorage.getBulk on tabular-backed KV implementations, picking up SQL batched-IN pushdown for free. https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m

https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m

getPrimaryKeyAsOrderedArray already applies jsToSqlValue to each PK field. The new getBulk overrides were re-applying it on the returned values. Idempotent for current types (strings, numbers, ISO date strings, etc.), so no observable bug today, but it would silently double-serialize any future non-idempotent type. Match the existing _getInternal and _deleteInternal patterns and pass through directly.

…predicate The .filter((r): r is T => r !== undefined) form trips TS when T is a generic that gets resolved to Awaited<T>: the type predicate's target type must be assignable to its parameter's type, and Awaited<T>|undefined isn't reliably narrowable to T in generic position. Switch to an as-cast on the filtered array. Same runtime, types compile.

Match the KV-event pattern established in c8c9860: concrete KV implementations now emit each method's event. KvViaTabularStorage.getBulk already emits as part of the rebase resolution; this finishes the symmetry for FsFolderKvStorage.

Three callsites of the same try { JSON.parse } catch { raw } pattern collapse into a private deserialize() helper. Behavior unchanged; the helper also short-circuits the schema-not-needs-json case so getAll's inlined IIFE goes away.

Match the tabular layer (BaseTabularStorage / SqliteTabularStorage / PostgresTabularStorage), which all return early without emitting on empty input. Listeners using getBulk for cache invalidation or logging no longer see spurious empty-call events from KV stores while seeing none from tabular stores.

SQLite's SQLITE_MAX_VARIABLE_NUMBER is 999 on older builds; Postgres caps bind parameters at 65535. A single getBulk call with many compound-PK inputs would hit either ceiling and fail with a cryptic driver error. Wrap the override so inputs above a safe per-statement threshold split into chunks executed sequentially, each chunk taking the mutex once. The event is emitted once with the original key array, not per chunk.

Two scopes sharing a physical table must not leak rows across kb_id boundaries via getBulk, and the emitted event must carry the original unscoped keys. Both invariants verified against InMemory.</br>

Copilot

Pull request overview

Adds a new plural “get by keys” API across KV and Tabular storage (and thus Vector via inheritance), while reclaiming the getBulk name by renaming the deprecated offset/limit paging method to getOffsetPage. This extends the storage abstraction to support efficient batched reads (notably via SQL IN (...)) and wires the operation into existing telemetry and event systems.

Changes:

Introduces IKvStorage.getBulk(keys) and ITabularStorage.getBulk(keys), plus new getBulk events for both KV and Tabular.
Renames deprecated ITabularStorage.getBulk(offset, limit) to getOffsetPage(offset, limit) across implementations and tests.
Implements batched SQL overrides for getBulk(keys) in Postgres/SQLite; other backends inherit the default Promise.all(get) implementation.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
providers/supabase/src/storage/SupabaseTabularStorage.ts	Renames deprecated offset paging method to `getOffsetPage`.
providers/sqlite/src/storage/SqliteTabularStorage.ts	Adds SQL-batched `getBulk(keys)` with chunking; renames offset paging to `getOffsetPage`.
providers/postgres/src/storage/PostgresTabularStorage.ts	Adds SQL-batched `getBulk(keys)` with chunking; renames offset paging to `getOffsetPage`.
packages/test/src/test/storage-tabular/HuggingFaceTabularStorage.integration.test.ts	Updates tests to call `getOffsetPage`.
packages/test/src/test/storage-tabular/genericTabularStorageTests.ts	Renames offset paging suite + adds new `getBulk(keys)` behavior tests.
packages/test/src/test/storage-kv/genericKvRepositoryTests.ts	Adds `getBulk(keys)` tests including JSON deserialization and event emission.
packages/test/src/test/rag/ScopedStorage.test.ts	Adds `ScopedTabularStorage.getBulk` isolation + event tests.
packages/storage/src/tabular/TelemetryTabularStorage.ts	Adds tracing for `getBulk(keys)` and renames traced offset paging to `getOffsetPage`.
packages/storage/src/tabular/SharedInMemoryTabularStorage.ts	Renames delegated offset paging method to `getOffsetPage`.
packages/storage/src/tabular/ITabularStorage.ts	Adds `getBulk(keys)` API + event type; renames deprecated offset paging to `getOffsetPage`.
packages/storage/src/tabular/InMemoryTabularStorage.ts	Renames deprecated offset paging method to `getOffsetPage`.
packages/storage/src/tabular/HuggingFaceTabularStorage.ts	Renames internal/self calls to `getOffsetPage`.
packages/storage/src/tabular/FsFolderTabularStorage.ts	Renames offset paging to `getOffsetPage` and updates warning message text.
packages/storage/src/tabular/CachedTabularStorage.ts	Renames delegated offset paging method to `getOffsetPage`.
packages/storage/src/tabular/BaseTabularStorage.ts	Adds default `getBulk(keys)` implementation and renames abstract offset paging to `getOffsetPage`.
packages/storage/src/kv/TelemetryKvStorage.ts	Adds tracing wrapper for `getBulk(keys)`.
packages/storage/src/kv/KvViaTabularStorage.ts	Adds `getBulk(keys)` delegating to tabular + shared JSON deserialization helper.
packages/storage/src/kv/KvStorage.ts	Adds abstract `getBulk(keys)` to KV base class.
packages/storage/src/kv/IKvStorage.ts	Adds `getBulk(keys)` API + `getBulk` event type.
packages/storage/src/kv/FsFolderKvStorage.ts	Implements `getBulk(keys)` via parallel `get` calls + event emission.
packages/knowledge-base/src/knowledge-base/ScopedTabularStorage.ts	Adds scoped `getBulk(keys)` that injects `kb_id` and strips it from results.
packages/indexeddb/src/storage/IndexedDbTabularStorage.ts	Renames deprecated offset paging method to `getOffsetPage`.
docs/superpowers/specs/2026-05-10-storage-get-plural-design.md	New design spec documenting API, naming reclaim, implementations, and tests.
docs/superpowers/plans/2026-05-10-storage-get-plural.md	New step-by-step implementation plan and checklist.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    describe("getBulk(keys)", () => {
+      const seed = [
+        { name: "key1", type: "type1", option: "value1", success: true },
+        { name: "key2", type: "type2", option: "value2", success: false },
+        { name: "key3", type: "type3", option: "value3", success: true },


Compound-PK tabular tests already exercised the row-value-tuple SQL form (WHERE (p1,p2) IN ((?,?),...)) on SQLite and Postgres. The single-column branch (WHERE pk IN (?,?,...)) was structurally distinct and had no test coverage. Add an analogous getBulk(keys) block against SearchSchema (single-PK) so the simpler SQL path is exercised against every backend that registers a searchable-repository factory.

…dration Replaces the N+1 hydration patterns in kb.hybridSearch (text-only RRF candidates) and kb.textSearch (every hit) with a single getBulk call that now lives on ITabularStorage (added upstream in #480). Results come back unordered with their primary keys, so callers re-align via a chunk_id -> entity map and preserve the hit ordering produced by the upstream rankers. For DB-backed storages, this turns up to ~50 round-trips per query (topK=10, candidatePoolMultiplier=5) into one — addresses the N+1 concerns Copilot flagged on KnowledgeBase.ts:582 and :631.

* feat(knowledge-base): hybrid search via RRF over BM25F text index Replaces the naive keyword-matching hybridSearch baked into each IVectorStorage implementation with a real BM25F text index that lives alongside the vector index, with fusion now performed at the KnowledgeBase layer via Reciprocal Rank Fusion. - Adds @workglow/storage/text: ITextIndex interface, pluggable Tokenizer with default English stopwords, and an in-memory BM25F index with per-field weights, JSON-serialisable state, and idempotent upserts. - KnowledgeBase gains installTextIndex / getTextIndex / reindexText, auto-writes chunks to the text index on upsert, cascades deletes, and fuses similaritySearch + textIndex.search through RRF in hybridSearch. - createKnowledgeBase accepts a textIndex option. - Removes IVectorStorage.hybridSearch, HybridSearchOptions, and the per-backend keyword-match implementations across InMemory, IndexedDB, Telemetry, Scoped, Sqlite, SqliteAi, and Postgres vector storages. Native Postgres FTS can be reintroduced later as an ITextIndex adapter. - ChunkRetrievalTask error message updated for the new "install a text index" semantics; scoreThreshold is no longer forwarded to hybrid (RRF scores are not comparable to cosine scores). - New tests: BM25Index unit tests covering ranking, BM25F field weights, length normalisation, upsert/remove, removeByDocument cascades, JSON round-trips, and tokenizer behaviour; KB integration test covering auto-indexing, cascade deletes, RRF promotion of text-strong matches, reindexText, and the no-index error path. * fix(knowledge-base,storage): address Copilot review on PR #478 - BM25Index.fromJSON now restores k1/b (previously silently kept the constructor's defaults, so a round-trip with mismatched params could change scoring). - BM25Index.remove uses a per-chunk reverse posting index instead of scanning the whole vocabulary. New chunkPostings map is populated on add and rebuilt by fromJSON. - BM25Index class JSDoc corrected: scoring config (k1, b, fieldWeights) IS serialised; only the tokenizer is not. - KnowledgeBase.upsertChunk / upsertChunksBulk now drop stale postings when an updated chunk has no indexable text fields. - KnowledgeBase.put / putBulk delegate to upsertChunk / upsertChunksBulk so the text index stays in sync via either entry point. - KnowledgeBase.hybridSearch uses Math.ceil for the candidate pool size to keep topK an integer when candidatePoolMultiplier is fractional. - New regression tests for k1/b round-trip, post-fromJSON remove, empty-text upserts, put/putBulk indexing, and fractional multipliers. * fix(knowledge-base,storage,ai): address self-review follow-ups - BM25Index: maintain a per-term document frequency cache (termDf) so search() does O(1) df lookup instead of walking every posting list per query term. Updated incrementally on add/remove and rebuilt on fromJSON; the old distinctChunkCount helper is gone. - KnowledgeBase.reindexText: now atomic with respect to async failures. Reads chunkStorage and stages tokenisation before mutating the index; if getAll() throws, the existing index is untouched. - KnowledgeBase.hybridSearch: default candidatePoolMultiplier bumped 2 -> 5 to give RRF enough overlap between rankers (industry norm; with 2x and topK=10 the fusion was degenerating toward "OR of top-K" for queries with low ranker agreement). - ChunkRetrievalTask: schema descriptions for method, scoreThreshold, and scores updated to disambiguate similarity (cosine in [0,1]) from hybrid (RRF fusion scores, not comparable). scoreThreshold still silently ignored for hybrid; schema now says so. - Tests: switched KB names from Date.now() to uuid4() to remove same-millisecond collision risk; added regression tests for atomic reindexText (mocks getAll to throw, asserts index unchanged) and for RRF score shape (asserts hybrid scores are small positives, not cosine-range). * fix(knowledge-base): address Copilot review on hybridSearch + KB tests - hybridSearch falls back to similaritySearch when textQuery is empty or whitespace-only. Previously it ran the full RRF path with an empty text-result list, returning vector-only RRF-shaped scores (~[0,0.05]) in the score field — surprising to callers expecting cosine scores in [0,1] when the query has no text signal. - Clamp rrfK to a non-negative value at use site (Math.max(0, rrfK)). rrfK <= -1 previously produced Infinity / sign-flipped denominators for some ranks. - Clamp candidatePoolMultiplier to >= 1 to prevent zero / negative pool sizes. - KnowledgeBaseHybridSearch.test.ts: pass register: false to every createKnowledgeBase call so the global KB registry doesn't accumulate per-test entries. Drop the no-op afterEach (its comment claimed in-memory KBs need no teardown, but said nothing about the registry cost; with register: false there's nothing to do). - New regression tests: empty/whitespace textQuery returns cosine scores; rrfK=-10 produces finite, positive scores. Not fixed in this PR: the N+1 hydration concern (kb.hybridSearch / textSearch issue one chunkStorage.get per chunkId). Fixing it requires adding a multi-id batch fetch to ITabularStorage, which is broader than this PR's scope. Tracked as a follow-up. * feat(knowledge-base): tag search results with scoreType discriminator ChunkSearchResult now carries an optional `scoreType: "cosine" | "bm25" | "rrf"` field so callers (typically UIs) can render scores appropriately without having to remember which search method produced them — the three scorers live on incompatible scales: - cosine: [-1,1], typically [0,1] for text embeddings. Absolute. - bm25: [0,inf). Absolute but corpus-dependent. - rrf: bounded above by 2/(rrfK+1) (~0.033 with defaults). Rank-based, not absolute, not comparable across queries. KB.similaritySearch tags "cosine", KB.textSearch tags "bm25", KB.hybridSearch tags "rrf" — except the empty-textQuery fallback that routes through similaritySearch, which correctly surfaces "cosine". ChunkRetrievalTask exposes a top-level `scoreType` field in its output (single string, since every result in one call shares the same type). Field is optional on ChunkSearchResult to keep the type non-breaking for storage-only callers that construct results without going through the KB layer. * fix(ai): scoreType default reflects hybrid's empty-query cosine fallback ChunkRetrievalTask used to default scoreType to "rrf" whenever method was "hybrid", but kb.hybridSearch routes through similaritySearch when textQuery is empty/whitespace. On a zero-result hybrid call with an empty query, the task therefore reported "rrf" even though the KB would have returned cosine scores had any matched. Derive the default from (method, queryText) instead of method alone: hybrid + empty/whitespace query => "cosine", otherwise method-derived. With at least one result the existing fallback to results[0].scoreType already handled this; the fix only changes the empty-result branch. Not adding a regression test: triggering this path requires an empty corpus AND a string query (model is required for string queries, which would force a mocked embedding service for unit coverage). The KB-level fallback to cosine on empty textQuery is already covered by KnowledgeBaseHybridSearch.test.ts. * perf(knowledge-base): use ITabularStorage.getBulk for plural chunk hydration Replaces the N+1 hydration patterns in kb.hybridSearch (text-only RRF candidates) and kb.textSearch (every hit) with a single getBulk call that now lives on ITabularStorage (added upstream in #480). Results come back unordered with their primary keys, so callers re-align via a chunk_id -> entity map and preserve the hit ordering produced by the upstream rankers. For DB-backed storages, this turns up to ~50 round-trips per query (topK=10, candidatePoolMultiplier=5) into one — addresses the N+1 concerns Copilot flagged on KnowledgeBase.ts:582 and :631. * fix(knowledge-base): address review-agent findings on PR #478 - reindexText is now truly atomic: snapshots `index.toJSON()` before mutating and restores via `fromJSON()` on any failure during `clear`/`add`/`getAll`. Previous "atomic w.r.t. async failures" doc promise covered only the getAll path; a sync throw mid-loop could empty the index. - Auto-index writes in upsertChunk / upsertChunksBulk are wrapped in try/catch with a console.warn. The chunk is already durable in chunkStorage when the index write runs, so an index error must not fail the upsert. Recovery is via kb.reindexText(). - textSearch gains a candidatePoolMultiplier option (default 2, matching prior behavior) for symmetry with hybridSearch when filtering selectively. - Defensive guards: rrfK, candidatePoolMultiplier, and vectorWeight in hybridSearch now reject non-finite values (NaN/Infinity) and fall back to defaults instead of producing Infinity/NaN scores. - New tests: - BM25Index.test.ts: removeByDocument cascade after fromJSON round-trip (catches docToChunks reconstruction gaps). - KnowledgeBaseHybridSearch.test.ts: reindexText rollback on synchronous add failure via a stub ITextIndex; upsertChunk warn + chunk-still-stored when index write throws. Changeset: documents the breaking removal of HybridSearchOptions / hybridSearch from @workglow/storage and points callers at the @workglow/knowledge-base equivalent (with the noted shape change — no scoreThreshold for RRF). --------- Co-authored-by: Claude <noreply@anthropic.com>

…dTabularStorage Previous commit replaced inner.getBulk delegation with per-key fan-out via inner.get, citing a leak risk from SQL backends building WHERE from PK columns only. That diagnosis is correct in shape but the chosen mitigation doesn't actually fix it: inner.get on SQL backends *also* builds WHERE from PK columns only (via getPrimaryKeyAsOrderedArray), so per-key fan-out would silently drop kb_id whenever kb_id is not in the inner PK -- the same failure mode it claimed to fix. The actual contract is "inner schema's PK must include kb_id". Every production wrapping uses SharedDocumentPrimaryKey or SharedChunkPrimaryKey (both ["kb_id", ...]); the libs codebase has no internal construction sites of ScopedTabularStorage at all -- it's exported for external consumers. Restore inner.getBulk(scopedKeys) (the original PR #480 IN-tuple WHERE optimization) and enforce the kb_id-in-PK contract at construction. Any future misuse fails loudly at the constructor rather than leaking rows at query time. Storages that don't expose primaryKeyNames (third-party impls that don't extend BaseTabularStorage) get a console.warn so they keep working. Tests: - new "constructor contract" describe block covers the throw, the happy path, and the warn-instead-of-throw fallback; - new "delegates to inner.getBulk in one round trip" test inside the existing "getBulk isolation" block locks in the optimization (a regression to fan-out is caught by the inner.getBulk vs inner.get call-count assertion). https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg

… Postgres-native hybrid search (#486) * fix(knowledge-base): cross-KB row leak in ScopedTabularStorage.getBulk The previous implementation delegated to `inner.getBulk(scopedKeys)`, but SQL backends build their IN-tuple WHERE from primary-key columns only via `getPrimaryKeyAsOrderedArray`. When the inner storage's PK does not include `kb_id` (per-KB tables wrapped for synthetic scoping, or any future backend that derives keys from PK columns only), our injected `kb_id` is silently dropped from the predicate, allowing rows from other KBs to leak through. Fan out per-key `get()` calls instead — each call goes through the same code path as `ScopedTabularStorage.get` and `delete`, both of which use `deleteSearch` / `inner.get` with the full `{ ...key, kb_id }` predicate that every backend correctly translates into `WHERE kb_id = ? AND ...`. Correctness over throughput. https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg * test(knowledge-base): SQL-backend cross-KB isolation tests for ScopedTabularStorage.getBulk The existing in-memory test in `ScopedStorage.test.ts` passes regardless of the fix because `InMemoryTabularStorage.getBulk` does not use `getPrimaryKeyAsOrderedArray` — it iterates keys with a direct lookup, so the in-memory path never exercised the SQL backends' IN-tuple WHERE construction. Mirror the isolation assertions against real SQL backends (PGlite-backed Postgres + in-memory SQLite) so a regression in `ScopedTabularStorage.getBulk` would surface in CI. The Postgres backend interprets `doc_id: { type: "string", "x-auto-generated": true }` as a `UUID` column, so the Postgres test uses real UUIDs for the colliding keys. https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg * feat(storage): optional beginRebuild/commitRebuild/abortRebuild on ITextIndex; search may return Promise Backends with server-side state — e.g. a Postgres-side `tsvector` table backing FTS — cannot round-trip through the in-memory toJSON/fromJSON snapshot path that `KnowledgeBase.reindexText` relies on for atomic rebuild. Expose three optional lifecycle hooks (`beginRebuild`, `commitRebuild`, `abortRebuild`) so such backends can wrap the rebuild in a database transaction; the in-memory `BM25Index` keeps working unchanged by simply omitting the hooks. Relax the return types on the index-mutation methods (`add`, `remove`, `removeByDocument`, `clear`, `size`) and `search` to `T | Promise<T>` so async backends can implement the interface without forcing in-memory backends into a Promise wrapper. The caller (`KnowledgeBase`) awaits either way. https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg * feat(knowledge-base): KnowledgeBase.reindexText uses ITextIndex rebuild hooks when available `reindexText` previously snapshotted the index via `toJSON()` and rolled back via `fromJSON(snapshot)` on error. That only works for backends whose state lives in memory; a server-side backend (e.g. a Postgres FTS index) cannot meaningfully round-trip through a JSON snapshot — the "snapshot" is durable on the database side. When the installed index exposes `beginRebuild`/`commitRebuild`/ `abortRebuild`, wrap the rebuild in those hooks instead. The hooks let the backend run the rebuild inside a database transaction, with abort mapped to ROLLBACK. Indices without the hooks (`BM25Index`) fall back to the existing JSON snapshot path with identical behaviour. Also await `index.search(...)` in `textSearch`/`hybridSearch` and handle fire-and-forget Promises in `syncTextIndexForChunk` so async backends can implement `ITextIndex` directly. https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg * feat(postgres): PostgresFtsTextIndex restoring native hybrid search via to_tsvector/plainto_tsquery PR #478 removed `IVectorStorage.hybridSearch` in favour of a KB-layer Reciprocal Rank Fusion over an in-memory `BM25Index`. That strands production Postgres KBs on a path where `reindexText` calls `chunkStorage.getAll()` — unbounded over server-side rows — and rebuilds postings in process memory. Restore the Postgres-native path as a new `ITextIndex` implementation backed by a single side table per KB indexed by a GIN `tsvector`: CREATE TABLE <table> (chunk_id TEXT PRIMARY KEY, doc_id TEXT NOT NULL, tsv TSVECTOR NOT NULL); CREATE INDEX <table>_tsv_idx ON <table> USING GIN (tsv); CREATE INDEX <table>_doc_idx ON <table> (doc_id); Scoring is `ts_rank_cd(tsv, plainto_tsquery(...))` — unbounded above and non-negative, so RRF fusion in `KnowledgeBase.hybridSearch` works without normalisation. Implements the optional `beginRebuild`/`commitRebuild`/`abortRebuild` hooks on `ITextIndex` so `KnowledgeBase.reindexText` wraps the rebuild in a real BEGIN/COMMIT — abort routes to ROLLBACK. State is durable server-side; `toJSON`/`fromJSON` are intentional no-ops. Exports the index from the new `@workglow/postgres/text` entry point and documents the setup pattern in the provider README. https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg * docs: changeset for ITextIndex async + PostgresFtsTextIndex - @workglow/storage minor: ITextIndex.search may now return a Promise; new optional beginRebuild/commitRebuild/abortRebuild lifecycle hooks for backends with server-side state. - @workglow/knowledge-base minor: reindexText uses the new hooks when available, falls back to toJSON/fromJSON snapshot rollback otherwise. - @workglow/postgres minor: new PostgresFtsTextIndex over a side `tsvector` GIN index for Postgres-native hybrid search. https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg * refactor(knowledge-base): restore inner.getBulk optimization in ScopedTabularStorage Previous commit replaced inner.getBulk delegation with per-key fan-out via inner.get, citing a leak risk from SQL backends building WHERE from PK columns only. That diagnosis is correct in shape but the chosen mitigation doesn't actually fix it: inner.get on SQL backends *also* builds WHERE from PK columns only (via getPrimaryKeyAsOrderedArray), so per-key fan-out would silently drop kb_id whenever kb_id is not in the inner PK -- the same failure mode it claimed to fix. The actual contract is "inner schema's PK must include kb_id". Every production wrapping uses SharedDocumentPrimaryKey or SharedChunkPrimaryKey (both ["kb_id", ...]); the libs codebase has no internal construction sites of ScopedTabularStorage at all -- it's exported for external consumers. Restore inner.getBulk(scopedKeys) (the original PR #480 IN-tuple WHERE optimization) and enforce the kb_id-in-PK contract at construction. Any future misuse fails loudly at the constructor rather than leaking rows at query time. Storages that don't expose primaryKeyNames (third-party impls that don't extend BaseTabularStorage) get a console.warn so they keep working. Tests: - new "constructor contract" describe block covers the throw, the happy path, and the warn-instead-of-throw fallback; - new "delegates to inner.getBulk in one round trip" test inside the existing "getBulk isolation" block locks in the optimization (a regression to fan-out is caught by the inner.getBulk vs inner.get call-count assertion). https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg * fix(postgres): tighten PostgresFtsTextIndex.fromJSON + document identifier constraints Addresses Copilot review feedback on PR #486: - fromJSON now validates state.table === this.table (or symmetrically removes table from toJSON if it can't enforce equality on a static method) so a snapshot from a different table can't silently round-trip through a mismatched instance. - README documents the strict alphanumeric+underscore identifier whitelist enforced on the `table` constructor argument, so callers know schema- qualified names and dashes are rejected. https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg * fix(postgres,knowledge-base): bind rebuildClient.query + correct stale docs/comments Addresses 7 Copilot review comments from the round-2 review: - PostgresFtsTextIndex.exec: bind rebuildClient.query during beginRebuild so pg.PoolClient.query receives the correct `this`. The previous unbound reference would throw on real Postgres connections (PGlite happens to be permissive enough not to catch this in the integration test). - README: reword the "no in-memory state at reindex time" claim — reindex still iterates all chunks via chunkStorage.getAll(). The actual benefits are server-side durable index, no JS-heap BM25 state, and transactional rebuild. Identifier whitelist note retained. - Changeset: mirror the README rewording. Add a note documenting that ScopedTabularStorage's constructor now throws when the inner PK omits kb_id — a user-visible behavior change for external wrappers. - Sqlite + Postgres integration tests: move `await Sqlite.init()` from the unawaited async describe() callback into beforeAll() so init completes before any test runs (was a latent flake source). - Sqlite + Postgres test comments: update the stale "fans out per-key get()" description to match the restored inner.getBulk + constructor-enforced kb_id-in-PK approach. - PR description updated via API to align with the implemented strategy. https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg --------- Co-authored-by: Claude <noreply@anthropic.com>

sroussey force-pushed the claude/storage-get-plural-YPPMp branch from 275d9b7 to b45403d Compare May 11, 2026 02:30

claude added 16 commits May 11, 2026 04:05

docs: add design for storage getBulk plural-get

77f1b50

Design spec for adding getBulk(keys) to IKvStorage / ITabularStorage, reclaiming the name from the deprecated offset/limit method on tabular (renamed to getOffsetPage). Vector inherits via ITabularStorage.

docs: add implementation plan for storage getBulk plural-get

269690d

refactor(storage): rename deprecated getBulk(offset,limit) to getOffs…

ce6caa1

…etPage Frees the getBulk name for the upcoming plural-get-by-keys method. @deprecated JSDoc preserved; migration target remains getPage. https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m

test(storage): rename getBulk(offset,limit) test sites to getOffsetPage

f6dc4f6

https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m

refactor(supabase): rename deprecated getBulk(offset,limit) to getOff…

9f753b1

…setPage Missed in the initial rename sweep. Brings SupabaseTabularStorage in line with the renamed ITabularStorage interface.

feat(storage): add ITabularStorage.getBulk(keys) plural-get

cafd047

Default implementation in BaseTabularStorage does Promise.all(get). SQL backends will override with a single batched query in follow-up commits.

perf(postgres): batched getBulk(keys) using WHERE (pk) IN (...)

202d690

https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m

feat(storage): TelemetryTabularStorage forwards getBulk(keys)

cc7b0dd

https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m

feat(knowledge-base): ScopedTabularStorage scopes getBulk(keys) by kb_id

01e05ba

https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m

feat(storage): add IKvStorage.getBulk(keys) plural-get

2a5ab8c

Delegates to ITabularStorage.getBulk on tabular-backed KV implementations, picking up SQL batched-IN pushdown for free. https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m

feat(storage): FsFolderKvStorage implements getBulk(keys)

54e9dac

https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m

feat(storage): TelemetryKvStorage forwards getBulk(keys)

8209988

https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m

sroussey force-pushed the claude/storage-get-plural-YPPMp branch from b45403d to 04255b6 Compare May 11, 2026 04:07

claude added 4 commits May 11, 2026 04:33

test(knowledge-base): cover ScopedTabularStorage.getBulk isolation

7e3381e

Two scopes sharing a physical table must not leak rows across kb_id boundaries via getBulk, and the emitted event must carry the original unscoped keys. Both invariants verified against InMemory.</br>

sroussey self-assigned this May 11, 2026

sroussey requested a review from Copilot May 11, 2026 04:35

Copilot started reviewing on behalf of sroussey May 11, 2026 04:36 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

sroussey merged commit 99d08a4 into main May 11, 2026
22 checks passed

sroussey deleted the claude/storage-get-plural-YPPMp branch May 11, 2026 05:10

sroussey mentioned this pull request May 12, 2026

fix(knowledge-base,storage,postgres): cross-KB getBulk leak + restore Postgres-native hybrid search #486

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add design for storage getBulk plural-get#480

docs: add design for storage getBulk plural-get#480
sroussey merged 21 commits into
mainfrom
claude/storage-get-plural-YPPMp

sroussey commented May 11, 2026

Uh oh!

pkg-pr-new Bot commented May 11, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sroussey commented May 11, 2026

Uh oh!

pkg-pr-new Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pkg-pr-new Bot commented May 11, 2026 •

edited

Loading

github-actions Bot commented May 11, 2026 •

edited

Loading