Skip to content

docs: add design for storage getBulk plural-get#480

Merged
sroussey merged 21 commits into
mainfrom
claude/storage-get-plural-YPPMp
May 11, 2026
Merged

docs: add design for storage getBulk plural-get#480
sroussey merged 21 commits into
mainfrom
claude/storage-get-plural-YPPMp

Conversation

@sroussey
Copy link
Copy Markdown
Collaborator

Design spec for adding getBulk(keys) to IKvStorage / ITabularStorage,
reclaiming the name from the deprecated offset/limit method on tabular
(renamed to getOffsetPage). Vector inherits via ITabularStorage.

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 11, 2026

Open in StackBlitz

@workglow/cli

npm i https://pkg.pr.new/@workglow/cli@480

@workglow/ai

npm i https://pkg.pr.new/@workglow/ai@480

@workglow/job-queue

npm i https://pkg.pr.new/@workglow/job-queue@480

@workglow/knowledge-base

npm i https://pkg.pr.new/@workglow/knowledge-base@480

@workglow/storage

npm i https://pkg.pr.new/@workglow/storage@480

@workglow/task-graph

npm i https://pkg.pr.new/@workglow/task-graph@480

@workglow/tasks

npm i https://pkg.pr.new/@workglow/tasks@480

@workglow/util

npm i https://pkg.pr.new/@workglow/util@480

workglow

npm i https://pkg.pr.new/workglow@480

commit: 99d438d

@sroussey sroussey force-pushed the claude/storage-get-plural-YPPMp branch from 275d9b7 to b45403d Compare May 11, 2026 02:30
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 11, 2026

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 61.89% 20164 / 32579
🔵 Statements 61.74% 20872 / 33802
🔵 Functions 64.53% 3908 / 6056
🔵 Branches 51.08% 9716 / 19020
File Coverage
File Stmts Branches Functions Lines Uncovered Lines
Changed Files
packages/storage/src/kv/IKvStorage.ts 100% 100% 100% 100%
packages/storage/src/kv/KvStorage.ts 40% 100% 14.28% 40% 104-154
packages/storage/src/kv/KvViaTabularStorage.ts 40% 57.14% 25% 41.86% 58-69, 88-94, 122-173
packages/storage/src/tabular/BaseTabularStorage.ts 18.63% 7.5% 10.93% 20.52% 78-117, 212, 215, 234, 237-239, 244, 258-265, 277-278, 284-287, 291-297, 307-308, 312, 317-957, 986-987, 990-991, 1025-1078, 1095-1201
packages/storage/src/tabular/FsFolderTabularStorage.ts 16.76% 6.45% 12.19% 17.9% 45-77, 137-170, 185-206, 216-223, 237-238, 268-394, 416-505
packages/storage/src/tabular/ITabularStorage.ts 0% 0% 0% 0% 207-212
Generated in workflow #2160 for commit 99d438d by the Vitest Coverage Report Action

claude added 16 commits May 11, 2026 04:05
Design spec for adding getBulk(keys) to IKvStorage / ITabularStorage,
reclaiming the name from the deprecated offset/limit method on tabular
(renamed to getOffsetPage). Vector inherits via ITabularStorage.
…etPage

Frees the getBulk name for the upcoming plural-get-by-keys method.
@deprecated JSDoc preserved; migration target remains getPage.

https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m
…setPage

Missed in the initial rename sweep. Brings SupabaseTabularStorage in
line with the renamed ITabularStorage interface.
Default implementation in BaseTabularStorage does Promise.all(get).
SQL backends will override with a single batched query in follow-up commits.
Replaces the inherited N-parallel-get default in SqliteTabularStorage
with a single-statement override: single-column PKs use
`WHERE pk IN (?,?,...)`, compound PKs use `WHERE (p1,p2) IN ((?,?),...)`
— one round-trip regardless of batch size.

https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m
Delegates to ITabularStorage.getBulk on tabular-backed KV implementations,
picking up SQL batched-IN pushdown for free.

https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m
getPrimaryKeyAsOrderedArray already applies jsToSqlValue to each PK
field. The new getBulk overrides were re-applying it on the returned
values. Idempotent for current types (strings, numbers, ISO date
strings, etc.), so no observable bug today, but it would silently
double-serialize any future non-idempotent type. Match the existing
_getInternal and _deleteInternal patterns and pass through directly.
…predicate

The .filter((r): r is T => r !== undefined) form trips TS when T is a
generic that gets resolved to Awaited<T>: the type predicate's target
type must be assignable to its parameter's type, and Awaited<T>|undefined
isn't reliably narrowable to T in generic position. Switch to an as-cast
on the filtered array. Same runtime, types compile.
Match the KV-event pattern established in c8c9860: concrete KV
implementations now emit each method's event. KvViaTabularStorage.getBulk
already emits as part of the rebase resolution; this finishes the
symmetry for FsFolderKvStorage.
@sroussey sroussey force-pushed the claude/storage-get-plural-YPPMp branch from b45403d to 04255b6 Compare May 11, 2026 04:07
claude added 4 commits May 11, 2026 04:33
Three callsites of the same try { JSON.parse } catch { raw } pattern
collapse into a private deserialize() helper. Behavior unchanged; the
helper also short-circuits the schema-not-needs-json case so getAll's
inlined IIFE goes away.
Match the tabular layer (BaseTabularStorage / SqliteTabularStorage /
PostgresTabularStorage), which all return early without emitting on
empty input. Listeners using getBulk for cache invalidation or logging
no longer see spurious empty-call events from KV stores while seeing
none from tabular stores.
SQLite's SQLITE_MAX_VARIABLE_NUMBER is 999 on older builds; Postgres
caps bind parameters at 65535. A single getBulk call with many compound-PK
inputs would hit either ceiling and fail with a cryptic driver error.

Wrap the override so inputs above a safe per-statement threshold split
into chunks executed sequentially, each chunk taking the mutex once.
The event is emitted once with the original key array, not per chunk.
Two scopes sharing a physical table must not leak rows across kb_id
boundaries via getBulk, and the emitted event must carry the original
unscoped keys. Both invariants verified against InMemory.</br>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new plural “get by keys” API across KV and Tabular storage (and thus Vector via inheritance), while reclaiming the getBulk name by renaming the deprecated offset/limit paging method to getOffsetPage. This extends the storage abstraction to support efficient batched reads (notably via SQL IN (...)) and wires the operation into existing telemetry and event systems.

Changes:

  • Introduces IKvStorage.getBulk(keys) and ITabularStorage.getBulk(keys), plus new getBulk events for both KV and Tabular.
  • Renames deprecated ITabularStorage.getBulk(offset, limit) to getOffsetPage(offset, limit) across implementations and tests.
  • Implements batched SQL overrides for getBulk(keys) in Postgres/SQLite; other backends inherit the default Promise.all(get) implementation.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
providers/supabase/src/storage/SupabaseTabularStorage.ts Renames deprecated offset paging method to getOffsetPage.
providers/sqlite/src/storage/SqliteTabularStorage.ts Adds SQL-batched getBulk(keys) with chunking; renames offset paging to getOffsetPage.
providers/postgres/src/storage/PostgresTabularStorage.ts Adds SQL-batched getBulk(keys) with chunking; renames offset paging to getOffsetPage.
packages/test/src/test/storage-tabular/HuggingFaceTabularStorage.integration.test.ts Updates tests to call getOffsetPage.
packages/test/src/test/storage-tabular/genericTabularStorageTests.ts Renames offset paging suite + adds new getBulk(keys) behavior tests.
packages/test/src/test/storage-kv/genericKvRepositoryTests.ts Adds getBulk(keys) tests including JSON deserialization and event emission.
packages/test/src/test/rag/ScopedStorage.test.ts Adds ScopedTabularStorage.getBulk isolation + event tests.
packages/storage/src/tabular/TelemetryTabularStorage.ts Adds tracing for getBulk(keys) and renames traced offset paging to getOffsetPage.
packages/storage/src/tabular/SharedInMemoryTabularStorage.ts Renames delegated offset paging method to getOffsetPage.
packages/storage/src/tabular/ITabularStorage.ts Adds getBulk(keys) API + event type; renames deprecated offset paging to getOffsetPage.
packages/storage/src/tabular/InMemoryTabularStorage.ts Renames deprecated offset paging method to getOffsetPage.
packages/storage/src/tabular/HuggingFaceTabularStorage.ts Renames internal/self calls to getOffsetPage.
packages/storage/src/tabular/FsFolderTabularStorage.ts Renames offset paging to getOffsetPage and updates warning message text.
packages/storage/src/tabular/CachedTabularStorage.ts Renames delegated offset paging method to getOffsetPage.
packages/storage/src/tabular/BaseTabularStorage.ts Adds default getBulk(keys) implementation and renames abstract offset paging to getOffsetPage.
packages/storage/src/kv/TelemetryKvStorage.ts Adds tracing wrapper for getBulk(keys).
packages/storage/src/kv/KvViaTabularStorage.ts Adds getBulk(keys) delegating to tabular + shared JSON deserialization helper.
packages/storage/src/kv/KvStorage.ts Adds abstract getBulk(keys) to KV base class.
packages/storage/src/kv/IKvStorage.ts Adds getBulk(keys) API + getBulk event type.
packages/storage/src/kv/FsFolderKvStorage.ts Implements getBulk(keys) via parallel get calls + event emission.
packages/knowledge-base/src/knowledge-base/ScopedTabularStorage.ts Adds scoped getBulk(keys) that injects kb_id and strips it from results.
packages/indexeddb/src/storage/IndexedDbTabularStorage.ts Renames deprecated offset paging method to getOffsetPage.
docs/superpowers/specs/2026-05-10-storage-get-plural-design.md New design spec documenting API, naming reclaim, implementations, and tests.
docs/superpowers/plans/2026-05-10-storage-get-plural.md New step-by-step implementation plan and checklist.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2425 to +2429
describe("getBulk(keys)", () => {
const seed = [
{ name: "key1", type: "type1", option: "value1", success: true },
{ name: "key2", type: "type2", option: "value2", success: false },
{ name: "key3", type: "type3", option: "value3", success: true },
Compound-PK tabular tests already exercised the row-value-tuple SQL
form (WHERE (p1,p2) IN ((?,?),...)) on SQLite and Postgres. The
single-column branch (WHERE pk IN (?,?,...)) was structurally distinct
and had no test coverage. Add an analogous getBulk(keys) block against
SearchSchema (single-PK) so the simpler SQL path is exercised against
every backend that registers a searchable-repository factory.
@sroussey sroussey merged commit 99d08a4 into main May 11, 2026
22 checks passed
@sroussey sroussey deleted the claude/storage-get-plural-YPPMp branch May 11, 2026 05:10
sroussey pushed a commit that referenced this pull request May 11, 2026
…dration

Replaces the N+1 hydration patterns in kb.hybridSearch (text-only RRF
candidates) and kb.textSearch (every hit) with a single getBulk call
that now lives on ITabularStorage (added upstream in #480). Results
come back unordered with their primary keys, so callers re-align
via a chunk_id -> entity map and preserve the hit ordering produced
by the upstream rankers.

For DB-backed storages, this turns up to ~50 round-trips per query
(topK=10, candidatePoolMultiplier=5) into one — addresses the N+1
concerns Copilot flagged on KnowledgeBase.ts:582 and :631.
sroussey added a commit that referenced this pull request May 11, 2026
* feat(knowledge-base): hybrid search via RRF over BM25F text index

Replaces the naive keyword-matching hybridSearch baked into each
IVectorStorage implementation with a real BM25F text index that lives
alongside the vector index, with fusion now performed at the
KnowledgeBase layer via Reciprocal Rank Fusion.

- Adds @workglow/storage/text: ITextIndex interface, pluggable Tokenizer
  with default English stopwords, and an in-memory BM25F index with
  per-field weights, JSON-serialisable state, and idempotent upserts.
- KnowledgeBase gains installTextIndex / getTextIndex / reindexText,
  auto-writes chunks to the text index on upsert, cascades deletes, and
  fuses similaritySearch + textIndex.search through RRF in hybridSearch.
- createKnowledgeBase accepts a textIndex option.
- Removes IVectorStorage.hybridSearch, HybridSearchOptions, and the
  per-backend keyword-match implementations across InMemory, IndexedDB,
  Telemetry, Scoped, Sqlite, SqliteAi, and Postgres vector storages.
  Native Postgres FTS can be reintroduced later as an ITextIndex
  adapter.
- ChunkRetrievalTask error message updated for the new "install a text
  index" semantics; scoreThreshold is no longer forwarded to hybrid
  (RRF scores are not comparable to cosine scores).
- New tests: BM25Index unit tests covering ranking, BM25F field weights,
  length normalisation, upsert/remove, removeByDocument cascades, JSON
  round-trips, and tokenizer behaviour; KB integration test covering
  auto-indexing, cascade deletes, RRF promotion of text-strong matches,
  reindexText, and the no-index error path.

* fix(knowledge-base,storage): address Copilot review on PR #478

- BM25Index.fromJSON now restores k1/b (previously silently kept the
  constructor's defaults, so a round-trip with mismatched params could
  change scoring).
- BM25Index.remove uses a per-chunk reverse posting index instead of
  scanning the whole vocabulary. New chunkPostings map is populated on
  add and rebuilt by fromJSON.
- BM25Index class JSDoc corrected: scoring config (k1, b, fieldWeights)
  IS serialised; only the tokenizer is not.
- KnowledgeBase.upsertChunk / upsertChunksBulk now drop stale postings
  when an updated chunk has no indexable text fields.
- KnowledgeBase.put / putBulk delegate to upsertChunk / upsertChunksBulk
  so the text index stays in sync via either entry point.
- KnowledgeBase.hybridSearch uses Math.ceil for the candidate pool size
  to keep topK an integer when candidatePoolMultiplier is fractional.
- New regression tests for k1/b round-trip, post-fromJSON remove,
  empty-text upserts, put/putBulk indexing, and fractional multipliers.

* fix(knowledge-base,storage,ai): address self-review follow-ups

- BM25Index: maintain a per-term document frequency cache (termDf) so
  search() does O(1) df lookup instead of walking every posting list per
  query term. Updated incrementally on add/remove and rebuilt on
  fromJSON; the old distinctChunkCount helper is gone.
- KnowledgeBase.reindexText: now atomic with respect to async failures.
  Reads chunkStorage and stages tokenisation before mutating the index;
  if getAll() throws, the existing index is untouched.
- KnowledgeBase.hybridSearch: default candidatePoolMultiplier bumped
  2 -> 5 to give RRF enough overlap between rankers (industry norm;
  with 2x and topK=10 the fusion was degenerating toward "OR of top-K"
  for queries with low ranker agreement).
- ChunkRetrievalTask: schema descriptions for method, scoreThreshold,
  and scores updated to disambiguate similarity (cosine in [0,1]) from
  hybrid (RRF fusion scores, not comparable). scoreThreshold still
  silently ignored for hybrid; schema now says so.
- Tests: switched KB names from Date.now() to uuid4() to remove
  same-millisecond collision risk; added regression tests for atomic
  reindexText (mocks getAll to throw, asserts index unchanged) and for
  RRF score shape (asserts hybrid scores are small positives, not
  cosine-range).

* fix(knowledge-base): address Copilot review on hybridSearch + KB tests

- hybridSearch falls back to similaritySearch when textQuery is empty
  or whitespace-only. Previously it ran the full RRF path with an empty
  text-result list, returning vector-only RRF-shaped scores (~[0,0.05])
  in the score field — surprising to callers expecting cosine scores
  in [0,1] when the query has no text signal.
- Clamp rrfK to a non-negative value at use site (Math.max(0, rrfK)).
  rrfK <= -1 previously produced Infinity / sign-flipped denominators
  for some ranks.
- Clamp candidatePoolMultiplier to >= 1 to prevent zero / negative
  pool sizes.
- KnowledgeBaseHybridSearch.test.ts: pass register: false to every
  createKnowledgeBase call so the global KB registry doesn't accumulate
  per-test entries. Drop the no-op afterEach (its comment claimed
  in-memory KBs need no teardown, but said nothing about the registry
  cost; with register: false there's nothing to do).
- New regression tests: empty/whitespace textQuery returns cosine
  scores; rrfK=-10 produces finite, positive scores.

Not fixed in this PR: the N+1 hydration concern (kb.hybridSearch /
textSearch issue one chunkStorage.get per chunkId). Fixing it requires
adding a multi-id batch fetch to ITabularStorage, which is broader than
this PR's scope. Tracked as a follow-up.

* feat(knowledge-base): tag search results with scoreType discriminator

ChunkSearchResult now carries an optional `scoreType: "cosine" | "bm25"
| "rrf"` field so callers (typically UIs) can render scores
appropriately without having to remember which search method produced
them — the three scorers live on incompatible scales:

- cosine: [-1,1], typically [0,1] for text embeddings. Absolute.
- bm25:   [0,inf). Absolute but corpus-dependent.
- rrf:    bounded above by 2/(rrfK+1) (~0.033 with defaults).
          Rank-based, not absolute, not comparable across queries.

KB.similaritySearch tags "cosine", KB.textSearch tags "bm25",
KB.hybridSearch tags "rrf" — except the empty-textQuery fallback that
routes through similaritySearch, which correctly surfaces "cosine".

ChunkRetrievalTask exposes a top-level `scoreType` field in its output
(single string, since every result in one call shares the same type).

Field is optional on ChunkSearchResult to keep the type non-breaking
for storage-only callers that construct results without going through
the KB layer.

* fix(ai): scoreType default reflects hybrid's empty-query cosine fallback

ChunkRetrievalTask used to default scoreType to "rrf" whenever method
was "hybrid", but kb.hybridSearch routes through similaritySearch when
textQuery is empty/whitespace. On a zero-result hybrid call with an
empty query, the task therefore reported "rrf" even though the KB
would have returned cosine scores had any matched.

Derive the default from (method, queryText) instead of method alone:
hybrid + empty/whitespace query => "cosine", otherwise method-derived.
With at least one result the existing fallback to results[0].scoreType
already handled this; the fix only changes the empty-result branch.

Not adding a regression test: triggering this path requires an empty
corpus AND a string query (model is required for string queries,
which would force a mocked embedding service for unit coverage).
The KB-level fallback to cosine on empty textQuery is already covered
by KnowledgeBaseHybridSearch.test.ts.

* perf(knowledge-base): use ITabularStorage.getBulk for plural chunk hydration

Replaces the N+1 hydration patterns in kb.hybridSearch (text-only RRF
candidates) and kb.textSearch (every hit) with a single getBulk call
that now lives on ITabularStorage (added upstream in #480). Results
come back unordered with their primary keys, so callers re-align
via a chunk_id -> entity map and preserve the hit ordering produced
by the upstream rankers.

For DB-backed storages, this turns up to ~50 round-trips per query
(topK=10, candidatePoolMultiplier=5) into one — addresses the N+1
concerns Copilot flagged on KnowledgeBase.ts:582 and :631.

* fix(knowledge-base): address review-agent findings on PR #478

- reindexText is now truly atomic: snapshots `index.toJSON()` before
  mutating and restores via `fromJSON()` on any failure during
  `clear`/`add`/`getAll`. Previous "atomic w.r.t. async failures" doc
  promise covered only the getAll path; a sync throw mid-loop could
  empty the index.
- Auto-index writes in upsertChunk / upsertChunksBulk are wrapped in
  try/catch with a console.warn. The chunk is already durable in
  chunkStorage when the index write runs, so an index error must not
  fail the upsert. Recovery is via kb.reindexText().
- textSearch gains a candidatePoolMultiplier option (default 2,
  matching prior behavior) for symmetry with hybridSearch when
  filtering selectively.
- Defensive guards: rrfK, candidatePoolMultiplier, and vectorWeight in
  hybridSearch now reject non-finite values (NaN/Infinity) and fall
  back to defaults instead of producing Infinity/NaN scores.
- New tests:
  - BM25Index.test.ts: removeByDocument cascade after fromJSON
    round-trip (catches docToChunks reconstruction gaps).
  - KnowledgeBaseHybridSearch.test.ts: reindexText rollback on
    synchronous add failure via a stub ITextIndex; upsertChunk warn
    + chunk-still-stored when index write throws.

Changeset: documents the breaking removal of HybridSearchOptions /
hybridSearch from @workglow/storage and points callers at the
@workglow/knowledge-base equivalent (with the noted shape change —
no scoreThreshold for RRF).

---------

Co-authored-by: Claude <noreply@anthropic.com>
sroussey pushed a commit that referenced this pull request May 12, 2026
…dTabularStorage

Previous commit replaced inner.getBulk delegation with per-key fan-out via
inner.get, citing a leak risk from SQL backends building WHERE from PK
columns only. That diagnosis is correct in shape but the chosen mitigation
doesn't actually fix it: inner.get on SQL backends *also* builds WHERE
from PK columns only (via getPrimaryKeyAsOrderedArray), so per-key fan-out
would silently drop kb_id whenever kb_id is not in the inner PK -- the
same failure mode it claimed to fix.

The actual contract is "inner schema's PK must include kb_id". Every
production wrapping uses SharedDocumentPrimaryKey or SharedChunkPrimaryKey
(both ["kb_id", ...]); the libs codebase has no internal construction
sites of ScopedTabularStorage at all -- it's exported for external
consumers.

Restore inner.getBulk(scopedKeys) (the original PR #480 IN-tuple WHERE
optimization) and enforce the kb_id-in-PK contract at construction. Any
future misuse fails loudly at the constructor rather than leaking rows
at query time. Storages that don't expose primaryKeyNames (third-party
impls that don't extend BaseTabularStorage) get a console.warn so they
keep working.

Tests:
- new "constructor contract" describe block covers the throw, the
  happy path, and the warn-instead-of-throw fallback;
- new "delegates to inner.getBulk in one round trip" test inside the
  existing "getBulk isolation" block locks in the optimization (a
  regression to fan-out is caught by the inner.getBulk vs inner.get
  call-count assertion).

https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg
sroussey added a commit that referenced this pull request May 12, 2026
… Postgres-native hybrid search (#486)

* fix(knowledge-base): cross-KB row leak in ScopedTabularStorage.getBulk

The previous implementation delegated to `inner.getBulk(scopedKeys)`, but
SQL backends build their IN-tuple WHERE from primary-key columns only via
`getPrimaryKeyAsOrderedArray`. When the inner storage's PK does not include
`kb_id` (per-KB tables wrapped for synthetic scoping, or any future backend
that derives keys from PK columns only), our injected `kb_id` is silently
dropped from the predicate, allowing rows from other KBs to leak through.

Fan out per-key `get()` calls instead — each call goes through the same
code path as `ScopedTabularStorage.get` and `delete`, both of which use
`deleteSearch` / `inner.get` with the full `{ ...key, kb_id }` predicate
that every backend correctly translates into `WHERE kb_id = ? AND ...`.
Correctness over throughput.

https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg

* test(knowledge-base): SQL-backend cross-KB isolation tests for ScopedTabularStorage.getBulk

The existing in-memory test in `ScopedStorage.test.ts` passes regardless of
the fix because `InMemoryTabularStorage.getBulk` does not use
`getPrimaryKeyAsOrderedArray` — it iterates keys with a direct lookup, so
the in-memory path never exercised the SQL backends' IN-tuple WHERE
construction. Mirror the isolation assertions against real SQL backends
(PGlite-backed Postgres + in-memory SQLite) so a regression in
`ScopedTabularStorage.getBulk` would surface in CI.

The Postgres backend interprets `doc_id: { type: "string",
"x-auto-generated": true }` as a `UUID` column, so the Postgres test uses
real UUIDs for the colliding keys.

https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg

* feat(storage): optional beginRebuild/commitRebuild/abortRebuild on ITextIndex; search may return Promise

Backends with server-side state — e.g. a Postgres-side `tsvector` table
backing FTS — cannot round-trip through the in-memory toJSON/fromJSON
snapshot path that `KnowledgeBase.reindexText` relies on for atomic
rebuild. Expose three optional lifecycle hooks (`beginRebuild`,
`commitRebuild`, `abortRebuild`) so such backends can wrap the rebuild
in a database transaction; the in-memory `BM25Index` keeps working
unchanged by simply omitting the hooks.

Relax the return types on the index-mutation methods (`add`, `remove`,
`removeByDocument`, `clear`, `size`) and `search` to `T | Promise<T>` so
async backends can implement the interface without forcing in-memory
backends into a Promise wrapper. The caller (`KnowledgeBase`) awaits
either way.

https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg

* feat(knowledge-base): KnowledgeBase.reindexText uses ITextIndex rebuild hooks when available

`reindexText` previously snapshotted the index via `toJSON()` and rolled
back via `fromJSON(snapshot)` on error. That only works for backends
whose state lives in memory; a server-side backend (e.g. a Postgres FTS
index) cannot meaningfully round-trip through a JSON snapshot — the
"snapshot" is durable on the database side.

When the installed index exposes `beginRebuild`/`commitRebuild`/
`abortRebuild`, wrap the rebuild in those hooks instead. The hooks let
the backend run the rebuild inside a database transaction, with abort
mapped to ROLLBACK. Indices without the hooks (`BM25Index`) fall back
to the existing JSON snapshot path with identical behaviour.

Also await `index.search(...)` in `textSearch`/`hybridSearch` and handle
fire-and-forget Promises in `syncTextIndexForChunk` so async backends
can implement `ITextIndex` directly.

https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg

* feat(postgres): PostgresFtsTextIndex restoring native hybrid search via to_tsvector/plainto_tsquery

PR #478 removed `IVectorStorage.hybridSearch` in favour of a
KB-layer Reciprocal Rank Fusion over an in-memory `BM25Index`. That
strands production Postgres KBs on a path where `reindexText` calls
`chunkStorage.getAll()` — unbounded over server-side rows — and rebuilds
postings in process memory.

Restore the Postgres-native path as a new `ITextIndex` implementation
backed by a single side table per KB indexed by a GIN `tsvector`:

  CREATE TABLE <table> (chunk_id TEXT PRIMARY KEY, doc_id TEXT NOT NULL,
                       tsv TSVECTOR NOT NULL);
  CREATE INDEX <table>_tsv_idx ON <table> USING GIN (tsv);
  CREATE INDEX <table>_doc_idx ON <table> (doc_id);

Scoring is `ts_rank_cd(tsv, plainto_tsquery(...))` — unbounded above and
non-negative, so RRF fusion in `KnowledgeBase.hybridSearch` works
without normalisation.

Implements the optional `beginRebuild`/`commitRebuild`/`abortRebuild`
hooks on `ITextIndex` so `KnowledgeBase.reindexText` wraps the rebuild
in a real BEGIN/COMMIT — abort routes to ROLLBACK. State is durable
server-side; `toJSON`/`fromJSON` are intentional no-ops.

Exports the index from the new `@workglow/postgres/text` entry point and
documents the setup pattern in the provider README.

https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg

* docs: changeset for ITextIndex async + PostgresFtsTextIndex

- @workglow/storage minor: ITextIndex.search may now return a Promise;
  new optional beginRebuild/commitRebuild/abortRebuild lifecycle hooks
  for backends with server-side state.
- @workglow/knowledge-base minor: reindexText uses the new hooks when
  available, falls back to toJSON/fromJSON snapshot rollback otherwise.
- @workglow/postgres minor: new PostgresFtsTextIndex over a side
  `tsvector` GIN index for Postgres-native hybrid search.

https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg

* refactor(knowledge-base): restore inner.getBulk optimization in ScopedTabularStorage

Previous commit replaced inner.getBulk delegation with per-key fan-out via
inner.get, citing a leak risk from SQL backends building WHERE from PK
columns only. That diagnosis is correct in shape but the chosen mitigation
doesn't actually fix it: inner.get on SQL backends *also* builds WHERE
from PK columns only (via getPrimaryKeyAsOrderedArray), so per-key fan-out
would silently drop kb_id whenever kb_id is not in the inner PK -- the
same failure mode it claimed to fix.

The actual contract is "inner schema's PK must include kb_id". Every
production wrapping uses SharedDocumentPrimaryKey or SharedChunkPrimaryKey
(both ["kb_id", ...]); the libs codebase has no internal construction
sites of ScopedTabularStorage at all -- it's exported for external
consumers.

Restore inner.getBulk(scopedKeys) (the original PR #480 IN-tuple WHERE
optimization) and enforce the kb_id-in-PK contract at construction. Any
future misuse fails loudly at the constructor rather than leaking rows
at query time. Storages that don't expose primaryKeyNames (third-party
impls that don't extend BaseTabularStorage) get a console.warn so they
keep working.

Tests:
- new "constructor contract" describe block covers the throw, the
  happy path, and the warn-instead-of-throw fallback;
- new "delegates to inner.getBulk in one round trip" test inside the
  existing "getBulk isolation" block locks in the optimization (a
  regression to fan-out is caught by the inner.getBulk vs inner.get
  call-count assertion).

https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg

* fix(postgres): tighten PostgresFtsTextIndex.fromJSON + document identifier constraints

Addresses Copilot review feedback on PR #486:

- fromJSON now validates state.table === this.table (or symmetrically removes
  table from toJSON if it can't enforce equality on a static method) so a
  snapshot from a different table can't silently round-trip through a
  mismatched instance.
- README documents the strict alphanumeric+underscore identifier whitelist
  enforced on the `table` constructor argument, so callers know schema-
  qualified names and dashes are rejected.

https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg

* fix(postgres,knowledge-base): bind rebuildClient.query + correct stale docs/comments

Addresses 7 Copilot review comments from the round-2 review:

- PostgresFtsTextIndex.exec: bind rebuildClient.query during beginRebuild so
  pg.PoolClient.query receives the correct `this`. The previous unbound
  reference would throw on real Postgres connections (PGlite happens to be
  permissive enough not to catch this in the integration test).
- README: reword the "no in-memory state at reindex time" claim — reindex
  still iterates all chunks via chunkStorage.getAll(). The actual benefits
  are server-side durable index, no JS-heap BM25 state, and transactional
  rebuild. Identifier whitelist note retained.
- Changeset: mirror the README rewording. Add a note documenting that
  ScopedTabularStorage's constructor now throws when the inner PK omits
  kb_id — a user-visible behavior change for external wrappers.
- Sqlite + Postgres integration tests: move `await Sqlite.init()` from the
  unawaited async describe() callback into beforeAll() so init completes
  before any test runs (was a latent flake source).
- Sqlite + Postgres test comments: update the stale "fans out per-key get()"
  description to match the restored inner.getBulk + constructor-enforced
  kb_id-in-PK approach.
- PR description updated via API to align with the implemented strategy.

https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants