docs: add design for storage getBulk plural-get#480
Merged
Conversation
@workglow/cli
@workglow/ai
@workglow/job-queue
@workglow/knowledge-base
@workglow/storage
@workglow/task-graph
@workglow/tasks
@workglow/util
workglow
commit: |
275d9b7 to
b45403d
Compare
Coverage Report
File Coverage
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Design spec for adding getBulk(keys) to IKvStorage / ITabularStorage, reclaiming the name from the deprecated offset/limit method on tabular (renamed to getOffsetPage). Vector inherits via ITabularStorage.
…etPage Frees the getBulk name for the upcoming plural-get-by-keys method. @deprecated JSDoc preserved; migration target remains getPage. https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m
…setPage Missed in the initial rename sweep. Brings SupabaseTabularStorage in line with the renamed ITabularStorage interface.
Default implementation in BaseTabularStorage does Promise.all(get). SQL backends will override with a single batched query in follow-up commits.
Replaces the inherited N-parallel-get default in SqliteTabularStorage with a single-statement override: single-column PKs use `WHERE pk IN (?,?,...)`, compound PKs use `WHERE (p1,p2) IN ((?,?),...)` — one round-trip regardless of batch size. https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m
Delegates to ITabularStorage.getBulk on tabular-backed KV implementations, picking up SQL batched-IN pushdown for free. https://claude.ai/code/session_01YVoVQgmfvX4rp9f4f5dc6m
getPrimaryKeyAsOrderedArray already applies jsToSqlValue to each PK field. The new getBulk overrides were re-applying it on the returned values. Idempotent for current types (strings, numbers, ISO date strings, etc.), so no observable bug today, but it would silently double-serialize any future non-idempotent type. Match the existing _getInternal and _deleteInternal patterns and pass through directly.
…predicate The .filter((r): r is T => r !== undefined) form trips TS when T is a generic that gets resolved to Awaited<T>: the type predicate's target type must be assignable to its parameter's type, and Awaited<T>|undefined isn't reliably narrowable to T in generic position. Switch to an as-cast on the filtered array. Same runtime, types compile.
Match the KV-event pattern established in c8c9860: concrete KV implementations now emit each method's event. KvViaTabularStorage.getBulk already emits as part of the rebase resolution; this finishes the symmetry for FsFolderKvStorage.
b45403d to
04255b6
Compare
Three callsites of the same try { JSON.parse } catch { raw } pattern
collapse into a private deserialize() helper. Behavior unchanged; the
helper also short-circuits the schema-not-needs-json case so getAll's
inlined IIFE goes away.
Match the tabular layer (BaseTabularStorage / SqliteTabularStorage / PostgresTabularStorage), which all return early without emitting on empty input. Listeners using getBulk for cache invalidation or logging no longer see spurious empty-call events from KV stores while seeing none from tabular stores.
SQLite's SQLITE_MAX_VARIABLE_NUMBER is 999 on older builds; Postgres caps bind parameters at 65535. A single getBulk call with many compound-PK inputs would hit either ceiling and fail with a cryptic driver error. Wrap the override so inputs above a safe per-statement threshold split into chunks executed sequentially, each chunk taking the mutex once. The event is emitted once with the original key array, not per chunk.
Two scopes sharing a physical table must not leak rows across kb_id boundaries via getBulk, and the emitted event must carry the original unscoped keys. Both invariants verified against InMemory.</br>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new plural “get by keys” API across KV and Tabular storage (and thus Vector via inheritance), while reclaiming the getBulk name by renaming the deprecated offset/limit paging method to getOffsetPage. This extends the storage abstraction to support efficient batched reads (notably via SQL IN (...)) and wires the operation into existing telemetry and event systems.
Changes:
- Introduces
IKvStorage.getBulk(keys)andITabularStorage.getBulk(keys), plus newgetBulkevents for both KV and Tabular. - Renames deprecated
ITabularStorage.getBulk(offset, limit)togetOffsetPage(offset, limit)across implementations and tests. - Implements batched SQL overrides for
getBulk(keys)in Postgres/SQLite; other backends inherit the defaultPromise.all(get)implementation.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| providers/supabase/src/storage/SupabaseTabularStorage.ts | Renames deprecated offset paging method to getOffsetPage. |
| providers/sqlite/src/storage/SqliteTabularStorage.ts | Adds SQL-batched getBulk(keys) with chunking; renames offset paging to getOffsetPage. |
| providers/postgres/src/storage/PostgresTabularStorage.ts | Adds SQL-batched getBulk(keys) with chunking; renames offset paging to getOffsetPage. |
| packages/test/src/test/storage-tabular/HuggingFaceTabularStorage.integration.test.ts | Updates tests to call getOffsetPage. |
| packages/test/src/test/storage-tabular/genericTabularStorageTests.ts | Renames offset paging suite + adds new getBulk(keys) behavior tests. |
| packages/test/src/test/storage-kv/genericKvRepositoryTests.ts | Adds getBulk(keys) tests including JSON deserialization and event emission. |
| packages/test/src/test/rag/ScopedStorage.test.ts | Adds ScopedTabularStorage.getBulk isolation + event tests. |
| packages/storage/src/tabular/TelemetryTabularStorage.ts | Adds tracing for getBulk(keys) and renames traced offset paging to getOffsetPage. |
| packages/storage/src/tabular/SharedInMemoryTabularStorage.ts | Renames delegated offset paging method to getOffsetPage. |
| packages/storage/src/tabular/ITabularStorage.ts | Adds getBulk(keys) API + event type; renames deprecated offset paging to getOffsetPage. |
| packages/storage/src/tabular/InMemoryTabularStorage.ts | Renames deprecated offset paging method to getOffsetPage. |
| packages/storage/src/tabular/HuggingFaceTabularStorage.ts | Renames internal/self calls to getOffsetPage. |
| packages/storage/src/tabular/FsFolderTabularStorage.ts | Renames offset paging to getOffsetPage and updates warning message text. |
| packages/storage/src/tabular/CachedTabularStorage.ts | Renames delegated offset paging method to getOffsetPage. |
| packages/storage/src/tabular/BaseTabularStorage.ts | Adds default getBulk(keys) implementation and renames abstract offset paging to getOffsetPage. |
| packages/storage/src/kv/TelemetryKvStorage.ts | Adds tracing wrapper for getBulk(keys). |
| packages/storage/src/kv/KvViaTabularStorage.ts | Adds getBulk(keys) delegating to tabular + shared JSON deserialization helper. |
| packages/storage/src/kv/KvStorage.ts | Adds abstract getBulk(keys) to KV base class. |
| packages/storage/src/kv/IKvStorage.ts | Adds getBulk(keys) API + getBulk event type. |
| packages/storage/src/kv/FsFolderKvStorage.ts | Implements getBulk(keys) via parallel get calls + event emission. |
| packages/knowledge-base/src/knowledge-base/ScopedTabularStorage.ts | Adds scoped getBulk(keys) that injects kb_id and strips it from results. |
| packages/indexeddb/src/storage/IndexedDbTabularStorage.ts | Renames deprecated offset paging method to getOffsetPage. |
| docs/superpowers/specs/2026-05-10-storage-get-plural-design.md | New design spec documenting API, naming reclaim, implementations, and tests. |
| docs/superpowers/plans/2026-05-10-storage-get-plural.md | New step-by-step implementation plan and checklist. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+2425
to
+2429
| describe("getBulk(keys)", () => { | ||
| const seed = [ | ||
| { name: "key1", type: "type1", option: "value1", success: true }, | ||
| { name: "key2", type: "type2", option: "value2", success: false }, | ||
| { name: "key3", type: "type3", option: "value3", success: true }, |
Compound-PK tabular tests already exercised the row-value-tuple SQL form (WHERE (p1,p2) IN ((?,?),...)) on SQLite and Postgres. The single-column branch (WHERE pk IN (?,?,...)) was structurally distinct and had no test coverage. Add an analogous getBulk(keys) block against SearchSchema (single-PK) so the simpler SQL path is exercised against every backend that registers a searchable-repository factory.
sroussey
pushed a commit
that referenced
this pull request
May 11, 2026
…dration Replaces the N+1 hydration patterns in kb.hybridSearch (text-only RRF candidates) and kb.textSearch (every hit) with a single getBulk call that now lives on ITabularStorage (added upstream in #480). Results come back unordered with their primary keys, so callers re-align via a chunk_id -> entity map and preserve the hit ordering produced by the upstream rankers. For DB-backed storages, this turns up to ~50 round-trips per query (topK=10, candidatePoolMultiplier=5) into one — addresses the N+1 concerns Copilot flagged on KnowledgeBase.ts:582 and :631.
sroussey
added a commit
that referenced
this pull request
May 11, 2026
* feat(knowledge-base): hybrid search via RRF over BM25F text index Replaces the naive keyword-matching hybridSearch baked into each IVectorStorage implementation with a real BM25F text index that lives alongside the vector index, with fusion now performed at the KnowledgeBase layer via Reciprocal Rank Fusion. - Adds @workglow/storage/text: ITextIndex interface, pluggable Tokenizer with default English stopwords, and an in-memory BM25F index with per-field weights, JSON-serialisable state, and idempotent upserts. - KnowledgeBase gains installTextIndex / getTextIndex / reindexText, auto-writes chunks to the text index on upsert, cascades deletes, and fuses similaritySearch + textIndex.search through RRF in hybridSearch. - createKnowledgeBase accepts a textIndex option. - Removes IVectorStorage.hybridSearch, HybridSearchOptions, and the per-backend keyword-match implementations across InMemory, IndexedDB, Telemetry, Scoped, Sqlite, SqliteAi, and Postgres vector storages. Native Postgres FTS can be reintroduced later as an ITextIndex adapter. - ChunkRetrievalTask error message updated for the new "install a text index" semantics; scoreThreshold is no longer forwarded to hybrid (RRF scores are not comparable to cosine scores). - New tests: BM25Index unit tests covering ranking, BM25F field weights, length normalisation, upsert/remove, removeByDocument cascades, JSON round-trips, and tokenizer behaviour; KB integration test covering auto-indexing, cascade deletes, RRF promotion of text-strong matches, reindexText, and the no-index error path. * fix(knowledge-base,storage): address Copilot review on PR #478 - BM25Index.fromJSON now restores k1/b (previously silently kept the constructor's defaults, so a round-trip with mismatched params could change scoring). - BM25Index.remove uses a per-chunk reverse posting index instead of scanning the whole vocabulary. New chunkPostings map is populated on add and rebuilt by fromJSON. - BM25Index class JSDoc corrected: scoring config (k1, b, fieldWeights) IS serialised; only the tokenizer is not. - KnowledgeBase.upsertChunk / upsertChunksBulk now drop stale postings when an updated chunk has no indexable text fields. - KnowledgeBase.put / putBulk delegate to upsertChunk / upsertChunksBulk so the text index stays in sync via either entry point. - KnowledgeBase.hybridSearch uses Math.ceil for the candidate pool size to keep topK an integer when candidatePoolMultiplier is fractional. - New regression tests for k1/b round-trip, post-fromJSON remove, empty-text upserts, put/putBulk indexing, and fractional multipliers. * fix(knowledge-base,storage,ai): address self-review follow-ups - BM25Index: maintain a per-term document frequency cache (termDf) so search() does O(1) df lookup instead of walking every posting list per query term. Updated incrementally on add/remove and rebuilt on fromJSON; the old distinctChunkCount helper is gone. - KnowledgeBase.reindexText: now atomic with respect to async failures. Reads chunkStorage and stages tokenisation before mutating the index; if getAll() throws, the existing index is untouched. - KnowledgeBase.hybridSearch: default candidatePoolMultiplier bumped 2 -> 5 to give RRF enough overlap between rankers (industry norm; with 2x and topK=10 the fusion was degenerating toward "OR of top-K" for queries with low ranker agreement). - ChunkRetrievalTask: schema descriptions for method, scoreThreshold, and scores updated to disambiguate similarity (cosine in [0,1]) from hybrid (RRF fusion scores, not comparable). scoreThreshold still silently ignored for hybrid; schema now says so. - Tests: switched KB names from Date.now() to uuid4() to remove same-millisecond collision risk; added regression tests for atomic reindexText (mocks getAll to throw, asserts index unchanged) and for RRF score shape (asserts hybrid scores are small positives, not cosine-range). * fix(knowledge-base): address Copilot review on hybridSearch + KB tests - hybridSearch falls back to similaritySearch when textQuery is empty or whitespace-only. Previously it ran the full RRF path with an empty text-result list, returning vector-only RRF-shaped scores (~[0,0.05]) in the score field — surprising to callers expecting cosine scores in [0,1] when the query has no text signal. - Clamp rrfK to a non-negative value at use site (Math.max(0, rrfK)). rrfK <= -1 previously produced Infinity / sign-flipped denominators for some ranks. - Clamp candidatePoolMultiplier to >= 1 to prevent zero / negative pool sizes. - KnowledgeBaseHybridSearch.test.ts: pass register: false to every createKnowledgeBase call so the global KB registry doesn't accumulate per-test entries. Drop the no-op afterEach (its comment claimed in-memory KBs need no teardown, but said nothing about the registry cost; with register: false there's nothing to do). - New regression tests: empty/whitespace textQuery returns cosine scores; rrfK=-10 produces finite, positive scores. Not fixed in this PR: the N+1 hydration concern (kb.hybridSearch / textSearch issue one chunkStorage.get per chunkId). Fixing it requires adding a multi-id batch fetch to ITabularStorage, which is broader than this PR's scope. Tracked as a follow-up. * feat(knowledge-base): tag search results with scoreType discriminator ChunkSearchResult now carries an optional `scoreType: "cosine" | "bm25" | "rrf"` field so callers (typically UIs) can render scores appropriately without having to remember which search method produced them — the three scorers live on incompatible scales: - cosine: [-1,1], typically [0,1] for text embeddings. Absolute. - bm25: [0,inf). Absolute but corpus-dependent. - rrf: bounded above by 2/(rrfK+1) (~0.033 with defaults). Rank-based, not absolute, not comparable across queries. KB.similaritySearch tags "cosine", KB.textSearch tags "bm25", KB.hybridSearch tags "rrf" — except the empty-textQuery fallback that routes through similaritySearch, which correctly surfaces "cosine". ChunkRetrievalTask exposes a top-level `scoreType` field in its output (single string, since every result in one call shares the same type). Field is optional on ChunkSearchResult to keep the type non-breaking for storage-only callers that construct results without going through the KB layer. * fix(ai): scoreType default reflects hybrid's empty-query cosine fallback ChunkRetrievalTask used to default scoreType to "rrf" whenever method was "hybrid", but kb.hybridSearch routes through similaritySearch when textQuery is empty/whitespace. On a zero-result hybrid call with an empty query, the task therefore reported "rrf" even though the KB would have returned cosine scores had any matched. Derive the default from (method, queryText) instead of method alone: hybrid + empty/whitespace query => "cosine", otherwise method-derived. With at least one result the existing fallback to results[0].scoreType already handled this; the fix only changes the empty-result branch. Not adding a regression test: triggering this path requires an empty corpus AND a string query (model is required for string queries, which would force a mocked embedding service for unit coverage). The KB-level fallback to cosine on empty textQuery is already covered by KnowledgeBaseHybridSearch.test.ts. * perf(knowledge-base): use ITabularStorage.getBulk for plural chunk hydration Replaces the N+1 hydration patterns in kb.hybridSearch (text-only RRF candidates) and kb.textSearch (every hit) with a single getBulk call that now lives on ITabularStorage (added upstream in #480). Results come back unordered with their primary keys, so callers re-align via a chunk_id -> entity map and preserve the hit ordering produced by the upstream rankers. For DB-backed storages, this turns up to ~50 round-trips per query (topK=10, candidatePoolMultiplier=5) into one — addresses the N+1 concerns Copilot flagged on KnowledgeBase.ts:582 and :631. * fix(knowledge-base): address review-agent findings on PR #478 - reindexText is now truly atomic: snapshots `index.toJSON()` before mutating and restores via `fromJSON()` on any failure during `clear`/`add`/`getAll`. Previous "atomic w.r.t. async failures" doc promise covered only the getAll path; a sync throw mid-loop could empty the index. - Auto-index writes in upsertChunk / upsertChunksBulk are wrapped in try/catch with a console.warn. The chunk is already durable in chunkStorage when the index write runs, so an index error must not fail the upsert. Recovery is via kb.reindexText(). - textSearch gains a candidatePoolMultiplier option (default 2, matching prior behavior) for symmetry with hybridSearch when filtering selectively. - Defensive guards: rrfK, candidatePoolMultiplier, and vectorWeight in hybridSearch now reject non-finite values (NaN/Infinity) and fall back to defaults instead of producing Infinity/NaN scores. - New tests: - BM25Index.test.ts: removeByDocument cascade after fromJSON round-trip (catches docToChunks reconstruction gaps). - KnowledgeBaseHybridSearch.test.ts: reindexText rollback on synchronous add failure via a stub ITextIndex; upsertChunk warn + chunk-still-stored when index write throws. Changeset: documents the breaking removal of HybridSearchOptions / hybridSearch from @workglow/storage and points callers at the @workglow/knowledge-base equivalent (with the noted shape change — no scoreThreshold for RRF). --------- Co-authored-by: Claude <noreply@anthropic.com>
5 tasks
sroussey
pushed a commit
that referenced
this pull request
May 12, 2026
…dTabularStorage Previous commit replaced inner.getBulk delegation with per-key fan-out via inner.get, citing a leak risk from SQL backends building WHERE from PK columns only. That diagnosis is correct in shape but the chosen mitigation doesn't actually fix it: inner.get on SQL backends *also* builds WHERE from PK columns only (via getPrimaryKeyAsOrderedArray), so per-key fan-out would silently drop kb_id whenever kb_id is not in the inner PK -- the same failure mode it claimed to fix. The actual contract is "inner schema's PK must include kb_id". Every production wrapping uses SharedDocumentPrimaryKey or SharedChunkPrimaryKey (both ["kb_id", ...]); the libs codebase has no internal construction sites of ScopedTabularStorage at all -- it's exported for external consumers. Restore inner.getBulk(scopedKeys) (the original PR #480 IN-tuple WHERE optimization) and enforce the kb_id-in-PK contract at construction. Any future misuse fails loudly at the constructor rather than leaking rows at query time. Storages that don't expose primaryKeyNames (third-party impls that don't extend BaseTabularStorage) get a console.warn so they keep working. Tests: - new "constructor contract" describe block covers the throw, the happy path, and the warn-instead-of-throw fallback; - new "delegates to inner.getBulk in one round trip" test inside the existing "getBulk isolation" block locks in the optimization (a regression to fan-out is caught by the inner.getBulk vs inner.get call-count assertion). https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg
sroussey
added a commit
that referenced
this pull request
May 12, 2026
… Postgres-native hybrid search (#486) * fix(knowledge-base): cross-KB row leak in ScopedTabularStorage.getBulk The previous implementation delegated to `inner.getBulk(scopedKeys)`, but SQL backends build their IN-tuple WHERE from primary-key columns only via `getPrimaryKeyAsOrderedArray`. When the inner storage's PK does not include `kb_id` (per-KB tables wrapped for synthetic scoping, or any future backend that derives keys from PK columns only), our injected `kb_id` is silently dropped from the predicate, allowing rows from other KBs to leak through. Fan out per-key `get()` calls instead — each call goes through the same code path as `ScopedTabularStorage.get` and `delete`, both of which use `deleteSearch` / `inner.get` with the full `{ ...key, kb_id }` predicate that every backend correctly translates into `WHERE kb_id = ? AND ...`. Correctness over throughput. https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg * test(knowledge-base): SQL-backend cross-KB isolation tests for ScopedTabularStorage.getBulk The existing in-memory test in `ScopedStorage.test.ts` passes regardless of the fix because `InMemoryTabularStorage.getBulk` does not use `getPrimaryKeyAsOrderedArray` — it iterates keys with a direct lookup, so the in-memory path never exercised the SQL backends' IN-tuple WHERE construction. Mirror the isolation assertions against real SQL backends (PGlite-backed Postgres + in-memory SQLite) so a regression in `ScopedTabularStorage.getBulk` would surface in CI. The Postgres backend interprets `doc_id: { type: "string", "x-auto-generated": true }` as a `UUID` column, so the Postgres test uses real UUIDs for the colliding keys. https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg * feat(storage): optional beginRebuild/commitRebuild/abortRebuild on ITextIndex; search may return Promise Backends with server-side state — e.g. a Postgres-side `tsvector` table backing FTS — cannot round-trip through the in-memory toJSON/fromJSON snapshot path that `KnowledgeBase.reindexText` relies on for atomic rebuild. Expose three optional lifecycle hooks (`beginRebuild`, `commitRebuild`, `abortRebuild`) so such backends can wrap the rebuild in a database transaction; the in-memory `BM25Index` keeps working unchanged by simply omitting the hooks. Relax the return types on the index-mutation methods (`add`, `remove`, `removeByDocument`, `clear`, `size`) and `search` to `T | Promise<T>` so async backends can implement the interface without forcing in-memory backends into a Promise wrapper. The caller (`KnowledgeBase`) awaits either way. https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg * feat(knowledge-base): KnowledgeBase.reindexText uses ITextIndex rebuild hooks when available `reindexText` previously snapshotted the index via `toJSON()` and rolled back via `fromJSON(snapshot)` on error. That only works for backends whose state lives in memory; a server-side backend (e.g. a Postgres FTS index) cannot meaningfully round-trip through a JSON snapshot — the "snapshot" is durable on the database side. When the installed index exposes `beginRebuild`/`commitRebuild`/ `abortRebuild`, wrap the rebuild in those hooks instead. The hooks let the backend run the rebuild inside a database transaction, with abort mapped to ROLLBACK. Indices without the hooks (`BM25Index`) fall back to the existing JSON snapshot path with identical behaviour. Also await `index.search(...)` in `textSearch`/`hybridSearch` and handle fire-and-forget Promises in `syncTextIndexForChunk` so async backends can implement `ITextIndex` directly. https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg * feat(postgres): PostgresFtsTextIndex restoring native hybrid search via to_tsvector/plainto_tsquery PR #478 removed `IVectorStorage.hybridSearch` in favour of a KB-layer Reciprocal Rank Fusion over an in-memory `BM25Index`. That strands production Postgres KBs on a path where `reindexText` calls `chunkStorage.getAll()` — unbounded over server-side rows — and rebuilds postings in process memory. Restore the Postgres-native path as a new `ITextIndex` implementation backed by a single side table per KB indexed by a GIN `tsvector`: CREATE TABLE <table> (chunk_id TEXT PRIMARY KEY, doc_id TEXT NOT NULL, tsv TSVECTOR NOT NULL); CREATE INDEX <table>_tsv_idx ON <table> USING GIN (tsv); CREATE INDEX <table>_doc_idx ON <table> (doc_id); Scoring is `ts_rank_cd(tsv, plainto_tsquery(...))` — unbounded above and non-negative, so RRF fusion in `KnowledgeBase.hybridSearch` works without normalisation. Implements the optional `beginRebuild`/`commitRebuild`/`abortRebuild` hooks on `ITextIndex` so `KnowledgeBase.reindexText` wraps the rebuild in a real BEGIN/COMMIT — abort routes to ROLLBACK. State is durable server-side; `toJSON`/`fromJSON` are intentional no-ops. Exports the index from the new `@workglow/postgres/text` entry point and documents the setup pattern in the provider README. https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg * docs: changeset for ITextIndex async + PostgresFtsTextIndex - @workglow/storage minor: ITextIndex.search may now return a Promise; new optional beginRebuild/commitRebuild/abortRebuild lifecycle hooks for backends with server-side state. - @workglow/knowledge-base minor: reindexText uses the new hooks when available, falls back to toJSON/fromJSON snapshot rollback otherwise. - @workglow/postgres minor: new PostgresFtsTextIndex over a side `tsvector` GIN index for Postgres-native hybrid search. https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg * refactor(knowledge-base): restore inner.getBulk optimization in ScopedTabularStorage Previous commit replaced inner.getBulk delegation with per-key fan-out via inner.get, citing a leak risk from SQL backends building WHERE from PK columns only. That diagnosis is correct in shape but the chosen mitigation doesn't actually fix it: inner.get on SQL backends *also* builds WHERE from PK columns only (via getPrimaryKeyAsOrderedArray), so per-key fan-out would silently drop kb_id whenever kb_id is not in the inner PK -- the same failure mode it claimed to fix. The actual contract is "inner schema's PK must include kb_id". Every production wrapping uses SharedDocumentPrimaryKey or SharedChunkPrimaryKey (both ["kb_id", ...]); the libs codebase has no internal construction sites of ScopedTabularStorage at all -- it's exported for external consumers. Restore inner.getBulk(scopedKeys) (the original PR #480 IN-tuple WHERE optimization) and enforce the kb_id-in-PK contract at construction. Any future misuse fails loudly at the constructor rather than leaking rows at query time. Storages that don't expose primaryKeyNames (third-party impls that don't extend BaseTabularStorage) get a console.warn so they keep working. Tests: - new "constructor contract" describe block covers the throw, the happy path, and the warn-instead-of-throw fallback; - new "delegates to inner.getBulk in one round trip" test inside the existing "getBulk isolation" block locks in the optimization (a regression to fan-out is caught by the inner.getBulk vs inner.get call-count assertion). https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg * fix(postgres): tighten PostgresFtsTextIndex.fromJSON + document identifier constraints Addresses Copilot review feedback on PR #486: - fromJSON now validates state.table === this.table (or symmetrically removes table from toJSON if it can't enforce equality on a static method) so a snapshot from a different table can't silently round-trip through a mismatched instance. - README documents the strict alphanumeric+underscore identifier whitelist enforced on the `table` constructor argument, so callers know schema- qualified names and dashes are rejected. https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg * fix(postgres,knowledge-base): bind rebuildClient.query + correct stale docs/comments Addresses 7 Copilot review comments from the round-2 review: - PostgresFtsTextIndex.exec: bind rebuildClient.query during beginRebuild so pg.PoolClient.query receives the correct `this`. The previous unbound reference would throw on real Postgres connections (PGlite happens to be permissive enough not to catch this in the integration test). - README: reword the "no in-memory state at reindex time" claim — reindex still iterates all chunks via chunkStorage.getAll(). The actual benefits are server-side durable index, no JS-heap BM25 state, and transactional rebuild. Identifier whitelist note retained. - Changeset: mirror the README rewording. Add a note documenting that ScopedTabularStorage's constructor now throws when the inner PK omits kb_id — a user-visible behavior change for external wrappers. - Sqlite + Postgres integration tests: move `await Sqlite.init()` from the unawaited async describe() callback into beforeAll() so init completes before any test runs (was a latent flake source). - Sqlite + Postgres test comments: update the stale "fans out per-key get()" description to match the restored inner.getBulk + constructor-enforced kb_id-in-PK approach. - PR description updated via API to align with the implemented strategy. https://claude.ai/code/session_01NDC5HAMu9gusbWHkozA1Tg --------- Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Design spec for adding getBulk(keys) to IKvStorage / ITabularStorage,
reclaiming the name from the deprecated offset/limit method on tabular
(renamed to getOffsetPage). Vector inherits via ITabularStorage.