Skip to content

fix(server): batch upsert pipeline to prevent OOM on large datasets (v2)#1009

Closed
diecoscai wants to merge 8 commits into
stagingfrom
fix/oom-pipeline-v2
Closed

fix(server): batch upsert pipeline to prevent OOM on large datasets (v2)#1009
diecoscai wants to merge 8 commits into
stagingfrom
fix/oom-pipeline-v2

Conversation

@diecoscai
Copy link
Copy Markdown

Summary

Why

  • Large document runs can exceed memory limits when chunk save/upsert execute unbounded across the full dataset.
  • The previous implementation also had edge-case correctness issues around telemetry null handling and per-batch result accounting.

Verification

  • pnpm --filter flowise-components build
  • pnpm --filter flowise build
  • npx prettier --check packages/components/nodes/documentloaders/AAIDomains/AAIDomains.ts packages/server/src/services/documentstore/index.ts

@diecoscai diecoscai self-assigned this Mar 12, 2026
@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 12, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
answerai-docs Building Building Preview Mar 12, 2026 6:40pm
the-answerai Building Building Preview Mar 12, 2026 6:40pm

Request Review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c48057327e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

pageContent: chunk.pageContent,
metadata: JSON.parse(chunk.metadata)
const UPSERT_BATCH_SIZE = 500
const isFullCleanup = recordManagerObj && data.recordManagerConfig && JSON.parse(data.recordManagerConfig)?.cleanup === 'full'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Parse recordManagerConfig without assuming JSON string

In _insertIntoVectorStoreWorkerThread, isFullCleanup does JSON.parse(data.recordManagerConfig), but this codepath receives recordManagerConfig as an object (for example from saveVectorStoreConfig and the upsertDocStore internal call). When a record-manager-enabled upsert runs, JSON.parse on an object throws before any upsert happens, so these requests fail outright instead of indexing documents. Read cleanup from the object directly (or parse conditionally only when the value is a string).

Useful? React with 👍 / 👎.

where: filterOptions,
skip,
take: UPSERT_BATCH_SIZE,
order: { chunkNo: 'ASC' }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add deterministic ordering for batched chunk pagination

The new batched upsert query uses skip/take with order: { chunkNo: 'ASC' }, but chunkNo is not unique across a store when multiple docs are indexed together. In that common storeId-only case, offset pagination over a non-unique sort key can return overlapping or missing rows between batches, which silently drops or duplicates chunks in the vector store. Add a unique tie-breaker (for example id) or use keyset pagination to make batching stable.

Useful? React with 👍 / 👎.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 12, 2026

PR Review: fix(server): batch upsert pipeline to prevent OOM on large datasets (v2)

Summary: This PR re-applies the OOM fix reverted by #1008, batching both the chunk-save step (_saveChunksToStorage) and the vector-store upsert step (_insertIntoVectorStoreWorkerThread). It also moves _createVectorStoreObject outside the upsert loop (a regression fix from #1006) and guards the telemetry call against an undefined indexResult. The changes are targeted and well-scoped.


Critical Issues

None. No security, multi-tenancy, or authentication regressions were introduced. All existing organizationId/userId fields are preserved on DocumentStoreFileChunk rows and the filterOptions query object remains unchanged.


Major Concerns

1. isFullCleanup path still loads all chunks into memory
packages/server/src/services/documentstore/index.ts lines 1584-1597

When recordManagerConfig.cleanup === 'full', the code does an unbounded .find({ where: filterOptions }) followed by a single giant vectorStoreMethods.upsert() call. This is the same memory pattern that caused the original OOM. If a store has tens of thousands of chunks and a record manager is configured with cleanup: full, this path will still OOM.

Consider whether full cleanup semantics can be achieved with a two-phase approach (batch upsert then a separate delete pass), or at minimum document in a comment why batching is not safe here.

2. addedDocs is replaced per batch, not accumulated
packages/server/src/services/documentstore/index.ts lines 1626-1631

indexResult.totalKeys = batchResult.totalKeys ?? indexResult.totalKeys
indexResult.addedDocs = batchResult.addedDocs ?? []   // last batch only

numAdded, numDeleted, numUpdated, and numSkipped are correctly summed across batches, but addedDocs is overwritten with each batch's value. The comment in the PR description says history should reflect "full-run totals," but any consumer reading indexResult.addedDocs will only see the last batch. If addedDocs is not used downstream this is harmless, but it is worth either accumulating it or explicitly truncating/omitting it.


Minor Issues and Suggestions

3. recordManagerConfig JSON parsing happens even when recordManagerObj is undefined
packages/server/src/services/documentstore/index.ts lines 1579-1581

const recordManagerConfig =
    typeof data.recordManagerConfig === 'string' ? JSON.parse(data.recordManagerConfig) : data.recordManagerConfig
const isFullCleanup = recordManagerObj && recordManagerConfig?.cleanup === 'full'

JSON.parse is called unconditionally. If data.recordManagerConfig is a malformed JSON string and no record manager is configured, this will throw inside _insertIntoVectorStoreWorkerThread rather than being silently ignored. Since isFullCleanup short-circuits on recordManagerObj being falsy, the parse could be guarded with the same condition:

const isFullCleanup =
    recordManagerObj &&
    (() => {
        const cfg = typeof data.recordManagerConfig === 'string'
            ? JSON.parse(data.recordManagerConfig)
            : data.recordManagerConfig
        return cfg?.cleanup === 'full'
    })()

Or simply move the parse inside the if (data.recordManagerName && data.recordManagerConfig) block that already exists above it.

4. Per-chunk .save() inside the batch still makes N individual INSERT calls
packages/server/src/services/documentstore/index.ts lines 1243-1266

Each chunk in a batch is saved with a separate appDataSource.getRepository(...).save(dChunk). While Promise.all runs them concurrently, this is still N round-trips to the database per batch (up to 500 concurrent INSERTs). A bulk insert via repository.insert(batch) or a TypeORM save(array) call would be meaningfully faster and reduce connection pressure. This is a performance suggestion rather than a correctness issue.

5. In-place mutation of data array in AAIDomains.ts
packages/components/nodes/documentloaders/AAIDomains/AAIDomains.ts lines 490-494

for (const domain of data as any[]) {
    domain.tags = domain.domain_tags?.map((dt: any) => dt.tags).filter(Boolean) || []
}

The previous code mapped to a new array (domainsWithTags). The new code mutates the raw API response objects in-place before passing them to filterByTags. This is functionally equivalent here because the objects are not reused, but it is worth noting as a stylistic divergence from immutable data-flow patterns used elsewhere in the method.

6. Telemetry flowGraph field receives undefined after batching
packages/server/src/services/documentstore/index.ts lines 1647-1658

flowGraph: omit(indexResult['result'], ['totalKeys', 'addedDocs'])

indexResult['result'] is only set when saving upsert history (line 1639, where result['result'] = JSON.stringify(...)). On the raw indexResult object returned from vectorStoreMethods.upsert, the result key may not exist. This means omit(undefined, [...]) is passed to telemetry. This was a pre-existing issue and not introduced by this PR, but it is now more visible since indexResult can be populated via the batch loop.


Positive Observations

  • Hoisting _createVectorStoreObject outside the for loop is the correct fix. Creating a new vector store connection on every batch iteration would be expensive and was a clear regression.
  • The guard if (!batchResult) continue cleanly handles vector stores that return nothing for a given batch without corrupting the accumulator.
  • The order: { chunkNo: 'ASC', id: 'ASC' } on the paginated fetch is a good practice - it ensures deterministic ordering across batch pages and avoids missed or duplicate rows if the underlying table is modified mid-run.
  • Wrapping telemetry in if (indexResult) correctly prevents a runtime crash when all upsert batches return nothing.
  • The metadata application refactor in AAIDomains.ts is simpler and easier to read than the original dual-branch docs.map() pattern. Logic is identical and the reduction in lines of code is genuine.
  • organizationId and userId are correctly propagated to every DocumentStoreFileChunk row in the batch loop, maintaining multi-tenancy invariants.

Next Steps

  • Address the isFullCleanup OOM risk (item 1) or document explicitly why it is acceptable.
  • Decide whether addedDocs accumulation is needed (item 2) or whether it should be dropped from the history result.
  • Items 3-6 are lower priority and can be deferred to a follow-up.

Overall this is a sound fix. The batch loop logic is correct and the regression fixes are well-targeted.


Reviewed by Claude (claude-sonnet-4-6)

Full-cleanup branch: replace unbounded find() with keyset-paginated while loop
accumulating all chunks into a single docs[] array, then calling upsert() once.
Preserves the single indexStartDt boundary required by record manager full-cleanup
semantics - splitting upsert() calls causes later calls to delete earlier keys.

Non-full path: replace count()+skip offset pagination with MoreThan(lastId) keyset
cursor on UUID PK, eliminating race condition where concurrent writes shift pages.
@claude
Copy link
Copy Markdown

claude Bot commented Mar 12, 2026

PR Review posted below - see full review in next comment

@claude
Copy link
Copy Markdown

claude Bot commented Mar 12, 2026

Deleting placeholder comment - full review follows

@claude
Copy link
Copy Markdown

claude Bot commented Mar 12, 2026

PR Review: fix(server): batch upsert pipeline to prevent OOM on large datasets (v2) --- Summary This PR re-applies the OOM fix from 1006 (reverted in 1008). It adds keyset-based pagination for DocumentStoreFileChunk reads, batches the chunk-save step, and fixes two regressions: _createVectorStoreObject called inside the per-batch loop, and null-dereference on telemetry when indexResult is undefined. Two files changed: AAIDomains.ts and documentstore/index.ts. --- ### CRITICAL: filterOptions is missing organizationId (multi-tenancy gap) Location: packages/server/src/services/documentstore/index.ts lines 1572-1577. filterOptions only contains storeId (and optionally docId). organizationId is absent, so every paginated find() reads chunks across ALL organizations for that store. DocumentStoreFileChunk has an indexed organizationId column and data.organizationId is available from entity.organizationId earlier in the same function. Per project conventions ALL DB queries must filter by organizationId. This must be fixed before merge. --- ### CRITICAL: UUID keyset pagination may skip rows under concurrent writes Location: both while-true loops in _insertIntoVectorStoreWorkerThread. The cursor uses order id ASC with MoreThan(lastId) on a UUID v4 PK. UUID v4 is randomly generated so concurrent inserts with UUIDs sorting before lastId are silently skipped. A composite (createdDate, id) cursor would be safer. --- ### MAJOR: addedDocs not accumulated across batches in non-full-cleanup path numAdded/numDeleted/numUpdated/numSkipped are summed correctly but addedDocs is overwritten per batch. History records only show the last batch. Correctness bug if addedDocs is used for audit or display. --- ### MAJOR: JSON.parse(chunk.metadata) unguarded in both pagination loops Malformed metadata on any single chunk throws and fails the entire job. A per-chunk try/catch with fallback to empty object would be more resilient. --- ### MAJOR: Full-cleanup path accumulates ALL docs in memory with no comment The isFullCleanup branch correctly calls upsert() once (splitting breaks full-cleanup semantics since later calls delete keys from earlier ones) but there is NO comment explaining this. A future developer could split it and reintroduce the deletion bug. Please add an explanatory comment. --- ### Suggestions - AAIDomains.ts tag transform mutates data in place vs original .map() approach. Minor style concern. - SAVE_BATCH_SIZE and UPSERT_BATCH_SIZE are inline constants. Consider hoisting to module-level with a note on rationale. - No tests added. Given this is the 2nd attempt after a regression caused a revert, tests for batching and result accumulation are strongly recommended. - Pre-existing typo: chatlowId (missing f) in telemetry is carried forward unchanged. --- ### Positive Observations - Hoisting _createVectorStoreObject outside the per-batch loop correctly fixes the primary regression from 1006. - Guarding telemetry with if (indexResult) prevents the null-dereference crash. - AAIDomains.ts metadata refactor replaces two near-identical map() branches with one for-of loop cleanly. - userId and organizationId correctly set on every DocumentStoreFileChunk in _saveChunksToStorage. - Well-formed commit messages and PR description accurately explains the changes and tradeoffs. --- ### Recommendation: Request Changes Issue 1 (missing organizationId in filterOptions) is a multi-tenancy gap that must be resolved before merge. Issues 3 and 4 are correctness bugs to fix. Issue 5 needs a clarifying comment.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 12, 2026

PR Review: fix(server): batch upsert pipeline to prevent OOM on large datasets (v2)

Reviewed commit: 61a95f7 (latest as of review)

Summary: This PR re-applies the OOM fix reverted by #1008. It adds keyset-paginated reads of DocumentStoreFileChunk for both the full-cleanup and non-full-cleanup upsert paths, batches the chunk-save step in _saveChunksToStorage, hoists _createVectorStoreObject outside the per-batch loop, and guards telemetry against an undefined indexResult. The final commit also adds organizationId to filterOptions, which was flagged in a prior review pass and is now resolved.


Critical Issues

None remaining. The previously flagged multi-tenancy gap (organizationId absent from filterOptions) was resolved in the final commit. Multi-tenancy fields are correctly propagated on all DocumentStoreFileChunk rows in _saveChunksToStorage and both paginated read loops now scope queries to entity.organizationId.


Major Concerns

1. Full-cleanup path accumulates ALL chunks in memory before calling upsert
packages/server/src/services/documentstore/index.ts lines 1592-1620

The isFullCleanup branch paginates DB reads correctly but still accumulates every Document into a single docs[] array before calling vectorStoreMethods.upsert() once. For a store with hundreds of thousands of chunks this array will be the same size as the original unbounded .find() was. The comment explaining why a single upsert() call is required (record manager deletes keys with timestamp earlier than indexStartDt) is excellent and should stay. But the memory problem is only half-solved: the reads are paged, the vector store call is not.

Consider documenting a known limitation or opening a follow-up ticket to explore whether the vector store's record manager can accept a pre-set indexStartDt so that batched upsert calls can share a single timestamp boundary.

2. recordManagerConfig is parsed unconditionally regardless of whether a record manager is configured
packages/server/src/services/documentstore/index.ts lines 1582-1584

const recordManagerConfig =
    typeof data.recordManagerConfig === 'string' ? JSON.parse(data.recordManagerConfig) : data.recordManagerConfig
const isFullCleanup = recordManagerObj && recordManagerConfig?.cleanup === 'full'

JSON.parse is called even when recordManagerObj is undefined (i.e., no record manager is configured). If data.recordManagerConfig contains malformed JSON, this throws inside the worker thread, failing the entire upsert job. Since isFullCleanup short-circuits on !recordManagerObj, the parse is only needed when a record manager is present. Moving the parse inside the existing if (data.recordManagerName && data.recordManagerConfig) block at line 1564 would make this safe.


Minor Issues and Suggestions

3. Per-chunk individual .save() calls inside the batch loop
packages/server/src/services/documentstore/index.ts lines 1246-1269

Each of the 500 chunks in a batch is saved with its own repository.save(dChunk) call. Promise.all runs them concurrently, but this is still up to 500 parallel round-trips per batch, which creates significant connection pool pressure. A single repository.save(batchArray) or repository.insert(batchArray) call would be meaningfully more efficient. This is a performance suggestion rather than a correctness issue.

4. addedDocs accumulation in the non-full-cleanup path
packages/server/src/services/documentstore/index.ts lines 1648-1656

The first-batch case initializes indexResult with addedDocs: batchResult.addedDocs ?? [], and subsequent batches correctly spread both arrays with [...indexResult.addedDocs, ...batchResult.addedDocs]. This is correct and the PR description's claim that "full-run totals" are accumulated is accurate. No change needed — this is a positive callout.

5. Telemetry flowGraph receives undefined
packages/server/src/services/documentstore/index.ts line 1683

flowGraph: omit(indexResult['result'], ['totalKeys', 'addedDocs'])

indexResult['result'] is only assigned at line 1668 during the UpsertHistory save block (result['result'] = JSON.stringify(...)). The raw indexResult returned by vectorStoreMethods.upsert() does not carry a result key, so omit(undefined, [...]) is passed to telemetry as flowGraph. This was a pre-existing bug before this PR, but it is now more consistently reached since indexResult is more reliably populated. Worth a follow-up fix.

6. No tests added
This PR is the second attempt after a regression caused a full revert (#1008). The batch accumulation logic, the full-cleanup single-upsert invariant, and the keyset pagination cursor are all non-trivial. Unit tests covering at least the result accumulation logic and the full-cleanup/non-full-cleanup code paths would significantly reduce the risk of future regressions.


Positive Observations

  • Hoisting _createVectorStoreObject outside the per-batch loop is the correct fix. Creating a new vector store connection on every batch would be expensive and was the primary regression introduced in fix(components): optimize AAIDomains loader memory for large datasets #1006.
  • The explanatory comment at lines 1587-1591 (why full-cleanup must call upsert() exactly once) is excellent. This kind of rationale comment prevents future developers from "simplifying" the code and reintroducing the deletion bug.
  • Guarding telemetry with if (indexResult) correctly prevents a runtime crash when all batches return nothing.
  • organizationId is now present in filterOptions, correctly scoping all paginated reads to the current organization.
  • userId and organizationId are correctly set on every DocumentStoreFileChunk row in _saveChunksToStorage, preserving multi-tenancy invariants.
  • The AAIDomains.ts metadata refactor replaces two near-identical docs.map() branches with a single for...of loop. Logic is identical and the reduction in lines of code is a genuine readability improvement.
  • Module-level constants SAVE_BATCH_SIZE and UPSERT_BATCH_SIZE are well-placed and easy to tune.
  • The in-place mutation comment added in the final commit ("non-mutating") accurately describes the spread operator used in the tag transform.
  • Commit messages follow conventional commit format and the PR description accurately describes the tradeoffs.

Recommendation

Approve with suggestions. The two critical issues from prior review passes are resolved. The remaining items are either pre-existing issues or lower-priority improvements that can be deferred. The only actionable blocker I would flag for discussion before merge is item 1 (full-cleanup path still builds an in-memory docs[] of unbounded size) — whether that is acceptable depends on the maximum expected chunk count for stores using full-cleanup record managers.

Items 2 and 5 are the highest priority follow-ups. Item 6 (tests) is strongly recommended given the regression history.


Reviewed by Claude (claude-sonnet-4-6)

- Move delete-before-save to delete-after-save in both _saveChunksToStorage
  and syncAndRefreshChunks: old chunks are preserved until all new chunks
  are confirmed persisted
- Add per-batch error handling so partial failures set status STALE and
  preserve completed batches rather than throwing all-or-nothing
- Replace sequential one-by-one saves in syncAndRefreshChunks with
  SAVE_BATCH_SIZE-batched parallel saves matching the existing pattern
- Derive totalChunks/totalChars from actual persisted counts, not expected
  values from the chunk response
- Add durability contract comments to both save paths documenting the
  SYNCING -> SYNC/STALE state machine and safe-delete-timing mechanism
@claude
Copy link
Copy Markdown

claude Bot commented Mar 13, 2026

PR Review: fix(server): batch upsert pipeline to prevent OOM on large datasets (v2)

Summary

This PR re-applies the OOM fix from #1006 that was reverted in #1008. It batches both the chunk-save step (_saveChunksToStorage / syncAndRefreshChunks) and the vector-store upsert step (_insertIntoVectorStoreWorkerThread), moves _createVectorStoreObject outside the upsert loop (a regression fix), guards the telemetry call against an undefined indexResult, and adds a new Jest test suite that codifies the durability contract.

The overall direction is correct and the changes are well-scoped. Below are the findings.


Critical Issues

None. No security regressions, no multi-tenancy gaps, and no authentication bypasses were introduced. All DocumentStoreFileChunk rows retain userId and organizationId fields, and the existing filterOptions query object continues to filter by storeId and (now also) organizationId — which is actually a positive security addition (see Positive Observations).


Major Concerns

1. isFullCleanup path still accumulates all documents in memory before upsert

Location: packages/server/src/services/documentstore/index.tsisFullCleanup block

The DB reads are now paginated (good), but all pages are accumulated into a single docs[] array before calling upsert() once. For a store with tens of thousands of chunks this is the same memory shape as before. The comment explains why a single upsert call is required for full cleanup semantics (record-manager timestamp logic), which is correct. However, the fix deserves a note in the PR description or an inline TODO:

  • Is the record-manager full cleanup mode expected to be used with very large stores? If yes, this is still an OOM risk and the constraint should be documented in a LIMITATIONS comment or the record-manager feature should be gated on store size.
  • If the constraint is accepted, consider adding a log line warning users when they enter this path with a large chunk count, so production incidents are easier to diagnose.

2. delete in _saveChunksToStorage uses only docId — no organizationId or userId guard

Location: packages/server/src/services/documentstore/index.ts — step 8 delete call

// step 8: delete old chunks only after all new chunks are confirmed saved
await appDataSource.getRepository(DocumentStoreFileChunk).delete({ docId: newLoaderId })

The corresponding delete in syncAndRefreshChunks correctly scopes by { docId, userId, organizationId }. The _saveChunksToStorage delete only uses docId. Since docId / newLoaderId is a UUID and the upstream authorization checks are in place, this is not an immediate data-leak vector. But it is inconsistent with the multi-tenancy pattern used elsewhere and could become a risk if docId reuse is ever introduced. Suggest adding userId and organizationId to this delete filter to match the pattern in syncAndRefreshChunks:

await appDataSource.getRepository(DocumentStoreFileChunk).delete({
    docId: newLoaderId,
    userId: data.userId,
    organizationId: data.organizationId
})

Minor Issues and Suggestions

3. recordManagerConfig is parsed unconditionally before the guard check

Location: packages/server/src/services/documentstore/index.ts lines ~1579-1581

const recordManagerConfig =
    typeof data.recordManagerConfig === 'string' ? JSON.parse(data.recordManagerConfig) : data.recordManagerConfig
const isFullCleanup = recordManagerObj && recordManagerConfig?.cleanup === 'full'

JSON.parse runs even when recordManagerObj is undefined (the common case). If data.recordManagerConfig is a malformed JSON string this will throw inside the worker thread unexpectedly. The parse can be deferred to inside the guard:

const isFullCleanup = (() => {
    if (!recordManagerObj) return false
    const cfg = typeof data.recordManagerConfig === 'string'
        ? JSON.parse(data.recordManagerConfig)
        : data.recordManagerConfig
    return cfg?.cleanup === 'full'
})()

4. Per-chunk .save() inside batches still makes N individual INSERT calls

Location: packages/server/src/services/documentstore/index.ts — batch map + Promise.all in both syncAndRefreshChunks and _saveChunksToStorage

Each chunk is saved in a separate repository.save(dChunk) call. Promise.all runs them concurrently within a batch, but that is still up to 500 simultaneous DB round-trips. TypeORM supports bulk saves:

const docChunks = batch.map((chunk, localIndex) => {
    return repository.create({ ...fields, chunkNo: i + localIndex + 1 })
})
await repository.save(docChunks)  // single multi-row INSERT

This would reduce connection pressure significantly for large syncs and is the more idiomatic TypeORM pattern.

5. IIFE for JSON.parse in document mapping adds noise with no error recovery

Location: packages/server/src/services/documentstore/index.ts — both chunk-to-Document mappings

metadata: (() => { try { return chunk.metadata ? JSON.parse(chunk.metadata) : {} } catch { return {} } })()

Returning {} on a JSON parse failure silently discards metadata for that chunk. If metadata is malformed it would be more useful to log a warning (the logger is already imported). Consider:

metadata: (() => {
    try { return chunk.metadata ? JSON.parse(chunk.metadata) : {} }
    catch { logger.warn(`[documentstore] Failed to parse metadata for chunk ${chunk.id}`); return {} }
})()

6. Test file declares mockChunkCreate but the mock buildMockAppDataSource assigns it to createmockChunkCreate is never independently asserted

Location: packages/server/test/services/documentstore/index.test.ts

The create mock in the repo is wired to mockChunkCreate, but no test currently asserts on it. This is fine for now, but if someone later adds a test that asserts mockChunkCreate was not called when it should have been, it can cause confusion because the function is used for the repo .create() call on every chunk. Worth a brief comment explaining the wiring.

7. Typo fix in telemetry — chatlowId corrected to chatflowId

Location: packages/server/src/services/documentstore/index.ts line ~1655

This is a good catch and worth noting explicitly. The original field name chatlowId was a long-standing typo. Any telemetry dashboards or downstream consumers that relied on the misspelled key will need to be updated.


Positive Observations

  • The durability contract comments (DURABILITY CONTRACT) are excellent. They make the intent of the safe-delete ordering immediately clear to future maintainers.
  • Adding organizationId: entity.organizationId to filterOptions in _insertIntoVectorStoreWorkerThread is a meaningful multi-tenancy hardening that was previously missing. This is a good addition.
  • Hoisting _createVectorStoreObject outside the batch loop in the non-full-cleanup path is the correct fix — constructing the vector store object is expensive and should not be repeated per batch.
  • The accumulation logic for numAdded, numDeleted, numUpdated, numSkipped, and addedDocs across batches is correct. addedDocs is properly spread-concatenated (not overwritten), which is the right behavior.
  • Per-page document conversion in AAIDomains.ts (createDocumentFromDomain called inside the pagination loop rather than at the end) reduces peak memory by converting as data is loaded rather than holding raw API response objects until all pages are fetched.
  • The new test suite clearly codifies the durability contract and the RED-comment pattern makes the intent of each test case self-documenting.

Checklist Assessment

  • No new routes added — enforceAbility check not applicable to this diff
  • Multi-tenancy: organizationId preserved on all chunk writes; added to vector store query filter
  • Multi-tenancy: _saveChunksToStorage delete should also include userId/organizationId (see Major Concern Tools Sandbox #2)
  • Error handling: InternalFlowiseError used in existing patterns; batch errors are caught and result in STALE status
  • tags: ['AAI'] not applicable (no new Flowise component nodes added)
  • Test coverage: durability contract unit tests added for _saveChunksToStorage
  • isFullCleanup OOM risk should be documented or mitigated (see Major Concern Answers Integration Beta v1 #1)

Overall assessment: Request changes on the two major concerns before merging. The multi-tenancy gap in the delete call (#2) is a correctness issue worth fixing now. The full-cleanup memory concern (#1) needs at minimum a documented decision. The minor items can be addressed as follow-up.

Reviewed by Claude Sonnet 4.6

@diecoscai
Copy link
Copy Markdown
Author

Superseded by #1013 (consolidated branch from #1009 with only safe isolated carry-overs from #1012).

@diecoscai diecoscai closed this Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant