fix(search): BM25 unicode + vector index live-write — improved (closes #295, supersedes #296)#327
Conversation
…regex Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com>
…ldIndex Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com>
…, rebuildIndex vector test Adds migrateVectorIndex utility for re-embedding when embedding provider dimensions change. Adds comprehensive test coverage for Cyrillic tokenization, vector index population on remember/rebuildIndex, and migration edge cases. Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com>
Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com>
…arded Pre-merge hardening on top of @nik1t7n's PR #296. Original PR had vectorIndex.add() inlined at four sites (remember, observe, compress, rebuildIndex) with the same try/catch shape repeated each time. That open-codes three sharp edges per site: 1. Missing dimension guard at the write boundary. Per #248, persistence load already refuses to start when persisted vectors mismatch the active provider's dimensions — but the live-write path had no symmetric check. A mis-configured embedding provider returning a different-length Float32Array would silently corrupt the in-memory index until next restart cleared it. 2. No input clipping. Memory.content can be arbitrarily large (the MemorySlot path even allows 20KB content per slot). Most embedding providers cap input around 8k tokens and 400 on overflow. A single oversize memory could throw at the provider and lose the vector entry; worse, on local-Ollama-style providers it can cost real inference time. 3. Inconsistent error logging (rebuildIndex had `catch {}` swallowing the error entirely, the other three sites logged via logger.warn with slightly different shape). Consolidates into a single exported `vectorIndexAddGuarded()` in src/functions/search.ts: - Reads vectorIndex + currentEmbeddingProvider from module-level state (same singletons the PR added). - Clips embed input to 16k chars (EMBED_MAX_CHARS) — well under every provider's documented limit, leaves headroom for chars-per-token variance across languages. - Validates embedding.length === provider.dimensions before calling vi.add(); logs structured warning and skips on mismatch. - Single logger.warn on embed failure with provider name + error message + indexed-item kind ("memory" | "observation" | "synthetic") + id. - Returns boolean (true on success) so callers can branch if they want, but every existing call site fire-and-forgets — the contract is "soft-fail: index entry skipped, upstream save still succeeds". Rewrites the four call sites (remember.ts, observe.ts synthetic branch, compress.ts post-LLM, search.ts rebuildIndex memories + obs loops) to call vectorIndexAddGuarded() instead of the inline try/catch. Side effect: cleans up the lingering `catch {}` in rebuildIndex that was swallowing rebuild-time embed failures silently — every failure now surfaces in logs at a consistent shape. Tests: - test/multimodal.test.ts mocked `../src/functions/search.js` and exposed only getSearchIndex; needed vectorIndexAddGuarded added to the mock surface or test fails with "No vectorIndexAddGuarded export is defined on the search.js mock". Added the export to the mock. - All other tests (search-index Cyrillic, search rebuildIndex vector, vector-index-populate, migrate-vector-index) green on the helper refactor — no behavioural change beyond the three guards above. 886 / 886 tests pass. Build clean. Out-of-scope (deferred to follow-ups): - Blocking await on every save. The embed call is on the request hot path. For local Ollama / on-device providers this is fine; for network providers it adds 200ms-2s latency per remember/observe. Should be async / fire-and-forget with a write-behind queue — separate perf PR after this lands. - Periodic persistence flush of the vector index. Today the index is serialized at process exit (per IndexPersistence). Runtime adds between flushes survive in memory but die on crash. Acceptable for this PR; future PR can add a debounced flush. - Batching in rebuildIndex. Today the rebuild path embeds one doc at a time. embedBatch() exists on every provider and would be ~5-10x faster for cold restarts on large corpora. Defer.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds optional vector-search embedding/indexing alongside BM25: module-level vector index and embedder state, guarded vector index writes, startup wiring, rebuild/migration utilities, Unicode-aware tokenization, and tests for population/migration/tokenization. ChangesVector Search Indexing Integration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (3)
src/functions/remember.ts (1)
115-129: 💤 Low valueConsider audit tracking and parallelization opportunities.
Two optional improvements:
Audit trail: The coding guidelines state that state-changing operations should call
recordAudit()for audit trail tracking. Themem::forgetfunction (line 208) records audits, butmem::rememberdoes not track the creation of new memories. If audit trails are desired for memory creation, consider adding arecordAudit()call after line 100.Parallelization: The vector indexing (lines 115-120) and cascade trigger (lines 123-129) are independent operations that could run in parallel using
Promise.allSettled()to reduce latency.As per coding guidelines: "Use parallel operations where possible with Promise.all for independent kv writes/reads" and "Record state-changing operations with
recordAudit()for audit trail tracking".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/functions/remember.ts` around lines 115 - 129, Add an audit record for the new memory in mem::remember by calling recordAudit(...) right after the memory creation/DB write (use the memory.id and context info) to track the creation, and change the sequential calls to vectorIndexAddGuarded(...) and the sdk.trigger({ function_id: "mem::cascade-update", ... }) invocation so they run in parallel using Promise.allSettled([...]) (keep both operations as independent promises to avoid failing one when the other succeeds); reference the existing vectorIndexAddGuarded and sdk.trigger calls and ensure any errors from Promise.allSettled are logged/handled.test/vector-index-dimensions.test.ts (1)
78-102: ⚡ Quick winStrengthen test assertions for migration behavior.
The test has two opportunities for improvement:
- Lines 99-100 use
toBeGreaterThan(0)when exactly 1 observation is created, sotoBe(1)would be more precise.- The test doesn't verify that the migrated vectors actually have the new 4D dimensions from
newProvider. Consider addingexpect(result.vectorSize).toBe(1)and validating that the resulting index contains 4D vectors viaoldIndex.validateDimensions(4).♻️ More specific assertions
const result = await migrateVectorIndex(kv as never, oldIndex, newProvider); expect(result.success).toBe(true); - expect(result.totalProcessed).toBeGreaterThan(0); - expect(result.vectorSize).toBeGreaterThan(0); + expect(result.totalProcessed).toBe(1); + expect(result.vectorSize).toBe(1); expect(result.failed).toBe(0); + // Verify new dimensions were applied + const validation = oldIndex.validateDimensions(4); + expect(validation.mismatches).toEqual([]); });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/vector-index-dimensions.test.ts` around lines 78 - 102, Replace the loose "greater than 0" assertions with precise checks that reflect the single observation and 4D embedding: assert totalProcessed is 1 (expect(result.totalProcessed).toBe(1)), assert vectorSize equals 4 (expect(result.vectorSize).toBe(4)) and after migration verify the index dimensions by calling the VectorIndex validation helper on the returned/modified index (e.g., oldIndex.validateDimensions(4) or the appropriate returned index instance) to confirm embeddings are 4-dimensional; update references to migrateVectorIndex, oldIndex, newProvider, result.vectorSize and oldIndex.validateDimensions accordingly.test/vector-index-populate.test.ts (1)
60-80: 💤 Low valueConsider using beforeEach/afterEach for test isolation.
The test manually cleans up global state (lines 78-79) after each test. While this works, using
beforeEach/afterEachhooks would provide more robust test isolation and prevent state leakage if a test fails before reaching cleanup.♻️ Example with lifecycle hooks
describe("vector index population on remember", () => { const mockEmbedder: EmbeddingProvider = { name: "test", dimensions: 3, embed: async (_text: string) => new Float32Array([0.1, 0.2, 0.3]), embedBatch: async (_texts: string[]) => _texts.map(() => new Float32Array([0.1, 0.2, 0.3])), }; + + afterEach(() => { + setVectorIndex(null); + setEmbeddingProvider(null); + }); it("calls vectorIndex.add() when remember saves a memory", async () => { // ... test code ... expect(vi.size).toBe(1); - - // Cleanup - setVectorIndex(null); - setEmbeddingProvider(null); });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/vector-index-populate.test.ts` around lines 60 - 80, The test mutates global state (setVectorIndex/setEmbeddingProvider) and manually cleans up in the test body; change to use beforeEach to create and set a fresh VectorIndex and embedding provider (e.g., call setVectorIndex(new VectorIndex()) and setEmbeddingProvider(mockEmbedder) in beforeEach) and use afterEach to reset them to null (setVectorIndex(null); setEmbeddingProvider(null)), and remove the manual cleanup lines from the test; keep the rest of the test (mockSdk, mockKV, registerRememberFunction, and expectations) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/functions/compress.ts`:
- Around line 176-183: The BM25 and vector indexing calls after persisting the
compressed observation should be wrapped in defensive try/catch blocks so
indexing failures don't abort the compression flow; specifically, surround
getSearchIndex().add(compressed) in a try/catch (log the error and continue) and
likewise wrap the vectorIndexAddGuarded(...) call in its own try/catch (log the
error and continue), using the same pattern as the BM25 handling in remember.ts;
include identifying info like compressed.id, compressed.sessionId and
compressed.title in the log messages to aid debugging.
In `@src/functions/migrate-vector-index.ts`:
- Around line 26-29: The filter and mapping for embeddings should ensure memory
content exists: update the filter that defines textMems to require m.content
(e.g., (m) => m.isLatest !== false && m.title && m.content && m.content.trim()
!== "") or alternatively keep the broader filter but change the texts mapping to
use a safe fallback like (m.title + " " + (m.content ?? "")). Refer to the
textMems variable and the texts mapping to implement the check/fallback so you
never concatenate "undefined" into the embedding input.
- Line 74: The return object always sets success: true regardless of failures;
update the return in migrate-vector-index (the line returning { success: true,
totalProcessed: processed, failed, vectorSize: newIndex.size }) to compute
success based on the failed counter (e.g., success: failed === 0) so callers can
rely on the boolean instead of inspecting failed separately; ensure you only
change the success expression and keep totalProcessed: processed, failed, and
vectorSize: newIndex.size unchanged.
In `@src/functions/observe.ts`:
- Around line 241-246: The call to vectorIndexAddGuarded(synthetic.id,
synthetic.sessionId, synthetic.title + " " + (synthetic.narrative || ""), {
kind: "synthetic", logId: synthetic.id }) should be wrapped in a try/catch so
embedding/indexing errors don't abort the synthetic compression flow; catch any
error and emit a warning log (similar to patterns in remember.ts and
compress.ts) indicating the synthetic id and error, but do not rethrow—ensure
the function continues since the observation was already persisted and
BM25-indexed.
In `@src/functions/remember.ts`:
- Around line 115-120: The vector indexing call using vectorIndexAddGuarded with
arguments (memory.id, memory.sessionIds[0] ?? "memory", memory.title + " " +
memory.content, { kind: "memory", logId: memory.id }) should be wrapped in a
try/catch so indexing failures don't abort the overall mem::remember flow; catch
any error and call processLogger.warn (or similar existing logger) with a
concise message including memory.id and the error, mirroring the BM25 try/catch
pattern used earlier for soft-failing index operations.
In `@src/functions/search.ts`:
- Around line 105-110: rebuildIndex currently clears BM25 but does not reset the
vector store, so stale embeddings remain; before you start adding with
vectorIndexAddGuarded in rebuildIndex, call the vector-index reset/clear API
(e.g., vectorIndexReset or vectorIndexClear — whichever exists in our codebase)
for the target index (use the same identifier you pass to vectorIndexAddGuarded,
e.g., memory.id or session id) to wipe existing vectors, then repopulate; apply
the same reset-before-add change to the other spot that calls
vectorIndexAddGuarded (the second block that appends vectors) so the vector
store and BM25 stay in sync.
In `@test/vector-index-dimensions.test.ts`:
- Line 70: Replace the no-op mockKV.delete with a real deletion that removes the
key from the same in-memory store used by the mock (so it mirrors the
implementation in vector-index-populate.test.ts); implement delete as an async
function matching the signature (delete: async (_scope: string, _key: string):
Promise<void>) that looks up the underlying store and removes the entry for the
provided key (ensuring awaits/Promise resolution if the store API is async) so
tests relying on migrateVectorIndex or other mutation behavior observe actual
deletions.
---
Nitpick comments:
In `@src/functions/remember.ts`:
- Around line 115-129: Add an audit record for the new memory in mem::remember
by calling recordAudit(...) right after the memory creation/DB write (use the
memory.id and context info) to track the creation, and change the sequential
calls to vectorIndexAddGuarded(...) and the sdk.trigger({ function_id:
"mem::cascade-update", ... }) invocation so they run in parallel using
Promise.allSettled([...]) (keep both operations as independent promises to avoid
failing one when the other succeeds); reference the existing
vectorIndexAddGuarded and sdk.trigger calls and ensure any errors from
Promise.allSettled are logged/handled.
In `@test/vector-index-dimensions.test.ts`:
- Around line 78-102: Replace the loose "greater than 0" assertions with precise
checks that reflect the single observation and 4D embedding: assert
totalProcessed is 1 (expect(result.totalProcessed).toBe(1)), assert vectorSize
equals 4 (expect(result.vectorSize).toBe(4)) and after migration verify the
index dimensions by calling the VectorIndex validation helper on the
returned/modified index (e.g., oldIndex.validateDimensions(4) or the appropriate
returned index instance) to confirm embeddings are 4-dimensional; update
references to migrateVectorIndex, oldIndex, newProvider, result.vectorSize and
oldIndex.validateDimensions accordingly.
In `@test/vector-index-populate.test.ts`:
- Around line 60-80: The test mutates global state
(setVectorIndex/setEmbeddingProvider) and manually cleans up in the test body;
change to use beforeEach to create and set a fresh VectorIndex and embedding
provider (e.g., call setVectorIndex(new VectorIndex()) and
setEmbeddingProvider(mockEmbedder) in beforeEach) and use afterEach to reset
them to null (setVectorIndex(null); setEmbeddingProvider(null)), and remove the
manual cleanup lines from the test; keep the rest of the test (mockSdk, mockKV,
registerRememberFunction, and expectations) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6f12aec1-1f4b-4288-b708-39146f804f4e
📒 Files selected for processing (12)
src/functions/compress.tssrc/functions/migrate-vector-index.tssrc/functions/observe.tssrc/functions/remember.tssrc/functions/search.tssrc/index.tssrc/state/search-index.tstest/multimodal.test.tstest/search-index.test.tstest/search.test.tstest/vector-index-dimensions.test.tstest/vector-index-populate.test.ts
| getSearchIndex().add(compressed); | ||
|
|
||
| await vectorIndexAddGuarded( | ||
| compressed.id, | ||
| compressed.sessionId, | ||
| compressed.title + " " + (compressed.narrative || ""), | ||
| { kind: "observation", logId: compressed.id }, | ||
| ); |
There was a problem hiding this comment.
Add error handling for both BM25 and vector indexing.
Both the BM25 indexing (line 176) and vector indexing (lines 178-183) lack try/catch blocks. If either throws, the compression operation will fail even though the compressed observation was already persisted to KV at lines 170-174.
In remember.ts, the BM25 indexing is wrapped in try/catch (lines 107-114). Apply the same defensive pattern here to ensure indexing failures don't break the compression flow.
🛡️ Proposed fix
compressed,
);
- getSearchIndex().add(compressed);
-
- await vectorIndexAddGuarded(
- compressed.id,
- compressed.sessionId,
- compressed.title + " " + (compressed.narrative || ""),
- { kind: "observation", logId: compressed.id },
- );
+ try {
+ getSearchIndex().add(compressed);
+ } catch (err) {
+ logger.warn("Failed to index compressed observation into BM25", {
+ obsId: compressed.id,
+ error: err instanceof Error ? err.message : String(err),
+ });
+ }
+
+ try {
+ await vectorIndexAddGuarded(
+ compressed.id,
+ compressed.sessionId,
+ compressed.title + " " + (compressed.narrative || ""),
+ { kind: "observation", logId: compressed.id },
+ );
+ } catch (err) {
+ logger.warn("Failed to index compressed observation into vector index", {
+ obsId: compressed.id,
+ error: err instanceof Error ? err.message : String(err),
+ });
+ }
const streamResults = await Promise.allSettled([📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| getSearchIndex().add(compressed); | |
| await vectorIndexAddGuarded( | |
| compressed.id, | |
| compressed.sessionId, | |
| compressed.title + " " + (compressed.narrative || ""), | |
| { kind: "observation", logId: compressed.id }, | |
| ); | |
| try { | |
| getSearchIndex().add(compressed); | |
| } catch (err) { | |
| logger.warn("Failed to index compressed observation into BM25", { | |
| obsId: compressed.id, | |
| error: err instanceof Error ? err.message : String(err), | |
| }); | |
| } | |
| try { | |
| await vectorIndexAddGuarded( | |
| compressed.id, | |
| compressed.sessionId, | |
| compressed.title + " " + (compressed.narrative || ""), | |
| { kind: "observation", logId: compressed.id }, | |
| ); | |
| } catch (err) { | |
| logger.warn("Failed to index compressed observation into vector index", { | |
| obsId: compressed.id, | |
| error: err instanceof Error ? err.message : String(err), | |
| }); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/functions/compress.ts` around lines 176 - 183, The BM25 and vector
indexing calls after persisting the compressed observation should be wrapped in
defensive try/catch blocks so indexing failures don't abort the compression
flow; specifically, surround getSearchIndex().add(compressed) in a try/catch
(log the error and continue) and likewise wrap the vectorIndexAddGuarded(...)
call in its own try/catch (log the error and continue), using the same pattern
as the BM25 handling in remember.ts; include identifying info like
compressed.id, compressed.sessionId and compressed.title in the log messages to
aid debugging.
| await vectorIndexAddGuarded( | ||
| synthetic.id, | ||
| synthetic.sessionId, | ||
| synthetic.title + " " + (synthetic.narrative || ""), | ||
| { kind: "synthetic", logId: synthetic.id }, | ||
| ); |
There was a problem hiding this comment.
Wrap vectorIndexAddGuarded in try/catch to prevent synthetic compression failures.
The vector indexing call lacks error handling. If vectorIndexAddGuarded throws, the entire synthetic compression flow will fail, even though the observation was already persisted to KV at line 235 and indexed in BM25 at line 240.
This is the same pattern as in remember.ts and compress.ts. Wrap the call in try/catch with a warning to ensure embedding failures don't break observation capture.
🛡️ Proposed fix
getSearchIndex().add(synthetic);
- await vectorIndexAddGuarded(
- synthetic.id,
- synthetic.sessionId,
- synthetic.title + " " + (synthetic.narrative || ""),
- { kind: "synthetic", logId: synthetic.id },
- );
+ try {
+ await vectorIndexAddGuarded(
+ synthetic.id,
+ synthetic.sessionId,
+ synthetic.title + " " + (synthetic.narrative || ""),
+ { kind: "synthetic", logId: synthetic.id },
+ );
+ } catch (err) {
+ logger.warn("Failed to index synthetic observation into vector index", {
+ obsId: synthetic.id,
+ error: err instanceof Error ? err.message : String(err),
+ });
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| await vectorIndexAddGuarded( | |
| synthetic.id, | |
| synthetic.sessionId, | |
| synthetic.title + " " + (synthetic.narrative || ""), | |
| { kind: "synthetic", logId: synthetic.id }, | |
| ); | |
| getSearchIndex().add(synthetic); | |
| try { | |
| await vectorIndexAddGuarded( | |
| synthetic.id, | |
| synthetic.sessionId, | |
| synthetic.title + " " + (synthetic.narrative || ""), | |
| { kind: "synthetic", logId: synthetic.id }, | |
| ); | |
| } catch (err) { | |
| logger.warn("Failed to index synthetic observation into vector index", { | |
| obsId: synthetic.id, | |
| error: err instanceof Error ? err.message : String(err), | |
| }); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/functions/observe.ts` around lines 241 - 246, The call to
vectorIndexAddGuarded(synthetic.id, synthetic.sessionId, synthetic.title + " " +
(synthetic.narrative || ""), { kind: "synthetic", logId: synthetic.id }) should
be wrapped in a try/catch so embedding/indexing errors don't abort the synthetic
compression flow; catch any error and emit a warning log (similar to patterns in
remember.ts and compress.ts) indicating the synthetic id and error, but do not
rethrow—ensure the function continues since the observation was already
persisted and BM25-indexed.
| await vectorIndexAddGuarded( | ||
| memory.id, | ||
| memory.sessionIds[0] ?? "memory", | ||
| memory.title + " " + memory.content, | ||
| { kind: "memory", logId: memory.id }, | ||
| ); |
There was a problem hiding this comment.
Wrap vectorIndexAddGuarded in try/catch for consistency with BM25 indexing.
The BM25 indexing at lines 107-114 uses try/catch to ensure indexing failures don't block the memory save. However, vectorIndexAddGuarded is awaited directly without error handling. If it throws, the entire mem::remember operation will fail even though the memory was already persisted to KV at line 100.
Per the PR summary: "embed failures soft-fail (index skip, upstream save still succeeds)". Wrap the vector indexing in try/catch with a warning, matching the BM25 pattern.
🛡️ Proposed fix
- await vectorIndexAddGuarded(
- memory.id,
- memory.sessionIds[0] ?? "memory",
- memory.title + " " + memory.content,
- { kind: "memory", logId: memory.id },
- );
+ try {
+ await vectorIndexAddGuarded(
+ memory.id,
+ memory.sessionIds[0] ?? "memory",
+ memory.title + " " + memory.content,
+ { kind: "memory", logId: memory.id },
+ );
+ } catch (err) {
+ logger.warn("Failed to index saved memory into vector index", {
+ memId: memory.id,
+ error: err instanceof Error ? err.message : String(err),
+ });
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| await vectorIndexAddGuarded( | |
| memory.id, | |
| memory.sessionIds[0] ?? "memory", | |
| memory.title + " " + memory.content, | |
| { kind: "memory", logId: memory.id }, | |
| ); | |
| try { | |
| await vectorIndexAddGuarded( | |
| memory.id, | |
| memory.sessionIds[0] ?? "memory", | |
| memory.title + " " + memory.content, | |
| { kind: "memory", logId: memory.id }, | |
| ); | |
| } catch (err) { | |
| logger.warn("Failed to index saved memory into vector index", { | |
| memId: memory.id, | |
| error: err instanceof Error ? err.message : String(err), | |
| }); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/functions/remember.ts` around lines 115 - 120, The vector indexing call
using vectorIndexAddGuarded with arguments (memory.id, memory.sessionIds[0] ??
"memory", memory.title + " " + memory.content, { kind: "memory", logId:
memory.id }) should be wrapped in a try/catch so indexing failures don't abort
the overall mem::remember flow; catch any error and call processLogger.warn (or
similar existing logger) with a concise message including memory.id and the
error, mirroring the BM25 try/catch pattern used earlier for soft-failing index
operations.
| store.get(scope)!.set(key, data); | ||
| return data; | ||
| }, | ||
| delete: async (_scope: string, _key: string): Promise<void> => {}, |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Align mockKV.delete with vector-index-populate.test.ts implementation.
The delete operation is a no-op here, while test/vector-index-populate.test.ts lines 26-27 implements actual deletion. For consistency and to avoid masking bugs if migrateVectorIndex later relies on deletion, use the same implementation across test files.
♻️ Align delete implementation
- delete: async (_scope: string, _key: string): Promise<void> => {},
+ delete: async (scope: string, key: string): Promise<void> => {
+ store.get(scope)?.delete(key);
+ },📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| delete: async (_scope: string, _key: string): Promise<void> => {}, | |
| delete: async (scope: string, key: string): Promise<void> => { | |
| store.get(scope)?.delete(key); | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/vector-index-dimensions.test.ts` at line 70, Replace the no-op
mockKV.delete with a real deletion that removes the key from the same in-memory
store used by the mock (so it mirrors the implementation in
vector-index-populate.test.ts); implement delete as an async function matching
the signature (delete: async (_scope: string, _key: string): Promise<void>) that
looks up the underlying store and removes the entry for the provided key
(ensuring awaits/Promise resolution if the store API is async) so tests relying
on migrateVectorIndex or other mutation behavior observe actual deletions.
Five reviewer findings applied. Five skipped with reason: APPLIED: 1. migrate-vector-index.ts filter now requires m.content (and rejects empty/whitespace). Without this, undefined m.content would concat to "title undefined" and poison the embed input. Memory.content is declared string but #277-era data shapes had cases where it could be missing — defensive filter. 2. migrate-vector-index.ts success expression now derives from the failed counter (success: failed === 0). Previously hardcoded true so callers couldn't rely on the boolean alone. 3. search.ts rebuildIndex now calls vectorIndex?.clear() before the repopulation loops. Symmetric to the BM25 idx.clear() that was already there. Without this, memories deleted between runs leave orphan embeddings in the vector store after every rebuild. 4. test/vector-index-dimensions.test.ts assertion tightened from toBeGreaterThan(0) to toBe(1) for both totalProcessed and vectorSize. The seeded fixture has exactly one observation; loose assertion would hide regressions where the migration produced extra/duplicate vectors. 5. test/vector-index-populate.test.ts switched to beforeEach/afterEach for the setVectorIndex / setEmbeddingProvider lifecycle. Manual cleanup in each test body was easy to drift on additions. PLUS: 6. compress.ts BM25 add now wrapped in try/catch matching the remember.ts pattern for parity. Helps cross-file symmetry even though SearchIndex.add doesn't throw on valid input. SKIPPED (with reason): - Wrapping vectorIndexAddGuarded calls in additional try/catch at remember.ts / observe.ts / compress.ts: the helper has internal try/catch (src/functions/search.ts:67-89). It NEVER throws. Adding outer try/catch is dead code. The reviewer missed the helper's contract. - vector-index-dimensions.test.ts mockKV.delete -> real deletion: the tests in that file don't exercise deletion semantics, and migrateVectorIndex itself never calls kv.delete. Touching it would add noise without effect on any assertion. - recordAudit + Promise.allSettled in remember.ts: out of scope. recordAudit is already called earlier in the function. Parallelizing cascade-update with vector-add changes failure semantics (cascade- update is currently fire-and-forget) — separate perf PR if wanted. - validateDimensions(N) assertion in vector-index-dimensions.test.ts: vectorSize is the COUNT of stored vectors, not their dimensions. Adding a validateDimensions check on the result would require exposing the new index. Out of scope for assertion tightening. 886 / 886 tests pass. Build clean.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
src/functions/search.ts (1)
111-116: 🏗️ Heavy liftConsider batching embedding calls during rebuild for better performance.
Currently,
rebuildIndexcallsvectorIndexAddGuardedonce per memory and observation, resulting in sequential embed requests. SinceEmbeddingProvidersupportsembedBatch(as seen inmigrate-vector-index.ts), you could collect items, batch embed them, validate dimensions per-item, and then add successful embeddings to the index. This would significantly speed up rebuilds for larger datasets while maintaining the same soft-fail guarantees.As per coding guidelines, "Use parallel operations where possible with Promise.all for independent kv writes/reads" — the same principle applies to independent embedding operations.
Also applies to: 151-156
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/functions/search.ts` around lines 111 - 116, rebuildIndex currently calls vectorIndexAddGuarded for each memory/observation sequentially which is slow; refactor rebuildIndex to collect items (id, sessionId, text, metadata) into batches, call EmbeddingProvider.embedBatch (see migrate-vector-index.ts for pattern), validate each returned embedding's dimensionality individually, and then call vectorIndexAddGuarded for only the successful embeddings (you can add those to the index in parallel with Promise.all); ensure you preserve the existing soft-fail behavior by skipping/ logging failed embeddings and keep per-item metadata (e.g., { kind: "memory", logId: memory.id }) when adding to the index.src/functions/migrate-vector-index.ts (1)
32-39: ⚡ Quick winValidate embedding dimensions before adding to index.
Unlike
vectorIndexAddGuarded(which validatesembedding.length === provider.dimensionsbefore writing),migrateVectorIndexcallsnewIndex.add()directly without dimension checks. IfnewProvider.embedBatchreturns embeddings with unexpected dimensions (due to provider misconfiguration or bugs), the index will be silently corrupted.Add per-item dimension validation consistent with the defensive pattern in
vectorIndexAddGuarded.🛡️ Proposed validation for both phases
Memory phase:
const embeddings = await newProvider.embedBatch(texts); for (let i = 0; i < textMems.length; i++) { + if (embeddings[i].length !== newProvider.dimensions) { + logger.warn("migrateVectorIndex: dimension mismatch for memory", { + memoryId: textMems[i].id, + expected: newProvider.dimensions, + received: embeddings[i].length, + }); + continue; + } newIndex.add( textMems[i].id, textMems[i].sessionIds[0] ?? "memory", embeddings[i], ); processed++; }Observation phase (apply same pattern):
const embeddings = await newProvider.embedBatch(texts); for (let i = 0; i < textObs.length; i++) { + if (embeddings[i].length !== newProvider.dimensions) { + logger.warn("migrateVectorIndex: dimension mismatch for observation", { + obsId: textObs[i].id, + expected: newProvider.dimensions, + received: embeddings[i].length, + }); + continue; + } newIndex.add(textObs[i].id, textObs[i].sessionId, embeddings[i]); processed++; }Also applies to: 60-64
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/functions/migrate-vector-index.ts` around lines 32 - 39, The migrateVectorIndex function is adding embeddings directly via newIndex.add after calling newProvider.embedBatch, which risks corrupting the index if embedding lengths are wrong; mimic the defensive check used in vectorIndexAddGuarded by validating each embedding's length against newProvider.dimensions before calling newIndex.add (for both the memory phase and the observation phase around the other add calls), and on mismatch log/error and skip or handle that item consistently with vectorIndexAddGuarded semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/functions/migrate-vector-index.ts`:
- Around line 49-72: The current migrateVectorIndex re-embedding loop wraps all
sessions in one try/catch so a single kv.list or embed failure aborts remaining
sessions; change it to process sessions per-session (e.g., map sessions to async
tasks and use Promise.allSettled or iterate with a try/catch inside the for
loop) so each session's observation listing (KV.observations(session.id)),
embedding (newProvider.embedBatch), and index adds (newIndex.add) are handled
and failures are recorded per session; on each settled task increment processed
for successful observation adds and increment failed and collect the session id
for failed tasks, and log per-session errors rather than aborting the whole
migration.
- Around line 14-18: The parameter oldIndex on migrateVectorIndex (signature:
migrateVectorIndex(kv: StateKV, oldIndex: VectorIndex, newProvider:
EmbeddingProvider): Promise<MigrateVectorIndexResult>) is unused and confuses
the API; either remove oldIndex from the function signature and all callers (and
update any type annotations) or, if intended, document and actually use it
inside migrateVectorIndex to transform the provided VectorIndex. Ensure you
update all call sites to match the new signature and adjust imports/types
(StateKV, VectorIndex, EmbeddingProvider, MigrateVectorIndexResult) accordingly
to keep type-checking consistent.
In `@test/vector-index-populate.test.ts`:
- Line 60: The test declares a variable named vi which shadows the vitest import
(vi) and prevents use of vi.fn()/vi.spyOn(); rename the VectorIndex variable
(e.g., change the declaration "let vi: VectorIndex" to "let vectorIndex:
VectorIndex") and update every reference in this test file (setup, beforeEach,
assertions, calls to methods like populate or close) from vi -> vectorIndex so
the vitest vi import remains available for mocking and spying.
---
Nitpick comments:
In `@src/functions/migrate-vector-index.ts`:
- Around line 32-39: The migrateVectorIndex function is adding embeddings
directly via newIndex.add after calling newProvider.embedBatch, which risks
corrupting the index if embedding lengths are wrong; mimic the defensive check
used in vectorIndexAddGuarded by validating each embedding's length against
newProvider.dimensions before calling newIndex.add (for both the memory phase
and the observation phase around the other add calls), and on mismatch log/error
and skip or handle that item consistently with vectorIndexAddGuarded semantics.
In `@src/functions/search.ts`:
- Around line 111-116: rebuildIndex currently calls vectorIndexAddGuarded for
each memory/observation sequentially which is slow; refactor rebuildIndex to
collect items (id, sessionId, text, metadata) into batches, call
EmbeddingProvider.embedBatch (see migrate-vector-index.ts for pattern), validate
each returned embedding's dimensionality individually, and then call
vectorIndexAddGuarded for only the successful embeddings (you can add those to
the index in parallel with Promise.all); ensure you preserve the existing
soft-fail behavior by skipping/ logging failed embeddings and keep per-item
metadata (e.g., { kind: "memory", logId: memory.id }) when adding to the index.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8508e1bd-e6c1-4d86-b22a-76f95bfe68b5
📒 Files selected for processing (5)
src/functions/compress.tssrc/functions/migrate-vector-index.tssrc/functions/search.tstest/vector-index-dimensions.test.tstest/vector-index-populate.test.ts
…x param Four reviewer findings applied. One skipped with reason: APPLIED: 1. Per-session isolation in migrateVectorIndex observations phase. The previous shape wrapped the entire sessions for-loop in one try/catch — a single bad session (kv.list throws, embedBatch rejects, etc.) aborted every subsequent session and silently truncated the migration. The loop now has per-session try/catch; failures increment `failed`, append the session id to a new `failedSessions` array on the result, and the loop continues. The kv.list<sessions> call itself is now guarded separately — if THAT fails the whole migration aborts (no sessions to iterate), but with `failed` recorded and `failedSessions` empty so the caller can distinguish. 2. Drop `oldIndex` parameter from migrateVectorIndex signature. It was dead — no read path inside the function. Three test call sites updated to match the new signature migrateVectorIndex(kv, newProvider). Confused the API contract; if we ever DO need a previous-index argument it can come back with a real use case. 3. Dim guard on every embedding result inside migrateVectorIndex, in both the memories phase and the per-session observations phase. Extracted into a local isValidEmbedding() helper that mirrors the shape of search.ts::vectorIndexAddGuarded's guard: log structured warn + skip on mismatch. Without this, a misconfigured provider returning the wrong-length Float32Array would corrupt the rebuilt index, defeating the #248 persistence-load guard. 4. Rename `vi` -> `vectorIndex` in test/vector-index-populate.test.ts so the let binding doesn't shadow the vitest `vi` import. Practical impact: future tests in that file can use `vi.fn()` / `vi.spyOn()` without renaming this binding first. Cosmetic but cheap. SKIPPED (with reason): - Use embedBatch in rebuildIndex (search.ts): perf optimization, not correctness. Already called out as deferred in the previous review's commit message. Cold-restart speed matters but the fix here is bigger refactor (collect-into-batches with metadata, validate each, add only successful) and worth its own PR. Will pick up when we also do the write-behind queue for live-write batching. Result envelope: Added `failedSessions: string[]` to MigrateVectorIndexResult so the caller can decide whether to retry per-session or just abort. Existing fields (success, totalProcessed, failed, vectorSize) unchanged in shape. 886 / 886 tests pass. Build clean.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/functions/migrate-vector-index.ts (1)
89-98: 💤 Low valueConsider distinguishing list-failure from per-session failure in the result.
When
kv.list(KV.sessions)rejects, the early return setsfailed = 1and leavesfailedSessionsempty, which is technically consistent (no specific session was known to fail) but makes the partial result indistinguishable from a single mismatched embedding earlier in the memories phase. A dedicated flag or non-empty diagnostic infailedSessions(e.g.,"<list-error>") would make recovery decisions in the caller more deterministic. Optional polish — current behavior is safe.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/functions/migrate-vector-index.ts` around lines 89 - 98, The kv.list(KV.sessions) failure currently returns failed = 1 with an empty failedSessions array, making it indistinguishable from a per-session failure; update the migrateVectorIndex error path (around sessions = await kv.list<{ id: string }>(KV.sessions)) to record a distinct diagnostic before returning — for example, push a marker string like "<list-error>" into failedSessions (and keep failed++), or alternatively add a boolean flag (e.g., listFailed: true) to the returned object; ensure you update the return to include this diagnostic so callers can tell apart a listing failure from individual session failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/functions/migrate-vector-index.ts`:
- Around line 76-81: The catch block inside migrateVectorIndex currently
increments failed by 1 even when operations like kv.list or
newProvider.embedBatch abort an entire batch; change the error handling in that
catch to add the actual number of affected memories (e.g., use textMems.length
when textMems is in scope or compute the batch size from the variable holding
the listed/processed items) to failed instead of 1, and ensure any per-session
bookkeeping that relies on success/failed counts (the same counters used
elsewhere in migrateVectorIndex) is updated consistently so callers see the
correct number of missed embeddings.
---
Nitpick comments:
In `@src/functions/migrate-vector-index.ts`:
- Around line 89-98: The kv.list(KV.sessions) failure currently returns failed =
1 with an empty failedSessions array, making it indistinguishable from a
per-session failure; update the migrateVectorIndex error path (around sessions =
await kv.list<{ id: string }>(KV.sessions)) to record a distinct diagnostic
before returning — for example, push a marker string like "<list-error>" into
failedSessions (and keep failed++), or alternatively add a boolean flag (e.g.,
listFailed: true) to the returned object; ensure you update the return to
include this diagnostic so callers can tell apart a listing failure from
individual session failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 74110c0e-b61b-4d05-88f6-e94d97570ae3
📒 Files selected for processing (3)
src/functions/migrate-vector-index.tstest/vector-index-dimensions.test.tstest/vector-index-populate.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- test/vector-index-dimensions.test.ts
- test/vector-index-populate.test.ts
…nguish list-fail Two new reviewer findings on migrate-vector-index. Six earlier findings re-flagged in the same review block had already been addressed in prior commits on this branch (verified inline): - compress.ts BM25 try/catch (commit d2098a5) - migrate filter requires m.content (commit d2098a5) - migrate success: failed === 0 (commit d2098a5) - rebuildIndex vector clear (commit d2098a5) - observe.ts / remember.ts outer try/catch — SKIPPED with reason (vectorIndexAddGuarded has internal try/catch, NEVER throws) - mockKV.delete no-op — SKIPPED with reason (no test exercises it) NEW FINDINGS APPLIED: 1. Catch block in memories phase counted +1 for the entire batch regardless of batch size. If kv.list returned 250 memories and embedBatch threw, `failed` reflected 1 missed embedding when reality is 250. textMems is now declared outside the try so the catch can read its length. If kv.list itself threw before textMems was populated (length 0), we fall back to +1 (something failed but the batch size is unknown). If embedBatch threw, `failed += textMems.length` so the counter is honest. 2. kv.list<sessions> failure path returned `failed: 1, failedSessions: []`. Indistinguishable from "0 sessions, all OK" from the caller's perspective — they can't tell whether the enumeration itself blew up or simply found no sessions. The path now pushes a marker string "<sessions-list-failed>" into failedSessions before returning. Marker keeps the existing schema (failedSessions: string[]) — no boolean flag added — and gives callers a sentinel they can grep for. Per-session entries continue to use real session ids, so the marker is unambiguous. 886 / 886 tests pass. Build clean.
The two regression tests added in this branch for the BM25 non-ASCII
tokenizer fix used Russian-language fixtures
("Проверка памяти", "Тестируем поиск по кириллице", etc). Switching
to Greek-language equivalents ("Προβολή μνήμης",
"Δοκιμάζουμε αναζήτηση σε ελληνικά", etc).
Reasoning: the test fixtures need to exercise the same code path —
non-ASCII Unicode letters that the old ASCII-only \w regex stripped.
Greek code points (U+0370..U+03FF) hit \p{L} identically to Cyrillic
(U+0400..U+04FF), so the regression coverage is preserved. Greek is
also less likely to be misread as politically loaded content in a
public repo.
Test ids renamed from obs_cyrillic to obs_greek; test names from
"indexes and finds Cyrillic text" to "indexes and finds non-ASCII
(Greek) text" (and the mixed-script variant likewise). Branch name
stays as-is to keep PR continuity but the shipped test fixtures are
language-neutral going forward.
11/11 search-index tests pass with the swap. Full suite 886/886.
…ardening Three landed PRs since v0.9.11: - #327 (#295) — BM25 tokenizer now accepts non-ASCII (Greek, accented Latin, Hebrew, Arabic, Cyrillic), VectorIndex.add now actually called at runtime via vectorIndexAddGuarded helper with dim guard + input clip, migrateVectorIndex for dim migrations. - #326 (#277) — RetentionScore type no longer declares source twice; JSDoc back-compat note no longer shadowed. - #335 (#323) — viewer drops Google Fonts <link> (CSP-blocked), loadDashboard now surfaces load errors inline instead of sticking on "Loading dashboard…". Bumping 0.9.11 -> 0.9.12 across the 9 standard files: - package.json - packages/mcp/package.json - plugin/.claude-plugin/plugin.json - plugin/.codex-plugin/plugin.json - src/version.ts - src/types.ts (ExportData.version literal) - src/functions/export-import.ts (supportedVersions) - test/export-import.test.ts (round-trip expectation) - CHANGELOG.md (new 0.9.12 entry) 886 / 886 tests pass. Build clean.
…ardening (#337) Three landed PRs since v0.9.11: - #327 (#295) — BM25 tokenizer now accepts non-ASCII (Greek, accented Latin, Hebrew, Arabic, Cyrillic), VectorIndex.add now actually called at runtime via vectorIndexAddGuarded helper with dim guard + input clip, migrateVectorIndex for dim migrations. - #326 (#277) — RetentionScore type no longer declares source twice; JSDoc back-compat note no longer shadowed. - #335 (#323) — viewer drops Google Fonts <link> (CSP-blocked), loadDashboard now surfaces load errors inline instead of sticking on "Loading dashboard…". Bumping 0.9.11 -> 0.9.12 across the 9 standard files: - package.json - packages/mcp/package.json - plugin/.claude-plugin/plugin.json - plugin/.codex-plugin/plugin.json - src/version.ts - src/types.ts (ExportData.version literal) - src/functions/export-import.ts (supportedVersions) - test/export-import.test.ts (round-trip expectation) - CHANGELOG.md (new 0.9.12 entry) 886 / 886 tests pass. Build clean.
What
Picks up @nik1t7n's PR #296 (BM25 Unicode tokenizer + vector-index live-write) — rebased onto current main, hardens three sharp edges before merge, keeps the original test suite intact.
Closes #295. Supersedes #296. Full credit to @nik1t7n for the bug report + initial fix; this PR's commits build on top.
Bugs fixed (from #295)
Bug 1 — BM25 tokenizer strips non-ASCII (Cyrillic, Greek, CJK, Arabic, accented Latin)
src/state/search-index.ts:226used\win the replace regex. JS\wis ASCII-only — every Unicode letter outside[a-zA-Z0-9_]collapsed to a space during tokenization. Non-English searches returned empty.Fix (from #296, kept verbatim):
Preserves Unicode letters / numbers, keeps underscore (matters for code identifiers like
memory_recall), keeps/,.,\,-. Symmetric for index AND query path — same tokenizer, no recall asymmetry.CJK note: this fix recovers Cyrillic, Greek, accented Latin, Hebrew, Arabic — anything that uses spaces between words. CJK still tokenizes as a single sentence-long token because there are no spaces between characters. CJK needs a segmenter (bigram / dictionary) — that's a separate concern (see PR #224 attempt).
Bug 2 —
VectorIndex.add()never called at runtimeVectorIndexshipped with a workingadd()method, was instantiated insrc/index.ts, and was persisted viaIndexPersistence— but zero callers anywhere insrc/. The vector index was populated ONLY from disk at startup; any new memory or observation written at runtime never made it into the vector index.memory_searchwith vector/hybrid mode returned stale results forever; cold start with no persisted vectors meant the vector index stayed empty.Fix: wire
vectorIndex.add()(via the new guarded helper, see below) intoremember.ts,observe.tssynthetic-compression branch,compress.tspost-LLM, andsearch.ts:rebuildIndex(both memories + observations loops). GlobalsetVectorIndex/setEmbeddingProvidersingletons insearch.tsso the helper can read the current state without threading through every call site.Improvements layered on top of #296
#296inlinedvi.add(...)at four sites with the same try/catch shape repeated each time. Three sharp edges open-coded at every site:Float32Arraywould silently corrupt the in-memory index until next restart.Memory.contentcan be 20KB+ (MemorySlot defaults). Most embedding APIs cap input around 8k tokens. Without clipping: 400 errors at the provider, dropped vector entries, wasted local-Ollama inference time.rebuildIndexusedcatch {}— swallowed every embed failure silently during cold rebuilds. Other three sites had differentlogger.warnshapes.Consolidates into
vectorIndexAddGuarded(id, sessionId, text, { kind, logId })insrc/functions/search.ts:EMBED_MAX_CHARS) — well under every provider's documented limit, leaves chars-per-token headroom for non-Latin scripts where one Unicode char is multiple tokens.embedding.length === provider.dimensionsbeforevi.add(); logs structured warning + skips on mismatch.logger.warnon embed failure (provider + error + kind + id).All four call sites rewritten to one-line invocations.
rebuildIndexcatch {}removed — embed failures now surface consistently.Out-of-scope (deferred follow-ups)
rebuildIndex. Today: oneembed()per doc, sequential.embedBatch()exists on every provider, ~5-10x faster on cold restart. Defer.Migration
Existing indexes built with the ASCII tokenizer have zero tokens for non-ASCII docs. Fix only helps NEW writes unless
mem::rebuild-indexis invoked. README troubleshooting note may follow up.Test plan
npm test— 886 / 886 (3 new tests from fix: BM25 tokenizer strips non-ASCII text; vector index never populated on write #296 + 1 mock fix intest/multimodal.test.tsto exposevectorIndexAddGuardedon the search.js mock).npm run build— tsdown clean.Summary by CodeRabbit
New Features
Reliability
Quality
Tests