Skip to content

fix(search): BM25 unicode + vector index live-write — improved (closes #295, supersedes #296)#327

Merged
rohitg00 merged 10 commits into
mainfrom
fix/bm25-cyrillic-vector-index
May 13, 2026
Merged

fix(search): BM25 unicode + vector index live-write — improved (closes #295, supersedes #296)#327
rohitg00 merged 10 commits into
mainfrom
fix/bm25-cyrillic-vector-index

Conversation

@rohitg00
Copy link
Copy Markdown
Owner

@rohitg00 rohitg00 commented May 13, 2026

What

Picks up @nik1t7n's PR #296 (BM25 Unicode tokenizer + vector-index live-write) — rebased onto current main, hardens three sharp edges before merge, keeps the original test suite intact.

Closes #295. Supersedes #296. Full credit to @nik1t7n for the bug report + initial fix; this PR's commits build on top.

Bugs fixed (from #295)

Bug 1 — BM25 tokenizer strips non-ASCII (Cyrillic, Greek, CJK, Arabic, accented Latin)

src/state/search-index.ts:226 used \w in the replace regex. JS \w is ASCII-only — every Unicode letter outside [a-zA-Z0-9_] collapsed to a space during tokenization. Non-English searches returned empty.

Fix (from #296, kept verbatim):

// before
.replace(/[^\w\s/.\\-_]/g, " ")
// after
.replace(/[^\p{L}\p{N}\s/.\\-_]/gu, " ")

Preserves Unicode letters / numbers, keeps underscore (matters for code identifiers like memory_recall), keeps /, ., \, -. Symmetric for index AND query path — same tokenizer, no recall asymmetry.

CJK note: this fix recovers Cyrillic, Greek, accented Latin, Hebrew, Arabic — anything that uses spaces between words. CJK still tokenizes as a single sentence-long token because there are no spaces between characters. CJK needs a segmenter (bigram / dictionary) — that's a separate concern (see PR #224 attempt).

Bug 2 — VectorIndex.add() never called at runtime

VectorIndex shipped with a working add() method, was instantiated in src/index.ts, and was persisted via IndexPersistence — but zero callers anywhere in src/. The vector index was populated ONLY from disk at startup; any new memory or observation written at runtime never made it into the vector index. memory_search with vector/hybrid mode returned stale results forever; cold start with no persisted vectors meant the vector index stayed empty.

Fix: wire vectorIndex.add() (via the new guarded helper, see below) into remember.ts, observe.ts synthetic-compression branch, compress.ts post-LLM, and search.ts:rebuildIndex (both memories + observations loops). Global setVectorIndex / setEmbeddingProvider singletons in search.ts so the helper can read the current state without threading through every call site.

Improvements layered on top of #296

#296 inlined vi.add(...) at four sites with the same try/catch shape repeated each time. Three sharp edges open-coded at every site:

  1. No dimension guard at the write boundary. Per fix(embedding): guard provider responses against dimension mismatches #248, persistence-load refuses mismatched dimensions — but the live-write path had no symmetric check. A mis-configured embedder returning the wrong-length Float32Array would silently corrupt the in-memory index until next restart.
  2. No input clipping. Memory.content can be 20KB+ (MemorySlot defaults). Most embedding APIs cap input around 8k tokens. Without clipping: 400 errors at the provider, dropped vector entries, wasted local-Ollama inference time.
  3. Inconsistent error logging. rebuildIndex used catch {} — swallowed every embed failure silently during cold rebuilds. Other three sites had different logger.warn shapes.

Consolidates into vectorIndexAddGuarded(id, sessionId, text, { kind, logId }) in src/functions/search.ts:

  • Clips input to 16k chars (EMBED_MAX_CHARS) — well under every provider's documented limit, leaves chars-per-token headroom for non-Latin scripts where one Unicode char is multiple tokens.
  • Validates embedding.length === provider.dimensions before vi.add(); logs structured warning + skips on mismatch.
  • Single logger.warn on embed failure (provider + error + kind + id).
  • Returns boolean for future-callers that want to branch; current call sites fire-and-forget — soft-fail contract: index entry skipped, upstream save still succeeds.

All four call sites rewritten to one-line invocations. rebuildIndex catch {} removed — embed failures now surface consistently.

Out-of-scope (deferred follow-ups)

  • Blocking await on every save. Embed call is on hot path; adds 200ms-2s per remember/observe on network providers. Should be async / fire-and-forget with write-behind queue. Separate perf PR.
  • Periodic persistence flush of the vector index. Today: serialized at process exit. Runtime adds between flushes survive in memory, die on crash. Future debounced-flush PR.
  • Batching in rebuildIndex. Today: one embed() per doc, sequential. embedBatch() exists on every provider, ~5-10x faster on cold restart. Defer.

Migration

Existing indexes built with the ASCII tokenizer have zero tokens for non-ASCII docs. Fix only helps NEW writes unless mem::rebuild-index is invoked. README troubleshooting note may follow up.

Test plan

Summary by CodeRabbit

  • New Features

    • Optional semantic/vector indexing and search added; memories, observations, and synthetic compressions are also indexed. Startup now initializes the embedding/vector layer so it's available to search.
  • Reliability

    • Vector-indexing and embedding operations are guarded and soft-fail (log warnings) so failures don't break workflows or rebuilds.
  • Quality

    • Improved Unicode-aware tokenization for better multilingual search.
  • Tests

    • New tests for vector-index population, migration, embedding dimensions, and multilingual tokenization.

Review Change Stack

nik1t7n and others added 6 commits May 13, 2026 12:02
…regex

Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com>
…ldIndex

Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com>
…dex in index.ts

Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com>
…, rebuildIndex vector test

Adds migrateVectorIndex utility for re-embedding when embedding provider dimensions change. Adds comprehensive test coverage for Cyrillic tokenization, vector index population on remember/rebuildIndex, and migration edge cases.

Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com>
Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com>
…arded

Pre-merge hardening on top of @nik1t7n's PR #296.

Original PR had vectorIndex.add() inlined at four sites (remember,
observe, compress, rebuildIndex) with the same try/catch shape repeated
each time. That open-codes three sharp edges per site:

1. Missing dimension guard at the write boundary. Per #248, persistence
   load already refuses to start when persisted vectors mismatch the
   active provider's dimensions — but the live-write path had no
   symmetric check. A mis-configured embedding provider returning a
   different-length Float32Array would silently corrupt the in-memory
   index until next restart cleared it.

2. No input clipping. Memory.content can be arbitrarily large (the
   MemorySlot path even allows 20KB content per slot). Most embedding
   providers cap input around 8k tokens and 400 on overflow. A single
   oversize memory could throw at the provider and lose the vector
   entry; worse, on local-Ollama-style providers it can cost real
   inference time.

3. Inconsistent error logging (rebuildIndex had `catch {}` swallowing
   the error entirely, the other three sites logged via logger.warn
   with slightly different shape).

Consolidates into a single exported `vectorIndexAddGuarded()` in
src/functions/search.ts:

  - Reads vectorIndex + currentEmbeddingProvider from module-level
    state (same singletons the PR added).
  - Clips embed input to 16k chars (EMBED_MAX_CHARS) — well under
    every provider's documented limit, leaves headroom for
    chars-per-token variance across languages.
  - Validates embedding.length === provider.dimensions before
    calling vi.add(); logs structured warning and skips on mismatch.
  - Single logger.warn on embed failure with provider name + error
    message + indexed-item kind ("memory" | "observation" |
    "synthetic") + id.
  - Returns boolean (true on success) so callers can branch if they
    want, but every existing call site fire-and-forgets — the contract
    is "soft-fail: index entry skipped, upstream save still succeeds".

Rewrites the four call sites (remember.ts, observe.ts synthetic
branch, compress.ts post-LLM, search.ts rebuildIndex memories + obs
loops) to call vectorIndexAddGuarded() instead of the inline try/catch.

Side effect: cleans up the lingering `catch {}` in rebuildIndex that
was swallowing rebuild-time embed failures silently — every failure
now surfaces in logs at a consistent shape.

Tests:
- test/multimodal.test.ts mocked `../src/functions/search.js` and
  exposed only getSearchIndex; needed vectorIndexAddGuarded added to
  the mock surface or test fails with "No vectorIndexAddGuarded export
  is defined on the search.js mock". Added the export to the mock.
- All other tests (search-index Cyrillic, search rebuildIndex vector,
  vector-index-populate, migrate-vector-index) green on the helper
  refactor — no behavioural change beyond the three guards above.

886 / 886 tests pass. Build clean.

Out-of-scope (deferred to follow-ups):
- Blocking await on every save. The embed call is on the request hot
  path. For local Ollama / on-device providers this is fine; for
  network providers it adds 200ms-2s latency per remember/observe.
  Should be async / fire-and-forget with a write-behind queue —
  separate perf PR after this lands.
- Periodic persistence flush of the vector index. Today the index is
  serialized at process exit (per IndexPersistence). Runtime adds
  between flushes survive in memory but die on crash. Acceptable for
  this PR; future PR can add a debounced flush.
- Batching in rebuildIndex. Today the rebuild path embeds one doc at
  a time. embedBatch() exists on every provider and would be ~5-10x
  faster for cold restarts on large corpora. Defer.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agentmemory Ready Ready Preview, Comment May 13, 2026 2:23pm

Request Review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1705141d-b8fb-4a96-a576-8a14a9dee4d1

📥 Commits

Reviewing files that changed from the base of the PR and between 07e1ea5 and b44fd99.

📒 Files selected for processing (1)
  • test/search-index.test.ts

📝 Walkthrough

Walkthrough

Adds optional vector-search embedding/indexing alongside BM25: module-level vector index and embedder state, guarded vector index writes, startup wiring, rebuild/migration utilities, Unicode-aware tokenization, and tests for population/migration/tokenization.

Changes

Vector Search Indexing Integration

Layer / File(s) Summary
Vector index state & guarded add helper
src/functions/search.ts
Adds module state for VectorIndex and EmbeddingProvider, getters/setters, clipEmbedInput, and vectorIndexAddGuarded that embeds clipped text, validates dimensions, writes to the vector index, and soft-fails on errors. rebuildIndex clearing is included here.
Startup wiring of vector index and embedder
src/index.ts
Wires setVectorIndex and setEmbeddingProvider during startup to initialize search module state.
Remember / Observe / Compress integration
src/functions/remember.ts, src/functions/observe.ts, src/functions/compress.ts
After BM25 getSearchIndex().add(...), these flows call vectorIndexAddGuarded to add combined title+content or title+narrative embeddings with context metadata.
Rebuild index & migrate vector index
src/functions/search.ts, src/functions/migrate-vector-index.ts
rebuildIndex clears the vector index and attempts guarded writes; migrateVectorIndex re-embeds memories and observations using a provided EmbeddingProvider and returns processed/failed counts, vector size, and failed sessions.
Unicode tokenization & tests
src/state/search-index.ts, test/*.test.ts
Search tokenization uses a Unicode-aware regex; tests added/updated for Greek indexing, vector index population on remember, rebuildIndex vector population, migrateVectorIndex behavior, and test mocks.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

Poem

"I'm a rabbit with a tiny quill,
I clip the text and feed the hill,
From Greek and glyph to memory streams,
I stitch embeddings into dreams,
Hop, index, done — a searcher's thrill!" 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: BM25 Unicode support and vector index live-write functionality, with explicit references to the issues being resolved.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/bm25-cyrillic-vector-index

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (3)
src/functions/remember.ts (1)

115-129: 💤 Low value

Consider audit tracking and parallelization opportunities.

Two optional improvements:

  1. Audit trail: The coding guidelines state that state-changing operations should call recordAudit() for audit trail tracking. The mem::forget function (line 208) records audits, but mem::remember does not track the creation of new memories. If audit trails are desired for memory creation, consider adding a recordAudit() call after line 100.

  2. Parallelization: The vector indexing (lines 115-120) and cascade trigger (lines 123-129) are independent operations that could run in parallel using Promise.allSettled() to reduce latency.

As per coding guidelines: "Use parallel operations where possible with Promise.all for independent kv writes/reads" and "Record state-changing operations with recordAudit() for audit trail tracking".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/functions/remember.ts` around lines 115 - 129, Add an audit record for
the new memory in mem::remember by calling recordAudit(...) right after the
memory creation/DB write (use the memory.id and context info) to track the
creation, and change the sequential calls to vectorIndexAddGuarded(...) and the
sdk.trigger({ function_id: "mem::cascade-update", ... }) invocation so they run
in parallel using Promise.allSettled([...]) (keep both operations as independent
promises to avoid failing one when the other succeeds); reference the existing
vectorIndexAddGuarded and sdk.trigger calls and ensure any errors from
Promise.allSettled are logged/handled.
test/vector-index-dimensions.test.ts (1)

78-102: ⚡ Quick win

Strengthen test assertions for migration behavior.

The test has two opportunities for improvement:

  1. Lines 99-100 use toBeGreaterThan(0) when exactly 1 observation is created, so toBe(1) would be more precise.
  2. The test doesn't verify that the migrated vectors actually have the new 4D dimensions from newProvider. Consider adding expect(result.vectorSize).toBe(1) and validating that the resulting index contains 4D vectors via oldIndex.validateDimensions(4).
♻️ More specific assertions
     const result = await migrateVectorIndex(kv as never, oldIndex, newProvider);
     expect(result.success).toBe(true);
-    expect(result.totalProcessed).toBeGreaterThan(0);
-    expect(result.vectorSize).toBeGreaterThan(0);
+    expect(result.totalProcessed).toBe(1);
+    expect(result.vectorSize).toBe(1);
     expect(result.failed).toBe(0);
+    // Verify new dimensions were applied
+    const validation = oldIndex.validateDimensions(4);
+    expect(validation.mismatches).toEqual([]);
   });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/vector-index-dimensions.test.ts` around lines 78 - 102, Replace the
loose "greater than 0" assertions with precise checks that reflect the single
observation and 4D embedding: assert totalProcessed is 1
(expect(result.totalProcessed).toBe(1)), assert vectorSize equals 4
(expect(result.vectorSize).toBe(4)) and after migration verify the index
dimensions by calling the VectorIndex validation helper on the returned/modified
index (e.g., oldIndex.validateDimensions(4) or the appropriate returned index
instance) to confirm embeddings are 4-dimensional; update references to
migrateVectorIndex, oldIndex, newProvider, result.vectorSize and
oldIndex.validateDimensions accordingly.
test/vector-index-populate.test.ts (1)

60-80: 💤 Low value

Consider using beforeEach/afterEach for test isolation.

The test manually cleans up global state (lines 78-79) after each test. While this works, using beforeEach/afterEach hooks would provide more robust test isolation and prevent state leakage if a test fails before reaching cleanup.

♻️ Example with lifecycle hooks
 describe("vector index population on remember", () => {
   const mockEmbedder: EmbeddingProvider = {
     name: "test",
     dimensions: 3,
     embed: async (_text: string) => new Float32Array([0.1, 0.2, 0.3]),
     embedBatch: async (_texts: string[]) =>
       _texts.map(() => new Float32Array([0.1, 0.2, 0.3])),
   };
+
+  afterEach(() => {
+    setVectorIndex(null);
+    setEmbeddingProvider(null);
+  });

   it("calls vectorIndex.add() when remember saves a memory", async () => {
     // ... test code ...
     expect(vi.size).toBe(1);
-
-    // Cleanup
-    setVectorIndex(null);
-    setEmbeddingProvider(null);
   });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/vector-index-populate.test.ts` around lines 60 - 80, The test mutates
global state (setVectorIndex/setEmbeddingProvider) and manually cleans up in the
test body; change to use beforeEach to create and set a fresh VectorIndex and
embedding provider (e.g., call setVectorIndex(new VectorIndex()) and
setEmbeddingProvider(mockEmbedder) in beforeEach) and use afterEach to reset
them to null (setVectorIndex(null); setEmbeddingProvider(null)), and remove the
manual cleanup lines from the test; keep the rest of the test (mockSdk, mockKV,
registerRememberFunction, and expectations) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/functions/compress.ts`:
- Around line 176-183: The BM25 and vector indexing calls after persisting the
compressed observation should be wrapped in defensive try/catch blocks so
indexing failures don't abort the compression flow; specifically, surround
getSearchIndex().add(compressed) in a try/catch (log the error and continue) and
likewise wrap the vectorIndexAddGuarded(...) call in its own try/catch (log the
error and continue), using the same pattern as the BM25 handling in remember.ts;
include identifying info like compressed.id, compressed.sessionId and
compressed.title in the log messages to aid debugging.

In `@src/functions/migrate-vector-index.ts`:
- Around line 26-29: The filter and mapping for embeddings should ensure memory
content exists: update the filter that defines textMems to require m.content
(e.g., (m) => m.isLatest !== false && m.title && m.content && m.content.trim()
!== "") or alternatively keep the broader filter but change the texts mapping to
use a safe fallback like (m.title + " " + (m.content ?? "")). Refer to the
textMems variable and the texts mapping to implement the check/fallback so you
never concatenate "undefined" into the embedding input.
- Line 74: The return object always sets success: true regardless of failures;
update the return in migrate-vector-index (the line returning { success: true,
totalProcessed: processed, failed, vectorSize: newIndex.size }) to compute
success based on the failed counter (e.g., success: failed === 0) so callers can
rely on the boolean instead of inspecting failed separately; ensure you only
change the success expression and keep totalProcessed: processed, failed, and
vectorSize: newIndex.size unchanged.

In `@src/functions/observe.ts`:
- Around line 241-246: The call to vectorIndexAddGuarded(synthetic.id,
synthetic.sessionId, synthetic.title + " " + (synthetic.narrative || ""), {
kind: "synthetic", logId: synthetic.id }) should be wrapped in a try/catch so
embedding/indexing errors don't abort the synthetic compression flow; catch any
error and emit a warning log (similar to patterns in remember.ts and
compress.ts) indicating the synthetic id and error, but do not rethrow—ensure
the function continues since the observation was already persisted and
BM25-indexed.

In `@src/functions/remember.ts`:
- Around line 115-120: The vector indexing call using vectorIndexAddGuarded with
arguments (memory.id, memory.sessionIds[0] ?? "memory", memory.title + " " +
memory.content, { kind: "memory", logId: memory.id }) should be wrapped in a
try/catch so indexing failures don't abort the overall mem::remember flow; catch
any error and call processLogger.warn (or similar existing logger) with a
concise message including memory.id and the error, mirroring the BM25 try/catch
pattern used earlier for soft-failing index operations.

In `@src/functions/search.ts`:
- Around line 105-110: rebuildIndex currently clears BM25 but does not reset the
vector store, so stale embeddings remain; before you start adding with
vectorIndexAddGuarded in rebuildIndex, call the vector-index reset/clear API
(e.g., vectorIndexReset or vectorIndexClear — whichever exists in our codebase)
for the target index (use the same identifier you pass to vectorIndexAddGuarded,
e.g., memory.id or session id) to wipe existing vectors, then repopulate; apply
the same reset-before-add change to the other spot that calls
vectorIndexAddGuarded (the second block that appends vectors) so the vector
store and BM25 stay in sync.

In `@test/vector-index-dimensions.test.ts`:
- Line 70: Replace the no-op mockKV.delete with a real deletion that removes the
key from the same in-memory store used by the mock (so it mirrors the
implementation in vector-index-populate.test.ts); implement delete as an async
function matching the signature (delete: async (_scope: string, _key: string):
Promise<void>) that looks up the underlying store and removes the entry for the
provided key (ensuring awaits/Promise resolution if the store API is async) so
tests relying on migrateVectorIndex or other mutation behavior observe actual
deletions.

---

Nitpick comments:
In `@src/functions/remember.ts`:
- Around line 115-129: Add an audit record for the new memory in mem::remember
by calling recordAudit(...) right after the memory creation/DB write (use the
memory.id and context info) to track the creation, and change the sequential
calls to vectorIndexAddGuarded(...) and the sdk.trigger({ function_id:
"mem::cascade-update", ... }) invocation so they run in parallel using
Promise.allSettled([...]) (keep both operations as independent promises to avoid
failing one when the other succeeds); reference the existing
vectorIndexAddGuarded and sdk.trigger calls and ensure any errors from
Promise.allSettled are logged/handled.

In `@test/vector-index-dimensions.test.ts`:
- Around line 78-102: Replace the loose "greater than 0" assertions with precise
checks that reflect the single observation and 4D embedding: assert
totalProcessed is 1 (expect(result.totalProcessed).toBe(1)), assert vectorSize
equals 4 (expect(result.vectorSize).toBe(4)) and after migration verify the
index dimensions by calling the VectorIndex validation helper on the
returned/modified index (e.g., oldIndex.validateDimensions(4) or the appropriate
returned index instance) to confirm embeddings are 4-dimensional; update
references to migrateVectorIndex, oldIndex, newProvider, result.vectorSize and
oldIndex.validateDimensions accordingly.

In `@test/vector-index-populate.test.ts`:
- Around line 60-80: The test mutates global state
(setVectorIndex/setEmbeddingProvider) and manually cleans up in the test body;
change to use beforeEach to create and set a fresh VectorIndex and embedding
provider (e.g., call setVectorIndex(new VectorIndex()) and
setEmbeddingProvider(mockEmbedder) in beforeEach) and use afterEach to reset
them to null (setVectorIndex(null); setEmbeddingProvider(null)), and remove the
manual cleanup lines from the test; keep the rest of the test (mockSdk, mockKV,
registerRememberFunction, and expectations) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6f12aec1-1f4b-4288-b708-39146f804f4e

📥 Commits

Reviewing files that changed from the base of the PR and between 25dddc4 and 6fd7906.

📒 Files selected for processing (12)
  • src/functions/compress.ts
  • src/functions/migrate-vector-index.ts
  • src/functions/observe.ts
  • src/functions/remember.ts
  • src/functions/search.ts
  • src/index.ts
  • src/state/search-index.ts
  • test/multimodal.test.ts
  • test/search-index.test.ts
  • test/search.test.ts
  • test/vector-index-dimensions.test.ts
  • test/vector-index-populate.test.ts

Comment thread src/functions/compress.ts Outdated
Comment on lines +176 to +183
getSearchIndex().add(compressed);

await vectorIndexAddGuarded(
compressed.id,
compressed.sessionId,
compressed.title + " " + (compressed.narrative || ""),
{ kind: "observation", logId: compressed.id },
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add error handling for both BM25 and vector indexing.

Both the BM25 indexing (line 176) and vector indexing (lines 178-183) lack try/catch blocks. If either throws, the compression operation will fail even though the compressed observation was already persisted to KV at lines 170-174.

In remember.ts, the BM25 indexing is wrapped in try/catch (lines 107-114). Apply the same defensive pattern here to ensure indexing failures don't break the compression flow.

🛡️ Proposed fix
         compressed,
       );
 
-      getSearchIndex().add(compressed);
-
-      await vectorIndexAddGuarded(
-        compressed.id,
-        compressed.sessionId,
-        compressed.title + " " + (compressed.narrative || ""),
-        { kind: "observation", logId: compressed.id },
-      );
+      try {
+        getSearchIndex().add(compressed);
+      } catch (err) {
+        logger.warn("Failed to index compressed observation into BM25", {
+          obsId: compressed.id,
+          error: err instanceof Error ? err.message : String(err),
+        });
+      }
+
+      try {
+        await vectorIndexAddGuarded(
+          compressed.id,
+          compressed.sessionId,
+          compressed.title + " " + (compressed.narrative || ""),
+          { kind: "observation", logId: compressed.id },
+        );
+      } catch (err) {
+        logger.warn("Failed to index compressed observation into vector index", {
+          obsId: compressed.id,
+          error: err instanceof Error ? err.message : String(err),
+        });
+      }
 
       const streamResults = await Promise.allSettled([
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
getSearchIndex().add(compressed);
await vectorIndexAddGuarded(
compressed.id,
compressed.sessionId,
compressed.title + " " + (compressed.narrative || ""),
{ kind: "observation", logId: compressed.id },
);
try {
getSearchIndex().add(compressed);
} catch (err) {
logger.warn("Failed to index compressed observation into BM25", {
obsId: compressed.id,
error: err instanceof Error ? err.message : String(err),
});
}
try {
await vectorIndexAddGuarded(
compressed.id,
compressed.sessionId,
compressed.title + " " + (compressed.narrative || ""),
{ kind: "observation", logId: compressed.id },
);
} catch (err) {
logger.warn("Failed to index compressed observation into vector index", {
obsId: compressed.id,
error: err instanceof Error ? err.message : String(err),
});
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/functions/compress.ts` around lines 176 - 183, The BM25 and vector
indexing calls after persisting the compressed observation should be wrapped in
defensive try/catch blocks so indexing failures don't abort the compression
flow; specifically, surround getSearchIndex().add(compressed) in a try/catch
(log the error and continue) and likewise wrap the vectorIndexAddGuarded(...)
call in its own try/catch (log the error and continue), using the same pattern
as the BM25 handling in remember.ts; include identifying info like
compressed.id, compressed.sessionId and compressed.title in the log messages to
aid debugging.

Comment thread src/functions/migrate-vector-index.ts Outdated
Comment thread src/functions/migrate-vector-index.ts Outdated
Comment thread src/functions/observe.ts
Comment on lines +241 to +246
await vectorIndexAddGuarded(
synthetic.id,
synthetic.sessionId,
synthetic.title + " " + (synthetic.narrative || ""),
{ kind: "synthetic", logId: synthetic.id },
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wrap vectorIndexAddGuarded in try/catch to prevent synthetic compression failures.

The vector indexing call lacks error handling. If vectorIndexAddGuarded throws, the entire synthetic compression flow will fail, even though the observation was already persisted to KV at line 235 and indexed in BM25 at line 240.

This is the same pattern as in remember.ts and compress.ts. Wrap the call in try/catch with a warning to ensure embedding failures don't break observation capture.

🛡️ Proposed fix
         getSearchIndex().add(synthetic);
-        await vectorIndexAddGuarded(
-          synthetic.id,
-          synthetic.sessionId,
-          synthetic.title + " " + (synthetic.narrative || ""),
-          { kind: "synthetic", logId: synthetic.id },
-        );
+        try {
+          await vectorIndexAddGuarded(
+            synthetic.id,
+            synthetic.sessionId,
+            synthetic.title + " " + (synthetic.narrative || ""),
+            { kind: "synthetic", logId: synthetic.id },
+          );
+        } catch (err) {
+          logger.warn("Failed to index synthetic observation into vector index", {
+            obsId: synthetic.id,
+            error: err instanceof Error ? err.message : String(err),
+          });
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
await vectorIndexAddGuarded(
synthetic.id,
synthetic.sessionId,
synthetic.title + " " + (synthetic.narrative || ""),
{ kind: "synthetic", logId: synthetic.id },
);
getSearchIndex().add(synthetic);
try {
await vectorIndexAddGuarded(
synthetic.id,
synthetic.sessionId,
synthetic.title + " " + (synthetic.narrative || ""),
{ kind: "synthetic", logId: synthetic.id },
);
} catch (err) {
logger.warn("Failed to index synthetic observation into vector index", {
obsId: synthetic.id,
error: err instanceof Error ? err.message : String(err),
});
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/functions/observe.ts` around lines 241 - 246, The call to
vectorIndexAddGuarded(synthetic.id, synthetic.sessionId, synthetic.title + " " +
(synthetic.narrative || ""), { kind: "synthetic", logId: synthetic.id }) should
be wrapped in a try/catch so embedding/indexing errors don't abort the synthetic
compression flow; catch any error and emit a warning log (similar to patterns in
remember.ts and compress.ts) indicating the synthetic id and error, but do not
rethrow—ensure the function continues since the observation was already
persisted and BM25-indexed.

Comment thread src/functions/remember.ts
Comment on lines +115 to +120
await vectorIndexAddGuarded(
memory.id,
memory.sessionIds[0] ?? "memory",
memory.title + " " + memory.content,
{ kind: "memory", logId: memory.id },
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wrap vectorIndexAddGuarded in try/catch for consistency with BM25 indexing.

The BM25 indexing at lines 107-114 uses try/catch to ensure indexing failures don't block the memory save. However, vectorIndexAddGuarded is awaited directly without error handling. If it throws, the entire mem::remember operation will fail even though the memory was already persisted to KV at line 100.

Per the PR summary: "embed failures soft-fail (index skip, upstream save still succeeds)". Wrap the vector indexing in try/catch with a warning, matching the BM25 pattern.

🛡️ Proposed fix
-        await vectorIndexAddGuarded(
-          memory.id,
-          memory.sessionIds[0] ?? "memory",
-          memory.title + " " + memory.content,
-          { kind: "memory", logId: memory.id },
-        );
+        try {
+          await vectorIndexAddGuarded(
+            memory.id,
+            memory.sessionIds[0] ?? "memory",
+            memory.title + " " + memory.content,
+            { kind: "memory", logId: memory.id },
+          );
+        } catch (err) {
+          logger.warn("Failed to index saved memory into vector index", {
+            memId: memory.id,
+            error: err instanceof Error ? err.message : String(err),
+          });
+        }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
await vectorIndexAddGuarded(
memory.id,
memory.sessionIds[0] ?? "memory",
memory.title + " " + memory.content,
{ kind: "memory", logId: memory.id },
);
try {
await vectorIndexAddGuarded(
memory.id,
memory.sessionIds[0] ?? "memory",
memory.title + " " + memory.content,
{ kind: "memory", logId: memory.id },
);
} catch (err) {
logger.warn("Failed to index saved memory into vector index", {
memId: memory.id,
error: err instanceof Error ? err.message : String(err),
});
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/functions/remember.ts` around lines 115 - 120, The vector indexing call
using vectorIndexAddGuarded with arguments (memory.id, memory.sessionIds[0] ??
"memory", memory.title + " " + memory.content, { kind: "memory", logId:
memory.id }) should be wrapped in a try/catch so indexing failures don't abort
the overall mem::remember flow; catch any error and call processLogger.warn (or
similar existing logger) with a concise message including memory.id and the
error, mirroring the BM25 try/catch pattern used earlier for soft-failing index
operations.

Comment thread src/functions/search.ts
store.get(scope)!.set(key, data);
return data;
},
delete: async (_scope: string, _key: string): Promise<void> => {},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Align mockKV.delete with vector-index-populate.test.ts implementation.

The delete operation is a no-op here, while test/vector-index-populate.test.ts lines 26-27 implements actual deletion. For consistency and to avoid masking bugs if migrateVectorIndex later relies on deletion, use the same implementation across test files.

♻️ Align delete implementation
-      delete: async (_scope: string, _key: string): Promise<void> => {},
+      delete: async (scope: string, key: string): Promise<void> => {
+        store.get(scope)?.delete(key);
+      },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
delete: async (_scope: string, _key: string): Promise<void> => {},
delete: async (scope: string, key: string): Promise<void> => {
store.get(scope)?.delete(key);
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/vector-index-dimensions.test.ts` at line 70, Replace the no-op
mockKV.delete with a real deletion that removes the key from the same in-memory
store used by the mock (so it mirrors the implementation in
vector-index-populate.test.ts); implement delete as an async function matching
the signature (delete: async (_scope: string, _key: string): Promise<void>) that
looks up the underlying store and removes the entry for the provided key
(ensuring awaits/Promise resolution if the store API is async) so tests relying
on migrateVectorIndex or other mutation behavior observe actual deletions.

Five reviewer findings applied. Five skipped with reason:

APPLIED:

1. migrate-vector-index.ts filter now requires m.content (and rejects
   empty/whitespace). Without this, undefined m.content would concat
   to "title undefined" and poison the embed input. Memory.content is
   declared string but #277-era data shapes had cases where it could
   be missing — defensive filter.

2. migrate-vector-index.ts success expression now derives from the
   failed counter (success: failed === 0). Previously hardcoded true
   so callers couldn't rely on the boolean alone.

3. search.ts rebuildIndex now calls vectorIndex?.clear() before the
   repopulation loops. Symmetric to the BM25 idx.clear() that was
   already there. Without this, memories deleted between runs leave
   orphan embeddings in the vector store after every rebuild.

4. test/vector-index-dimensions.test.ts assertion tightened from
   toBeGreaterThan(0) to toBe(1) for both totalProcessed and
   vectorSize. The seeded fixture has exactly one observation; loose
   assertion would hide regressions where the migration produced
   extra/duplicate vectors.

5. test/vector-index-populate.test.ts switched to beforeEach/afterEach
   for the setVectorIndex / setEmbeddingProvider lifecycle. Manual
   cleanup in each test body was easy to drift on additions.

PLUS:

6. compress.ts BM25 add now wrapped in try/catch matching the
   remember.ts pattern for parity. Helps cross-file symmetry even
   though SearchIndex.add doesn't throw on valid input.

SKIPPED (with reason):

- Wrapping vectorIndexAddGuarded calls in additional try/catch at
  remember.ts / observe.ts / compress.ts: the helper has internal
  try/catch (src/functions/search.ts:67-89). It NEVER throws. Adding
  outer try/catch is dead code. The reviewer missed the helper's
  contract.
- vector-index-dimensions.test.ts mockKV.delete -> real deletion: the
  tests in that file don't exercise deletion semantics, and
  migrateVectorIndex itself never calls kv.delete. Touching it would
  add noise without effect on any assertion.
- recordAudit + Promise.allSettled in remember.ts: out of scope.
  recordAudit is already called earlier in the function. Parallelizing
  cascade-update with vector-add changes failure semantics (cascade-
  update is currently fire-and-forget) — separate perf PR if wanted.
- validateDimensions(N) assertion in vector-index-dimensions.test.ts:
  vectorSize is the COUNT of stored vectors, not their dimensions.
  Adding a validateDimensions check on the result would require
  exposing the new index. Out of scope for assertion tightening.

886 / 886 tests pass. Build clean.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
src/functions/search.ts (1)

111-116: 🏗️ Heavy lift

Consider batching embedding calls during rebuild for better performance.

Currently, rebuildIndex calls vectorIndexAddGuarded once per memory and observation, resulting in sequential embed requests. Since EmbeddingProvider supports embedBatch (as seen in migrate-vector-index.ts), you could collect items, batch embed them, validate dimensions per-item, and then add successful embeddings to the index. This would significantly speed up rebuilds for larger datasets while maintaining the same soft-fail guarantees.

As per coding guidelines, "Use parallel operations where possible with Promise.all for independent kv writes/reads" — the same principle applies to independent embedding operations.

Also applies to: 151-156

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/functions/search.ts` around lines 111 - 116, rebuildIndex currently calls
vectorIndexAddGuarded for each memory/observation sequentially which is slow;
refactor rebuildIndex to collect items (id, sessionId, text, metadata) into
batches, call EmbeddingProvider.embedBatch (see migrate-vector-index.ts for
pattern), validate each returned embedding's dimensionality individually, and
then call vectorIndexAddGuarded for only the successful embeddings (you can add
those to the index in parallel with Promise.all); ensure you preserve the
existing soft-fail behavior by skipping/ logging failed embeddings and keep
per-item metadata (e.g., { kind: "memory", logId: memory.id }) when adding to
the index.
src/functions/migrate-vector-index.ts (1)

32-39: ⚡ Quick win

Validate embedding dimensions before adding to index.

Unlike vectorIndexAddGuarded (which validates embedding.length === provider.dimensions before writing), migrateVectorIndex calls newIndex.add() directly without dimension checks. If newProvider.embedBatch returns embeddings with unexpected dimensions (due to provider misconfiguration or bugs), the index will be silently corrupted.

Add per-item dimension validation consistent with the defensive pattern in vectorIndexAddGuarded.

🛡️ Proposed validation for both phases

Memory phase:

       const embeddings = await newProvider.embedBatch(texts);
       for (let i = 0; i < textMems.length; i++) {
+        if (embeddings[i].length !== newProvider.dimensions) {
+          logger.warn("migrateVectorIndex: dimension mismatch for memory", {
+            memoryId: textMems[i].id,
+            expected: newProvider.dimensions,
+            received: embeddings[i].length,
+          });
+          continue;
+        }
         newIndex.add(
           textMems[i].id,
           textMems[i].sessionIds[0] ?? "memory",
           embeddings[i],
         );
         processed++;
       }

Observation phase (apply same pattern):

         const embeddings = await newProvider.embedBatch(texts);
         for (let i = 0; i < textObs.length; i++) {
+          if (embeddings[i].length !== newProvider.dimensions) {
+            logger.warn("migrateVectorIndex: dimension mismatch for observation", {
+              obsId: textObs[i].id,
+              expected: newProvider.dimensions,
+              received: embeddings[i].length,
+            });
+            continue;
+          }
           newIndex.add(textObs[i].id, textObs[i].sessionId, embeddings[i]);
           processed++;
         }

Also applies to: 60-64

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/functions/migrate-vector-index.ts` around lines 32 - 39, The
migrateVectorIndex function is adding embeddings directly via newIndex.add after
calling newProvider.embedBatch, which risks corrupting the index if embedding
lengths are wrong; mimic the defensive check used in vectorIndexAddGuarded by
validating each embedding's length against newProvider.dimensions before calling
newIndex.add (for both the memory phase and the observation phase around the
other add calls), and on mismatch log/error and skip or handle that item
consistently with vectorIndexAddGuarded semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/functions/migrate-vector-index.ts`:
- Around line 49-72: The current migrateVectorIndex re-embedding loop wraps all
sessions in one try/catch so a single kv.list or embed failure aborts remaining
sessions; change it to process sessions per-session (e.g., map sessions to async
tasks and use Promise.allSettled or iterate with a try/catch inside the for
loop) so each session's observation listing (KV.observations(session.id)),
embedding (newProvider.embedBatch), and index adds (newIndex.add) are handled
and failures are recorded per session; on each settled task increment processed
for successful observation adds and increment failed and collect the session id
for failed tasks, and log per-session errors rather than aborting the whole
migration.
- Around line 14-18: The parameter oldIndex on migrateVectorIndex (signature:
migrateVectorIndex(kv: StateKV, oldIndex: VectorIndex, newProvider:
EmbeddingProvider): Promise<MigrateVectorIndexResult>) is unused and confuses
the API; either remove oldIndex from the function signature and all callers (and
update any type annotations) or, if intended, document and actually use it
inside migrateVectorIndex to transform the provided VectorIndex. Ensure you
update all call sites to match the new signature and adjust imports/types
(StateKV, VectorIndex, EmbeddingProvider, MigrateVectorIndexResult) accordingly
to keep type-checking consistent.

In `@test/vector-index-populate.test.ts`:
- Line 60: The test declares a variable named vi which shadows the vitest import
(vi) and prevents use of vi.fn()/vi.spyOn(); rename the VectorIndex variable
(e.g., change the declaration "let vi: VectorIndex" to "let vectorIndex:
VectorIndex") and update every reference in this test file (setup, beforeEach,
assertions, calls to methods like populate or close) from vi -> vectorIndex so
the vitest vi import remains available for mocking and spying.

---

Nitpick comments:
In `@src/functions/migrate-vector-index.ts`:
- Around line 32-39: The migrateVectorIndex function is adding embeddings
directly via newIndex.add after calling newProvider.embedBatch, which risks
corrupting the index if embedding lengths are wrong; mimic the defensive check
used in vectorIndexAddGuarded by validating each embedding's length against
newProvider.dimensions before calling newIndex.add (for both the memory phase
and the observation phase around the other add calls), and on mismatch log/error
and skip or handle that item consistently with vectorIndexAddGuarded semantics.

In `@src/functions/search.ts`:
- Around line 111-116: rebuildIndex currently calls vectorIndexAddGuarded for
each memory/observation sequentially which is slow; refactor rebuildIndex to
collect items (id, sessionId, text, metadata) into batches, call
EmbeddingProvider.embedBatch (see migrate-vector-index.ts for pattern), validate
each returned embedding's dimensionality individually, and then call
vectorIndexAddGuarded for only the successful embeddings (you can add those to
the index in parallel with Promise.all); ensure you preserve the existing
soft-fail behavior by skipping/ logging failed embeddings and keep per-item
metadata (e.g., { kind: "memory", logId: memory.id }) when adding to the index.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8508e1bd-e6c1-4d86-b22a-76f95bfe68b5

📥 Commits

Reviewing files that changed from the base of the PR and between 6fd7906 and d2098a5.

📒 Files selected for processing (5)
  • src/functions/compress.ts
  • src/functions/migrate-vector-index.ts
  • src/functions/search.ts
  • test/vector-index-dimensions.test.ts
  • test/vector-index-populate.test.ts

Comment thread src/functions/migrate-vector-index.ts
Comment thread src/functions/migrate-vector-index.ts Outdated
Comment thread test/vector-index-populate.test.ts Outdated
…x param

Four reviewer findings applied. One skipped with reason:

APPLIED:

1. Per-session isolation in migrateVectorIndex observations phase. The
   previous shape wrapped the entire sessions for-loop in one try/catch
   — a single bad session (kv.list throws, embedBatch rejects, etc.)
   aborted every subsequent session and silently truncated the
   migration. The loop now has per-session try/catch; failures
   increment `failed`, append the session id to a new `failedSessions`
   array on the result, and the loop continues. The kv.list<sessions>
   call itself is now guarded separately — if THAT fails the whole
   migration aborts (no sessions to iterate), but with `failed`
   recorded and `failedSessions` empty so the caller can distinguish.

2. Drop `oldIndex` parameter from migrateVectorIndex signature. It was
   dead — no read path inside the function. Three test call sites
   updated to match the new signature
   migrateVectorIndex(kv, newProvider). Confused the API contract; if
   we ever DO need a previous-index argument it can come back with a
   real use case.

3. Dim guard on every embedding result inside migrateVectorIndex, in
   both the memories phase and the per-session observations phase.
   Extracted into a local isValidEmbedding() helper that mirrors the
   shape of search.ts::vectorIndexAddGuarded's guard: log structured
   warn + skip on mismatch. Without this, a misconfigured provider
   returning the wrong-length Float32Array would corrupt the rebuilt
   index, defeating the #248 persistence-load guard.

4. Rename `vi` -> `vectorIndex` in test/vector-index-populate.test.ts
   so the let binding doesn't shadow the vitest `vi` import. Practical
   impact: future tests in that file can use `vi.fn()` / `vi.spyOn()`
   without renaming this binding first. Cosmetic but cheap.

SKIPPED (with reason):

- Use embedBatch in rebuildIndex (search.ts): perf optimization, not
  correctness. Already called out as deferred in the previous review's
  commit message. Cold-restart speed matters but the fix here is
  bigger refactor (collect-into-batches with metadata, validate each,
  add only successful) and worth its own PR. Will pick up when we
  also do the write-behind queue for live-write batching.

Result envelope:
  Added `failedSessions: string[]` to MigrateVectorIndexResult so the
  caller can decide whether to retry per-session or just abort.
  Existing fields (success, totalProcessed, failed, vectorSize)
  unchanged in shape.

886 / 886 tests pass. Build clean.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/functions/migrate-vector-index.ts (1)

89-98: 💤 Low value

Consider distinguishing list-failure from per-session failure in the result.

When kv.list(KV.sessions) rejects, the early return sets failed = 1 and leaves failedSessions empty, which is technically consistent (no specific session was known to fail) but makes the partial result indistinguishable from a single mismatched embedding earlier in the memories phase. A dedicated flag or non-empty diagnostic in failedSessions (e.g., "<list-error>") would make recovery decisions in the caller more deterministic. Optional polish — current behavior is safe.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/functions/migrate-vector-index.ts` around lines 89 - 98, The
kv.list(KV.sessions) failure currently returns failed = 1 with an empty
failedSessions array, making it indistinguishable from a per-session failure;
update the migrateVectorIndex error path (around sessions = await kv.list<{ id:
string }>(KV.sessions)) to record a distinct diagnostic before returning — for
example, push a marker string like "<list-error>" into failedSessions (and keep
failed++), or alternatively add a boolean flag (e.g., listFailed: true) to the
returned object; ensure you update the return to include this diagnostic so
callers can tell apart a listing failure from individual session failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/functions/migrate-vector-index.ts`:
- Around line 76-81: The catch block inside migrateVectorIndex currently
increments failed by 1 even when operations like kv.list or
newProvider.embedBatch abort an entire batch; change the error handling in that
catch to add the actual number of affected memories (e.g., use textMems.length
when textMems is in scope or compute the batch size from the variable holding
the listed/processed items) to failed instead of 1, and ensure any per-session
bookkeeping that relies on success/failed counts (the same counters used
elsewhere in migrateVectorIndex) is updated consistently so callers see the
correct number of missed embeddings.

---

Nitpick comments:
In `@src/functions/migrate-vector-index.ts`:
- Around line 89-98: The kv.list(KV.sessions) failure currently returns failed =
1 with an empty failedSessions array, making it indistinguishable from a
per-session failure; update the migrateVectorIndex error path (around sessions =
await kv.list<{ id: string }>(KV.sessions)) to record a distinct diagnostic
before returning — for example, push a marker string like "<list-error>" into
failedSessions (and keep failed++), or alternatively add a boolean flag (e.g.,
listFailed: true) to the returned object; ensure you update the return to
include this diagnostic so callers can tell apart a listing failure from
individual session failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 74110c0e-b61b-4d05-88f6-e94d97570ae3

📥 Commits

Reviewing files that changed from the base of the PR and between d2098a5 and 7985cfa.

📒 Files selected for processing (3)
  • src/functions/migrate-vector-index.ts
  • test/vector-index-dimensions.test.ts
  • test/vector-index-populate.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/vector-index-dimensions.test.ts
  • test/vector-index-populate.test.ts

Comment thread src/functions/migrate-vector-index.ts
…nguish list-fail

Two new reviewer findings on migrate-vector-index. Six earlier findings
re-flagged in the same review block had already been addressed in prior
commits on this branch (verified inline):

  - compress.ts BM25 try/catch (commit d2098a5)
  - migrate filter requires m.content (commit d2098a5)
  - migrate success: failed === 0 (commit d2098a5)
  - rebuildIndex vector clear (commit d2098a5)
  - observe.ts / remember.ts outer try/catch — SKIPPED with reason
    (vectorIndexAddGuarded has internal try/catch, NEVER throws)
  - mockKV.delete no-op — SKIPPED with reason (no test exercises it)

NEW FINDINGS APPLIED:

1. Catch block in memories phase counted +1 for the entire batch
   regardless of batch size. If kv.list returned 250 memories and
   embedBatch threw, `failed` reflected 1 missed embedding when
   reality is 250.

   textMems is now declared outside the try so the catch can read its
   length. If kv.list itself threw before textMems was populated
   (length 0), we fall back to +1 (something failed but the batch
   size is unknown). If embedBatch threw, `failed += textMems.length`
   so the counter is honest.

2. kv.list<sessions> failure path returned `failed: 1,
   failedSessions: []`. Indistinguishable from "0 sessions, all OK"
   from the caller's perspective — they can't tell whether the
   enumeration itself blew up or simply found no sessions.

   The path now pushes a marker string "<sessions-list-failed>" into
   failedSessions before returning. Marker keeps the existing schema
   (failedSessions: string[]) — no boolean flag added — and gives
   callers a sentinel they can grep for. Per-session entries continue
   to use real session ids, so the marker is unambiguous.

886 / 886 tests pass. Build clean.
The two regression tests added in this branch for the BM25 non-ASCII
tokenizer fix used Russian-language fixtures
("Проверка памяти", "Тестируем поиск по кириллице", etc). Switching
to Greek-language equivalents ("Προβολή μνήμης",
"Δοκιμάζουμε αναζήτηση σε ελληνικά", etc).

Reasoning: the test fixtures need to exercise the same code path —
non-ASCII Unicode letters that the old ASCII-only \w regex stripped.
Greek code points (U+0370..U+03FF) hit \p{L} identically to Cyrillic
(U+0400..U+04FF), so the regression coverage is preserved. Greek is
also less likely to be misread as politically loaded content in a
public repo.

Test ids renamed from obs_cyrillic to obs_greek; test names from
"indexes and finds Cyrillic text" to "indexes and finds non-ASCII
(Greek) text" (and the mixed-script variant likewise). Branch name
stays as-is to keep PR continuity but the shipped test fixtures are
language-neutral going forward.

11/11 search-index tests pass with the swap. Full suite 886/886.
@rohitg00 rohitg00 merged commit 24f6c5f into main May 13, 2026
5 checks passed
@rohitg00 rohitg00 deleted the fix/bm25-cyrillic-vector-index branch May 13, 2026 14:27
rohitg00 added a commit that referenced this pull request May 13, 2026
…ardening

Three landed PRs since v0.9.11:
  - #327 (#295) — BM25 tokenizer now accepts non-ASCII (Greek,
    accented Latin, Hebrew, Arabic, Cyrillic), VectorIndex.add now
    actually called at runtime via vectorIndexAddGuarded helper with
    dim guard + input clip, migrateVectorIndex for dim migrations.
  - #326 (#277) — RetentionScore type no longer declares source
    twice; JSDoc back-compat note no longer shadowed.
  - #335 (#323) — viewer drops Google Fonts <link> (CSP-blocked),
    loadDashboard now surfaces load errors inline instead of
    sticking on "Loading dashboard…".

Bumping 0.9.11 -> 0.9.12 across the 9 standard files:
- package.json
- packages/mcp/package.json
- plugin/.claude-plugin/plugin.json
- plugin/.codex-plugin/plugin.json
- src/version.ts
- src/types.ts (ExportData.version literal)
- src/functions/export-import.ts (supportedVersions)
- test/export-import.test.ts (round-trip expectation)
- CHANGELOG.md (new 0.9.12 entry)

886 / 886 tests pass. Build clean.
rohitg00 added a commit that referenced this pull request May 13, 2026
…ardening (#337)

Three landed PRs since v0.9.11:
  - #327 (#295) — BM25 tokenizer now accepts non-ASCII (Greek,
    accented Latin, Hebrew, Arabic, Cyrillic), VectorIndex.add now
    actually called at runtime via vectorIndexAddGuarded helper with
    dim guard + input clip, migrateVectorIndex for dim migrations.
  - #326 (#277) — RetentionScore type no longer declares source
    twice; JSDoc back-compat note no longer shadowed.
  - #335 (#323) — viewer drops Google Fonts <link> (CSP-blocked),
    loadDashboard now surfaces load errors inline instead of
    sticking on "Loading dashboard…".

Bumping 0.9.11 -> 0.9.12 across the 9 standard files:
- package.json
- packages/mcp/package.json
- plugin/.claude-plugin/plugin.json
- plugin/.codex-plugin/plugin.json
- src/version.ts
- src/types.ts (ExportData.version literal)
- src/functions/export-import.ts (supportedVersions)
- test/export-import.test.ts (round-trip expectation)
- CHANGELOG.md (new 0.9.12 entry)

886 / 886 tests pass. Build clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BM25 tokenizer strips non-ASCII text; vector index never populated on write

2 participants