fix: BM25 tokenizer strips non-ASCII text; vector index never populated on write by nik1t7n · Pull Request #296 · rohitg00/agentmemory

nik1t7n · 2026-05-12T02:46:12Z

Closes #295.

Summary

Two independent bugs that degrade search quality for non-English users and silently break vector-hybrid search. See the issue for full root-cause analysis.

Bug 1: BM25 tokenizer strips Cyrillic and all non-ASCII characters

src/state/search-index.ts:226 — the tokenizer regex \w only matches ASCII [a-zA-Z0-9_]. All Unicode letters (Cyrillic, Greek, CJK, Arabic) are stripped to spaces.

Fix: Replace \w with [\p{L}\p{N}] and add the u (unicode) flag. This preserves all Unicode letters and numbers while keeping ASCII semantics intact.

Bug 2: `vectorIndex.add()` is never called at runtime

The VectorIndex exists and is persisted at startup, but no code path calls add():

remember.ts — no vector indexing after BM25 add
observe.ts — no vector indexing on synthetic compression
compress.ts — no vector indexing on LLM compression
search.ts::rebuildIndex() — no vector embedding during index rebuild

Fix: Add global getVectorIndex()/getEmbeddingProvider() singletons in search.ts and wire vectorIndex.add() into all four code paths. Call setVectorIndex()/setEmbeddingProvider() in index.ts at startup.

Migration utility (bonus)

src/functions/migrate-vector-index.ts — re-embeds all existing memories and observations against a new embedding provider when dimensions change (e.g., switching from 384-dim to 1536-dim embeddings). Handles empty KV gracefully.

Files changed

File	Change
`src/state/search-index.ts`	`\w` → `[\p{L}\p{N}]` with `u` flag
`src/functions/search.ts`	Add singleton accessors + vector embedding in `rebuildIndex()`
`src/index.ts`	`setVectorIndex()` / `setEmbeddingProvider()` at startup
`src/functions/remember.ts`	`vectorIndex.add()` after BM25 add
`src/functions/observe.ts`	`vectorIndex.add()` on synthetic compression
`src/functions/compress.ts`	`vectorIndex.add()` on LLM compression
`src/functions/migrate-vector-index.ts`	NEW — migration utility
`test/search-index.test.ts`	+2 Cyrillic tokenization tests
`test/search.test.ts`	+1 rebuildIndex vector population test
`test/vector-index-populate.test.ts`	NEW — +3 vector-populate-on-remember tests
`test/vector-index-dimensions.test.ts`	+3 migration tests

How to verify

npm test

Results: 864 passed, 13 failed — all 13 failures are pre-existing in test/mcp-standalone.test.ts (unrelated to this PR, related to memory_governance_delete and standalone MCP tool count assertions).

Each commit is signed-off in compliance with the DCO requirement of Apache-2.0 license.

Summary by CodeRabbit

Release Notes

New Features
- Added semantic search capability using vector embeddings alongside existing keyword search
- Expanded Unicode support for multilingual text, including Cyrillic languages
- New vector index migration utility for updating embeddings with different providers
Tests
- Added tests validating vector search integration
- Added tests confirming Unicode search support

…L}\p{N} regex Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com>

…in rebuildIndex Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com>

…ss + setVectorIndex in index.ts Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com>

…, rebuildIndex vector test Adds migrateVectorIndex utility for re-embedding when embedding provider dimensions change. Adds comprehensive test coverage for Cyrillic tokenization, vector index population on remember/rebuildIndex, and migration edge cases. Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com>

Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com>

vercel · 2026-05-12T02:46:17Z

@nik1t7n is attempting to deploy a commit to the rohitg00's projects Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2026-05-12T02:46:25Z

📝 Walkthrough

Walkthrough

This PR fixes two critical bugs preventing non-English search and vector index population: Unicode tokenization is updated to preserve non-ASCII characters, vector index is wired into the search module with state management, and all write paths (remember, observe, compress) and rebuild now populate the vector index with embeddings. A migration utility supports re-embedding when providers change.

Changes

Vector Index Population and Unicode Tokenization

Layer / File(s)	Summary
Unicode tokenization fix `src/state/search-index.ts`, `test/search-index.test.ts`	BM25 tokenizer updated from ASCII-only `\w` to Unicode-aware `\p{L}\p{N}` pattern. Tests validate Cyrillic and mixed-language query matching.
Vector index state management `src/functions/search.ts`	Module-level singletons for `VectorIndex` and `EmbeddingProvider` with exported setter/getter accessors enable decoupled initialization and testing.
Vector indexing on rebuild `src/functions/search.ts`, `test/search.test.ts`	`rebuildIndex` now embeds memories and observations and adds them to the vector index during the rebuild pass. Test confirms vector index is populated.
Vector indexing on remember `src/functions/remember.ts`	After BM25 indexing, saved memory is embedded and upserted to vector index if both provider and index are available; failures logged as warnings.
Vector indexing on synthetic compression `src/functions/observe.ts`	When auto-compression is disabled, synthetic observations are embedded and added to vector index. Non-fatal failures are logged.
Vector indexing on LLM compression `src/functions/compress.ts`	After persisting and BM25 indexing, compressed observations are embedded and added to vector index with non-fatal error handling.
Vector index migration utility `src/functions/migrate-vector-index.ts`, `test/vector-index-dimensions.test.ts`	New `migrateVectorIndex` function re-embeds all stored memories and observations against a new embedding provider. Returns success, processed count, failure count, and final vector size. Tests cover observations, memories, and empty KV.
Startup integration `src/index.ts`	After initialization, calls `setVectorIndex` and `setEmbeddingProvider` to register active instances with search module.
Write-path integration testing `test/vector-index-populate.test.ts`	Suite validates `mem::remember` populates vector index when components are set, multiple memories increase size correctly, and flow gracefully succeeds when vector components are absent.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

rohitg00/agentmemory#248: Adds VectorIndex.validateDimensions and provider dimension guards; this PR uses that validation in migration and test context.
rohitg00/agentmemory#258: Adds BM25 memory indexing to the remember/search flow; this PR extends that same flow with optional vector embedding alongside BM25.
rohitg00/agentmemory#269: Adds memory-to-observation fallback logic in search; this PR wires vector index infrastructure that affects the same search/indexing flow.

Poem

🐰 Cyrillic texts no longer fade,
Vectors bloom where embeddings played.
Write once, index all the way—
Unicode search saves the day! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes both main bugs: BM25 tokenizer stripping non-ASCII text and vector index not being populated during writes, matching the core objectives.
Linked Issues check	✅ Passed	All coding requirements from issue `#295` are met: Unicode-aware tokenizer fix [search-index.ts], vector-index population wiring in remember/observe/compress/rebuildIndex [multiple files], global singleton accessors for vector index and embedding provider [search.ts, index.ts], and migration utility [migrate-vector-index.ts]. Comprehensive tests validate all fixes.
Out of Scope Changes check	✅ Passed	All changes align with the stated objectives: tokenizer fix, vector index wiring, singleton accessors, migration utility, and targeted tests for Cyrillic tokenization and vector population. No unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

test/vector-index-populate.test.ts (2)

124-127: 💤 Low value

Consider removing or simplifying the verbose comment.

The comment on lines 124-127 is lengthy and expresses uncertainty about the implementation. Per coding guidelines, avoid comments explaining WHAT the code does. The test assertion itself (expect(vi).toBeNull()) clearly documents the expected behavior.

🧹 Simplify or remove comment

     expect((result as { success: boolean }).success).toBe(true);

     const vi = getVectorIndex();
-    // The remember function initializes the SearchIndex singleton but doesn't
-    // touch vectorIndex when it's null — and no setVectorIndex was called.
-    // getVectorIndex() returns whatever was set, which is null here.
     expect(vi).toBeNull();

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/vector-index-populate.test.ts` around lines 124 - 127, Remove or shorten
the verbose explanatory comment preceding the assertion in the test; the
behavior is already clear from the code (the remember function, SearchIndex,
getVectorIndex/setVectorIndex and the assertion expect(vi).toBeNull()), so
either delete the multi-line comment or replace it with a one-line note like
"vectorIndex should be null when not set" to keep the test concise.

77-79: ⚡ Quick win

Consider using beforeEach/afterEach for cleanup to improve test isolation.

The manual cleanup calls to setVectorIndex(null) and setEmbeddingProvider(null) are repeated in multiple tests. If a test throws or fails before reaching the cleanup code, global state will leak into subsequent tests.

♻️ Refactor to use beforeEach/afterEach

 describe("vector index population on remember", () => {
   const mockEmbedder: EmbeddingProvider = {
     name: "test",
     dimensions: 3,
     embed: async (_text: string) => new Float32Array([0.1, 0.2, 0.3]),
     embedBatch: async (_texts: string[]) =>
       _texts.map(() => new Float32Array([0.1, 0.2, 0.3])),
   };
+
+  afterEach(() => {
+    setVectorIndex(null);
+    setEmbeddingProvider(null);
+  });

   it("calls vectorIndex.add() when remember saves a memory", async () => {
     const sdk = mockSdk();
     const kv = mockKV();
     const vi = new VectorIndex();
     setVectorIndex(vi);
     setEmbeddingProvider(mockEmbedder);
     registerRememberFunction(sdk as never, kv as never);

     const result = await sdk.trigger({
       function_id: "mem::remember",
       payload: { content: "Test memory for vector indexing", type: "fact" },
     });

     expect((result as { success: boolean }).success).toBe(true);
-    // Vector index should have one entry after remember
     expect(vi.size).toBe(1);
-
-    // Cleanup
-    setVectorIndex(null);
-    setEmbeddingProvider(null);
   });

   it("calls vectorIndex.add() with short content (0% similarity dedup)", async () => {
     const sdk = mockSdk();
     const kv = mockKV();
     const vi = new VectorIndex();
     setVectorIndex(vi);
     setEmbeddingProvider(mockEmbedder);
     registerRememberFunction(sdk as never, kv as never);

-    // Save first memory
     await sdk.trigger({
       function_id: "mem::remember",
       payload: { content: "First unique memory", type: "fact" },
     });

-    // Save second (different content, no dedup)
     await sdk.trigger({
       function_id: "mem::remember",
       payload: { content: "Second completely different memory", type: "fact" },
     });

     expect(vi.size).toBe(2);
-
-    setVectorIndex(null);
-    setEmbeddingProvider(null);
   });

Also applies to: 104-105

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/vector-index-populate.test.ts` around lines 77 - 79, Replace repeated
manual cleanup calls to setVectorIndex(null) and setEmbeddingProvider(null)
inside individual tests with centralized test hooks: add a beforeEach that
ensures both setVectorIndex(null) and setEmbeddingProvider(null) are reset
before each test and an afterEach that calls setVectorIndex(null) and
setEmbeddingProvider(null) to guarantee cleanup even on failures; remove the
inline calls from the test bodies (the ones calling setVectorIndex and
setEmbeddingProvider) so all tests rely on the new beforeEach/afterEach hooks
for isolation.

src/functions/migrate-vector-index.ts (1)

23-72: 🏗️ Heavy lift

Consider parallelizing independent re-embedding operations.

The memory re-embedding (lines 23-47) and observation re-embedding (lines 49-72) are independent operations that run sequentially. Parallelizing them with Promise.all() could significantly speed up migrations, especially for large datasets. As per coding guidelines, use parallel operations where possible.

⚡ Proposed parallelization structure

 export async function migrateVectorIndex(
   kv: StateKV,
   oldIndex: VectorIndex,
   newProvider: EmbeddingProvider,
 ): Promise<MigrateVectorIndexResult> {
   const newIndex = new VectorIndex();
   let failed = 0;
   let processed = 0;

-  // Re-embed memories
-  try {
-    const memories = await kv.list<Memory>(KV.memories);
-    const textMems = memories.filter(
-      (m) => m.isLatest !== false && m.title,
-    );
-    const texts = textMems.map((m) => m.title! + " " + m.content);
-
-    if (texts.length > 0) {
-      const embeddings = await newProvider.embedBatch(texts);
-      for (let i = 0; i < textMems.length; i++) {
-        newIndex.add(
-          textMems[i].id,
-          textMems[i].sessionIds[0] ?? "memory",
-          embeddings[i],
-        );
-        processed++;
-      }
-    }
-  } catch (err) {
-    logger.warn("migrateVectorIndex: failed to re-embed memories", {
-      error: err instanceof Error ? err.message : String(err),
-    });
-    failed++;
-  }
-
-  // Re-embed observations
-  try {
-    const sessions = await kv.list<{ id: string }>(KV.sessions);
-    for (const session of sessions) {
-      const observations = await kv.list<CompressedObservation>(
-        KV.observations(session.id),
-      );
-      const textObs = observations.filter((o) => o.title);
-      const texts = textObs.map((o) => o.title! + " " + (o.narrative || ""));
-
-      if (texts.length > 0) {
-        const embeddings = await newProvider.embedBatch(texts);
-        for (let i = 0; i < textObs.length; i++) {
-          newIndex.add(textObs[i].id, textObs[i].sessionId, embeddings[i]);
-          processed++;
-        }
-      }
-    }
-  } catch (err) {
-    logger.warn("migrateVectorIndex: failed to re-embed observations", {
-      error: err instanceof Error ? err.message : String(err),
-    });
-    failed++;
-  }
+  const results = await Promise.all([
+    // Re-embed memories
+    (async () => {
+      try {
+        const memories = await kv.list<Memory>(KV.memories);
+        const textMems = memories.filter(
+          (m) => m.isLatest !== false && m.title,
+        );
+        const texts = textMems.map((m) => m.title! + " " + m.content);
+
+        if (texts.length > 0) {
+          const embeddings = await newProvider.embedBatch(texts);
+          for (let i = 0; i < textMems.length; i++) {
+            newIndex.add(
+              textMems[i].id,
+              textMems[i].sessionIds[0] ?? "memory",
+              embeddings[i],
+            );
+          }
+          return { processed: textMems.length, failed: 0 };
+        }
+        return { processed: 0, failed: 0 };
+      } catch (err) {
+        logger.warn("migrateVectorIndex: failed to re-embed memories", {
+          error: err instanceof Error ? err.message : String(err),
+        });
+        return { processed: 0, failed: 1 };
+      }
+    })(),
+    // Re-embed observations
+    (async () => {
+      try {
+        const sessions = await kv.list<{ id: string }>(KV.sessions);
+        let sessionProcessed = 0;
+        for (const session of sessions) {
+          const observations = await kv.list<CompressedObservation>(
+            KV.observations(session.id),
+          );
+          const textObs = observations.filter((o) => o.title);
+          const texts = textObs.map((o) => o.title! + " " + (o.narrative || ""));
+
+          if (texts.length > 0) {
+            const embeddings = await newProvider.embedBatch(texts);
+            for (let i = 0; i < textObs.length; i++) {
+              newIndex.add(textObs[i].id, textObs[i].sessionId, embeddings[i]);
+              sessionProcessed++;
+            }
+          }
+        }
+        return { processed: sessionProcessed, failed: 0 };
+      } catch (err) {
+        logger.warn("migrateVectorIndex: failed to re-embed observations", {
+          error: err instanceof Error ? err.message : String(err),
+        });
+        return { processed: 0, failed: 1 };
+      }
+    })(),
+  ]);
+
+  processed = results.reduce((sum, r) => sum + r.processed, 0);
+  failed = results.reduce((sum, r) => sum + r.failed, 0);

   return { success: true, totalProcessed: processed, failed, vectorSize: newIndex.size };
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/functions/migrate-vector-index.ts` around lines 23 - 72, The memory and
observation re-embedding blocks should be run in parallel: extract each block
into its own async helper (e.g., reembedMemories and reembedObservations) that
performs its kv.list -> filter -> newProvider.embedBatch -> newIndex.add loop
and updates the shared counters (processed, failed) safely; then run
Promise.all([reembedMemories(), reembedObservations()]) so they execute
concurrently. Ensure each helper catches its own errors and increments failed on
error (and increments processed for each successful add), and avoid race
conditions on counters (use atomic increments or local counters returned by each
helper and sum them after Promise.all). Keep usage of symbols: kv.list,
newProvider.embedBatch, newIndex.add, processed, failed, memories, observations,
textMems, textObs, sessions, and session.id to locate code.

src/functions/search.ts (1)

53-58: 🏗️ Heavy lift

Consider batching embeddings for better rebuild performance.

The current implementation calls currentEmbeddingProvider.embed() sequentially for each memory and observation (lines 53-58, 93-98). This creates a performance bottleneck, especially with network-based embedding providers. The EmbeddingProvider interface includes embedBatch() which should be used to embed multiple texts in a single call. As per coding guidelines, use parallel operations where possible.

⚡ Proposed batching approach

For memories:

   const memories = await kv.list<Memory>(KV.memories)
+  const memoriesToEmbed: Memory[] = []
   for (const memory of memories) {
     if (memory.isLatest === false) continue
     if (!memory.title || !memory.content) continue
     idx.add(memoryToObservation(memory))
+    if (vectorIndex && currentEmbeddingProvider) {
+      memoriesToEmbed.push(memory)
+    }
+    count++
+  }
+  if (memoriesToEmbed.length > 0 && vectorIndex && currentEmbeddingProvider) {
+    try {
+      const texts = memoriesToEmbed.map(m => m.title + ' ' + m.content)
+      const embeddings = await currentEmbeddingProvider.embedBatch(texts)
+      for (let i = 0; i < memoriesToEmbed.length; i++) {
+        vectorIndex.add(
+          memoriesToEmbed[i].id,
+          memoriesToEmbed[i].sessionIds[0] ?? 'memory',
+          embeddings[i]
+        )
+      }
+    } catch (err) {
+      logger.warn('rebuildIndex: failed to batch embed memories', {
+        error: err instanceof Error ? err.message : String(err),
+      })
+    }
-    if (vectorIndex && currentEmbeddingProvider) {
-      try {
-        const embedding = await currentEmbeddingProvider.embed(memory.title + ' ' + memory.content)
-        vectorIndex.add(memory.id, memory.sessionIds[0] ?? 'memory', embedding)
-      } catch {}
-    }
-    count++
   }

Similar refactoring would apply to observations (lines 89-102).

Also applies to: 93-98

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/functions/search.ts` around lines 53 - 58, The loop calling
currentEmbeddingProvider.embed for each memory/observation should be converted
to use currentEmbeddingProvider.embedBatch to reduce RPC/network calls: collect
the texts (e.g., memory.title + ' ' + memory.content and observation.text) into
arrays, call embedBatch once per batch, then iterate the returned embeddings and
call vectorIndex.add(memory.id, memory.sessionIds[0] ?? 'memory', embedding)
(and analogous for observations) to restore entries; if embedBatch is not
implemented, fallback to Promise.all on currentEmbeddingProvider.embed for
parallel execution. Ensure you reference currentEmbeddingProvider.embedBatch,
vectorIndex.add, and the memory/observation loops when applying the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/functions/migrate-vector-index.ts`:
- Line 74: Return object always sets success: true even when there are failures;
update the return to compute success from the failed count (e.g., success:
failed === 0) so callers get an accurate outcome. Locate the return in
migrate-vector-index.ts that currently returns { success: true, totalProcessed:
processed, failed, vectorSize: newIndex.size } and change the success value to
be derived from the failed variable (or equivalent check), leaving
totalProcessed, failed and vectorSize unchanged.

In `@src/functions/search.ts`:
- Around line 53-58: The empty catch blocks around the embedding and indexing
logic (where currentEmbeddingProvider.embed(...) is called and
vectorIndex.add(...) is invoked) swallow all errors; modify those catches to
capture the error (e.g., catch (err)) and log a helpful message including
memory.id, memory.sessionIds, and the error details using the module logger (or
processLogger) so embedding failures are visible but do not crash execution,
then continue; apply the same change to both occurrences that call
currentEmbeddingProvider.embed and vectorIndex.add.

In `@test/search.test.ts`:
- Around line 195-207: The test currently sets singletons via
setEmbeddingProvider(mockEmbedder) and setVectorIndex(new VectorIndex()) then
calls rebuildIndex(...) and asserts on getVectorIndex(); ensure singleton state
is always reset by wrapping the setup and assertions in a try/finally that calls
setVectorIndex(null) and setEmbeddingProvider(null) in finally, or alternatively
move those cleanup calls into an afterEach hook so cleanup runs even when
assertions fail; update the test around rebuildIndex/getVectorIndex to use
try/finally (or add afterEach) to guarantee failure-safe teardown.

---

Nitpick comments:
In `@src/functions/migrate-vector-index.ts`:
- Around line 23-72: The memory and observation re-embedding blocks should be
run in parallel: extract each block into its own async helper (e.g.,
reembedMemories and reembedObservations) that performs its kv.list -> filter ->
newProvider.embedBatch -> newIndex.add loop and updates the shared counters
(processed, failed) safely; then run Promise.all([reembedMemories(),
reembedObservations()]) so they execute concurrently. Ensure each helper catches
its own errors and increments failed on error (and increments processed for each
successful add), and avoid race conditions on counters (use atomic increments or
local counters returned by each helper and sum them after Promise.all). Keep
usage of symbols: kv.list, newProvider.embedBatch, newIndex.add, processed,
failed, memories, observations, textMems, textObs, sessions, and session.id to
locate code.

In `@src/functions/search.ts`:
- Around line 53-58: The loop calling currentEmbeddingProvider.embed for each
memory/observation should be converted to use
currentEmbeddingProvider.embedBatch to reduce RPC/network calls: collect the
texts (e.g., memory.title + ' ' + memory.content and observation.text) into
arrays, call embedBatch once per batch, then iterate the returned embeddings and
call vectorIndex.add(memory.id, memory.sessionIds[0] ?? 'memory', embedding)
(and analogous for observations) to restore entries; if embedBatch is not
implemented, fallback to Promise.all on currentEmbeddingProvider.embed for
parallel execution. Ensure you reference currentEmbeddingProvider.embedBatch,
vectorIndex.add, and the memory/observation loops when applying the change.

In `@test/vector-index-populate.test.ts`:
- Around line 124-127: Remove or shorten the verbose explanatory comment
preceding the assertion in the test; the behavior is already clear from the code
(the remember function, SearchIndex, getVectorIndex/setVectorIndex and the
assertion expect(vi).toBeNull()), so either delete the multi-line comment or
replace it with a one-line note like "vectorIndex should be null when not set"
to keep the test concise.
- Around line 77-79: Replace repeated manual cleanup calls to
setVectorIndex(null) and setEmbeddingProvider(null) inside individual tests with
centralized test hooks: add a beforeEach that ensures both setVectorIndex(null)
and setEmbeddingProvider(null) are reset before each test and an afterEach that
calls setVectorIndex(null) and setEmbeddingProvider(null) to guarantee cleanup
even on failures; remove the inline calls from the test bodies (the ones calling
setVectorIndex and setEmbeddingProvider) so all tests rely on the new
beforeEach/afterEach hooks for isolation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 47f3227f-e7ca-446e-97ad-c7f25bc5e4d3

📥 Commits

Reviewing files that changed from the base of the PR and between 03fb42d and 4255298.

📒 Files selected for processing (11)

src/functions/compress.ts
src/functions/migrate-vector-index.ts
src/functions/observe.ts
src/functions/remember.ts
src/functions/search.ts
src/index.ts
src/state/search-index.ts
test/search-index.test.ts
test/search.test.ts
test/vector-index-dimensions.test.ts
test/vector-index-populate.test.ts

coderabbitai · 2026-05-12T02:50:37Z

+    failed++;
+  }
+
+  return { success: true, totalProcessed: processed, failed, vectorSize: newIndex.size };


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

success field is always true even when failures occur.

The function returns success: true regardless of the failed count. This may mislead callers into thinking the migration completed successfully when partial failures occurred. Consider setting success: failed === 0 to accurately reflect the migration outcome.

🔧 Proposed fix to reflect partial failures

- return { success: true, totalProcessed: processed, failed, vectorSize: newIndex.size }; + return { success: failed === 0, totalProcessed: processed, failed, vectorSize: newIndex.size };

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

return { success: true, totalProcessed: processed, failed, vectorSize: newIndex.size };

return { success: failed === 0, totalProcessed: processed, failed, vectorSize: newIndex.size };

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/functions/migrate-vector-index.ts` at line 74, Return object always sets success: true even when there are failures; update the return to compute success from the failed count (e.g., success: failed === 0) so callers get an accurate outcome. Locate the return in migrate-vector-index.ts that currently returns { success: true, totalProcessed: processed, failed, vectorSize: newIndex.size } and change the success value to be derived from the failed variable (or equivalent check), leaving totalProcessed, failed and vectorSize unchanged.

coderabbitai · 2026-05-12T02:50:37Z

+      if (vectorIndex && currentEmbeddingProvider) {
+        try {
+          const embedding = await currentEmbeddingProvider.embed(memory.title + ' ' + memory.content)
+          vectorIndex.add(memory.id, memory.sessionIds[0] ?? 'memory', embedding)
+        } catch {}
+      }


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Empty catch blocks silently swallow embedding errors.

The empty catch {} blocks at lines 57 and 97 suppress all errors during vector embedding, including unexpected failures. While making vector indexing non-fatal is correct, completely silent failures make debugging difficult when embeddings fail.

🔍 Proposed fix to log suppressed errors

if (vectorIndex && currentEmbeddingProvider) { try { const embedding = await currentEmbeddingProvider.embed(memory.title + ' ' + memory.content) vectorIndex.add(memory.id, memory.sessionIds[0] ?? 'memory', embedding) - } catch {} + } catch (err) { + logger.warn('rebuildIndex: failed to embed memory', { + memoryId: memory.id, + error: err instanceof Error ? err.message : String(err), + }) + } } ... if (vectorIndex && currentEmbeddingProvider) { try { const embedding = await currentEmbeddingProvider.embed(obs.title + ' ' + obs.narrative) vectorIndex.add(obs.id, obs.sessionId, embedding) - } catch {} + } catch (err) { + logger.warn('rebuildIndex: failed to embed observation', { + obsId: obs.id, + error: err instanceof Error ? err.message : String(err), + }) + } }

Also applies to: 93-98

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/functions/search.ts` around lines 53 - 58, The empty catch blocks around the embedding and indexing logic (where currentEmbeddingProvider.embed(...) is called and vectorIndex.add(...) is invoked) swallow all errors; modify those catches to capture the error (e.g., catch (err)) and log a helpful message including memory.id, memory.sessionIds, and the error details using the module logger (or processLogger) so embedding failures are visible but do not crash execution, then continue; apply the same change to both occurrences that call currentEmbeddingProvider.embed and vectorIndex.add.

coderabbitai · 2026-05-12T02:50:37Z

+    setEmbeddingProvider(mockEmbedder);
+    setVectorIndex(new VectorIndex());
+
+    await rebuildIndex(kv as never);
+
+    const vi = getVectorIndex();
+    expect(vi).not.toBeNull();
+    expect(vi!.size).toBeGreaterThan(0);
+
+    // Cleanup
+    setVectorIndex(null);
+    setEmbeddingProvider(null);
+  });


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make singleton cleanup failure-safe in this test.

If an assertion fails before Line 205/Line 206, singleton state can leak into subsequent tests. Wrap setup/assertions in try/finally (or move reset into afterEach).

Suggested fix

it("rebuildIndex populates the vector index", async () => { const mockEmbedder = { name: "test", dimensions: 3, embed: async (_text: string) => new Float32Array([0.1, 0.2, 0.3]), embedBatch: async (_texts: string[]) => _texts.map(() => new Float32Array([0.1, 0.2, 0.3])), }; - setEmbeddingProvider(mockEmbedder); - setVectorIndex(new VectorIndex()); - - await rebuildIndex(kv as never); - - const vi = getVectorIndex(); - expect(vi).not.toBeNull(); - expect(vi!.size).toBeGreaterThan(0); - - // Cleanup - setVectorIndex(null); - setEmbeddingProvider(null); + setEmbeddingProvider(mockEmbedder); + setVectorIndex(new VectorIndex()); + try { + await rebuildIndex(kv as never); + const vi = getVectorIndex(); + expect(vi).not.toBeNull(); + expect(vi!.size).toBeGreaterThan(0); + } finally { + setVectorIndex(null); + setEmbeddingProvider(null); + } });

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

setEmbeddingProvider(mockEmbedder);

setVectorIndex(new VectorIndex());

await rebuildIndex(kv as never);

const vi = getVectorIndex();

expect(vi).not.toBeNull();

expect(vi!.size).toBeGreaterThan(0);

// Cleanup

setVectorIndex(null);

setEmbeddingProvider(null);

});

setEmbeddingProvider(mockEmbedder);

setVectorIndex(new VectorIndex());

try {

await rebuildIndex(kv as never);

const vi = getVectorIndex();

expect(vi).not.toBeNull();

expect(vi!.size).toBeGreaterThan(0);

} finally {

// Cleanup

setVectorIndex(null);

setEmbeddingProvider(null);

}

});

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/search.test.ts` around lines 195 - 207, The test currently sets singletons via setEmbeddingProvider(mockEmbedder) and setVectorIndex(new VectorIndex()) then calls rebuildIndex(...) and asserts on getVectorIndex(); ensure singleton state is always reset by wrapping the setup and assertions in a try/finally that calls setVectorIndex(null) and setEmbeddingProvider(null) in finally, or alternatively move those cleanup calls into an afterEach hook so cleanup runs even when assertions fail; update the test around rebuildIndex/getVectorIndex to use try/finally (or add afterEach) to guarantee failure-safe teardown.

rohitg00 · 2026-05-13T11:27:25Z

Superseded by #327. Took your branch, rebased onto current main, and added a centralizing refactor commit on top (vectorIndexAddGuarded helper with dimension guard + 16k input clip + consistent logging). All your BM25 + vector-index wire-up + Cyrillic tests preserved. Full credit kept on every commit you wrote. Closing this one; please follow #327 for merge. Thanks for the precise repro + fix.

@nik1t7n

…#295, supersedes #296) (#327) * fix #2: BM25 tokenizer now preserves Cyrillic/Unicode via \p{L}\p{N} regex Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com> * fix #1: add getVectorIndex/setVectorIndex + vector embedding in rebuildIndex Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com> * fix #1+#3: vectorIndex.add in remember/observe/compress + setVectorIndex in index.ts Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com> * fix: add migration script, Cyrillic BM25 tests, vector-populate tests, rebuildIndex vector test Adds migrateVectorIndex utility for re-embedding when embedding provider dimensions change. Adds comprehensive test coverage for Cyrillic tokenization, vector index population on remember/rebuildIndex, and migration edge cases. Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com> * chore: remove internal research spec from PR branch Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com> * refactor(search): centralize vector-index writes via vectorIndexAddGuarded Pre-merge hardening on top of @nik1t7n's PR #296. Original PR had vectorIndex.add() inlined at four sites (remember, observe, compress, rebuildIndex) with the same try/catch shape repeated each time. That open-codes three sharp edges per site: 1. Missing dimension guard at the write boundary. Per #248, persistence load already refuses to start when persisted vectors mismatch the active provider's dimensions — but the live-write path had no symmetric check. A mis-configured embedding provider returning a different-length Float32Array would silently corrupt the in-memory index until next restart cleared it. 2. No input clipping. Memory.content can be arbitrarily large (the MemorySlot path even allows 20KB content per slot). Most embedding providers cap input around 8k tokens and 400 on overflow. A single oversize memory could throw at the provider and lose the vector entry; worse, on local-Ollama-style providers it can cost real inference time. 3. Inconsistent error logging (rebuildIndex had `catch {}` swallowing the error entirely, the other three sites logged via logger.warn with slightly different shape). Consolidates into a single exported `vectorIndexAddGuarded()` in src/functions/search.ts: - Reads vectorIndex + currentEmbeddingProvider from module-level state (same singletons the PR added). - Clips embed input to 16k chars (EMBED_MAX_CHARS) — well under every provider's documented limit, leaves headroom for chars-per-token variance across languages. - Validates embedding.length === provider.dimensions before calling vi.add(); logs structured warning and skips on mismatch. - Single logger.warn on embed failure with provider name + error message + indexed-item kind ("memory" | "observation" | "synthetic") + id. - Returns boolean (true on success) so callers can branch if they want, but every existing call site fire-and-forgets — the contract is "soft-fail: index entry skipped, upstream save still succeeds". Rewrites the four call sites (remember.ts, observe.ts synthetic branch, compress.ts post-LLM, search.ts rebuildIndex memories + obs loops) to call vectorIndexAddGuarded() instead of the inline try/catch. Side effect: cleans up the lingering `catch {}` in rebuildIndex that was swallowing rebuild-time embed failures silently — every failure now surfaces in logs at a consistent shape. Tests: - test/multimodal.test.ts mocked `../src/functions/search.js` and exposed only getSearchIndex; needed vectorIndexAddGuarded added to the mock surface or test fails with "No vectorIndexAddGuarded export is defined on the search.js mock". Added the export to the mock. - All other tests (search-index Cyrillic, search rebuildIndex vector, vector-index-populate, migrate-vector-index) green on the helper refactor — no behavioural change beyond the three guards above. 886 / 886 tests pass. Build clean. Out-of-scope (deferred to follow-ups): - Blocking await on every save. The embed call is on the request hot path. For local Ollama / on-device providers this is fine; for network providers it adds 200ms-2s latency per remember/observe. Should be async / fire-and-forget with a write-behind queue — separate perf PR after this lands. - Periodic persistence flush of the vector index. Today the index is serialized at process exit (per IndexPersistence). Runtime adds between flushes survive in memory but die on crash. Acceptable for this PR; future PR can add a debounced flush. - Batching in rebuildIndex. Today the rebuild path embeds one doc at a time. embedBatch() exists on every provider and would be ~5-10x faster for cold restarts on large corpora. Defer. * fix(search): tighter migrate guard + rebuild vector clear + test hygiene Five reviewer findings applied. Five skipped with reason: APPLIED: 1. migrate-vector-index.ts filter now requires m.content (and rejects empty/whitespace). Without this, undefined m.content would concat to "title undefined" and poison the embed input. Memory.content is declared string but #277-era data shapes had cases where it could be missing — defensive filter. 2. migrate-vector-index.ts success expression now derives from the failed counter (success: failed === 0). Previously hardcoded true so callers couldn't rely on the boolean alone. 3. search.ts rebuildIndex now calls vectorIndex?.clear() before the repopulation loops. Symmetric to the BM25 idx.clear() that was already there. Without this, memories deleted between runs leave orphan embeddings in the vector store after every rebuild. 4. test/vector-index-dimensions.test.ts assertion tightened from toBeGreaterThan(0) to toBe(1) for both totalProcessed and vectorSize. The seeded fixture has exactly one observation; loose assertion would hide regressions where the migration produced extra/duplicate vectors. 5. test/vector-index-populate.test.ts switched to beforeEach/afterEach for the setVectorIndex / setEmbeddingProvider lifecycle. Manual cleanup in each test body was easy to drift on additions. PLUS: 6. compress.ts BM25 add now wrapped in try/catch matching the remember.ts pattern for parity. Helps cross-file symmetry even though SearchIndex.add doesn't throw on valid input. SKIPPED (with reason): - Wrapping vectorIndexAddGuarded calls in additional try/catch at remember.ts / observe.ts / compress.ts: the helper has internal try/catch (src/functions/search.ts:67-89). It NEVER throws. Adding outer try/catch is dead code. The reviewer missed the helper's contract. - vector-index-dimensions.test.ts mockKV.delete -> real deletion: the tests in that file don't exercise deletion semantics, and migrateVectorIndex itself never calls kv.delete. Touching it would add noise without effect on any assertion. - recordAudit + Promise.allSettled in remember.ts: out of scope. recordAudit is already called earlier in the function. Parallelizing cascade-update with vector-add changes failure semantics (cascade- update is currently fire-and-forget) — separate perf PR if wanted. - validateDimensions(N) assertion in vector-index-dimensions.test.ts: vectorSize is the COUNT of stored vectors, not their dimensions. Adding a validateDimensions check on the result would require exposing the new index. Out of scope for assertion tightening. 886 / 886 tests pass. Build clean. * fix(migrate): per-session isolation + dim guard + drop unused oldIndex param Four reviewer findings applied. One skipped with reason: APPLIED: 1. Per-session isolation in migrateVectorIndex observations phase. The previous shape wrapped the entire sessions for-loop in one try/catch — a single bad session (kv.list throws, embedBatch rejects, etc.) aborted every subsequent session and silently truncated the migration. The loop now has per-session try/catch; failures increment `failed`, append the session id to a new `failedSessions` array on the result, and the loop continues. The kv.list<sessions> call itself is now guarded separately — if THAT fails the whole migration aborts (no sessions to iterate), but with `failed` recorded and `failedSessions` empty so the caller can distinguish. 2. Drop `oldIndex` parameter from migrateVectorIndex signature. It was dead — no read path inside the function. Three test call sites updated to match the new signature migrateVectorIndex(kv, newProvider). Confused the API contract; if we ever DO need a previous-index argument it can come back with a real use case. 3. Dim guard on every embedding result inside migrateVectorIndex, in both the memories phase and the per-session observations phase. Extracted into a local isValidEmbedding() helper that mirrors the shape of search.ts::vectorIndexAddGuarded's guard: log structured warn + skip on mismatch. Without this, a misconfigured provider returning the wrong-length Float32Array would corrupt the rebuilt index, defeating the #248 persistence-load guard. 4. Rename `vi` -> `vectorIndex` in test/vector-index-populate.test.ts so the let binding doesn't shadow the vitest `vi` import. Practical impact: future tests in that file can use `vi.fn()` / `vi.spyOn()` without renaming this binding first. Cosmetic but cheap. SKIPPED (with reason): - Use embedBatch in rebuildIndex (search.ts): perf optimization, not correctness. Already called out as deferred in the previous review's commit message. Cold-restart speed matters but the fix here is bigger refactor (collect-into-batches with metadata, validate each, add only successful) and worth its own PR. Will pick up when we also do the write-behind queue for live-write batching. Result envelope: Added `failedSessions: string[]` to MigrateVectorIndexResult so the caller can decide whether to retry per-session or just abort. Existing fields (success, totalProcessed, failed, vectorSize) unchanged in shape. 886 / 886 tests pass. Build clean. * fix(migrate): attribute batch-level failures to correct count + distinguish list-fail Two new reviewer findings on migrate-vector-index. Six earlier findings re-flagged in the same review block had already been addressed in prior commits on this branch (verified inline): - compress.ts BM25 try/catch (commit d2098a5) - migrate filter requires m.content (commit d2098a5) - migrate success: failed === 0 (commit d2098a5) - rebuildIndex vector clear (commit d2098a5) - observe.ts / remember.ts outer try/catch — SKIPPED with reason (vectorIndexAddGuarded has internal try/catch, NEVER throws) - mockKV.delete no-op — SKIPPED with reason (no test exercises it) NEW FINDINGS APPLIED: 1. Catch block in memories phase counted +1 for the entire batch regardless of batch size. If kv.list returned 250 memories and embedBatch threw, `failed` reflected 1 missed embedding when reality is 250. textMems is now declared outside the try so the catch can read its length. If kv.list itself threw before textMems was populated (length 0), we fall back to +1 (something failed but the batch size is unknown). If embedBatch threw, `failed += textMems.length` so the counter is honest. 2. kv.list<sessions> failure path returned `failed: 1, failedSessions: []`. Indistinguishable from "0 sessions, all OK" from the caller's perspective — they can't tell whether the enumeration itself blew up or simply found no sessions. The path now pushes a marker string "<sessions-list-failed>" into failedSessions before returning. Marker keeps the existing schema (failedSessions: string[]) — no boolean flag added — and gives callers a sentinel they can grep for. Per-session entries continue to use real session ids, so the marker is unambiguous. 886 / 886 tests pass. Build clean. * test: swap Cyrillic non-ASCII tokenizer fixtures for Greek The two regression tests added in this branch for the BM25 non-ASCII tokenizer fix used Russian-language fixtures ("Проверка памяти", "Тестируем поиск по кириллице", etc). Switching to Greek-language equivalents ("Προβολή μνήμης", "Δοκιμάζουμε αναζήτηση σε ελληνικά", etc). Reasoning: the test fixtures need to exercise the same code path — non-ASCII Unicode letters that the old ASCII-only \w regex stripped. Greek code points (U+0370..U+03FF) hit \p{L} identically to Cyrillic (U+0400..U+04FF), so the regression coverage is preserved. Greek is also less likely to be misread as politically loaded content in a public repo. Test ids renamed from obs_cyrillic to obs_greek; test names from "indexes and finds Cyrillic text" to "indexes and finds non-ASCII (Greek) text" (and the mixed-script variant likewise). Branch name stays as-is to keep PR continuity but the shipped test fixtures are language-neutral going forward. 11/11 search-index tests pass with the swap. Full suite 886/886. --------- Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com> Co-authored-by: Nikita Nosov <20nik.nosov21@gmail.com>

nik1t7n added 5 commits May 12, 2026 02:43

fix rohitg00#2: BM25 tokenizer now preserves Cyrillic/Unicode via \p{…

0781b37

…L}\p{N} regex Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com>

fix rohitg00#1: add getVectorIndex/setVectorIndex + vector embedding …

36f6d4e

…in rebuildIndex Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com>

fix rohitg00#1+rohitg00#3: vectorIndex.add in remember/observe/compre…

49fc9ae

…ss + setVectorIndex in index.ts Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com>

chore: remove internal research spec from PR branch

4255298

Signed-off-by: Nikita Nosov <20nik.nosov21@gmail.com>

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

rohitg00 mentioned this pull request May 13, 2026

fix(search): BM25 unicode + vector index live-write — improved (closes #295, supersedes #296) #327

Merged

4 tasks

rohitg00 closed this May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: BM25 tokenizer strips non-ASCII text; vector index never populated on write#296

fix: BM25 tokenizer strips non-ASCII text; vector index never populated on write#296
nik1t7n wants to merge 5 commits into
rohitg00:mainfrom
nik1t7n:fix/bm25-cyrillic-vector-index

nik1t7n commented May 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

vercel Bot commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 12, 2026

Uh oh!

coderabbitai Bot May 12, 2026

Uh oh!

coderabbitai Bot May 12, 2026

Uh oh!

rohitg00 commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	return { success: true, totalProcessed: processed, failed, vectorSize: newIndex.size };
	return { success: failed === 0, totalProcessed: processed, failed, vectorSize: newIndex.size };

Conversation

nik1t7n commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug 1: BM25 tokenizer strips Cyrillic and all non-ASCII characters

Bug 2: vectorIndex.add() is never called at runtime

Migration utility (bonus)

Files changed

How to verify

Summary by CodeRabbit

Release Notes

Uh oh!

vercel Bot commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

rohitg00 commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nik1t7n commented May 12, 2026 •

edited by coderabbitai Bot

Loading

Bug 2: `vectorIndex.add()` is never called at runtime

coderabbitai Bot commented May 12, 2026 •

edited

Loading