Skip to content

fix: clear deleted memories from bm25 and vector indices#636

Merged
rohitg00 merged 5 commits into
rohitg00:mainfrom
abhinav-m22:fix/delete-index
May 25, 2026
Merged

fix: clear deleted memories from bm25 and vector indices#636
rohitg00 merged 5 commits into
rohitg00:mainfrom
abhinav-m22:fix/delete-index

Conversation

@abhinav-m22
Copy link
Copy Markdown
Contributor

@abhinav-m22 abhinav-m22 commented May 24, 2026

Deleted memories used to stay in the BM25 and vector indices, so they kept occupying result slots and pushed live memories out of limit capped searches. This PR closes the leak across every delete path and makes the cleanup survive a hard process exit.

  • Every delete path now also removes the entry from the BM25 and vector indices, instead of just deleting from KV. That covers forget, governance delete, bulk delete, auto forget TTL and low value observations, and retention eviction.
  • The index snapshot is flushed to disk synchronously after a delete, not on the debounced timer. So even if the process gets killed a second after the delete, the deletion is already persisted and the ghost cannot come back at next boot.
  • Added a SearchIndex.remove method that properly tears down the inverted index postings, the per doc term counts, the total doc length, and the prefix cache. Idempotent on unknown ids, so calling it twice is safe.
  • Filled in the test gaps. Every delete path now has coverage for both the index cleanup and the persistence flush.
Screenshot From 2026-05-24 19-07-26

Summary by CodeRabbit

  • Bug Fixes

    • Deleted memories and observations are now properly removed from search and vector indexes across all deletion operations (auto-forget, governance deletions, user-initiated forget, and retention eviction), ensuring search results accurately reflect current data.
  • Tests

    • Added comprehensive test coverage for search index cleanup across all deletion scenarios.

Review Change Stack

@vercel
Copy link
Copy Markdown

vercel Bot commented May 24, 2026

@abhinav-m22 is attempting to deploy a commit to the rohitg00's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 24, 2026

📝 Walkthrough

Walkthrough

This PR ensures deleted memories and observations are removed from both BM25 and vector search indexes across all deletion paths, with persistence synchronization to prevent deleted entries from resurfacing on restart.

Changes

Search Index Deletion and Persistence

Layer / File(s) Summary
SearchIndex.remove method and unit tests
src/state/search-index.ts, test/search-index.test.ts
SearchIndex.remove(id) deletes observation entries from inverted indexes, term-frequency records, and caches. Unit tests validate removed docs no longer appear in search results, unknown removals are no-ops, and scoring remains consistent across add/remove/add cycles.
Search module index removal and persistence infrastructure
src/functions/search.ts
New exports: vectorIndexRemove(id) removes entries from in-memory vector index, setIndexPersistence(...) registers a persistence handle, scheduleIndexSave() triggers debounced save, and flushIndexSave() synchronously awaits a persistence save for delete paths.
Index persistence wiring
src/index.ts
Worker initialization imports and calls setIndexPersistence(indexPersistence) to connect the persistence layer so delete/mutation operations can flush BM25 and vector index changes to disk.
Governance delete operations with index cleanup and tests
src/functions/governance.ts, test/governance.test.ts
Single and bulk governance delete remove deleted memory IDs from search/vector indexes and conditionally flush persistence. Tests verify index removals occur and persistence is flushed only when deletions happen.
Forget operation with index cleanup and tests
src/functions/remember.ts, test/remember-forget-audit.test.ts
Forget operation removes deleted memory and observation IDs from search/vector indexes across three deletion paths (memory, specific observations, all observations for session), then flushes persistence. Tests validate index removals and immediate persistence flush.
Auto-forget with index cleanup and tests
src/functions/auto-forget.ts, test/auto-forget.test.ts
Background auto-forget removes TTL-expired memories and low-value observations from indexes, with conditional persistence flush when deletions occur (skipped in dryRun mode). Tests verify index cleanup and dryRun prevents both index mutation and persistence flush.
Retention eviction with index cleanup and tests
src/functions/retention.ts, test/retention.test.ts
Background retention eviction removes evicted memory IDs from search/vector indexes during candidate processing and flushes persistence after the eviction loop. Tests verify index removals and persistence flush only occur when memories are evicted.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • #632: This PR implements the exact fix—adding SearchIndex.remove() and vectorIndexRemove() and calling them from all delete paths (governance, forget, auto-forget, retention) so deleted memories no longer remain in search indexes.

Poem

🐰 With whiskers twitching and a keen eye,
We scrubbed the indexes clean and dry,
No ghost of memory shall stay,
Once deleted, gone they are for aye!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately and concisely describes the main fix: ensuring deleted memories are cleared from both BM25 and vector indices, which directly addresses the core problem solved by this changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/functions/governance.ts (1)

22-31: ⚡ Quick win

Parallelize per-id delete work in mem::governance-delete.

This loop does independent KV/index operations serially, which adds avoidable latency for larger memoryIds payloads. Consider Promise.allSettled (or batched chunks) per request, then compute deleted from fulfilled results.

As per coding guidelines, "Use parallel operations with Promise.all() for independent kv writes/reads".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/functions/governance.ts` around lines 22 - 31, The serial per-id delete
loop over data.memoryIds (using kv.get, kv.delete, deleteAccessLog,
getSearchIndex().remove, vectorIndexRemove) causes unnecessary latency; refactor
the delete logic in the mem::governance-delete handler to run independent per-id
operations in parallel (e.g., map memoryIds to async tasks and use
Promise.allSettled or chunked Promise.all for large lists), ensure each task
performs the existing sequence (get, conditional delete, deleteAccessLog, remove
from search and vector index), and compute the deleted count from the settled
results by counting fulfilled tasks that actually deleted an item rather than
incrementing a shared counter inside the serial loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/auto-forget.test.ts`:
- Around line 1-13: The test is missing a mock for the iii-sdk; add
vi.mock("iii-sdk") at the top of test/auto-forget.test.ts and return an object
exposing the mocked SDK shape used by the code under test (e.g., sdk: { trigger:
vi.fn() } and kv: { get: vi.fn(), set: vi.fn(), list: vi.fn() }) so the calls in
registerAutoForgetFunction and any code that imports iii-sdk (e.g., code paths
exercised by getSearchIndex, setIndexPersistence, memoryToObservation) have the
expected mocked methods available.

In `@test/retention.test.ts`:
- Around line 1-6: Add a module mock for "iii-sdk" in the test suite by calling
vi.mock("iii-sdk") at the top of test/retention.test.ts and provide mocked
implementations for sdk.trigger and the kv API (kv.get, kv.set, kv.list) so
tests use the module mock instead of the local SDK shim; specifically, export
from the mock an object with trigger (a vi.fn()) and kv (an object with get,
set, list as vi.fn()s or simple in-memory implementations) so tests that call
getSearchIndex, setIndexPersistence, or memoryToObservation interact with the
mocked iii-sdk functions.

---

Nitpick comments:
In `@src/functions/governance.ts`:
- Around line 22-31: The serial per-id delete loop over data.memoryIds (using
kv.get, kv.delete, deleteAccessLog, getSearchIndex().remove, vectorIndexRemove)
causes unnecessary latency; refactor the delete logic in the
mem::governance-delete handler to run independent per-id operations in parallel
(e.g., map memoryIds to async tasks and use Promise.allSettled or chunked
Promise.all for large lists), ensure each task performs the existing sequence
(get, conditional delete, deleteAccessLog, remove from search and vector index),
and compute the deleted count from the settled results by counting fulfilled
tasks that actually deleted an item rather than incrementing a shared counter
inside the serial loop.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e2b89055-a816-4e11-8655-770358a09535

📥 Commits

Reviewing files that changed from the base of the PR and between 3551241 and fd58585.

📒 Files selected for processing (12)
  • src/functions/auto-forget.ts
  • src/functions/governance.ts
  • src/functions/remember.ts
  • src/functions/retention.ts
  • src/functions/search.ts
  • src/index.ts
  • src/state/search-index.ts
  • test/auto-forget.test.ts
  • test/governance.test.ts
  • test/remember-forget-audit.test.ts
  • test/retention.test.ts
  • test/search-index.test.ts

Comment thread test/auto-forget.test.ts
Comment thread test/retention.test.ts
@abhinav-m22
Copy link
Copy Markdown
Contributor Author

Regarding the review points:

Verified each finding against current code: both vi.mock("iii-sdk") suggestions are no-ops because the source files only do import type { ISdk } (no runtime usage), and the governance-delete parallelization is a pre-existing serial loop unrelated to this fix's scope. Skipping all with detailed reasoning in the inline replies.

@abhinav-m22
Copy link
Copy Markdown
Contributor Author

Hi @rohitg00, could you please take a look and review the changes?

Copy link
Copy Markdown
Owner

@rohitg00 rohitg00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Real bug, clean fix, complete test coverage.

Verified

Issue is legit. Reproed #632 against current main — deleted memories still occupy BM25 result slots. SearchIndex.remove() did not exist; VectorIndex.remove() existed but was orphaned (zero callers).

Fix is minimal and complete. Five delete paths fixed: governance.delete-memory, governance.bulk-delete, forget, auto-forget (TTL + low-value), retention.evict. All now call getSearchIndex().remove(id) + vectorIndexRemove(id).

SearchIndex.remove() tears down every actual data structure:

  • entries map — deleted
  • docTermCounts map — deleted
  • invertedIndex posting lists — id pulled from every term, empty terms removed
  • totalDocLengthentry.termCount subtracted (clamped at 0)
  • sortedTerms cache — invalidated

(PR description mentions a "prefix cache" — there isn't one in the codebase, just inline prefix-match scoring. Doesn't matter, no field was missed.)

Persistence flush is well-designed.

  • flushIndexSave() is a sync variant of the debounced scheduleSave — appropriate for delete paths since adds are chatty but deletes are infrequent
  • setIndexPersistence(p) wiring keeps unit tests independent (no-op until wired)
  • IndexPersistence.save() already catches its own errors via logFailure() — callers can't crash on flush failure, which is the right tradeoff because the KV delete already committed before flush is invoked

Test coverage is thorough. 5 new test blocks across 5 files covering every delete path, dryRun no-op, persistence flush assertion via mock, plus a dedicated SearchIndex.remove unit-level test for idempotency + term cleanup.

Verified locally

  • git fetch origin pull/636/head:pr-636 && git checkout pr-636
  • npm test — 1114/1114 pass
  • npm test -- test/auto-forget.test.ts test/governance.test.ts test/remember-forget-audit.test.ts test/retention.test.ts test/search-index.test.ts — 59/59 pass
  • npm run build — 21 files, 2432 KB, clean

Minor (non-blocking)

flushIndexSave() is not internally serialized — two concurrent delete-handler invocations could race two save() calls. Both write the same serialized form so the worst case is one redundant KV write of a slightly stale snapshot. Not a correctness issue, and serializing it would add complexity for negligible win on a delete-shaped workload. Leaving as-is is the right call.

Ship it.

@rohitg00 rohitg00 merged commit 8c558c6 into rohitg00:main May 25, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants