Context
Surfaced during the design re-read of PR #304. The post-run hint correction landed in commit f48e8f1, but the underlying gap (orphan embeddings accumulating in vectors.db after delete-deduped purges rows from messages) is real and was deferred.
What happens
The vector backend's design contract is documented at internal/vector/sqlitevec/backend.go:300-306:
Dedup Execute does not remove vector-store rows by design: if a message is embedded then later soft-deleted, the embedding stays in the vector store and query-time live filtering (dropDeletedFromSource, filteredMessageIDs) enforces the live-message contract.
This is correct for soft-delete (deleted_at), where the message row still exists and the join still works. After delete-deduped permanently removes message rows, the vector-store rows whose message_id no longer joins are orphaned:
- They consume disk space in
vectors.db.
- They get over-fetched by the
deletedOverfetchFactor = 2 pad in dropDeletedFromSource (backend.go:797), which assumes a constant fraction of orphans.
- They never get pruned. Over months of dedup + purge cycles, the orphan count grows unbounded relative to the live corpus.
The post-run hint in delete-deduped (now corrected by f48e8f1) tells the user to run build-embeddings --full-rebuild, which recreates the vector index from scratch — a heavy operation that re-pays the embedding-API cost for the entire corpus. That's a workaround, not a maintenance command.
Why it matters
build-embeddings --full-rebuild is expensive: it re-runs every embedding through the configured endpoint. Users running large archives will avoid it.
- The over-fetch factor was tuned for a low orphan ratio. As orphans accumulate, ANN recall degrades because the live subset of the top-K shrinks.
- Long-running daemonized deployments (
serve) compound the problem.
Proposed approach
Add a lightweight vectors prune (or build-embeddings --prune-orphans) command that:
- Reads message IDs from the vector backend.
- Anti-joins against
main.messages.id.
- Deletes vector-store rows whose
message_id has no live message row.
This is much cheaper than a full rebuild: no embedding API calls, just a DELETE FROM vec_chunks WHERE message_id NOT IN (SELECT id FROM messages)-shaped query.
Optionally hook it into delete-deduped as a post-step (gated by a flag) so the cleanup happens in-line for users who want it, while remaining opt-out for users who batch their vector maintenance separately.
Context
Surfaced during the design re-read of PR #304. The post-run hint correction landed in commit
f48e8f1, but the underlying gap (orphan embeddings accumulating invectors.dbafterdelete-dedupedpurges rows frommessages) is real and was deferred.What happens
The vector backend's design contract is documented at
internal/vector/sqlitevec/backend.go:300-306:This is correct for soft-delete (
deleted_at), where the message row still exists and the join still works. Afterdelete-dedupedpermanently removes message rows, the vector-store rows whosemessage_idno longer joins are orphaned:vectors.db.deletedOverfetchFactor = 2pad indropDeletedFromSource(backend.go:797), which assumes a constant fraction of orphans.The post-run hint in
delete-deduped(now corrected byf48e8f1) tells the user to runbuild-embeddings --full-rebuild, which recreates the vector index from scratch — a heavy operation that re-pays the embedding-API cost for the entire corpus. That's a workaround, not a maintenance command.Why it matters
build-embeddings --full-rebuildis expensive: it re-runs every embedding through the configured endpoint. Users running large archives will avoid it.serve) compound the problem.Proposed approach
Add a lightweight
vectors prune(orbuild-embeddings --prune-orphans) command that:main.messages.id.message_idhas no live message row.This is much cheaper than a full rebuild: no embedding API calls, just a
DELETE FROM vec_chunks WHERE message_id NOT IN (SELECT id FROM messages)-shaped query.Optionally hook it into
delete-dedupedas a post-step (gated by a flag) so the cleanup happens in-line for users who want it, while remaining opt-out for users who batch their vector maintenance separately.