Add semantic and hybrid search via local vector embeddings by wesm · Pull Request #277 · wesm/msgvault

wesm · 2026-04-20T15:07:39Z

Summary

Adds optional semantic and hybrid search to msgvault, using vector embeddings generated against any OpenAI-compatible endpoint (Ollama, llama.cpp, LM Studio, etc.). Keyword (FTS5) search remains the default; vector support is opt-in behind a build tag and a config block.

What's new

Search modes

mode=vector — pure semantic search; the free-text query is embedded and nearest neighbors are returned.
mode=hybrid — BM25 and vector results fused via Reciprocal Rank Fusion, with an optional subject boost.
mode=fts — existing keyword search, still the default.

All three modes are available through:

msgvault search "..." --mode hybrid [--explain] [--json]
GET /api/v1/search?q=...&mode=hybrid
MCP search_messages tool (mode + explain arguments)

Filter operators (from:, to:, label:, has:attachment, after:, before:, account:) work uniformly across all three modes.

Find similar messages

New MCP tool find_similar_messages takes a seed message_id and returns nearest-neighbor messages, excluding the seed itself. Supports account, after, before, has_attachment filters.

Embedding lifecycle

msgvault build-embeddings [--full-rebuild] [--yes] — backfill or rebuild the index. Resumable: rerun to pick up where a prior run left off.
Generations — every rebuild creates a new generation with a model:dimension fingerprint. The old generation keeps serving queries until the new one is fully built, then activates atomically.
Incremental updates — new and repaired messages are enqueued automatically during Gmail/IMAP sync and drained by the worker.
Scheduler integration — msgvault serve runs the worker on a cron ([vector.embed.schedule] cron = "...") and optionally after every successful scheduled sync (run_after_sync = true).

Stats and observability

GET /api/v1/stats includes a vector_search sub-object: active generation, fingerprint, message count, pending-queue depth, building generation (if any).
--explain on search surfaces per-hit scores (BM25, vector distance, RRF, subject boost).
Clear sentinel errors for common conditions: vector_not_enabled, index_stale, index_building, missing_free_text, pagination_unsupported, invalid_mode, embedding_timeout.

Storage

Vectors live in a separate vectors.db (sqlite-vec ANN index + queue + generations table) next to msgvault.db. The main database is unchanged.
Attached over a single connection with foreign_keys=ON on every open.

Enabling

Build with vector support (the default make build already does this):
```
go build -tags "fts5 sqlite_vec"
```
Run an embedding endpoint (Ollama example):
```
ollama pull nomic-embed-text
```

Add a [vector] block to ~/.msgvault/config.toml:

[vector]
enabled = true
backend = "sqlite-vec"

[vector.embeddings]
endpoint = "http://localhost:11434/v1"
model = "nomic-embed-text"
dimension = 768
max_input_chars = 2000   # match to the embedder's context window

Backfill:

msgvault build-embeddings --full-rebuild --yes

For ongoing maintenance, chain the drain onto your existing sync cron:

0 2 * * * msgvault sync-incremental you@gmail.com && msgvault build-embeddings

Full guide: msgvault.io/usage/vector-search/. HTTP response shapes: msgvault.io/api-server/.

Compatibility

Vector support is strictly additive. Binaries built without -tags sqlite_vec, or configs without [vector] enabled = true, behave exactly as before.
No existing tables, API shapes, or CLI flags change for users who don't enable vector search.

🤖 Generated with Claude Code

Proposes hybrid BM25 + vector search over the email corpus with embeddings generated via OpenAI-compatible endpoints (Ollama, LM Studio, llama.cpp-server). Vectors live in a separate ~/.msgvault/vectors.db attached at query time so hybrid RRF fusion stays a single SQLite query. Search is exposed through CLI, HTTP, and stdio MCP; the TUI is unchanged. Reranker and RAG are deferred to follow-up specs.

- Track an active index generation with a model:dimension fingerprint so search refuses to run on a mismatched corpus instead of silently comparing cross-model vectors. Model rotation is an explicit msgvault embed --full-rebuild, never an auto-invalidation. Remove the --model one-shot override from msgvault search. - Rewrite the fused SQL so account, date, label, and attachment filters push into both signal CTEs. The previous post-fusion filter starved recall on narrow filters. - Extend the existing msgvault search CLI and MCP server instead of treating them as greenfield. Gmail-style query syntax is preserved; --mode and --explain flags are additive. All 8 existing MCP tools are unchanged; search_messages gains optional mode/explain args and find_similar_messages is the only new tool. - Drop total from vector/hybrid response shape and use returned/has_more instead. total has no stable definition without a similarity threshold or full scan. fts responses still report total. - Add a Compatibility and versioning section enumerating every affected public surface and its additive delta.

- Dual-enqueue during rebuild: sync writes pending_embeddings for every non-retired generation, not just active. The building generation now stays current with ingest throughout a long rebuild, so the atomic swap can't drop messages that arrived mid-build. - Backend contract carries GenerationID on Upsert, Search, Delete, Stats, plus new CreateGeneration / ActivateGeneration / RetireGeneration methods. Removes the ambiguity when two same-dim generations coexist in vectors_vec_dN. - One explicit pagination rule per mode: fts keeps page/page_size as today; vector/hybrid accepts only page=1 with page_size capped at max_page_size_hybrid. page>1 returns HTTP 400 pagination_unsupported. CLI --offset errors for non-fts modes. - index_building now only fires when no active generation exists (first-ever build). Rebuilds with an active generation are transparent to search. - Default mode is fts everywhere at the public API. Removed default_mode from config.

- Switch the embedding worker from claim-by-delete to claim-by-mark: pending_embeddings gains claimed_at and claim_token; completion transactionally DELETEs the row and writes the embedding. Startup reclaim pass releases stale claims. The previous recovery scan was unsafe under dual-enqueue because it relied on needs_embedding as an external source of truth. - Drop the needs_embedding column from messages. pending_embeddings is the only source of truth for pending work. Initial enqueue on CreateGeneration seeds the queue from SELECT id FROM messages in the same transaction that creates the generation, so there is no window where a generation exists with an incomplete queue. - Rewrite the sample fused SQL against the actual msgvault schema: messages.sender_id -> participants, messages.source_id -> sources, EXISTS on message_labels. Filter values are resolved to IDs at the Go layer (matching internal/query) and the SQL takes JSON arrays of IDs. Adds deleted_at IS NULL exclusion. - Fix stale package-layout note: /api/v1/search/hybrid and /vector sub-endpoints were removed in the previous revision; handlers.go gains a mode query param on the existing /api/v1/search handler. MCP server location and command wiring updated accordingly.

- Disallow --account with --full-rebuild. Generations are corpus-wide by definition (the activation flips the entire active index), so a partial rebuild cannot safely replace the current generation. The CLI rejects the combination with a clear error. --account remains valid on incremental runs as a drain-prioritization filter. - Fix the deletion column in the sample SQL: existing search paths in internal/query/sqlite.go and internal/search/parser.go exclude deleted_from_source_at IS NULL, not deleted_at. The previous wording would have changed which messages are searchable. - Unify the /api/v1/stats extension: one optional vector_search sub-object containing active_generation, building_generation, and pending_embeddings_total. Both the compatibility table and the observability section now reference the same shape.

Unify on deleted_from_source_at IS NULL as the one deletion predicate used everywhere: initial generation seed, sync-side enqueue, and the fused search SQL. This matches existing msgvault search behavior in internal/query/sqlite.go and internal/search/parser.go. Also document the policy for messages that are marked deleted after embedding: not actively purged in MVP (search hides them via the same predicate; the next --full-rebuild drops them naturally). Active purging is a storage-hygiene improvement, not a correctness fix.

Decomposes the design spec into 30 TDD tasks across 11 phases: foundation (config, backend interface, sqlite-vec extension, schema); generations and storage ops (Create/Activate/Retire, Upsert, Search, Delete, Stats, FusedSearch); embedding pipeline (HTTP client, preprocessing, claim-mark-complete queue, worker, reclaim, sync-side enqueuer); hybrid RRF and search engine; CLI extensions (--mode, --explain, msgvault embed); HTTP and MCP surfaces; scheduler wiring; documentation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n no-config path

Introduces internal/vector/sqlitevec, a small package that wires the sqlite-vec SQLite extension into mattn/go-sqlite3 when the sqlite_vec build tag is set. Callers invoke RegisterExtension() once at startup; it calls sqlite_vec.Auto() (a process-global sqlite3_auto_extension hook) and registers a "sqlite3_vec" driver alias for database/sql. The stub (!sqlite_vec) returns ErrNotBuilt and the default "sqlite3" driver name so the rest of the tree compiles without the extension. Parametrizes the Makefile on BUILD_TAGS so build, build-release, install, test, test-v, and bench all pick up fts5 and sqlite_vec uniformly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Define baseline SQLite schema for the vector-search co-located vectors.db (generations, embeddings, pending queue, embed runs, schema_version) plus an embed-backed Migrate() that applies it, enables WAL + foreign keys, and ensures a dim-specific vec0 table. EnsureVectorTable builds vectors_vec_dN on demand so multiple dimensions can coexist during model rollover.

…tire) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add ErrUnknownGeneration sentinel so callers can distinguish "no such generation" from generic DB errors. Wrap BeginTx and PrepareContext errors in Upsert. Expand test coverage for empty chunks, unknown generation, multi-chunk with mixed Truncated flags, and re-upsert replacement. Make existing tests capture query errors instead of swallowing them. Document ErrUnknownGeneration / ErrDimensionMismatch contract on Upsert's godoc. vec0 virtual tables reject INSERT OR REPLACE on primary-key conflicts, so Upsert now DELETEs the existing vec row before inserting to support idempotent re-embedding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds Filter.IsEmpty() and the sqlitevec Search path: look up the generation's dimension, resolve any structured filter into a message_id IN (...) fragment via a main-DB query, then run the vec0 MATCH query and return ranked hits ordered by ascending distance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… harden test helper Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Delete removes entries from embeddings and the dimension-specific vec0 table in one transaction; unknown generation wraps ErrUnknownGeneration; empty IDs is a no-op. Stats returns EmbeddingCount and PendingCount for the given generation, or aggregated across all generations when gen == 0. StorageBytes is left for the caller to derive from file size.

Adds the hybrid BM25 + ANN CTE (spec §5.3) using a fresh SQLite connection per query that opens msgvault.db and ATTACHes vectors.db under the "vec" alias. Filters resolve to JSON ID arrays and push into both signal branches via json_each(). RRF fusion is computed in SQL via FULL OUTER JOIN. Uses FTS5's built-in `rank` column instead of `bm25(messages_fts)` because the latter errors with "unable to use function bm25 in the requested context" when the FTS table participates in a JOIN. `k` for sqlite-vec and `LIMIT` for the BM25 CTE are interpolated from trusted integers on the request. Adds `MainPath` to Options, `Path() string` on Backend, and a stub `applySubjectBoost` — the real boost logic lives in Task 11. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SQLite ATTACH is per-connection, but *sql.DB is a pool. A query dispatched after ATTACH could check out a different physical connection and fail with "no such table: vec.vectors_vec_d768". Pin the pool to a single connection via SetMaxOpenConns(1) / SetMaxIdleConns(1) so every subsequent query reuses the attached connection. Add FusedSearch tests for FTS-only (VectorScore NaN), vector-only (BM25Score NaN), no-signals error, unknown generation, and dimension mismatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SQLite's bm25() function cannot be used inside subqueries or after joins ("unable to use function bm25 in the requested context"). The FTS5 built-in fts.rank column aliases to the default bm25 score, so reading fts.rank inside the CTE is semantically identical and correct. Discovered while implementing Task 9 (FusedSearch); spec updated so Task 11 (subject boost) and Task 17 (hybrid engine orchestration) don't copy the broken pattern.

New internal/vector/embed package providing a Client.Embed method that posts to {endpoint}/embeddings with the OpenAI-compatible JSON schema. Supports optional Bearer auth, per-request timeout, exponential-backoff retry on 5xx and network errors, and immediate failure on 4xx. Verifies returned vector dimensions and that every input index is populated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous sentinel (empty want = skip) silently hid the EmptySubjectAndBody case: it legitimately expects "" so the check short-circuited. Replace with a dedicated boolean so the empty-both assertion actually runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wraps the pending_embeddings table with a crash-safe Claim/Complete/ Release/ReclaimStale API. Each claim stamps rows with a crypto-random 16-hex token and timestamp; Complete/Release only act on rows whose token matches, so a stale or foreign token is a silent no-op. Stale claims can be reclaimed to the available pool via a cutoff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add Worker that drains pending_embeddings for a building generation by claiming a batch, fetching subject/body from the main DB, preprocessing, calling the embedding client, and upserting chunks into the backend. On embed or upsert error, claimed rows are released so another worker can retry them. Honors context cancellation and applies defaults for BatchSize (32) and Log. Drop the sqlite_vec build tag from queue.go since Queue is pure SQL and has no sqlite-vec dependency; keeps the queue usable by worker.go, which is tag-neutral. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds StaleThreshold field to WorkerDeps (default 10 minutes) and a ReclaimStale method that delegates to Queue.ReclaimStale. Callers invoke this at startup before RunOnce so rows claimed by crashed workers are released and retried. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds Enqueuer in internal/vector/embed that inserts message IDs into pending_embeddings for every non-retired generation. Wires the optional EmbedEnqueuer interface into the Syncer at the batch-commit point in both full and incremental sync paths, plus the rarer history-label-add path. Enqueue failures are logged and treated as non-fatal; missed IDs get picked up on the next full rebuild. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implements internal/vector/hybrid.Fuse: combines BM25 and vector ranked lists via Reciprocal Rank Fusion (RRF), with an optional per-hit subject-term boost for messages whose subject matches any query term. BM25Score/VectorScore are NaN when a signal is absent, matching the FusingBackend contract. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…t path

GitHub's windows-latest runner doesn't ship sqlite3.h, which sqlite-vec's CGo bindings #include unconditionally with -DSQLITE_CORE. Build the Windows CI and release artifacts with -tags "fts5" only; vector search on Windows remains available to developers who provide the header themselves (e.g. via MSYS2's mingw-w64-x86_64-sqlite3 package). Documented the limitation in docs/vector-search.md and scripts/build.ps1. Linux release now installs libsqlite3-dev so the same header is available during the containerised build; mattn/go-sqlite3 still links its bundled amalgamation at runtime, so this only affects compile-time header resolution. Updated the Nix vendorHash to reflect the current go.sum (the new sqlite-vec-go-bindings dependency shifted it).

roborev-ci · 2026-04-20T15:28:46Z

roborev: Combined Review (`d87c29c`)

Verdict: No medium-or-higher findings; the diff looks clean based on the submitted reviews.

All reviewers either reported no actionable issues or provided no findings. No Medium, High, or Critical items to surface.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

The previous commit dropped sqlite_vec from Windows builds as a shortcut, which meant Windows binaries would have silently lost vector search. windows-latest runners ship MSYS2, and installing mingw-w64-x86_64-sqlite3 + pointing CGO_CFLAGS at its include path is enough to satisfy sqlite-vec's #include <sqlite3.h>. Build Windows artifacts with -tags "fts5 sqlite_vec" again, restore the docs, and have scripts/build.ps1 auto-configure CGO_CFLAGS when the MSYS2 header is present.

roborev-ci · 2026-04-20T15:36:41Z

roborev: Combined Review (`f66eb71`)

Verdict: No medium-or-higher issues found; the reviewed diff looks clean.

All reviewers who provided findings agreed there were no issues requiring mention at Medium, High, or Critical severity.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Windows sqlite_vec builds hit undefined-reference link errors for ArrowArrayIsReleased, ArrowSchemaRelease, and other helpers from arrow-go/v18's arrow/c/helpers.h. The header declares those as plain `inline` rather than `static inline`, so C99 semantics require one TU to emit an external definition — and nothing in the cdata package does. GNU89 inline semantics always emit an external copy, which resolves the link. Apply the flag only on Windows, where the MinGW toolchain trips on this; Linux and macOS builds continue with their existing flags.

roborev-ci · 2026-04-20T15:45:41Z

roborev: Combined Review (`c9904cf`)

Verdict: No Medium, High, or Critical findings; the changes look clean.

All reviewed outputs found no actionable issues at Medium severity or above. Security review also found no security-impacting regressions in the diff.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

With only -fgnu89-inline, every translation unit that pulls in arrow-go/v18's helpers.h emits an external copy of each helper, so MinGW's ld then fails on duplicate definitions of ArrowArrayIsReleased et al. Passing --allow-multiple-definition to the linker lets it keep the first definition and drop the rest, which is the standard pairing for projects that can't patch the upstream header.

roborev-ci · 2026-04-20T15:52:41Z

roborev: Combined Review (`b2114dc`)

No medium-or-higher findings identified across the submitted reviews.

All reviewed outputs were clean at Medium, High, and Critical severity.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Empty-filter Search now loops instead of stopping at a fixed 2× over- fetch: each pass either satisfies k live hits or doubles fetch up to the generation's embedded count. Two new regression tests lock this in — one where >k deletions force an expansion round, and one where the full corpus produces fewer live hits than k and Search must terminate cleanly. scripts/build.ps1 now appends the required MSYS2 include path, -fgnu89-inline, and --allow-multiple-definition flags regardless of pre-existing CGO_CFLAGS/CGO_LDFLAGS exports, so users with their own SQLite include paths or linker flags don't silently skip the MinGW workaround. docs/vector-search.md replaces the Unix-shell snippet with PowerShell syntax and lists all three flags the Windows source build needs.

roborev-ci · 2026-04-20T16:04:40Z

roborev: Combined Review (`a7549ec`)

Verdict: No medium-or-higher findings; the reviewed changes look clean.

All provided review outputs either reported no issues or contained no findings to merge.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

- F2: drop unused registerErr from sqlitevec/ext.go and document the forward-compatible nil return. - F6: isUniqueConstraintErr now uses errors.As against sqlite3.ErrConstraintUnique / ErrConstraintPrimaryKey instead of matching error text — robust against driver locale or formatting changes. - F8: search_messages MCP tool replaces the shared withOffset() with an inline parameter that explicitly states offset>0 is rejected for mode=vector and mode=hybrid. - F9: pure vector mode no longer reports a misleading rrf score (RRF needs two signals to fuse). vectorHitsToFused sets RRFScore=NaN, scoreBreakdown.RRF becomes a *float64 omitted when NaN, and the API/MCP docs note that rrf is absent in mode=vector.

…tivation enqueue - F12: TestFusedSearch_NullSubjectExcludedBySubjectFilter pins down the intentional drop of NULL-subject messages when a subject filter is supplied — exercises both the fused CTE path and the Search filter path so any future change has to touch them together. - F13: TestEmbedJob_Run_PostActivationEnqueueDrainsOnNextRun confirms the activation gate's eventual-consistency property: a pending row enqueued after pendingCount==0 still gets drained on the next tick via the active-generation top-up path. Added a comment in embed_job.go documenting why the gate is intentionally non-atomic.

- F5: Queue.Complete and Queue.Release replace per-id Exec loops with a single statement bound to json_each(message_ids). At default BatchSize=32 the change is small; at larger batch sizes it removes N round-trips against the vectors.db write lock. - F7: Enqueuer.EnqueueMessages uses one INSERT-OR-IGNORE … FROM json_each(?) per non-retired generation instead of N×M ExecContext calls. A 5,000-message incremental sync with two non-retired generations now issues 2 writes instead of 10,000, so sync flushes no longer starve the embed worker's Claim.

The fused CTE applied LIMIT in SQL and then did the subject-boost re-rank in Go, so any boost-eligible hit ranked at pre-boost rank Limit+1 was invisible — it never made it into the candidate set the Go code re-sorted. Over-fetch by subjectBoostOverfetchFactor (4×) when boost is active, apply the boost, re-sort, then trim back to the requested limit. TestFusedSearch_SubjectBoostOverFetchesBeyondLimit covers the case: seven non-boost-eligible subjects rank ahead of a single boost- eligible subject in pure ANN order. With Limit=3 and 5× boost, the boost-eligible hit is now promoted into the visible page; without the over-fetch the SQL LIMIT would have hidden it before re-rank ran.

…ed stale threshold - F1: ErrEmbeddingTimeout sentinel wraps context.DeadlineExceeded from the embed call; HTTP returns 503 embedding_timeout, MCP returns the matching tool error. Clients can now retry transient embedder slowness instead of treating it as a permanent failure. - F3: Document the FULL OUTER JOIN identity that makes the fused CTE's pool-size correlated subqueries safe to read on the first result row only (and zero-default safe when fused is empty). Add TestFusedSearch_EmptyFilteredSetReportsNotSaturated to lock the contract. - F14: Auto-derive Worker.StaleThreshold from EmbedTimeout × (MaxRetries + 1) × 2 with a 10-minute floor. Users who configure a long embed timeout (slow remote endpoint) no longer need to remember to bump StaleThreshold themselves; a ReclaimStale tick can no longer pull live work out from under a worker that is still inside its retry budget.

roborev-ci · 2026-04-20T19:26:46Z

roborev: Combined Review (`59325fb`)

Verdict: No medium, high, or critical findings.

The reviewed diff appears clean at Medium+ severity across the combined review outputs.

Notable note: one Low severity issue was reported in internal/api/handlers.go:430 and internal/api/handlers.go:459, but it is omitted here per instruction.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

… retries default - R1 (14845): CLI rendered NaN RRFScore as the literal string "NaN" in table mode and panicked encoding/json in JSON mode. Pure vector mode intentionally sets RRFScore=NaN to mark "no fusion happened"; the CLI now mirrors the API/MCP convention — formatOptionalScore for table output, omit the rrf_score key when NaN for JSON output. - R2 (14848): Subject boost was over-fetching 4 × Limit before reranking, which left correctness gaps for large boost factors or large KPerSignal. Replaced the heuristic with a full-pool fetch (2 × KPerSignal, the natural cap of bm25_used FULL OUTER JOIN ann_used). Adds TestFusedSearch_SubjectBoostPromotesDeepRankHit: 100x boost surfaces a hit ranked at the deepest pre-boost position into the top page. - R3 (14849 MED): Documented why chi/v5's Timeout middleware does not preempt our structured 503 embedding_timeout response — it wraps ctx with a deadline and only writes 504 in a deferred function after the inline handler returns, by which point our response is already on the wire. Made RequestTimeout configurable via ServerOptions and added TestHandleSearch_HybridEmbeddingTimeoutFiresChi with a 100ms cap + blocking embedder; locks the contract so a future chi switch to http.TimeoutHandler-style preemption fails the test instead of silently breaking the JSON error path. - R4 (14849 LOW): derivedStaleThreshold treated EmbedMaxRetries=0 as one attempt while embed.NewClient defaults zero to 3 retries — a worker built with EmbedTimeout set but EmbedMaxRetries left zero could derive a stale threshold shorter than the client's real retry budget and reclaim live work. Normalize zero to 3 and add a regression case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roborev-ci · 2026-04-20T19:49:45Z

roborev: Combined Review (`d221c83`)

Verdict: No medium-or-higher issues found in this PR.

Reviewed outputs from all agents; no deduplicated Medium, High, or Critical findings to report.

Security review also found no new security issues in the diff.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

- R1 (14853 MED): derivedStaleThreshold mis-modeled embed.Client's MaxRetries semantics — the client treats it as the TOTAL number of HTTP attempts (the for-loop runs `for attempt := 1; attempt <= MaxRetries` and the docstring says "MaxRetries total attempts"), not retries-after- the-first. The worker was computing `attempts := maxRetries + 1`, over-deriving the budget by one timeout window for every caller. Aligned the math (`attempts := maxRetries`) and updated the WorkerDeps comment to call out the total-attempts semantics. Adjusted test expectations: 5m × 3 attempts × 2 = 30m (was 40m), and 10m × 3 × 2 = 60m for the zero-default case (was 80m). - R2 (14853 LOW): http.Server.WriteTimeout was hardcoded to 60s, so a caller setting RequestTimeout > 60s would still lose the race to the underlying socket cutoff and never deliver the structured error. Tied WriteTimeout to s.requestTimeout + 5s; the buffer covers chi's deferred WriteHeader and the response flush. Documented the contract on ServerOptions.RequestTimeout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roborev-ci · 2026-04-20T19:59:48Z

roborev: Combined Review (`ad356f8`)

Verdict: No medium-or-higher findings; the reviewed changes look clean.

All reviewers reported no actionable issues at Medium, High, or Critical severity. Security review also did not identify an actionable concern with the added github.com/asg017/sqlite-vec-go-bindings v0.1.6 dependency.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Vector search documentation lives in the docs site now. Delete docs/vector-search.md, revert the vector additions to docs/api.md, and point the README link at https://msgvault.io/usage/vector-search/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roborev-ci · 2026-04-20T21:14:41Z

roborev: Combined Review (`9b913e2`)

Verdict: No medium-or-higher findings identified; the reviewed diff looks clean.

All provided reviews found no actionable Medium, High, or Critical issues.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

The new name follows the existing verb-first CLI pattern (build-cache, sync-full, import-emlx) and is discoverable to users who don't know what an embedding is. "embed" stays as a cobra alias so existing cronjobs and scripts keep working. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roborev-ci · 2026-04-20T21:32:25Z

roborev: Combined Review (`7c5530a`)

Vector/hybrid search is close, but the PR still has 1 high-severity performance issue and 4 medium-severity correctness/reliability gaps.

High

High — internal/vector/sqlitevec/fused.go (openFusedConn): FusedSearch opens a fresh *sql.DB with sql.Open on every query. That adds avoidable pool/driver/database-open overhead to the hot path and is a poor fit for database/sql, especially for repeated searches.
Fix: Keep a long-lived attached connection/pool on the backend, or reuse a *sql.Conn from the existing main DB.

Medium

Medium — cmd/msgvault/cmd/search.go:71: The CLI rejects --mode=vector|hybrid in remote mode even though this PR adds mode/explain support to /api/v1/search. That regresses remote CLI users by blocking semantic/hybrid search entirely.
Fix: Thread mode and explain through runRemoteSearch and the remote query builder instead of rejecting them.
Medium — cmd/msgvault/cmd/search_vector.go:37: runHybridSearch opens msgvault.db directly and skips the usual local-search preparation path (InitSchema() plus FTS/backfill/index checks). On fresh or upgraded archives, --mode=hybrid can run against missing or stale FTS state and return incomplete results or fail.
Fix: Reuse the normal local-search setup before constructing the hybrid engine, or explicitly run schema/init and FTS preparation in the vector/hybrid path.
Medium — internal/api/handlers.go:476, internal/mcp/handlers.go:321, internal/mcp/handlers.go:400: Bulk hydration failures are logged and then treated as summaries = nil, which converts a successful vector search into a successful response with zero hits. That is a silent false negative rather than a partial-result case.
Fix: Propagate hydration failures as API/tool errors, and only drop IDs that are genuinely missing after a successful summary fetch.
Medium — internal/query/sqlite.go (GetMessageSummariesByIDs), internal/store/api.go (GetMessagesSummariesByIDs): Hydration binds message IDs as IN (?, ?, ...). If the result set grows past SQLite’s parameter limit, hydration fails with too many SQL variables.
Fix: Pass IDs as a JSON array and use IN (SELECT value FROM json_each(?)), consistent with the filter-resolution approach already used elsewhere.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

The HTTP API reference now lives on msgvault.io/api-server/. Keeping an in-repo copy invites drift. Point the README link at the docs site instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roborev-ci · 2026-04-20T21:51:08Z

roborev: Combined Review (`b2c941c`)

Verdict: Not ready to merge due to 1 high-severity correctness issue and 2 medium-severity regressions.

High

Hydration failures are silently converted into empty/partial success responses
Location: internal/api/handlers.go, internal/mcp/handlers.go
The new bulk-hydration path logs GetMessagesSummariesByIDs failures but still returns success with missing results. That misreports backend errors as "no matches" and can hide real search hits.
Suggested fix: Return a request/tool error on hydration failure, or fall back to the prior per-message lookup path instead of dropping hits.

Medium

Bulk summary hydration does not fully preserve the existing search summary shape
Location: internal/store/api.go in GetMessagesSummariesByIDs
The new hydrator restores To and Labels, but not other summary fields such as Cc. Since hybrid/vector search is switching to this path specifically to match existing search responses, this creates a response-shape regression versus FTS results.
Suggested fix: Hydrate the full payload expected by toMessageSummary, or keep the older projection until the bulk path is complete.
Remote CLI blocks newly added server-side vector/hybrid search support
Location: cmd/msgvault/cmd/search.go
Remote mode still rejects --mode=vector|hybrid even though the server now supports those modes on /api/v1/search. As a result, remote CLI users cannot use the new feature.
Suggested fix: Forward mode and explain in remote requests and render the remote hybrid response instead of rejecting it client-side.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

The vector search feature has never been released, so there is no shipped "embed" command for users to be compatible with. Drop the alias and the Long-text mention of it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

roborev-ci · 2026-04-20T22:12:39Z

roborev: Combined Review (`18aa250`)

Verdict: No medium-or-higher issues identified in the reviewed diff.

All review outputs agree the changes are clean at Medium, High, and Critical severity.

Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Reconciles 172 ahead / 37 behind state with upstream wesm/msgvault. Strategy: accept upstream wholesale for connector code (M365, iMessage, gvoice, IMAP XOAUTH2) where upstream's implementations are more battle-tested and already cover the fork's bug fixes. Hand-resolve store/sync/search/cmd/build files to union both feature sets. Preserved from fork: - SQLCipher encryption-at-rest (passphrase, AES-GCM token encryption) - Advisory file locking (tryLock, lockFile, syscall.Flock) - AI Archive Intelligence subsystem (internal/embedding/, vec_messages, pipeline_runs, --semantic search) - Web UI (React/TypeScript SPA in web/) - Hot-path search tokenizer (dispatchToken, toLowerFast, parseSizeFast) - migrateAddContentID, InitVectorTable, content_id attachment column Adopted from upstream: - Dialect interface + loggedDB wrapper + structured logging pipeline (wesm#276 PostgreSQL dialect refactor foundation) - OpenReadOnly() for MCP read-only access - IsBusyError, SchemaStale helpers - Unified text import (wesm#238) — M365 OAuth (wesm#228), iMessage (wesm#224), Google Voice (wesm#225) — all wholesale - Search enhancements: regex, FTS5 snippets, sorting (wesm#252), domain normalization (normalizeAddr, looksLikeDomain, gTLDs) - rebuild-fts command (wesm#287), 8 bug fixes from wesm#254 - IMAP date filtering (wesm#222), greeting wait (wesm#248) - Vector subsystem (wesm#277) — coexists with fork's AI Archive Intelligence as parallel implementation; future cleanup needed Build/runtime fixes applied during merge: - Replaced mattn/go-sqlite3 imports with mutecomm/go-sqlcipher/v4 (drop-in API-compatible) to resolve duplicate symbol linker errors - Dropped sqlite_vec from default BUILD_TAGS (requires SQLite 3.38+ APIs sqlcipher v4.4.2 does not expose; re-enable when sqlcipher upgrades) - safeRowsAffected helper in db_logger.go: defer recover around RowsAffected() call (sqlcipher returns nil internal Result for multi-statement DDL) - Wired normalizeAddr into hot-path tokenizer for from:/to:/cc:/bcc: Stubbed under unreachable build tag (need follow-up decision): - cmd/msgvault/cmd/sync_gvoice.go — fork's sync API obsolete vs upstream's import-based gvoice - cmd/msgvault/cmd/sync_imessage.go — same situation Verified: go build ./... passing, go vet clean, 45/45 test packages pass with 0 failures. See MERGE_REPORT.md for file-by-file resolution notes.

wesm and others added 30 commits April 19, 2026 10:39

Add VectorConfig scaffolding and wire into top-level Config

d4230a1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tighten URL validation; use strings.Contains; apply vector defaults o…

c363e95

…n no-config path

Define vector.Backend and FusingBackend interfaces with core types

d52a036

Implement sqlite-vec backend generation lifecycle (Create/Activate/Re…

cd68d43

…tire) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implement Backend.Upsert with dimension verification and pending cleanup

bb6adc5

Address T7 review nits: expand Search tests, correct scoring comment,…

ef457a5

… harden test helper Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add email-aware embedding preprocessing (quotes, signatures, subject)

77e67c6

Add hybrid search engine with fingerprint check and FusingBackend fas…

9d98d31

…t path

wesm added 5 commits April 20, 2026 14:09

Remove in-repo docs/api.md; docs site is authoritative

b2c941c

The HTTP API reference now lives on msgvault.io/api-server/. Keeping an in-repo copy invites drift. Point the README link at the docs site instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wesm merged commit 67fab47 into main Apr 20, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add semantic and hybrid search via local vector embeddings#277

Add semantic and hybrid search via local vector embeddings#277
wesm merged 97 commits intomainfrom
feat/vector-search

wesm commented Apr 20, 2026 •

edited

Loading

Uh oh!

roborev-ci Bot commented Apr 20, 2026

Uh oh!

roborev-ci Bot commented Apr 20, 2026

Uh oh!

roborev-ci Bot commented Apr 20, 2026

Uh oh!

roborev-ci Bot commented Apr 20, 2026

Uh oh!

roborev-ci Bot commented Apr 20, 2026

Uh oh!

roborev-ci Bot commented Apr 20, 2026

Uh oh!

roborev-ci Bot commented Apr 20, 2026

Uh oh!

roborev-ci Bot commented Apr 20, 2026

Uh oh!

roborev-ci Bot commented Apr 20, 2026

Uh oh!

roborev-ci Bot commented Apr 20, 2026

Uh oh!

roborev-ci Bot commented Apr 20, 2026

Uh oh!

roborev-ci Bot commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wesm commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's new

Search modes

Find similar messages

Embedding lifecycle

Stats and observability

Storage

Enabling

Compatibility

Uh oh!

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (d87c29c)

Uh oh!

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (f66eb71)

Uh oh!

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (c9904cf)

Uh oh!

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (b2114dc)

Uh oh!

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (a7549ec)

Uh oh!

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (59325fb)

Uh oh!

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (d221c83)

Uh oh!

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (ad356f8)

Uh oh!

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (9b913e2)

Uh oh!

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (7c5530a)

High

Medium

Uh oh!

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (b2c941c)

High

Medium

Uh oh!

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (18aa250)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wesm commented Apr 20, 2026 •

edited

Loading

roborev: Combined Review (`d87c29c`)

roborev: Combined Review (`f66eb71`)

roborev: Combined Review (`c9904cf`)

roborev: Combined Review (`b2114dc`)

roborev: Combined Review (`a7549ec`)

roborev: Combined Review (`59325fb`)

roborev: Combined Review (`d221c83`)

roborev: Combined Review (`ad356f8`)

roborev: Combined Review (`9b913e2`)

roborev: Combined Review (`7c5530a`)

roborev: Combined Review (`b2c941c`)

roborev: Combined Review (`18aa250`)