Skip to content

Add semantic and hybrid search via local vector embeddings#277

Merged
wesm merged 97 commits intomainfrom
feat/vector-search
Apr 20, 2026
Merged

Add semantic and hybrid search via local vector embeddings#277
wesm merged 97 commits intomainfrom
feat/vector-search

Conversation

@wesm
Copy link
Copy Markdown
Owner

@wesm wesm commented Apr 20, 2026

Summary

Adds optional semantic and hybrid search to msgvault, using vector embeddings generated against any OpenAI-compatible endpoint (Ollama, llama.cpp, LM Studio, etc.). Keyword (FTS5) search remains the default; vector support is opt-in behind a build tag and a config block.

What's new

Search modes

  • mode=vector — pure semantic search; the free-text query is embedded and nearest neighbors are returned.
  • mode=hybrid — BM25 and vector results fused via Reciprocal Rank Fusion, with an optional subject boost.
  • mode=fts — existing keyword search, still the default.

All three modes are available through:

  • msgvault search "..." --mode hybrid [--explain] [--json]
  • GET /api/v1/search?q=...&mode=hybrid
  • MCP search_messages tool (mode + explain arguments)

Filter operators (from:, to:, label:, has:attachment, after:, before:, account:) work uniformly across all three modes.

Find similar messages

New MCP tool find_similar_messages takes a seed message_id and returns nearest-neighbor messages, excluding the seed itself. Supports account, after, before, has_attachment filters.

Embedding lifecycle

  • msgvault build-embeddings [--full-rebuild] [--yes] — backfill or rebuild the index. Resumable: rerun to pick up where a prior run left off.
  • Generations — every rebuild creates a new generation with a model:dimension fingerprint. The old generation keeps serving queries until the new one is fully built, then activates atomically.
  • Incremental updates — new and repaired messages are enqueued automatically during Gmail/IMAP sync and drained by the worker.
  • Scheduler integrationmsgvault serve runs the worker on a cron ([vector.embed.schedule] cron = "...") and optionally after every successful scheduled sync (run_after_sync = true).

Stats and observability

  • GET /api/v1/stats includes a vector_search sub-object: active generation, fingerprint, message count, pending-queue depth, building generation (if any).
  • --explain on search surfaces per-hit scores (BM25, vector distance, RRF, subject boost).
  • Clear sentinel errors for common conditions: vector_not_enabled, index_stale, index_building, missing_free_text, pagination_unsupported, invalid_mode, embedding_timeout.

Storage

  • Vectors live in a separate vectors.db (sqlite-vec ANN index + queue + generations table) next to msgvault.db. The main database is unchanged.
  • Attached over a single connection with foreign_keys=ON on every open.

Enabling

  1. Build with vector support (the default make build already does this):
    go build -tags "fts5 sqlite_vec"
    
  2. Run an embedding endpoint (Ollama example):
    ollama pull nomic-embed-text
    
  3. Add a [vector] block to ~/.msgvault/config.toml:
    [vector]
    enabled = true
    backend = "sqlite-vec"
    
    [vector.embeddings]
    endpoint = "http://localhost:11434/v1"
    model = "nomic-embed-text"
    dimension = 768
    max_input_chars = 2000   # match to the embedder's context window
  4. Backfill:
    msgvault build-embeddings --full-rebuild --yes
    
  5. For ongoing maintenance, chain the drain onto your existing sync cron:
    0 2 * * * msgvault sync-incremental you@gmail.com && msgvault build-embeddings
    

Full guide: msgvault.io/usage/vector-search/. HTTP response shapes: msgvault.io/api-server/.

Compatibility

  • Vector support is strictly additive. Binaries built without -tags sqlite_vec, or configs without [vector] enabled = true, behave exactly as before.
  • No existing tables, API shapes, or CLI flags change for users who don't enable vector search.

🤖 Generated with Claude Code

wesm and others added 30 commits April 19, 2026 10:39
Proposes hybrid BM25 + vector search over the email corpus with
embeddings generated via OpenAI-compatible endpoints (Ollama, LM Studio,
llama.cpp-server). Vectors live in a separate ~/.msgvault/vectors.db
attached at query time so hybrid RRF fusion stays a single SQLite
query. Search is exposed through CLI, HTTP, and stdio MCP; the TUI is
unchanged. Reranker and RAG are deferred to follow-up specs.
- Track an active index generation with a model:dimension fingerprint
  so search refuses to run on a mismatched corpus instead of silently
  comparing cross-model vectors. Model rotation is an explicit
  msgvault embed --full-rebuild, never an auto-invalidation. Remove
  the --model one-shot override from msgvault search.
- Rewrite the fused SQL so account, date, label, and attachment
  filters push into both signal CTEs. The previous post-fusion filter
  starved recall on narrow filters.
- Extend the existing msgvault search CLI and MCP server instead of
  treating them as greenfield. Gmail-style query syntax is preserved;
  --mode and --explain flags are additive. All 8 existing MCP tools
  are unchanged; search_messages gains optional mode/explain args and
  find_similar_messages is the only new tool.
- Drop total from vector/hybrid response shape and use
  returned/has_more instead. total has no stable definition without a
  similarity threshold or full scan. fts responses still report total.
- Add a Compatibility and versioning section enumerating every
  affected public surface and its additive delta.
- Dual-enqueue during rebuild: sync writes pending_embeddings for
  every non-retired generation, not just active. The building
  generation now stays current with ingest throughout a long rebuild,
  so the atomic swap can't drop messages that arrived mid-build.
- Backend contract carries GenerationID on Upsert, Search, Delete,
  Stats, plus new CreateGeneration / ActivateGeneration /
  RetireGeneration methods. Removes the ambiguity when two same-dim
  generations coexist in vectors_vec_dN.
- One explicit pagination rule per mode: fts keeps page/page_size as
  today; vector/hybrid accepts only page=1 with page_size capped at
  max_page_size_hybrid. page>1 returns HTTP 400
  pagination_unsupported. CLI --offset errors for non-fts modes.
- index_building now only fires when no active generation exists
  (first-ever build). Rebuilds with an active generation are
  transparent to search.
- Default mode is fts everywhere at the public API. Removed
  default_mode from config.
- Switch the embedding worker from claim-by-delete to claim-by-mark:
  pending_embeddings gains claimed_at and claim_token; completion
  transactionally DELETEs the row and writes the embedding. Startup
  reclaim pass releases stale claims. The previous recovery scan was
  unsafe under dual-enqueue because it relied on needs_embedding as
  an external source of truth.
- Drop the needs_embedding column from messages. pending_embeddings
  is the only source of truth for pending work. Initial enqueue on
  CreateGeneration seeds the queue from SELECT id FROM messages in
  the same transaction that creates the generation, so there is no
  window where a generation exists with an incomplete queue.
- Rewrite the sample fused SQL against the actual msgvault schema:
  messages.sender_id -> participants, messages.source_id -> sources,
  EXISTS on message_labels. Filter values are resolved to IDs at the
  Go layer (matching internal/query) and the SQL takes JSON arrays
  of IDs. Adds deleted_at IS NULL exclusion.
- Fix stale package-layout note: /api/v1/search/hybrid and /vector
  sub-endpoints were removed in the previous revision; handlers.go
  gains a mode query param on the existing /api/v1/search handler.
  MCP server location and command wiring updated accordingly.
- Disallow --account with --full-rebuild. Generations are corpus-wide
  by definition (the activation flips the entire active index), so a
  partial rebuild cannot safely replace the current generation. The
  CLI rejects the combination with a clear error. --account remains
  valid on incremental runs as a drain-prioritization filter.
- Fix the deletion column in the sample SQL: existing search paths
  in internal/query/sqlite.go and internal/search/parser.go exclude
  deleted_from_source_at IS NULL, not deleted_at. The previous
  wording would have changed which messages are searchable.
- Unify the /api/v1/stats extension: one optional vector_search
  sub-object containing active_generation, building_generation, and
  pending_embeddings_total. Both the compatibility table and the
  observability section now reference the same shape.
Unify on deleted_from_source_at IS NULL as the one deletion predicate
used everywhere: initial generation seed, sync-side enqueue, and the
fused search SQL. This matches existing msgvault search behavior in
internal/query/sqlite.go and internal/search/parser.go.

Also document the policy for messages that are marked deleted after
embedding: not actively purged in MVP (search hides them via the same
predicate; the next --full-rebuild drops them naturally). Active
purging is a storage-hygiene improvement, not a correctness fix.
Decomposes the design spec into 30 TDD tasks across 11 phases:
foundation (config, backend interface, sqlite-vec extension, schema);
generations and storage ops (Create/Activate/Retire, Upsert, Search,
Delete, Stats, FusedSearch); embedding pipeline (HTTP client,
preprocessing, claim-mark-complete queue, worker, reclaim, sync-side
enqueuer); hybrid RRF and search engine; CLI extensions (--mode,
--explain, msgvault embed); HTTP and MCP surfaces; scheduler wiring;
documentation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces internal/vector/sqlitevec, a small package that wires the
sqlite-vec SQLite extension into mattn/go-sqlite3 when the sqlite_vec
build tag is set. Callers invoke RegisterExtension() once at startup;
it calls sqlite_vec.Auto() (a process-global sqlite3_auto_extension
hook) and registers a "sqlite3_vec" driver alias for database/sql.

The stub (!sqlite_vec) returns ErrNotBuilt and the default "sqlite3"
driver name so the rest of the tree compiles without the extension.

Parametrizes the Makefile on BUILD_TAGS so build, build-release,
install, test, test-v, and bench all pick up fts5 and sqlite_vec
uniformly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Define baseline SQLite schema for the vector-search co-located
vectors.db (generations, embeddings, pending queue, embed runs,
schema_version) plus an embed-backed Migrate() that applies it,
enables WAL + foreign keys, and ensures a dim-specific vec0 table.

EnsureVectorTable builds vectors_vec_dN on demand so multiple
dimensions can coexist during model rollover.
…tire)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add ErrUnknownGeneration sentinel so callers can distinguish "no such
generation" from generic DB errors. Wrap BeginTx and PrepareContext
errors in Upsert. Expand test coverage for empty chunks, unknown
generation, multi-chunk with mixed Truncated flags, and re-upsert
replacement. Make existing tests capture query errors instead of
swallowing them. Document ErrUnknownGeneration / ErrDimensionMismatch
contract on Upsert's godoc.

vec0 virtual tables reject INSERT OR REPLACE on primary-key conflicts,
so Upsert now DELETEs the existing vec row before inserting to support
idempotent re-embedding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds Filter.IsEmpty() and the sqlitevec Search path: look up the
generation's dimension, resolve any structured filter into a
message_id IN (...) fragment via a main-DB query, then run the vec0
MATCH query and return ranked hits ordered by ascending distance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… harden test helper

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Delete removes entries from embeddings and the dimension-specific vec0
table in one transaction; unknown generation wraps ErrUnknownGeneration;
empty IDs is a no-op.

Stats returns EmbeddingCount and PendingCount for the given generation,
or aggregated across all generations when gen == 0. StorageBytes is
left for the caller to derive from file size.
Adds the hybrid BM25 + ANN CTE (spec §5.3) using a fresh SQLite
connection per query that opens msgvault.db and ATTACHes vectors.db
under the "vec" alias. Filters resolve to JSON ID arrays and push into
both signal branches via json_each(). RRF fusion is computed in SQL via
FULL OUTER JOIN.

Uses FTS5's built-in `rank` column instead of `bm25(messages_fts)`
because the latter errors with "unable to use function bm25 in the
requested context" when the FTS table participates in a JOIN. `k` for
sqlite-vec and `LIMIT` for the BM25 CTE are interpolated from trusted
integers on the request.

Adds `MainPath` to Options, `Path() string` on Backend, and a stub
`applySubjectBoost` — the real boost logic lives in Task 11.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SQLite ATTACH is per-connection, but *sql.DB is a pool. A query
dispatched after ATTACH could check out a different physical connection
and fail with "no such table: vec.vectors_vec_d768". Pin the pool to a
single connection via SetMaxOpenConns(1) / SetMaxIdleConns(1) so every
subsequent query reuses the attached connection.

Add FusedSearch tests for FTS-only (VectorScore NaN), vector-only
(BM25Score NaN), no-signals error, unknown generation, and dimension
mismatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SQLite's bm25() function cannot be used inside subqueries or after joins
("unable to use function bm25 in the requested context"). The FTS5
built-in fts.rank column aliases to the default bm25 score, so reading
fts.rank inside the CTE is semantically identical and correct.

Discovered while implementing Task 9 (FusedSearch); spec updated so
Task 11 (subject boost) and Task 17 (hybrid engine orchestration) don't
copy the broken pattern.
New internal/vector/embed package providing a Client.Embed method that
posts to {endpoint}/embeddings with the OpenAI-compatible JSON schema.
Supports optional Bearer auth, per-request timeout, exponential-backoff
retry on 5xx and network errors, and immediate failure on 4xx. Verifies
returned vector dimensions and that every input index is populated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous sentinel (empty want = skip) silently hid the
EmptySubjectAndBody case: it legitimately expects "" so the check
short-circuited. Replace with a dedicated boolean so the empty-both
assertion actually runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps the pending_embeddings table with a crash-safe Claim/Complete/
Release/ReclaimStale API. Each claim stamps rows with a crypto-random
16-hex token and timestamp; Complete/Release only act on rows whose
token matches, so a stale or foreign token is a silent no-op. Stale
claims can be reclaimed to the available pool via a cutoff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add Worker that drains pending_embeddings for a building generation by
claiming a batch, fetching subject/body from the main DB, preprocessing,
calling the embedding client, and upserting chunks into the backend. On
embed or upsert error, claimed rows are released so another worker can
retry them. Honors context cancellation and applies defaults for
BatchSize (32) and Log.

Drop the sqlite_vec build tag from queue.go since Queue is pure SQL and
has no sqlite-vec dependency; keeps the queue usable by worker.go, which
is tag-neutral.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds StaleThreshold field to WorkerDeps (default 10 minutes) and a
ReclaimStale method that delegates to Queue.ReclaimStale. Callers
invoke this at startup before RunOnce so rows claimed by crashed
workers are released and retried.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds Enqueuer in internal/vector/embed that inserts message IDs into
pending_embeddings for every non-retired generation. Wires the optional
EmbedEnqueuer interface into the Syncer at the batch-commit point in
both full and incremental sync paths, plus the rarer history-label-add
path. Enqueue failures are logged and treated as non-fatal; missed IDs
get picked up on the next full rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements internal/vector/hybrid.Fuse: combines BM25 and vector ranked
lists via Reciprocal Rank Fusion (RRF), with an optional per-hit
subject-term boost for messages whose subject matches any query term.
BM25Score/VectorScore are NaN when a signal is absent, matching the
FusingBackend contract.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GitHub's windows-latest runner doesn't ship sqlite3.h, which
sqlite-vec's CGo bindings #include unconditionally with -DSQLITE_CORE.
Build the Windows CI and release artifacts with -tags "fts5" only;
vector search on Windows remains available to developers who provide
the header themselves (e.g. via MSYS2's mingw-w64-x86_64-sqlite3
package). Documented the limitation in docs/vector-search.md and
scripts/build.ps1.

Linux release now installs libsqlite3-dev so the same header is
available during the containerised build; mattn/go-sqlite3 still links
its bundled amalgamation at runtime, so this only affects compile-time
header resolution.

Updated the Nix vendorHash to reflect the current go.sum (the new
sqlite-vec-go-bindings dependency shifted it).
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (d87c29c)

Verdict: No medium-or-higher findings; the diff looks clean based on the submitted reviews.

All reviewers either reported no actionable issues or provided no findings. No Medium, High, or Critical items to surface.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

The previous commit dropped sqlite_vec from Windows builds as a
shortcut, which meant Windows binaries would have silently lost vector
search. windows-latest runners ship MSYS2, and installing
mingw-w64-x86_64-sqlite3 + pointing CGO_CFLAGS at its include path is
enough to satisfy sqlite-vec's #include <sqlite3.h>. Build Windows
artifacts with -tags "fts5 sqlite_vec" again, restore the docs, and
have scripts/build.ps1 auto-configure CGO_CFLAGS when the MSYS2
header is present.
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (f66eb71)

Verdict: No medium-or-higher issues found; the reviewed diff looks clean.

All reviewers who provided findings agreed there were no issues requiring mention at Medium, High, or Critical severity.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Windows sqlite_vec builds hit undefined-reference link errors for
ArrowArrayIsReleased, ArrowSchemaRelease, and other helpers from
arrow-go/v18's arrow/c/helpers.h. The header declares those as plain
`inline` rather than `static inline`, so C99 semantics require one TU
to emit an external definition — and nothing in the cdata package
does. GNU89 inline semantics always emit an external copy, which
resolves the link.

Apply the flag only on Windows, where the MinGW toolchain trips on
this; Linux and macOS builds continue with their existing flags.
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (c9904cf)

Verdict: No Medium, High, or Critical findings; the changes look clean.

All reviewed outputs found no actionable issues at Medium severity or above. Security review also found no security-impacting regressions in the diff.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

With only -fgnu89-inline, every translation unit that pulls in
arrow-go/v18's helpers.h emits an external copy of each helper, so
MinGW's ld then fails on duplicate definitions of ArrowArrayIsReleased
et al. Passing --allow-multiple-definition to the linker lets it keep
the first definition and drop the rest, which is the standard pairing
for projects that can't patch the upstream header.
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (b2114dc)

No medium-or-higher findings identified across the submitted reviews.

All reviewed outputs were clean at Medium, High, and Critical severity.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Empty-filter Search now loops instead of stopping at a fixed 2× over-
fetch: each pass either satisfies k live hits or doubles fetch up to
the generation's embedded count. Two new regression tests lock this
in — one where >k deletions force an expansion round, and one where
the full corpus produces fewer live hits than k and Search must
terminate cleanly.

scripts/build.ps1 now appends the required MSYS2 include path,
-fgnu89-inline, and --allow-multiple-definition flags regardless of
pre-existing CGO_CFLAGS/CGO_LDFLAGS exports, so users with their own
SQLite include paths or linker flags don't silently skip the MinGW
workaround.

docs/vector-search.md replaces the Unix-shell snippet with PowerShell
syntax and lists all three flags the Windows source build needs.
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (a7549ec)

Verdict: No medium-or-higher findings; the reviewed changes look clean.

All provided review outputs either reported no issues or contained no findings to merge.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

wesm added 5 commits April 20, 2026 14:09
- F2: drop unused registerErr from sqlitevec/ext.go and document the
  forward-compatible nil return.
- F6: isUniqueConstraintErr now uses errors.As against
  sqlite3.ErrConstraintUnique / ErrConstraintPrimaryKey instead of
  matching error text — robust against driver locale or formatting
  changes.
- F8: search_messages MCP tool replaces the shared withOffset() with
  an inline parameter that explicitly states offset>0 is rejected
  for mode=vector and mode=hybrid.
- F9: pure vector mode no longer reports a misleading rrf score
  (RRF needs two signals to fuse). vectorHitsToFused sets
  RRFScore=NaN, scoreBreakdown.RRF becomes a *float64 omitted when
  NaN, and the API/MCP docs note that rrf is absent in mode=vector.
…tivation enqueue

- F12: TestFusedSearch_NullSubjectExcludedBySubjectFilter pins down
  the intentional drop of NULL-subject messages when a subject filter
  is supplied — exercises both the fused CTE path and the Search
  filter path so any future change has to touch them together.
- F13: TestEmbedJob_Run_PostActivationEnqueueDrainsOnNextRun confirms
  the activation gate's eventual-consistency property: a pending row
  enqueued after pendingCount==0 still gets drained on the next tick
  via the active-generation top-up path. Added a comment in
  embed_job.go documenting why the gate is intentionally non-atomic.
- F5: Queue.Complete and Queue.Release replace per-id Exec loops with
  a single statement bound to json_each(message_ids). At default
  BatchSize=32 the change is small; at larger batch sizes it removes
  N round-trips against the vectors.db write lock.
- F7: Enqueuer.EnqueueMessages uses one INSERT-OR-IGNORE … FROM
  json_each(?) per non-retired generation instead of N×M ExecContext
  calls. A 5,000-message incremental sync with two non-retired
  generations now issues 2 writes instead of 10,000, so sync flushes
  no longer starve the embed worker's Claim.
The fused CTE applied LIMIT in SQL and then did the subject-boost
re-rank in Go, so any boost-eligible hit ranked at pre-boost rank
Limit+1 was invisible — it never made it into the candidate set the
Go code re-sorted. Over-fetch by subjectBoostOverfetchFactor (4×)
when boost is active, apply the boost, re-sort, then trim back to
the requested limit.

TestFusedSearch_SubjectBoostOverFetchesBeyondLimit covers the case:
seven non-boost-eligible subjects rank ahead of a single boost-
eligible subject in pure ANN order. With Limit=3 and 5× boost, the
boost-eligible hit is now promoted into the visible page; without
the over-fetch the SQL LIMIT would have hidden it before re-rank
ran.
…ed stale threshold

- F1: ErrEmbeddingTimeout sentinel wraps context.DeadlineExceeded
  from the embed call; HTTP returns 503 embedding_timeout, MCP
  returns the matching tool error. Clients can now retry transient
  embedder slowness instead of treating it as a permanent failure.
- F3: Document the FULL OUTER JOIN identity that makes the fused
  CTE's pool-size correlated subqueries safe to read on the first
  result row only (and zero-default safe when fused is empty). Add
  TestFusedSearch_EmptyFilteredSetReportsNotSaturated to lock the
  contract.
- F14: Auto-derive Worker.StaleThreshold from
  EmbedTimeout × (MaxRetries + 1) × 2 with a 10-minute floor. Users
  who configure a long embed timeout (slow remote endpoint) no
  longer need to remember to bump StaleThreshold themselves; a
  ReclaimStale tick can no longer pull live work out from under a
  worker that is still inside its retry budget.
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (59325fb)

Verdict: No medium, high, or critical findings.

The reviewed diff appears clean at Medium+ severity across the combined review outputs.

Notable note: one Low severity issue was reported in internal/api/handlers.go:430 and internal/api/handlers.go:459, but it is omitted here per instruction.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

… retries default

- R1 (14845): CLI rendered NaN RRFScore as the literal string "NaN" in
  table mode and panicked encoding/json in JSON mode. Pure vector mode
  intentionally sets RRFScore=NaN to mark "no fusion happened"; the
  CLI now mirrors the API/MCP convention — formatOptionalScore for
  table output, omit the rrf_score key when NaN for JSON output.
- R2 (14848): Subject boost was over-fetching 4 × Limit before
  reranking, which left correctness gaps for large boost factors or
  large KPerSignal. Replaced the heuristic with a full-pool fetch
  (2 × KPerSignal, the natural cap of bm25_used FULL OUTER JOIN
  ann_used). Adds TestFusedSearch_SubjectBoostPromotesDeepRankHit:
  100x boost surfaces a hit ranked at the deepest pre-boost position
  into the top page.
- R3 (14849 MED): Documented why chi/v5's Timeout middleware does not
  preempt our structured 503 embedding_timeout response — it wraps
  ctx with a deadline and only writes 504 in a deferred function
  after the inline handler returns, by which point our response is
  already on the wire. Made RequestTimeout configurable via
  ServerOptions and added TestHandleSearch_HybridEmbeddingTimeoutFiresChi
  with a 100ms cap + blocking embedder; locks the contract so a
  future chi switch to http.TimeoutHandler-style preemption fails the
  test instead of silently breaking the JSON error path.
- R4 (14849 LOW): derivedStaleThreshold treated EmbedMaxRetries=0 as
  one attempt while embed.NewClient defaults zero to 3 retries — a
  worker built with EmbedTimeout set but EmbedMaxRetries left zero
  could derive a stale threshold shorter than the client's real retry
  budget and reclaim live work. Normalize zero to 3 and add a
  regression case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (d221c83)

Verdict: No medium-or-higher issues found in this PR.

Reviewed outputs from all agents; no deduplicated Medium, High, or Critical findings to report.

Security review also found no new security issues in the diff.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

- R1 (14853 MED): derivedStaleThreshold mis-modeled embed.Client's
  MaxRetries semantics — the client treats it as the TOTAL number of
  HTTP attempts (the for-loop runs `for attempt := 1; attempt <= MaxRetries`
  and the docstring says "MaxRetries total attempts"), not retries-after-
  the-first. The worker was computing `attempts := maxRetries + 1`,
  over-deriving the budget by one timeout window for every caller.
  Aligned the math (`attempts := maxRetries`) and updated the WorkerDeps
  comment to call out the total-attempts semantics. Adjusted test
  expectations: 5m × 3 attempts × 2 = 30m (was 40m), and
  10m × 3 × 2 = 60m for the zero-default case (was 80m).
- R2 (14853 LOW): http.Server.WriteTimeout was hardcoded to 60s, so a
  caller setting RequestTimeout > 60s would still lose the race to the
  underlying socket cutoff and never deliver the structured error.
  Tied WriteTimeout to s.requestTimeout + 5s; the buffer covers chi's
  deferred WriteHeader and the response flush. Documented the contract
  on ServerOptions.RequestTimeout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (ad356f8)

Verdict: No medium-or-higher findings; the reviewed changes look clean.

All reviewers reported no actionable issues at Medium, High, or Critical severity. Security review also did not identify an actionable concern with the added github.com/asg017/sqlite-vec-go-bindings v0.1.6 dependency.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

Vector search documentation lives in the docs site now. Delete
docs/vector-search.md, revert the vector additions to docs/api.md,
and point the README link at https://msgvault.io/usage/vector-search/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (9b913e2)

Verdict: No medium-or-higher findings identified; the reviewed diff looks clean.

All provided reviews found no actionable Medium, High, or Critical issues.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

The new name follows the existing verb-first CLI pattern
(build-cache, sync-full, import-emlx) and is discoverable to users
who don't know what an embedding is. "embed" stays as a cobra alias
so existing cronjobs and scripts keep working.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (7c5530a)

Vector/hybrid search is close, but the PR still has 1 high-severity performance issue and 4 medium-severity correctness/reliability gaps.

High

  • Highinternal/vector/sqlitevec/fused.go (openFusedConn): FusedSearch opens a fresh *sql.DB with sql.Open on every query. That adds avoidable pool/driver/database-open overhead to the hot path and is a poor fit for database/sql, especially for repeated searches.
    Fix: Keep a long-lived attached connection/pool on the backend, or reuse a *sql.Conn from the existing main DB.

Medium

  • Mediumcmd/msgvault/cmd/search.go:71: The CLI rejects --mode=vector|hybrid in remote mode even though this PR adds mode/explain support to /api/v1/search. That regresses remote CLI users by blocking semantic/hybrid search entirely.
    Fix: Thread mode and explain through runRemoteSearch and the remote query builder instead of rejecting them.

  • Mediumcmd/msgvault/cmd/search_vector.go:37: runHybridSearch opens msgvault.db directly and skips the usual local-search preparation path (InitSchema() plus FTS/backfill/index checks). On fresh or upgraded archives, --mode=hybrid can run against missing or stale FTS state and return incomplete results or fail.
    Fix: Reuse the normal local-search setup before constructing the hybrid engine, or explicitly run schema/init and FTS preparation in the vector/hybrid path.

  • Mediuminternal/api/handlers.go:476, internal/mcp/handlers.go:321, internal/mcp/handlers.go:400: Bulk hydration failures are logged and then treated as summaries = nil, which converts a successful vector search into a successful response with zero hits. That is a silent false negative rather than a partial-result case.
    Fix: Propagate hydration failures as API/tool errors, and only drop IDs that are genuinely missing after a successful summary fetch.

  • Mediuminternal/query/sqlite.go (GetMessageSummariesByIDs), internal/store/api.go (GetMessagesSummariesByIDs): Hydration binds message IDs as IN (?, ?, ...). If the result set grows past SQLite’s parameter limit, hydration fails with too many SQL variables.
    Fix: Pass IDs as a JSON array and use IN (SELECT value FROM json_each(?)), consistent with the filter-resolution approach already used elsewhere.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

The HTTP API reference now lives on msgvault.io/api-server/.
Keeping an in-repo copy invites drift. Point the README link at
the docs site instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (b2c941c)

Verdict: Not ready to merge due to 1 high-severity correctness issue and 2 medium-severity regressions.

High

  • Hydration failures are silently converted into empty/partial success responses
    Location: internal/api/handlers.go, internal/mcp/handlers.go
    The new bulk-hydration path logs GetMessagesSummariesByIDs failures but still returns success with missing results. That misreports backend errors as "no matches" and can hide real search hits.
    Suggested fix: Return a request/tool error on hydration failure, or fall back to the prior per-message lookup path instead of dropping hits.

Medium

  • Bulk summary hydration does not fully preserve the existing search summary shape
    Location: internal/store/api.go in GetMessagesSummariesByIDs
    The new hydrator restores To and Labels, but not other summary fields such as Cc. Since hybrid/vector search is switching to this path specifically to match existing search responses, this creates a response-shape regression versus FTS results.
    Suggested fix: Hydrate the full payload expected by toMessageSummary, or keep the older projection until the bulk path is complete.

  • Remote CLI blocks newly added server-side vector/hybrid search support
    Location: cmd/msgvault/cmd/search.go
    Remote mode still rejects --mode=vector|hybrid even though the server now supports those modes on /api/v1/search. As a result, remote CLI users cannot use the new feature.
    Suggested fix: Forward mode and explain in remote requests and render the remote hybrid response instead of rejecting it client-side.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

The vector search feature has never been released, so there is no
shipped "embed" command for users to be compatible with. Drop the
alias and the Long-text mention of it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@roborev-ci
Copy link
Copy Markdown

roborev-ci Bot commented Apr 20, 2026

roborev: Combined Review (18aa250)

Verdict: No medium-or-higher issues identified in the reviewed diff.

All review outputs agree the changes are clean at Medium, High, and Critical severity.


Synthesized from 3 reviews (agents: codex, gemini | types: default, security)

@wesm wesm merged commit 67fab47 into main Apr 20, 2026
6 checks passed
sternryan added a commit to sternryan/msgvault that referenced this pull request Apr 26, 2026
Reconciles 172 ahead / 37 behind state with upstream wesm/msgvault.

Strategy: accept upstream wholesale for connector code (M365, iMessage,
gvoice, IMAP XOAUTH2) where upstream's implementations are more
battle-tested and already cover the fork's bug fixes. Hand-resolve
store/sync/search/cmd/build files to union both feature sets.

Preserved from fork:
- SQLCipher encryption-at-rest (passphrase, AES-GCM token encryption)
- Advisory file locking (tryLock, lockFile, syscall.Flock)
- AI Archive Intelligence subsystem (internal/embedding/, vec_messages,
  pipeline_runs, --semantic search)
- Web UI (React/TypeScript SPA in web/)
- Hot-path search tokenizer (dispatchToken, toLowerFast, parseSizeFast)
- migrateAddContentID, InitVectorTable, content_id attachment column

Adopted from upstream:
- Dialect interface + loggedDB wrapper + structured logging pipeline
  (wesm#276 PostgreSQL dialect refactor foundation)
- OpenReadOnly() for MCP read-only access
- IsBusyError, SchemaStale helpers
- Unified text import (wesm#238) — M365 OAuth (wesm#228), iMessage (wesm#224),
  Google Voice (wesm#225) — all wholesale
- Search enhancements: regex, FTS5 snippets, sorting (wesm#252),
  domain normalization (normalizeAddr, looksLikeDomain, gTLDs)
- rebuild-fts command (wesm#287), 8 bug fixes from wesm#254
- IMAP date filtering (wesm#222), greeting wait (wesm#248)
- Vector subsystem (wesm#277) — coexists with fork's AI Archive
  Intelligence as parallel implementation; future cleanup needed

Build/runtime fixes applied during merge:
- Replaced mattn/go-sqlite3 imports with mutecomm/go-sqlcipher/v4
  (drop-in API-compatible) to resolve duplicate symbol linker errors
- Dropped sqlite_vec from default BUILD_TAGS (requires SQLite 3.38+
  APIs sqlcipher v4.4.2 does not expose; re-enable when sqlcipher
  upgrades)
- safeRowsAffected helper in db_logger.go: defer recover around
  RowsAffected() call (sqlcipher returns nil internal Result for
  multi-statement DDL)
- Wired normalizeAddr into hot-path tokenizer for from:/to:/cc:/bcc:

Stubbed under unreachable build tag (need follow-up decision):
- cmd/msgvault/cmd/sync_gvoice.go — fork's sync API obsolete vs
  upstream's import-based gvoice
- cmd/msgvault/cmd/sync_imessage.go — same situation

Verified: go build ./... passing, go vet clean, 45/45 test packages
pass with 0 failures. See MERGE_REPORT.md for file-by-file resolution
notes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant