Skip to content

MMR reranking via mmr_lambda hidden column#6

Merged
vlasky merged 4 commits intovlasky:mainfrom
MayCXC:mmr-reranking-vlasky
Feb 28, 2026
Merged

MMR reranking via mmr_lambda hidden column#6
vlasky merged 4 commits intovlasky:mainfrom
MayCXC:mmr-reranking-vlasky

Conversation

@MayCXC
Copy link

@MayCXC MayCXC commented Feb 27, 2026

Rebased version of asg017#267 (issue: asg017#266) for this fork. Adds a mmr_lambda hidden column to vec0 for Maximal Marginal Relevance reranking in KNN queries.

SELECT rowid, distance FROM vec_items
WHERE embedding MATCH ? AND k = 10 AND mmr_lambda = 0.5;

Composes with this fork's distance constraints, partition keys, and all vector types/metrics.

Also fixes the pre-existing test_shadow snapshot failure (missing ORDER BY on pragma_table_list).

Full design and rationale in the upstream PR.

Add Maximal Marginal Relevance (MMR) support to vec0 virtual table.
When mmr_lambda is provided in a KNN query, candidates are over-fetched
and then greedily re-selected to balance relevance against diversity.

API: WHERE embedding MATCH ? AND k = 10 AND mmr_lambda = 0.7

- mmr_lambda range [0.0, 1.0]: 1.0 = pure relevance, 0.0 = pure diversity
- Over-fetch factor: 5x (capped at k_max=4096)
- Supports float32, int8, and bit vector types
- All distance metrics (L2, cosine, L1, hamming)
- Zero impact when mmr_lambda is not provided
9 test functions covering:
- Cosine diversity (baseline vs lambda=1.0, 0.5, 0.0)
- L2 distance metric compatibility
- Int8 vector element type
- Cluster monopoly breaking
- Composition with distance constraints
- Composition with partition keys
- Edge cases (k=1, k=0)
- Error handling (invalid lambda range)
- Insert guard for hidden column
pragma_table_list does not guarantee row order. Add ORDER BY name
to the two shadow table queries so the snapshot is deterministic.
mceachen added a commit to photostructure/sqlite-vec that referenced this pull request Feb 27, 2026
@mceachen
Copy link

mceachen commented Feb 27, 2026

Thanks for this! Claude (and I) just merged it into my fork.

One fix it made: in vec0_mmr_rerank, the copy-back loop at the end iterates k_target times, but the greedy selection loop can terminate early (via if (best_idx < 0) break), leaving the tail of out_rowids/out_distances uninitialized. We added an n_selected counter and an out_n_selected output parameter so only the actually-selected entries are copied back. The caller now sets k_used = n_selected instead of k_used = k_original. (You can see my referenced commit for details)

MayCXC added a commit to MayCXC/sqlite-vec that referenced this pull request Feb 28, 2026
The copy-back loop iterated k_target times, but the greedy selection
loop can terminate early via `if (best_idx < 0) break`, leaving the
tail of out_rowids/out_distances uninitialized. Add an n_selected
counter and out_n_selected output parameter so only actually-selected
entries are copied back. The caller now sets k_used = n_selected
instead of k_used = k_original.

Credit: mceachen (vlasky#6)
The copy-back loop iterated k_target times, but the greedy selection
loop can terminate early via `if (best_idx < 0) break`, leaving the
tail of out_rowids/out_distances uninitialized. Add an n_selected
counter and out_n_selected output parameter so only actually-selected
entries are copied back. The caller now sets k_used = n_selected
instead of k_used = k_original.

Credit: mceachen (vlasky#6)
@vlasky vlasky merged commit 69dfda1 into vlasky:main Feb 28, 2026
@vlasky
Copy link
Owner

vlasky commented Feb 28, 2026

Thanks for this contribution! Merged.

I added a follow-up commit (8d4ef9e) to normalize the diversity term in the MMR greedy loop. The relevance term was already normalized to [0,1] by dividing by max_dist, but the inter-candidate similarity used raw distances (1 - d). For cosine distance this is fine since values are already in [0,1], but for L2 and L1 where distances are unbounded, the two terms in the MMR score were on different scales - meaning mmr_lambda didn't have a consistent effect across distance metrics.

The fix divides the inter-candidate distance by max_dist to match, so both terms are on the same scale regardless of metric.

@MayCXC
Copy link
Author

MayCXC commented Feb 28, 2026

@mceachen @vlasky thank you both for the very rapid fixes :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants