-
Notifications
You must be signed in to change notification settings - Fork 167
feat: MUVERA embeddings #542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds support for MUVERA embeddings by implementing the MUVERA algorithm and integrating it with existing late interaction models.
- Introduces
SimHashProjection,MuveraAlgorithm, andMuveraEmbeddingfor multi-vector to fixed-dimension encoding. - Registers
MuveraEmbeddinginTextEmbeddingregistry and exposes it in the package__init__. - Integrates MUVERA into the
embed,query_embed, andpassage_embedworkflows.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| fastembed/text/text_embedding.py | Imported and registered MuveraEmbedding |
| fastembed/text/muvera_embedding.py | Implemented MUVERA algorithm and embedding class |
| fastembed/init.py | Exposed MuveraEmbedding in the top‐level package exports |
Comments suppressed due to low confidence (1)
fastembed/text/muvera_embedding.py:1
- The new MUVERA functionality lacks unit tests. Add tests covering
MuveraAlgorithm.encode,encode_document, and theMuveraEmbeddingmethods (embed,query_embed) to ensure correctness and maintain coverage.
import numpy as np
fastembed/text/muvera_embedding.py
Outdated
| for i in range(B): | ||
| if cluster_vector_counts[i] == 0: # Empty cluster found | ||
| min_hamming = float("inf") | ||
| best_vector = None | ||
| # Find vector whose cluster ID has minimum Hamming distance to i | ||
| for vector in vectors: | ||
| vector_cluster_id = simhash.get_cluster_id(vector) |
Copilot
AI
Jul 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recomputing cluster IDs for each vector when filling empty clusters leads to O(B * n) get_cluster_id calls. Consider caching each vector's cluster_id once before the empty-cluster loop to improve performance.
| for i in range(B): | |
| if cluster_vector_counts[i] == 0: # Empty cluster found | |
| min_hamming = float("inf") | |
| best_vector = None | |
| # Find vector whose cluster ID has minimum Hamming distance to i | |
| for vector in vectors: | |
| vector_cluster_id = simhash.get_cluster_id(vector) | |
| # Cache cluster IDs for all vectors | |
| cached_cluster_ids = {tuple(vector): simhash.get_cluster_id(vector) for vector in vectors} | |
| for i in range(B): | |
| if cluster_vector_counts[i] == 0: # Empty cluster found | |
| min_hamming = float("inf") | |
| best_vector = None | |
| # Find vector whose cluster ID has minimum Hamming distance to i | |
| for vector in vectors: | |
| vector_cluster_id = cached_cluster_ids[tuple(vector)] |
fastembed/text/muvera_embedding.py
Outdated
| for projection_index, simhash in enumerate(self.simhash_projections): | ||
| # Initialize cluster centers and count vectors assigned to each cluster | ||
| cluster_centers = np.zeros((B, self.d)) | ||
| cluster_vector_counts = np.zeros(B) |
Copilot
AI
Jul 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Initialize cluster_vector_counts with an integer dtype (e.g., np.zeros(B, dtype=int)) to accurately represent counts and avoid unintended float usage.
| cluster_vector_counts = np.zeros(B) | |
| cluster_vector_counts = np.zeros(B, dtype=int) |
…rove parameter defaults
…ng size and add Jupyter notebook for MUVERA usage
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
* fix: fix types, doctest, rename variables, refactor * fix: fix python3.9 compatibility
|
I ran benchmarks on beir/scidocs and got the following results for colbert: Parameters I was using: Though, I was expecting better results, we can still see consistent improvements of the metrics with the growth of embedding size, so I think we can conclude that the implementation is correct. |
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds a new postprocess package exposing Muvera via fastembed.postprocess.init.py and implements MUVERA in fastembed.postprocess.muvera.py. The new module provides a POPCOUNT lookup, a hamming_distance_matrix helper, SimHashProjection for k-bit SimHash clustering, and the Muvera class that performs repeated SimHash clustering, cluster-center accumulation, optional normalization, empty-cluster filling by nearest Hamming-distance clusters, per-repetition ±1 random projections, and flattening into a fixed-dimensional embedding. Includes a from_multivector_model constructor and a small main demo. Tests validate deterministic outputs for process, process_document, and process_query. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
fastembed/postprocess/muvera.py (4)
72-75: Consider using numpy for binary-to-decimal conversion.The current implementation is correct but could be optimized using numpy's vectorized operations.
Replace the loop with a more efficient numpy operation:
- cluster_id = 0 - for i, bit in enumerate(binary_values): - cluster_id += bit * (2**i) + # Convert binary array to decimal using numpy + powers_of_2 = 2 ** np.arange(len(binary_values)) + cluster_id = int(np.dot(binary_values, powers_of_2))
149-149: Remove unnecessary noqa directive.The
# noqa[naming]comment appears unnecessary asr_repsis a valid parameter name that matches the class attribute.- r_reps: int = 20, # noqa[naming] + r_reps: int = 20,
236-236: Fix typo in docstring.Remove the extra closing bracket.
- vectors (NumpyArray]): Query vectors of shape (n_tokens, dim) + vectors (NumpyArray): Query vectors of shape (n_tokens, dim)
292-292: Remove unnecessary type: ignore comments.The
get_cluster_idmethod acceptsnp.ndarraywhich is compatible with the loop variablevector, so the type: ignore comments are not needed.- cluster_id = simhash.get_cluster_id(vector) # type: ignore + cluster_id = simhash.get_cluster_id(vector)- vector_cluster_id = simhash.get_cluster_id(vector) # type: ignore + vector_cluster_id = simhash.get_cluster_id(vector)Also applies to: 311-311
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
fastembed/postprocess/__init__.py(1 hunks)fastembed/postprocess/muvera.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
fastembed/postprocess/__init__.py (1)
fastembed/postprocess/muvera.py (1)
Muvera(79-338)
fastembed/postprocess/muvera.py (3)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
LateInteractionTextEmbeddingBase(8-71)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
LateInteractionMultimodalEmbeddingBase(10-78)fastembed/text/text_embedding.py (1)
embedding_size(132-136)
🔇 Additional comments (3)
fastembed/postprocess/muvera.py (2)
1-14: LGTM! Clean imports and type definition.The imports are appropriate and the
MultiVectorModeltype alias correctly captures both late interaction model types.
297-320: Inconsistent normalization for filled empty clusters.When both
normalize_by_count=Trueandfill_empty_clusters=True(document mode), empty clusters are filled with raw vectors but aren't normalized by count. This creates inconsistency where regular clusters contain mean vectors while filled clusters contain raw vectors.Consider normalizing the filled vectors to maintain consistency:
# Assign the best matching vector to the empty cluster if best_vector is not None: - cluster_centers[i] = best_vector + if normalize_by_count: + cluster_centers[i] = best_vector # Already normalized (single vector) + else: + cluster_centers[i] = best_vectorAlternatively, document this behavior in the docstring if it's intentional.
Likely an incorrect or invalid review comment.
fastembed/postprocess/__init__.py (1)
1-3: LGTM! Standard package initialization.Correctly imports and exports the
Muveraclass for the postprocess module.
The paper suggests using Muvera embeddings for candidate retrieval and original multivectors for reranking, so that's kind of expected. Thanks for putting in all this effort! When can we expect the next release of FastEmbed, so we can announce Muvera support? |
|
Oops! I have forgotten to do the reranking |
| k_sim (int, optional): Number of SimHash functions (creates 2^k_sim clusters). | ||
| Defaults to 5. | ||
| dim_proj (int, optional): Dimensionality after random projection (must be <= dim). | ||
| Defaults to 16. | ||
| r_reps (int, optional): Number of random projection repetitions for robustness. | ||
| Defaults to 20. | ||
| random_seed (int, optional): Seed for random number generator to ensure | ||
| reproducible results. Defaults to 42. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"optional" is typically used for args that can be None, not args that have a default value, though other things in fastembed use it for "default value", too, so it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we need to keep this if it's empty.
* vectorize operations * fix: fill empty clusters with dataset vectors * rollback get_output_dimension * fix: fix type hints * fix: review comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
fastembed/postprocess/muvera.py (1)
244-244: Fix minor typographical error in docstring.There's an extra closing bracket in the type annotation.
- Args: - vectors (NumpyArray]): Query vectors of shape (n_tokens, dim) + Args: + vectors (NumpyArray): Query vectors of shape (n_tokens, dim)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
fastembed/postprocess/muvera.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
fastembed/postprocess/muvera.py (3)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
LateInteractionTextEmbeddingBase(8-71)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
LateInteractionMultimodalEmbeddingBase(10-78)fastembed/text/text_embedding.py (1)
embedding_size(132-136)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: Python 3.12.x on windows-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.12.x on macos-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.11.x on windows-latest test
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.11.x on macos-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
- GitHub Check: Python 3.9.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on windows-latest test
- GitHub Check: Python 3.10.x on macos-latest test
- GitHub Check: Python 3.9.x on macos-latest test
- GitHub Check: Python 3.9.x on windows-latest test
🔇 Additional comments (9)
fastembed/postprocess/muvera.py (9)
1-17: LGTM! Clean imports and well-defined module constants.The imports are appropriately organized, the type alias clearly defines the supported multi-vector models, and the constants are well-documented. The POPCOUNT LUT optimization for Hamming distance computation is a nice performance touch.
19-32: Efficient Hamming distance implementation using lookup table.The implementation correctly leverages the POPCOUNT LUT for fast bit counting. The use of broadcasting and vectorized operations ensures good performance for computing the full pairwise distance matrix.
34-84: Well-implemented SimHash clustering with clear documentation.The class provides a clean interface for locality-sensitive hashing using random hyperplanes. The bit manipulation in
get_cluster_idsis elegant and mathematically sound.
86-150: Comprehensive MUVERA implementation with robust parameter validation.The class structure is well-designed with clear documentation explaining the algorithm steps. The parameter validation in the constructor prevents common configuration errors.
151-205: Convenient factory method with excellent documentation and example.The
from_multivector_modelclass method provides a user-friendly way to create MUVERA instances from existing models. The comprehensive docstring with a practical example is particularly helpful.
207-220: Clean property implementation following consistent patterns.The embedding size property follows the same pattern as other FastEmbed models, ensuring API consistency across the codebase.
221-249: Well-designed document and query processing methods with appropriate defaults.The separation of document and query processing with different parameter defaults aligns well with the MUVERA paper's recommendations. The methods provide clear abstractions over the core
processmethod.
251-357: Comprehensive core processing implementation with robust error handling.The
processmethod implements the full MUVERA algorithm correctly with proper handling of edge cases like empty clusters. The step-by-step approach makes the complex algorithm understandable and maintainable.
359-365: Simple demo implementation for testing the module.The
__main__block provides a basic demonstration of the MUVERA functionality, which is useful for manual testing and verification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (7)
tests/test_postprocess.py (7)
21-22: Use DIM constant and float32 to minimize numerical drift.Use the declared DIM instead of a magic number and cast to float32 to better match typical embedding dtypes, reducing cross-platform numeric variance.
Apply this diff:
- random_generator = np.random.default_rng(42) - multivector = random_generator.random((10, 128)) + rng = np.random.default_rng(42) + multivector = rng.random((10, DIM)).astype(np.float32)
19-19: Rename test for intent clarity.Name communicates purpose; current name is vague.
Apply this diff:
-def test_single_input(): +def test_muvera_process_consistency_and_constructors():
28-31: Assert the expected output dimensionality explicitly.Strengthen the shape assertion by checking against the formula r_reps * k_sim * dim_proj and the Muvera-reported embedding_size.
Apply this diff:
- fde = muvera.process(multivector) - assert fde.shape[0] == muvera.embedding_size - assert np.allclose(fde[:3], CANONICAL_VALUES) + fde = muvera.process(multivector) + expected_dim = R_REPS * K_SIM * DIM_PROJ + assert fde.shape[0] == expected_dim + assert muvera.embedding_size == expected_dim + assert np.allclose(fde[:3], CANONICAL_VALUES, rtol=1e-5, atol=1e-7)
32-35: Keep the doc-path equality check; ensure numeric tolerance.Retain equality with an explicit tolerance, mirroring the change above.
Apply this diff:
- fde_doc = muvera.process_document(multivector) - assert fde_doc.shape[0] == muvera.embedding_size - assert np.allclose(fde, fde_doc) + fde_doc = muvera.process_document(multivector) + assert fde_doc.shape[0] == muvera.embedding_size + assert np.allclose(fde, fde_doc, rtol=1e-5, atol=1e-7)
36-38: Strengthen query-path invariants and loosen brittleness.Add checks that are robust across minor numeric changes: query vector should differ from doc vector, and it should contain zeros when fill_empty_clusters=False. Keep canonical checks but with explicit tolerance.
Apply this diff:
- fde_query = muvera.process_query(multivector) - assert fde_query.shape[0] == muvera.embedding_size - assert np.allclose(fde_query[np.nonzero(fde_query)][:3], CANONICAL_QUERY_VALUES) + fde_query = muvera.process_query(multivector) + assert fde_query.shape[0] == muvera.embedding_size + # Query FDE differs from doc FDE by design (no fill/normalization) + assert not np.allclose(fde_query, fde_doc, rtol=1e-5, atol=1e-7) + # Expect some zeros in the query FDE + assert np.count_nonzero(fde_query) < muvera.embedding_size + # Canonical sentinel check with explicit tolerance + nonzero_vals = fde_query[np.nonzero(fde_query)][:3] + assert np.allclose(nonzero_vals, CANONICAL_QUERY_VALUES, rtol=1e-5, atol=1e-7)
24-27: Cross-verify both constructors produce identical outputs.Given identical hyperparameters and seeds, process() results should match across constructors; assert explicitly.
Apply this diff to append after the loop:
@@ assert np.allclose(fde_query[np.nonzero(fde_query)][:3], CANONICAL_QUERY_VALUES) + # Cross-verify: both constructors should produce identical outputs for the same input + muvera_a = Muvera(dim=DIM, k_sim=K_SIM, dim_proj=DIM_PROJ, r_reps=R_REPS, random_seed=42) + muvera_b = Muvera.from_multivector_model(model, k_sim=K_SIM, dim_proj=DIM_PROJ, r_reps=R_REPS) + fde_a = muvera_a.process(multivector) + fde_b = muvera_b.process(multivector) + assert np.allclose(fde_a, fde_b, rtol=1e-5, atol=1e-7)Also applies to: 28-38
39-39: Add a negative test for invalid dim_proj to catch regression.from_multivector_model is documented to raise when dim_proj > embedding_size. Add a regression test to enforce this contract.
Apply this patch at the end of the file:
+ +def test_from_multivector_model_raises_on_invalid_dim_proj(): + model = SimpleNamespace(embedding_size=4) + with pytest.raises(ValueError): + Muvera.from_multivector_model(model, k_sim=K_SIM, dim_proj=8, r_reps=R_REPS)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tests/test_postprocess.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/test_postprocess.py (2)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
LateInteractionTextEmbedding(14-153)fastembed/postprocess/muvera.py (6)
Muvera(86-356)from_multivector_model(152-205)process(251-356)embedding_size(218-219)process_document(221-234)process_query(236-249)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.12.x on macos-latest test
- GitHub Check: Python 3.11.x on windows-latest test
- GitHub Check: Python 3.12.x on windows-latest test
- GitHub Check: Python 3.11.x on macos-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on windows-latest test
- GitHub Check: Python 3.9.x on macos-latest test
- GitHub Check: Python 3.10.x on macos-latest test
- GitHub Check: Python 3.9.x on windows-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
- GitHub Check: Python 3.9.x on ubuntu-latest test
🔇 Additional comments (1)
tests/test_postprocess.py (1)
24-38: Solid baseline checks and determinism across paths.Good coverage to exercise both constructors and to check consistency between process and process_document. Seeding the RNG ensures determinism.
| from fastembed import LateInteractionTextEmbedding | ||
| from fastembed.postprocess import Muvera |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid network/model dependency in unit tests; stub the model.
Creating LateInteractionTextEmbedding, even with lazy_load=True, can still trigger heavyweight imports or network access in CI. This test only needs embedding_size; replace the real model with a lightweight stub to eliminate flakiness and speed up tests.
Apply this diff:
@@
-import numpy as np
+import numpy as np
+import pytest
@@
-from fastembed import LateInteractionTextEmbedding
from fastembed.postprocess import Muvera
+from types import SimpleNamespace
@@
-def test_single_input():
- model = LateInteractionTextEmbedding("colbert-ir/colbertv2.0", lazy_load=True)
- random_generator = np.random.default_rng(42)
- multivector = random_generator.random((10, 128))
+def test_single_input():
+ # Avoid loading external models in unit tests; only embedding_size is needed here.
+ model = SimpleNamespace(embedding_size=DIM)
+ rng = np.random.default_rng(42)
+ multivector = rng.random((10, DIM)).astype(np.float32)Also applies to: 20-22
I implemented MUVERA embeddings for all the late interaction models that FastEmbed supports.
Context: https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/
It's still a draft, as I'm running some experiments in parallel.
All Submissions:
New Feature Submissions:
pre-commitwithpip3 install pre-commitand set up hooks withpre-commit install?