Skip to content

Conversation

@kacperlukawski
Copy link
Member

I implemented MUVERA embeddings for all the late interaction models that FastEmbed supports.

Context: https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/

It's still a draft, as I'm running some experiments in parallel.

All Submissions:

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  • Does your submission pass the existing tests?
  • Have you added tests for your feature?
  • Have you installed pre-commit with pip3 install pre-commit and set up hooks with pre-commit install?

@kacperlukawski kacperlukawski requested review from Copilot and joein and removed request for Copilot July 9, 2025 12:13
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds support for MUVERA embeddings by implementing the MUVERA algorithm and integrating it with existing late interaction models.

  • Introduces SimHashProjection, MuveraAlgorithm, and MuveraEmbedding for multi-vector to fixed-dimension encoding.
  • Registers MuveraEmbedding in TextEmbedding registry and exposes it in the package __init__.
  • Integrates MUVERA into the embed, query_embed, and passage_embed workflows.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
fastembed/text/text_embedding.py Imported and registered MuveraEmbedding
fastembed/text/muvera_embedding.py Implemented MUVERA algorithm and embedding class
fastembed/init.py Exposed MuveraEmbedding in the top‐level package exports
Comments suppressed due to low confidence (1)

fastembed/text/muvera_embedding.py:1

  • The new MUVERA functionality lacks unit tests. Add tests covering MuveraAlgorithm.encode, encode_document, and the MuveraEmbedding methods (embed, query_embed) to ensure correctness and maintain coverage.
import numpy as np

Comment on lines 234 to 240
for i in range(B):
if cluster_vector_counts[i] == 0: # Empty cluster found
min_hamming = float("inf")
best_vector = None
# Find vector whose cluster ID has minimum Hamming distance to i
for vector in vectors:
vector_cluster_id = simhash.get_cluster_id(vector)
Copy link

Copilot AI Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recomputing cluster IDs for each vector when filling empty clusters leads to O(B * n) get_cluster_id calls. Consider caching each vector's cluster_id once before the empty-cluster loop to improve performance.

Suggested change
for i in range(B):
if cluster_vector_counts[i] == 0: # Empty cluster found
min_hamming = float("inf")
best_vector = None
# Find vector whose cluster ID has minimum Hamming distance to i
for vector in vectors:
vector_cluster_id = simhash.get_cluster_id(vector)
# Cache cluster IDs for all vectors
cached_cluster_ids = {tuple(vector): simhash.get_cluster_id(vector) for vector in vectors}
for i in range(B):
if cluster_vector_counts[i] == 0: # Empty cluster found
min_hamming = float("inf")
best_vector = None
# Find vector whose cluster ID has minimum Hamming distance to i
for vector in vectors:
vector_cluster_id = cached_cluster_ids[tuple(vector)]

Copilot uses AI. Check for mistakes.
for projection_index, simhash in enumerate(self.simhash_projections):
# Initialize cluster centers and count vectors assigned to each cluster
cluster_centers = np.zeros((B, self.d))
cluster_vector_counts = np.zeros(B)
Copy link

Copilot AI Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Initialize cluster_vector_counts with an integer dtype (e.g., np.zeros(B, dtype=int)) to accurately represent counts and avoid unintended float usage.

Suggested change
cluster_vector_counts = np.zeros(B)
cluster_vector_counts = np.zeros(B, dtype=int)

Copilot uses AI. Check for mistakes.
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

joein added 2 commits July 24, 2025 18:05
* fix: fix types, doctest, rename variables, refactor

* fix: fix python3.9 compatibility
@joein
Copy link
Member

joein commented Jul 31, 2025

I ran benchmarks on beir/scidocs and got the following results for colbert:

Vector: colbert, Recall@4: 0.4470, Recall@5: 0.4920, Recall@10: 0.7030
Vector: muvera-1280, Recall@4: 0.2350, Recall@5: 0.2610, Recall@10: 0.3700
Vector: muvera-2560, Recall@4: 0.2680, Recall@5: 0.3070, Recall@10: 0.4290
Vector: muvera-5120, Recall@4: 0.3020, Recall@5: 0.3370, Recall@10: 0.4770
Vector: muvera-10240, Recall@4: 0.3350, Recall@5: 0.3810, Recall@10: 0.5580
Vector: muvera-15360, Recall@4: 0.35, Recall@5: 0.4, Recall@10: 0.558
Vector: muvera-20480, Recall@4: 0.357, Recall@5: 0.415, Recall@10: 0.594

Parameters I was using:

r_reps, k_sim, dim_proj:
(20, 3, 8),
(20, 4, 8),
(20, 5, 8),
(20, 5, 16),
(30, 5, 16),
(40, 5, 16)

Though, I was expecting better results, we can still see consistent improvements of the metrics with the growth of embedding size, so I think we can conclude that the implementation is correct.

@joein joein marked this pull request as ready for review July 31, 2025 19:46
@coderabbitai
Copy link

coderabbitai bot commented Jul 31, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds a new postprocess package exposing Muvera via fastembed.postprocess.init.py and implements MUVERA in fastembed.postprocess.muvera.py. The new module provides a POPCOUNT lookup, a hamming_distance_matrix helper, SimHashProjection for k-bit SimHash clustering, and the Muvera class that performs repeated SimHash clustering, cluster-center accumulation, optional normalization, empty-cluster filling by nearest Hamming-distance clusters, per-repetition ±1 random projections, and flattening into a fixed-dimensional embedding. Includes a from_multivector_model constructor and a small main demo. Tests validate deterministic outputs for process, process_document, and process_query.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/muvera-embedding

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
fastembed/postprocess/muvera.py (4)

72-75: Consider using numpy for binary-to-decimal conversion.

The current implementation is correct but could be optimized using numpy's vectorized operations.

Replace the loop with a more efficient numpy operation:

-        cluster_id = 0
-        for i, bit in enumerate(binary_values):
-            cluster_id += bit * (2**i)
+        # Convert binary array to decimal using numpy
+        powers_of_2 = 2 ** np.arange(len(binary_values))
+        cluster_id = int(np.dot(binary_values, powers_of_2))

149-149: Remove unnecessary noqa directive.

The # noqa[naming] comment appears unnecessary as r_reps is a valid parameter name that matches the class attribute.

-        r_reps: int = 20,  # noqa[naming]
+        r_reps: int = 20,

236-236: Fix typo in docstring.

Remove the extra closing bracket.

-            vectors (NumpyArray]): Query vectors of shape (n_tokens, dim)
+            vectors (NumpyArray): Query vectors of shape (n_tokens, dim)

292-292: Remove unnecessary type: ignore comments.

The get_cluster_id method accepts np.ndarray which is compatible with the loop variable vector, so the type: ignore comments are not needed.

-                cluster_id = simhash.get_cluster_id(vector)  # type: ignore
+                cluster_id = simhash.get_cluster_id(vector)
-                            vector_cluster_id = simhash.get_cluster_id(vector)  # type: ignore
+                            vector_cluster_id = simhash.get_cluster_id(vector)

Also applies to: 311-311

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between acec312 and f0fb50e.

📒 Files selected for processing (2)
  • fastembed/postprocess/__init__.py (1 hunks)
  • fastembed/postprocess/muvera.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
fastembed/postprocess/__init__.py (1)
fastembed/postprocess/muvera.py (1)
  • Muvera (79-338)
fastembed/postprocess/muvera.py (3)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • LateInteractionTextEmbeddingBase (8-71)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • LateInteractionMultimodalEmbeddingBase (10-78)
fastembed/text/text_embedding.py (1)
  • embedding_size (132-136)
🔇 Additional comments (3)
fastembed/postprocess/muvera.py (2)

1-14: LGTM! Clean imports and type definition.

The imports are appropriate and the MultiVectorModel type alias correctly captures both late interaction model types.


297-320: Inconsistent normalization for filled empty clusters.

When both normalize_by_count=True and fill_empty_clusters=True (document mode), empty clusters are filled with raw vectors but aren't normalized by count. This creates inconsistency where regular clusters contain mean vectors while filled clusters contain raw vectors.

Consider normalizing the filled vectors to maintain consistency:

                        # Assign the best matching vector to the empty cluster
                        if best_vector is not None:
-                            cluster_centers[i] = best_vector
+                            if normalize_by_count:
+                                cluster_centers[i] = best_vector  # Already normalized (single vector)
+                            else:
+                                cluster_centers[i] = best_vector

Alternatively, document this behavior in the docstring if it's intentional.

Likely an incorrect or invalid review comment.

fastembed/postprocess/__init__.py (1)

1-3: LGTM! Standard package initialization.

Correctly imports and exports the Muvera class for the postprocess module.

@kacperlukawski
Copy link
Member Author

Though, I was expecting better results, we can still see consistent improvements of the metrics with the growth of embedding size, so I think we can conclude that the implementation is correct.

The paper suggests using Muvera embeddings for candidate retrieval and original multivectors for reranking, so that's kind of expected. Thanks for putting in all this effort! When can we expect the next release of FastEmbed, so we can announce Muvera support?

@joein
Copy link
Member

joein commented Aug 1, 2025

Oops! I have forgotten to do the reranking
I also wanted to add model2vec to the next release, so I'd expect middle/end of the next week

@joein joein requested a review from tbung August 14, 2025 14:10
Comment on lines +113 to +120
k_sim (int, optional): Number of SimHash functions (creates 2^k_sim clusters).
Defaults to 5.
dim_proj (int, optional): Dimensionality after random projection (must be <= dim).
Defaults to 16.
r_reps (int, optional): Number of random projection repetitions for robustness.
Defaults to 20.
random_seed (int, optional): Seed for random number generator to ensure
reproducible results. Defaults to 42.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"optional" is typically used for args that can be None, not args that have a default value, though other things in fastembed use it for "default value", too, so it's fine.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need to keep this if it's empty.

* vectorize operations

* fix: fill empty clusters with dataset vectors

* rollback get_output_dimension

* fix: fix type hints

* fix: review comments
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
fastembed/postprocess/muvera.py (1)

244-244: Fix minor typographical error in docstring.

There's an extra closing bracket in the type annotation.

-        Args:
-            vectors (NumpyArray]): Query vectors of shape (n_tokens, dim)
+        Args:
+            vectors (NumpyArray): Query vectors of shape (n_tokens, dim)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f0fb50e and 05c4e04.

📒 Files selected for processing (1)
  • fastembed/postprocess/muvera.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
fastembed/postprocess/muvera.py (3)
fastembed/late_interaction/late_interaction_embedding_base.py (1)
  • LateInteractionTextEmbeddingBase (8-71)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • LateInteractionMultimodalEmbeddingBase (10-78)
fastembed/text/text_embedding.py (1)
  • embedding_size (132-136)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: Python 3.12.x on windows-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.12.x on macos-latest test
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.11.x on windows-latest test
  • GitHub Check: Python 3.12.x on ubuntu-latest test
  • GitHub Check: Python 3.11.x on macos-latest test
  • GitHub Check: Python 3.11.x on ubuntu-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
  • GitHub Check: Python 3.9.x on ubuntu-latest test
  • GitHub Check: Python 3.10.x on windows-latest test
  • GitHub Check: Python 3.10.x on macos-latest test
  • GitHub Check: Python 3.9.x on macos-latest test
  • GitHub Check: Python 3.9.x on windows-latest test
🔇 Additional comments (9)
fastembed/postprocess/muvera.py (9)

1-17: LGTM! Clean imports and well-defined module constants.

The imports are appropriately organized, the type alias clearly defines the supported multi-vector models, and the constants are well-documented. The POPCOUNT LUT optimization for Hamming distance computation is a nice performance touch.


19-32: Efficient Hamming distance implementation using lookup table.

The implementation correctly leverages the POPCOUNT LUT for fast bit counting. The use of broadcasting and vectorized operations ensures good performance for computing the full pairwise distance matrix.


34-84: Well-implemented SimHash clustering with clear documentation.

The class provides a clean interface for locality-sensitive hashing using random hyperplanes. The bit manipulation in get_cluster_ids is elegant and mathematically sound.


86-150: Comprehensive MUVERA implementation with robust parameter validation.

The class structure is well-designed with clear documentation explaining the algorithm steps. The parameter validation in the constructor prevents common configuration errors.


151-205: Convenient factory method with excellent documentation and example.

The from_multivector_model class method provides a user-friendly way to create MUVERA instances from existing models. The comprehensive docstring with a practical example is particularly helpful.


207-220: Clean property implementation following consistent patterns.

The embedding size property follows the same pattern as other FastEmbed models, ensuring API consistency across the codebase.


221-249: Well-designed document and query processing methods with appropriate defaults.

The separation of document and query processing with different parameter defaults aligns well with the MUVERA paper's recommendations. The methods provide clear abstractions over the core process method.


251-357: Comprehensive core processing implementation with robust error handling.

The process method implements the full MUVERA algorithm correctly with proper handling of edge cases like empty clusters. The step-by-step approach makes the complex algorithm understandable and maintainable.


359-365: Simple demo implementation for testing the module.

The __main__ block provides a basic demonstration of the MUVERA functionality, which is useful for manual testing and verification.

@joein joein merged commit ca023be into main Aug 20, 2025
23 checks passed
@joein joein deleted the feature/muvera-embedding branch August 20, 2025 23:59
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (7)
tests/test_postprocess.py (7)

21-22: Use DIM constant and float32 to minimize numerical drift.

Use the declared DIM instead of a magic number and cast to float32 to better match typical embedding dtypes, reducing cross-platform numeric variance.

Apply this diff:

-    random_generator = np.random.default_rng(42)
-    multivector = random_generator.random((10, 128))
+    rng = np.random.default_rng(42)
+    multivector = rng.random((10, DIM)).astype(np.float32)

19-19: Rename test for intent clarity.

Name communicates purpose; current name is vague.

Apply this diff:

-def test_single_input():
+def test_muvera_process_consistency_and_constructors():

28-31: Assert the expected output dimensionality explicitly.

Strengthen the shape assertion by checking against the formula r_reps * k_sim * dim_proj and the Muvera-reported embedding_size.

Apply this diff:

-        fde = muvera.process(multivector)
-        assert fde.shape[0] == muvera.embedding_size
-        assert np.allclose(fde[:3], CANONICAL_VALUES)
+        fde = muvera.process(multivector)
+        expected_dim = R_REPS * K_SIM * DIM_PROJ
+        assert fde.shape[0] == expected_dim
+        assert muvera.embedding_size == expected_dim
+        assert np.allclose(fde[:3], CANONICAL_VALUES, rtol=1e-5, atol=1e-7)

32-35: Keep the doc-path equality check; ensure numeric tolerance.

Retain equality with an explicit tolerance, mirroring the change above.

Apply this diff:

-        fde_doc = muvera.process_document(multivector)
-        assert fde_doc.shape[0] == muvera.embedding_size
-        assert np.allclose(fde, fde_doc)
+        fde_doc = muvera.process_document(multivector)
+        assert fde_doc.shape[0] == muvera.embedding_size
+        assert np.allclose(fde, fde_doc, rtol=1e-5, atol=1e-7)

36-38: Strengthen query-path invariants and loosen brittleness.

Add checks that are robust across minor numeric changes: query vector should differ from doc vector, and it should contain zeros when fill_empty_clusters=False. Keep canonical checks but with explicit tolerance.

Apply this diff:

-        fde_query = muvera.process_query(multivector)
-        assert fde_query.shape[0] == muvera.embedding_size
-        assert np.allclose(fde_query[np.nonzero(fde_query)][:3], CANONICAL_QUERY_VALUES)
+        fde_query = muvera.process_query(multivector)
+        assert fde_query.shape[0] == muvera.embedding_size
+        # Query FDE differs from doc FDE by design (no fill/normalization)
+        assert not np.allclose(fde_query, fde_doc, rtol=1e-5, atol=1e-7)
+        # Expect some zeros in the query FDE
+        assert np.count_nonzero(fde_query) < muvera.embedding_size
+        # Canonical sentinel check with explicit tolerance
+        nonzero_vals = fde_query[np.nonzero(fde_query)][:3]
+        assert np.allclose(nonzero_vals, CANONICAL_QUERY_VALUES, rtol=1e-5, atol=1e-7)

24-27: Cross-verify both constructors produce identical outputs.

Given identical hyperparameters and seeds, process() results should match across constructors; assert explicitly.

Apply this diff to append after the loop:

@@
         assert np.allclose(fde_query[np.nonzero(fde_query)][:3], CANONICAL_QUERY_VALUES)
 
+    # Cross-verify: both constructors should produce identical outputs for the same input
+    muvera_a = Muvera(dim=DIM, k_sim=K_SIM, dim_proj=DIM_PROJ, r_reps=R_REPS, random_seed=42)
+    muvera_b = Muvera.from_multivector_model(model, k_sim=K_SIM, dim_proj=DIM_PROJ, r_reps=R_REPS)
+    fde_a = muvera_a.process(multivector)
+    fde_b = muvera_b.process(multivector)
+    assert np.allclose(fde_a, fde_b, rtol=1e-5, atol=1e-7)

Also applies to: 28-38


39-39: Add a negative test for invalid dim_proj to catch regression.

from_multivector_model is documented to raise when dim_proj > embedding_size. Add a regression test to enforce this contract.

Apply this patch at the end of the file:

+
+def test_from_multivector_model_raises_on_invalid_dim_proj():
+    model = SimpleNamespace(embedding_size=4)
+    with pytest.raises(ValueError):
+        Muvera.from_multivector_model(model, k_sim=K_SIM, dim_proj=8, r_reps=R_REPS)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 05c4e04 and e924524.

📒 Files selected for processing (1)
  • tests/test_postprocess.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/test_postprocess.py (2)
fastembed/late_interaction/late_interaction_text_embedding.py (1)
  • LateInteractionTextEmbedding (14-153)
fastembed/postprocess/muvera.py (6)
  • Muvera (86-356)
  • from_multivector_model (152-205)
  • process (251-356)
  • embedding_size (218-219)
  • process_document (221-234)
  • process_query (236-249)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.12.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.12.x on macos-latest test
  • GitHub Check: Python 3.11.x on windows-latest test
  • GitHub Check: Python 3.12.x on windows-latest test
  • GitHub Check: Python 3.11.x on macos-latest test
  • GitHub Check: Python 3.11.x on ubuntu-latest test
  • GitHub Check: Python 3.10.x on windows-latest test
  • GitHub Check: Python 3.9.x on macos-latest test
  • GitHub Check: Python 3.10.x on macos-latest test
  • GitHub Check: Python 3.9.x on windows-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
  • GitHub Check: Python 3.9.x on ubuntu-latest test
🔇 Additional comments (1)
tests/test_postprocess.py (1)

24-38: Solid baseline checks and determinism across paths.

Good coverage to exercise both constructors and to check consistency between process and process_document. Seeding the RNG ensures determinism.

Comment on lines +3 to +4
from fastembed import LateInteractionTextEmbedding
from fastembed.postprocess import Muvera
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid network/model dependency in unit tests; stub the model.

Creating LateInteractionTextEmbedding, even with lazy_load=True, can still trigger heavyweight imports or network access in CI. This test only needs embedding_size; replace the real model with a lightweight stub to eliminate flakiness and speed up tests.

Apply this diff:

@@
-import numpy as np
+import numpy as np
+import pytest
@@
-from fastembed import LateInteractionTextEmbedding
 from fastembed.postprocess import Muvera
+from types import SimpleNamespace
@@
-def test_single_input():
-    model = LateInteractionTextEmbedding("colbert-ir/colbertv2.0", lazy_load=True)
-    random_generator = np.random.default_rng(42)
-    multivector = random_generator.random((10, 128))
+def test_single_input():
+    # Avoid loading external models in unit tests; only embedding_size is needed here.
+    model = SimpleNamespace(embedding_size=DIM)
+    rng = np.random.default_rng(42)
+    multivector = rng.random((10, DIM)).astype(np.float32)

Also applies to: 20-22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants