add Multilingual E5 models #1

brendanator · 2024-04-09T09:24:54Z

This PR adds intfloat/multilingual-e5-base, -small models.

these models have no "token_type_ids" inputs, so I introduce check logic.

curiously, intfloat/multilingual-e5-large onnx model is only 546kB(small is 470MB, base is 1.11GB). and it can't run inference. so, I commented out for large model definition.

brendanator · 2024-04-09T09:24:55Z

This is a benchmark review for experiment review_of_100_reviews_20240409.
Run ID: review_of_100_reviews_20240409/benchmark_2024-04-09T10-24-34_v1-16-0-95-g65fefaeef-dirty.

This pull request was cloned from https://github.com/Anush008/fastembed-rs/pull/48. (Note: the URL is not a link to avoid triggering a notification on the original pull request.)

brendanator

Sourcery Review:

Hey @brendanator - I've reviewed your changes and they look great!

Here's what I looked at during the review

🟡 General issues: 1 issue found
🟢 Security: all looks good
🟢 Testing: all looks good
🟢 Complexity: all looks good
🟡 Docstrings: 1 issue found

LangSmith trace

_{Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.}

brendanator · 2024-04-09T09:27:29Z

src/lib.rs

+    // Large model is something wrong, model.onnx size is only 546kB
+    // /// Large model of multilingual E5 Text Embeddings


suggestion (code_clarification): Consider clarifying the status of the MultilingualE5Large model in the code.

It's helpful for future maintenance to include a more detailed comment on why the Large model is commented out and under what conditions it might be revisited or fixed.

Suggested change

// Large model is something wrong, model.onnx size is only 546kB

// /// Large model of multilingual E5 Text Embeddings

// The Large model of multilingual E5 Text Embeddings is currently not included due to an issue with its ONNX file size (only 546kB),

// indicating a potential export or conversion problem. This model variant should be revisited once the underlying issue is resolved

// or if a correctly sized model file becomes available.

Is this comment correct?

Is this comment helpful?

Is the comment type correct?

Is the comment area correct?

brendanator · 2024-04-09T09:27:29Z

src/lib.rs

    }

    /// Private method to return an instance
    fn new(tokenizer: Tokenizer, session: Session) -> Self {


suggestion (docstrings): Please update the docstring for function: TextEmbedding::new

Reason for update: The function signature and internal logic have changed to include a new field need_token_type_ids.

Suggested new docstring:

/// Private method to return an instance, determining if `token_type_ids` are needed based on the session inputs.

kounoike added 3 commits April 8, 2024 06:57

add Multilingual E5 models

f921155

fix session inputs creation

9211da0

add multilingual E5 models to README

b657560

brendanator commented Apr 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add Multilingual E5 models #1

add Multilingual E5 models #1

Uh oh!

brendanator commented Apr 9, 2024

Uh oh!

brendanator commented Apr 9, 2024

Uh oh!

brendanator left a comment

Uh oh!

brendanator Apr 9, 2024

Uh oh!

SourceryAI Apr 9, 2024

Uh oh!

SourceryAI Apr 9, 2024

Uh oh!

SourceryAI Apr 9, 2024

Uh oh!

SourceryAI Apr 9, 2024

Uh oh!

brendanator Apr 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		// Large model is something wrong, model.onnx size is only 546kB
		// /// Large model of multilingual E5 Text Embeddings

-    // Large model is something wrong, model.onnx size is only 546kB
-    // /// Large model of multilingual E5 Text Embeddings
+    // The Large model of multilingual E5 Text Embeddings is currently not included due to an issue with its ONNX file size (only 546kB),
+    // indicating a potential export or conversion problem. This model variant should be revisited once the underlying issue is resolved
+    // or if a correctly sized model file becomes available.

add Multilingual E5 models #1

Are you sure you want to change the base?

add Multilingual E5 models #1

Uh oh!

Conversation

brendanator commented Apr 9, 2024

Uh oh!

brendanator commented Apr 9, 2024

Uh oh!

brendanator left a comment

Choose a reason for hiding this comment

Uh oh!

brendanator Apr 9, 2024

Choose a reason for hiding this comment

Uh oh!

SourceryAI Apr 9, 2024

Choose a reason for hiding this comment

Uh oh!

SourceryAI Apr 9, 2024

Choose a reason for hiding this comment

Uh oh!

SourceryAI Apr 9, 2024

Choose a reason for hiding this comment

Uh oh!

SourceryAI Apr 9, 2024

Choose a reason for hiding this comment

Uh oh!

brendanator Apr 9, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants