add Multilingual E5 models #7

brendanator · 2024-04-15T15:20:08Z

This PR adds intfloat/multilingual-e5-base, -small models.

these models have no "token_type_ids" inputs, so I introduce check logic.

curiously, intfloat/multilingual-e5-large onnx model is only 546kB(small is 470MB, base is 1.11GB). and it can't run inference. so, I commented out for large model definition.

brendanator · 2024-04-15T15:20:21Z

This is a benchmark review for experiment review_of_100_reviews_20240409.
Run ID: review_of_100_reviews_20240409/benchmark_2024-04-15T16-19-27_v1-16-0-123-g1e31c5645-dirty.

This pull request was cloned from https://github.com/Anush008/fastembed-rs/pull/48. (Note: the URL is not a link to avoid triggering a notification on the original pull request.)

Experiment configuration

review_config:
  # User configuration for the review
  # - benchmark - use the user config from the benchmark reviews
  # - <value> - use the value directly
  user_config:
    enable_ai_review: true
    enable_rule_comments: false

    enable_complexity_comments: benchmark
    enable_docstring_comments: benchmark
    enable_security_comments: benchmark
    enable_tests_comments: benchmark
    enable_comment_suggestions: benchmark

    enable_approvals: true

  ai_review_config:
    # The model responses to use for the experiment
    # - benchmark - use the model responses from the benchmark reviews
    # - llm - call the language model to generate responses
    model_responses:
      comments_model: benchmark
      comment_validation_model: benchmark
      comment_suggestion_model: benchmark
      complexity_model: benchmark
      docstrings_model: benchmark
      security_model: benchmark
      tests_model: benchmark

# The pull request dataset to run the experiment on
pull_request_dataset:
- https://github.com/mraniki/iamlistening/pull/294
- https://github.com/gdsfactory/gplugins/pull/373
- https://github.com/Anush008/fastembed-rs/pull/48
- https://github.com/mraniki/tt/pull/1435
- https://github.com/kloudlite/operator/pull/172
- https://github.com/mraniki/iamlistening/pull/293
- https://github.com/mraniki/iamlistening/pull/292
- https://github.com/mraniki/cefi/pull/434
- https://github.com/kloudlite/operator/pull/171
- https://github.com/usama-maxenius/image-editor/pull/62
- https://github.com/mraniki/tt/pull/1434
- https://github.com/mraniki/dxsp/pull/614
- https://github.com/albumentations-team/albumentations/pull/1637
- https://github.com/erxes/erxes/pull/5119
- https://github.com/mraniki/cefi/pull/433
- https://github.com/Quarticai/QuarticSDK/pull/358
- https://github.com/mraniki/cefi/pull/432
- https://github.com/tpaviot/pythonocc-core/pull/1311
- https://github.com/lightning-bot/Lightning/pull/144
- https://github.com/ignition-api/8.1/pull/265
- https://github.com/fairdataihub/fairdataihub.org/pull/616
# - https://github.com/suttacentral/suttacentral/pull/3122
# - https://github.com/jquagga/ttt/pull/30
# - https://github.com/jquagga/ttt/pull/29
# - https://github.com/Harrytimbog/Peer-Pal/pull/22
# - https://github.com/bengosney/cerberus/pull/790
# - https://github.com/Harrytimbog/Peer-Pal/pull/21
# - https://github.com/mraniki/cefi/pull/431
# - https://github.com/bengosney/cerberus/pull/789
# - https://github.com/bengosney/cerberus/pull/788
# - https://github.com/jmcerrejon/PiKISS/pull/214
# - https://github.com/mraniki/dxsp/pull/613
# - https://github.com/mraniki/cefi/pull/430
# - https://github.com/mraniki/cefi/pull/429
# - https://github.com/gdsfactory/gdsfactory/pull/2658
# - https://github.com/Bilbottom/sql-learning-materials/pull/7
# - https://github.com/mraniki/cefi/pull/428
# - https://github.com/KonScanner/synthr-farming/pull/1
# - https://github.com/rtk-rnjn/algorithms/pull/78
# - https://github.com/malayilneil/lab04/pull/1
# - https://github.com/nbhirud/system_update/pull/6
# - https://github.com/mraniki/cefi/pull/427
# - https://github.com/Kilo59/ruff-sync/pull/16
# - https://github.com/jquagga/ttt/pull/27
# - https://github.com/alexiusstrauss/CryptoTrendAnalyzer/pull/10
# - https://github.com/jquagga/ttt/pull/25
# - https://github.com/mraniki/tt/pull/1425
# - https://github.com/albumentations-team/albumentations_stats/pull/1
# - https://github.com/jquagga/ttt/pull/24
# - https://github.com/jsugg/retry-on/pull/1
# - https://github.com/strawberry-graphql/strawberry/pull/3442
# - https://github.com/jquagga/ttt/pull/23
# - https://github.com/jquagga/ttt/pull/22
# - https://github.com/jquagga/ttt/pull/21
# - https://github.com/jquagga/ttt/pull/20
# - https://github.com/Kilo59/ruff-sync/pull/14
# - https://github.com/jquagga/ttt/pull/19
# - https://github.com/jquagga/ttt/pull/18
# - https://github.com/jquagga/ttt/pull/17
# - https://github.com/brendancsmith/diffbot-kg/pull/3
# - https://github.com/2lambda123/StenaIT-stenajs-webui/pull/1
# - https://github.com/jkool702/openwrt/pull/24
# - https://github.com/KevinNitroG/VNULIB-Downloader/pull/28
# - https://github.com/CPUT-DEVS/devpost-hackathon/pull/15
# - https://github.com/code-Harsh247/FRSS-project/pull/34
# - https://github.com/code-Harsh247/FRSS-project/pull/33
# - https://github.com/kurianbenoy/samam-ml-verification/pull/1
# - https://github.com/alexiusstrauss/CryptoTrendAnalyzer/pull/8
# - https://github.com/mraniki/dxsp/pull/612
# - https://github.com/alexiusstrauss/CryptoTrendAnalyzer/pull/7
# - https://github.com/alexiusstrauss/CryptoTrendAnalyzer/pull/6
# - https://github.com/neurodatascience/cohort_creator/pull/207
# - https://github.com/albumentations-team/albumentations-demo/pull/12
# - https://github.com/mraniki/cefi/pull/426
# - https://github.com/alexiusstrauss/CryptoTrendAnalyzer/pull/5
# - https://github.com/jkool702/openwrt/pull/23
# - https://github.com/jkool702/openwrt/pull/22
# - https://github.com/ynvtlmr/intergenerational-family-code/pull/108
# - https://github.com/PythonFreeCourse/lms/pull/390
# - https://github.com/jquagga/ttt/pull/16
# - https://github.com/PythonFreeCourse/lms/pull/389
# - https://github.com/PythonFreeCourse/lms/pull/389
# - https://github.com/PythonFreeCourse/lms/pull/389
# - https://github.com/jquagga/ttt/pull/13
# - https://github.com/Speccy-Rom/Leetcode_aka_speccy-rom/pull/293
# - https://github.com/Bilbottom/sql-learning-materials/pull/5
# - https://github.com/mraniki/cefi/pull/425
# - https://github.com/approvals/Approvals.NodeJS/pull/173
# - https://github.com/gdsfactory/gdsfactory/pull/2657
# - https://github.com/mraniki/tt/pull/1420
# - https://github.com/vibikerski/trackingtasks/pull/2
# - https://github.com/yaitoo/sqle/pull/30
# - https://github.com/jquagga/ttt/pull/12
# - https://github.com/Mesteriis/test-repo/pull/4
# - https://github.com/Mesteriis/test-repo/pull/3
# - https://github.com/Mesteriis/test-repo/pull/2
# - https://github.com/Mesteriis/test-repo/pull/1
# - https://github.com/letsdoitnowus/planium-backend/pull/34
# - https://github.com/code-Harsh247/FRSS-project/pull/32
# - https://github.com/letsdoitnowus/planium-backend/pull/33

# Questions to ask to label the review comments
review_comment_labels:
- label: correct
  question: Is this comment correct?
- label: helpful
  question: Is this comment helpful?
- label: comment-type
  question: Is the comment type correct?
- label: comment-area
  question: Is the comment area correct?

# Benchmark reviews generated by running
#   python -m scripts.experiment benchmark <experiment_name>
benchmark_reviews:
- dataset_pull_request: https://github.com/jquagga/ttt/pull/12
  review_pull_request: https://github.com/sourcery-ai-experiments/ttt/pull/25

SourceryAI

Hey @brendanator - I've reviewed your changes and they look great!

Here's what I looked at during the review

🟡 General issues: 2 issues found
🟢 Security: all looks good
🟢 Testing: all looks good
🟢 Complexity: all looks good
🟡 Docstrings: 1 issue found

LangSmith trace

_{Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.}

SourceryAI · 2024-04-15T15:22:03Z

src/lib.rs

+    MultilingualE5Small,
+    /// Base model of multilingual E5 Text Embeddings
+    MultilingualE5Base,
+    // Large model is something wrong, model.onnx size is only 546kB


suggestion (code_clarification): Clarify the comment about the large model issue.

The comment about the large model being 'something wrong' is vague. It would be helpful to specify what the issue is, whether it's a temporary or permanent problem, and any planned steps to resolve it.

Suggested change

// Large model is something wrong, model.onnx size is only 546kB

// The Large model of multilingual E5 Text Embeddings appears to be incorrect due to its unusually small size (546kB). This issue is currently under investigation to determine if it's a file corruption or a misconfiguration. Updates or fixes will be applied once the problem is fully diagnosed.

SourceryAI · 2024-04-15T15:22:03Z

src/lib.rs

+        let need_token_type_ids = session
+            .inputs
+            .iter()
+            .any(|input| input.name == "token_type_ids");


suggestion (code_refinement): Consider initializing 'need_token_type_ids' directly in the struct declaration.

Initializing 'need_token_type_ids' directly in the struct declaration could simplify the 'new' method and improve readability.

Suggested change

let need_token_type_ids = session

.inputs

.iter()

.any(|input| input.name == "token_type_ids");

Self {

tokenizer,

session,

need_token_type_ids: session.inputs.iter().any(|input| input.name == "token_type_ids")

}

SourceryAI · 2024-04-15T15:22:03Z

src/lib.rs

        Ok(Self::new(tokenizer, session))
    }

    /// Private method to return an instance


suggestion (docstrings): Please update the docstring for function: TextEmbedding::new

Reason for update: Initialization logic has changed to include a new field based on session inputs.

Suggested new docstring:

/// Private method to return an instance, initializing `need_token_type_ids` based on session inputs.

kounoike added 3 commits April 8, 2024 06:57

add Multilingual E5 models

f921155

fix session inputs creation

9211da0

add multilingual E5 models to README

b657560

SourceryAI approved these changes Apr 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add Multilingual E5 models #7

add Multilingual E5 models #7

brendanator commented Apr 15, 2024

Uh oh!

brendanator commented Apr 15, 2024

Uh oh!

SourceryAI left a comment

Uh oh!

SourceryAI Apr 15, 2024

Uh oh!

SourceryAI Apr 15, 2024

Uh oh!

SourceryAI Apr 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	// Large model is something wrong, model.onnx size is only 546kB
	// The Large model of multilingual E5 Text Embeddings appears to be incorrect due to its unusually small size (546kB). This issue is currently under investigation to determine if it's a file corruption or a misconfiguration. Updates or fixes will be applied once the problem is fully diagnosed.

add Multilingual E5 models #7

Are you sure you want to change the base?

add Multilingual E5 models #7

Conversation

brendanator commented Apr 15, 2024

Uh oh!

brendanator commented Apr 15, 2024

Uh oh!

SourceryAI left a comment

Choose a reason for hiding this comment

Uh oh!

SourceryAI Apr 15, 2024

Choose a reason for hiding this comment

Uh oh!

SourceryAI Apr 15, 2024

Choose a reason for hiding this comment

Uh oh!

SourceryAI Apr 15, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants