-
Notifications
You must be signed in to change notification settings - Fork 0
add Multilingual E5 models #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: base-sha/77439ea5a87ef0ba082009226a521011873f7dfc
Are you sure you want to change the base?
add Multilingual E5 models #7
Conversation
|
This is a benchmark review for experiment This pull request was cloned from Experiment configurationreview_config:
# User configuration for the review
# - benchmark - use the user config from the benchmark reviews
# - <value> - use the value directly
user_config:
enable_ai_review: true
enable_rule_comments: false
enable_complexity_comments: benchmark
enable_docstring_comments: benchmark
enable_security_comments: benchmark
enable_tests_comments: benchmark
enable_comment_suggestions: benchmark
enable_approvals: true
ai_review_config:
# The model responses to use for the experiment
# - benchmark - use the model responses from the benchmark reviews
# - llm - call the language model to generate responses
model_responses:
comments_model: benchmark
comment_validation_model: benchmark
comment_suggestion_model: benchmark
complexity_model: benchmark
docstrings_model: benchmark
security_model: benchmark
tests_model: benchmark
# The pull request dataset to run the experiment on
pull_request_dataset:
- https://github.com/mraniki/iamlistening/pull/294
- https://github.com/gdsfactory/gplugins/pull/373
- https://github.com/Anush008/fastembed-rs/pull/48
- https://github.com/mraniki/tt/pull/1435
- https://github.com/kloudlite/operator/pull/172
- https://github.com/mraniki/iamlistening/pull/293
- https://github.com/mraniki/iamlistening/pull/292
- https://github.com/mraniki/cefi/pull/434
- https://github.com/kloudlite/operator/pull/171
- https://github.com/usama-maxenius/image-editor/pull/62
- https://github.com/mraniki/tt/pull/1434
- https://github.com/mraniki/dxsp/pull/614
- https://github.com/albumentations-team/albumentations/pull/1637
- https://github.com/erxes/erxes/pull/5119
- https://github.com/mraniki/cefi/pull/433
- https://github.com/Quarticai/QuarticSDK/pull/358
- https://github.com/mraniki/cefi/pull/432
- https://github.com/tpaviot/pythonocc-core/pull/1311
- https://github.com/lightning-bot/Lightning/pull/144
- https://github.com/ignition-api/8.1/pull/265
- https://github.com/fairdataihub/fairdataihub.org/pull/616
# - https://github.com/suttacentral/suttacentral/pull/3122
# - https://github.com/jquagga/ttt/pull/30
# - https://github.com/jquagga/ttt/pull/29
# - https://github.com/Harrytimbog/Peer-Pal/pull/22
# - https://github.com/bengosney/cerberus/pull/790
# - https://github.com/Harrytimbog/Peer-Pal/pull/21
# - https://github.com/mraniki/cefi/pull/431
# - https://github.com/bengosney/cerberus/pull/789
# - https://github.com/bengosney/cerberus/pull/788
# - https://github.com/jmcerrejon/PiKISS/pull/214
# - https://github.com/mraniki/dxsp/pull/613
# - https://github.com/mraniki/cefi/pull/430
# - https://github.com/mraniki/cefi/pull/429
# - https://github.com/gdsfactory/gdsfactory/pull/2658
# - https://github.com/Bilbottom/sql-learning-materials/pull/7
# - https://github.com/mraniki/cefi/pull/428
# - https://github.com/KonScanner/synthr-farming/pull/1
# - https://github.com/rtk-rnjn/algorithms/pull/78
# - https://github.com/malayilneil/lab04/pull/1
# - https://github.com/nbhirud/system_update/pull/6
# - https://github.com/mraniki/cefi/pull/427
# - https://github.com/Kilo59/ruff-sync/pull/16
# - https://github.com/jquagga/ttt/pull/27
# - https://github.com/alexiusstrauss/CryptoTrendAnalyzer/pull/10
# - https://github.com/jquagga/ttt/pull/25
# - https://github.com/mraniki/tt/pull/1425
# - https://github.com/albumentations-team/albumentations_stats/pull/1
# - https://github.com/jquagga/ttt/pull/24
# - https://github.com/jsugg/retry-on/pull/1
# - https://github.com/strawberry-graphql/strawberry/pull/3442
# - https://github.com/jquagga/ttt/pull/23
# - https://github.com/jquagga/ttt/pull/22
# - https://github.com/jquagga/ttt/pull/21
# - https://github.com/jquagga/ttt/pull/20
# - https://github.com/Kilo59/ruff-sync/pull/14
# - https://github.com/jquagga/ttt/pull/19
# - https://github.com/jquagga/ttt/pull/18
# - https://github.com/jquagga/ttt/pull/17
# - https://github.com/brendancsmith/diffbot-kg/pull/3
# - https://github.com/2lambda123/StenaIT-stenajs-webui/pull/1
# - https://github.com/jkool702/openwrt/pull/24
# - https://github.com/KevinNitroG/VNULIB-Downloader/pull/28
# - https://github.com/CPUT-DEVS/devpost-hackathon/pull/15
# - https://github.com/code-Harsh247/FRSS-project/pull/34
# - https://github.com/code-Harsh247/FRSS-project/pull/33
# - https://github.com/kurianbenoy/samam-ml-verification/pull/1
# - https://github.com/alexiusstrauss/CryptoTrendAnalyzer/pull/8
# - https://github.com/mraniki/dxsp/pull/612
# - https://github.com/alexiusstrauss/CryptoTrendAnalyzer/pull/7
# - https://github.com/alexiusstrauss/CryptoTrendAnalyzer/pull/6
# - https://github.com/neurodatascience/cohort_creator/pull/207
# - https://github.com/albumentations-team/albumentations-demo/pull/12
# - https://github.com/mraniki/cefi/pull/426
# - https://github.com/alexiusstrauss/CryptoTrendAnalyzer/pull/5
# - https://github.com/jkool702/openwrt/pull/23
# - https://github.com/jkool702/openwrt/pull/22
# - https://github.com/ynvtlmr/intergenerational-family-code/pull/108
# - https://github.com/PythonFreeCourse/lms/pull/390
# - https://github.com/jquagga/ttt/pull/16
# - https://github.com/PythonFreeCourse/lms/pull/389
# - https://github.com/PythonFreeCourse/lms/pull/389
# - https://github.com/PythonFreeCourse/lms/pull/389
# - https://github.com/jquagga/ttt/pull/13
# - https://github.com/Speccy-Rom/Leetcode_aka_speccy-rom/pull/293
# - https://github.com/Bilbottom/sql-learning-materials/pull/5
# - https://github.com/mraniki/cefi/pull/425
# - https://github.com/approvals/Approvals.NodeJS/pull/173
# - https://github.com/gdsfactory/gdsfactory/pull/2657
# - https://github.com/mraniki/tt/pull/1420
# - https://github.com/vibikerski/trackingtasks/pull/2
# - https://github.com/yaitoo/sqle/pull/30
# - https://github.com/jquagga/ttt/pull/12
# - https://github.com/Mesteriis/test-repo/pull/4
# - https://github.com/Mesteriis/test-repo/pull/3
# - https://github.com/Mesteriis/test-repo/pull/2
# - https://github.com/Mesteriis/test-repo/pull/1
# - https://github.com/letsdoitnowus/planium-backend/pull/34
# - https://github.com/code-Harsh247/FRSS-project/pull/32
# - https://github.com/letsdoitnowus/planium-backend/pull/33
# Questions to ask to label the review comments
review_comment_labels:
- label: correct
question: Is this comment correct?
- label: helpful
question: Is this comment helpful?
- label: comment-type
question: Is the comment type correct?
- label: comment-area
question: Is the comment area correct?
# Benchmark reviews generated by running
# python -m scripts.experiment benchmark <experiment_name>
benchmark_reviews:
- dataset_pull_request: https://github.com/jquagga/ttt/pull/12
review_pull_request: https://github.com/sourcery-ai-experiments/ttt/pull/25
|
SourceryAI
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @brendanator - I've reviewed your changes and they look great!
Here's what I looked at during the review
- 🟡 General issues: 2 issues found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟡 Docstrings: 1 issue found
Help me be more useful! Please click 👍 or 👎 on each comment to tell me if it was helpful.
| MultilingualE5Small, | ||
| /// Base model of multilingual E5 Text Embeddings | ||
| MultilingualE5Base, | ||
| // Large model is something wrong, model.onnx size is only 546kB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code_clarification): Clarify the comment about the large model issue.
The comment about the large model being 'something wrong' is vague. It would be helpful to specify what the issue is, whether it's a temporary or permanent problem, and any planned steps to resolve it.
| // Large model is something wrong, model.onnx size is only 546kB | |
| // The Large model of multilingual E5 Text Embeddings appears to be incorrect due to its unusually small size (546kB). This issue is currently under investigation to determine if it's a file corruption or a misconfiguration. Updates or fixes will be applied once the problem is fully diagnosed. |
| let need_token_type_ids = session | ||
| .inputs | ||
| .iter() | ||
| .any(|input| input.name == "token_type_ids"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code_refinement): Consider initializing 'need_token_type_ids' directly in the struct declaration.
Initializing 'need_token_type_ids' directly in the struct declaration could simplify the 'new' method and improve readability.
| let need_token_type_ids = session | |
| .inputs | |
| .iter() | |
| .any(|input| input.name == "token_type_ids"); | |
| Self { | |
| tokenizer, | |
| session, | |
| need_token_type_ids: session.inputs.iter().any(|input| input.name == "token_type_ids") | |
| } |
| Ok(Self::new(tokenizer, session)) | ||
| } | ||
|
|
||
| /// Private method to return an instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (docstrings): Please update the docstring for function: TextEmbedding::new
Reason for update: Initialization logic has changed to include a new field based on session inputs.
Suggested new docstring:
/// Private method to return an instance, initializing `need_token_type_ids` based on session inputs.
This PR adds intfloat/multilingual-e5-base, -small models.
these models have no "token_type_ids" inputs, so I introduce check logic.
curiously, intfloat/multilingual-e5-large onnx model is only 546kB(small is 470MB, base is 1.11GB). and it can't run inference. so, I commented out for large model definition.