-
-
Notifications
You must be signed in to change notification settings - Fork 10k
[CI] Add mteb testing for rerank models #19344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @noooop, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team,
Gemini here, providing a summary for this pull request. This PR introduces MTEB (Massive Text Embedding Benchmark) testing specifically for rerank models within the CI pipeline. The goal is to ensure the correctness and performance of rerank models supported by the project by evaluating them against a standard benchmark task. The chosen task is NFCorpus (English part) using a two-stage approach: BM25 for initial retrieval followed by the rerank model on the top 10 query-corpus pairs, which aligns with common reranking use cases.
The PR adds the necessary infrastructure to run these tests, including new MTEB encoder implementations for the OpenAI API endpoints and helper functions to manage the MTEB evaluation process for reranking.
Highlights
- Add MTEB Rerank Testing: Introduces MTEB testing for rerank models to the CI pipeline to validate their performance and correctness.
- NFCorpus Rerank Task: Uses the English subset of the NFCorpus dataset with a BM25 retriever and reranking on the top 10 results as the benchmark task.
- OpenAI API Rerank/Score Client: Adds new MTEB encoder classes (
ScoreClientMtebEncoder
,RerankClientMtebEncoder
) to evaluate models served via the OpenAI compatible API's/score
and/rerank
endpoints. - Rerank Testing Infrastructure: Adds helper functions (
run_mteb_rerank
,mteb_test_rerank_models
) and aRerankModelInfo
class to streamline the process of defining and running MTEB rerank tests for various models. - Specific Model Tests Added: Adds MTEB rerank tests for BAAI rerankers, a cross-encoder model, and a Jina reranker.
Changelog
Click here to see the changelog
- tests/conftest.py
- Modified the
score
method ofOpenAIClientMtebEncoder
to accept arbitrary*args
and**kwargs
(lines 1036-1038).
- Modified the
- tests/entrypoints/openai/correctness/test_mteb_embed.py
- Removed import and usage of
run_mteb_embed_task_st
(lines 10, 36). - Updated
MODEL_NAME
fromBAAI/bge-m3
tointfloat/e5-small
(line 15). - Updated
MAIN_SCORE
from 0.7873 to 0.7423 (line 16). - Removed
dtype
andmax-model-len
from server arguments and added--disable-uvicorn-access-log
(lines 22-23). - Renamed
test_mteb
totest_mteb_embed
(line 29).
- Removed import and usage of
- tests/entrypoints/openai/correctness/test_mteb_score.py
- New file added to include MTEB rerank tests for models served via the OpenAI API.
- Defines
MODEL_NAME
(cross-encoder/ms-marco-MiniLM-L-6-v2
) andMAIN_SCORE
(0.33702) for the test (lines 17-18). - Adds
test_mteb_score
to test the/score
endpoint using MTEB (lines 31-42). - Adds
test_mteb_rerank
to test the/rerank
endpoint using MTEB (lines 45-56).
- tests/models/language/pooling/mteb_utils.py
- Imported
shutil
,Optional
,requests
,HfRunner
,VllmRunner
, andRerankModelInfo
(lines 4-14). - Added constants
MTEB_RERANK_TASKS
,MTEB_RERANK_LANGS
, andMTEB_RERANK_TOL
for reranking (lines 24-26). - Added a
predict
method toVllmMtebEncoder
to conform to the MTEBEncoder
interface for reranking (lines 51-70). - Added
ScoreClientMtebEncoder
class for MTEB evaluation using the OpenAI/score
endpoint (lines 95-129). - Added
RerankClientMtebEncoder
class for MTEB evaluation using the OpenAI/rerank
endpoint (lines 132-142). - Modified
run_mteb_embed_task
to disable the progress bar during encoding (lines 152-154). - Removed
run_mteb_embed_task_st
function (lines 74-77 in original). - Added
run_mteb_rerank
function to perform the two-stage MTEB reranking evaluation (lines 204-240). - Added
mteb_test_rerank_models
helper function to run MTEB rerank tests for given model info (lines 243-299).
- Imported
- tests/models/language/pooling/test_baai.py
- Imported
RerankModelInfo
andmteb_test_rerank_models
(line 5, 7). - Defined
RERANK_MODELS
list containing BAAI reranker models for testing (lines 61-73). - Added
test_rerank_models_mteb
parameterized test using the new rerank models and helper function (lines 90-93).
- Imported
- tests/models/language/pooling/test_cross_encoder.py
- New file added to include MTEB rerank tests for cross-encoder models.
- Imported
RerankModelInfo
andmteb_test_rerank_models
(line 5). - Defined
RERANK_MODELS
list containing cross-encoder models (lines 7-10). - Added
test_rerank_models_mteb
parameterized test (lines 13-16).
- tests/models/language/pooling/test_jina.py
- Imported
RerankModelInfo
andmteb_test_rerank_models
(line 9, 12). - Removed old scoring test functions (
test_llm_1_to_1
,test_llm_1_to_N
) and related constants (SCORING_MODELS
,TEXTS_1
,TEXTS_2
) (lines 13-30, 39-78 in original). - Defined
RERANK_MODELS
list containing Jina reranker models (lines 20-26). - Added
test_rerank_models_mteb
parameterized test using the new rerank models and helper function (lines 57-60).
- Imported
- tests/models/utils.py
- Added
RerankModelInfo
NamedTuple to define rerank models for testing (lines 341-345).
- Added
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces MTEB (Massive Text Embedding Benchmark) testing for rerank models, a valuable addition for ensuring the quality and correctness of reranking functionality. The changes are generally well-structured, with new test files for OpenAI endpoint correctness and updates to utility functions for MTEB integration.
Summary of Findings
- Hardcoded Benchmark Scores: The tests in
test_mteb_embed.py
andtest_mteb_score.py
now rely on hardcodedMAIN_SCORE
values. This can improve CI stability and speed but removes dynamic checks against reference implementations. It's important to have a clear strategy for updating these scores if underlying models or libraries change. - Clarity of
truncate_prompt_tokens
Parameter: The parametertruncate_prompt_tokens=-1
is used in several places related to scoring and reranking. Its specific meaning within vLLM (e.g., no truncation, default behavior) could be clarified with a comment to aid understanding.
Merge Readiness
The PR is in good shape and adds valuable test coverage for reranking models. Addressing the comments regarding hardcoded scores and the truncate_prompt_tokens
parameter would enhance clarity and long-term maintainability. I am setting the status to REQUEST_CHANGES
to encourage discussion on these points. I am not authorized to approve pull requests; please ensure further review and approval before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as long as tests pass
PTAL at the failing test |
Can this pr be merged? I want to use it in #19260. |
PTAL at the failing OpenAI API Correctness test |
Head branch was pushed to by a user without write access
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One ask wrt formatting, but the spec decode failure can be ignored here.
so set the threshold to MTEB_RERANK_TOL = 1e-3 CI local
Please do a final review, see what still needs to be improved. |
There may be small precision differences when different hardware is used, that's normal. Maybe we need to set the dtype to float32 to other poolers as well... |
I have tested it, and there is no significant difference in other poolers. I will continue to pay attention to this. |
Can this pr be merged? I hope to use this test in #19675. |
Thanks for reviewing |
OpenAI API correctness was broken by #18957, this pr is unrelated. |
Signed-off-by: minpeter <kali2005611@gmail.com>
Signed-off-by: Yang Wang <elainewy@meta.com>
Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
Summary
Task selection
This PR introduces MTEB (Massive Text Embedding Benchmark) testing specifically for rerank models within the CI pipeline.
The goal is to ensure the correctness of rerank models supported by the project by evaluating them against a standard benchmark task.
The chosen task is NFCorpus (English part) using a two-stage approach: BM25 for initial retrieval followed by the rerank model on the top 10 query-corpus pairs, which aligns with common reranking use cases.
Results
Here are some model test results, which can be referred to.
Threshold
MTEB_RERANK_TOL = 1e-3
CI
BAAI/bge-reranker-base vllm fp32 0.32398
BAAI/bge-reranker-base vllm fp16 0.32399
cross-encoder/ms-marco-MiniLM-L-6-v2 vllm fp16 0.33457
local
BAAI/bge-reranker-base st fp32 0.32379
BAAI/bge-reranker-base vllm fp32 0.32379
BAAI/bge-reranker-base vllm fp16 0.32378
cross-encoder/ms-marco-MiniLM-L-6-v2 st fp32 0.33437
cross-encoder/ms-marco-MiniLM-L-6-v2 vllm fp32 0.33437
cross-encoder/ms-marco-MiniLM-L-6-v2 vllm fp16 0.33437
I don't know why, but the differences between the models on local and CI machines are greater than those between fp16 and fp32.
Is there a hidden bug, or just bad luck?
so set the threshold to MTEB_RERANK_TOL = 1e-3