[CI] Add mteb testing for rerank models #19344

noooop · 2025-06-09T03:30:21Z

Summary

Add mteb testing for rerank models
Use float32 for torch.cumsum in MeanPool

Task selection

This PR introduces MTEB (Massive Text Embedding Benchmark) testing specifically for rerank models within the CI pipeline.

The goal is to ensure the correctness of rerank models supported by the project by evaluating them against a standard benchmark task.

The chosen task is NFCorpus (English part) using a two-stage approach: BM25 for initial retrieval followed by the rerank model on the top 10 query-corpus pairs, which aligns with common reranking use cases.

Results

Here are some model test results, which can be referred to.

model_name	float32 st	float16 diff	bfloat16 diff	float32 diff
bm25s retriever only	0.32102	-	-	-
random reranker	0.25306	-	-	-
cross-encoder/ms-marco-TinyBERT-L-2-v2	0.3288	4.0000000000040004e-05	0.00017000000000000348	0.0
cross-encoder/ms-marco-MiniLM-L-6-v2	0.33437	0.0	0.00010999999999999899	0.0
tomaarsen/Qwen3-Reranker-0.6B-seq-cls wo/template	0.25782	3.999999999998449e-05	-0.0007499999999999729	0.0
tomaarsen/Qwen3-Reranker-0.6B-seq-cls w/template	0.33699	0.00040000000000001146	-0.0012600000000000389	0.0
jinaai/jina-reranker-v2-base-multilingual	0.33623	0.0011400000000000299	0.0010100000000000109	0.0
BAAI/bge-reranker-base	0.32379	-1.0000000000010001e-05	0.0	0.0
BAAI/bge-reranker-large	0.33321	-7.00000000000145e-05	-0.00031000000000003247	0.0
BAAI/bge-reranker-v2-m3	0.32803	7.00000000000145e-05	-0.00046000000000001595	0.0

Threshold

MTEB_RERANK_TOL = 1e-3

CI
BAAI/bge-reranker-base vllm fp32 0.32398
BAAI/bge-reranker-base vllm fp16 0.32399
cross-encoder/ms-marco-MiniLM-L-6-v2 vllm fp16 0.33457

local
BAAI/bge-reranker-base st fp32 0.32379
BAAI/bge-reranker-base vllm fp32 0.32379
BAAI/bge-reranker-base vllm fp16 0.32378
cross-encoder/ms-marco-MiniLM-L-6-v2 st fp32 0.33437
cross-encoder/ms-marco-MiniLM-L-6-v2 vllm fp32 0.33437
cross-encoder/ms-marco-MiniLM-L-6-v2 vllm fp16 0.33437

In [7]: 0.32399 == pytest.approx(0.32379, abs=1e-3)
Out[7]: True

In [8]: 0.32399 == pytest.approx(0.32379, abs=1e-4)
Out[8]: False

In [9]: 0.33457 == pytest.approx(0.33437, abs=1e-3)
Out[9]: True

In [10]: 0.33457 == pytest.approx(0.33437, abs=1e-4)
Out[10]: False

I don't know why, but the differences between the models on local and CI machines are greater than those between fp16 and fp32.

Is there a hidden bug, or just bad luck?

so set the threshold to MTEB_RERANK_TOL = 1e-3

github-actions · 2025-06-09T03:30:30Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Hello @noooop, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team,

Gemini here, providing a summary for this pull request. This PR introduces MTEB (Massive Text Embedding Benchmark) testing specifically for rerank models within the CI pipeline. The goal is to ensure the correctness and performance of rerank models supported by the project by evaluating them against a standard benchmark task. The chosen task is NFCorpus (English part) using a two-stage approach: BM25 for initial retrieval followed by the rerank model on the top 10 query-corpus pairs, which aligns with common reranking use cases.

The PR adds the necessary infrastructure to run these tests, including new MTEB encoder implementations for the OpenAI API endpoints and helper functions to manage the MTEB evaluation process for reranking.

Highlights

Add MTEB Rerank Testing: Introduces MTEB testing for rerank models to the CI pipeline to validate their performance and correctness.
NFCorpus Rerank Task: Uses the English subset of the NFCorpus dataset with a BM25 retriever and reranking on the top 10 results as the benchmark task.
OpenAI API Rerank/Score Client: Adds new MTEB encoder classes (ScoreClientMtebEncoder, RerankClientMtebEncoder) to evaluate models served via the OpenAI compatible API's /score and /rerank endpoints.
Rerank Testing Infrastructure: Adds helper functions (run_mteb_rerank, mteb_test_rerank_models) and a RerankModelInfo class to streamline the process of defining and running MTEB rerank tests for various models.
Specific Model Tests Added: Adds MTEB rerank tests for BAAI rerankers, a cross-encoder model, and a Jina reranker.

Changelog

Click here to see the changelog

tests/conftest.py
- Modified the score method of OpenAIClientMtebEncoder to accept arbitrary *args and **kwargs (lines 1036-1038).
tests/entrypoints/openai/correctness/test_mteb_embed.py
- Removed import and usage of run_mteb_embed_task_st (lines 10, 36).
- Updated MODEL_NAME from BAAI/bge-m3 to intfloat/e5-small (line 15).
- Updated MAIN_SCORE from 0.7873 to 0.7423 (line 16).
- Removed dtype and max-model-len from server arguments and added --disable-uvicorn-access-log (lines 22-23).
- Renamed test_mteb to test_mteb_embed (line 29).
tests/entrypoints/openai/correctness/test_mteb_score.py
- New file added to include MTEB rerank tests for models served via the OpenAI API.
- Defines MODEL_NAME (cross-encoder/ms-marco-MiniLM-L-6-v2) and MAIN_SCORE (0.33702) for the test (lines 17-18).
- Adds test_mteb_score to test the /score endpoint using MTEB (lines 31-42).
- Adds test_mteb_rerank to test the /rerank endpoint using MTEB (lines 45-56).
tests/models/language/pooling/mteb_utils.py
- Imported shutil, Optional, requests, HfRunner, VllmRunner, and RerankModelInfo (lines 4-14).
- Added constants MTEB_RERANK_TASKS, MTEB_RERANK_LANGS, and MTEB_RERANK_TOL for reranking (lines 24-26).
- Added a predict method to VllmMtebEncoder to conform to the MTEB Encoder interface for reranking (lines 51-70).
- Added ScoreClientMtebEncoder class for MTEB evaluation using the OpenAI /score endpoint (lines 95-129).
- Added RerankClientMtebEncoder class for MTEB evaluation using the OpenAI /rerank endpoint (lines 132-142).
- Modified run_mteb_embed_task to disable the progress bar during encoding (lines 152-154).
- Removed run_mteb_embed_task_st function (lines 74-77 in original).
- Added run_mteb_rerank function to perform the two-stage MTEB reranking evaluation (lines 204-240).
- Added mteb_test_rerank_models helper function to run MTEB rerank tests for given model info (lines 243-299).
tests/models/language/pooling/test_baai.py
- Imported RerankModelInfo and mteb_test_rerank_models (line 5, 7).
- Defined RERANK_MODELS list containing BAAI reranker models for testing (lines 61-73).
- Added test_rerank_models_mteb parameterized test using the new rerank models and helper function (lines 90-93).
tests/models/language/pooling/test_cross_encoder.py
- New file added to include MTEB rerank tests for cross-encoder models.
- Imported RerankModelInfo and mteb_test_rerank_models (line 5).
- Defined RERANK_MODELS list containing cross-encoder models (lines 7-10).
- Added test_rerank_models_mteb parameterized test (lines 13-16).
tests/models/language/pooling/test_jina.py
- Imported RerankModelInfo and mteb_test_rerank_models (line 9, 12).
- Removed old scoring test functions (test_llm_1_to_1, test_llm_1_to_N) and related constants (SCORING_MODELS, TEXTS_1, TEXTS_2) (lines 13-30, 39-78 in original).
- Defined RERANK_MODELS list containing Jina reranker models (lines 20-26).
- Added test_rerank_models_mteb parameterized test using the new rerank models and helper function (lines 57-60).
tests/models/utils.py
- Added RerankModelInfo NamedTuple to define rerank models for testing (lines 341-345).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces MTEB (Massive Text Embedding Benchmark) testing for rerank models, a valuable addition for ensuring the quality and correctness of reranking functionality. The changes are generally well-structured, with new test files for OpenAI endpoint correctness and updates to utility functions for MTEB integration.

Summary of Findings

Hardcoded Benchmark Scores: The tests in test_mteb_embed.py and test_mteb_score.py now rely on hardcoded MAIN_SCORE values. This can improve CI stability and speed but removes dynamic checks against reference implementations. It's important to have a clear strategy for updating these scores if underlying models or libraries change.
Clarity of truncate_prompt_tokens Parameter: The parameter truncate_prompt_tokens=-1 is used in several places related to scoring and reranking. Its specific meaning within vLLM (e.g., no truncation, default behavior) could be clarified with a comment to aid understanding.

Merge Readiness

The PR is in good shape and adds valuable test coverage for reranking models. Addressing the comments regarding hardcoded scores and the truncate_prompt_tokens parameter would enhance clarity and long-term maintainability. I am setting the status to REQUEST_CHANGES to encourage discussion on these points. I am not authorized to approve pull requests; please ensure further review and approval before merging.

tests/entrypoints/openai/correctness/test_mteb_embed.py

tests/entrypoints/openai/correctness/test_mteb_score.py

tests/models/language/pooling/mteb_utils.py

tests/entrypoints/openai/correctness/test_mteb_embed.py

DarkLight1337

LGTM as long as tests pass

DarkLight1337 · 2025-06-09T09:56:47Z

PTAL at the failing test

noooop · 2025-06-10T03:54:51Z

@DarkLight1337

Can this pr be merged? I want to use it in #19260.

DarkLight1337 · 2025-06-10T05:45:36Z

PTAL at the failing OpenAI API Correctness test

aarnphm

One ask wrt formatting, but the spec decode failure can be ignored here.

tests/conftest.py

noooop · 2025-06-12T10:05:43Z

@DarkLight1337

Improve the output precision of embedding models #19092 precision drop cause by MeanPool torch.cumsum does not use float32.
I don't know why, but the differences between the models on local and CI machines are greater than those between fp16 and fp32. Is there a hidden bug, or just bad luck?

so set the threshold to MTEB_RERANK_TOL = 1e-3

CI
BAAI/bge-reranker-base vllm fp32 0.32398
BAAI/bge-reranker-base vllm fp16 0.32399
cross-encoder/ms-marco-MiniLM-L-6-v2 vllm fp16 0.33457

local
BAAI/bge-reranker-base st fp32 0.32379
BAAI/bge-reranker-base vllm fp32 0.32379
BAAI/bge-reranker-base vllm fp16 0.32378
cross-encoder/ms-marco-MiniLM-L-6-v2 st fp32 0.33437
cross-encoder/ms-marco-MiniLM-L-6-v2 vllm fp32 0.33437
cross-encoder/ms-marco-MiniLM-L-6-v2 vllm fp16 0.33437

In [7]: 0.32399 == pytest.approx(0.32379, abs=1e-3)
Out[7]: True

In [8]: 0.32399 == pytest.approx(0.32379, abs=1e-4)
Out[8]: False

In [9]: 0.33457 == pytest.approx(0.33437, abs=1e-3)
Out[9]: True

In [10]: 0.33457 == pytest.approx(0.33437, abs=1e-4)
Out[10]: False

Please do a final review, see what still needs to be improved.

DarkLight1337 · 2025-06-12T11:14:39Z

There may be small precision differences when different hardware is used, that's normal. Maybe we need to set the dtype to float32 to other poolers as well...

noooop · 2025-06-12T11:37:21Z

Maybe we need to set the dtype to float32 to other poolers as well...

I have tested it, and there is no significant difference in other poolers. I will continue to pay attention to this.

noooop · 2025-06-16T06:23:03Z

@DarkLight1337

Can this pr be merged? I hope to use this test in #19675.

noooop · 2025-06-16T08:51:32Z

Thanks for reviewing

noooop · 2025-06-16T09:58:10Z

cc @DarkLight1337

OpenAI API correctness was broken by #18957, this pr is unrelated.

Signed-off-by: minpeter <kali2005611@gmail.com>

Signed-off-by: Yang Wang <elainewy@meta.com>

Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>

+ test_rerank_models_mteb

60ab1e1

gemini-code-assist bot reviewed Jun 9, 2025

View reviewed changes

gemini-code-assist bot suggested changes Jun 9, 2025

View reviewed changes

tests/entrypoints/openai/correctness/test_mteb_embed.py Show resolved Hide resolved

tests/entrypoints/openai/correctness/test_mteb_score.py Outdated Show resolved Hide resolved

tests/models/language/pooling/mteb_utils.py Outdated Show resolved Hide resolved

+ bm25s

15b6840

mergify bot added the ci/build label Jun 9, 2025

disable duplicate test

ca0e8a7

noooop marked this pull request as ready for review June 9, 2025 03:38

noooop requested review from DarkLight1337, ywang96, robertgshaw2-redhat, simon-mo and aarnphm as code owners June 9, 2025 03:38

noooop added 5 commits June 9, 2025 11:45

fix

e6d6ace

fix

335c375

fix

7271552

fix

018a3b2

fix

89696eb

DarkLight1337 reviewed Jun 9, 2025

View reviewed changes

tests/entrypoints/openai/correctness/test_mteb_embed.py Show resolved Hide resolved

DarkLight1337 approved these changes Jun 9, 2025

View reviewed changes

noooop added 2 commits June 9, 2025 18:16

fix 3.2.1

c3718b1

Merge branch 'vllm-project:main' into reranker

444b0f2

DarkLight1337 enabled auto-merge (squash) June 10, 2025 04:30

github-actions bot added the ready label Jun 10, 2025

upgrade st to the latest 4.1.0

3e6cfd4

auto-merge was automatically disabled June 10, 2025 05:53
Head branch was pushed to by a user without write access

fix

0a48a24

noooop added 6 commits June 12, 2025 11:30

fix

3a94eb7

try float32

2b187b0

+ tasks metadata

5061458

Merge branch 'vllm-project:main' into reranker

fb9d277

Merge branch 'vllm-project:main' into reranker

90978a2

MTEB_RERANK_TOL = 1e-3

4297410

aarnphm approved these changes Jun 12, 2025

View reviewed changes

tests/conftest.py Outdated Show resolved Hide resolved

noooop added 2 commits June 14, 2025 13:15

Using float32 in PoolerHead

f855280

Merge branch 'vllm-project:main' into reranker

f51e2ce

vllm-bot merged commit f40f763 into vllm-project:main Jun 16, 2025
94 of 97 checks passed

quanliu1991 mentioned this pull request Jun 16, 2025

[Core] add remove_seq_from_computed_blocks_tracker to BlockSpaceManager #19686

Merged

noooop mentioned this pull request Jun 18, 2025

[Model] support modernbert #16648

Merged

yeqcharlotte pushed a commit to yeqcharlotte/vllm that referenced this pull request Jun 22, 2025

[CI] Add mteb testing for rerank models (vllm-project#19344)

1c6da4a

minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025

[CI] Add mteb testing for rerank models (vllm-project#19344)

779015b

Signed-off-by: minpeter <kali2005611@gmail.com>

yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Jun 24, 2025

[CI] Add mteb testing for rerank models (vllm-project#19344)

861f2b0

Signed-off-by: Yang Wang <elainewy@meta.com>

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025

[CI] Add mteb testing for rerank models (vllm-project#19344)

c192564

wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025

[CI] Add mteb testing for rerank models (vllm-project#19344)

5d40d95

This was referenced Jul 4, 2025

[Bug]: Qwen3 Rerank 模型的准确率存在问题 #20478

Closed

[Model][Last/4] Automatic conversion of CrossEncoding model #19675

Merged

[CI] Increase the threshold of the MTEB RERANK tests #20615

Merged

noooop deleted the reranker branch July 10, 2025 04:46

noooop mentioned this pull request Jul 11, 2025

[Bugfix] Fix tensor parallel issue in Qwen3 reranker weight loading #20682

Merged

4 tasks

avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025

[CI] Add mteb testing for rerank models (vllm-project#19344)

807ed0b

Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>

googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025

[CI] Add mteb testing for rerank models (vllm-project#19344)

14f8ed3

Uh oh!

[CI] Add mteb testing for rerank models #19344

[CI] Add mteb testing for rerank models #19344

Uh oh!

Conversation

noooop commented Jun 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Task selection

Results

Threshold

Uh oh!

github-actions bot commented Jun 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Jun 9, 2025

Uh oh!

noooop commented Jun 10, 2025

Uh oh!

DarkLight1337 commented Jun 10, 2025

Uh oh!

aarnphm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

noooop commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Jun 12, 2025

Uh oh!

noooop commented Jun 12, 2025

Uh oh!

noooop commented Jun 16, 2025

Uh oh!

Uh oh!

noooop commented Jun 16, 2025

Uh oh!

noooop commented Jun 16, 2025

Uh oh!

Uh oh!

noooop commented Jun 9, 2025 •

edited by github-actions bot

Loading

noooop commented Jun 12, 2025 •

edited

Loading