Skip to content

Conversation

noooop
Copy link
Contributor

@noooop noooop commented Jun 9, 2025

Summary

  • Add mteb testing for rerank models
  • Use float32 for torch.cumsum in MeanPool

Task selection

This PR introduces MTEB (Massive Text Embedding Benchmark) testing specifically for rerank models within the CI pipeline.

The goal is to ensure the correctness of rerank models supported by the project by evaluating them against a standard benchmark task.

The chosen task is NFCorpus (English part) using a two-stage approach: BM25 for initial retrieval followed by the rerank model on the top 10 query-corpus pairs, which aligns with common reranking use cases.

Results

Here are some model test results, which can be referred to.

model_name float32 st float16 diff bfloat16 diff float32 diff
bm25s retriever only 0.32102 - - -
random reranker 0.25306 - - -
cross-encoder/ms-marco-TinyBERT-L-2-v2 0.3288 4.0000000000040004e-05 0.00017000000000000348 0.0
cross-encoder/ms-marco-MiniLM-L-6-v2 0.33437 0.0 0.00010999999999999899 0.0
tomaarsen/Qwen3-Reranker-0.6B-seq-cls wo/template 0.25782 3.999999999998449e-05 -0.0007499999999999729 0.0
tomaarsen/Qwen3-Reranker-0.6B-seq-cls w/template 0.33699 0.00040000000000001146 -0.0012600000000000389 0.0
jinaai/jina-reranker-v2-base-multilingual 0.33623 0.0011400000000000299 0.0010100000000000109 0.0
BAAI/bge-reranker-base 0.32379 -1.0000000000010001e-05 0.0 0.0
BAAI/bge-reranker-large 0.33321 -7.00000000000145e-05 -0.00031000000000003247 0.0
BAAI/bge-reranker-v2-m3 0.32803 7.00000000000145e-05 -0.00046000000000001595 0.0

Threshold

MTEB_RERANK_TOL = 1e-3

CI
BAAI/bge-reranker-base vllm fp32 0.32398
BAAI/bge-reranker-base vllm fp16 0.32399
cross-encoder/ms-marco-MiniLM-L-6-v2 vllm fp16 0.33457

local
BAAI/bge-reranker-base st fp32 0.32379
BAAI/bge-reranker-base vllm fp32 0.32379
BAAI/bge-reranker-base vllm fp16 0.32378
cross-encoder/ms-marco-MiniLM-L-6-v2 st fp32 0.33437
cross-encoder/ms-marco-MiniLM-L-6-v2 vllm fp32 0.33437
cross-encoder/ms-marco-MiniLM-L-6-v2 vllm fp16 0.33437

In [7]: 0.32399 == pytest.approx(0.32379, abs=1e-3)
Out[7]: True

In [8]: 0.32399 == pytest.approx(0.32379, abs=1e-4)
Out[8]: False

In [9]: 0.33457 == pytest.approx(0.33437, abs=1e-3)
Out[9]: True

In [10]: 0.33457 == pytest.approx(0.33437, abs=1e-4)
Out[10]: False

I don't know why, but the differences between the models on local and CI machines are greater than those between fp16 and fp32.

Is there a hidden bug, or just bad luck?

so set the threshold to MTEB_RERANK_TOL = 1e-3

Copy link

github-actions bot commented Jun 9, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @noooop, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team,

Gemini here, providing a summary for this pull request. This PR introduces MTEB (Massive Text Embedding Benchmark) testing specifically for rerank models within the CI pipeline. The goal is to ensure the correctness and performance of rerank models supported by the project by evaluating them against a standard benchmark task. The chosen task is NFCorpus (English part) using a two-stage approach: BM25 for initial retrieval followed by the rerank model on the top 10 query-corpus pairs, which aligns with common reranking use cases.

The PR adds the necessary infrastructure to run these tests, including new MTEB encoder implementations for the OpenAI API endpoints and helper functions to manage the MTEB evaluation process for reranking.

Highlights

  • Add MTEB Rerank Testing: Introduces MTEB testing for rerank models to the CI pipeline to validate their performance and correctness.
  • NFCorpus Rerank Task: Uses the English subset of the NFCorpus dataset with a BM25 retriever and reranking on the top 10 results as the benchmark task.
  • OpenAI API Rerank/Score Client: Adds new MTEB encoder classes (ScoreClientMtebEncoder, RerankClientMtebEncoder) to evaluate models served via the OpenAI compatible API's /score and /rerank endpoints.
  • Rerank Testing Infrastructure: Adds helper functions (run_mteb_rerank, mteb_test_rerank_models) and a RerankModelInfo class to streamline the process of defining and running MTEB rerank tests for various models.
  • Specific Model Tests Added: Adds MTEB rerank tests for BAAI rerankers, a cross-encoder model, and a Jina reranker.

Changelog

Click here to see the changelog
  • tests/conftest.py
    • Modified the score method of OpenAIClientMtebEncoder to accept arbitrary *args and **kwargs (lines 1036-1038).
  • tests/entrypoints/openai/correctness/test_mteb_embed.py
    • Removed import and usage of run_mteb_embed_task_st (lines 10, 36).
    • Updated MODEL_NAME from BAAI/bge-m3 to intfloat/e5-small (line 15).
    • Updated MAIN_SCORE from 0.7873 to 0.7423 (line 16).
    • Removed dtype and max-model-len from server arguments and added --disable-uvicorn-access-log (lines 22-23).
    • Renamed test_mteb to test_mteb_embed (line 29).
  • tests/entrypoints/openai/correctness/test_mteb_score.py
    • New file added to include MTEB rerank tests for models served via the OpenAI API.
    • Defines MODEL_NAME (cross-encoder/ms-marco-MiniLM-L-6-v2) and MAIN_SCORE (0.33702) for the test (lines 17-18).
    • Adds test_mteb_score to test the /score endpoint using MTEB (lines 31-42).
    • Adds test_mteb_rerank to test the /rerank endpoint using MTEB (lines 45-56).
  • tests/models/language/pooling/mteb_utils.py
    • Imported shutil, Optional, requests, HfRunner, VllmRunner, and RerankModelInfo (lines 4-14).
    • Added constants MTEB_RERANK_TASKS, MTEB_RERANK_LANGS, and MTEB_RERANK_TOL for reranking (lines 24-26).
    • Added a predict method to VllmMtebEncoder to conform to the MTEB Encoder interface for reranking (lines 51-70).
    • Added ScoreClientMtebEncoder class for MTEB evaluation using the OpenAI /score endpoint (lines 95-129).
    • Added RerankClientMtebEncoder class for MTEB evaluation using the OpenAI /rerank endpoint (lines 132-142).
    • Modified run_mteb_embed_task to disable the progress bar during encoding (lines 152-154).
    • Removed run_mteb_embed_task_st function (lines 74-77 in original).
    • Added run_mteb_rerank function to perform the two-stage MTEB reranking evaluation (lines 204-240).
    • Added mteb_test_rerank_models helper function to run MTEB rerank tests for given model info (lines 243-299).
  • tests/models/language/pooling/test_baai.py
    • Imported RerankModelInfo and mteb_test_rerank_models (line 5, 7).
    • Defined RERANK_MODELS list containing BAAI reranker models for testing (lines 61-73).
    • Added test_rerank_models_mteb parameterized test using the new rerank models and helper function (lines 90-93).
  • tests/models/language/pooling/test_cross_encoder.py
    • New file added to include MTEB rerank tests for cross-encoder models.
    • Imported RerankModelInfo and mteb_test_rerank_models (line 5).
    • Defined RERANK_MODELS list containing cross-encoder models (lines 7-10).
    • Added test_rerank_models_mteb parameterized test (lines 13-16).
  • tests/models/language/pooling/test_jina.py
    • Imported RerankModelInfo and mteb_test_rerank_models (line 9, 12).
    • Removed old scoring test functions (test_llm_1_to_1, test_llm_1_to_N) and related constants (SCORING_MODELS, TEXTS_1, TEXTS_2) (lines 13-30, 39-78 in original).
    • Defined RERANK_MODELS list containing Jina reranker models (lines 20-26).
    • Added test_rerank_models_mteb parameterized test using the new rerank models and helper function (lines 57-60).
  • tests/models/utils.py
    • Added RerankModelInfo NamedTuple to define rerank models for testing (lines 341-345).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces MTEB (Massive Text Embedding Benchmark) testing for rerank models, a valuable addition for ensuring the quality and correctness of reranking functionality. The changes are generally well-structured, with new test files for OpenAI endpoint correctness and updates to utility functions for MTEB integration.

Summary of Findings

  • Hardcoded Benchmark Scores: The tests in test_mteb_embed.py and test_mteb_score.py now rely on hardcoded MAIN_SCORE values. This can improve CI stability and speed but removes dynamic checks against reference implementations. It's important to have a clear strategy for updating these scores if underlying models or libraries change.
  • Clarity of truncate_prompt_tokens Parameter: The parameter truncate_prompt_tokens=-1 is used in several places related to scoring and reranking. Its specific meaning within vLLM (e.g., no truncation, default behavior) could be clarified with a comment to aid understanding.

Merge Readiness

The PR is in good shape and adds valuable test coverage for reranking models. Addressing the comments regarding hardcoded scores and the truncate_prompt_tokens parameter would enhance clarity and long-term maintainability. I am setting the status to REQUEST_CHANGES to encourage discussion on these points. I am not authorized to approve pull requests; please ensure further review and approval before merging.

@mergify mergify bot added the ci/build label Jun 9, 2025
@noooop noooop marked this pull request as ready for review June 9, 2025 03:38
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as tests pass

@DarkLight1337
Copy link
Member

PTAL at the failing test

@noooop
Copy link
Contributor Author

noooop commented Jun 10, 2025

@DarkLight1337

Can this pr be merged? I want to use it in #19260.

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) June 10, 2025 04:30
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 10, 2025
@DarkLight1337
Copy link
Member

PTAL at the failing OpenAI API Correctness test

auto-merge was automatically disabled June 10, 2025 05:53

Head branch was pushed to by a user without write access

Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One ask wrt formatting, but the spec decode failure can be ignored here.

@noooop
Copy link
Contributor Author

noooop commented Jun 12, 2025

@DarkLight1337

  1. Improve the output precision of embedding models #19092 precision drop cause by MeanPool torch.cumsum does not use float32.

  2. I don't know why, but the differences between the models on local and CI machines are greater than those between fp16 and fp32. Is there a hidden bug, or just bad luck?

so set the threshold to MTEB_RERANK_TOL = 1e-3

CI
BAAI/bge-reranker-base vllm fp32 0.32398
BAAI/bge-reranker-base vllm fp16 0.32399
cross-encoder/ms-marco-MiniLM-L-6-v2 vllm fp16 0.33457

local
BAAI/bge-reranker-base st fp32 0.32379
BAAI/bge-reranker-base vllm fp32 0.32379
BAAI/bge-reranker-base vllm fp16 0.32378
cross-encoder/ms-marco-MiniLM-L-6-v2 st fp32 0.33437
cross-encoder/ms-marco-MiniLM-L-6-v2 vllm fp32 0.33437
cross-encoder/ms-marco-MiniLM-L-6-v2 vllm fp16 0.33437

In [7]: 0.32399 == pytest.approx(0.32379, abs=1e-3)
Out[7]: True

In [8]: 0.32399 == pytest.approx(0.32379, abs=1e-4)
Out[8]: False

In [9]: 0.33457 == pytest.approx(0.33437, abs=1e-3)
Out[9]: True

In [10]: 0.33457 == pytest.approx(0.33437, abs=1e-4)
Out[10]: False

Please do a final review, see what still needs to be improved.

@DarkLight1337
Copy link
Member

There may be small precision differences when different hardware is used, that's normal. Maybe we need to set the dtype to float32 to other poolers as well...

@noooop
Copy link
Contributor Author

noooop commented Jun 12, 2025

Maybe we need to set the dtype to float32 to other poolers as well...

I have tested it, and there is no significant difference in other poolers. I will continue to pay attention to this.

@noooop
Copy link
Contributor Author

noooop commented Jun 16, 2025

@DarkLight1337

Can this pr be merged? I hope to use this test in #19675.

@vllm-bot vllm-bot merged commit f40f763 into vllm-project:main Jun 16, 2025
94 of 97 checks passed
@noooop
Copy link
Contributor Author

noooop commented Jun 16, 2025

Thanks for reviewing

@noooop
Copy link
Contributor Author

noooop commented Jun 16, 2025

cc @DarkLight1337

OpenAI API correctness was broken by #18957, this pr is unrelated.

yeqcharlotte pushed a commit to yeqcharlotte/vllm that referenced this pull request Jun 22, 2025
minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
Signed-off-by: minpeter <kali2005611@gmail.com>
yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Jun 24, 2025
Signed-off-by: Yang Wang <elainewy@meta.com>
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025
wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025
@noooop noooop deleted the reranker branch July 10, 2025 04:46
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants