Skip to content

同一个Reranker模型,相同的输入,在xinference跟原生Transformers下差别很大,且xinference输出错误的结果 #3104

@zhangever

Description

@zhangever

System Info / 系統信息

CUDA Version: 12.4
Ubuntu 22.04.3 LTS

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • docker / docker
    pip install / 通过 pip install 安装
    installation from source / 从源码安装

Version info / 版本信息

xinference: v1.3.1

transformers: 4.40.1

The command used to start Xinference / 用以启动 xinference 的命令

docker run -d --name xinference --restart=always \
-e HF_ENDPOINT=https://hf-mirror.com \
-e HUGGING_FACE_HUB_TOKEN=hf_xx \
-e LOG_TZ=Asia/Shanghai \
-e TZ=Asia/Shanghai \
-v /root/.xinference:/root/.xinference \
-v /root/.cache/huggingface:/root/.cache/huggingface \
-v /root/.cache/modelscope:/root/.cache/modelscope \
-v /data2/models:/data2/models \
-p 9997:9997 \
-p 8777:8777 \
--gpus all \
registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:v1.3.1 \
xinference-local -H 0.0.0.0 --port 9997 -mp 8777

Reproduction / 复现过程

询问中国的首都是哪里? xinference返回上海分数更高,不符合预期; 而Transformers是北京,符合预期。

  1. Launch MiniCPM-Reranker-Light within xinference
  2. curl to the reranker model:
curl -X 'POST' 'http://xxx:9997/v1/rerank' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "MiniCPM-Reranker-Light",
    "query": "中国的首都是哪里?",
    "documents": [
        "beijing",
        "shanghai"
    ]
  }'

Outs:

{"id":"4e605156-0634-11f0-a430-0242c0a80102","results":[{"index":1,"relevance_score":0.021355781704187393,"document":null},{"index":0,"relevance_score":0.011472251266241074,"document":null}],"meta":{"api_version":null,"billed_units":null,"tokens":null,"warnings":null}}
  1. Run python code with Transfermers:
from transformers import AutoModelForSequenceClassification
import torch

model_name = "openbmb/MiniCPM-Reranker-Light"
model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to("cuda")
# You can also use the following code to use flash_attention_2
# model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True,attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
model.eval()

query = "中国的首都是哪里?" # "Where is the capital of China?"
passages = ["beijing", "shanghai"] # 北京,上海

rerank_score = model.rerank(query, passages,query_instruction="Query:", batch_size=32, max_length=1024)
print(rerank_score) #[0.01791382 0.00024533]


sentence_pairs = [[f"Query: {query}", doc] for doc in passages]
scores = model.compute_score(sentence_pairs, batch_size=32, max_length=1024)
print(scores) #[0.01791382 0.00024533]

outputs:

You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to(
'cuda')`.                                                                                                                                                         
Computing scores: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.09it/s]
[0.01785278 0.00024915]                                                                                                                                           
Computing scores: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17.27it/s]
[0.01785278 0.00024915] 

Expected behavior / 期待表现

预期是北京的分数更高

Activity

added this to the v1.x milestone on Mar 21, 2025
qinxuye

qinxuye commented on Mar 22, 2025

@qinxuye
Contributor

Xinference 背后 engine 是 sentence-transformers。这个需要复现下。

zhangever

zhangever commented on Mar 25, 2025

@zhangever
Author

Image

这个两个分数一致的

qinxuye

qinxuye commented on Mar 25, 2025

@qinxuye
Contributor

我看到有 tokenizer 设置 padding_side 啥的,我不确定是不是完全一致。

zhangever

zhangever commented on Mar 25, 2025

@zhangever
Author

我看到有 tokenizer 设置 padding_side 啥的,我不确定是不是完全一致。

现在用xinference的reranker模型, 出来的效果比不重排还要差。

github-actions

github-actions commented on Apr 1, 2025

@github-actions

This issue is stale because it has been open for 7 days with no activity.

zhangever

zhangever commented on Apr 7, 2025

@zhangever
Author

秦总支持下呢 @qinxuye

qinxuye

qinxuye commented on Apr 7, 2025

@qinxuye
Contributor

本周会定位下。

github-actions

github-actions commented on Apr 14, 2025

@github-actions

This issue is stale because it has been open for 7 days with no activity.

zhangever

zhangever commented on Apr 20, 2025

@zhangever
Author

持续关注。

github-actions

github-actions commented on Apr 27, 2025

@github-actions

This issue is stale because it has been open for 7 days with no activity.

github-actions

github-actions commented on May 2, 2025

@github-actions

This issue was closed because it has been inactive for 5 days since being marked as stale.

reopened this on May 3, 2025
github-actions

github-actions commented on May 11, 2025

@github-actions

This issue is stale because it has been open for 7 days with no activity.

qinxuye

qinxuye commented on May 12, 2025

@qinxuye
Contributor

@llyycchhee 请 track 下这个问题。

llyycchhee

llyycchhee commented on May 14, 2025

@llyycchhee
Collaborator

这个是因为MiniCPM-Reranker-Ligh模型需要加上 INSTRUCTION="Query: " 放在每个query之前,用户可以在xinference的curl中的query参数根据模型需要 更改为 "query":"Query: 中国的首都是哪里?"。后续会在文档中补充说明

zhangever

zhangever commented on May 15, 2025

@zhangever
Author

亲测可以。 xinference这边能否自动给用户带上这个Query: 呢? 因为很多平台, 包括dify,是比较难处理这个INSTRUCTION的. @llyycchhee

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

    Development

    Participants

    @qinxuye@zhangever@XprobeBot@llyycchhee

    Issue actions

      同一个Reranker模型,相同的输入,在xinference跟原生Transformers下差别很大,且xinference输出错误的结果 · Issue #3104 · xorbitsai/inference