同一个Reranker模型，相同的输入，在xinference跟原生Transformers下差别很大，且xinference输出错误的结果

### System Info / 系統信息

CUDA Version: 12.4 
Ubuntu 22.04.3 LTS

### Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

- [x] docker / docker
- [ ] pip install / 通过 pip install 安装
- [ ] installation from source / 从源码安装

### Version info / 版本信息

xinference: v1.3.1

transformers: 4.40.1

### The command used to start Xinference / 用以启动 xinference 的命令

docker run -d --name xinference --restart=always \                                                                                                                
  -e HF_ENDPOINT=https://hf-mirror.com \                                                                                                                          
  -e HUGGING_FACE_HUB_TOKEN=hf_xx \                                                                                               
  -e LOG_TZ=Asia/Shanghai \                                                                                                                                       
  -e TZ=Asia/Shanghai \                                                                                                                                           
  -v /root/.xinference:/root/.xinference \                                                                                                                        
  -v /root/.cache/huggingface:/root/.cache/huggingface \                                                                                                          
  -v /root/.cache/modelscope:/root/.cache/modelscope \                                                                                                            
  -v /data2/models:/data2/models \                                                                                                                                
  -p 9997:9997 \                                                                                                                                                  
  -p 8777:8777 \                                                                                                                                                  
  --gpus all \                                                                                                                                                    
  registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:v1.3.1 \                                                                                         
  xinference-local -H 0.0.0.0 --port 9997 -mp 8777 

### Reproduction / 复现过程

询问中国的首都是哪里？  xinference返回上海分数更高，不符合预期； 而Transformers是北京，符合预期。
 
1. Launch MiniCPM-Reranker-Light within xinference
2. curl to the reranker model:
```
curl -X 'POST' 'http://xxx:9997/v1/rerank' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "MiniCPM-Reranker-Light",
    "query": "中国的首都是哪里？",
    "documents": [
        "beijing",
        "shanghai"
    ]
  }'
```
Outs:
```
{"id":"4e605156-0634-11f0-a430-0242c0a80102","results":[{"index":1,"relevance_score":0.021355781704187393,"document":null},{"index":0,"relevance_score":0.011472251266241074,"document":null}],"meta":{"api_version":null,"billed_units":null,"tokens":null,"warnings":null}}
```
3. Run python code with Transfermers:
```
from transformers import AutoModelForSequenceClassification
import torch

model_name = "openbmb/MiniCPM-Reranker-Light"
model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to("cuda")
# You can also use the following code to use flash_attention_2
# model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True,attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
model.eval()

query = "中国的首都是哪里？" # "Where is the capital of China?"
passages = ["beijing", "shanghai"] # 北京，上海

rerank_score = model.rerank(query, passages,query_instruction="Query:", batch_size=32, max_length=1024)
print(rerank_score) #[0.01791382 0.00024533]


sentence_pairs = [[f"Query: {query}", doc] for doc in passages]
scores = model.compute_score(sentence_pairs, batch_size=32, max_length=1024)
print(scores) #[0.01791382 0.00024533]

```
outputs:
```
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to(
'cuda')`.                                                                                                                                                         
Computing scores: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.09it/s]
[0.01785278 0.00024915]                                                                                                                                           
Computing scores: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17.27it/s]
[0.01785278 0.00024915] 
```

### Expected behavior / 期待表现

预期是北京的分数更高

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

同一个Reranker模型，相同的输入，在xinference跟原生Transformers下差别很大，且xinference输出错误的结果 #3104

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

同一个Reranker模型，相同的输入，在xinference跟原生Transformers下差别很大，且xinference输出错误的结果 #3104

Description

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现

Activity

qinxuye commented on Mar 22, 2025

zhangever commented on Mar 25, 2025

qinxuye commented on Mar 25, 2025

zhangever commented on Mar 25, 2025

github-actions commented on Apr 1, 2025

zhangever commented on Apr 7, 2025

qinxuye commented on Apr 7, 2025

github-actions commented on Apr 14, 2025

zhangever commented on Apr 20, 2025

github-actions commented on Apr 27, 2025

github-actions commented on May 2, 2025

github-actions commented on May 11, 2025

qinxuye commented on May 12, 2025

llyycchhee commented on May 14, 2025

zhangever commented on May 15, 2025

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions