-
Notifications
You must be signed in to change notification settings - Fork 728
Closed
Description
System Info / 系統信息
CUDA Version: 12.4
Ubuntu 22.04.3 LTS
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
- docker / dockerpip install / 通过 pip install 安装installation from source / 从源码安装To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Version info / 版本信息
xinference: v1.3.1
transformers: 4.40.1
The command used to start Xinference / 用以启动 xinference 的命令
docker run -d --name xinference --restart=always \
-e HF_ENDPOINT=https://hf-mirror.com \
-e HUGGING_FACE_HUB_TOKEN=hf_xx \
-e LOG_TZ=Asia/Shanghai \
-e TZ=Asia/Shanghai \
-v /root/.xinference:/root/.xinference \
-v /root/.cache/huggingface:/root/.cache/huggingface \
-v /root/.cache/modelscope:/root/.cache/modelscope \
-v /data2/models:/data2/models \
-p 9997:9997 \
-p 8777:8777 \
--gpus all \
registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:v1.3.1 \
xinference-local -H 0.0.0.0 --port 9997 -mp 8777
Reproduction / 复现过程
询问中国的首都是哪里? xinference返回上海分数更高,不符合预期; 而Transformers是北京,符合预期。
- Launch MiniCPM-Reranker-Light within xinference
- curl to the reranker model:
curl -X 'POST' 'http://xxx:9997/v1/rerank' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "MiniCPM-Reranker-Light",
"query": "中国的首都是哪里?",
"documents": [
"beijing",
"shanghai"
]
}'
Outs:
{"id":"4e605156-0634-11f0-a430-0242c0a80102","results":[{"index":1,"relevance_score":0.021355781704187393,"document":null},{"index":0,"relevance_score":0.011472251266241074,"document":null}],"meta":{"api_version":null,"billed_units":null,"tokens":null,"warnings":null}}
- Run python code with Transfermers:
from transformers import AutoModelForSequenceClassification
import torch
model_name = "openbmb/MiniCPM-Reranker-Light"
model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to("cuda")
# You can also use the following code to use flash_attention_2
# model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True,attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
model.eval()
query = "中国的首都是哪里?" # "Where is the capital of China?"
passages = ["beijing", "shanghai"] # 北京,上海
rerank_score = model.rerank(query, passages,query_instruction="Query:", batch_size=32, max_length=1024)
print(rerank_score) #[0.01791382 0.00024533]
sentence_pairs = [[f"Query: {query}", doc] for doc in passages]
scores = model.compute_score(sentence_pairs, batch_size=32, max_length=1024)
print(scores) #[0.01791382 0.00024533]
outputs:
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to(
'cuda')`.
Computing scores: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4.09it/s]
[0.01785278 0.00024915]
Computing scores: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17.27it/s]
[0.01785278 0.00024915]
Expected behavior / 期待表现
预期是北京的分数更高
Metadata
Metadata
Assignees
Labels
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
qinxuye commentedon Mar 22, 2025
Xinference 背后 engine 是 sentence-transformers。这个需要复现下。
zhangever commentedon Mar 25, 2025
这个两个分数一致的
qinxuye commentedon Mar 25, 2025
我看到有 tokenizer 设置 padding_side 啥的,我不确定是不是完全一致。
zhangever commentedon Mar 25, 2025
现在用xinference的reranker模型, 出来的效果比不重排还要差。
github-actions commentedon Apr 1, 2025
This issue is stale because it has been open for 7 days with no activity.
zhangever commentedon Apr 7, 2025
秦总支持下呢 @qinxuye
qinxuye commentedon Apr 7, 2025
本周会定位下。
github-actions commentedon Apr 14, 2025
This issue is stale because it has been open for 7 days with no activity.
zhangever commentedon Apr 20, 2025
持续关注。
github-actions commentedon Apr 27, 2025
This issue is stale because it has been open for 7 days with no activity.
github-actions commentedon May 2, 2025
This issue was closed because it has been inactive for 5 days since being marked as stale.
github-actions commentedon May 11, 2025
This issue is stale because it has been open for 7 days with no activity.
qinxuye commentedon May 12, 2025
@llyycchhee 请 track 下这个问题。
llyycchhee commentedon May 14, 2025
这个是因为MiniCPM-Reranker-Ligh模型需要加上 INSTRUCTION="Query: " 放在每个query之前,用户可以在xinference的curl中的query参数根据模型需要 更改为 "query":"Query: 中国的首都是哪里?"。后续会在文档中补充说明
zhangever commentedon May 15, 2025
亲测可以。 xinference这边能否自动给用户带上这个Query: 呢? 因为很多平台, 包括dify,是比较难处理这个INSTRUCTION的. @llyycchhee