-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Hi authors, thanks for this great work! When I am trying to reproduce the results, I can successfully reproduce the GLM-9B and GPT-4o's result. But when benchmarking the Qwen models using vLLM, the performance is much lower than expected. I can only get 1% Overall Success Rate and 45.12% Overall Call Acc, which is much lower than your official number (40.10% and 58.32%)
My environments uses vllm==0.6.3, transformers==4.48.3, torch==2.4.0, and I am benchmarking Qwen2.5-72B-Instruct model.
To extend the model's context length to 131072, following there official repo, I manually add the "rope_scaling": { "factor": 4.0, "original_max_position_embeddings": 32768, "type": "yarn" } parameter to the models config.json, and store it under checkpoints/Qwen2.5-72B-Instruct-YaRN directory.
The vllm command I use is
vllm serve checkpoints/Qwen2.5-72B-Instruct-YaRN --tensor-parallel-size 4 --gpu-memory-utilization 0.8 \
--max_model_len 131072 --trust-remote-code --port 4008 \
--enable-auto-tool-choice --tool-call-parser hermes
My evaluation code is
CUDA_VISIBLE_DEVICES=0,1,2,3 python evaluation.py \
--model_name checkpoints/Qwen2.5-72B-Instruct-YaRN \
--proc_num 10 \
--vllm_url http://0.0.0.0:4008/v1
```.
Could you share insights on how to achieve accurate results with the Qwen model? Additionally, it would be great if you could include the code from your repository for serving Qwen models using vLLM.