Question about how to reproduce Qwen's result using vLLM

Hi authors, thanks for this great work! When I am trying to reproduce the results, I can successfully reproduce the GLM-9B and GPT-4o's result. But when benchmarking the Qwen models using vLLM, the performance is much lower than expected. I can only get 1% Overall Success Rate and 45.12% Overall Call Acc, which is much lower than your official number (40.10% and 58.32%)

My environments uses `vllm==0.6.3`, `transformers==4.48.3`, `torch==2.4.0`, and I am benchmarking Qwen2.5-72B-Instruct model. 
To extend the model's context length to 131072, following there [official repo](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct#processing-long-texts), I manually add the `"rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }` parameter to the models config.json, and store it under `checkpoints/Qwen2.5-72B-Instruct-YaRN` directory.

The vllm command I use is
```
vllm serve checkpoints/Qwen2.5-72B-Instruct-YaRN --tensor-parallel-size 4 --gpu-memory-utilization 0.8 \
    --max_model_len 131072 --trust-remote-code --port 4008 \
    --enable-auto-tool-choice --tool-call-parser hermes
```
My evaluation code is
```
CUDA_VISIBLE_DEVICES=0,1,2,3 python evaluation.py \
    --model_name checkpoints/Qwen2.5-72B-Instruct-YaRN \
    --proc_num 10 \
    --vllm_url http://0.0.0.0:4008/v1
```.

Could you share insights on how to achieve accurate results with the Qwen model? Additionally, it would be great if you could include the code from your repository for serving Qwen models using vLLM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about how to reproduce Qwen's result using vLLM #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about how to reproduce Qwen's result using vLLM #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions