Skip to content

Question about how to reproduce Qwen's result using vLLM #2

@haochengxi

Description

@haochengxi

Hi authors, thanks for this great work! When I am trying to reproduce the results, I can successfully reproduce the GLM-9B and GPT-4o's result. But when benchmarking the Qwen models using vLLM, the performance is much lower than expected. I can only get 1% Overall Success Rate and 45.12% Overall Call Acc, which is much lower than your official number (40.10% and 58.32%)

My environments uses vllm==0.6.3, transformers==4.48.3, torch==2.4.0, and I am benchmarking Qwen2.5-72B-Instruct model.
To extend the model's context length to 131072, following there official repo, I manually add the "rope_scaling": { "factor": 4.0, "original_max_position_embeddings": 32768, "type": "yarn" } parameter to the models config.json, and store it under checkpoints/Qwen2.5-72B-Instruct-YaRN directory.

The vllm command I use is

vllm serve checkpoints/Qwen2.5-72B-Instruct-YaRN --tensor-parallel-size 4 --gpu-memory-utilization 0.8 \
    --max_model_len 131072 --trust-remote-code --port 4008 \
    --enable-auto-tool-choice --tool-call-parser hermes

My evaluation code is

CUDA_VISIBLE_DEVICES=0,1,2,3 python evaluation.py \
    --model_name checkpoints/Qwen2.5-72B-Instruct-YaRN \
    --proc_num 10 \
    --vllm_url http://0.0.0.0:4008/v1
```.

Could you share insights on how to achieve accurate results with the Qwen model? Additionally, it would be great if you could include the code from your repository for serving Qwen models using vLLM.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions