Skip to content

[Performance]: vllm0.6.5加载GLM4-9B-Chat,动态加载lora,输入长文本时推理性能下降较多 #11317

@zh19980310

Description

@zh19980310

Proposal to improve performance

No response

Report of performance regression

A800,单卡处理单条请求

  1. vllm0.6.5不加载lora
    (1)启动:
    CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --model /Work/....../glm-4-9b-chat/ --trust-remote-code
    (2)请求:
    response = client.chat.completions.create(
    model='/Work/....../glm-4-9b-chat/',
    messages=messages,
    n=1,
    temperature=0,
    extra_body={"stop_token_ids": [151329, 151336, 151338]},
    max_tokens=2048,
    stream=True)

  2. vllm0.6.5动态加载lora
    【lora模型使用llama_factory框架训练】
    (1)启动:
    CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --model /Work/....../glm-4-9b-chat/ --enable-lora --max-loras 10 --lora-modules summary=/Work/....../sft_1218/ --trust-remote-code --max-lora-rank 64
    (2)请求:
    response = client.chat.completions.create(
    model='summary',
    messages=messages,
    n=1,
    temperature=0,
    extra_body={"stop_token_ids": [151329, 151336, 151338]},
    max_tokens=2048,
    stream=True)

测试messages中输入不同长度文本时,不同情况下的推理速度:
d2dccaa39734cc6f41449b48aad6a65
发现加载lora后,输入文本较长时,推理速度相比于不加载lora下降较多,输入文本较短时下降不多
请问是什么原因造成的,我应该如何解决?谢谢~

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

performancePerformance-related issuesstaleOver 90 days of inactivity

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions