-
-
Notifications
You must be signed in to change notification settings - Fork 12.7k
Description
Proposal to improve performance
No response
Report of performance regression
A800,单卡处理单条请求
-
vllm0.6.5不加载lora
(1)启动:
CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --model /Work/....../glm-4-9b-chat/ --trust-remote-code
(2)请求:
response = client.chat.completions.create(
model='/Work/....../glm-4-9b-chat/',
messages=messages,
n=1,
temperature=0,
extra_body={"stop_token_ids": [151329, 151336, 151338]},
max_tokens=2048,
stream=True) -
vllm0.6.5动态加载lora
【lora模型使用llama_factory框架训练】
(1)启动:
CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --model /Work/....../glm-4-9b-chat/ --enable-lora --max-loras 10 --lora-modules summary=/Work/....../sft_1218/ --trust-remote-code --max-lora-rank 64
(2)请求:
response = client.chat.completions.create(
model='summary',
messages=messages,
n=1,
temperature=0,
extra_body={"stop_token_ids": [151329, 151336, 151338]},
max_tokens=2048,
stream=True)
测试messages中输入不同长度文本时,不同情况下的推理速度:

发现加载lora后,输入文本较长时,推理速度相比于不加载lora下降较多,输入文本较短时下降不多
请问是什么原因造成的,我应该如何解决?谢谢~
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.