[Performance]: vllm0.6.5加载GLM4-9B-Chat，动态加载lora，输入长文本时推理性能下降较多

### Proposal to improve performance

_No response_

### Report of performance regression

### A800，单卡处理单条请求
1. **vllm0.6.5不加载lora**
（1）启动：
CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --model /Work/....../glm-4-9b-chat/ --trust-remote-code
（2）请求：
response = client.chat.completions.create(
        model='/Work/....../glm-4-9b-chat/',
        messages=messages,
        n=1,
        temperature=0,
        extra_body={"stop_token_ids": [151329, 151336, 151338]},
        max_tokens=2048,
        stream=True)

2. **vllm0.6.5动态加载lora**
【lora模型使用llama_factory框架训练】
（1）启动：
CUDA_VISIBLE_DEVICES=7 python -m vllm.entrypoints.openai.api_server --model /Work/....../glm-4-9b-chat/ --enable-lora --max-loras 10 --lora-modules summary=/Work/....../sft_1218/ --trust-remote-code --max-lora-rank 64
（2）请求：
response = client.chat.completions.create(
        model='summary',
        messages=messages,
        n=1,
        temperature=0,
        extra_body={"stop_token_ids": [151329, 151336, 151338]},
        max_tokens=2048,
        stream=True)

**测试messages中输入不同长度文本时，不同情况下的推理速度：**
![d2dccaa39734cc6f41449b48aad6a65](https://github.com/user-attachments/assets/c28cbcfb-447b-49b4-972c-00569e52730f)
发现加载lora后，输入文本较长时，推理速度相比于不加载lora下降较多，输入文本较短时下降不多
请问是什么原因造成的，我应该如何解决？谢谢~

### Misc discussion on performance

_No response_

### Your current environment (if you think it is necessary)

```text
The output of `python collect_env.py`
```


### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: vllm0.6.5加载GLM4-9B-Chat，动态加载lora，输入长文本时推理性能下降较多 #11317

Proposal to improve performance

Report of performance regression

A800，单卡处理单条请求

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: vllm0.6.5加载GLM4-9B-Chat，动态加载lora，输入长文本时推理性能下降较多 #11317

Description

Proposal to improve performance

Report of performance regression

A800，单卡处理单条请求

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions