[Bug]: Inference is exceptionally slow on the L20 GPU

### Your current environment

PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Alibaba Group Enterprise Linux Server 7.2 (Paladin) (x86_64)
GCC version: (GCC) 9.2.1 20200522 (Alibaba 9.2.1-3 2.17)
Clang version: Could not collect
CMake version: version 3.20.1
Libc version: glibc-2.30

Python version: 3.9.19 (main, Mar 21 2024, 17:11:28)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.9.151-015.ali3000.alios7.x86_64-x86_64-with-glibc2.30
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: L20-2-PCIE-48GB-48GB-L-H-V
Nvidia driver version: 535.161.08
cuDNN version: Probably one of the following:
/usr/local/cuda/targets/x86_64-linux/lib/libcudnn.so.8.9.3
/usr/local/cuda/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.9.3
/usr/local/cuda/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.9.3
/usr/local/cuda/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.9.3
/...es (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

LD_LIBRARY_PATH=/home/t4/huse/images/tansformers-gpu.1d12f125-1d2d-464b-9ea7-2c616fd2e256/1/spark-3.2.0-dag-snapshotTest-20241030180001-SNAPSHOT-py39/python/lib/conda/lib/python3.9/site-packages/cv2/../../lib64:/home/t4/huse/images/tansformers-gpu.1d12f125-1d2d-464b-9ea7-2c616fd2e256/1/spark-3.2.0-dag-snapshotTest-20241030180001-SNAPSHOT-py39/python/lib/conda/lib/python3.9/site-packages/nvidia/nvjitlink/lib:/dev/shm/xpdk-lite:/dev/shm/xpdk-lite:/opt/conda/lib/python3.8/site-packages/aistudio_common/reader/libs/:/opt/taobao/java/jre/lib/amd64/server/:/usr/local/cuda/lib64:/usr/local/TensorRT-8.6.1/lib/:/usr/local/lib:/usr/local/lib64:/opt/ai-inference/
NVIDIA_VISIBLE_DEVICES=GPU-a18c8a32-ef6d-8d76-5a61-7a81da133490
NVIDIA_DRIVER_CAPABILITIES=utility,compute
CUDA_MPS_PIPE_DIRECTORY=/dev/shm/nvidia-mps
CUDA_MODULE_LOADING=LAZY

### Model Input Dumps

_No response_

### 🐛 Describe the bug

Inference is exceptionally slow on the L20 GPU, speed 0.08tokens/s:
<img width="1850" alt="截屏2024-11-25 15 35 43" src="https://github.com/user-attachments/assets/0bc5eae7-b9ee-49d5-9121-58b05cf5ed7d">
and graphics card usage is low:
<img width="1609" alt="截屏2024-11-25 14 24 49" src="https://github.com/user-attachments/assets/fbd8cdcb-828c-441a-ba6c-f58738007d36">


### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Inference is exceptionally slow on the L20 GPU #10652

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Inference is exceptionally slow on the L20 GPU #10652

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions