Skip to content

[Bug]: Inference is exceptionally slow on the L20 GPU #10652

@joey9503

Description

@joey9503

Your current environment

PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Alibaba Group Enterprise Linux Server 7.2 (Paladin) (x86_64)
GCC version: (GCC) 9.2.1 20200522 (Alibaba 9.2.1-3 2.17)
Clang version: Could not collect
CMake version: version 3.20.1
Libc version: glibc-2.30

Python version: 3.9.19 (main, Mar 21 2024, 17:11:28) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.9.151-015.ali3000.alios7.x86_64-x86_64-with-glibc2.30
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: L20-2-PCIE-48GB-48GB-L-H-V
Nvidia driver version: 535.161.08
cuDNN version: Probably one of the following:
/usr/local/cuda/targets/x86_64-linux/lib/libcudnn.so.8.9.3
/usr/local/cuda/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.9.3
/usr/local/cuda/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.9.3
/usr/local/cuda/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.9.3
/...es (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

LD_LIBRARY_PATH=/home/t4/huse/images/tansformers-gpu.1d12f125-1d2d-464b-9ea7-2c616fd2e256/1/spark-3.2.0-dag-snapshotTest-20241030180001-SNAPSHOT-py39/python/lib/conda/lib/python3.9/site-packages/cv2/../../lib64:/home/t4/huse/images/tansformers-gpu.1d12f125-1d2d-464b-9ea7-2c616fd2e256/1/spark-3.2.0-dag-snapshotTest-20241030180001-SNAPSHOT-py39/python/lib/conda/lib/python3.9/site-packages/nvidia/nvjitlink/lib:/dev/shm/xpdk-lite:/dev/shm/xpdk-lite:/opt/conda/lib/python3.8/site-packages/aistudio_common/reader/libs/:/opt/taobao/java/jre/lib/amd64/server/:/usr/local/cuda/lib64:/usr/local/TensorRT-8.6.1/lib/:/usr/local/lib:/usr/local/lib64:/opt/ai-inference/
NVIDIA_VISIBLE_DEVICES=GPU-a18c8a32-ef6d-8d76-5a61-7a81da133490
NVIDIA_DRIVER_CAPABILITIES=utility,compute
CUDA_MPS_PIPE_DIRECTORY=/dev/shm/nvidia-mps
CUDA_MODULE_LOADING=LAZY

Model Input Dumps

No response

🐛 Describe the bug

Inference is exceptionally slow on the L20 GPU, speed 0.08tokens/s:
截屏2024-11-25 15 35 43
and graphics card usage is low:
截屏2024-11-25 14 24 49

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions