Skip to content

RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable #100691

@zhz17

Description

@zhz17

🐛 Describe the bug

When transfer tensors or load model to GPU returns the error "RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable"

'''check.py
import torch
import os

a = torch.cuda.is_available()
print(a)

ngpu= 1
device = torch.device("cuda:0" if (torch.cuda.is_available() and ngpu > 0) else "cpu")
print(device)

dev_2 = torch.device("cuda:1")
print(dev_2)

for d in range(torch.cuda.device_count()):
print(torch.cuda.get_device_name(d))
print(torch.rand(3,3).cuda())
'''

Outputs shown as below:
True
cuda:0
cuda:1
Tesla T4
Tesla T4
Traceback (most recent call last):
File "/data/check.py", line 17, in
print(torch.rand(3,3).cuda())
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

nvidia-smi

Fri May 5 13:28:17 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:02:01.0 Off | 0 |
| N/A 42C P0 28W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:02:02.0 Off | 0 |
| N/A 42C P0 28W / 70W | 2MiB / 15360MiB | 6% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Versions

PyTorch version: 2.0.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: version 3.26.3
Libc version: glibc-2.17

Python version: 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.88.1.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: Tesla T4
GPU 1: Tesla T4

Nvidia driver version: 525.105.17
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 1
Core(s) per socket: 2
座: 16
NUMA 节点: 1
厂商 ID: GenuineIntel
CPU 系列: 6
型号: 6
型号名称: QEMU Virtual CPU version 2.5+
步进: 3
CPU MHz: 2095.078
BogoMIPS: 4190.15
超管理器厂商: Microsoft
虚拟化类型: 完全
L1d 缓存: 32K
L1i 缓存: 32K
L2 缓存: 4096K
L3 缓存: 16384K
NUMA 节点0 CPU: 0-31
Flags: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 ht syscall nx lm rep_good nopl xtopology eagerfpu pni cx16 x2apic hyper
visor lahf_lm rsb_ctxsw

Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] torch==2.0.0
[pip3] torchaudio==2.0.1
[pip3] torchvision==0.15.1
[pip3] triton==2.0.0
[conda] numpy 1.24.3 pypi_0 pypi
[conda] torch 2.0.0 pypi_0 pypi
[conda] torchaudio 2.0.1 pypi_0 pypi
[conda] torchvision 0.15.1 pypi_0 pypi
[conda] triton 2.0.0 pypi_0 pypi

cc @ngimel

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: cudaRelated to torch.cuda, and CUDA support in generaltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions