Cannot select specific coda device #12967

DP1701 · 2024-04-26T21:05:57Z

Search before asking

I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

Hello everyone,

unfortunately I cannot select a specific CUDA device.
I run the following:

python train.py --epochs 10 --device 0

I get

train: weights=yolov5s.pt, cfg=, data=data/coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=10, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, evolve_population=data/hyps, resume_evolve=None, bucket=, cache=None, image_weights=False, device=0, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest, ndjson_console=False, ndjson_file=False
github: up to date with https://github.com/ultralytics/yolov5 ✅
Traceback (most recent call last):
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 306, in _lazy_init
    queued_call()
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 174, in _check_capability
    capability = get_device_capability(d)
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 430, in get_device_capability
    prop = get_device_properties(device)
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 448, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=, num_gpus=

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/train.py", line 848, in <module>
    main(opt)
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/train.py", line 607, in main
    device = select_device(opt.device, batch_size=opt.batch_size)
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/utils/torch_utils.py", line 134, in select_device
    p = torch.cuda.get_device_properties(i)
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 444, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 312, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=, num_gpus=

CUDA call was originally invoked at:

  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/train.py", line 34, in <module>
    import torch
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/__init__.py", line 1478, in <module>
    _C._initExtension(manager_path())
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 238, in <module>
    _lazy_call(_check_capability)
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 235, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))

But if I simply omit -device 0, then it works. However, all CUDA devices are selected.

I have installed the following pip packages

Package                  Version
------------------------ --------------------
absl-py                  2.1.0
albumentations           1.4.4
annotated-types          0.6.0
certifi                  2024.2.2
charset-normalizer       3.3.2
contourpy                1.2.1
cycler                   0.12.1
filelock                 3.13.4
fonttools                4.51.0
fsspec                   2024.3.1
gitdb                    4.0.11
GitPython                3.1.43
grpcio                   1.62.2
idna                     3.7
imageio                  2.34.1
Jinja2                   3.1.3
joblib                   1.4.0
kiwisolver               1.4.5
lazy_loader              0.4
Markdown                 3.6
MarkupSafe               2.1.5
matplotlib               3.8.4
mpmath                   1.3.0
networkx                 3.3
numpy                    1.26.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.20.5
nvidia-nvjitlink-cu12    12.4.127
nvidia-nvtx-cu12         12.1.105
opencv-python            4.9.0.80
opencv-python-headless   4.9.0.80
packaging                24.0
pandas                   2.2.2
pillow                   10.3.0
pip                      22.0.2
protobuf                 5.26.1
psutil                   5.9.8
py-cpuinfo               9.0.0
pydantic                 2.7.1
pydantic_core            2.18.2
pyparsing                3.1.2
python-dateutil          2.9.0.post0
pytz                     2024.1
PyYAML                   6.0.1
requests                 2.31.0
scikit-image             0.23.2
scikit-learn             1.4.2
scipy                    1.13.0
seaborn                  0.13.2
setuptools               69.5.1
six                      1.16.0
smmap                    5.0.1
sympy                    1.12
tensorboard              2.16.2
tensorboard-data-server  0.7.2
thop                     0.1.1.post2209072238
threadpoolctl            3.4.0
tifffile                 2024.4.24
torch                    2.3.0
torchvision              0.18.0
tqdm                     4.66.2
triton                   2.3.0
typing_extensions        4.11.0
tzdata                   2024.1
ultralytics              8.2.3
urllib3                  2.2.1
Werkzeug                 3.0.2
wheel                    0.43.0

Python 3.10.12

Additional

No response

The text was updated successfully, but these errors were encountered:

DP1701 · 2024-04-26T21:54:37Z

ok, it seems that something is wrong with pytorch v2.3. The error does not occur with v2.2.2.

glenn-jocher · 2024-04-27T04:31:31Z

Hey there! 😊 Thanks for pinpointing that the issue seems tied to PyTorch v2.3. Different versions of PyTorch can have unique behaviors or bugs that affect other software, including YOLOv5.

For now, sticking with PyTorch v2.2.2 where you're not encountering this error sounds like a solid workaround. It's always good practice to test different versions of dependencies if you run into issues. If anything else comes up or if you have further questions, feel free to ask! Your observations make a valuable contribution to the community. 👍

DP1701 added the question Further information is requested label Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot select specific coda device #12967

Cannot select specific coda device #12967

DP1701 commented Apr 26, 2024

DP1701 commented Apr 26, 2024

glenn-jocher commented Apr 27, 2024

Cannot select specific coda device #12967

Cannot select specific coda device #12967

Comments

DP1701 commented Apr 26, 2024

Search before asking

Question

Additional

DP1701 commented Apr 26, 2024

glenn-jocher commented Apr 27, 2024