Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot select specific coda device #12967

Open
1 task done
DP1701 opened this issue Apr 26, 2024 · 2 comments
Open
1 task done

Cannot select specific coda device #12967

DP1701 opened this issue Apr 26, 2024 · 2 comments
Labels
question Further information is requested

Comments

@DP1701
Copy link

DP1701 commented Apr 26, 2024

Search before asking

Question

Hello everyone,

unfortunately I cannot select a specific CUDA device.
I run the following:

python train.py --epochs 10 --device 0

I get

train: weights=yolov5s.pt, cfg=, data=data/coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=10, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, evolve_population=data/hyps, resume_evolve=None, bucket=, cache=None, image_weights=False, device=0, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest, ndjson_console=False, ndjson_file=False
github: up to date with https://github.com/ultralytics/yolov5 ✅
Traceback (most recent call last):
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 306, in _lazy_init
    queued_call()
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 174, in _check_capability
    capability = get_device_capability(d)
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 430, in get_device_capability
    prop = get_device_properties(device)
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 448, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=, num_gpus=

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/train.py", line 848, in <module>
    main(opt)
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/train.py", line 607, in main
    device = select_device(opt.device, batch_size=opt.batch_size)
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/utils/torch_utils.py", line 134, in select_device
    p = torch.cuda.get_device_properties(i)
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 444, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 312, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=, num_gpus=

CUDA call was originally invoked at:

  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/train.py", line 34, in <module>
    import torch
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/__init__.py", line 1478, in <module>
    _C._initExtension(manager_path())
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 238, in <module>
    _lazy_call(_check_capability)
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 235, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))

But if I simply omit -device 0, then it works. However, all CUDA devices are selected.

I have installed the following pip packages

Package                  Version
------------------------ --------------------
absl-py                  2.1.0
albumentations           1.4.4
annotated-types          0.6.0
certifi                  2024.2.2
charset-normalizer       3.3.2
contourpy                1.2.1
cycler                   0.12.1
filelock                 3.13.4
fonttools                4.51.0
fsspec                   2024.3.1
gitdb                    4.0.11
GitPython                3.1.43
grpcio                   1.62.2
idna                     3.7
imageio                  2.34.1
Jinja2                   3.1.3
joblib                   1.4.0
kiwisolver               1.4.5
lazy_loader              0.4
Markdown                 3.6
MarkupSafe               2.1.5
matplotlib               3.8.4
mpmath                   1.3.0
networkx                 3.3
numpy                    1.26.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.20.5
nvidia-nvjitlink-cu12    12.4.127
nvidia-nvtx-cu12         12.1.105
opencv-python            4.9.0.80
opencv-python-headless   4.9.0.80
packaging                24.0
pandas                   2.2.2
pillow                   10.3.0
pip                      22.0.2
protobuf                 5.26.1
psutil                   5.9.8
py-cpuinfo               9.0.0
pydantic                 2.7.1
pydantic_core            2.18.2
pyparsing                3.1.2
python-dateutil          2.9.0.post0
pytz                     2024.1
PyYAML                   6.0.1
requests                 2.31.0
scikit-image             0.23.2
scikit-learn             1.4.2
scipy                    1.13.0
seaborn                  0.13.2
setuptools               69.5.1
six                      1.16.0
smmap                    5.0.1
sympy                    1.12
tensorboard              2.16.2
tensorboard-data-server  0.7.2
thop                     0.1.1.post2209072238
threadpoolctl            3.4.0
tifffile                 2024.4.24
torch                    2.3.0
torchvision              0.18.0
tqdm                     4.66.2
triton                   2.3.0
typing_extensions        4.11.0
tzdata                   2024.1
ultralytics              8.2.3
urllib3                  2.2.1
Werkzeug                 3.0.2
wheel                    0.43.0

Python 3.10.12

Additional

No response

@DP1701 DP1701 added the question Further information is requested label Apr 26, 2024
@DP1701
Copy link
Author

DP1701 commented Apr 26, 2024

ok, it seems that something is wrong with pytorch v2.3. The error does not occur with v2.2.2.

@glenn-jocher
Copy link
Member

Hey there! 😊 Thanks for pinpointing that the issue seems tied to PyTorch v2.3. Different versions of PyTorch can have unique behaviors or bugs that affect other software, including YOLOv5.

For now, sticking with PyTorch v2.2.2 where you're not encountering this error sounds like a solid workaround. It's always good practice to test different versions of dependencies if you run into issues. If anything else comes up or if you have further questions, feel free to ask! Your observations make a valuable contribution to the community. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants