Multi-GPU train #12925

Cho-Hong-Seok · 2024-04-15T15:31:10Z

Search before asking

I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

First of all, thank you for your always kind and detailed answers!

I'm trying to train the yolov6_seg model with multi-gpu and I'm getting an error, I don't know which part I need to fix.

train code

!python -m torch.distributed.launch --nproc_per_node 2 tools/train.py --batch 64 --conf configs/yolov6s_seg.py --epoch 150 --data ../FST1/data.yaml --device 0,1

error

/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
[2024-04-15 19:19:43,314] torch.distributed.run: [WARNING] 
[2024-04-15 19:19:43,314] torch.distributed.run: [WARNING] *****************************************
[2024-04-15 19:19:43,314] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-04-15 19:19:43,314] torch.distributed.run: [WARNING] *****************************************
Traceback (most recent call last):
Traceback (most recent call last):
  File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 143, in <module>
  File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 143, in <module>
        main(args)main(args)

  File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 116, in main
  File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 116, in main
        cfg, device, args = check_and_init(args)cfg, device, args = check_and_init(args)

                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 102, in check_and_init
  File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 102, in check_and_init
    device = select_device(args.device)
    device = select_device(args.device)
                    ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^  File "/media/HDD/조홍석/YOLOv6/yolov6/utils/envs.py", line 32, in select_device
^^^^
  File "/media/HDD/조홍석/YOLOv6/yolov6/utils/envs.py", line 32, in select_device
    assert torch.cuda.is_available()
AssertionError    
assert torch.cuda.is_available()
AssertionError
[2024-04-15 19:19:48,319] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 29482) of binary: /home/dilab03/anaconda3/bin/python
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 198, in <module>
    main()
  File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 194, in main
    launch(args)
  File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 179, in launch
    run(args)
  File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-04-15_19:19:48
  host      : dilab03-Super-Server
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 29483)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-15_19:19:48
  host      : dilab03-Super-Server
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 29482)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Additional

No response

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2024-04-15T20:58:43Z

Hello! 😊

Thank you for reaching out and for your kind words. It looks like your training script encountered an issue because it couldn't find any available GPUs. This is typically indicated by the AssertionError in your stack trace when the script checks torch.cuda.is_available().

Here's what you can do to troubleshoot this issue:

Ensure you are running your script on a machine with CUDA-capable GPUs.
Verify that your PyTorch installation is correctly configured to use CUDA. You can test this by running import torch; print(torch.cuda.is_available()) in your Python environment. This should return True if CUDA is available.
Based on the warnings about torch.distributed.launch being deprecated, consider using the recommended torchrun for launching distributed PyTorch training, if your environment supports it.

If after checking these points you still face issues, it might be helpful to double-check your PyTorch and CUDA setup or consider running a simpler script to ensure your setup can successfully utilize the GPUs.

Happy coding! 🚀

github-actions · 2024-05-16T00:21:05Z

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

Cho-Hong-Seok added the question Further information is requested label Apr 15, 2024

github-actions bot added the Stale label May 16, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU train #12925

Multi-GPU train #12925

Cho-Hong-Seok commented Apr 15, 2024

glenn-jocher commented Apr 15, 2024

github-actions bot commented May 16, 2024

Multi-GPU train #12925

Multi-GPU train #12925

Comments

Cho-Hong-Seok commented Apr 15, 2024

Search before asking

Question

Additional

glenn-jocher commented Apr 15, 2024

github-actions bot commented May 16, 2024