Multigpu and multinode performance #12424

unrue · 2023-11-24T13:06:21Z

Search before asking

I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

I'm training Yolov5 on custom dataset and HPC machine, having 4 GPU per node. I'm doing some performance test, using different number of GPUs. Each test run for 100 epochs. The following are time results:

4 GPU: 2h, 30 min
8 GPU: 2h 21 min
16 GPU: 2h 25 min
32 GPU: 2h 46 min

More or less the AP is the same. I'm bit confused. Why Yolov5 training time does not scale with more GPUs? Each epoch should be finish in less time using more Gpus as well as the execution time. Right? Someone could explain such behaviour? Thanks.

Additional

No response

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2023-11-24T19:17:36Z

@unrue hi there! Thanks for reaching out. This is a known behavior due to the overhead of synchronizing across multiple GPUs and inter-node communication. YOLOv5 and its multi-GPU capability are actively optimized, with scaling improvements ongoing. For real-time updates, please see the training best practices on our documentation. If you have further questions or feedback, feel free to let us know. 🚀

unrue · 2023-11-27T14:34:32Z

Thanks Glenn,

so, in fact at the moment MultiGPU-multinode on Yolo is not useful. In the above link, I dont' see any tip to improve multigpu performances. I'm using Yolo on HPC cluster, having a lot of GPUs available. But, if Yolo does not scale up, I'm limited to run into a single node :/

I'll follow future updates. Thanks.

glenn-jocher · 2023-11-27T19:59:01Z

@unrue you're welcome! I appreciate your understanding. Our team is actively working to enhance multi-GPU and multi-node performance, and we value your feedback in this process. Your support and patience mean a lot. If you have any more questions or run into any issues, feel free to ask. We're here to help!

unrue · 2023-11-28T12:28:00Z

Thanks Glenn, apart time performance, are there other reason to enable Multinode in Yolo? More data processing?

glenn-jocher · 2023-11-28T20:33:42Z

@unrue Absolutely, multinode setups can certainly enable larger-scale data processing and model training when dealing with massive datasets and resource-intensive tasks. This can be especially beneficial for distributed data parallel training or for handling extremely large models. Keep an eye on our updates for improvements and new features in this area. If you have any more questions, feel free to ask. Good luck with your work! 🌟

unrue · 2023-11-29T06:53:13Z

Thansk Glenn, yes I have another question. Suppose Yolo starts with 4 GPUs and 50 epochs. Second test, Yolo run with 8 GPus, in such case, the number of epochs should be 25? Or the epochs remain the same? I mean, the number of epochs should be resized when the number of gpus grows up? Or it remains constant?

Thanks.

glenn-jocher · 2023-11-29T14:30:54Z

@unrue The number of epochs should remain constant regardless of the number of GPUs used. You do not need to resize the number of epochs when scaling up the number of GPUs. However, when increasing the number of GPUs, you may observe faster convergence due to increased parallelism, potentially reducing training time. If you have any more questions or need further clarification, feel free to ask. Happy to help!

unrue · 2023-11-29T14:38:13Z

Do you already have an idea why Yolo does not scale? Where is the bottleneck.

glenn-jocher · 2023-11-29T16:44:01Z

@unrue The main bottleneck in scaling YOLOv5 across multiple GPUs and nodes is the communication and synchronization overhead between the GPUs. Our team is actively working to optimize and improve the scalability of YOLOv5, so keep an eye out for updates as we continue to address these challenges. Your feedback is invaluable as we work to enhance the multi-GPU and multi-node performance. If you have further questions or need assistance, feel free to ask. Thank you for your understanding and support!

github-actions · 2023-12-30T00:19:52Z

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

unrue added the question Further information is requested label Nov 24, 2023

github-actions bot added the Stale Stale and schedule for closing soon label Dec 30, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multigpu and multinode performance #12424

Multigpu and multinode performance #12424

unrue commented Nov 24, 2023 •

edited

Loading

glenn-jocher commented Nov 24, 2023

unrue commented Nov 27, 2023 •

edited

Loading

glenn-jocher commented Nov 27, 2023

unrue commented Nov 28, 2023

glenn-jocher commented Nov 28, 2023

unrue commented Nov 29, 2023

glenn-jocher commented Nov 29, 2023

unrue commented Nov 29, 2023 •

edited

Loading

glenn-jocher commented Nov 29, 2023

github-actions bot commented Dec 30, 2023

Multigpu and multinode performance #12424

Multigpu and multinode performance #12424

Comments

unrue commented Nov 24, 2023 • edited Loading

Search before asking

Question

Additional

glenn-jocher commented Nov 24, 2023

unrue commented Nov 27, 2023 • edited Loading

glenn-jocher commented Nov 27, 2023

unrue commented Nov 28, 2023

glenn-jocher commented Nov 28, 2023

unrue commented Nov 29, 2023

glenn-jocher commented Nov 29, 2023

unrue commented Nov 29, 2023 • edited Loading

glenn-jocher commented Nov 29, 2023

github-actions bot commented Dec 30, 2023

unrue commented Nov 24, 2023 •

edited

Loading

unrue commented Nov 27, 2023 •

edited

Loading

unrue commented Nov 29, 2023 •

edited

Loading