Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multigpu and multinode performance #12424

Closed
1 task done
unrue opened this issue Nov 24, 2023 · 10 comments
Closed
1 task done

Multigpu and multinode performance #12424

unrue opened this issue Nov 24, 2023 · 10 comments
Labels
question Further information is requested Stale Stale and schedule for closing soon

Comments

@unrue
Copy link

unrue commented Nov 24, 2023

Search before asking

Question

I'm training Yolov5 on custom dataset and HPC machine, having 4 GPU per node. I'm doing some performance test, using different number of GPUs. Each test run for 100 epochs. The following are time results:

4 GPU: 2h, 30 min
8 GPU: 2h 21 min
16 GPU: 2h 25 min
32 GPU: 2h 46 min

More or less the AP is the same. I'm bit confused. Why Yolov5 training time does not scale with more GPUs? Each epoch should be finish in less time using more Gpus as well as the execution time. Right? Someone could explain such behaviour? Thanks.

Additional

No response

@unrue unrue added the question Further information is requested label Nov 24, 2023
@glenn-jocher
Copy link
Member

@unrue hi there! Thanks for reaching out. This is a known behavior due to the overhead of synchronizing across multiple GPUs and inter-node communication. YOLOv5 and its multi-GPU capability are actively optimized, with scaling improvements ongoing. For real-time updates, please see the training best practices on our documentation. If you have further questions or feedback, feel free to let us know. 🚀

@unrue
Copy link
Author

unrue commented Nov 27, 2023

Thanks Glenn,

so, in fact at the moment MultiGPU-multinode on Yolo is not useful. In the above link, I dont' see any tip to improve multigpu performances. I'm using Yolo on HPC cluster, having a lot of GPUs available. But, if Yolo does not scale up, I'm limited to run into a single node :/

I'll follow future updates. Thanks.

@glenn-jocher
Copy link
Member

@unrue you're welcome! I appreciate your understanding. Our team is actively working to enhance multi-GPU and multi-node performance, and we value your feedback in this process. Your support and patience mean a lot. If you have any more questions or run into any issues, feel free to ask. We're here to help!

@unrue
Copy link
Author

unrue commented Nov 28, 2023

Thanks Glenn, apart time performance, are there other reason to enable Multinode in Yolo? More data processing?

@glenn-jocher
Copy link
Member

@unrue Absolutely, multinode setups can certainly enable larger-scale data processing and model training when dealing with massive datasets and resource-intensive tasks. This can be especially beneficial for distributed data parallel training or for handling extremely large models. Keep an eye on our updates for improvements and new features in this area. If you have any more questions, feel free to ask. Good luck with your work! 🌟

@unrue
Copy link
Author

unrue commented Nov 29, 2023

Thansk Glenn, yes I have another question. Suppose Yolo starts with 4 GPUs and 50 epochs. Second test, Yolo run with 8 GPus, in such case, the number of epochs should be 25? Or the epochs remain the same? I mean, the number of epochs should be resized when the number of gpus grows up? Or it remains constant?

Thanks.

@glenn-jocher
Copy link
Member

@unrue The number of epochs should remain constant regardless of the number of GPUs used. You do not need to resize the number of epochs when scaling up the number of GPUs. However, when increasing the number of GPUs, you may observe faster convergence due to increased parallelism, potentially reducing training time. If you have any more questions or need further clarification, feel free to ask. Happy to help!

@unrue
Copy link
Author

unrue commented Nov 29, 2023

Do you already have an idea why Yolo does not scale? Where is the bottleneck.

@glenn-jocher
Copy link
Member

@unrue The main bottleneck in scaling YOLOv5 across multiple GPUs and nodes is the communication and synchronization overhead between the GPUs. Our team is actively working to optimize and improve the scalability of YOLOv5, so keep an eye out for updates as we continue to address these challenges. Your feedback is invaluable as we work to enhance the multi-GPU and multi-node performance. If you have further questions or need assistance, feel free to ask. Thank you for your understanding and support!

Copy link
Contributor

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

@github-actions github-actions bot added the Stale Stale and schedule for closing soon label Dec 30, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale Stale and schedule for closing soon
Projects
None yet
Development

No branches or pull requests

2 participants