-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process group init fails when training YOLOv8 after successful tunning [Databricks] [single node GPU] #13833
Comments
👋 Hello @lbeaucourt, thank you for your interest in Ultralytics YOLOv8 🚀! We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered. If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it. If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results. Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users. InstallPip install the pip install ultralytics EnvironmentsYOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
StatusIf this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit. |
@lbeaucourt hi there, Thank you for providing a detailed report and the minimal reproducible example. This is very helpful! 😊 The error you're encountering, Here are a few steps to help troubleshoot and resolve this issue:
Please try these steps and let us know if the issue persists. Your feedback is invaluable to us, and we appreciate your patience as we work to resolve this. |
Hi @glenn-jocher , Thank you very much for this clear reply ! I tested your solution and it works fine BUT only for model.train(). I explain a bit, if a set the environment variables BEFORE model.tune() as follow
Then tunning fail with error: "Default process group has not been initialized, please make sure to call init_process_group" But, if I keep previous env variable setting for tuning and I change the setting before training, it works ! So, thanks for your answer, it solves my problem. I still not sure to understand why behaviour is different from model.tune() to model.train() but it's not a pain point. The final version of the code which is working for me is:
|
Hi @lbeaucourt, Thank you for the detailed follow-up and for sharing your working solution! 😊 It's great to hear that the provided solution works for For now, your approach of setting the environment variables before Happy training! 🚀 |
Search before asking
YOLOv8 Component
Train
Bug
Environment
Ultralytics YOLOv8.2.35 🚀 Python-3.11.0rc1 torch-2.3.1+cu121 CUDA:0 (Tesla V100-PCIE-16GB, 16151MiB)
Setup complete ✅ (6 CPUs, 104.0 GB RAM, 42.4/250.9 GB disk)
Minimal Reproducible Example
Additional
Hello, I'm working on Databricks with a single node GPU cluster (Standard_NC6s_v3).
I am trying to re-train YOLOv8 on a custom dataset after adjusting the hyperparameters. I run all the commands in the same notebook/session. While the tuning works fine, the training raises an error related to the initialization of the process group (as far as I can see).
Errors occurs right after training data are scanned:
I've tried manually initialising the process group with 'torch.distributed.init_process_group('nccl')' but it doesn't work.
I don't understand how 'model.tune()' (where the model is trained) can work successfully when 'model.train()' failed with the same system configuration.
Thanks.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: