-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weโll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a naive DDP for model interface #78
Conversation
for more information, see https://pre-commit.ci
@Laughing-q The cli will only throw an error only in DDP mode, right? |
@AyushExel As we check |
@Laughing-q I think in yolov5 local_rank is set to -1 by default so it means it is -1 for cpu and 0 for single gpu and master process in DDP mode |
@AyushExel can we make 0 for all the cpu, single-gpu and master process cases? |
@AyushExel ok now the CI passed, that's because I wanted to remove the |
@Laughing-q I think I tried that but for some reason, I had to revert to the same v5 method. Can't really remember what it was. We can get to it after cli DDP fix.
|
EDIT: |
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Laughing-q <1185102784@qq.com>
@Laughing-q So now DDP is working as it's supposed to and it is as fast as v5 right? Is there something else needed here? |
@AyushExel yes, you can merge it when you think it's ok after your review. :) |
@Laughing-q Okay I think I'll handle the temp file deletion in this PR itself otherwise it'll get into backlog.. |
@AyushExel I think cleaning it could be better, as we'll save everything we need in the save_dir, like models, args etc. |
@Laughing-q okay I'll attempt that |
for more information, see https://pre-commit.ci
@Laughing-q Okay, so I've updated the logic of the temp file creation so that the temp file created will have the trainer object id in the suffix so that it can be identified later for deletion. It worked on my non-DDP setup. Please check it on DDP and then this should be good to go |
@AyushExel it works when I removed the pdb part. Is this part only for you to debug? Can I just remove it? |
@Laughing-q ohh yes. the pdb is for debugging. I forgot to remove it. I'll push a fix |
for more information, see https://pre-commit.ci
Hmm.. that's strange |
@AyushExel okay, but |
@AyushExel maybe we just remove the temp file in the end after we finish the training? |
@Laughing-q ohh I see..It's a race condition.. |
@AyushExel or maybe add a time.sleep before ddp_cleanup |
hmm okay. How do we get a signal that training is done? would sleeping for a couple seconds work here? |
@AyushExel yes it works, the time.sleep way, we can take this as a temp solution. |
for more information, see https://pre-commit.ci
@AyushExel okay I committed the time.sleep as a temp solution. I'm going to sleep now, back in tomorrow. :) |
@Laughing-q hmm okay..I'm not an expert in handling race conditions. I'm wondering if this could be different for different systems.
|
Yeah sounds good.. Its too late for you.. Let's chat tomorrow |
@AyushExel I also think this can be different across different OSes, it's hard to handle this across different OSes. actually I didn't expect these would be a race conditions. Maybe we should just simply add a |
Yes that would be the solution if this doesn't work well but I'd like to stick to the current solution if possible because it cleans up after every execution.. |
@Laughing-q okay let's merge this now. Once we start testing this and any problem occurs due to race condition, we can then move to the temp dir strategy |
@AyushExel okay merging now |
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
@AyushExel
usage:
If you want to update the cli for DDP, probably it's better to commit in this PR and complete the DDP?
BTW, do we have to keep the
RANK=-1
case? I think we've talked about this but I forgot the result of the discussion.๐ ๏ธ PR Summary
Made with โค๏ธ by Ultralytics Actions
๐ Summary
Improvements to distributed training and validation logic for better performance and stability.
๐ Key Changes
subprocess.Popen
rather thanmp.spawn
.dist.py
for distributed training, including dynamic free port finding._setup_ddp
) is defined, changing environment variables._setup_train
.val.py
for detection and segmentation.๐ฏ Purpose & Impact