Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weโ€™ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a naive DDP for model interface #78

Merged
merged 23 commits into from
Dec 19, 2022
Merged

add a naive DDP for model interface #78

merged 23 commits into from
Dec 19, 2022

Conversation

Laughing-q
Copy link
Member

@Laughing-q Laughing-q commented Dec 16, 2022

@AyushExel
usage:

from ultralytics.yolo.engine.model import YOLO
import torch

model = YOLO()
# model.new("yolov5n.yaml") # automatically detects task type
model.new("yolov5n-seg.yaml") # automatically detects task type
model.train(data="runs/balloon.yaml", epochs=50, device="0,1", exist_ok=True, batch_size=16)

If you want to update the cli for DDP, probably it's better to commit in this PR and complete the DDP?
BTW, do we have to keep the RANK=-1 case? I think we've talked about this but I forgot the result of the discussion.

๐Ÿ› ๏ธ PR Summary

Made with โค๏ธ by Ultralytics Actions

๐ŸŒŸ Summary

Improvements to distributed training and validation logic for better performance and stability.

๐Ÿ“Š Key Changes

  • ๐Ÿค Enhanced distributed training logic to use a subprocess with subprocess.Popen rather than mp.spawn.
  • ๐Ÿ—๏ธ Introduced new utility functions in dist.py for distributed training, including dynamic free port finding.
  • ๐Ÿ› ๏ธ Refactored how the DDP setup (_setup_ddp) is defined, changing environment variables.
  • ๐Ÿ”„ Adjusted data loader creation to consider single GPU cases explicitly in _setup_train.
  • ๐Ÿงน Cleaned up the temporary DDP launch file after use, ensuring no leftover files.
  • ๐Ÿ“ Standardized how IOU vectors are utilized across different files like val.py for detection and segmentation.
  • ๐Ÿงฎ Updated the internal logic for metrics computation to streamline the evaluation process.

๐ŸŽฏ Purpose & Impact

  • ๐Ÿ”ง These changes aim to make distributed training setup more robust and less prone to errors due to fixed ports or file management issues.
  • ๐Ÿš€ Users benefit from more efficient training across multiple GPUs, potentially leading to faster experimentation and development cycles.
  • ๐Ÿ“ˆ The updates in the validation files will enhance the consistency and accuracy of the model evaluation stages, offering more reliable metrics for performance benchmarks.

@AyushExel
Copy link
Contributor

@Laughing-q The cli will only throw an error only in DDP mode, right?
And what rank=-1 case? are you talking about this?
Screenshot 2022-12-16 at 3 41 34 PM

@Laughing-q
Copy link
Member Author

@AyushExel As we check if rank in [0, -1], what the -1 is for? can we just check if rank == 0?
image

@AyushExel
Copy link
Contributor

@Laughing-q I think in yolov5 local_rank is set to -1 by default so it means it is -1 for cpu and 0 for single gpu and master process in DDP mode

@Laughing-q
Copy link
Member Author

@AyushExel can we make 0 for all the cpu, single-gpu and master process cases?

@Laughing-q
Copy link
Member Author

@AyushExel ok now the CI passed, that's because I wanted to remove the metric_keys in val.py of detect/segment but I forgot the classify.

@AyushExel
Copy link
Contributor

AyushExel commented Dec 16, 2022

@AyushExel can we make 0 for all the cpu, single-gpu and master process cases?

@Laughing-q I think I tried that but for some reason, I had to revert to the same v5 method. Can't really remember what it was. We can get to it after cli DDP fix.
Can I test the subprocess method on my mac by just changing the train function as:

    def train(self):
        world_size = torch.cuda.device_count()
        os.environ["LOCAL_RANK"] = -1
         command = generate_ddp_command(world_size)
         subprocess.Popen(command)

@AyushExel
Copy link
Contributor

EDIT:
ohh there is an assert.. I'll probably have to check on the server.. I'm meeting with glenn in 5 mins so I'll try after that

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Laughing-q <1185102784@qq.com>
@AyushExel
Copy link
Contributor

@Laughing-q So now DDP is working as it's supposed to and it is as fast as v5 right? Is there something else needed here?
Except for deletion or reordering of temp file.

@Laughing-q
Copy link
Member Author

@AyushExel yes, you can merge it when you think it's ok after your review. :)

@AyushExel
Copy link
Contributor

@Laughing-q Okay I think I'll handle the temp file deletion in this PR itself otherwise it'll get into backlog..
So what do you think would be better here? Cleanup the temp file after every run or just create a ultralytics/ dir for all temp files?

@Laughing-q
Copy link
Member Author

@AyushExel I think cleaning it could be better, as we'll save everything we need in the save_dir, like models, args etc.

@AyushExel
Copy link
Contributor

@Laughing-q okay I'll attempt that

@AyushExel
Copy link
Contributor

AyushExel commented Dec 17, 2022

@Laughing-q Okay, so I've updated the logic of the temp file creation so that the temp file created will have the trainer object id in the suffix so that it can be identified later for deletion. It worked on my non-DDP setup. Please check it on DDP and then this should be good to go

@Laughing-q
Copy link
Member Author

Laughing-q commented Dec 17, 2022

@AyushExel it works when I removed the pdb part. Is this part only for you to debug? Can I just remove it?
EDIT: ah actually the temp file can't be found when launching DDP after I removed the pdb part. But the temp file was still there if I keep the pdb part and my terminal will go to pdb mode.

@AyushExel
Copy link
Contributor

@Laughing-q ohh yes. the pdb is for debugging. I forgot to remove it. I'll push a fix

@AyushExel
Copy link
Contributor

ah actually the temp file can't be found when launching DDP after I removed the pdb part. But the temp file was still there if I keep the pdb part and my terminal will go to pdb mode.

Hmm.. that's strange

@Laughing-q
Copy link
Member Author

@AyushExel okay, but ddp_cleanup will clean the temp file before the subprocess.Popen launch DDP...

@Laughing-q
Copy link
Member Author

@AyushExel maybe we just remove the temp file in the end after we finish the training?

@AyushExel
Copy link
Contributor

@Laughing-q ohh I see..It's a race condition..

@Laughing-q
Copy link
Member Author

@AyushExel or maybe add a time.sleep before ddp_cleanup

@AyushExel
Copy link
Contributor

@AyushExel maybe we just remove the temp file in the end after we finish the training?

hmm okay. How do we get a signal that training is done? would sleeping for a couple seconds work here?

@Laughing-q
Copy link
Member Author

Laughing-q commented Dec 17, 2022

@AyushExel yes it works, the time.sleep way, we can take this as a temp solution.

@Laughing-q
Copy link
Member Author

@AyushExel okay I committed the time.sleep as a temp solution. I'm going to sleep now, back in tomorrow. :)

@AyushExel
Copy link
Contributor

@Laughing-q hmm okay..I'm not an expert in handling race conditions. I'm wondering if this could be different for different systems.
The problem here is that os.remove() gets executed before popen() accesses that file right? If that's the case then yeah 2 sec of sleep should solve the problem.
Is one of these things happening here:

  • Once the ddp command starts running, then the program wait gets blocked so that os.remove() won't get executed or
  • while the training is running, os.remove() command waits for the file to become available for deletion
    If either of the above is happening then I think its fine.. I just don't know what.. Or if the behavior remains same across all OSes

@AyushExel
Copy link
Contributor

@AyushExel okay I committed the time.sleep as a temp solution. I'm going to sleep now, back in tomorrow. :)

Yeah sounds good.. Its too late for you.. Let's chat tomorrow

@Laughing-q
Copy link
Member Author

@AyushExel I also think this can be different across different OSes, it's hard to handle this across different OSes. actually I didn't expect these would be a race conditions. Maybe we should just simply add a ultralytics dir to keep all temp files, like /tmp/ultralytics in linux. What do you think?

@AyushExel
Copy link
Contributor

Yes that would be the solution if this doesn't work well but I'd like to stick to the current solution if possible because it cleans up after every execution..
I think if it works on linux and windows, its fine because there is no CUDA on mac.
Also, I need to confirm the subprocess function about what's happening .. If its the race condition( ie, file gets deleted befor it can be accessed) then its a simple race condition for which the sleep solution is fine

@AyushExel
Copy link
Contributor

@Laughing-q okay let's merge this now. Once we start testing this and any problem occurs due to race condition, we can then move to the temp dir strategy

@Laughing-q
Copy link
Member Author

@AyushExel okay merging now

@Laughing-q Laughing-q merged commit 7690cae into main Dec 19, 2022
@Laughing-q Laughing-q deleted the DDP branch December 19, 2022 05:24
0iui0 pushed a commit to 0iui0/ultralytics that referenced this pull request Jan 3, 2024
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants