add a naive DDP for model interface #78

Laughing-q · 2022-12-16T09:55:52Z

from ultralytics.yolo.engine.model import YOLO
import torch

model = YOLO()
# model.new("yolov5n.yaml") # automatically detects task type
model.new("yolov5n-seg.yaml") # automatically detects task type
model.train(data="runs/balloon.yaml", epochs=50, device="0,1", exist_ok=True, batch_size=16)

If you want to update the cli for DDP, probably it's better to commit in this PR and complete the DDP?
BTW, do we have to keep the RANK=-1 case? I think we've talked about this but I forgot the result of the discussion.

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Improvements to distributed training and validation logic for better performance and stability.

📊 Key Changes

🤝 Enhanced distributed training logic to use a subprocess with subprocess.Popen rather than mp.spawn.
🏗️ Introduced new utility functions in dist.py for distributed training, including dynamic free port finding.
🛠️ Refactored how the DDP setup (_setup_ddp) is defined, changing environment variables.
🔄 Adjusted data loader creation to consider single GPU cases explicitly in _setup_train.
🧹 Cleaned up the temporary DDP launch file after use, ensuring no leftover files.
📐 Standardized how IOU vectors are utilized across different files like val.py for detection and segmentation.
🧮 Updated the internal logic for metrics computation to streamline the evaluation process.

🎯 Purpose & Impact

🔧 These changes aim to make distributed training setup more robust and less prone to errors due to fixed ports or file management issues.
🚀 Users benefit from more efficient training across multiple GPUs, potentially leading to faster experimentation and development cycles.
📈 The updates in the validation files will enhance the consistency and accuracy of the model evaluation stages, offering more reliable metrics for performance benchmarks.

for more information, see https://pre-commit.ci

AyushExel · 2022-12-16T10:11:46Z

@Laughing-q The cli will only throw an error only in DDP mode, right?
And what rank=-1 case? are you talking about this?

Laughing-q · 2022-12-16T10:14:34Z

@AyushExel As we check if rank in [0, -1], what the -1 is for? can we just check if rank == 0?

AyushExel · 2022-12-16T10:16:45Z

@Laughing-q I think in yolov5 local_rank is set to -1 by default so it means it is -1 for cpu and 0 for single gpu and master process in DDP mode

Laughing-q · 2022-12-16T10:18:13Z

@AyushExel can we make 0 for all the cpu, single-gpu and master process cases?

Laughing-q · 2022-12-16T10:22:40Z

@AyushExel ok now the CI passed, that's because I wanted to remove the metric_keys in val.py of detect/segment but I forgot the classify.

AyushExel · 2022-12-16T10:23:56Z

@AyushExel can we make 0 for all the cpu, single-gpu and master process cases?

@Laughing-q I think I tried that but for some reason, I had to revert to the same v5 method. Can't really remember what it was. We can get to it after cli DDP fix.
Can I test the subprocess method on my mac by just changing the train function as:

    def train(self):
        world_size = torch.cuda.device_count()
        os.environ["LOCAL_RANK"] = -1
         command = generate_ddp_command(world_size)
         subprocess.Popen(command)

AyushExel · 2022-12-16T10:26:44Z

EDIT:
ohh there is an assert.. I'll probably have to check on the server.. I'm meeting with glenn in 5 mins so I'll try after that

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Laughing-q <1185102784@qq.com>

AyushExel · 2022-12-17T12:32:45Z

@Laughing-q So now DDP is working as it's supposed to and it is as fast as v5 right? Is there something else needed here?
Except for deletion or reordering of temp file.

Laughing-q · 2022-12-17T12:36:53Z

@AyushExel yes, you can merge it when you think it's ok after your review. :)

AyushExel · 2022-12-17T12:40:46Z

@Laughing-q Okay I think I'll handle the temp file deletion in this PR itself otherwise it'll get into backlog..
So what do you think would be better here? Cleanup the temp file after every run or just create a ultralytics/ dir for all temp files?

Laughing-q · 2022-12-17T12:44:37Z

@AyushExel I think cleaning it could be better, as we'll save everything we need in the save_dir, like models, args etc.

AyushExel · 2022-12-17T12:45:12Z

@Laughing-q okay I'll attempt that

for more information, see https://pre-commit.ci

AyushExel · 2022-12-17T16:15:08Z

@Laughing-q Okay, so I've updated the logic of the temp file creation so that the temp file created will have the trainer object id in the suffix so that it can be identified later for deletion. It worked on my non-DDP setup. Please check it on DDP and then this should be good to go

Laughing-q · 2022-12-17T16:39:52Z

@AyushExel it works when I removed the pdb part. Is this part only for you to debug? Can I just remove it?
EDIT: ah actually the temp file can't be found when launching DDP after I removed the pdb part. But the temp file was still there if I keep the pdb part and my terminal will go to pdb mode.

AyushExel · 2022-12-17T16:46:02Z

@Laughing-q ohh yes. the pdb is for debugging. I forgot to remove it. I'll push a fix

for more information, see https://pre-commit.ci

AyushExel · 2022-12-17T16:47:27Z

ah actually the temp file can't be found when launching DDP after I removed the pdb part. But the temp file was still there if I keep the pdb part and my terminal will go to pdb mode.

Hmm.. that's strange

Laughing-q · 2022-12-17T16:47:44Z

@AyushExel okay, but ddp_cleanup will clean the temp file before the subprocess.Popen launch DDP...

Laughing-q · 2022-12-17T16:48:20Z

@AyushExel maybe we just remove the temp file in the end after we finish the training?

AyushExel · 2022-12-17T16:48:23Z

@Laughing-q ohh I see..It's a race condition..

Laughing-q · 2022-12-17T16:49:03Z

@AyushExel or maybe add a time.sleep before ddp_cleanup

AyushExel · 2022-12-17T16:50:45Z

@AyushExel maybe we just remove the temp file in the end after we finish the training?

hmm okay. How do we get a signal that training is done? would sleeping for a couple seconds work here?

Laughing-q · 2022-12-17T16:52:26Z

@AyushExel yes it works, the time.sleep way, we can take this as a temp solution.

for more information, see https://pre-commit.ci

Laughing-q · 2022-12-17T17:00:52Z

@AyushExel okay I committed the time.sleep as a temp solution. I'm going to sleep now, back in tomorrow. :)

AyushExel · 2022-12-17T17:00:57Z

@Laughing-q hmm okay..I'm not an expert in handling race conditions. I'm wondering if this could be different for different systems.
The problem here is that os.remove() gets executed before popen() accesses that file right? If that's the case then yeah 2 sec of sleep should solve the problem.
Is one of these things happening here:

Once the ddp command starts running, then the program wait gets blocked so that os.remove() won't get executed or
while the training is running, os.remove() command waits for the file to become available for deletion
If either of the above is happening then I think its fine.. I just don't know what.. Or if the behavior remains same across all OSes

AyushExel · 2022-12-17T17:01:24Z

@AyushExel okay I committed the time.sleep as a temp solution. I'm going to sleep now, back in tomorrow. :)

Yeah sounds good.. Its too late for you.. Let's chat tomorrow

Laughing-q · 2022-12-18T04:07:42Z

@AyushExel I also think this can be different across different OSes, it's hard to handle this across different OSes. actually I didn't expect these would be a race conditions. Maybe we should just simply add a ultralytics dir to keep all temp files, like /tmp/ultralytics in linux. What do you think?

AyushExel · 2022-12-18T07:02:45Z

Yes that would be the solution if this doesn't work well but I'd like to stick to the current solution if possible because it cleans up after every execution..
I think if it works on linux and windows, its fine because there is no CUDA on mac.
Also, I need to confirm the subprocess function about what's happening .. If its the race condition( ie, file gets deleted befor it can be accessed) then its a simple race condition for which the sleep solution is fine

AyushExel · 2022-12-19T03:24:00Z

@Laughing-q okay let's merge this now. Once we start testing this and any problem occurs due to race condition, we can then move to the temp dir strategy

Laughing-q · 2022-12-19T05:24:34Z

@AyushExel okay merging now

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>

Laughing-q and others added 13 commits December 16, 2022 09:55

add ddp

42074ff

fix

705ccd4

cleanup

79b5a0e

fix nc

8b28608

revert

f428d44

fix val

2473ddc

Merge branch 'main' into DDP

8a7e262

update

e2af37b

update

2adabb6

update

f868862

[pre-commit.ci] auto fixes from pre-commit.com hooks

cf93842

for more information, see https://pre-commit.ci

fix world_size

25fbe5a

fix metrics.key

8a06a38

fix

9b87430

revert metrics.keys

42c134f

fix

2526aec

Command/file generator for DDP (#79)

fc04821

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Laughing-q <1185102784@qq.com>

update

e02a5ca

[pre-commit.ci] auto fixes from pre-commit.com hooks

b5d0ad3

for more information, see https://pre-commit.ci

AyushExel and others added 2 commits December 17, 2022 22:16

update

6a197dc

[pre-commit.ci] auto fixes from pre-commit.com hooks

ba2a385

for more information, see https://pre-commit.ci

Laughing-q and others added 2 commits December 17, 2022 16:58

add time.sleep

5a33a94

[pre-commit.ci] auto fixes from pre-commit.com hooks

fceb556

for more information, see https://pre-commit.ci

Laughing-q merged commit 7690cae into main Dec 19, 2022

Laughing-q deleted the DDP branch December 19, 2022 05:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a naive DDP for model interface #78

add a naive DDP for model interface #78

Laughing-q commented Dec 16, 2022 •

edited by UltralyticsAssistant

Loading

AyushExel commented Dec 16, 2022

Laughing-q commented Dec 16, 2022

AyushExel commented Dec 16, 2022

Laughing-q commented Dec 16, 2022

Laughing-q commented Dec 16, 2022

AyushExel commented Dec 16, 2022 •

edited

Loading

AyushExel commented Dec 16, 2022

AyushExel commented Dec 17, 2022

Laughing-q commented Dec 17, 2022

AyushExel commented Dec 17, 2022

Laughing-q commented Dec 17, 2022

AyushExel commented Dec 17, 2022

AyushExel commented Dec 17, 2022 •

edited

Loading

Laughing-q commented Dec 17, 2022 •

edited

Loading

AyushExel commented Dec 17, 2022

AyushExel commented Dec 17, 2022

Laughing-q commented Dec 17, 2022

Laughing-q commented Dec 17, 2022

AyushExel commented Dec 17, 2022

Laughing-q commented Dec 17, 2022

AyushExel commented Dec 17, 2022

Laughing-q commented Dec 17, 2022 •

edited

Loading

Laughing-q commented Dec 17, 2022

AyushExel commented Dec 17, 2022

AyushExel commented Dec 17, 2022

Laughing-q commented Dec 18, 2022

AyushExel commented Dec 18, 2022

AyushExel commented Dec 19, 2022

Laughing-q commented Dec 19, 2022

add a naive DDP for model interface #78

add a naive DDP for model interface #78

Conversation

Laughing-q commented Dec 16, 2022 • edited by UltralyticsAssistant Loading

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

AyushExel commented Dec 16, 2022

Laughing-q commented Dec 16, 2022

AyushExel commented Dec 16, 2022

Laughing-q commented Dec 16, 2022

Laughing-q commented Dec 16, 2022

AyushExel commented Dec 16, 2022 • edited Loading

AyushExel commented Dec 16, 2022

AyushExel commented Dec 17, 2022

Laughing-q commented Dec 17, 2022

AyushExel commented Dec 17, 2022

Laughing-q commented Dec 17, 2022

AyushExel commented Dec 17, 2022

AyushExel commented Dec 17, 2022 • edited Loading

Laughing-q commented Dec 17, 2022 • edited Loading

AyushExel commented Dec 17, 2022

AyushExel commented Dec 17, 2022

Laughing-q commented Dec 17, 2022

Laughing-q commented Dec 17, 2022

AyushExel commented Dec 17, 2022

Laughing-q commented Dec 17, 2022

AyushExel commented Dec 17, 2022

Laughing-q commented Dec 17, 2022 • edited Loading

Laughing-q commented Dec 17, 2022

AyushExel commented Dec 17, 2022

AyushExel commented Dec 17, 2022

Laughing-q commented Dec 18, 2022

AyushExel commented Dec 18, 2022

AyushExel commented Dec 19, 2022

Laughing-q commented Dec 19, 2022

Laughing-q commented Dec 16, 2022 •

edited by UltralyticsAssistant

Loading

AyushExel commented Dec 16, 2022 •

edited

Loading

AyushExel commented Dec 17, 2022 •

edited

Loading

Laughing-q commented Dec 17, 2022 •

edited

Loading

Laughing-q commented Dec 17, 2022 •

edited

Loading