-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Closed
Labels
high prioritymodule: ddpIssues/PRs related distributed data parallel trainingIssues/PRs related distributed data parallel trainingmodule: docsRelated to our documentation, both in docs/ and docblocksRelated to our documentation, both in docs/ and docblocksoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriage reviewtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
📚 The doc issue
https://discuss.pytorch.org/t/properly-implementing-ddp-in-training-loop-with-cleanup-barrier-and-its-expected-output/146465 mentions a couple of issues with DDP tutorial, when the framework is repurposed to train a different model (in this case a CIFAR classifier):
dist.barrier()
causes a hang in demo with checkpoint: https://github.com/pytorch/tutorials/blob/a1ad9ed50305e96597a1a5d3c3d3d565e881e27e/intermediate_source/ddp_tutorial.rst- model parallel demo causes duplicate GPU issue, fixed by
set_device
.
The mentioned forum post has a repro script for the above 2 issues.
Suggest a potential alternative/fix
No response
cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @brianjo @mruberry
Metadata
Metadata
Assignees
Labels
high prioritymodule: ddpIssues/PRs related distributed data parallel trainingIssues/PRs related distributed data parallel trainingmodule: docsRelated to our documentation, both in docs/ and docblocksRelated to our documentation, both in docs/ and docblocksoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriage reviewtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module