Skip to content

Investigate and update DDP tutorials #74246

@rohan-varma

Description

@rohan-varma

📚 The doc issue

https://discuss.pytorch.org/t/properly-implementing-ddp-in-training-loop-with-cleanup-barrier-and-its-expected-output/146465 mentions a couple of issues with DDP tutorial, when the framework is repurposed to train a different model (in this case a CIFAR classifier):

  1. dist.barrier() causes a hang in demo with checkpoint: https://github.com/pytorch/tutorials/blob/a1ad9ed50305e96597a1a5d3c3d3d565e881e27e/intermediate_source/ddp_tutorial.rst
  2. model parallel demo causes duplicate GPU issue, fixed by set_device.

The mentioned forum post has a repro script for the above 2 issues.

Suggest a potential alternative/fix

No response

cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @brianjo @mruberry

Metadata

Metadata

Assignees

Labels

high prioritymodule: ddpIssues/PRs related distributed data parallel trainingmodule: docsRelated to our documentation, both in docs/ and docblocksoncall: distributedAdd this issue/PR to distributed oncall triage queuetriage reviewtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions