Investigate and update DDP tutorials

### 📚 The doc issue

https://discuss.pytorch.org/t/properly-implementing-ddp-in-training-loop-with-cleanup-barrier-and-its-expected-output/146465 mentions a couple of issues with DDP tutorial, when the framework is repurposed to train a different model (in this case a CIFAR classifier):

1. `dist.barrier()` causes a hang in demo with checkpoint: https://github.com/pytorch/tutorials/blob/a1ad9ed50305e96597a1a5d3c3d3d565e881e27e/intermediate_source/ddp_tutorial.rst
2. model parallel demo causes duplicate GPU issue, fixed by `set_device`. 

The mentioned forum post has a repro script for the above 2 issues.

### Suggest a potential alternative/fix

_No response_

cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @brianjo @mruberry

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate and update DDP tutorials #74246

📚 The doc issue

Suggest a potential alternative/fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate and update DDP tutorials #74246

Description

📚 The doc issue

Suggest a potential alternative/fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions