Skip to content

Conversation

awgu
Copy link
Contributor

@awgu awgu commented Jul 16, 2021

I wrote the tutorial as if the feature would be in PyTorch v1.10 (meaning preceding underscores were removed). Here is the PDF of the render:
render.pdf
The bot rendered a preview.

This PR is meant as a draft to show what I have for the tutorial. It is not meant to actually be merged any time soon.

@netlify
Copy link

netlify bot commented Jul 16, 2021

✔️ Deploy Preview for pytorch-tutorials-preview ready!

🔨 Explore the source changes: ca5cb34

🔍 Inspect the deploy log: https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/610ac4aaecb14c0007fd5b2c

😎 Browse the preview: https://deploy-preview-1610--pytorch-tutorials-preview.netlify.app/intermediate/flask_rest_api_tutorial

Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it ready for review yet?

@awgu
Copy link
Contributor Author

awgu commented Jul 31, 2021

is it ready for review yet?

Yes, I think the draft is ready for review.

@awgu awgu marked this pull request as ready for review July 31, 2021 00:51
@@ -0,0 +1,434 @@
Distributed Training with Uneven Inputs Using the Join Context Manager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This page might better fit in the tutorial folder than a recipe (which is usually short and simple), as this one has quite a bit of content.

I would suggest add this to https://pytorch.org/tutorials/advanced/ and add a link to the overview page under the DDP section: https://pytorch.org/tutorials/beginner/dist_overview.html

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Thanks for putting this together. The main comment was that it might be better to make the docstrings public first, otherwise readers of this page might be confused about why the APIs weren't described anywhere.


In this recipe, you will see:

- An overview of the ``Join`` context manager.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we first make the Join API public (by removing the prefix _), and then convert Join here into a link pointing to the documentation page?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made API public and added link.


As we saw in the previous examples, the constructor takes in a list of the
``Joinable`` s that participate in the training loop. These should be the
classes that perform collective communciations in each iteration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

communciations -> communications

@@ -0,0 +1,434 @@
Distributed Training with Uneven Inputs Using the Join Context Manager
======================================================================

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add your name here as the author, and you can link this to your home page.

.. note:: ``Join`` is introduced in PyTorch 1.10 as a prototype feature. This
API is subject to change.

In this recipe, you will see:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is moved to the tutorial folder, let's replace "recipe" to "tutorial" as well.

Rank 1 has exhausted all 6 of its inputs!

.. note::
``DistributedDataParallel`` provided its own ``join()`` context manager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's convert DistributedDataParallel and DDP's join() into hyperlinks as well.

``with model.join():``. One limitation of the existing
``DistributedDataParallel.join()`` is that it does not allow multiple
participating classes, e.g. ``DistributedDataParallel`` and
``ZeroRedundancyOptimizer`` together.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: ZeroRedundancyOptimizer to link

context manager, let us delve deeper into how it works. This will provide a
greater insight into the full capability that it offers and prepare you to make
your own custom classes compatible. Here, we will go over the ``Join`` class as
well as the supporting classes ``Joinable`` and ``JoinHook``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also make Joinable and JoinHook API docs public first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 182 to 188
- ``join_device(self) -> torch.device``

This returns a device to be used by the ``Join`` context manager to perform
collective communications, e.g. ``torch.device("cuda:0")`` or
``torch.device("cpu")``.

- ``join_process_group(self) -> ProcessGroup``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also explain why we need device and process group objects?

``ZeroRedundancyOptimizer`` main hook performs an optimizer step per normal
since the joined rank is still responsible for updating and synchronizing its
shard of the parameters, and the ``DistributedDataParallel`` post-hook
broadcasts the final updated model from one of the last joining ranks to ensure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lost some context here. Question, why it would be different if we skip the final broadcast? The join hooks for DDP/ZeRO should already make sure that model params are in sync?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to make the idea concrete by mentioning what the provided hooks for ZeroRedundancyOptimizer and DistributedDataParallel look like. Because ZeroRedundancyOptimizer does not use a post-hook and because the DistributedDataParallel main hook is quite complex, I thought it would be best to use ZeroRedundancyOptimizer to exemplify the main hook and DistributedDataParallel to exemplify the post-hook.

Perhaps, to make it more clear, I can add the word provided:

To give concrete examples of what these hooks may look like, the provided
``ZeroRedundancyOptimizer`` main hook performs an optimizer step per normal
since the joined rank is still responsible for updating and synchronizing its
shard of the parameters, and the provided ``DistributedDataParallel`` post-hook
broadcasts the final updated model from one of the last joining ranks to ensure
that it is the same across all ranks.

I am not sure if we can skip the final broadcast in DistributedDataParallel's post-hook as a result of using ZeroRedundancyOptimizer. The reason is that I do not believe that module buffers are included in model.parameters(), so buffers would not be synced.

Code pointers:

facebook-github-bot pushed a commit to pytorch/pytorch that referenced this pull request Aug 3, 2021
Summary:
**Overview:**
This removes the preceding `_` from `_Join`, `_Joinable`, and `_JoinHook` in preparation for adding the generic join context manager tutorial (see [here](pytorch/tutorials#1610)). This also adds a docs page, which can be linked from the tutorial. [Here](https://github.com/pytorch/pytorch/files/6919475/render.pdf) is a render of the docs page.

Pull Request resolved: #62605

Test Plan:
`DistributedDataParallel.join()`:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```

`ZeroRedundancyOptimizer`:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```
NOTE: DDP overlap tests are failing due to a landing race. See #62592. Once the fix is landed, I will rebase, and tests should be passing.

`Join`:
```
gpurun4 python test/distributed/algorithms/test_join.py
```

Reviewed By: mrshenli

Differential Revision: D30055544

Pulled By: andwgu

fbshipit-source-id: a5ce1f1d9f1904de3bdd4edd0b31b0a612d87026
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might also need to add an index entry to this file, so that this page can show in the left navigation bar.

https://github.com/mrshenli/tutorials/blob/master/index.rst

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

@awgu awgu merged commit fb6a49d into pytorch:master Aug 4, 2021
rodrigo-techera pushed a commit to Experience-Monks/tutorials that referenced this pull request Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants