Add Tutorial for Generic Join Context Manager #1610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

awgu merged 5 commits into pytorch:master from awgu:master

Aug 4, 2021

Contributor

awgu commented Jul 16, 2021 •

edited

Loading

I wrote the tutorial as if the feature would be in PyTorch v1.10 (meaning preceding underscores were removed). Here is the PDF of the render:
render.pdf The bot rendered a preview.

This PR is meant as a draft to show what I have for the tutorial. It is not meant to actually be merged any time soon.

facebook-github-bot added the cla signed label

awgu requested review from mrshenli and rohan-varma

July 16, 2021 21:48

netlify bot commented Jul 16, 2021 •

edited

Loading

✔️ Deploy Preview for pytorch-tutorials-preview ready!

🔨 Explore the source changes: ca5cb34

🔍 Inspect the deploy log: https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/610ac4aaecb14c0007fd5b2c

😎 Browse the preview: https://deploy-preview-1610--pytorch-tutorials-preview.netlify.app/intermediate/flask_rest_api_tutorial

rohan-varma reviewed

View reviewed changes

Contributor

rohan-varma left a comment

is it ready for review yet?

Contributor Author

awgu commented Jul 31, 2021

is it ready for review yet?

Yes, I think the draft is ready for review.

awgu marked this pull request as ready for review

July 31, 2021 00:51

mrshenli reviewed

View reviewed changes

recipes_source/generic_join.rst

		@@ -0,0 +1,434 @@
		Distributed Training with Uneven Inputs Using the Join Context Manager

Contributor

mrshenli Jul 31, 2021

This page might better fit in the tutorial folder than a recipe (which is usually short and simple), as this one has quite a bit of content.

I would suggest add this to https://pytorch.org/tutorials/advanced/ and add a link to the overview page under the DDP section: https://pytorch.org/tutorials/beginner/dist_overview.html

mrshenli reviewed

View reviewed changes

Contributor

mrshenli left a comment

Looks great to me! Thanks for putting this together. The main comment was that it might be better to make the docstrings public first, otherwise readers of this page might be confused about why the APIs weren't described anywhere.

recipes_source/generic_join.rst Outdated


		In this recipe, you will see:

		- An overview of the ``Join`` context manager.

Contributor

mrshenli Aug 1, 2021

Shall we first make the Join API public (by removing the prefix _), and then convert Join here into a link pointing to the documentation page?

Contributor Author

awgu Aug 4, 2021

Made API public and added link.

recipes_source/generic_join.rst Outdated

+              As we saw in the previous examples, the constructor takes in a list of the
+              ``Joinable`` s that participate in the training loop. These should be the
+              classes that perform collective communciations in each iteration.

Contributor

mrshenli Aug 1, 2021

communciations -> communications

recipes_source/generic_join.rst

		@@ -0,0 +1,434 @@
		Distributed Training with Uneven Inputs Using the Join Context Manager
		======================================================================

Contributor

mrshenli Aug 1, 2021

Add your name here as the author, and you can link this to your home page.

recipes_source/generic_join.rst Outdated

+              .. note:: ``Join`` is introduced in PyTorch 1.10 as a prototype feature. This
+                  API is subject to change.
+              In this recipe, you will see:

Contributor

mrshenli Aug 1, 2021

If this is moved to the tutorial folder, let's replace "recipe" to "tutorial" as well.

recipes_source/generic_join.rst Outdated

+                Rank 1 has exhausted all 6 of its inputs!
+              .. note::
+                  ``DistributedDataParallel`` provided its own ``join()`` context manager

Contributor

mrshenli Aug 1, 2021

Let's convert DistributedDataParallel and DDP's join() into hyperlinks as well.

recipes_source/generic_join.rst Outdated

+                  ``with model.join():``. One limitation of the existing
+                  ``DistributedDataParallel.join()`` is that it does not allow multiple
+                  participating classes, e.g. ``DistributedDataParallel`` and
+                  ``ZeroRedundancyOptimizer`` together.

Contributor

mrshenli Aug 1, 2021

ditto: ZeroRedundancyOptimizer to link

recipes_source/generic_join.rst

+              context manager, let us delve deeper into how it works. This will provide a
+              greater insight into the full capability that it offers and prepare you to make
+              your own custom classes compatible. Here, we will go over the ``Join`` class as
+              well as the supporting classes ``Joinable`` and ``JoinHook``.

Contributor

mrshenli Aug 1, 2021

Let's also make Joinable and JoinHook API docs public first.

Contributor Author

awgu Aug 3, 2021

Done: pytorch/pytorch#62605

recipes_source/generic_join.rst

Comment on lines 182 to 188

+              - ``join_device(self) -> torch.device``
+              This returns a device to be used by the ``Join`` context manager to perform
+              collective communications, e.g. ``torch.device("cuda:0")`` or
+              ``torch.device("cpu")``.
+              - ``join_process_group(self) -> ProcessGroup``

Contributor

mrshenli Aug 1, 2021

shall we also explain why we need device and process group objects?

recipes_source/generic_join.rst

+              ``ZeroRedundancyOptimizer`` main hook performs an optimizer step per normal
+              since the joined rank is still responsible for updating and synchronizing its
+              shard of the parameters, and the ``DistributedDataParallel`` post-hook
+              broadcasts the final updated model from one of the last joining ranks to ensure

Contributor

mrshenli Aug 1, 2021

I lost some context here. Question, why it would be different if we skip the final broadcast? The join hooks for DDP/ZeRO should already make sure that model params are in sync?

Contributor Author

awgu Aug 2, 2021

I wanted to make the idea concrete by mentioning what the provided hooks for ZeroRedundancyOptimizer and DistributedDataParallel look like. Because ZeroRedundancyOptimizer does not use a post-hook and because the DistributedDataParallel main hook is quite complex, I thought it would be best to use ZeroRedundancyOptimizer to exemplify the main hook and DistributedDataParallel to exemplify the post-hook.

Perhaps, to make it more clear, I can add the word provided:

To give concrete examples of what these hooks may look like, the provided
``ZeroRedundancyOptimizer`` main hook performs an optimizer step per normal
since the joined rank is still responsible for updating and synchronizing its
shard of the parameters, and the provided ``DistributedDataParallel`` post-hook
broadcasts the final updated model from one of the last joining ranks to ensure
that it is the same across all ranks.

I am not sure if we can skip the final broadcast in DistributedDataParallel's post-hook as a result of using ZeroRedundancyOptimizer. The reason is that I do not believe that module buffers are included in model.parameters(), so buffers would not be synced.

Code pointers:

The final broadcast happens through _sync_final_model().
_sync_final_model() calls _sync_params_and_buffers(), which iterates over the model's state_dict(). This should include both parameters and buffers, as the method name suggests.

awgu mentioned this pull request

Make _Join, _Joinable, _JoinHook public pytorch/pytorch#62605

Closed

facebook-github-bot pushed a commit to pytorch/pytorch that referenced this pull request


          Make _Join, _Joinable, _JoinHook public (#62605)

62a90c2

Summary:
**Overview:**
This removes the preceding `_` from `_Join`, `_Joinable`, and `_JoinHook` in preparation for adding the generic join context manager tutorial (see [here](pytorch/tutorials#1610)). This also adds a docs page, which can be linked from the tutorial. [Here](https://github.com/pytorch/pytorch/files/6919475/render.pdf) is a render of the docs page.

Pull Request resolved: #62605

Test Plan:
`DistributedDataParallel.join()`:
```
touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception
```

`ZeroRedundancyOptimizer`:
```
gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```
NOTE: DDP overlap tests are failing due to a landing race. See #62592. Once the fix is landed, I will rebase, and tests should be passing.

`Join`:
```
gpurun4 python test/distributed/algorithms/test_join.py
```

Reviewed By: mrshenli

Differential Revision: D30055544

Pulled By: andwgu

fbshipit-source-id: a5ce1f1d9f1904de3bdd4edd0b31b0a612d87026

mrshenli reviewed

View reviewed changes

Contributor

mrshenli left a comment

you might also need to add an index entry to this file, so that this page can show in the left navigation bar.

https://github.com/mrshenli/tutorials/blob/master/index.rst

mrshenli approved these changes

View reviewed changes

Contributor

mrshenli left a comment

LGTM! Thank you!

Andrew Gu added 5 commits

August 4, 2021 16:47


          Add generic join recipe

7ad8714


          Address feedback

5375e14


          Change _join_hook, _join_process_group, and _join_device to public

51441c0


          Add link to Join master docs

39d7f82


          Add to TOC tree

ca5cb34

awgu merged commit fb6a49d into pytorch:master

rodrigo-techera pushed a commit to Experience-Monks/tutorials that referenced this pull request


          Add Tutorial for Generic Join Context Manager (pytorch#1610)

8d6186a

Add generic join tutorial

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels