-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Add Tutorial for Generic Join Context Manager #1610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✔️ Deploy Preview for pytorch-tutorials-preview ready! 🔨 Explore the source changes: ca5cb34 🔍 Inspect the deploy log: https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/610ac4aaecb14c0007fd5b2c 😎 Browse the preview: https://deploy-preview-1610--pytorch-tutorials-preview.netlify.app/intermediate/flask_rest_api_tutorial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it ready for review yet?
Yes, I think the draft is ready for review. |
@@ -0,0 +1,434 @@ | |||
Distributed Training with Uneven Inputs Using the Join Context Manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This page might better fit in the tutorial folder than a recipe (which is usually short and simple), as this one has quite a bit of content.
I would suggest add this to https://pytorch.org/tutorials/advanced/
and add a link to the overview page under the DDP section: https://pytorch.org/tutorials/beginner/dist_overview.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! Thanks for putting this together. The main comment was that it might be better to make the docstrings public first, otherwise readers of this page might be confused about why the APIs weren't described anywhere.
recipes_source/generic_join.rst
Outdated
|
||
In this recipe, you will see: | ||
|
||
- An overview of the ``Join`` context manager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we first make the Join
API public (by removing the prefix _
), and then convert Join
here into a link pointing to the documentation page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made API public and added link.
recipes_source/generic_join.rst
Outdated
|
||
As we saw in the previous examples, the constructor takes in a list of the | ||
``Joinable`` s that participate in the training loop. These should be the | ||
classes that perform collective communciations in each iteration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
communciations -> communications
@@ -0,0 +1,434 @@ | |||
Distributed Training with Uneven Inputs Using the Join Context Manager | |||
====================================================================== | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add your name here as the author, and you can link this to your home page.
recipes_source/generic_join.rst
Outdated
.. note:: ``Join`` is introduced in PyTorch 1.10 as a prototype feature. This | ||
API is subject to change. | ||
|
||
In this recipe, you will see: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is moved to the tutorial folder, let's replace "recipe" to "tutorial" as well.
recipes_source/generic_join.rst
Outdated
Rank 1 has exhausted all 6 of its inputs! | ||
|
||
.. note:: | ||
``DistributedDataParallel`` provided its own ``join()`` context manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's convert DistributedDataParallel
and DDP's join()
into hyperlinks as well.
recipes_source/generic_join.rst
Outdated
``with model.join():``. One limitation of the existing | ||
``DistributedDataParallel.join()`` is that it does not allow multiple | ||
participating classes, e.g. ``DistributedDataParallel`` and | ||
``ZeroRedundancyOptimizer`` together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: ZeroRedundancyOptimizer to link
context manager, let us delve deeper into how it works. This will provide a | ||
greater insight into the full capability that it offers and prepare you to make | ||
your own custom classes compatible. Here, we will go over the ``Join`` class as | ||
well as the supporting classes ``Joinable`` and ``JoinHook``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also make Joinable and JoinHook API docs public first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: pytorch/pytorch#62605
- ``join_device(self) -> torch.device`` | ||
|
||
This returns a device to be used by the ``Join`` context manager to perform | ||
collective communications, e.g. ``torch.device("cuda:0")`` or | ||
``torch.device("cpu")``. | ||
|
||
- ``join_process_group(self) -> ProcessGroup`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we also explain why we need device and process group objects?
``ZeroRedundancyOptimizer`` main hook performs an optimizer step per normal | ||
since the joined rank is still responsible for updating and synchronizing its | ||
shard of the parameters, and the ``DistributedDataParallel`` post-hook | ||
broadcasts the final updated model from one of the last joining ranks to ensure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lost some context here. Question, why it would be different if we skip the final broadcast? The join hooks for DDP/ZeRO should already make sure that model params are in sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to make the idea concrete by mentioning what the provided hooks for ZeroRedundancyOptimizer
and DistributedDataParallel
look like. Because ZeroRedundancyOptimizer
does not use a post-hook and because the DistributedDataParallel
main hook is quite complex, I thought it would be best to use ZeroRedundancyOptimizer
to exemplify the main hook and DistributedDataParallel
to exemplify the post-hook.
Perhaps, to make it more clear, I can add the word provided
:
To give concrete examples of what these hooks may look like, the provided
``ZeroRedundancyOptimizer`` main hook performs an optimizer step per normal
since the joined rank is still responsible for updating and synchronizing its
shard of the parameters, and the provided ``DistributedDataParallel`` post-hook
broadcasts the final updated model from one of the last joining ranks to ensure
that it is the same across all ranks.
I am not sure if we can skip the final broadcast in DistributedDataParallel
's post-hook as a result of using ZeroRedundancyOptimizer
. The reason is that I do not believe that module buffers are included in model.parameters()
, so buffers would not be synced.
Code pointers:
- The final broadcast happens through
_sync_final_model()
. _sync_final_model()
calls_sync_params_and_buffers()
, which iterates over the model'sstate_dict()
. This should include both parameters and buffers, as the method name suggests.
Summary: **Overview:** This removes the preceding `_` from `_Join`, `_Joinable`, and `_JoinHook` in preparation for adding the generic join context manager tutorial (see [here](pytorch/tutorials#1610)). This also adds a docs page, which can be linked from the tutorial. [Here](https://github.com/pytorch/pytorch/files/6919475/render.pdf) is a render of the docs page. Pull Request resolved: #62605 Test Plan: `DistributedDataParallel.join()`: ``` touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception ``` `ZeroRedundancyOptimizer`: ``` gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py ``` NOTE: DDP overlap tests are failing due to a landing race. See #62592. Once the fix is landed, I will rebase, and tests should be passing. `Join`: ``` gpurun4 python test/distributed/algorithms/test_join.py ``` Reviewed By: mrshenli Differential Revision: D30055544 Pulled By: andwgu fbshipit-source-id: a5ce1f1d9f1904de3bdd4edd0b31b0a612d87026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might also need to add an index entry to this file, so that this page can show in the left navigation bar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you!
Add generic join tutorial
I wrote the tutorial as if the feature would be in PyTorch v1.10 (meaning preceding underscores were removed).
Here is the PDF of the render:The bot rendered a preview.render.pdf
This PR is meant as a draft to show what I have for the tutorial. It is not meant to actually be merged any time soon.