Don't use subclass when tracing and call wait_tensor immediately. #98001

kumpera · 2023-03-30T18:53:34Z

This change expects that proper scheduling of the wait_tensor call will happen over the traced graph.

pytorch-bot · 2023-03-30T18:53:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/98001

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e8a26c1:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab · 2023-03-30T19:18:37Z

torch/distributed/_functional_collectives.py

+        return False
+    return mode.tracer is not None
+
+def _maybe_wrap_tensor(self):


this should also be easy to support in dynamo:

we just have to implement 1 special case, for '_maybe_wrap_tensor' which always traces wait_tensor, and never traces the guts of _maybe_wrap_tensor or _are_we_tracing.

wconstab · 2023-03-30T19:19:59Z

torch/distributed/_functional_collectives.py

    ), f"input dimension 0 ({self.size(0)} must be a multiple of group_size {group_size}"
    tensor = torch._C._nn.reduce_scatter_tensor(self, reduceOp, scatter_dim, tag, rankset, group_size)  # type: ignore[attr-defined]
-    res = AsyncCollectiveTensor(tensor)
-    _register_wrapper_tensor(res, tensor)


i keep forgetting what the policy should be on calling _register...

we need to put a better comment here to immortalize it...

my current thought is

during eager we must register here

during tracing we never register here, but we require backends will register when calling the actual collective

Backend's can optionally skip the _register_wrapper_tensor call if they emit a wait_tensor.

hmmm can they? i thought wait_tensor will call into this code https://github.com/pytorch/pytorch/blob/master/torch/distributed/_functional_collectives.py#L96

which will find nothing has been registered, UNLESS the backend did the registration first

well this is complicated.

in inductor, I made collectives call register and i made wait_tensor call wait_tensor.

this way, if wait happens in a separate inductor graph or in eager, it still works

but if your backend knows that allreduce+wait will be in the same graph, it can emit more efficient code that lowers a collective to work = dist.<collective> and lowers wait_tensor to work.wait() which skips the need for ever registering.

i guess the risk/concern is that if you aren't careful in backend design, wait_tensor op can become a silent no-op

yeah, we can look into this sort of optimization later.

wconstab · 2023-03-30T19:24:15Z

General thoughts:

easy to reverse this decision if we need to: just have to find a way to fix the subclass tracing problems
we're already implementing compiler passes for comm optimization, it shouldn't matter to them whether we start out with wait in a near optimal or non-optimal place

(is 2 fully true? and what is the status of the passes we'll need for wait placement? cc @lessw2020 @fegin)

wanchaol

Can we add some unit tests to guard the behavior and make sure it works? You can take the tests I wrote in #97945

lessw2020 · 2023-03-30T19:59:48Z

we're already implementing compiler passes for comm optimization, it shouldn't matter to them whether we start out with wait in a near optimal or non-optimal place

(is 2 fully true? and what is the status of the passes we'll need for wait placement? cc @lessw2020 @fegin)

Re: 2 - correct that it doesn't matter where the waits start per se, we are going to roll them up with the comm fusion pass and use only the last wait for each fused section (the rest will be removed with dce).

Re: status - the fusion pass has been working for some time and this is the pass atm that modifies the waits.
The cross iter pass is still in progress, but it's going to pick up and move the relevant wait with the move, so same as fusion pass where we'll move the waits directly as needed. (cc @wconstab )

kumpera · 2023-03-31T16:03:40Z

@pytorchmergebot merge

pytorchmergebot · 2023-03-31T16:06:26Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ezyang · 2024-08-05T15:46:13Z

torch/distributed/_functional_collectives.py

+    mode = get_innermost_proxy_mode()
+    if mode is None:
+        return False
+    return mode.tracer is not None


Did you intentionally check mode.tracer? It seems to me that this is guaranteed to be not None

Don't use subclass when tracing and call wait_tensor immediately.

e8a26c1

This change expects that proper scheduling of the wait_tensor call will happen over the traced graph.

kumpera requested review from H-Huang, awgu, fegin, kiukchung, kwen2501, mrshenli, rohan-varma, wanchaol and zhaojuanmao as code owners March 30, 2023 18:53

kumpera requested a review from d4l3k as a code owner March 30, 2023 18:53

wconstab reviewed Mar 30, 2023

View reviewed changes

wconstab approved these changes Mar 30, 2023

View reviewed changes

wanchaol approved these changes Mar 30, 2023

View reviewed changes

wconstab mentioned this pull request Mar 30, 2023

Dynamo support for _maybe_wrap_tensor #98005

Closed

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 31, 2023

kumpera added module: c10d Issues/PRs related to collective communications and process groups topic: not user facing topic category labels Mar 31, 2023

pytorchmergebot added the Merged label Mar 31, 2023

pytorchmergebot closed this in 3b188c5 Mar 31, 2023

ezyang reviewed Aug 5, 2024

View reviewed changes

Don't use subclass when tracing and call wait_tensor immediately. #98001

Don't use subclass when tracing and call wait_tensor immediately. #98001

Uh oh!

Conversation

kumpera commented Mar 30, 2023

Uh oh!

pytorch-bot bot commented Mar 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/98001

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab commented Mar 30, 2023

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

lessw2020 commented Mar 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kumpera commented Mar 31, 2023

Uh oh!

pytorchmergebot commented Mar 31, 2023

Merge started

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pytorch-bot bot commented Mar 30, 2023 •

edited

Loading

lessw2020 commented Mar 30, 2023 •

edited

Loading