Fix torch.distributed._functional_collectives.AsyncCollectiveTensor for aten.to. #134661

PHLens · 2024-08-28T07:38:35Z

Fixes #133421

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

pytorch-bot · 2024-08-28T07:38:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134661

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c4425cc with merge base b336d72 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

tianyu-l · 2024-08-29T18:21:23Z

@bdhirsh can you help review?

bdhirsh · 2024-09-04T20:06:51Z

torch/distributed/_functional_collectives.py

@@ -432,6 +432,9 @@ def reduce_scatter_tensor_coalesced(
 # Today, this maps 1:1 with "aten ops that are views".
 def _is_view_op(tgt):
    assert isinstance(tgt, torch._ops.OpOverload)
+    # Special case for `aten.to`. See issue: https://github.com/pytorch/pytorch/issues/133421
+    if "to" in tgt.__name__.split('.'):
+        return False


hmm this feels like a bit of bandaid fix (although I agree it solves the aten.to problem).

The issue is that we have several composite operations that are "maybe-aliasing", and therefore can lie about their schemas. aten.to.device is an example: it may-or-may-not alias input, depending on whether the device matches the input tensor, but its schema reports as always aliasing.

Ordinarily, these ops will always decompose before we get to torch_dispatch. But under inference_mode, these ops can show up directly in torch_dispatch.

The way we've generally dealt with this for other subclasses is to force them to decompose these composite ops, by adding this code:

r = func.decompose(*args, **kwargs) # this will attempt to run the eager-mode decomposition if one exists, and return NotImplemented otherwise if r is not NotImplemented: return r

Although cc @weifengpy, I'm curious what you think, since this could have an impact on eager performance (we'd probably have to measure it)

however, given that AsyncCollectiveTensor doesn't really do very much other than branch on view ops... another strategy that is more general than this PR but might be less risky for perf would just be to have AsyncCollectiveTensor assume that all of these composite ops are not views (worst case, it does an early sync when it doesn't have to, but most view ops are not composite anyway).

You can do it like this:

# don't apply the view optimization to any `CompositeImplicitAutograd` ops if torch._C._dispatch_has_kernel_for_dispatch_key(func.name(), DispatchKey.CompositeImplicitAutograd): return False

fyi @weifengpy @wanchaol

sry for the late reply, I've fixed it in a more general way according to your suggestion. @bdhirsh

github-actions · 2024-11-23T07:33:59Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

bdhirsh

Thanks for the fix!

bdhirsh · 2024-12-26T19:34:15Z

@pytorchbot merge

pytorch-bot · 2024-12-26T19:34:19Z

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

bdhirsh · 2024-12-26T19:34:43Z

@pytorchbot merge

pytorchmergebot · 2024-12-26T19:37:31Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-12-26T19:42:58Z

Merge failed

Reason: 2 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

bdhirsh · 2025-01-03T19:35:06Z

@pytorchbot rebase

bdhirsh · 2025-01-03T19:35:21Z

@pytorchbot help

pytorch-bot · 2025-01-03T19:35:23Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'help' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick', 'close')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

bdhirsh · 2025-01-03T19:35:49Z

@pytorchbot --help

pytorch-bot · 2025-01-03T19:35:52Z

PyTorchBot Help

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

In order to invoke the bot on your PR, include a line that starts with
@pytorchbot anywhere in a comment. That line will form the command; no
multi-line commands are allowed. Some commands may be used on issues as specified below.

Example:
    Some extra context, blah blah, wow this PR looks awesome

    @pytorchbot merge

optional arguments:
  -h, --help            Show this help message and exit.

command:
  {merge,revert,rebase,label,drci,cherry-pick,close}
    merge               Merge a PR
    revert              Revert a PR
    rebase              Rebase a PR
    label               Add label to a PR
    drci                Update Dr. CI
    cherry-pick         Cherry pick a PR onto a release branch
    close               Close a PR

Merge

usage: @pytorchbot merge [-f MESSAGE | -i] [-ic] [-r [{viable/strict,main}]]

Merge an accepted PR, subject to the rules in .github/merge_rules.json.
By default, this will wait for all required checks (lint, pull) to succeed before merging.

optional arguments:
  -f MESSAGE, --force MESSAGE
                        Merge without checking anything. This requires a reason for auditting purpose, for example:
                        @pytorchbot merge -f 'Minor update to fix lint. Expecting all PR tests to pass'
                        
                        Please use `-f` as last resort, prefer `--ignore-current` to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.
  -i, --ignore-current  Merge while ignoring the currently failing jobs.  Behaves like -f if there are no pending jobs.
  -ic                   Old flag for --ignore-current. Deprecated in favor of -i.
  -r [{viable/strict,main}], --rebase [{viable/strict,main}]
                        Rebase the PR to re run checks before merging.  Accepts viable/strict or main as branch options and will default to viable/strict if not specified.

Revert

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Revert a merged PR. This requires that you are a Meta employee.

Example:
  @pytorchbot revert -m="This is breaking tests on trunk. hud.pytorch.org/" -c=nosignal

optional arguments:
  -m MESSAGE, --message MESSAGE
                        The reason you are reverting, will be put in the commit message. Must be longer than 3 words.
  -c {nosignal,ignoredsignal,landrace,weird,ghfirst}, --classification {nosignal,ignoredsignal,landrace,weird,ghfirst}
                        A machine-friendly classification of the revert reason.

Rebase

usage: @pytorchbot rebase [-s | -b BRANCH]

Rebase a PR. Rebasing defaults to the stable viable/strict branch of pytorch.
Repeat contributor may use this command to rebase their PR.

optional arguments:
  -s, --stable          [DEPRECATED] Rebase onto viable/strict
  -b BRANCH, --branch BRANCH
                        Branch you would like to rebase to

Label

usage: @pytorchbot label labels [labels ...]

Adds label to a PR or Issue [Can be used on Issues]

positional arguments:
  labels  Labels to add to given Pull Request or Issue [Can be used on Issues]

Dr CI

usage: @pytorchbot drci 

Update Dr. CI. Updates the Dr. CI comment on the PR in case it's gotten out of sync with actual CI results.

cherry-pick

usage: @pytorchbot cherry-pick --onto ONTO [--fixes FIXES] -c
                               {regression,critical,fixnewfeature,docs,release}

Cherry pick a pull request onto a release branch for inclusion in a release

optional arguments:
  --onto ONTO           Branch you would like to cherry pick onto (Example: release/2.1)
  --fixes FIXES         Link to the issue that your PR fixes (Example: https://github.com/pytorch/pytorch/issues/110666)
  -c {regression,critical,fixnewfeature,docs,release}, --classification {regression,critical,fixnewfeature,docs,release}
                        A machine-friendly classification of the cherry-pick reason.

Close

usage: @pytorchbot close

Close a PR [Can be used on issues]

bdhirsh · 2025-01-03T19:36:19Z

@pytorchbot merge -r

pytorchmergebot · 2025-01-03T19:36:30Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-01-03T19:36:33Z

Successfully rebased fix_asy_tensor onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_asy_tensor && git pull --rebase)

pytorchmergebot · 2025-01-03T19:37:53Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-01-03T19:37:57Z

Tried to rebase and push PR #134661, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

pytorchmergebot · 2025-01-03T19:37:58Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

bdhirsh · 2025-01-03T23:25:39Z

@pytorchbot merge

pytorchmergebot · 2025-01-03T23:27:25Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…2688) We never added a proper test for the fix from #134661 Pull Request resolved: #152688 Approved by: https://github.com/kwen2501 ghstack dependencies: #152195

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 28, 2024

pytorchbot added the open source label Aug 28, 2024

awgu requested review from yifuwang, bdhirsh and tianyu-l August 28, 2024 12:51

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 30, 2024

bdhirsh reviewed Sep 4, 2024

View reviewed changes

PHLens force-pushed the fix_asy_tensor branch from 944c172 to 85ddb7a Compare September 24, 2024 02:15

kwen2501 added the topic: bug fixes topic category label Sep 24, 2024

PHLens force-pushed the fix_asy_tensor branch 2 times, most recently from 354568d to bf91cd1 Compare September 24, 2024 06:53

github-actions bot added the Stale label Nov 23, 2024

github-actions bot closed this Dec 23, 2024

bdhirsh reopened this Dec 26, 2024

bdhirsh approved these changes Dec 26, 2024

View reviewed changes

bdhirsh added the release notes: distributed (dtensor) release notes category label Dec 26, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 26, 2024

pytorchmergebot added the merging label Dec 26, 2024

pytorchmergebot removed the merging label Dec 26, 2024

Fix AsyncCollectiveTensor logic for CompositeImplicitAutograd ops.

c4425cc

pytorchmergebot force-pushed the fix_asy_tensor branch from bf91cd1 to c4425cc Compare January 3, 2025 19:36

pytorchmergebot added the merging label Jan 3, 2025

pytorchmergebot added the Merged label Jan 4, 2025

pytorchmergebot closed this in 98949df Jan 4, 2025

pytorchmergebot removed the merging label Jan 4, 2025

kwen2501 mentioned this pull request May 1, 2025

AsyncCollectiveTensor doesn't trigger wait upon dtype cast #152534

Closed

bdhirsh mentioned this pull request May 2, 2025

Add a test for AsyncCollectiveTensor handling for maybe-view ops #152688

Closed

Fix torch.distributed._functional_collectives.AsyncCollectiveTensor for aten.to. #134661

Fix torch.distributed._functional_collectives.AsyncCollectiveTensor for aten.to. #134661

Uh oh!

Conversation

PHLens commented Aug 28, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134661

✅ No Failures

Uh oh!

tianyu-l commented Aug 29, 2024

Uh oh!

bdhirsh Sep 4, 2024

Choose a reason for hiding this comment

Uh oh!

bdhirsh Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PHLens Sep 24, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 23, 2024

Uh oh!

bdhirsh left a comment

Choose a reason for hiding this comment

Uh oh!

bdhirsh commented Dec 26, 2024

Uh oh!

pytorch-bot bot commented Dec 26, 2024

Uh oh!

bdhirsh commented Dec 26, 2024

Uh oh!

pytorchmergebot commented Dec 26, 2024

Merge started

Uh oh!

pytorchmergebot commented Dec 26, 2024

Merge failed

Uh oh!

bdhirsh commented Jan 3, 2025

Uh oh!

bdhirsh commented Jan 3, 2025

Uh oh!

pytorch-bot bot commented Jan 3, 2025

Uh oh!

bdhirsh commented Jan 3, 2025

Uh oh!

pytorch-bot bot commented Jan 3, 2025

PyTorchBot Help

Merge

Revert

Rebase

Label

Dr CI

cherry-pick

Close

Uh oh!

bdhirsh commented Jan 3, 2025

Uh oh!

pytorchmergebot commented Jan 3, 2025

Uh oh!

pytorchmergebot commented Jan 3, 2025

Uh oh!

pytorchmergebot commented Jan 3, 2025

Uh oh!

pytorchmergebot commented Jan 3, 2025

Uh oh!

pytorchmergebot commented Jan 3, 2025

Uh oh!

bdhirsh commented Jan 3, 2025

Uh oh!

pytorchmergebot commented Jan 3, 2025

Merge started

Uh oh!

Uh oh!

PHLens commented Aug 28, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 28, 2024 •

edited

Loading

bdhirsh Sep 4, 2024 •

edited

Loading