Skip to content

Conversation

PHLens
Copy link
Contributor

@PHLens PHLens commented Aug 28, 2024

Copy link

pytorch-bot bot commented Aug 28, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134661

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c4425cc with merge base b336d72 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 28, 2024
@awgu awgu requested review from yifuwang, bdhirsh and tianyu-l August 28, 2024 12:51
@tianyu-l
Copy link
Contributor

@bdhirsh can you help review?

@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 30, 2024
@@ -432,6 +432,9 @@ def reduce_scatter_tensor_coalesced(
# Today, this maps 1:1 with "aten ops that are views".
def _is_view_op(tgt):
assert isinstance(tgt, torch._ops.OpOverload)
# Special case for `aten.to`. See issue: https://github.com/pytorch/pytorch/issues/133421
if "to" in tgt.__name__.split('.'):
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this feels like a bit of bandaid fix (although I agree it solves the aten.to problem).

The issue is that we have several composite operations that are "maybe-aliasing", and therefore can lie about their schemas. aten.to.device is an example: it may-or-may-not alias input, depending on whether the device matches the input tensor, but its schema reports as always aliasing.

Ordinarily, these ops will always decompose before we get to torch_dispatch. But under inference_mode, these ops can show up directly in torch_dispatch.

The way we've generally dealt with this for other subclasses is to force them to decompose these composite ops, by adding this code:

r = func.decompose(*args, **kwargs)
# this will attempt to run the eager-mode decomposition if one exists, and return NotImplemented otherwise
if r is not NotImplemented:
    return r

Although cc @weifengpy, I'm curious what you think, since this could have an impact on eager performance (we'd probably have to measure it)

Copy link
Contributor

@bdhirsh bdhirsh Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however, given that AsyncCollectiveTensor doesn't really do very much other than branch on view ops... another strategy that is more general than this PR but might be less risky for perf would just be to have AsyncCollectiveTensor assume that all of these composite ops are not views (worst case, it does an early sync when it doesn't have to, but most view ops are not composite anyway).

You can do it like this:

# don't apply the view optimization to any `CompositeImplicitAutograd` ops
if torch._C._dispatch_has_kernel_for_dispatch_key(func.name(), DispatchKey.CompositeImplicitAutograd):
    return False

fyi @weifengpy @wanchaol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sry for the late reply, I've fixed it in a more general way according to your suggestion. @bdhirsh

@kwen2501 kwen2501 added the topic: bug fixes topic category label Sep 24, 2024
@PHLens PHLens force-pushed the fix_asy_tensor branch 2 times, most recently from 354568d to bf91cd1 Compare September 24, 2024 06:53
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Nov 23, 2024
@github-actions github-actions bot closed this Dec 23, 2024
@bdhirsh bdhirsh reopened this Dec 26, 2024
Copy link
Contributor

@bdhirsh bdhirsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@bdhirsh
Copy link
Contributor

bdhirsh commented Dec 26, 2024

@pytorchbot merge

Copy link

pytorch-bot bot commented Dec 26, 2024

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

@bdhirsh bdhirsh added the release notes: distributed (dtensor) release notes category label Dec 26, 2024
@bdhirsh
Copy link
Contributor

bdhirsh commented Dec 26, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 26, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@bdhirsh
Copy link
Contributor

bdhirsh commented Jan 3, 2025

@pytorchbot rebase

@bdhirsh
Copy link
Contributor

bdhirsh commented Jan 3, 2025

@pytorchbot help

Copy link

pytorch-bot bot commented Jan 3, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'help' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick', 'close')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

@bdhirsh
Copy link
Contributor

bdhirsh commented Jan 3, 2025

@pytorchbot --help

Copy link

pytorch-bot bot commented Jan 3, 2025

PyTorchBot Help

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

In order to invoke the bot on your PR, include a line that starts with
@pytorchbot anywhere in a comment. That line will form the command; no
multi-line commands are allowed. Some commands may be used on issues as specified below.

Example:
    Some extra context, blah blah, wow this PR looks awesome

    @pytorchbot merge

optional arguments:
  -h, --help            Show this help message and exit.

command:
  {merge,revert,rebase,label,drci,cherry-pick,close}
    merge               Merge a PR
    revert              Revert a PR
    rebase              Rebase a PR
    label               Add label to a PR
    drci                Update Dr. CI
    cherry-pick         Cherry pick a PR onto a release branch
    close               Close a PR

Merge

usage: @pytorchbot merge [-f MESSAGE | -i] [-ic] [-r [{viable/strict,main}]]

Merge an accepted PR, subject to the rules in .github/merge_rules.json.
By default, this will wait for all required checks (lint, pull) to succeed before merging.

optional arguments:
  -f MESSAGE, --force MESSAGE
                        Merge without checking anything. This requires a reason for auditting purpose, for example:
                        @pytorchbot merge -f 'Minor update to fix lint. Expecting all PR tests to pass'
                        
                        Please use `-f` as last resort, prefer `--ignore-current` to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.
  -i, --ignore-current  Merge while ignoring the currently failing jobs.  Behaves like -f if there are no pending jobs.
  -ic                   Old flag for --ignore-current. Deprecated in favor of -i.
  -r [{viable/strict,main}], --rebase [{viable/strict,main}]
                        Rebase the PR to re run checks before merging.  Accepts viable/strict or main as branch options and will default to viable/strict if not specified.

Revert

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Revert a merged PR. This requires that you are a Meta employee.

Example:
  @pytorchbot revert -m="This is breaking tests on trunk. hud.pytorch.org/" -c=nosignal

optional arguments:
  -m MESSAGE, --message MESSAGE
                        The reason you are reverting, will be put in the commit message. Must be longer than 3 words.
  -c {nosignal,ignoredsignal,landrace,weird,ghfirst}, --classification {nosignal,ignoredsignal,landrace,weird,ghfirst}
                        A machine-friendly classification of the revert reason.

Rebase

usage: @pytorchbot rebase [-s | -b BRANCH]

Rebase a PR. Rebasing defaults to the stable viable/strict branch of pytorch.
Repeat contributor may use this command to rebase their PR.

optional arguments:
  -s, --stable          [DEPRECATED] Rebase onto viable/strict
  -b BRANCH, --branch BRANCH
                        Branch you would like to rebase to

Label

usage: @pytorchbot label labels [labels ...]

Adds label to a PR or Issue [Can be used on Issues]

positional arguments:
  labels  Labels to add to given Pull Request or Issue [Can be used on Issues]

Dr CI

usage: @pytorchbot drci 

Update Dr. CI. Updates the Dr. CI comment on the PR in case it's gotten out of sync with actual CI results.

cherry-pick

usage: @pytorchbot cherry-pick --onto ONTO [--fixes FIXES] -c
                               {regression,critical,fixnewfeature,docs,release}

Cherry pick a pull request onto a release branch for inclusion in a release

optional arguments:
  --onto ONTO           Branch you would like to cherry pick onto (Example: release/2.1)
  --fixes FIXES         Link to the issue that your PR fixes (Example: https://github.com/pytorch/pytorch/issues/110666)
  -c {regression,critical,fixnewfeature,docs,release}, --classification {regression,critical,fixnewfeature,docs,release}
                        A machine-friendly classification of the cherry-pick reason.

Close

usage: @pytorchbot close

Close a PR [Can be used on issues]

@bdhirsh
Copy link
Contributor

bdhirsh commented Jan 3, 2025

@pytorchbot merge -r

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fix_asy_tensor onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_asy_tensor && git pull --rebase)

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Tried to rebase and push PR #134661, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@bdhirsh
Copy link
Contributor

bdhirsh commented Jan 3, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (dtensor) release notes category Stale topic: bug fixes topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

torch.distributed._functional_collectives.AsyncCollectiveTensor has issue with aten.to.dtpye_layout.
7 participants