[dtensor] directly return local_tensor under no_grad #128145

wanchaol · 2024-06-06T16:43:38Z

Stack from ghstack (oldest at bottom):

as titled, skip the autograd function and directly return the
local_tensor if it's under no_grad context, this would avoid creating
views

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

as titled, skip the autograd function and directly return the local_tensor if it's under no_grad context, this would avoid creating views [ghstack-poisoned]

pytorch-bot · 2024-06-06T16:43:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128145

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (6 Unrelated Failures)

As of commit 7a1cfdf with merge base 2fc9079 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

linux-binary-manywheel / manywheel-py3_8-cuda11_8-test / test (gh) (similar failure)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) ()
sebotnet33ts_256
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#126884)
cspdarknet53
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal, unstable) (gh) (#126993)
Process completed with exit code 1.
linux-binary-manywheel / manywheel-py3_8-cuda12_1-test / test (gh) (#127288)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory
linux-binary-manywheel / manywheel-py3_8-cuda12_4-test / test (gh) (#127289)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory

This comment was automatically generated by Dr. CI and updates every 15 minutes.

awgu · 2024-06-06T18:28:51Z

torch/distributed/_tensor/api.py

        .. note:: `to_local` is differentiable, the `requires_grad` of the local tensor returned
            will depend on if the `DTensor` requires_grad or not.
        """
+        if not torch.is_grad_enabled():


Now the question is, how expensive is torch.is_grad_enabled() 😆

I hope it's pretty cheap given that we are using it invasively in pytorch ;)

tested and got a trace locally, I didn't even see the to_local or is_grad_enabled calls in the CPU trace :)

I guess that's because there is no aten op in that call stack, so the profiler cannot see it? (but this could still contribute to gaps between other ops in the profiler trace?)

wanchaol · 2024-06-06T22:46:18Z

@pytorchbot merge

pytorchmergebot · 2024-06-06T22:48:19Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

wanchaol · 2024-06-07T03:42:17Z

@pytorchbot merge

pytorchmergebot · 2024-06-07T03:44:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

wanchaol · 2024-06-07T03:59:05Z

@pytorchbot merge -f "queued job"

pytorchmergebot · 2024-06-07T03:59:23Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

pytorchmergebot · 2024-06-07T04:01:38Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

as titled, skip the autograd function and directly return the local_tensor if it's under no_grad context, this would avoid creating views Pull Request resolved: pytorch#128145 Approved by: https://github.com/awgu ghstack dependencies: pytorch#128112

[dtensor] directly return local_tensor under no_grad

7a1cfdf

as titled, skip the autograd function and directly return the local_tensor if it's under no_grad context, this would avoid creating views [ghstack-poisoned]

wanchaol mentioned this pull request Jun 6, 2024

[dtensor] reuse DTensorSpec as much as possible #128112

Closed

wanchaol mentioned this pull request Jun 6, 2024

[fsdp2] update foreach_reduce accumulate_grad #128117

Closed

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jun 6, 2024

awgu approved these changes Jun 6, 2024

View reviewed changes

awgu reviewed Jun 6, 2024

View reviewed changes

wanchaol requested a review from albanD June 6, 2024 20:16

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 6, 2024

pytorchmergebot added the merging label Jun 6, 2024

pytorchmergebot removed the merging label Jun 6, 2024

wanchaol added the release notes: distributed (dtensor) release notes category label Jun 6, 2024

pytorchmergebot added the merging label Jun 7, 2024

pytorchmergebot closed this in 3df53c2 Jun 7, 2024

pytorchmergebot added Merged and removed merging labels Jun 7, 2024

github-actions bot deleted the gh/wanchaol/482/head branch July 8, 2024 01:57

[dtensor] directly return local_tensor under no_grad #128145

[dtensor] directly return local_tensor under no_grad #128145

Uh oh!

Conversation

wanchaol commented Jun 6, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128145

✅ You can merge normally! (6 Unrelated Failures)

Uh oh!

awgu Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

wanchaol Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

wanchaol Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

awgu Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

wanchaol commented Jun 6, 2024

Uh oh!

pytorchmergebot commented Jun 6, 2024

Merge failed

Uh oh!

wanchaol commented Jun 7, 2024

Uh oh!

pytorchmergebot commented Jun 7, 2024

Merge started

Uh oh!

wanchaol commented Jun 7, 2024

Uh oh!

pytorchmergebot commented Jun 7, 2024

Uh oh!

pytorchmergebot commented Jun 7, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wanchaol commented Jun 6, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 6, 2024 •

edited

Loading