Add a mode to avoid clone() in DDPSink #122927

pritamdamania87 · 2024-03-29T00:10:12Z

DDPSink clones the outputs of DDP to avoid in-place modification of loss (see #61982). However, when outputs are really large (2-3GB) this adds a lot of overhead for peak memory.

As a result, adding a mode to avoid this clone in cases where users are not modifying loss in-place.

cc @mrshenli @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

pytorch-bot · 2024-03-29T00:10:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122927

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 48e2ef7 with merge base e70bf23 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fegin

LGTM

rohan-varma

Makes sense, thanks!

torch/nn/parallel/distributed.py

pritamdamania87 · 2024-04-08T20:07:50Z

Thanks for the review @fegin and @rohan-varma!

pritamdamania87 · 2024-04-08T20:15:36Z

@pytorchbot merge

pytorch-bot · 2024-04-08T20:15:40Z

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

pritamdamania87 · 2024-04-08T21:13:32Z

@fegin @rohan-varma Could you approve the appropriate workflows as suggested in #122927 (comment)? Thanks!

pritamdamania87 · 2024-04-11T18:48:43Z

@pytorchbot merge

pytorchmergebot · 2024-04-11T18:50:33Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

pritamdamania87 · 2024-04-11T20:21:35Z

@pytorchbot merge

pytorchmergebot · 2024-04-11T20:23:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-04-11T20:23:27Z

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

pritamdamania87 · 2024-04-11T23:56:08Z

@pytorchbot merge -f "flaky windows test"

pytorch-bot · 2024-04-11T23:56:12Z

You are not authorized to force merges to this repository. Please use the regular @pytorchmergebot merge command instead

pritamdamania87 · 2024-04-11T23:56:27Z

@pytorchbot merge

pytorchmergebot · 2024-04-11T23:58:11Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-04-11T23:58:17Z

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

pritamdamania87 · 2024-04-12T00:00:25Z

@pytorchbot rebase

pytorchmergebot · 2024-04-12T00:01:59Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-04-12T00:02:02Z

Successfully rebased user/pdamania/ddp_avoid_clone onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout user/pdamania/ddp_avoid_clone && git pull --rebase)

pritamdamania87 · 2024-04-12T04:02:37Z

@pytorchbot merge

pytorchmergebot · 2024-04-12T04:04:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

DDPSink clones the outputs of DDP to avoid in-place modification of loss (see pytorch#61982). However, when outputs are really large (2-3GB) this adds a lot of overhead for peak memory. As a result, adding a mode to avoid this clone in cases where users are not modifying loss in-place. Pull Request resolved: pytorch#122927 Approved by: https://github.com/fegin, https://github.com/rohan-varma

pritamdamania87 requested review from albanD, jbschlosser and mikaylagawarecki as code owners March 29, 2024 00:10

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 29, 2024

pytorchbot added the open source label Mar 29, 2024

mikaylagawarecki requested a review from wconstab March 29, 2024 00:53

albanD removed their request for review March 29, 2024 01:58

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 1, 2024

yf225 requested review from fegin and rohan-varma April 4, 2024 19:26

mikaylagawarecki removed their request for review April 5, 2024 21:14

fegin approved these changes Apr 8, 2024

View reviewed changes

rohan-varma approved these changes Apr 8, 2024

View reviewed changes

torch/nn/parallel/distributed.py Outdated Show resolved Hide resolved

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 11, 2024

pytorchmergebot added the merging label Apr 11, 2024

pytorchmergebot removed the merging label Apr 11, 2024

awgu added the release notes: distributed (ddp) release notes category label Apr 11, 2024

pytorchmergebot added the merging label Apr 11, 2024

pytorchmergebot removed the merging label Apr 11, 2024

pytorchmergebot added the merging label Apr 11, 2024

pytorchmergebot removed the merging label Apr 11, 2024

pritamdamania87 added 3 commits April 12, 2024 00:02

Add a mode to avoid clone() in DDPSink

ee67072

Address comments

146e2ab

Fix lint

48e2ef7

pytorchmergebot force-pushed the user/pdamania/ddp_avoid_clone branch from a72a5b8 to 48e2ef7 Compare April 12, 2024 00:02

pytorchmergebot added the merging label Apr 12, 2024

pytorchmergebot added the Merged label Apr 12, 2024

pytorchmergebot closed this in 9dfeec9 Apr 12, 2024

pytorchmergebot removed the merging label Apr 12, 2024

Add a mode to avoid clone() in DDPSink #122927

Add a mode to avoid clone() in DDPSink #122927

Uh oh!

Conversation

pritamdamania87 commented Mar 29, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122927

✅ No Failures

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pritamdamania87 commented Apr 8, 2024

Uh oh!

pritamdamania87 commented Apr 8, 2024

Uh oh!

pytorch-bot bot commented Apr 8, 2024

Uh oh!

pritamdamania87 commented Apr 8, 2024

Uh oh!

pritamdamania87 commented Apr 11, 2024

Uh oh!

pytorchmergebot commented Apr 11, 2024

Merge failed

Uh oh!

pritamdamania87 commented Apr 11, 2024

Uh oh!

pytorchmergebot commented Apr 11, 2024

Merge started

Uh oh!

pytorchmergebot commented Apr 11, 2024

Merge failed

Uh oh!

pritamdamania87 commented Apr 11, 2024

Uh oh!

pytorch-bot bot commented Apr 11, 2024

Uh oh!

pritamdamania87 commented Apr 11, 2024

Uh oh!

pytorchmergebot commented Apr 11, 2024

Merge started

Uh oh!

pytorchmergebot commented Apr 11, 2024

Merge failed

Uh oh!

pritamdamania87 commented Apr 12, 2024

Uh oh!

pytorchmergebot commented Apr 12, 2024

Uh oh!

pytorchmergebot commented Apr 12, 2024

Uh oh!

pritamdamania87 commented Apr 12, 2024

Uh oh!

pytorchmergebot commented Apr 12, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pritamdamania87 commented Mar 29, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 29, 2024 •

edited

Loading