Skip to content

Conversation

pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Mar 29, 2024

DDPSink clones the outputs of DDP to avoid in-place modification of loss (see #61982). However, when outputs are really large (2-3GB) this adds a lot of overhead for peak memory.

As a result, adding a mode to avoid this clone in cases where users are not modifying loss in-place.

cc @mrshenli @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

Copy link

pytorch-bot bot commented Mar 29, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122927

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 48e2ef7 with merge base e70bf23 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 29, 2024
@albanD albanD removed their request for review March 29, 2024 01:58
@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 1, 2024
@yf225 yf225 requested review from fegin and rohan-varma April 4, 2024 19:26
@mikaylagawarecki mikaylagawarecki removed their request for review April 5, 2024 21:14
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks!

@pritamdamania87
Copy link
Contributor Author

Thanks for the review @fegin and @rohan-varma!

@pritamdamania87
Copy link
Contributor Author

@pytorchbot merge

Copy link

pytorch-bot bot commented Apr 8, 2024

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

@pritamdamania87
Copy link
Contributor Author

@fegin @rohan-varma Could you approve the appropriate workflows as suggested in #122927 (comment)? Thanks!

@pritamdamania87
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 11, 2024
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@awgu awgu added the release notes: distributed (ddp) release notes category label Apr 11, 2024
@pritamdamania87
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@pritamdamania87
Copy link
Contributor Author

@pytorchbot merge -f "flaky windows test"

Copy link

pytorch-bot bot commented Apr 11, 2024

You are not authorized to force merges to this repository. Please use the regular @pytorchmergebot merge command instead

@pritamdamania87
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@pritamdamania87
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased user/pdamania/ddp_avoid_clone onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout user/pdamania/ddp_avoid_clone && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the user/pdamania/ddp_avoid_clone branch from a72a5b8 to 48e2ef7 Compare April 12, 2024 00:02
@pritamdamania87
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
DDPSink clones the outputs of DDP to avoid in-place modification of loss (see pytorch#61982). However, when outputs are really large (2-3GB) this adds a lot of overhead for peak memory.

As a result, adding a mode to avoid this clone in cases where users are not modifying loss in-place.

Pull Request resolved: pytorch#122927
Approved by: https://github.com/fegin, https://github.com/rohan-varma
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
DDPSink clones the outputs of DDP to avoid in-place modification of loss (see pytorch#61982). However, when outputs are really large (2-3GB) this adds a lot of overhead for peak memory.

As a result, adding a mode to avoid this clone in cases where users are not modifying loss in-place.

Pull Request resolved: pytorch#122927
Approved by: https://github.com/fegin, https://github.com/rohan-varma
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (ddp) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants