Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CUDA out of memory mesage with private pool info #124673

Closed
wants to merge 6 commits into from

Conversation

isuruf
Copy link
Collaborator

@isuruf isuruf commented Apr 22, 2024

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Apr 22, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124673

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ca6dc21 with merge base 023f05c (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

isuruf added a commit that referenced this pull request Apr 22, 2024
ghstack-source-id: 2371f7f9f781aae61b79352a58fe8ab0e71c273a
Pull Request resolved: #124673
[ghstack-poisoned]
isuruf added a commit that referenced this pull request Apr 22, 2024
ghstack-source-id: aa44b8c9c14a3995e25e4f8ed05075d6106ad59d
Pull Request resolved: #124673
[ghstack-poisoned]
isuruf added a commit that referenced this pull request Apr 23, 2024
ghstack-source-id: 115a9f29ba6c7107a518523d35dd60a9eaea8bed
Pull Request resolved: #124673
@isuruf
Copy link
Collaborator Author

isuruf commented Apr 23, 2024

cc @eellison @peterbell10

return res;
};
for (const auto& p : graph_pools) {
allocated_in_private_pools += get_size_block(p.second->large_blocks);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This likely needs to happen before the mutex unlock on line 1119.

Comment on lines 1173 to 1177
format_size(allocated_bytes + allocated_in_private_pools),
" is allocated by PyTorch, with ",
format_size(allocated_in_private_pools),
" allocated in private pools, and ",
format_size(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should print this if there aren't any allocated private pools. Also, maybe worth explicitly mentioning cudagraphs here? for e.g. users doing torch.compile(mode='reduce-overhead') they might not make the connection

[ghstack-poisoned]
isuruf added a commit that referenced this pull request Apr 23, 2024
ghstack-source-id: df2d76357eac5113798fabd7f02bc6eaeb41afca
Pull Request resolved: #124673
[ghstack-poisoned]
isuruf added a commit that referenced this pull request Apr 24, 2024
ghstack-source-id: d08a69009e6404ae38ac34fa768fdb15c96974af
Pull Request resolved: #124673
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@isuruf isuruf added module: cuda graphs Ability to capture and then replay streams of CUDA kernels release notes: cuda release notes category labels May 14, 2024
@isuruf
Copy link
Collaborator Author

isuruf commented May 14, 2024

@pytorchbot merge

Copy link

pytorch-bot bot commented May 14, 2024

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: ',' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick', 'close')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

@isuruf
Copy link
Collaborator Author

isuruf commented May 14, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 14, 2024
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) are pending/not yet run. The first few are:

  • EasyCLA

Dig deeper by viewing the pending checks on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@peterbell10 peterbell10 added the topic: improvements topic category label May 15, 2024
@peterbell10
Copy link
Collaborator

@pytorchbot merge -r

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/isuruf/46/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/124673)

pytorchmergebot pushed a commit that referenced this pull request May 15, 2024
ghstack-source-id: 9ea4a6b25304fa1975659177910018099b83fc72
Pull Request resolved: #124673
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

ZelboK pushed a commit to ZelboK/pytorch that referenced this pull request May 19, 2024
@github-actions github-actions bot deleted the gh/isuruf/46/head branch June 15, 2024 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: cuda graphs Ability to capture and then replay streams of CUDA kernels open source release notes: cuda release notes category topic: improvements topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants