-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update CUDA out of memory mesage with private pool info #124673
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124673
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit ca6dc21 with merge base 023f05c ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ghstack-source-id: 2371f7f9f781aae61b79352a58fe8ab0e71c273a Pull Request resolved: #124673
ghstack-source-id: aa44b8c9c14a3995e25e4f8ed05075d6106ad59d Pull Request resolved: #124673
ghstack-source-id: 115a9f29ba6c7107a518523d35dd60a9eaea8bed Pull Request resolved: #124673
c10/cuda/CUDACachingAllocator.cpp
Outdated
return res; | ||
}; | ||
for (const auto& p : graph_pools) { | ||
allocated_in_private_pools += get_size_block(p.second->large_blocks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This likely needs to happen before the mutex unlock on line 1119.
c10/cuda/CUDACachingAllocator.cpp
Outdated
format_size(allocated_bytes + allocated_in_private_pools), | ||
" is allocated by PyTorch, with ", | ||
format_size(allocated_in_private_pools), | ||
" allocated in private pools, and ", | ||
format_size( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should print this if there aren't any allocated private pools. Also, maybe worth explicitly mentioning cudagraphs here? for e.g. users doing torch.compile(mode='reduce-overhead')
they might not make the connection
ghstack-source-id: df2d76357eac5113798fabd7f02bc6eaeb41afca Pull Request resolved: #124673
ghstack-source-id: d08a69009e6404ae38ac34fa768fdb15c96974af Pull Request resolved: #124673
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
@pytorchbot merge |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot merge |
Merge failedReason: 1 mandatory check(s) are pending/not yet run. The first few are:
Dig deeper by viewing the pending checks on hud |
@pytorchbot merge -r |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
ghstack-source-id: 9ea4a6b25304fa1975659177910018099b83fc72 Pull Request resolved: #124673
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes pytorch#121932 Pull Request resolved: pytorch#124673 Approved by: https://github.com/eellison, https://github.com/eqy
Stack from ghstack (oldest at bottom):
Fixes #121932
cc @mcarilli @ezyang @eellison @peterbell10