Update CUDA out of memory mesage with private pool info #124673

isuruf · 2024-04-22T22:45:43Z

Stack from ghstack (oldest at bottom):

-> Update CUDA out of memory mesage with private pool info #124673

Fixes #121932

cc @mcarilli @ezyang @eellison @peterbell10

[ghstack-poisoned]

pytorch-bot · 2024-04-22T22:45:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124673

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ca6dc21 with merge base 023f05c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 2371f7f9f781aae61b79352a58fe8ab0e71c273a Pull Request resolved: #124673

[ghstack-poisoned]

ghstack-source-id: aa44b8c9c14a3995e25e4f8ed05075d6106ad59d Pull Request resolved: #124673

[ghstack-poisoned]

ghstack-source-id: 115a9f29ba6c7107a518523d35dd60a9eaea8bed Pull Request resolved: #124673

isuruf · 2024-04-23T13:07:18Z

cc @eellison @peterbell10

peterbell10 · 2024-04-23T14:58:13Z

c10/cuda/CUDACachingAllocator.cpp

+        return res;
+      };
+      for (const auto& p : graph_pools) {
+        allocated_in_private_pools += get_size_block(p.second->large_blocks);


This likely needs to happen before the mutex unlock on line 1119.

eellison · 2024-04-23T15:58:53Z

c10/cuda/CUDACachingAllocator.cpp

+          format_size(allocated_bytes + allocated_in_private_pools),
+          " is allocated by PyTorch, with ",
+          format_size(allocated_in_private_pools),
+          " allocated in private pools, and ",
+          format_size(


I don't think we should print this if there aren't any allocated private pools. Also, maybe worth explicitly mentioning cudagraphs here? for e.g. users doing torch.compile(mode='reduce-overhead') they might not make the connection

[ghstack-poisoned]

ghstack-source-id: df2d76357eac5113798fabd7f02bc6eaeb41afca Pull Request resolved: #124673

c10/cuda/CUDACachingAllocator.cpp

[ghstack-poisoned]

ghstack-source-id: d08a69009e6404ae38ac34fa768fdb15c96974af Pull Request resolved: #124673

eellison

🚢

isuruf · 2024-05-14T20:20:10Z

@pytorchbot merge

pytorch-bot · 2024-05-14T20:20:12Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: ',' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick', 'close')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

isuruf · 2024-05-14T20:20:54Z

@pytorchbot merge

pytorchmergebot · 2024-05-14T20:22:45Z

Merge failed

Reason: 1 mandatory check(s) are pending/not yet run. The first few are:

EasyCLA

Dig deeper by viewing the pending checks on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

peterbell10 · 2024-05-15T00:59:54Z

@pytorchbot merge -r

pytorchmergebot · 2024-05-15T01:01:31Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2024-05-15T01:01:43Z

Successfully rebased gh/isuruf/46/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/124673)

ghstack-source-id: 9ea4a6b25304fa1975659177910018099b83fc72 Pull Request resolved: #124673

pytorchmergebot · 2024-05-15T01:02:57Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes pytorch#121932 Pull Request resolved: pytorch#124673 Approved by: https://github.com/eellison, https://github.com/eqy

Update

b634475

[ghstack-poisoned]

isuruf added a commit that referenced this pull request Apr 22, 2024

Update CUDA out of memory mesage with private pool info

fb5edaf

ghstack-source-id: 2371f7f9f781aae61b79352a58fe8ab0e71c273a Pull Request resolved: #124673

pytorchbot added the open source label Apr 22, 2024

Update

2110e02

[ghstack-poisoned]

isuruf added a commit that referenced this pull request Apr 22, 2024

Update CUDA out of memory mesage with private pool info

e08ca29

ghstack-source-id: aa44b8c9c14a3995e25e4f8ed05075d6106ad59d Pull Request resolved: #124673

Update

f289659

[ghstack-poisoned]

isuruf added a commit that referenced this pull request Apr 23, 2024

Update CUDA out of memory mesage with private pool info

6935adc

ghstack-source-id: 115a9f29ba6c7107a518523d35dd60a9eaea8bed Pull Request resolved: #124673

peterbell10 reviewed Apr 23, 2024

View reviewed changes

eellison reviewed Apr 23, 2024

View reviewed changes

Update

88b8e42

[ghstack-poisoned]

isuruf added a commit that referenced this pull request Apr 23, 2024

Update CUDA out of memory mesage with private pool info

93d1fa6

ghstack-source-id: df2d76357eac5113798fabd7f02bc6eaeb41afca Pull Request resolved: #124673

eqy reviewed Apr 24, 2024

View reviewed changes

c10/cuda/CUDACachingAllocator.cpp Outdated Show resolved Hide resolved

Update

7176983

[ghstack-poisoned]

isuruf added a commit that referenced this pull request Apr 24, 2024

Update CUDA out of memory mesage with private pool info

6d39843

ghstack-source-id: d08a69009e6404ae38ac34fa768fdb15c96974af Pull Request resolved: #124673

isuruf requested review from peterbell10, eellison and eqy May 13, 2024 17:16

eellison approved these changes May 14, 2024

View reviewed changes

eqy approved these changes May 14, 2024

View reviewed changes

isuruf added module: cuda graphs Ability to capture and then replay streams of CUDA kernels release notes: cuda release notes category labels May 14, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 14, 2024

pytorchmergebot added the merging label May 14, 2024

pytorchmergebot removed the merging label May 14, 2024

peterbell10 added the topic: improvements topic category label May 15, 2024

Update

ca6dc21

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request May 15, 2024

Update CUDA out of memory mesage with private pool info

313071d

ghstack-source-id: 9ea4a6b25304fa1975659177910018099b83fc72 Pull Request resolved: #124673

pytorchmergebot added the merging label May 15, 2024

pytorchmergebot added the Merged label May 15, 2024

pytorchmergebot closed this in 0dedc1a May 15, 2024

pytorchmergebot removed the merging label May 15, 2024

ZelboK pushed a commit to ZelboK/pytorch that referenced this pull request May 19, 2024

Update CUDA out of memory mesage with private pool info (pytorch#124673)

11aea9e

Fixes pytorch#121932 Pull Request resolved: pytorch#124673 Approved by: https://github.com/eellison, https://github.com/eqy

github-actions bot deleted the gh/isuruf/46/head branch June 15, 2024 02:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update CUDA out of memory mesage with private pool info #124673

Update CUDA out of memory mesage with private pool info #124673

isuruf commented Apr 22, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented Apr 22, 2024 •

edited

isuruf commented Apr 23, 2024

peterbell10 Apr 23, 2024

eellison Apr 23, 2024

eellison left a comment

isuruf commented May 14, 2024 •

edited

pytorch-bot bot commented May 14, 2024

isuruf commented May 14, 2024

pytorchmergebot commented May 14, 2024

peterbell10 commented May 15, 2024

pytorchmergebot commented May 15, 2024

pytorchmergebot commented May 15, 2024

pytorchmergebot commented May 15, 2024

Update CUDA out of memory mesage with private pool info #124673

Update CUDA out of memory mesage with private pool info #124673

Conversation

isuruf commented Apr 22, 2024 • edited by pytorch-bot bot

pytorch-bot bot commented Apr 22, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124673

✅ No Failures

isuruf commented Apr 23, 2024

peterbell10 Apr 23, 2024

Choose a reason for hiding this comment

eellison Apr 23, 2024

Choose a reason for hiding this comment

eellison left a comment

Choose a reason for hiding this comment

isuruf commented May 14, 2024 • edited

pytorch-bot bot commented May 14, 2024

isuruf commented May 14, 2024

pytorchmergebot commented May 14, 2024

Merge failed

peterbell10 commented May 15, 2024

pytorchmergebot commented May 15, 2024

pytorchmergebot commented May 15, 2024

pytorchmergebot commented May 15, 2024

Merge started

isuruf commented Apr 22, 2024 •

edited by pytorch-bot bot

pytorch-bot bot commented Apr 22, 2024 •

edited

isuruf commented May 14, 2024 •

edited