Improve detection of workspace/non-output allocations in cudagraphs #99985

eellison · 2023-04-25T15:19:24Z

Stack from ghstack (oldest at bottom):

-> Improve detection of workspace/non-output allocations in cudagraphs #99985

When we run cudagraph trees we are not allowed to have permanent workspace allocations like in cublas because we might need to reclaim that memory for a previous cudagraph recording, and it is memory that is not accounted for in output weakrefs so it does not work with checkpointing. Previously, I would check that we didn't have any additional allocations through snapshotting. This was extremely slow so I had to turn it off.

This PR first does the quick checking to see if we are in an error state, then if we are does the slow logic of creating snapshot. Also turns on history recording so we get a stacktrace of where the bad allocation came from.

cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire

[ghstack-poisoned]

pytorch-bot · 2023-04-25T15:19:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99985

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit c7c6d51:

NEW FAILURE - The following job has failed:

linux-docs / build-docs-functorch-false (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…udagraphs" cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

…udagraphs" When we run cudagraph trees we are not allowed to have permanent workspace allocations like in cublas because we might need to reclaim that memory for a previous cudagraph recording, and it is memory that is not accounted for in output weakrefs so it does not work with checkpointing. Previously, I would check that we didn't have any additional allocations through snapshotting. This was extremely slow so I had to turn it off. This PR first does the quick checking to see if we are in an error state, then if we are does the slow logic of creating snapshot. Also turns on history recording so we get a stacktrace of where the bad allocation came from. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

ghstack-source-id: ea92c69 Pull Request resolved: #99985

zdevito

A couple of minor comments, but I am good with landing this.

zdevito · 2023-04-25T23:50:59Z

torch/_inductor/cudagraph_trees.py

+def format_tb(caching_allocator_trace):
+    formatted_traceback = []

+    MAX_LENGHTH = 20


zdevito · 2023-04-26T00:33:50Z

torch/_inductor/cudagraph_trees.py

+@contextlib.contextmanager
+def enable_history_recording():
+    "Turns on history recording in the CUDA Caching Allocator"
+    enabled = torch._C._cuda_isHistoryEnabled()


There are caveats when simple recording is enabled but stack traces are not. This function take like 4 arguments, I wonder if when _record_memory_history is called, can we return the current settings as a dict. Maybe for now, we can just have _record_memory_history return whether it was enabled before. I guess I am trying to avoid having to add more virtual functions to the allocator interface than necessary.

At least I have a default impl here, so you don't need to override on subclasses

…udagraphs" When we run cudagraph trees we are not allowed to have permanent workspace allocations like in cublas because we might need to reclaim that memory for a previous cudagraph recording, and it is memory that is not accounted for in output weakrefs so it does not work with checkpointing. Previously, I would check that we didn't have any additional allocations through snapshotting. This was extremely slow so I had to turn it off. This PR first does the quick checking to see if we are in an error state, then if we are does the slow logic of creating snapshot. Also turns on history recording so we get a stacktrace of where the bad allocation came from. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

ghstack-source-id: 0476827 Pull Request resolved: #99985

…udagraphs" When we run cudagraph trees we are not allowed to have permanent workspace allocations like in cublas because we might need to reclaim that memory for a previous cudagraph recording, and it is memory that is not accounted for in output weakrefs so it does not work with checkpointing. Previously, I would check that we didn't have any additional allocations through snapshotting. This was extremely slow so I had to turn it off. This PR first does the quick checking to see if we are in an error state, then if we are does the slow logic of creating snapshot. Also turns on history recording so we get a stacktrace of where the bad allocation came from. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

ghstack-source-id: ad9661a Pull Request resolved: #99985

eellison · 2023-05-01T15:56:38Z

@pytorchbot merge -f "unrelated error"

pytorchmergebot · 2023-05-01T15:58:39Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ytorch#99985) When we run cudagraph trees we are not allowed to have permanent workspace allocations like in cublas because we might need to reclaim that memory for a previous cudagraph recording, and it is memory that is not accounted for in output weakrefs so it does not work with checkpointing. Previously, I would check that we didn't have any additional allocations through snapshotting. This was extremely slow so I had to turn it off. This PR first does the quick checking to see if we are in an error state, then if we are does the slow logic of creating snapshot. Also turns on history recording so we get a stacktrace of where the bad allocation came from. Pull Request resolved: pytorch#99985 Approved by: https://github.com/zdevito

xwang233 · 2024-02-29T06:02:54Z

torch/_inductor/cudagraph_trees.py

-        lambda: f"These live storage data ptrs are in the cudagraph pool but not "
-        f"accounted for as an output of cudagraph trees {allocated_not_in_live_storages}",
-    )
+    if allocated_not_in_live_storages != 0:


Just curious, wouldn't comparing a dictionary with an int always return not equal, thus this turns into if True: effectively?

Improve detection of workspace/non-output allocations in cudagraphs

6e5510f

[ghstack-poisoned]

github-actions bot added ciflow/inductor module: inductor labels Apr 25, 2023

Update on "Improve detection of workspace/non-output allocations in c…

1e50941

…udagraphs" cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

Update on "Improve detection of workspace/non-output allocations in c…

26cdbdb

…udagraphs" cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

eellison requested a review from zdevito April 25, 2023 15:49

eellison added the topic: not user facing topic category label Apr 25, 2023

eellison added a commit that referenced this pull request Apr 25, 2023

Improve detection of workspace/non-output allocations in cudagraphs

a3e103a

ghstack-source-id: ea92c69 Pull Request resolved: #99985

zdevito approved these changes Apr 26, 2023

View reviewed changes

eellison added a commit that referenced this pull request Apr 28, 2023

Improve detection of workspace/non-output allocations in cudagraphs

27ab41d

ghstack-source-id: 0476827 Pull Request resolved: #99985

eellison added a commit that referenced this pull request Apr 29, 2023

Improve detection of workspace/non-output allocations in cudagraphs

0daf399

ghstack-source-id: ad9661a Pull Request resolved: #99985

pytorchmergebot added merging Merged labels May 1, 2023

pytorchmergebot closed this in 3edff6b May 1, 2023

facebook-github-bot deleted the gh/eellison/435/head branch June 8, 2023 16:22

xwang233 reviewed Feb 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve detection of workspace/non-output allocations in cudagraphs #99985

Improve detection of workspace/non-output allocations in cudagraphs #99985

Uh oh!

eellison commented Apr 25, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 25, 2023 •

edited

Loading

Uh oh!

zdevito left a comment

Uh oh!

zdevito Apr 25, 2023

Uh oh!

zdevito Apr 26, 2023

Uh oh!

eellison Apr 26, 2023

Uh oh!

eellison commented May 1, 2023

Uh oh!

pytorchmergebot commented May 1, 2023

Uh oh!

xwang233 Feb 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Improve detection of workspace/non-output allocations in cudagraphs #99985

Improve detection of workspace/non-output allocations in cudagraphs #99985

Uh oh!

Conversation

eellison commented Apr 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99985

❌ 1 New Failure

Uh oh!

zdevito left a comment

Choose a reason for hiding this comment

Uh oh!

zdevito Apr 25, 2023

Choose a reason for hiding this comment

Uh oh!

zdevito Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

eellison Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

eellison commented May 1, 2023

Uh oh!

pytorchmergebot commented May 1, 2023

Merge started

Uh oh!

xwang233 Feb 29, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eellison commented Apr 25, 2023 •

edited

Loading

pytorch-bot bot commented Apr 25, 2023 •

edited

Loading