[CUDA][CUDA Graphs] Fix potential race with autograd thread during a graph capture 2 #106570

eqy · 2023-08-03T19:29:38Z

An alternative to #106235 that just adds our own uid generation so that we can call beginAllocateStreamToPool (which notifies the caching allocator that a capture is starting) before actually starting the capture. Note that this does appear to change the behavior uid generation a bit from the CUDA API call (which seems to increment by 3 each time instead of 1).

Looking at the changes again I'm not sure if both the begin capture ordering change is needed in addition to the end capture ordering change, but it makes me uneasy as I'm not sure anything prevents the autograd thread from running cleanup code "in-between" captures.

CC @zdevito @eellison

cc @mcarilli @ezyang

pytorch-bot · 2023-08-03T19:29:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106570

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 537dd71:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

zdevito

This looks a lot cleaner as a way of resolving the race conditions. I don't see anything obviously wrong with CUDAGraph coming ups with its own IDs. I think using the capture ID was a holdover from when the allocator would look up what pool to use via that ID. Now that it is per-stream anyway, this seems much simpler.

aten/src/ATen/cuda/CUDAGraph.cpp

zdevito · 2023-08-03T21:16:19Z

aten/src/ATen/cuda/CUDAGraph.cpp

I guess we don't need this call anymore.

aten/src/ATen/cuda/CUDAGraph.cpp

eqy · 2023-08-04T05:35:58Z

@pytorchmergebot rebase

pytorchmergebot · 2023-08-04T05:38:18Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2023-08-04T05:38:25Z

Successfully rebased graphsuid onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout graphsuid && git pull --rebase)

eqy · 2023-08-07T22:48:42Z

@pytorchmergebot rebase

pytorchmergebot · 2023-08-07T22:51:11Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2023-08-07T22:51:17Z

Successfully rebased graphsuid onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout graphsuid && git pull --rebase)

eqy · 2023-08-08T06:01:16Z

@pytorchmergebot merge

pytorchmergebot · 2023-08-08T06:03:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@zdevito

…graph capture 2 (pytorch#106570) An alternative to pytorch#106235 that just adds our own uid generation so that we can call `beginAllocateStreamToPool` (which notifies the caching allocator that a capture is starting) before actually starting the capture. Note that this does appear to change the behavior uid generation a bit from the CUDA API call (which seems to increment by 3 each time instead of 1). Looking at the changes again I'm not sure if both the _begin_ capture ordering change is needed in addition to the _end_ capture ordering change, but it makes me uneasy as I'm not sure anything prevents the autograd thread from running cleanup code "in-between" captures. CC @zdevito @eellison Pull Request resolved: pytorch#106570 Approved by: https://github.com/zdevito

eqy added open source module: cuda graphs Ability to capture and then replay streams of CUDA kernels topic: not user facing topic category module: CUDACachingAllocator labels Aug 3, 2023

eqy requested review from eellison and zdevito August 3, 2023 19:29

eqy changed the title ~~[CUDA][CUDA Graphs] Fix potential race with autograd thread during a graph capture~~ [CUDA][CUDA Graphs] Fix potential race with autograd thread during a graph capture 2 Aug 3, 2023

zdevito approved these changes Aug 4, 2023

View reviewed changes

pytorchmergebot force-pushed the graphsuid branch from fe52f3d to a68f8af Compare August 4, 2023 05:38

eqy added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Aug 7, 2023

eqy added 4 commits August 7, 2023 22:51

check in

61881fb

Update CUDAGraph.cpp

6aaca96

Update CUDAGraph.cpp

68b0033

Update CUDAGraph.cpp

6ab4622

pytorchmergebot force-pushed the graphsuid branch from e94cf13 to 6ab4622 Compare August 7, 2023 22:51

Update CUDAGraph.cpp

537dd71

pytorchmergebot added the merging label Aug 8, 2023

pytorchmergebot added Merged and removed merging labels Aug 8, 2023

pytorchmergebot closed this in 03c9321 Aug 8, 2023

eqy mentioned this pull request Aug 8, 2023

[CUDA][CUDA Graphs] Fix potential race with autograd thread during a graph capture #106235

Closed

syed-ahmed mentioned this pull request Jul 25, 2024

Implements torch.cuda.MemPool() API #131152

Closed

[CUDA][CUDA Graphs] Fix potential race with autograd thread during a graph capture 2 #106570

[CUDA][CUDA Graphs] Fix potential race with autograd thread during a graph capture 2 #106570

Uh oh!

Conversation

eqy commented Aug 3, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106570

✅ No Failures

Uh oh!

zdevito left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zdevito Aug 3, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

eqy commented Aug 4, 2023

Uh oh!

pytorchmergebot commented Aug 4, 2023

Uh oh!

pytorchmergebot commented Aug 4, 2023

Uh oh!

eqy commented Aug 7, 2023

Uh oh!

pytorchmergebot commented Aug 7, 2023

Uh oh!

pytorchmergebot commented Aug 7, 2023

Uh oh!

eqy commented Aug 8, 2023

Uh oh!

pytorchmergebot commented Aug 8, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eqy commented Aug 3, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 3, 2023 •

edited

Loading