[NCCL][CUDA][CUDA Graphs] Flush enqueued work before starting a graph capture 2#110665
[NCCL][CUDA][CUDA Graphs] Flush enqueued work before starting a graph capture 2#110665eqy wants to merge 4 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110665
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit 59dc4ee with merge base 543a763 ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Aidyn-A
left a comment
There was a problem hiding this comment.
This seems like less intrusive but effective solution. However, I think it would be safer and more precise to place increment and decrement before the work is enqueued and after it is completed respectively. Do not really understand why currently increment and decrement are placed so.
aten/src/ATen/cuda/CUDAGraph.cpp
Outdated
| * describes memory management for captures. | ||
| */ | ||
|
|
||
| int CUDAGraph::pending_event_queries = 0; |
There was a problem hiding this comment.
It would be easier to make it atomic, so you do not need to store mutex object and lock/unlock it every time you modify it.
| if (!coalescing_state_ && capture_status == c10::cuda::CaptureStatus::None) { | ||
| workEnqueue(work); | ||
| } else { | ||
| at::cuda::CUDAGraph::dec_pending_event_queries(); |
There was a problem hiding this comment.
See comment below on this---it's not the prettiest but I believe it's safe to increment before we check---otherwise it's potentially too late to notify the graphs-side code.
| if (!coalescing_state_ && capture_status == c10::cuda::CaptureStatus::None) { | ||
| workEnqueue(work); | ||
| } else { | ||
| at::cuda::CUDAGraph::dec_pending_event_queries(); |
There was a problem hiding this comment.
I suggest to increment before workEnqueue and decrement after the work is completed.
There was a problem hiding this comment.
I thought about this, but we must preemptively increment here as incrementing after the capture status is checked is potentially unsafe. In this case, we can have an interaction where the observed capture status here is None but at the same time the graphs code begins a capture (updating capture status), but it is now too late to notify graphs code that there is work enqueued. By preemptively incrementing, we notify the graphs code that there may potentially be work enqueued until we are certain that either 1. the work is done, or 2. we don't actually enqueue it as we have received notification that a capture has begun.
In other words, we increment as early as possible to avoid any chance of the graphs code erroneously assuming that nothing is/could be enqueued when there is still a chance that the NCCL code is between the None check and workEnqueue.
Bad interaction in excessive detail:
NCCL (autograd thread) -> I see that the capture status is None
graphs (forward thread) -> I set the capture status now
graphs (forward thread) -> I see that the number of pending event queries is zero
graphs (forward thread) -> I am starting the capture
NCCL (autograd thread) -> I am incrementing the number of pending event queries [this is now too late]
NCCL (autograd thread) -> I am enqueueing the work now
[we are now at a state where the watchdog can query the enqueued work's events and crash the capture]
There was a problem hiding this comment.
I see, now it makes sense. Thanks for the explanation! 👍
The inc/dec pattern that is a bit tricky is for handling the case where we were about to enqueue work, but found out just in time that a graph capture is starting, so we bail out and decrement without actually enqueuing. The decrement for the case where the actual work is completed is done in the work cleanup loop, as expected. |
| } else { | ||
| it = workMetaList_.erase(it); | ||
| } | ||
| at::cuda::CUDAGraph::dec_pending_event_queries(); |
There was a problem hiding this comment.
@Aidyn-A this is where we decrement, when the work is actually enqueud and done and cleaned up.
aten/src/ATen/cuda/CUDAGraph.h
Outdated
|
@pytorchmergebot rebase |
|
CC @jithunnair-amd ; currently the added test is disabled on ROCm as it fails, not sure if similar changes are needed for the AMD backends. |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
|
@pytorchmergebot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
25cb1cc to
59dc4ee
Compare
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
… capture 2 2.1 release cherry-pick of #110665
… capture 2 (pytorch#110665) Alternative to pytorch#104487. Several have chimed in that pytorch#104487 introduces a dependency from torch (c10d) to ATen, which is considered backward and messy. This alternative switches the dependency relationship at the cost of requiring graphs to potentially do some polling before the capture. CC @huydhn @malfet @Aidyn-A @ptrblck Pull Request resolved: pytorch#110665 Approved by: https://github.com/kwen2501
… capture 2 (pytorch#110665) Alternative to pytorch#104487. Several have chimed in that pytorch#104487 introduces a dependency from torch (c10d) to ATen, which is considered backward and messy. This alternative switches the dependency relationship at the cost of requiring graphs to potentially do some polling before the capture. CC @huydhn @malfet @Aidyn-A @ptrblck Pull Request resolved: pytorch#110665 Approved by: https://github.com/kwen2501
… capture 2 (pytorch#110665) Alternative to pytorch#104487. Several have chimed in that pytorch#104487 introduces a dependency from torch (c10d) to ATen, which is considered backward and messy. This alternative switches the dependency relationship at the cost of requiring graphs to potentially do some polling before the capture. CC @huydhn @malfet @Aidyn-A @ptrblck Pull Request resolved: pytorch#110665 Approved by: https://github.com/kwen2501
… capture 2 (pytorch#110665) Alternative to pytorch#104487. Several have chimed in that pytorch#104487 introduces a dependency from torch (c10d) to ATen, which is considered backward and messy. This alternative switches the dependency relationship at the cost of requiring graphs to potentially do some polling before the capture. CC @huydhn @malfet @Aidyn-A @ptrblck Pull Request resolved: pytorch#110665 Approved by: https://github.com/kwen2501
Alternative to #104487.
Several have chimed in that #104487 introduces a dependency from torch (c10d) to ATen, which is considered backward and messy. This alternative switches the dependency relationship at the cost of requiring graphs to potentially do some polling before the capture.
CC @huydhn @malfet @Aidyn-A @ptrblck