Skip to content

[NCCL][CUDA][CUDA Graphs] Flush enqueued work before starting a graph capture 2#110665

Closed
eqy wants to merge 4 commits intopytorch:mainfrom
eqy:graphincdecnccl
Closed

[NCCL][CUDA][CUDA Graphs] Flush enqueued work before starting a graph capture 2#110665
eqy wants to merge 4 commits intopytorch:mainfrom
eqy:graphincdecnccl

Conversation

@eqy
Copy link
Copy Markdown
Collaborator

@eqy eqy commented Oct 5, 2023

Alternative to #104487.

Several have chimed in that #104487 introduces a dependency from torch (c10d) to ATen, which is considered backward and messy. This alternative switches the dependency relationship at the cost of requiring graphs to potentially do some polling before the capture.

CC @huydhn @malfet @Aidyn-A @ptrblck

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Oct 5, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110665

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 59dc4ee with merge base 543a763 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@eqy eqy added open source topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR module: nccl Problems related to nccl support labels Oct 5, 2023
@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 5, 2023
Copy link
Copy Markdown
Collaborator

@Aidyn-A Aidyn-A left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like less intrusive but effective solution. However, I think it would be safer and more precise to place increment and decrement before the work is enqueued and after it is completed respectively. Do not really understand why currently increment and decrement are placed so.

* describes memory management for captures.
*/

int CUDAGraph::pending_event_queries = 0;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be easier to make it atomic, so you do not need to store mutex object and lock/unlock it every time you modify it.

if (!coalescing_state_ && capture_status == c10::cuda::CaptureStatus::None) {
workEnqueue(work);
} else {
at::cuda::CUDAGraph::dec_pending_event_queries();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below on this---it's not the prettiest but I believe it's safe to increment before we check---otherwise it's potentially too late to notify the graphs-side code.

if (!coalescing_state_ && capture_status == c10::cuda::CaptureStatus::None) {
workEnqueue(work);
} else {
at::cuda::CUDAGraph::dec_pending_event_queries();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to increment before workEnqueue and decrement after the work is completed.

Copy link
Copy Markdown
Collaborator Author

@eqy eqy Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, but we must preemptively increment here as incrementing after the capture status is checked is potentially unsafe. In this case, we can have an interaction where the observed capture status here is None but at the same time the graphs code begins a capture (updating capture status), but it is now too late to notify graphs code that there is work enqueued. By preemptively incrementing, we notify the graphs code that there may potentially be work enqueued until we are certain that either 1. the work is done, or 2. we don't actually enqueue it as we have received notification that a capture has begun.

In other words, we increment as early as possible to avoid any chance of the graphs code erroneously assuming that nothing is/could be enqueued when there is still a chance that the NCCL code is between the None check and workEnqueue.

Bad interaction in excessive detail:

NCCL (autograd thread) -> I see that the capture status is None
graphs (forward thread) -> I set the capture status now
graphs (forward thread) -> I see that the number of pending event queries is zero
graphs (forward thread) -> I am starting the capture
NCCL (autograd thread) -> I am incrementing the number of pending event queries [this is now too late]
NCCL (autograd thread) -> I am enqueueing the work now
[we are now at a state where the watchdog can query the enqueued work's events and crash the capture]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, now it makes sense. Thanks for the explanation! 👍

@eqy
Copy link
Copy Markdown
Collaborator Author

eqy commented Oct 6, 2023

This seems like less intrusive but effective solution. However, I think it would be safer and more precise to place increment and decrement before the work is enqueued and after it is completed respectively. Do not really understand why currently increment and decrement are placed so.

The inc/dec pattern that is a bit tricky is for handling the case where we were about to enqueue work, but found out just in time that a graph capture is starting, so we bail out and decrement without actually enqueuing.

The decrement for the case where the actual work is completed is done in the work cleanup loop, as expected.

} else {
it = workMetaList_.erase(it);
}
at::cuda::CUDAGraph::dec_pending_event_queries();
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aidyn-A this is where we decrement, when the work is actually enqueud and done and cleaned up.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unused mutex_

@eqy
Copy link
Copy Markdown
Collaborator Author

eqy commented Oct 12, 2023

@pytorchmergebot rebase

@eqy
Copy link
Copy Markdown
Collaborator Author

eqy commented Oct 12, 2023

CC @jithunnair-amd ; currently the added test is disabled on ROCm as it fails, not sure if similar changes are needed for the AMD backends.

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased graphincdecnccl onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout graphincdecnccl && git pull --rebase)

@eqy
Copy link
Copy Markdown
Collaborator Author

eqy commented Oct 18, 2023

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased graphincdecnccl onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout graphincdecnccl && git pull --rebase)

@malfet malfet self-assigned this Oct 20, 2023
Copy link
Copy Markdown
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @eqy for the PR and @Aidyn-A for the review.

@eqy
Copy link
Copy Markdown
Collaborator Author

eqy commented Oct 20, 2023

@pytorchmergebot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

huydhn pushed a commit that referenced this pull request Oct 27, 2023
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
… capture 2 (pytorch#110665)

Alternative to pytorch#104487.

Several have chimed in that pytorch#104487 introduces a dependency from torch (c10d) to ATen, which is considered backward and messy. This alternative switches the dependency relationship at the cost of requiring graphs to potentially do some polling before the capture.

CC @huydhn @malfet @Aidyn-A @ptrblck
Pull Request resolved: pytorch#110665
Approved by: https://github.com/kwen2501
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
… capture 2 (pytorch#110665)

Alternative to pytorch#104487.

Several have chimed in that pytorch#104487 introduces a dependency from torch (c10d) to ATen, which is considered backward and messy. This alternative switches the dependency relationship at the cost of requiring graphs to potentially do some polling before the capture.

CC @huydhn @malfet @Aidyn-A @ptrblck
Pull Request resolved: pytorch#110665
Approved by: https://github.com/kwen2501
acphile pushed a commit to acphile/pytorch that referenced this pull request Jan 22, 2024
… capture 2 (pytorch#110665)

Alternative to pytorch#104487.

Several have chimed in that pytorch#104487 introduces a dependency from torch (c10d) to ATen, which is considered backward and messy. This alternative switches the dependency relationship at the cost of requiring graphs to potentially do some polling before the capture.

CC @huydhn @malfet @Aidyn-A @ptrblck
Pull Request resolved: pytorch#110665
Approved by: https://github.com/kwen2501
acphile pushed a commit to acphile/pytorch that referenced this pull request Jan 22, 2024
… capture 2 (pytorch#110665)

Alternative to pytorch#104487.

Several have chimed in that pytorch#104487 introduces a dependency from torch (c10d) to ATen, which is considered backward and messy. This alternative switches the dependency relationship at the cost of requiring graphs to potentially do some polling before the capture.

CC @huydhn @malfet @Aidyn-A @ptrblck
Pull Request resolved: pytorch#110665
Approved by: https://github.com/kwen2501
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged module: nccl Problems related to nccl support open source release notes: distributed (c10d) release notes category topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

6 participants