[NCCL][CUDA][CUDA Graphs] Flush enqueued work before starting a graph capture 2 by eqy · Pull Request #110665 · pytorch/pytorch

eqy · 2023-10-05T23:10:51Z

Alternative to #104487.

Several have chimed in that #104487 introduces a dependency from torch (c10d) to ATen, which is considered backward and messy. This alternative switches the dependency relationship at the cost of requiring graphs to potentially do some polling before the capture.

CC @huydhn @malfet @Aidyn-A @ptrblck

pytorch-bot · 2023-10-05T23:10:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110665

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 59dc4ee with merge base 543a763 ():

NEW FAILURE - The following job has failed:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu) (gh)

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

trunk / linux-focal-rocm5.6-py3.8 / test (default, 1, 3, linux.rocm.gpu, unstable) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Aidyn-A

This seems like less intrusive but effective solution. However, I think it would be safer and more precise to place increment and decrement before the work is enqueued and after it is completed respectively. Do not really understand why currently increment and decrement are placed so.

Aidyn-A · 2023-10-06T08:18:47Z

aten/src/ATen/cuda/CUDAGraph.cpp

 * describes memory management for captures.
 */

+int CUDAGraph::pending_event_queries = 0;


It would be easier to make it atomic, so you do not need to store mutex object and lock/unlock it every time you modify it.

Aidyn-A · 2023-10-06T08:25:12Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

  if (!coalescing_state_ && capture_status == c10::cuda::CaptureStatus::None) {
    workEnqueue(work);
+  } else {
+    at::cuda::CUDAGraph::dec_pending_event_queries();


See comment below on this---it's not the prettiest but I believe it's safe to increment before we check---otherwise it's potentially too late to notify the graphs-side code.

Aidyn-A · 2023-10-06T08:32:38Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

  if (!coalescing_state_ && capture_status == c10::cuda::CaptureStatus::None) {
    workEnqueue(work);
+  } else {
+    at::cuda::CUDAGraph::dec_pending_event_queries();


I suggest to increment before workEnqueue and decrement after the work is completed.

I thought about this, but we must preemptively increment here as incrementing after the capture status is checked is potentially unsafe. In this case, we can have an interaction where the observed capture status here is None but at the same time the graphs code begins a capture (updating capture status), but it is now too late to notify graphs code that there is work enqueued. By preemptively incrementing, we notify the graphs code that there may potentially be work enqueued until we are certain that either 1. the work is done, or 2. we don't actually enqueue it as we have received notification that a capture has begun.

In other words, we increment as early as possible to avoid any chance of the graphs code erroneously assuming that nothing is/could be enqueued when there is still a chance that the NCCL code is between the None check and workEnqueue.

Bad interaction in excessive detail:

NCCL (autograd thread) -> I see that the capture status is None graphs (forward thread) -> I set the capture status now graphs (forward thread) -> I see that the number of pending event queries is zero graphs (forward thread) -> I am starting the capture NCCL (autograd thread) -> I am incrementing the number of pending event queries [this is now too late] NCCL (autograd thread) -> I am enqueueing the work now [we are now at a state where the watchdog can query the enqueued work's events and crash the capture]

I see, now it makes sense. Thanks for the explanation! 👍

eqy · 2023-10-06T21:06:46Z

This seems like less intrusive but effective solution. However, I think it would be safer and more precise to place increment and decrement before the work is enqueued and after it is completed respectively. Do not really understand why currently increment and decrement are placed so.

The inc/dec pattern that is a bit tricky is for handling the case where we were about to enqueue work, but found out just in time that a graph capture is starting, so we bail out and decrement without actually enqueuing.

The decrement for the case where the actual work is completed is done in the work cleanup loop, as expected.

eqy · 2023-10-06T21:08:07Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

        } else {
          it = workMetaList_.erase(it);
        }
+        at::cuda::CUDAGraph::dec_pending_event_queries();


@Aidyn-A this is where we decrement, when the work is actually enqueud and done and cleaned up.

Aidyn-A · 2023-10-12T20:52:05Z

aten/src/ATen/cuda/CUDAGraph.h

nit: unused mutex_

eqy · 2023-10-12T21:31:54Z

@pytorchmergebot rebase

eqy · 2023-10-12T21:33:12Z

CC @jithunnair-amd ; currently the added test is disabled on ROCm as it fails, not sure if similar changes are needed for the AMD backends.

pytorchmergebot · 2023-10-12T21:33:53Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2023-10-12T21:34:01Z

Successfully rebased graphincdecnccl onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout graphincdecnccl && git pull --rebase)

eqy · 2023-10-18T19:21:42Z

@pytorchmergebot rebase

pytorchmergebot · 2023-10-18T19:24:03Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2023-10-18T19:24:10Z

Successfully rebased graphincdecnccl onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout graphincdecnccl && git pull --rebase)

kwen2501

LGTM. Thanks @eqy for the PR and @Aidyn-A for the review.

eqy · 2023-10-20T23:55:17Z

@pytorchmergebot merge

pytorchmergebot · 2023-10-20T23:57:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

… capture 2 2.1 release cherry-pick of #110665

@huydhn

… capture 2 (pytorch#110665) Alternative to pytorch#104487. Several have chimed in that pytorch#104487 introduces a dependency from torch (c10d) to ATen, which is considered backward and messy. This alternative switches the dependency relationship at the cost of requiring graphs to potentially do some polling before the capture. CC @huydhn @malfet @Aidyn-A @ptrblck Pull Request resolved: pytorch#110665 Approved by: https://github.com/kwen2501

@huydhn

… capture 2 (pytorch#110665) Alternative to pytorch#104487. Several have chimed in that pytorch#104487 introduces a dependency from torch (c10d) to ATen, which is considered backward and messy. This alternative switches the dependency relationship at the cost of requiring graphs to potentially do some polling before the capture. CC @huydhn @malfet @Aidyn-A @ptrblck Pull Request resolved: pytorch#110665 Approved by: https://github.com/kwen2501

@huydhn

… capture 2 (pytorch#110665) Alternative to pytorch#104487. Several have chimed in that pytorch#104487 introduces a dependency from torch (c10d) to ATen, which is considered backward and messy. This alternative switches the dependency relationship at the cost of requiring graphs to potentially do some polling before the capture. CC @huydhn @malfet @Aidyn-A @ptrblck Pull Request resolved: pytorch#110665 Approved by: https://github.com/kwen2501

@huydhn

… capture 2 (pytorch#110665) Alternative to pytorch#104487. Several have chimed in that pytorch#104487 introduces a dependency from torch (c10d) to ATen, which is considered backward and messy. This alternative switches the dependency relationship at the cost of requiring graphs to potentially do some polling before the capture. CC @huydhn @malfet @Aidyn-A @ptrblck Pull Request resolved: pytorch#110665 Approved by: https://github.com/kwen2501

eqy requested review from H-Huang, awgu, d4l3k, fduwjj, fegin, kiukchung, kwen2501, mrshenli, rohan-varma, wanchaol, wz337 and zhaojuanmao as code owners October 5, 2023 23:10

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Oct 5, 2023

eqy added open source topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR module: nccl Problems related to nccl support labels Oct 5, 2023

soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 5, 2023

Aidyn-A reviewed Oct 6, 2023

View reviewed changes

eqy commented Oct 6, 2023

View reviewed changes

eqy mentioned this pull request Oct 9, 2023

[DO NOT MERGE][NCCL][CUDA][CUDA Graphs] Set watchdog runtime capture mode to thread local to handle cleaning straggling work #105182

Closed

mikelambert mentioned this pull request Oct 10, 2023

Repro for non-deterministic "operation not permitted when stream is capturing" crash #110779

Open

Aidyn-A reviewed Oct 12, 2023

View reviewed changes

aten/src/ATen/cuda/CUDAGraph.h Outdated

Copy link
Copy Markdown

Collaborator

Aidyn-A Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unused mutex_

eqy reacted with thumbs up emoji

eqy added 4 commits October 18, 2023 19:24

check in

74821d2

check in impl

07263e0

address comments, skip test on rocm

cdea0d5

unused

59dc4ee

pytorchmergebot force-pushed the graphincdecnccl branch from 25cb1cc to 59dc4ee Compare October 18, 2023 19:24

malfet self-assigned this Oct 20, 2023

kwen2501 approved these changes Oct 20, 2023

View reviewed changes

pytorchmergebot added the merging label Oct 20, 2023

pytorchmergebot added Merged and removed merging labels Oct 20, 2023

pytorchmergebot closed this in aa24459 Oct 20, 2023

huydhn pushed a commit that referenced this pull request Oct 27, 2023

[NCCL][CUDA][CUDA Graphs] Flush enqueued work before starting a graph…

c79d293

… capture 2 2.1 release cherry-pick of #110665

kwen2501 mentioned this pull request Feb 29, 2024

[C10D] Fix coalescedCollective op Flight Recording #120430

Closed

eqy mentioned this pull request Feb 29, 2024

torch.distributed.destroy_process_group() hangs after CUDA graph capture of NCCL operations #115388

Open

Conversation

eqy commented Oct 5, 2023

Uh oh!

pytorch-bot bot commented Oct 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110665

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

Aidyn-A left a comment

Choose a reason for hiding this comment

Uh oh!

Aidyn-A Oct 6, 2023

Choose a reason for hiding this comment

Uh oh!

Aidyn-A Oct 6, 2023

Choose a reason for hiding this comment

Uh oh!

eqy Oct 6, 2023

Choose a reason for hiding this comment

Uh oh!

Aidyn-A Oct 6, 2023

Choose a reason for hiding this comment

Uh oh!

eqy Oct 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aidyn-A Oct 9, 2023

Choose a reason for hiding this comment

Uh oh!

eqy commented Oct 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eqy Oct 6, 2023

Choose a reason for hiding this comment

Uh oh!

Aidyn-A Oct 12, 2023

Choose a reason for hiding this comment

Uh oh!

eqy commented Oct 12, 2023

Uh oh!

eqy commented Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorchmergebot commented Oct 12, 2023

Uh oh!

pytorchmergebot commented Oct 12, 2023

Uh oh!

eqy commented Oct 18, 2023

Uh oh!

pytorchmergebot commented Oct 18, 2023

Uh oh!

pytorchmergebot commented Oct 18, 2023

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

eqy commented Oct 20, 2023

Uh oh!

pytorchmergebot commented Oct 20, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pytorch-bot bot commented Oct 5, 2023 •

edited

Loading

eqy Oct 6, 2023 •

edited

Loading

eqy commented Oct 6, 2023 •

edited

Loading

eqy commented Oct 12, 2023 •

edited

Loading