[C10D] Fix coalescedCollective op Flight Recording #120430

wconstab · 2024-02-22T19:40:13Z

Stack from ghstack (oldest at bottom):

-> [C10D] Fix coalescedCollective op Flight Recording #120430

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225 @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-02-22T19:40:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120430

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit b54bdbc with merge base 92a2b21 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / linux-focal-cuda12.1-py3.10-gcc9 / test (nogpu_AVX512, 1, 1, linux.2xlarge) (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]

[ghstack-poisoned]

ghstack-source-id: 37ba636 Pull Request resolved: #120430

[ghstack-poisoned]

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]

The coalescing manager api works by accumulating operations in python via a contextmanager, and then making one call into c++ to an <op>_coalesced API. It has limited support for ops and has been added recently to avoid overheads of making individual py-cpp calls. This complicates flight recording.. For now, flight recording of coalescing_manager collectives is less detailed than cpp coalesced collectives. TODO test the other 2 ops that are supported by coalescing_manager ghstack-source-id: c85d4eb Pull Request resolved: #120430

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]

The coalescing manager api works by accumulating operations in python via a contextmanager, and then making one call into c++ to an <op>_coalesced API. It has limited support for ops and has been added recently to avoid overheads of making individual py-cpp calls. This complicates flight recording.. For now, flight recording of coalescing_manager collectives is less detailed than cpp coalesced collectives. TODO test the other 2 ops that are supported by coalescing_manager ghstack-source-id: 604bda7 Pull Request resolved: #120430

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]

The coalescing manager api works by accumulating operations in python via a contextmanager, and then making one call into c++ to an <op>_coalesced API. It has limited support for ops and has been added recently to avoid overheads of making individual py-cpp calls. This complicates flight recording.. For now, flight recording of coalescing_manager collectives is less detailed than cpp coalesced collectives. TODO test the other 2 ops that are supported by coalescing_manager ghstack-source-id: a00a1d0 Pull Request resolved: #120430

kwen2501 · 2024-02-27T22:14:06Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+  workEnqueue(work);
+  // TODO(whc) did I delete a relevant codepath here?
+  // at::cuda::CUDAGraph::dec_pending_event_queries();
+  // }


Yes, I think you would need to put back the capture_status == c10::cuda::CaptureStatus::None check and the else path.

Would be also nice to pack everything:

inc_pending_event_queries() check capture_status dec_pending_event_queries()

inside workEnqueue so that we don't have these checks everywhere. (can be in a different PR)

I think my reasoning here was that coalescing_state_ should NEVER be active inside this function. Becuase this function is part of the 'coalescing manager' implementation, which would call its own startCoalescing function later.

If that is correct, i'll add an assertion here to clarify it.

i'm not sure how the cudagraph capture path works. Is it exclusive with coalescing or is it additive?

should the new logic be like this? I don't understand why the cuda event increment is unconditional and the decrement is conditional.

// Notify graphs before we check the capture status preemptively at::cuda::CUDAGraph::inc_pending_event_queries(); workEnqueue(work); if (capture_status != c10::cuda::CaptureStatus::None) { at::cuda::CUDAGraph::dec_pending_event_queries(); }

Sorry for the confusion.

What I meant is like this:

at::cuda::CUDAGraph::inc_pending_event_queries(); if (capture_status == c10::cuda::CaptureStatus::None) { workEnqueue(work); } else { at::cuda::CUDAGraph::dec_pending_event_queries(); }

It preemptively increases the pending work number, if this work is actually enqueued, then this is fine.
If this work ends up not being enqueued because CUDA Graph capture is active, then it winds back the pending work number.

You can check elsewhere in the code how workEnqueue(work) is conditioned. I guess they are more or less done the same way. Hence I propose putting the above CUDA Graph logic into the workEnqueue function itself, so that we don't have the if-else at every call site.

hmm.

is it necessary to preemptively increment and then decrement? or can we just increment when we workEnqueue nad never decrement?

There was a reason. See here:
#110665 (comment)
Cc @eqy

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]

The coalescing manager api works by accumulating operations in python via a contextmanager, and then making one call into c++ to an <op>_coalesced API. It has limited support for ops and has been added recently to avoid overheads of making individual py-cpp calls. This complicates flight recording.. For now, flight recording of coalescing_manager collectives is less detailed than cpp coalesced collectives. TODO test the other 2 ops that are supported by coalescing_manager ghstack-source-id: b3e7e68 Pull Request resolved: #120430

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]

The coalescing manager api works by accumulating operations in python via a contextmanager, and then making one call into c++ to an <op>_coalesced API. It has limited support for ops and has been added recently to avoid overheads of making individual py-cpp calls. This complicates flight recording.. For now, flight recording of coalescing_manager collectives is less detailed than cpp coalesced collectives. TODO test the other 2 ops that are supported by coalescing_manager ghstack-source-id: ddb126a Pull Request resolved: #120430

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]

The coalescing manager api works by accumulating operations in python via a contextmanager, and then making one call into c++ to an <op>_coalesced API. It has limited support for ops and has been added recently to avoid overheads of making individual py-cpp calls. This complicates flight recording.. For now, flight recording of coalescing_manager collectives is less detailed than cpp coalesced collectives. TODO test the other 2 ops that are supported by coalescing_manager ghstack-source-id: cecff23 Pull Request resolved: #120430

kwen2501

Thanks for the last fix about workEnqueue. LGTM.

eqy · 2024-03-01T23:08:52Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

            << "Launching ProcessGroupNCCL abort asynchrounously.";
  std::future<bool> fut = std::async(
      std::launch::async, [this, &reason]() { return this->abort(reason); });



Yes, I think this explanation is accurate; one nit would maybe be editing "(a) capturing is already in progress, so we can not enqueue" to "(a) capturing is already in progress (we cannot enqueue in this case)."

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]

The coalescing manager api works by accumulating operations in python via a contextmanager, and then making one call into c++ to an <op>_coalesced API. It has limited support for ops and has been added recently to avoid overheads of making individual py-cpp calls. This complicates flight recording.. For now, flight recording of coalescing_manager collectives is less detailed than cpp coalesced collectives. TODO test the other 2 ops that are supported by coalescing_manager ghstack-source-id: 35d0aac Pull Request resolved: #120430

wconstab · 2024-03-02T00:06:21Z

@pytorchbot merge

pytorchmergebot · 2024-03-02T00:08:55Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-03-02T00:46:24Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Lint / lintrunner-clang / linux-job

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]

The coalescing manager api works by accumulating operations in python via a contextmanager, and then making one call into c++ to an <op>_coalesced API. It has limited support for ops and has been added recently to avoid overheads of making individual py-cpp calls. This complicates flight recording.. For now, flight recording of coalescing_manager collectives is less detailed than cpp coalesced collectives. TODO test the other 2 ops that are supported by coalescing_manager ghstack-source-id: e8dac5e Pull Request resolved: #120430

wconstab · 2024-03-13T21:04:52Z

@pytorchbot merge

pytorchmergebot · 2024-03-13T21:06:41Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

working on coalesced

bbd12b9

[ghstack-poisoned]

wconstab mentioned this pull request Feb 22, 2024

[C10D] Add test case for crashing isend_irecv dump_entries combo #119757

Closed

wconstab mentioned this pull request Feb 22, 2024

[C10D] Fix pointToPoint op Flight Recording #120270

Closed

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Feb 22, 2024

This was referenced Feb 22, 2024

WIP debug why timing=true is crashing #120350

Closed

wip add test for individual p2p and start fixing input/output sizes for non coalescedo ps #120410

Closed

github-actions bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 22, 2024

Update on "working on coalesced"

b2fe462

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225 [ghstack-poisoned]

Update on "Make coalescedCollective work with FlightRecorder"

c1ad42b

[ghstack-poisoned]

wconstab changed the title ~~working on coalesced~~ Make coalescedCollective work with FlightRecorder Feb 23, 2024

Update on "Make coalescedCollective work with FlightRecorder"

9c7c46b

[ghstack-poisoned]

Update on "[C10D] Fix coalescedCollective op Flight Recording"

5f93033

[ghstack-poisoned]

wconstab changed the title ~~Make coalescedCollective work with FlightRecorder~~ [C10D] Fix coalescedCollective op Flight Recording Feb 23, 2024

wconstab added a commit that referenced this pull request Feb 23, 2024

[C10D] Fix coalescedCollective op Flight Recording

4fce728

ghstack-source-id: 37ba636 Pull Request resolved: #120430

Update on "[C10D] Fix coalescedCollective op Flight Recording"

7724ad7

[ghstack-poisoned]

Update on "[C10D] Fix coalescedCollective op Flight Recording"

f1aee2b

[ghstack-poisoned]

Update on "[C10D] Fix coalescedCollective op Flight Recording"

125da9a

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]

Update on "[C10D] Fix coalescedCollective op Flight Recording"

2a31b87

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]

wconstab mentioned this pull request Feb 27, 2024

[C10D] Fix logic for default group=None in _set_pg_timeout #120686

Closed

Update on "[C10D] Fix coalescedCollective op Flight Recording"

b91dac9

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]

kwen2501 reviewed Feb 27, 2024

View reviewed changes

wconstab mentioned this pull request Feb 28, 2024

[C10D] Add ProcessGroup op_id to track ops inside coalescing region #120745

Closed

Update on "[C10D] Fix coalescedCollective op Flight Recording"

ffb3768

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]

Update on "[C10D] Fix coalescedCollective op Flight Recording"

0d6cb96

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]

Update on "[C10D] Fix coalescedCollective op Flight Recording"

6b86000

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]

Update on "[C10D] Fix coalescedCollective op Flight Recording"

87abd4a

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]

kwen2501 approved these changes Mar 1, 2024

View reviewed changes

eqy reviewed Mar 1, 2024

View reviewed changes

Update on "[C10D] Fix coalescedCollective op Flight Recording"

b7be886

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later. [ghstack-poisoned]

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 2, 2024

pytorchmergebot added the merging label Mar 2, 2024

pytorchmergebot removed the merging label Mar 2, 2024

pytorchmergebot added the merging label Mar 13, 2024

pytorchmergebot added the Merged label Mar 13, 2024

pytorchmergebot closed this in 7e076c7 Mar 13, 2024

pytorchmergebot removed the merging label Mar 13, 2024

github-actions bot deleted the gh/wconstab/274/head branch April 13, 2024 01:47

[C10D] Fix coalescedCollective op Flight Recording #120430

[C10D] Fix coalescedCollective op Flight Recording #120430

Uh oh!

Conversation

wconstab commented Feb 22, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120430

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

kwen2501 Feb 27, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab Feb 28, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab Feb 28, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 Feb 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Feb 29, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab Feb 29, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 Feb 29, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

eqy Mar 1, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab commented Mar 2, 2024

Uh oh!

pytorchmergebot commented Mar 2, 2024

Merge started

Uh oh!

pytorchmergebot commented Mar 2, 2024

Merge failed

Uh oh!

wconstab commented Mar 13, 2024

Uh oh!

pytorchmergebot commented Mar 13, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wconstab commented Feb 22, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 22, 2024 •

edited

Loading

kwen2501 Feb 29, 2024 •

edited

Loading