Skip to content

Conversation

@wconstab
Copy link
Contributor

@wconstab wconstab commented Feb 22, 2024

Stack from ghstack (oldest at bottom):

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @yf225 @chauhang

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 22, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120430

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit b54bdbc with merge base 92a2b21 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

[ghstack-poisoned]
@wconstab wconstab changed the title working on coalesced Make coalescedCollective work with FlightRecorder Feb 23, 2024
@wconstab wconstab changed the title Make coalescedCollective work with FlightRecorder [C10D] Fix coalescedCollective op Flight Recording Feb 23, 2024
wconstab added a commit that referenced this pull request Feb 23, 2024
ghstack-source-id: 37ba636
Pull Request resolved: #120430

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later.


[ghstack-poisoned]
wconstab added a commit that referenced this pull request Feb 23, 2024
The coalescing manager api works by accumulating operations in python
via a contextmanager, and then making one call into c++ to an <op>_coalesced
API.  It has limited support for ops and has been added recently to
avoid overheads of making individual py-cpp calls.

This complicates flight recording..

For now, flight recording of coalescing_manager collectives is less
detailed than cpp coalesced collectives.

TODO test the other 2 ops that are supported by coalescing_manager

ghstack-source-id: c85d4eb
Pull Request resolved: #120430

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later.


[ghstack-poisoned]
wconstab added a commit that referenced this pull request Feb 24, 2024
The coalescing manager api works by accumulating operations in python
via a contextmanager, and then making one call into c++ to an <op>_coalesced
API.  It has limited support for ops and has been added recently to
avoid overheads of making individual py-cpp calls.

This complicates flight recording..

For now, flight recording of coalescing_manager collectives is less
detailed than cpp coalesced collectives.

TODO test the other 2 ops that are supported by coalescing_manager

ghstack-source-id: 604bda7
Pull Request resolved: #120430

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later.


[ghstack-poisoned]
wconstab added a commit that referenced this pull request Feb 27, 2024
The coalescing manager api works by accumulating operations in python
via a contextmanager, and then making one call into c++ to an <op>_coalesced
API.  It has limited support for ops and has been added recently to
avoid overheads of making individual py-cpp calls.

This complicates flight recording..

For now, flight recording of coalescing_manager collectives is less
detailed than cpp coalesced collectives.

TODO test the other 2 ops that are supported by coalescing_manager

ghstack-source-id: a00a1d0
Pull Request resolved: #120430
Comment on lines 2608 to 2611
workEnqueue(work);
// TODO(whc) did I delete a relevant codepath here?
// at::cuda::CUDAGraph::dec_pending_event_queries();
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you would need to put back the capture_status == c10::cuda::CaptureStatus::None check and the else path.

Would be also nice to pack everything:

inc_pending_event_queries()
check capture_status
dec_pending_event_queries()

inside workEnqueue so that we don't have these checks everywhere. (can be in a different PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my reasoning here was that coalescing_state_ should NEVER be active inside this function. Becuase this function is part of the 'coalescing manager' implementation, which would call its own startCoalescing function later.

If that is correct, i'll add an assertion here to clarify it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure how the cudagraph capture path works. Is it exclusive with coalescing or is it additive?

should the new logic be like this? I don't understand why the cuda event increment is unconditional and the decrement is conditional.

  // Notify graphs before we check the capture status preemptively
  at::cuda::CUDAGraph::inc_pending_event_queries();

  workEnqueue(work);
  if (capture_status != c10::cuda::CaptureStatus::None) {
    at::cuda::CUDAGraph::dec_pending_event_queries();
  }

Copy link
Contributor

@kwen2501 kwen2501 Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion.

What I meant is like this:

  at::cuda::CUDAGraph::inc_pending_event_queries();

  if (capture_status == c10::cuda::CaptureStatus::None) {
    workEnqueue(work);
  } else {
    at::cuda::CUDAGraph::dec_pending_event_queries();
  }

It preemptively increases the pending work number, if this work is actually enqueued, then this is fine.
If this work ends up not being enqueued because CUDA Graph capture is active, then it winds back the pending work number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check elsewhere in the code how workEnqueue(work) is conditioned. I guess they are more or less done the same way. Hence I propose putting the above CUDA Graph logic into the workEnqueue function itself, so that we don't have the if-else at every call site.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.

is it necessary to preemptively increment and then decrement? or can we just increment when we workEnqueue nad never decrement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a reason. See here:
#110665 (comment)
Cc @eqy


Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later.


[ghstack-poisoned]

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later.


[ghstack-poisoned]
wconstab added a commit that referenced this pull request Feb 28, 2024
The coalescing manager api works by accumulating operations in python
via a contextmanager, and then making one call into c++ to an <op>_coalesced
API.  It has limited support for ops and has been added recently to
avoid overheads of making individual py-cpp calls.

This complicates flight recording..

For now, flight recording of coalescing_manager collectives is less
detailed than cpp coalesced collectives.

TODO test the other 2 ops that are supported by coalescing_manager

ghstack-source-id: b3e7e68
Pull Request resolved: #120430

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later.


[ghstack-poisoned]
wconstab added a commit that referenced this pull request Feb 28, 2024
The coalescing manager api works by accumulating operations in python
via a contextmanager, and then making one call into c++ to an <op>_coalesced
API.  It has limited support for ops and has been added recently to
avoid overheads of making individual py-cpp calls.

This complicates flight recording..

For now, flight recording of coalescing_manager collectives is less
detailed than cpp coalesced collectives.

TODO test the other 2 ops that are supported by coalescing_manager

ghstack-source-id: ddb126a
Pull Request resolved: #120430

Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later.


[ghstack-poisoned]
wconstab added a commit that referenced this pull request Mar 1, 2024
The coalescing manager api works by accumulating operations in python
via a contextmanager, and then making one call into c++ to an <op>_coalesced
API.  It has limited support for ops and has been added recently to
avoid overheads of making individual py-cpp calls.

This complicates flight recording..

For now, flight recording of coalescing_manager collectives is less
detailed than cpp coalesced collectives.

TODO test the other 2 ops that are supported by coalescing_manager

ghstack-source-id: cecff23
Pull Request resolved: #120430
Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the last fix about workEnqueue. LGTM.

<< "Launching ProcessGroupNCCL abort asynchrounously.";
std::future<bool> fut = std::async(
std::launch::async, [this, &reason]() { return this->abort(reason); });

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this explanation is accurate; one nit would maybe be editing "(a) capturing is already in progress, so we can not enqueue" to "(a) capturing is already in progress (we cannot enqueue in this case)."


Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later.


[ghstack-poisoned]
wconstab added a commit that referenced this pull request Mar 2, 2024
The coalescing manager api works by accumulating operations in python
via a contextmanager, and then making one call into c++ to an <op>_coalesced
API.  It has limited support for ops and has been added recently to
avoid overheads of making individual py-cpp calls.

This complicates flight recording..

For now, flight recording of coalescing_manager collectives is less
detailed than cpp coalesced collectives.

TODO test the other 2 ops that are supported by coalescing_manager

ghstack-source-id: 35d0aac
Pull Request resolved: #120430
@wconstab
Copy link
Contributor Author

wconstab commented Mar 2, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 2, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers


Also noticed and filed #120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later.


cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Mar 13, 2024
The coalescing manager api works by accumulating operations in python
via a contextmanager, and then making one call into c++ to an <op>_coalesced
API.  It has limited support for ops and has been added recently to
avoid overheads of making individual py-cpp calls.

This complicates flight recording..

For now, flight recording of coalescing_manager collectives is less
detailed than cpp coalesced collectives.

TODO test the other 2 ops that are supported by coalescing_manager

ghstack-source-id: e8dac5e
Pull Request resolved: #120430
@wconstab
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions github-actions bot deleted the gh/wconstab/274/head branch April 13, 2024 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants