Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comprehensively test NCCL's get_future() API #56838

Open
rohan-varma opened this issue Apr 23, 2021 · 1 comment
Open

Comprehensively test NCCL's get_future() API #56838

rohan-varma opened this issue Apr 23, 2021 · 1 comment
Labels
better-engineering Relatively self-contained tasks for better engineering contributors oncall: distributed Add this issue/PR to distributed oncall triage queue pt_distributed_rampup Ramp up tasks for new developers on PT distributed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@rohan-varma
Copy link
Member

rohan-varma commented Apr 23, 2021

馃殌 Feature

In ProcessGroupNCCL, we added a get_future() API to support gradient compression use cases, where a user can call get_future() to schedule additional callbacks when implementing custom gradient compression algorithms.

However, get_future() can be more generally useful and today is created for all nccl collectives as well as recv p2p op, but does not appear to be tested anywhere. It would be great to added tests that use get_future() and then enqueue more CUDA operations on the result and verify all synchronization happens appropriately to ensure this API works as expected.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu @gcramer23

@rohan-varma rohan-varma added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 23, 2021
@rohan-varma rohan-varma added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 7, 2021
@rohan-varma
Copy link
Member Author

@agolynski Do you have any thoughts on this issue? Can it be assigned to you?

@rohan-varma rohan-varma added the better-engineering Relatively self-contained tasks for better engineering contributors label Oct 12, 2021
@rohan-varma rohan-varma added the pt_distributed_rampup Ramp up tasks for new developers on PT distributed label Nov 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
better-engineering Relatively self-contained tasks for better engineering contributors oncall: distributed Add this issue/PR to distributed oncall triage queue pt_distributed_rampup Ramp up tasks for new developers on PT distributed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

1 participant