Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProcessGroupNCCL::alltoall_base needs to call recordStream #46603

Closed

Conversation

jeffdaily
Copy link
Collaborator

For similar reasons as documented in the [Sync Streams] note. For a current example, ProcessGroupNCCL::allgather must also call recordStream and does so already.

The output tensor is created on the default stream (by the application). NCCL/RCCL internally uses another stream (i.e., ncclStream). If we do not record the output tensor on the ncclStream, there is a chance that the output tensor might be deallocated while NCCL/RCCL is using it.

The application is not aware of the ncclStream since it's internal to ProcessGroupNCCL. So, the application cannot record the output tensor on the ncclStream.

Patch originally developed by @sarunyap.

@dr-ci
Copy link

dr-ci bot commented Oct 20, 2020

💊 CI failures summary and remediations

As of commit 72bc60e (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

Extra GitHub checks: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 1 time.

@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 21, 2020
@osalpekar
Copy link
Member

cc @srinivas212

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pritamdamania87 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in ce5bca5.

@jeffdaily jeffdaily deleted the ProcessGroupNCCL_use_recordStream branch October 26, 2020 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged module: nccl Problems related to nccl support oncall: distributed Add this issue/PR to distributed oncall triage queue open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants