Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Fix lightning checkpoint report callback #42751

Merged

Conversation

woshiyyya
Copy link
Member

@woshiyyya woshiyyya commented Jan 26, 2024

Why are these changes needed?

In RayTrainReportCallback, we were using torch.distributed.barrier() to coordinate workers without specifying the device argument. If users do not set up the torch cuda device themselves, then the collective calls will be all bound to the default device (cuda:0). This could mess up the device map of the barrier call.

Why it hangs?

pytorch/pytorch#53658

Internally, torch.distributed.barrier calls allreduce on a dummy tensor link. The dummy tensor is created on the GPU specified by barrier(device_id=). NCCL will try to create a communicator for current process on device x if it doesn't exist, which is a blocking operation.

# rank 0 on node 0
collective1(tensor on dev 0) # creates new communicator, OK
collective2(tensor on dev 0) # doesn't try to create new communicator

# rank 1 on node 1
collective1(tensor on dev 0) # creates new communicator, OK
collective2(tensor on dev 1) # tries to create new communicator, hangs

When the users don't specify device_id, and the barrier will use the torch.cuda.current_device, which can be all cuda:0, if no torch.cuda.set_device was called before. The first call works, but the successive call hangs if it's binding processes with different device_id.

Therefore, to avoid deadlock, the rule-of-thumb is to always explicitly specify different device id for each workers for all the collective calls.

Solution

This PR switch from torch.distributed.barrier() to lightning's trainer.strategy.barrier(), which explicitly specified the cuda device for each barrier call to ensure the ordering.

Related issue number

Closes #42927

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

woshiyyya and others added 3 commits November 1, 2023 18:28
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>
Signed-off-by: Yunxuan Xiao <yunxuanx@anyscale.com>
@woshiyyya woshiyyya marked this pull request as ready for review January 31, 2024 01:30
@woshiyyya woshiyyya changed the title [test] Fix lightning report callback [Train] Fix lightning checkpoint report callback Jan 31, 2024
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome investigation!

@matthewdeng matthewdeng merged commit c3de7f8 into ray-project:master Feb 2, 2024
9 checks passed
tterrysun pushed a commit to tterrysun/ray that referenced this pull request Feb 14, 2024
Signed-off-by: Yunxuan Xiao <yunxuanx@anyscale.com>
Signed-off-by: tterrysun <terry@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Train] Unable to find checkpoint folder in Lightning RayTrainReportCallback.
3 participants