[Train] Fix lightning checkpoint report callback #42751

woshiyyya · 2024-01-26T21:33:22Z

Why are these changes needed?

In RayTrainReportCallback, we were using torch.distributed.barrier() to coordinate workers without specifying the device argument. If users do not set up the torch cuda device themselves, then the collective calls will be all bound to the default device (cuda:0). This could mess up the device map of the barrier call.

Why it hangs?

pytorch/pytorch#53658

Internally, torch.distributed.barrier calls allreduce on a dummy tensor link. The dummy tensor is created on the GPU specified by barrier(device_id=). NCCL will try to create a communicator for current process on device x if it doesn't exist, which is a blocking operation.

# rank 0 on node 0
collective1(tensor on dev 0) # creates new communicator, OK
collective2(tensor on dev 0) # doesn't try to create new communicator

# rank 1 on node 1
collective1(tensor on dev 0) # creates new communicator, OK
collective2(tensor on dev 1) # tries to create new communicator, hangs

When the users don't specify device_id, and the barrier will use the torch.cuda.current_device, which can be all cuda:0, if no torch.cuda.set_device was called before. The first call works, but the successive call hangs if it's binding processes with different device_id.

Therefore, to avoid deadlock, the rule-of-thumb is to always explicitly specify different device id for each workers for all the collective calls.

Solution

This PR switch from torch.distributed.barrier() to lightning's trainer.strategy.barrier(), which explicitly specified the cuda device for each barrier call to ensure the ordering.

Related issue number

Closes #42927

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

Signed-off-by: Yunxuan Xiao <yunxuanx@anyscale.com>

python/ray/train/lightning/_lightning_utils.py

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

matthewdeng

Awesome investigation!

Signed-off-by: Yunxuan Xiao <yunxuanx@anyscale.com> Signed-off-by: tterrysun <terry@anyscale.com>

woshiyyya and others added 3 commits November 1, 2023 18:28

fix

ab71a07

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

Update _lightning_utils.py

bcab7a2

Signed-off-by: Yunxuan Xiao <xiaoyunxuan1998@gmail.com>

Merge branch 'master' into train/fix_lightning_report_callback

0801111

Signed-off-by: Yunxuan Xiao <yunxuanx@anyscale.com>

woshiyyya marked this pull request as ready for review January 31, 2024 01:30

woshiyyya assigned justinvyu and matthewdeng Jan 31, 2024

woshiyyya changed the title ~~[test] Fix lightning report callback~~ [Train] Fix lightning checkpoint report callback Jan 31, 2024

woshiyyya requested review from justinvyu and matthewdeng January 31, 2024 01:31

matthewdeng reviewed Jan 31, 2024

View reviewed changes

python/ray/train/lightning/_lightning_utils.py Show resolved Hide resolved

add test

4203420

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

woshiyyya requested review from matthewdeng February 1, 2024 19:41

rm test

6e4da64

Signed-off-by: woshiyyya <xiaoyunxuan1998@gmail.com>

matthewdeng approved these changes Feb 2, 2024

View reviewed changes

matthewdeng merged commit c3de7f8 into ray-project:master Feb 2, 2024
9 checks passed

tterrysun pushed a commit to tterrysun/ray that referenced this pull request Feb 14, 2024

[Train] Fix lightning checkpoint report callback (ray-project#42751)

f3a9749

Signed-off-by: Yunxuan Xiao <yunxuanx@anyscale.com> Signed-off-by: tterrysun <terry@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Fix lightning checkpoint report callback #42751

[Train] Fix lightning checkpoint report callback #42751

woshiyyya commented Jan 26, 2024 •

edited

Loading

matthewdeng left a comment

[Train] Fix lightning checkpoint report callback #42751

[Train] Fix lightning checkpoint report callback #42751

Conversation

woshiyyya commented Jan 26, 2024 • edited Loading

Why are these changes needed?

Why it hangs?

Solution

Related issue number

Checks

matthewdeng left a comment

Choose a reason for hiding this comment

woshiyyya commented Jan 26, 2024 •

edited

Loading