[train] `RayTrainReportCallback` should only save a checkpoint on rank 0 for xgboost/lightgbm #45083

justinvyu · 2024-05-01T21:23:28Z

Why are these changes needed?

This PR adds a condition to only save and report a checkpoint on the rank 0 worker for xgboost and lightgbm. This prevents unnecessary checkpoints being created, since all data parallel workers have the same model states. Note: this also accounts for usage in Tune, where ray.train.get_context().get_world_rank() returns None.

This also includes a drive-by fix for checkpoint_at_end in the xgboost callback. Now, we no longer do a separate checkpoint_at_end if the checkpoint frequency happens to line up with the last iteration. For example, if saving every 5 iterations: [iter] 0 1 2 3 4 (checkpoint) 5 6 7 8 9 (checkpoint) (checkpoint), we no longer have this "duplicate" checkpoint at the end after this fix.

Related issue number

Reported on slack

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…lgb_rank0_only_ckpt

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2024-05-01T21:30:39Z

python/ray/train/tests/test_xgboost_trainer.py

+    callback = RayTrainReportCallback(frequency=2, checkpoint_at_end=True)
+
+    booster = mock.MagicMock()
+
+    with mock.patch("ray.train.report") as mock_report, mock.patch(
+        "ray.train.get_context"
+    ) as mock_get_context:
+        mock_context = mock.MagicMock()
+        mock_context.get_world_rank.return_value = rank
+        mock_get_context.return_value = mock_context
+
+        booster.num_boosted_rounds.return_value = 2
+        callback.after_iteration(booster, epoch=1, evals_log={})
+
+        # Only rank 0 should report based on `frequency`
+        reported_checkpoint = bool(mock_report.call_args.kwargs.get("checkpoint"))
+        if rank == 0:
+            assert reported_checkpoint
+        else:
+            assert not reported_checkpoint
+
+        booster.num_boosted_rounds.return_value = 3
+        callback.after_iteration(booster, epoch=2, evals_log={})
+        # Nobody should report a checkpoint on iterations
+        reported_checkpoint = bool(mock_report.call_args.kwargs.get("checkpoint"))
+        assert not reported_checkpoint


Is this test unnecessarily complicated? Any better ways to do this?

Just went through some MagicMock docs. I think this unit-testing function looks good too me :)
For simplicity, probably we can just test the class variable RayTrainReportCallback._last_checkpoint_iteration after the function callback.after_iteration() is called. In this way, we can reduce the mock of ray.train.report. Then the code from line 154 ~ 169 could be simplified as:

with mock.patch("ray.train.get_context") as mock_get_context: mock_context = mock.MagicMock() mock_context.get_world_rank.return_value = rank mock_get_context.return_value = mock_context booster.num_boosted_rounds.return_value = 2 callback.after_iteration(booster, epoch=1, evals_log={}) assertEqual(callback._last_checkpoint_iteration, 1) # same as epoch number

However, this test assumes ray.train.report and RayTrainReportCallback._get_checkpoint() always works. It's also a little indirect. Correct me if I am wrong :)

You're right, the test was a little too indirect and I was able to find a similar way to simplify it. PTAL

Nice simplification! Learnt a lot from the code 👍

python/ray/train/tests/test_xgboost_trainer.py

hongpeng-guo · 2024-05-02T07:28:22Z

python/ray/train/tests/test_xgboost_trainer.py

+    callback = RayTrainReportCallback(frequency=2, checkpoint_at_end=True)
+
+    booster = mock.MagicMock()
+
+    with mock.patch("ray.train.report") as mock_report, mock.patch(
+        "ray.train.get_context"
+    ) as mock_get_context:
+        mock_context = mock.MagicMock()
+        mock_context.get_world_rank.return_value = rank
+        mock_get_context.return_value = mock_context
+
+        booster.num_boosted_rounds.return_value = 2
+        callback.after_iteration(booster, epoch=1, evals_log={})
+
+        # Only rank 0 should report based on `frequency`
+        reported_checkpoint = bool(mock_report.call_args.kwargs.get("checkpoint"))
+        if rank == 0:
+            assert reported_checkpoint
+        else:
+            assert not reported_checkpoint
+
+        booster.num_boosted_rounds.return_value = 3
+        callback.after_iteration(booster, epoch=2, evals_log={})
+        # Nobody should report a checkpoint on iterations
+        reported_checkpoint = bool(mock_report.call_args.kwargs.get("checkpoint"))
+        assert not reported_checkpoint


Just went through some MagicMock docs. I think this unit-testing function looks good too me :)
For simplicity, probably we can just test the class variable RayTrainReportCallback._last_checkpoint_iteration after the function callback.after_iteration() is called. In this way, we can reduce the mock of ray.train.report. Then the code from line 154 ~ 169 could be simplified as:

with mock.patch("ray.train.get_context") as mock_get_context: mock_context = mock.MagicMock() mock_context.get_world_rank.return_value = rank mock_get_context.return_value = mock_context booster.num_boosted_rounds.return_value = 2 callback.after_iteration(booster, epoch=1, evals_log={}) assertEqual(callback._last_checkpoint_iteration, 1) # same as epoch number

However, this test assumes ray.train.report and RayTrainReportCallback._get_checkpoint() always works. It's also a little indirect. Correct me if I am wrong :)

…lgb_rank0_only_ckpt

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

matthewdeng

Clean!

hongpeng-guo

LGTM!

…k 0 for xgboost/lightgbm (ray-project#45083) This PR adds a condition to only save and report a checkpoint on the rank 0 worker for xgboost and lightgbm. This prevents unnecessary checkpoints being created, since all data parallel workers have the same model states. Note: this also accounts for usage in Tune, where `ray.train.get_context().get_world_rank()` returns `None`. Fix `checkpoint_at_end` for the xgboost callback to avoid duplicate checkpoints. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Hongpeng Guo <hpguo@anyscale.com>

…k 0 for xgboost/lightgbm (ray-project#45083) This PR adds a condition to only save and report a checkpoint on the rank 0 worker for xgboost and lightgbm. This prevents unnecessary checkpoints being created, since all data parallel workers have the same model states. Note: this also accounts for usage in Tune, where `ray.train.get_context().get_world_rank()` returns `None`. Fix `checkpoint_at_end` for the xgboost callback to avoid duplicate checkpoints. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Hongpeng Guo <hpguo@anyscale.com> Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

…k 0 for xgboost/lightgbm (ray-project#45083) This PR adds a condition to only save and report a checkpoint on the rank 0 worker for xgboost and lightgbm. This prevents unnecessary checkpoints being created, since all data parallel workers have the same model states. Note: this also accounts for usage in Tune, where `ray.train.get_context().get_world_rank()` returns `None`. Fix `checkpoint_at_end` for the xgboost callback to avoid duplicate checkpoints. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Hongpeng Guo <hpguo@anyscale.com>

…k 0 for xgboost/lightgbm (ray-project#45083) This PR adds a condition to only save and report a checkpoint on the rank 0 worker for xgboost and lightgbm. This prevents unnecessary checkpoints being created, since all data parallel workers have the same model states. Note: this also accounts for usage in Tune, where `ray.train.get_context().get_world_rank()` returns `None`. Fix `checkpoint_at_end` for the xgboost callback to avoid duplicate checkpoints. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Hongpeng Guo <hpguo@anyscale.com> Signed-off-by: gchurch <gabe1church@gmail.com>

justinvyu added 5 commits April 29, 2024 15:21

fix overlapping xgboost ckpt logic

5f74584

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

some fixes

1f3f2a1

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add fix + initial test

13d65d0

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into xgb_…

2942a96

…lgb_rank0_only_ckpt

fix for lgbm too

61a705f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested review from matthewdeng and woshiyyya as code owners May 1, 2024 21:23

justinvyu changed the title ~~[train] `RayTrain~~ [train] RayTrainReportCallback should only save a checkpoint on rank 0 for xgboost/lightgbm May 1, 2024

justinvyu assigned matthewdeng and hongpeng-guo May 1, 2024

justinvyu requested a review from hongpeng-guo May 1, 2024 21:24

justinvyu commented May 1, 2024

View reviewed changes

hongpeng-guo reviewed May 2, 2024

View reviewed changes

justinvyu added 5 commits May 8, 2024 14:04

Merge branch 'master' of https://github.com/ray-project/ray into xgb_…

538ea8b

…lgb_rank0_only_ckpt

fix world rank check for tune + last ckpt iter tracking

1c92130

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix test

b819751

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

simplify test

00a7156

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix comment indexing

9fa4d36

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

matthewdeng approved these changes May 9, 2024

View reviewed changes

Merge branch 'master' into xgb_lgb_rank0_only_ckpt

f851303

hongpeng-guo approved these changes May 9, 2024

View reviewed changes

justinvyu merged commit 112e859 into ray-project:master May 9, 2024
5 checks passed

justinvyu deleted the xgb_lgb_rank0_only_ckpt branch May 9, 2024 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] `RayTrainReportCallback` should only save a checkpoint on rank 0 for xgboost/lightgbm #45083

[train] `RayTrainReportCallback` should only save a checkpoint on rank 0 for xgboost/lightgbm #45083

justinvyu commented May 1, 2024 •

edited

justinvyu May 1, 2024

hongpeng-guo May 2, 2024

justinvyu May 8, 2024

hongpeng-guo May 9, 2024

hongpeng-guo May 2, 2024

matthewdeng left a comment

hongpeng-guo left a comment •

edited

[train] RayTrainReportCallback should only save a checkpoint on rank 0 for xgboost/lightgbm #45083

[train] RayTrainReportCallback should only save a checkpoint on rank 0 for xgboost/lightgbm #45083

Conversation

justinvyu commented May 1, 2024 • edited

Why are these changes needed?

Related issue number

Checks

justinvyu May 1, 2024

Choose a reason for hiding this comment

hongpeng-guo May 2, 2024

Choose a reason for hiding this comment

justinvyu May 8, 2024

Choose a reason for hiding this comment

hongpeng-guo May 9, 2024

Choose a reason for hiding this comment

hongpeng-guo May 2, 2024

Choose a reason for hiding this comment

matthewdeng left a comment

Choose a reason for hiding this comment

hongpeng-guo left a comment • edited

Choose a reason for hiding this comment

[train] `RayTrainReportCallback` should only save a checkpoint on rank 0 for xgboost/lightgbm #45083

[train] `RayTrainReportCallback` should only save a checkpoint on rank 0 for xgboost/lightgbm #45083

justinvyu commented May 1, 2024 •

edited

hongpeng-guo left a comment •

edited