[air] New `train.Checkpoint` API: Update `Train tests + examples` (batch 1) #38709

justinvyu · 2023-08-22T05:30:53Z

Why are these changes needed?

This PR updates the first batch of Train tests + examples CI tests to use the new persistence mode and checkpoint API. Each converted test will add the new_storage tag in the CI, which excludes it from the old runners. This is so that we don't need to keep backwards compatibility within the tests.

Related issue number

#38570

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com> update tag Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com> [test_torch_trainer] remove commented test Signed-off-by: Justin Yu <justinvyu@anyscale.com> [test_torch_trainer] finish Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/ci/train

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2023-08-22T05:31:48Z

python/ray/tune/experiment/trial.py

@@ -1093,6 +1093,7 @@ def on_checkpoint(self, checkpoint: Union[_TrackedCheckpoint, _TrainingResult]):
            self.storage.current_checkpoint_index += 1
        else:
            self.run_metadata.checkpoint_manager.on_checkpoint(checkpoint)
+        self.invalidate_json_state()


self.storage is not part of run_metadata, so it's not enough to just invalidate the run_metadata cache. @krfricke

justinvyu · 2023-08-22T05:34:15Z

python/ray/train/tests/test_torch_trainer.py

+        # TorchCheckpoint.from_model fails, so just save the state dict only.
+        train.report(
+            {}, checkpoint=TorchCheckpoint.from_state_dict(model.module.state_dict())
+        )


@amogkam This test fails if I try to torch.save(model) or pickle.dump(model). Is the fix from #25335 supposed to fix that?

Is this unit test still relevant? Saving the state dict seems to work fine.

justinvyu · 2023-08-22T05:44:04Z

python/ray/train/tests/test_trainer_restore.py

+    # TODO(justinvyu): Failure injection callback doesn't work very well to test this.
+    pytest.skip("Re-enable after coming up with a better way to inject a failure.")
+    monkeypatch.setenv("RAY_AIR_LOCAL_CACHE_DIR", str(tmp_path))


This test is failing for some complicated reasons:

We inject a failure in a callback. Errors in callbacks get handled in different ways.

If we fail during on_trial_save, the error gets raised immediately WITHOUT calling tune_controller.checkpoint one last time. This means that the checkpoint gets processed -> checkpoint 00001 gets deleted, but the experiment state still thinks checkpoint 00001 is the latest one. This is a consequence of raising exactly during on_trial_save + num_to_keep=1.

I tried failing on the next iteration's on_trial_result, but that also didn't work. The trial advanced to the next iteration for some reason.

So, I'm skipping for these (legacy) trainers for now.

Fine with me to skip the test, but for the case where on_trial_save fails (in a custom callback or so), should we make sure this works? (Can be in a follow-up)

Sounds good. I'm not sure why this worked before actually, would be good to get to the bottom of this in a follow-up.

Tracking that here #38734

justinvyu · 2023-08-22T05:50:00Z

python/ray/train/tests/test_gpu_examples.py

+    if not _use_storage_context():
+        # TODO(justinvyu): [skipped_test]
+        pytest.skip("Skipping for now.")


This is in a different (GPU) test suite, but shares example code for a different test that has been updated, so excluding it here.

Signed-off-by: Justin Yu <justinvyu@anyscale.com> update test_horovod_trainer pt2 Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…persistence/ci/train

krfricke

Test changes lgtm

krfricke · 2023-08-22T12:40:52Z

python/ray/train/tests/test_trainer_restore.py

+    # TODO(justinvyu): Failure injection callback doesn't work very well to test this.
+    pytest.skip("Re-enable after coming up with a better way to inject a failure.")
+    monkeypatch.setenv("RAY_AIR_LOCAL_CACHE_DIR", str(tmp_path))


Fine with me to skip the test, but for the case where on_trial_save fails (in a custom callback or so), should we make sure this works? (Can be in a follow-up)

…tch 1) (ray-project#38709) This PR updates the first batch of `Train tests + examples` CI tests to use the new persistence mode and checkpoint API. Each converted test will add the `new_storage` tag in the CI, which excludes it from the old runners. This is so that we don't need to keep backwards compatibility within the tests. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

…tch 1) (ray-project#38709) This PR updates the first batch of `Train tests + examples` CI tests to use the new persistence mode and checkpoint API. Each converted test will add the `new_storage` tag in the CI, which excludes it from the old runners. This is so that we don't need to keep backwards compatibility within the tests. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Signed-off-by: Victor <vctr.y.m@example.com>

justinvyu added 29 commits August 21, 2023 12:27

exclude converted tests in old codepath CI runners

ca3c94d

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

hvd_pytorch_example

11b8fd0

Signed-off-by: Justin Yu <justinvyu@anyscale.com> update tag Signed-off-by: Justin Yu <justinvyu@anyscale.com>

test_transformers_trainer_steps

d16f6f3

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

tune_tensorflow_mnist_example

32e7943

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

test_torch_trainer

bcb5173

Signed-off-by: Justin Yu <justinvyu@anyscale.com> [test_torch_trainer] remove commented test Signed-off-by: Justin Yu <justinvyu@anyscale.com> [test_torch_trainer] finish Signed-off-by: Justin Yu <justinvyu@anyscale.com>

[test_torch_trainer] remove preprocessor

ad3ab4e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

[test_trainer_restore] pt1

a29bdd2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove preprocessor tests

e690af9

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

[test_trainer_restore] finished

3d3bcd1

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

FIX: invalidate trial cache on storage ckpt index update

c9a374d

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

FIX: transformers trainer to user train.report

3fc1233

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

test_training_iterator

a1fd329

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

test_transformers_checkpoint (move_only)

0836353

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

test_utils (move_only)

7e7cc21

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

[test_training_iterator] pt1

e07b647

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

[test_training_iterator] finished

4c17417

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

[test_training_iterator] delete ckpt test

d8e7917

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

tune_torch_regression_example + torch_regression_example

83646fa

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

[torch_regression_example] misc

299337f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

test_torch_checkpoint (move_only)

8bca3cd

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

test_torch_utils (move_only)

88c6c26

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

test_transformers_trainer

39454bf

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

b5a956d

…persistence/ci/train

test_worker_group (move_only)

e1bc34d

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix lint

66c8109

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

mlflow_simple_example (move_only)

597730c

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

torch_quick_start (move_only)

e2f2306

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

tensorflow_quickstart (move_only)

aba19a4

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

e2e_wandb + test_examples (move_only)

6d63204

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu commented Aug 22, 2023

View reviewed changes

justinvyu requested a review from krfricke August 22, 2023 07:09

justinvyu assigned krfricke and ericl Aug 22, 2023

justinvyu requested a review from ericl August 22, 2023 07:11

justinvyu added 3 commits August 22, 2023 00:12

update test_horovod_trainer

8b37414

Signed-off-by: Justin Yu <justinvyu@anyscale.com> update test_horovod_trainer pt2 Signed-off-by: Justin Yu <justinvyu@anyscale.com>

exclude updated tests in 3.7 compatibility suite for now

2dabc94

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

b5d5b66

…persistence/ci/train

krfricke approved these changes Aug 22, 2023

View reviewed changes

Merge branch 'master' into air/persistence/ci/train

a8df510

krfricke merged commit 6156470 into ray-project:master Aug 22, 2023
59 of 64 checks passed

justinvyu deleted the air/persistence/ci/train branch August 22, 2023 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air] New `train.Checkpoint` API: Update `Train tests + examples` (batch 1) #38709

[air] New `train.Checkpoint` API: Update `Train tests + examples` (batch 1) #38709

justinvyu commented Aug 22, 2023 •

edited

justinvyu Aug 22, 2023

justinvyu Aug 22, 2023

justinvyu Aug 22, 2023

krfricke Aug 22, 2023

justinvyu Aug 22, 2023

justinvyu Aug 22, 2023

krfricke left a comment

krfricke Aug 22, 2023

[air] New train.Checkpoint API: Update Train tests + examples (batch 1) #38709

[air] New train.Checkpoint API: Update Train tests + examples (batch 1) #38709

Conversation

justinvyu commented Aug 22, 2023 • edited

Why are these changes needed?

Related issue number

Checks

justinvyu Aug 22, 2023

Choose a reason for hiding this comment

justinvyu Aug 22, 2023

Choose a reason for hiding this comment

justinvyu Aug 22, 2023

Choose a reason for hiding this comment

krfricke Aug 22, 2023

Choose a reason for hiding this comment

justinvyu Aug 22, 2023

Choose a reason for hiding this comment

justinvyu Aug 22, 2023

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

krfricke Aug 22, 2023

Choose a reason for hiding this comment

[air] New `train.Checkpoint` API: Update `Train tests + examples` (batch 1) #38709

[air] New `train.Checkpoint` API: Update `Train tests + examples` (batch 1) #38709

justinvyu commented Aug 22, 2023 •

edited