Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] New train.Checkpoint API: Update Train tests + examples (batch 1) #38709

Merged
merged 33 commits into from
Aug 22, 2023

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Aug 22, 2023

Why are these changes needed?

This PR updates the first batch of Train tests + examples CI tests to use the new persistence mode and checkpoint API. Each converted test will add the new_storage tag in the CI, which excludes it from the old runners. This is so that we don't need to keep backwards compatibility within the tests.

Related issue number

#38570

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update tag

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

[test_torch_trainer] remove commented test

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

[test_torch_trainer] finish

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@@ -1093,6 +1093,7 @@ def on_checkpoint(self, checkpoint: Union[_TrackedCheckpoint, _TrainingResult]):
self.storage.current_checkpoint_index += 1
else:
self.run_metadata.checkpoint_manager.on_checkpoint(checkpoint)
self.invalidate_json_state()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.storage is not part of run_metadata, so it's not enough to just invalidate the run_metadata cache. @krfricke

Comment on lines +297 to +300
# TorchCheckpoint.from_model fails, so just save the state dict only.
train.report(
{}, checkpoint=TorchCheckpoint.from_state_dict(model.module.state_dict())
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogkam This test fails if I try to torch.save(model) or pickle.dump(model). Is the fix from #25335 supposed to fix that?

Is this unit test still relevant? Saving the state dict seems to work fine.

Comment on lines +202 to +204
# TODO(justinvyu): Failure injection callback doesn't work very well to test this.
pytest.skip("Re-enable after coming up with a better way to inject a failure.")
monkeypatch.setenv("RAY_AIR_LOCAL_CACHE_DIR", str(tmp_path))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is failing for some complicated reasons:

  1. We inject a failure in a callback. Errors in callbacks get handled in different ways.
  2. If we fail during on_trial_save, the error gets raised immediately WITHOUT calling tune_controller.checkpoint one last time. This means that the checkpoint gets processed -> checkpoint 00001 gets deleted, but the experiment state still thinks checkpoint 00001 is the latest one. This is a consequence of raising exactly during on_trial_save + num_to_keep=1.
  3. I tried failing on the next iteration's on_trial_result, but that also didn't work. The trial advanced to the next iteration for some reason.

So, I'm skipping for these (legacy) trainers for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me to skip the test, but for the case where on_trial_save fails (in a custom callback or so), should we make sure this works? (Can be in a follow-up)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'm not sure why this worked before actually, would be good to get to the bottom of this in a follow-up.

Tracking that here #38734

Comment on lines +103 to +105
if not _use_storage_context():
# TODO(justinvyu): [skipped_test]
pytest.skip("Skipping for now.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in a different (GPU) test suite, but shares example code for a different test that has been updated, so excluding it here.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update test_horovod_trainer pt2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test changes lgtm

Comment on lines +202 to +204
# TODO(justinvyu): Failure injection callback doesn't work very well to test this.
pytest.skip("Re-enable after coming up with a better way to inject a failure.")
monkeypatch.setenv("RAY_AIR_LOCAL_CACHE_DIR", str(tmp_path))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me to skip the test, but for the case where on_trial_save fails (in a custom callback or so), should we make sure this works? (Can be in a follow-up)

@krfricke krfricke merged commit 6156470 into ray-project:master Aug 22, 2023
59 of 64 checks passed
@justinvyu justinvyu deleted the air/persistence/ci/train branch August 22, 2023 18:15
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…tch 1) (ray-project#38709)

This PR updates the first batch of `Train tests + examples` CI tests to use the new persistence mode and checkpoint API. Each converted test will add the `new_storage` tag in the CI, which excludes it from the old runners. This is so that we don't need to keep backwards compatibility within the tests.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…tch 1) (ray-project#38709)

This PR updates the first batch of `Train tests + examples` CI tests to use the new persistence mode and checkpoint API. Each converted test will add the `new_storage` tag in the CI, which excludes it from the old runners. This is so that we don't need to keep backwards compatibility within the tests.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants