-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air] New train.Checkpoint
API: Update Train tests + examples
(batch 1)
#38709
[air] New train.Checkpoint
API: Update Train tests + examples
(batch 1)
#38709
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com> update tag Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com> [test_torch_trainer] remove commented test Signed-off-by: Justin Yu <justinvyu@anyscale.com> [test_torch_trainer] finish Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/ci/train
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@@ -1093,6 +1093,7 @@ def on_checkpoint(self, checkpoint: Union[_TrackedCheckpoint, _TrainingResult]): | |||
self.storage.current_checkpoint_index += 1 | |||
else: | |||
self.run_metadata.checkpoint_manager.on_checkpoint(checkpoint) | |||
self.invalidate_json_state() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.storage
is not part of run_metadata
, so it's not enough to just invalidate the run_metadata
cache. @krfricke
# TorchCheckpoint.from_model fails, so just save the state dict only. | ||
train.report( | ||
{}, checkpoint=TorchCheckpoint.from_state_dict(model.module.state_dict()) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO(justinvyu): Failure injection callback doesn't work very well to test this. | ||
pytest.skip("Re-enable after coming up with a better way to inject a failure.") | ||
monkeypatch.setenv("RAY_AIR_LOCAL_CACHE_DIR", str(tmp_path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is failing for some complicated reasons:
- We inject a failure in a callback. Errors in callbacks get handled in different ways.
- If we fail during
on_trial_save
, the error gets raised immediately WITHOUT callingtune_controller.checkpoint
one last time. This means that the checkpoint gets processed -> checkpoint 00001 gets deleted, but the experiment state still thinks checkpoint 00001 is the latest one. This is a consequence of raising exactly duringon_trial_save
+num_to_keep=1
. - I tried failing on the next iteration's
on_trial_result
, but that also didn't work. The trial advanced to the next iteration for some reason.
So, I'm skipping for these (legacy) trainers for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me to skip the test, but for the case where on_trial_save
fails (in a custom callback or so), should we make sure this works? (Can be in a follow-up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I'm not sure why this worked before actually, would be good to get to the bottom of this in a follow-up.
Tracking that here #38734
if not _use_storage_context(): | ||
# TODO(justinvyu): [skipped_test] | ||
pytest.skip("Skipping for now.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in a different (GPU) test suite, but shares example code for a different test that has been updated, so excluding it here.
Signed-off-by: Justin Yu <justinvyu@anyscale.com> update test_horovod_trainer pt2 Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…persistence/ci/train
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test changes lgtm
# TODO(justinvyu): Failure injection callback doesn't work very well to test this. | ||
pytest.skip("Re-enable after coming up with a better way to inject a failure.") | ||
monkeypatch.setenv("RAY_AIR_LOCAL_CACHE_DIR", str(tmp_path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me to skip the test, but for the case where on_trial_save
fails (in a custom callback or so), should we make sure this works? (Can be in a follow-up)
…tch 1) (ray-project#38709) This PR updates the first batch of `Train tests + examples` CI tests to use the new persistence mode and checkpoint API. Each converted test will add the `new_storage` tag in the CI, which excludes it from the old runners. This is so that we don't need to keep backwards compatibility within the tests. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
…tch 1) (ray-project#38709) This PR updates the first batch of `Train tests + examples` CI tests to use the new persistence mode and checkpoint API. Each converted test will add the `new_storage` tag in the CI, which excludes it from the old runners. This is so that we don't need to keep backwards compatibility within the tests. Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> Signed-off-by: Victor <vctr.y.m@example.com>
Why are these changes needed?
This PR updates the first batch of
Train tests + examples
CI tests to use the new persistence mode and checkpoint API. Each converted test will add thenew_storage
tag in the CI, which excludes it from the old runners. This is so that we don't need to keep backwards compatibility within the tests.Related issue number
#38570
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.