Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Fix an edge case where DurableTrainable would not delete checkpoints in remote storage #18318

Merged
merged 13 commits into from
Sep 8, 2021

Conversation

Yard1
Copy link
Member

@Yard1 Yard1 commented Sep 2, 2021

Why are these changes needed?

If both the runner and trial are present on the same node, DurableTrainable will never remove files in remote storage as it would be unable to find local files, which had already been deleted by the checkpoint_deleter. This change ensures that checkpoint_deleter doesn't remove local files by itself if the runner is on the same node.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@xwjiang2010
Copy link
Contributor

How about instead, we change the logic in DurableTrainable to always remove checkpoint in remote storage whether or not local copy is found?
try:
except:
finally:

@Yard1
Copy link
Member Author

Yard1 commented Sep 2, 2021

@xwjiang2010 It won't be able to find the checkpoint directory through the find method then

@xwjiang2010
Copy link
Contributor

@xwjiang2010 It won't be able to find the checkpoint directory through the find method then

Yeah that's another thing I don't quite understand. find_checkpoint_dir deals with some nested checkpoint case and recursively tries to find the top layer directory that contains .is_checkpoint. For the cases I have seen, it always gives me the same output as input tho. Would be good to understand the case this function is trying to address.

@xwjiang2010
Copy link
Contributor

I am ok with the direction of the PR. Thanks Antoni for spotting the bug :)

Besides my question about when to care about nested checkpoint and thus find_checkpoint_dir, if we want to proceed with this direction, can we:

  • Add some high level description of the responsibilities of CheckpointManager, DurableTrainable including in the context of when Trainable and CheckpointManager happen to be on the same physical machine. This would greatly help with reasoning about the code.
  • Add a test?

python/ray/tune/trial.py Outdated Show resolved Hide resolved
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think we can go ahead with this. But I agree we might want to refactor this (remote) checkpointing logic later

@@ -13,6 +13,7 @@

import ray
import ray.cloudpickle as cloudpickle
from ray.exceptions import GetTimeoutError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way we can test this PR?

Copy link
Member Author

@Yard1 Yard1 Sep 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could include a test for this in one of the release tests using s3, but I don't see a unit test for this possible, unless we somehow mock node ips? This would have to run on a multi-node setup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. How about let's keep track of this in an issue, as we should add multi-testing soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why the release test set up cannot access S3?

@richardliaw richardliaw merged commit dd6abed into ray-project:master Sep 8, 2021
@Yard1 Yard1 deleted the delete_checkpoint branch September 8, 2021 22:13
@Yard1 Yard1 mentioned this pull request Sep 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants