-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Fix an edge case where DurableTrainable
would not delete checkpoints in remote storage
#18318
Conversation
How about instead, we change the logic in DurableTrainable to always remove checkpoint in remote storage whether or not local copy is found? |
@xwjiang2010 It won't be able to find the checkpoint directory through the find method then |
Yeah that's another thing I don't quite understand. |
I am ok with the direction of the PR. Thanks Antoni for spotting the bug :) Besides my question about when to care about nested checkpoint and thus
|
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think we can go ahead with this. But I agree we might want to refactor this (remote) checkpointing logic later
@@ -13,6 +13,7 @@ | |||
|
|||
import ray | |||
import ray.cloudpickle as cloudpickle | |||
from ray.exceptions import GetTimeoutError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some way we can test this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could include a test for this in one of the release tests using s3, but I don't see a unit test for this possible, unless we somehow mock node ips? This would have to run on a multi-node setup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. How about let's keep track of this in an issue, as we should add multi-testing soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why the release test set up cannot access S3?
Why are these changes needed?
If both the runner and trial are present on the same node,
DurableTrainable
will never remove files in remote storage as it would be unable to find local files, which had already been deleted by thecheckpoint_deleter
. This change ensures thatcheckpoint_deleter
doesn't remove local files by itself if the runner is on the same node.Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.