-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Cross-Node Recovery #3725
Conversation
Test FAILed. |
Test PASSed. |
Test FAILed. |
@@ -122,6 +123,12 @@ def flush(self): | |||
self._log_syncer.sync_now(force=True) | |||
self._log_syncer.wait() | |||
|
|||
def update_location(self, worker_ip): | |||
"""Sends the current log directory to the remote node.""" | |||
if worker_ip != self._log_syncer.worker_ip: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this crash on non-autoscaler clusters?
Also, log a warning if sync is happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done; and no, won't crash (1sync_to_worker_if_possible1 catches it)
python/ray/tune/logger.py
Outdated
@@ -122,6 +123,12 @@ def flush(self): | |||
self._log_syncer.sync_now(force=True) | |||
self._log_syncer.wait() | |||
|
|||
def update_location(self, worker_ip): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update_location() sounds more harmless than this is, maybe sync_results_to_new_location()?
Test PASSed. |
Test PASSed. |
Test PASSed. |
This is ready for another review. |
Test PASSed. |
Test PASSed. |
Augments trial restore to also check if the runner is at the same
location. If not, the checkpoint files are pushed onto the new location.
TODO:
We don't have any tests for this right now...