New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Sync logs from workers and improve tensorboard reporting #1567
Conversation
Test PASSed. |
Test FAILed. |
python/ray/tune/log_sync.py
Outdated
def _refresh_worker_ip(self): | ||
if self.worker_ip_fut: | ||
try: | ||
self.worker_ip = ray.get(self.worker_ip_fut) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: this can block forever if the trial is dead
Test FAILed. |
@richardliaw ready for review (though I have yet to test on a cluster) |
Tested; seems to work fine. |
Test FAILed. |
jenkins retest this please |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dependence on autoscaler functionality rather than particular abstractions seems brittle.
Also separately (not in this PR), it would make sense to have some sort of command building tool that takes care of things like Docker environments.
ssh_key = get_ssh_key() | ||
ssh_user = get_ssh_user() | ||
if ssh_key is None or ssh_user is None: | ||
print( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not allow arbitrary clusters to use this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster_info.py will have to support that. I'm not sure how we can infer the SSH key unless it's set up by Ray.
print("Error: log sync requires rsync to be installed.") | ||
return | ||
worker_to_local_sync_cmd = ( | ||
("""rsync -avz -e "ssh -i '{}' -o ConnectTimeout=120s """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also doesn't work with Docker btw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it would be nice to have a more generalized way of running remote commands. Probably when kubernetes support lands we can expose a common run_cmd interface from the autoscaler.
python/ray/tune/logger.py
Outdated
values = [] | ||
for attr in result.keys(): | ||
value = result[attr] | ||
if value is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
less verbose to just do for attr, value in result.items()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
python/ray/tune/logger.py
Outdated
for attr in result.keys(): | ||
value = result[attr] | ||
if value is not None: | ||
if type(value) in [int, float]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps also check for long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's a real type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eh it's only for py2 so this is fine
self.worker_ip = None | ||
print("Created LogSyncer for {} -> {}".format(local_dir, remote_dir)) | ||
|
||
def set_worker_ip(self, worker_ip): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why preserve state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like sync_now(force=True)
is called when forced to flush and when the trial stops. In both cases, why do you only care about the last worker ip?
Separately, It might also make sense to split functions for local -> remote sync and local <- worker sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The worker ip can change over time if the trial is restarted. Unfortunately Ray doesn't have an API to get the worker IP synchronously, so we piggyback on the IP reported by the last result.
Note that syncing is per-trial, so we actually track the IP per trial.
python/ray/tune/cluster_info.py
Outdated
def get_ssh_key(): | ||
"""Returns ssh key to connecting to cluster workers.""" | ||
|
||
path = os.path.expanduser("~/ray_bootstrap_key.pem") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hardcoding this seems brittle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a TODO that this only supports this type of cluster for now.
Test FAILed. |
Test PASSed. |
Test failure (Valgrind and custom resources) unrelated. |
What do these changes do?
info
object and have them show up in tensorboard.Related issue number
#1546