Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Sync logs from workers and improve tensorboard reporting #1567

Merged
merged 13 commits into from Feb 26, 2018

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Feb 20, 2018

What do these changes do?

  • Sync logs automatically from workers to the head node (then optionally to S3)
  • Log all valid sub-fields reported to tensorboard. This allows you to put arbitrary metrics in the info object and have them show up in tensorboard.

Related issue number

#1546

@ericl ericl changed the title [tune] Sync logs to workers [tune] Sync logs to workers and improve tensorboard reporting Feb 20, 2018
@ericl ericl changed the title [tune] Sync logs to workers and improve tensorboard reporting [tune] Sync logs from workers and improve tensorboard reporting Feb 20, 2018
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3855/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3857/
Test FAILed.

def _refresh_worker_ip(self):
if self.worker_ip_fut:
try:
self.worker_ip = ray.get(self.worker_ip_fut)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: this can block forever if the trial is dead

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3888/
Test FAILed.

@ericl
Copy link
Contributor Author

ericl commented Feb 22, 2018

@richardliaw ready for review (though I have yet to test on a cluster)

@ericl
Copy link
Contributor Author

ericl commented Feb 22, 2018

Tested; seems to work fine.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3923/
Test FAILed.

@ericl
Copy link
Contributor Author

ericl commented Feb 24, 2018

jenkins retest this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3926/
Test PASSed.

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependence on autoscaler functionality rather than particular abstractions seems brittle.

Also separately (not in this PR), it would make sense to have some sort of command building tool that takes care of things like Docker environments.

ssh_key = get_ssh_key()
ssh_user = get_ssh_user()
if ssh_key is None or ssh_user is None:
print(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not allow arbitrary clusters to use this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster_info.py will have to support that. I'm not sure how we can infer the SSH key unless it's set up by Ray.

print("Error: log sync requires rsync to be installed.")
return
worker_to_local_sync_cmd = (
("""rsync -avz -e "ssh -i '{}' -o ConnectTimeout=120s """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also doesn't work with Docker btw

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would be nice to have a more generalized way of running remote commands. Probably when kubernetes support lands we can expose a common run_cmd interface from the autoscaler.

values = []
for attr in result.keys():
value = result[attr]
if value is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less verbose to just do for attr, value in result.items()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for attr in result.keys():
value = result[attr]
if value is not None:
if type(value) in [int, float]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps also check for long

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a real type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh it's only for py2 so this is fine

self.worker_ip = None
print("Created LogSyncer for {} -> {}".format(local_dir, remote_dir))

def set_worker_ip(self, worker_ip):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why preserve state?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like sync_now(force=True) is called when forced to flush and when the trial stops. In both cases, why do you only care about the last worker ip?

Separately, It might also make sense to split functions for local -> remote sync and local <- worker sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worker ip can change over time if the trial is restarted. Unfortunately Ray doesn't have an API to get the worker IP synchronously, so we piggyback on the IP reported by the last result.

Note that syncing is per-trial, so we actually track the IP per trial.

def get_ssh_key():
"""Returns ssh key to connecting to cluster workers."""

path = os.path.expanduser("~/ray_bootstrap_key.pem")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hardcoding this seems brittle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a TODO that this only supports this type of cluster for now.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3944/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3958/
Test PASSed.

@richardliaw
Copy link
Contributor

Test failure (Valgrind and custom resources) unrelated.

@richardliaw richardliaw merged commit 87e107e into ray-project:master Feb 26, 2018
@richardliaw richardliaw deleted the s3sync branch February 26, 2018 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants