[tune] Sync logs from workers and improve tensorboard reporting #1567

ericl · 2018-02-20T02:47:35Z

What do these changes do?

Sync logs automatically from workers to the head node (then optionally to S3)
Log all valid sub-fields reported to tensorboard. This allows you to put arbitrary metrics in the info object and have them show up in tensorboard.

Related issue number

AmplabJenkins · 2018-02-20T03:45:07Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3855/
Test PASSed.

AmplabJenkins · 2018-02-20T04:26:13Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3857/
Test FAILed.

ericl · 2018-02-20T10:16:03Z

python/ray/tune/log_sync.py

+    def _refresh_worker_ip(self):
+        if self.worker_ip_fut:
+            try:
+                self.worker_ip = ray.get(self.worker_ip_fut)


todo: this can block forever if the trial is dead

AmplabJenkins · 2018-02-22T01:44:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3888/
Test FAILed.

ericl · 2018-02-22T06:18:31Z

@richardliaw ready for review (though I have yet to test on a cluster)

ericl · 2018-02-22T08:54:08Z

Tested; seems to work fine.

AmplabJenkins · 2018-02-24T01:26:01Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3923/
Test FAILed.

ericl · 2018-02-24T05:58:22Z

jenkins retest this please

AmplabJenkins · 2018-02-24T06:54:17Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3926/
Test PASSed.

richardliaw

The dependence on autoscaler functionality rather than particular abstractions seems brittle.

Also separately (not in this PR), it would make sense to have some sort of command building tool that takes care of things like Docker environments.

richardliaw · 2018-02-25T01:57:39Z

python/ray/tune/log_sync.py

+            ssh_key = get_ssh_key()
+            ssh_user = get_ssh_user()
+            if ssh_key is None or ssh_user is None:
+                print(


why not allow arbitrary clusters to use this?

cluster_info.py will have to support that. I'm not sure how we can infer the SSH key unless it's set up by Ray.

richardliaw · 2018-02-25T02:07:09Z

python/ray/tune/log_sync.py

+                print("Error: log sync requires rsync to be installed.")
+                return
+            worker_to_local_sync_cmd = (
+                ("""rsync -avz -e "ssh -i '{}' -o ConnectTimeout=120s """


this also doesn't work with Docker btw

Yeah, it would be nice to have a more generalized way of running remote commands. Probably when kubernetes support lands we can expose a common run_cmd interface from the autoscaler.

richardliaw · 2018-02-25T02:10:12Z

python/ray/tune/logger.py

+    values = []
+    for attr in result.keys():
+        value = result[attr]
+        if value is not None:


less verbose to just do for attr, value in result.items()

richardliaw · 2018-02-25T02:10:39Z

python/ray/tune/logger.py

+    for attr in result.keys():
+        value = result[attr]
+        if value is not None:
+            if type(value) in [int, float]:


perhaps also check for long

I don't think that's a real type?

eh it's only for py2 so this is fine

richardliaw · 2018-02-25T02:18:53Z

python/ray/tune/log_sync.py

+        self.worker_ip = None
+        print("Created LogSyncer for {} -> {}".format(local_dir, remote_dir))
+
+    def set_worker_ip(self, worker_ip):


why preserve state?

it seems like sync_now(force=True) is called when forced to flush and when the trial stops. In both cases, why do you only care about the last worker ip?

Separately, It might also make sense to split functions for local -> remote sync and local <- worker sync.

The worker ip can change over time if the trial is restarted. Unfortunately Ray doesn't have an API to get the worker IP synchronously, so we piggyback on the IP reported by the last result.

Note that syncing is per-trial, so we actually track the IP per trial.

richardliaw · 2018-02-25T02:25:50Z

python/ray/tune/cluster_info.py

+def get_ssh_key():
+    """Returns ssh key to connecting to cluster workers."""
+
+    path = os.path.expanduser("~/ray_bootstrap_key.pem")


hardcoding this seems brittle

I added a TODO that this only supports this type of cluster for now.

AmplabJenkins · 2018-02-25T22:47:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3944/
Test FAILed.

AmplabJenkins · 2018-02-26T06:24:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3958/
Test PASSed.

richardliaw · 2018-02-26T19:35:45Z

Test failure (Valgrind and custom resources) unrelated.

ericl added 7 commits February 19, 2018 16:42

Mon Feb 19 16:42:23 PST 2018

68e6f3c

Mon Feb 19 17:45:38 PST 2018

cf735d8

Mon Feb 19 17:58:58 PST 2018

c89d1f8

Mon Feb 19 18:18:07 PST 2018

06f5f93

Mon Feb 19 18:36:10 PST 2018

1d704b2

Mon Feb 19 18:40:07 PST 2018

46b5482

Mon Feb 19 18:45:17 PST 2018

5104f31

ericl assigned richardliaw Feb 20, 2018

ericl changed the title ~~[tune] Sync logs to workers~~ [tune] Sync logs to workers and improve tensorboard reporting Feb 20, 2018

ericl changed the title ~~[tune] Sync logs to workers and improve tensorboard reporting~~ [tune] Sync logs from workers and improve tensorboard reporting Feb 20, 2018

Mon Feb 19 19:49:33 PST 2018

0c3a941

ericl commented Feb 20, 2018

View reviewed changes

ericl added 2 commits February 21, 2018 14:09

update

f258869

Wed Feb 21 17:41:58 PST 2018

51e49f5

richardliaw mentioned this pull request Feb 22, 2018

[tune] tune autogenned dir names use commas, this messes with tensorboard #1580

Closed

Fri Feb 23 17:04:40 PST 2018

1eb7071

richardliaw reviewed Feb 25, 2018

View reviewed changes

Sun Feb 25 14:19:41 PST 2018

959b37c

richardliaw approved these changes Feb 26, 2018

View reviewed changes

Sun Feb 25 21:28:11 PST 2018

4eef3bc

richardliaw merged commit 87e107e into ray-project:master Feb 26, 2018

richardliaw deleted the s3sync branch February 26, 2018 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Sync logs from workers and improve tensorboard reporting #1567

[tune] Sync logs from workers and improve tensorboard reporting #1567

ericl commented Feb 20, 2018

AmplabJenkins commented Feb 20, 2018

AmplabJenkins commented Feb 20, 2018

ericl Feb 20, 2018

AmplabJenkins commented Feb 22, 2018

ericl commented Feb 22, 2018 •

edited

ericl commented Feb 22, 2018

AmplabJenkins commented Feb 24, 2018

ericl commented Feb 24, 2018

AmplabJenkins commented Feb 24, 2018

richardliaw left a comment

richardliaw Feb 25, 2018

ericl Feb 25, 2018

richardliaw Feb 25, 2018

ericl Feb 25, 2018

richardliaw Feb 25, 2018

ericl Feb 25, 2018

richardliaw Feb 25, 2018

ericl Feb 25, 2018

richardliaw Feb 26, 2018

richardliaw Feb 25, 2018

richardliaw Feb 25, 2018 •

edited

ericl Feb 25, 2018

richardliaw Feb 25, 2018

ericl Feb 25, 2018

AmplabJenkins commented Feb 25, 2018

AmplabJenkins commented Feb 26, 2018

richardliaw commented Feb 26, 2018

[tune] Sync logs from workers and improve tensorboard reporting #1567

[tune] Sync logs from workers and improve tensorboard reporting #1567

Conversation

ericl commented Feb 20, 2018

What do these changes do?

Related issue number

AmplabJenkins commented Feb 20, 2018

AmplabJenkins commented Feb 20, 2018

Choose a reason for hiding this comment

AmplabJenkins commented Feb 22, 2018

ericl commented Feb 22, 2018 • edited

ericl commented Feb 22, 2018

AmplabJenkins commented Feb 24, 2018

ericl commented Feb 24, 2018

AmplabJenkins commented Feb 24, 2018

richardliaw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardliaw Feb 25, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Feb 25, 2018

AmplabJenkins commented Feb 26, 2018

richardliaw commented Feb 26, 2018

ericl commented Feb 22, 2018 •

edited

richardliaw Feb 25, 2018 •

edited