[Experimental] Add experimental distributed SGD API #2858

ericl · 2018-09-11T22:04:35Z

No description provided.

pcmoritz · 2018-09-11T22:12:35Z

python/ray/experimental/sgd/sgd.py

+                ray.worker.global_worker.plasma_client.store_socket_name)
+            manager_socket = (
+                ray.worker.global_worker.plasma_client.manager_socket_name)
+            memcpy_plasma_module = tf.load_op_library(


This needs to change, see https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_plasma_tf_op.py#L87

https://github.com/richardliaw/osdi2018/pull/8/files#diff-625435b68a5254f4ebbca9ccf8d655df

robertnishihara

We're copying a lot of code from TF in this PR. Can you say why? Some of it is just to get a Resnet model, right? What about the allreduce stuff?

robertnishihara · 2018-09-11T22:14:20Z

python/ray/experimental/sgd/chrome_timeline.py

+import time
+
+
+class Timeline(object):


Let's remove this from this PR.

and run_timeline from sgd.py

python/ray/experimental/sgd/sgd.py

@@ -0,0 +1,504 @@
+from __future__ import absolute_import


robertnishihara · 2018-09-11T22:43:00Z

python/ray/experimental/sgd/sgd.py

+        assert(len(self.per_device_grads) == num_devices)
+        self.num_grads = num_grads = len(self.packed_grads_and_vars[0])
+        if max_bytes:
+            print("Packed grads => {} tensors".format(num_grads))


robertnishihara · 2018-09-11T22:43:13Z

python/ray/experimental/sgd/sgd.py

+                            plasma_manager_socket_name=manager_socket)
+                grad_ph = tf.reshape(
+                    grad_ph, self.packed_grads_and_vars[0][j][0].shape)
+                print("Packed tensor", grad_ph)


use logger, same with all other prints

robertnishihara · 2018-09-11T22:46:51Z

Using the default model (but with batch size 64) model_creator = lambda worker_idx, device_idx: TFBenchModel(batch=64, use_cpus=False), I see the following times for sgd.step(). This is all without the Plasma op.

1 machine, 1 GPU: ~0.8s
1 machine, 8 GPUs: ~0.9s
2 machines, 8 GPUs each: ~2s
3 machines, 8 GPUs each: ~2.1s
4 machines, 8 GPUs each: ~2.4s

so, scaling from 1 GPU to 8 GPUs is quite good.

AmplabJenkins · 2018-09-11T23:35:12Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8162/
Test PASSed.

AmplabJenkins · 2018-09-13T23:49:30Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8193/
Test FAILed.

AmplabJenkins · 2018-09-14T02:03:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8196/
Test FAILed.

AmplabJenkins · 2018-09-14T02:59:13Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8199/
Test FAILed.

AmplabJenkins · 2018-09-14T03:04:43Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8198/
Test PASSed.

AmplabJenkins · 2018-09-14T03:04:51Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8197/
Test PASSed.

AmplabJenkins · 2018-09-14T03:48:48Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8200/
Test PASSed.

ericl · 2018-09-14T06:53:29Z

I did some reorganization of the code.

Moved the benchmarks code to tfbench. We'll need this if we want to run standard image net benchmarks without too much pain.
Moved timelines code to util.py file. This stuff is needed to provide proper performance instrumentation for Ray and TF<->GPU scheduling.

I also have started fixing the plasma op code. Currently it crashes since I left out the parameter server actor setup code. We can either try to merge this first, and merge the PS code after, or do both here.

pcmoritz

LGTM (also the plasma op changes). Happy to merge this after the linting error is fixed and do the parameter server changes as a followup.

robertnishihara · 2018-09-14T19:42:46Z

python/ray/experimental/sgd/util.py

+        self.start_time = self.time()
+        self.tid = tid
+
+    def patch_ray(self):


Why patch the Ray logging? As opposed to just using the existing logging?

The ray logging functionality doesn't provide the fine-grained control we have here to capture just one SGD iteration, and also add additional types of events.

Note that this still calls the original log call so you can get overall timelines that way still.

Hm, you can just do (assuming xray)

with ray.profile("custom_event"): # do one SGD iteration

to add the event to the timeline. Is that what you want? Anyway, if there's additional functionality here that's needed for profiling, then in a follow up PR we should just extend the profiling API (so it's available more generally outside of SGD).

robertnishihara · 2018-09-14T19:45:18Z

python/ray/experimental/sgd/util.py

+from __future__ import division
+from __future__ import print_function
+
+import ray


The ray import should be separate and below standard library imports https://github.com/google/styleguide/blob/gh-pages/pyguide.md#313-imports-formatting

This applies in other places also

robertnishihara · 2018-09-14T19:46:01Z

python/ray/experimental/sgd/modified_allreduce.py

@@ -0,0 +1,627 @@
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.


Can you add a comment at the top of all the copied files saying something like "This file was copied from [RELEVANT URL]"

richardliaw

I think we should scope the imports correctly (there's places where we import . and from file by name...)

richardliaw · 2018-09-14T19:59:00Z

python/ray/experimental/sgd/sgd.py

+import tensorflow.contrib.nccl as nccl
+import tensorflow.contrib.slim as slim
+
+from util import Timeline, fetch, run_timeline


do we want all of these to be local imports (rather than global)?

robertnishihara · 2018-09-14T20:45:26Z

python/ray/experimental/sgd/util.py

+            ray.worker.global_worker.plasma_client.fetch([plasma_id])
+
+
+def run_timeline(sess, ops, feed_dict={}, write_timeline=False, name=""):


shouldn't use a mutable value for default arg

AmplabJenkins · 2018-09-15T20:57:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8247/
Test PASSed.

robertnishihara

Looks good to me (pending tests passing, it looks like there is a linting error).

Currently it looks like this code is not touched by any of our tests. Can we add simple tests (just to touch the code) in a follow up PR (probably on Jenkins).

AmplabJenkins · 2018-09-18T08:05:43Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8273/
Test PASSed.

robertnishihara · 2018-09-18T21:07:56Z

Looks like we need to exclude the copied TF files from the flake8 test.

ericl · 2018-09-18T23:01:23Z

Let me add that.

robertnishihara · 2018-09-19T23:56:18Z

@ericl there is a merge conflict.

AmplabJenkins · 2018-09-20T00:56:53Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8304/
Test PASSed.

AmplabJenkins · 2018-09-20T01:02:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8305/
Test FAILed.

robertnishihara · 2018-09-20T01:26:24Z

jenkins, retest this please

robertnishihara · 2018-09-20T01:26:34Z

The test failure was

ERROR: testTrainMultiCartpoleSinglePolicy (__main__.TestMultiAgentEnv)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/ray/python/ray/rllib/test/test_multi_agent_env.py", line 364, in testTrainMultiCartpoleSinglePolicy
    raise Exception("failed to improve reward")
Exception: failed to improve reward

----------------------------------------------------------------------

AmplabJenkins · 2018-09-20T02:52:44Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8306/
Test PASSed.

robertnishihara · 2018-09-28T05:59:51Z

Progress on #1945.

ericl added 5 commits May 14, 2018 19:54

check in sgd api

46dabe1

idx

ec2c21b

foreach_worker foreach_model

aab1d5a

add feed_dict

bc1ce2a

Merge remote-tracking branch 'upstream/master' into sgd

503f6c2

pcmoritz reviewed Sep 11, 2018

View reviewed changes

robertnishihara reviewed Sep 11, 2018

View reviewed changes

python/ray/experimental/sgd/sgd.py

@@ -0,0 +1,504 @@

from __future__ import absolute_import

This comment was marked as resolved.

Sign in to view

robertnishihara reviewed Sep 11, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into sgd

9167d3b

ericl added 6 commits September 13, 2018 18:21

update

6a0a068

yapf

4242886

typo

ffafb6a

lint

4525ff0

plasma op change

5a6c913

fix plasma op

63e4580

ericl added 3 commits September 13, 2018 19:16

still not working

2ca7c85

fix

f37af90

fix

11672b0

ericl force-pushed the sgd branch from d67ee24 to 11672b0 Compare September 14, 2018 02:21

pcmoritz approved these changes Sep 14, 2018

View reviewed changes

robertnishihara reviewed Sep 14, 2018

View reviewed changes

richardliaw reviewed Sep 14, 2018

View reviewed changes

robertnishihara reviewed Sep 14, 2018

View reviewed changes

comments

8da7c9a

yapf

8e226a1

robertnishihara approved these changes Sep 18, 2018

View reviewed changes

ericl added 2 commits September 19, 2018 16:27

silly flake8

8c64bde

small test

b840c3f

Merge remote-tracking branch 'upstream/master' into sgd

4b50f5a

pcmoritz merged commit 3267676 into ray-project:master Sep 20, 2018

pcmoritz deleted the sgd branch September 20, 2018 04:12

		@@ -0,0 +1,627 @@
		# Copyright 2017 The TensorFlow Authors. All Rights Reserved.

		ray.worker.global_worker.plasma_client.fetch([plasma_id])


		def run_timeline(sess, ops, feed_dict={}, write_timeline=False, name=""):

[Experimental] Add experimental distributed SGD API #2858

[Experimental] Add experimental distributed SGD API #2858

Conversation

ericl commented Sep 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertnishihara left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

robertnishihara Sep 11, 2018 • edited Loading

Choose a reason for hiding this comment

robertnishihara commented Sep 11, 2018 • edited Loading

AmplabJenkins commented Sep 11, 2018

AmplabJenkins commented Sep 13, 2018

AmplabJenkins commented Sep 14, 2018

AmplabJenkins commented Sep 14, 2018

AmplabJenkins commented Sep 14, 2018

AmplabJenkins commented Sep 14, 2018

AmplabJenkins commented Sep 14, 2018

ericl commented Sep 14, 2018

pcmoritz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardliaw left a comment

Choose a reason for hiding this comment

richardliaw Sep 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Sep 15, 2018

robertnishihara left a comment

Choose a reason for hiding this comment

AmplabJenkins commented Sep 18, 2018

robertnishihara commented Sep 18, 2018

ericl commented Sep 18, 2018

robertnishihara commented Sep 19, 2018

AmplabJenkins commented Sep 20, 2018

AmplabJenkins commented Sep 20, 2018

robertnishihara commented Sep 20, 2018

robertnishihara commented Sep 20, 2018

AmplabJenkins commented Sep 20, 2018

robertnishihara commented Sep 28, 2018

robertnishihara Sep 11, 2018 •

edited

Loading

robertnishihara commented Sep 11, 2018 •

edited

Loading

richardliaw Sep 14, 2018 •

edited

Loading