-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Experimental] Add experimental distributed SGD API #2858
Conversation
python/ray/experimental/sgd/sgd.py
Outdated
ray.worker.global_worker.plasma_client.store_socket_name) | ||
manager_socket = ( | ||
ray.worker.global_worker.plasma_client.manager_socket_name) | ||
memcpy_plasma_module = tf.load_op_library( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to change, see https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_plasma_tf_op.py#L87
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're copying a lot of code from TF in this PR. Can you say why? Some of it is just to get a Resnet model, right? What about the allreduce stuff?
import time | ||
|
||
|
||
class Timeline(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and run_timeline
from sgd.py
@@ -0,0 +1,504 @@ | |||
from __future__ import absolute_import |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
python/ray/experimental/sgd/sgd.py
Outdated
assert(len(self.per_device_grads) == num_devices) | ||
self.num_grads = num_grads = len(self.packed_grads_and_vars[0]) | ||
if max_bytes: | ||
print("Packed grads => {} tensors".format(num_grads)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use logger
python/ray/experimental/sgd/sgd.py
Outdated
plasma_manager_socket_name=manager_socket) | ||
grad_ph = tf.reshape( | ||
grad_ph, self.packed_grads_and_vars[0][j][0].shape) | ||
print("Packed tensor", grad_ph) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use logger, same with all other prints
Using the default model (but with batch size 64)
so, scaling from 1 GPU to 8 GPUs is quite good. |
Test PASSed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
I did some reorganization of the code.
I also have started fixing the plasma op code. Currently it crashes since I left out the parameter server actor setup code. We can either try to merge this first, and merge the PS code after, or do both here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (also the plasma op changes). Happy to merge this after the linting error is fixed and do the parameter server changes as a followup.
self.start_time = self.time() | ||
self.tid = tid | ||
|
||
def patch_ray(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why patch the Ray logging? As opposed to just using the existing logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ray logging functionality doesn't provide the fine-grained control we have here to capture just one SGD iteration, and also add additional types of events.
Note that this still calls the original log call so you can get overall timelines that way still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, you can just do (assuming xray)
with ray.profile("custom_event"):
# do one SGD iteration
to add the event to the timeline. Is that what you want? Anyway, if there's additional functionality here that's needed for profiling, then in a follow up PR we should just extend the profiling API (so it's available more generally outside of SGD).
python/ray/experimental/sgd/util.py
Outdated
from __future__ import division | ||
from __future__ import print_function | ||
|
||
import ray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ray
import should be separate and below standard library imports https://github.com/google/styleguide/blob/gh-pages/pyguide.md#313-imports-formatting
This applies in other places also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -0,0 +1,627 @@ | |||
# Copyright 2017 The TensorFlow Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment at the top of all the copied files saying something like "This file was copied from [RELEVANT URL]"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should scope the imports correctly (there's places where we import .
and from file by name...)
python/ray/experimental/sgd/sgd.py
Outdated
import tensorflow.contrib.nccl as nccl | ||
import tensorflow.contrib.slim as slim | ||
|
||
from util import Timeline, fetch, run_timeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want all of these to be local imports (rather than global)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
python/ray/experimental/sgd/util.py
Outdated
ray.worker.global_worker.plasma_client.fetch([plasma_id]) | ||
|
||
|
||
def run_timeline(sess, ops, feed_dict={}, write_timeline=False, name=""): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't use a mutable value for default arg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me (pending tests passing, it looks like there is a linting error).
Currently it looks like this code is not touched by any of our tests. Can we add simple tests (just to touch the code) in a follow up PR (probably on Jenkins).
Test PASSed. |
Looks like we need to exclude the copied TF files from the |
Let me add that. |
@ericl there is a merge conflict. |
Test PASSed. |
Test FAILed. |
jenkins, retest this please |
The test failure was
|
Test PASSed. |
Progress on #1945. |
No description provided.