New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sgd] Merge sharded param server based SGD implementation #3033

Merged
merged 46 commits into from Oct 28, 2018

Conversation

Projects
None yet
8 participants
@ericl
Copy link
Contributor

ericl commented Oct 7, 2018

What do these changes do?

This includes most of the TF code used for the OSDI experiment. Perf sanity check on p3.16xl instances: Overall scaling looks ok, with the multi-node results within 5% of OSDI final numbers. This seems reasonable given that hugepages are not enabled here, and the param server shards are placed randomly.

$ RAY_USE_XRAY=1 ./test_sgd.py --gpu --batch-size=64 --num-workers=N \
  --devices-per-worker=M --strategy=<simple|ps> \
  --warmup --object-store-memory=10000000000

Images per second total
gpus total              | simple | ps
========================================
1                       | 218
2 (1 worker)            | 388
4 (1 worker)            | 759
4 (2 workers)           | 176    | 623
8 (1 worker)            | 985
8 (2 workers)           | 349    | 1031
16 (2 nodes, 2 workers) | 600    | 1661
16 (2 nodes, 4 workers) | 468    | 1712   <--- OSDI perf was 1817

Related issue number

#3019

ericl added some commits Oct 7, 2018

ps

@ericl ericl changed the title [WIP] Merge distributed parameter server code [WIP] Merge sharded param server based SGD implementation Oct 7, 2018

@AmplabJenkins

This comment has been minimized.

Copy link
Collaborator

AmplabJenkins commented Oct 7, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8571/
Test PASSed.

@robertnishihara robertnishihara requested a review from stephanie-wang Oct 7, 2018

ericl added some commits Oct 21, 2018

@AmplabJenkins

This comment has been minimized.

Copy link
Collaborator

AmplabJenkins commented Oct 21, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8771/
Test PASSed.

ericl added some commits Oct 21, 2018

fix
fix
up
@AmplabJenkins

This comment has been minimized.

Copy link
Collaborator

AmplabJenkins commented Oct 21, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8772/
Test FAILed.

ericl added some commits Oct 21, 2018

fix

@ericl ericl changed the title [WIP] Merge sharded param server based SGD implementation [sgd] Merge sharded param server based SGD implementation Oct 21, 2018

@AmplabJenkins

This comment has been minimized.

Copy link
Collaborator

AmplabJenkins commented Oct 21, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8773/
Test FAILed.

@AmplabJenkins

This comment has been minimized.

Copy link
Collaborator

AmplabJenkins commented Oct 21, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8775/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link
Collaborator

AmplabJenkins commented Oct 21, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8774/
Test FAILed.

@AmplabJenkins

This comment has been minimized.

Copy link
Collaborator

AmplabJenkins commented Oct 21, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8776/
Test FAILed.

@AmplabJenkins

This comment has been minimized.

Copy link
Collaborator

AmplabJenkins commented Oct 26, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8907/
Test FAILed.

@alexbao

This comment has been minimized.

Copy link
Contributor

alexbao commented Oct 26, 2018

Thanks for the change. LG after the rest minor comments are addressed.

Probably you can also share the awesome work during our next meeting too!

@robertnishihara

This comment has been minimized.

Copy link
Contributor

robertnishihara commented Oct 26, 2018

@ericl The PR that removed legacy Ray introduced some conflicts, so this PR needs to be rebased.

@AmplabJenkins

This comment has been minimized.

Copy link
Collaborator

AmplabJenkins commented Oct 26, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8922/
Test PASSed.

@richardliaw
Copy link
Member

richardliaw left a comment

looks fine, cleanup can be done in a separate PR for:

  1. getting rid of the time.time() with something like timers from RLlib
  2. documenting all the methods and such
Show resolved Hide resolved python/ray/experimental/sgd/model.py Outdated
Show resolved Hide resolved python/ray/experimental/sgd/param_server.py
Show resolved Hide resolved python/ray/experimental/sgd/sgd.py
Show resolved Hide resolved python/ray/experimental/sgd/sgd_worker.py Outdated
@richardliaw

This comment has been minimized.

Copy link
Member

richardliaw commented Oct 27, 2018

(maybe should sit down with @stephanie-wang to walk over code before merge)

ericl added some commits Oct 27, 2018

@AmplabJenkins

This comment has been minimized.

Copy link
Collaborator

AmplabJenkins commented Oct 28, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8932/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link
Collaborator

AmplabJenkins commented Oct 28, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8934/
Test FAILed.

@AmplabJenkins

This comment has been minimized.

Copy link
Collaborator

AmplabJenkins commented Oct 28, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8935/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Copy link
Collaborator

AmplabJenkins commented Oct 28, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8936/
Test PASSed.

ericl added some commits Oct 28, 2018

max

@ericl ericl merged commit af0c117 into ray-project:master Oct 28, 2018

1 check was pending

default Merged build started.
Details
@ericl

This comment has been minimized.

Copy link
Contributor

ericl commented Oct 28, 2018

@stephanie-wang let's go through this on tuesday?

@AmplabJenkins

This comment has been minimized.

Copy link
Collaborator

AmplabJenkins commented Oct 28, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/8942/
Test FAILed.

@ericl

This comment has been minimized.

Copy link
Contributor

ericl commented Oct 28, 2018

@stephanie-wang

This comment has been minimized.

Copy link
Contributor

stephanie-wang commented Oct 29, 2018

Oops, sorry I didn't get around to this in time. Yeah, I'd like to meet to go over the code if you guys have time, thanks!


def get(self, object_id):
"""Put the accumulated gradients to the given object id."""
self.timeline.start("get")

This comment has been minimized.

@chunyang-wen

chunyang-wen Nov 20, 2018

Contributor

I think this kind of start and end code can be replaced by a decorator. It is also good for later refactoring.

This comment has been minimized.

@robertnishihara

robertnishihara Nov 20, 2018

Contributor

Yeah, let's not fix this code though. Ultimately it should just be removed and replaced with the ray.profile API.

from ray.experimental.sgd.modified_allreduce import sum_gradients_all_reduce, \
unpack_small_tensors

logger = logging.getLogger(__name__)

This comment has been minimized.

@chunyang-wen

chunyang-wen Nov 20, 2018

Contributor

A unified logger module is better if we want to set a unified logging level, logging format. Currently this kind of logger is related to logging module, which may be interleaved with user's logging configuration and cause trouble for logging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment