New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Version #23

Closed
jermainewang opened this Issue Nov 9, 2015 · 54 comments

Comments

Projects
None yet
@jermainewang

jermainewang commented Nov 9, 2015

Is there any distributed version of TensorFlow that could work on multiple machines?

-Minjie

@rusenask

This comment has been minimized.

Show comment
Hide comment
@rusenask

rusenask Nov 9, 2015

I think that's the point - Google hasn't open sourced "scalable" version :)

rusenask commented Nov 9, 2015

I think that's the point - Google hasn't open sourced "scalable" version :)

@vrv

This comment has been minimized.

Show comment
Hide comment
@vrv

vrv Nov 9, 2015

Contributor

Thanks for the question! To reiterate what I said here, we are working on making a distributed implementation available, it's currently not in the initial release. Please stay tuned, and take a look at the cifar multi-gpu tutorial for a flavor of how we handle multiple 'devices': http://tensorflow.org/tutorials/deep_cnn/index.md

Contributor

vrv commented Nov 9, 2015

Thanks for the question! To reiterate what I said here, we are working on making a distributed implementation available, it's currently not in the initial release. Please stay tuned, and take a look at the cifar multi-gpu tutorial for a flavor of how we handle multiple 'devices': http://tensorflow.org/tutorials/deep_cnn/index.md

@saraswat

This comment has been minimized.

Show comment
Hide comment
@saraswat

saraswat Nov 9, 2015

would appreciate any insight on the availability of the distributed version. Is the distributed code that is being worked on in github? That is one place where some of us who are interested can contribute

saraswat commented Nov 9, 2015

would appreciate any insight on the availability of the distributed version. Is the distributed code that is being worked on in github? That is one place where some of us who are interested can contribute

@zhangandyx

This comment has been minimized.

Show comment
Hide comment
@zhangandyx

zhangandyx commented Nov 10, 2015

👍

@edwardyoon

This comment has been minimized.

Show comment
Hide comment
@edwardyoon

edwardyoon Nov 10, 2015

Hello,

After reading these plans and ideas, I'm somewhat surprised. According to http://static.googleusercontent.com/media/research.google.com/en//people/jeff/BayLearn2015.pdf, both data and model parallel are needed to train large and powerful models quickly. BTW, GPUs transferring data takes time as described in http://tensorflow.org/tutorials/deep_cnn/index.md. Then, how it's possible to efficiently support both model parallelism and heterogeneous multi-devices (of a single node) on distributed cluster? Could you please roughly explain how different it from DistBelief?

Thanks!

edwardyoon commented Nov 10, 2015

Hello,

After reading these plans and ideas, I'm somewhat surprised. According to http://static.googleusercontent.com/media/research.google.com/en//people/jeff/BayLearn2015.pdf, both data and model parallel are needed to train large and powerful models quickly. BTW, GPUs transferring data takes time as described in http://tensorflow.org/tutorials/deep_cnn/index.md. Then, how it's possible to efficiently support both model parallelism and heterogeneous multi-devices (of a single node) on distributed cluster? Could you please roughly explain how different it from DistBelief?

Thanks!

@edwardyoon

This comment has been minimized.

Show comment
Hide comment
@edwardyoon

edwardyoon Nov 10, 2015

P.S., GPU acceleration also could be limited by model partition strategies.

edwardyoon commented Nov 10, 2015

P.S., GPU acceleration also could be limited by model partition strategies.

@jeffreyadean

This comment has been minimized.

Show comment
Hide comment
@jeffreyadean

jeffreyadean Nov 11, 2015

Contributor

Our current internal distributed extensions are somewhat entangled with Google internal infrastructure, which is why we released the single-machine version first. The code is not yet in GitHub, because it has dependencies on other parts of the Google code base at the moment, most of which have been trimmed, but there are some remaining ones.

We realize that distributed support is really important, and it's one of the top features we're prioritizing at the moment.

Contributor

jeffreyadean commented Nov 11, 2015

Our current internal distributed extensions are somewhat entangled with Google internal infrastructure, which is why we released the single-machine version first. The code is not yet in GitHub, because it has dependencies on other parts of the Google code base at the moment, most of which have been trimmed, but there are some remaining ones.

We realize that distributed support is really important, and it's one of the top features we're prioritizing at the moment.

@edwardyoon

This comment has been minimized.

Show comment
Hide comment
@edwardyoon

edwardyoon Nov 11, 2015

Awesome.

After reading the whitepaper, I just realized that large neural network model can be partitioned into sub-graphs by layer (horizontal partitioning) and executed in a serial way.

One thing not clear is the performance for fully connected network on multi node equipped with GPUs cluster ..

edwardyoon commented Nov 11, 2015

Awesome.

After reading the whitepaper, I just realized that large neural network model can be partitioned into sub-graphs by layer (horizontal partitioning) and executed in a serial way.

One thing not clear is the performance for fully connected network on multi node equipped with GPUs cluster ..

@kdunn926

This comment has been minimized.

Show comment
Hide comment
@kdunn926

kdunn926 Nov 14, 2015

In theory, something like Dask could be layered on top for handling this - at least for the Python front-end.

kdunn926 commented Nov 14, 2015

In theory, something like Dask could be layered on top for handling this - at least for the Python front-end.

@saraswat

This comment has been minimized.

Show comment
Hide comment
@saraswat

saraswat Nov 24, 2015

Any update on timeline?

saraswat commented Nov 24, 2015

Any update on timeline?

@edwardyoon

This comment has been minimized.

Show comment
Hide comment
@edwardyoon

edwardyoon Dec 2, 2015

Dask looks interesting project but the drawback of the blocking algorithm is that it's not memory optimal. Since a large amount of memory is required for fully-connected layers, I was thought that Pregel-like model parallelism on CPUs w/ vertical partitioning is more attractive for fully connected layers (blocking mat-mult on GPU also appears to me slow and memory demanding). Of course, I maybe wrong but that's why I launched Apache Horn project recently. Since layers can be pipelined, I hope we can collaborate each projects in complementary way.

edwardyoon commented Dec 2, 2015

Dask looks interesting project but the drawback of the blocking algorithm is that it's not memory optimal. Since a large amount of memory is required for fully-connected layers, I was thought that Pregel-like model parallelism on CPUs w/ vertical partitioning is more attractive for fully connected layers (blocking mat-mult on GPU also appears to me slow and memory demanding). Of course, I maybe wrong but that's why I launched Apache Horn project recently. Since layers can be pipelined, I hope we can collaborate each projects in complementary way.

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack
Contributor

bhack commented Jan 15, 2016

@saudet

This comment has been minimized.

Show comment
Hide comment
@saudet

saudet Jan 16, 2016

@bhack I wonder what their Java/Scala interface looks like...

saudet commented Jan 16, 2016

@bhack I wonder what their Java/Scala interface looks like...

@martinwicke

This comment has been minimized.

Show comment
Hide comment
@martinwicke

martinwicke Jan 16, 2016

Member

We've started work on this using gRPC. We hopefully have something to show soon.

Member

martinwicke commented Jan 16, 2016

We've started work on this using gRPC. We hopefully have something to show soon.

@RobotiAi

This comment has been minimized.

Show comment
Hide comment
@RobotiAi

RobotiAi Jan 17, 2016

Really in desperate need for the distribute version.. Dream to see it to be released early and contribute my efforts in this great project..

RobotiAi commented Jan 17, 2016

Really in desperate need for the distribute version.. Dream to see it to be released early and contribute my efforts in this great project..

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack Jan 17, 2016

Contributor
Contributor

bhack commented Jan 17, 2016

@shendiaomo

This comment has been minimized.

Show comment
Hide comment
@shendiaomo

shendiaomo Jan 28, 2016

Any update on timeline? Can't wait any longer... @martinwicke

shendiaomo commented Jan 28, 2016

Any update on timeline? Can't wait any longer... @martinwicke

@andykitchen

This comment has been minimized.

Show comment
Hide comment
@andykitchen

andykitchen Jan 31, 2016

Contributor

Bump. I'm really glad this is your top priority. Thanks for all your amazing work so far.

Contributor

andykitchen commented Jan 31, 2016

Bump. I'm really glad this is your top priority. Thanks for all your amazing work so far.

@krzysztof-magosa

This comment has been minimized.

Show comment
Hide comment
@krzysztof-magosa

krzysztof-magosa commented Feb 14, 2016

+1

1 similar comment
@grillermo

This comment has been minimized.

Show comment
Hide comment
@grillermo

grillermo commented Feb 14, 2016

+1

@ctn

This comment has been minimized.

Show comment
Hide comment
@ctn

ctn Feb 15, 2016

Sorry I completely missed the mention; have been heads down on various matters. This looked just like the other hundreds of commit notifications in my mailbox.

Yes we'll be talking about a distributed implementation of TensorFlow on Spark, pyspark in particular. Some pretty interesting results on scaling and GPU vs CPU etc. I'll see you there if you'll be in NYC for SparkSummit, else on live streaming.

The primary motivation is from the Spark perspective---e.g., easily adding another useful workload to an existing Spark deployment.

For distributed TensorFlow in the abstract, Google will release a distributed implementation "soon".

HTH.

Update on the above (Distributed Tensorflow on Spark):

ctn commented Feb 15, 2016

Sorry I completely missed the mention; have been heads down on various matters. This looked just like the other hundreds of commit notifications in my mailbox.

Yes we'll be talking about a distributed implementation of TensorFlow on Spark, pyspark in particular. Some pretty interesting results on scaling and GPU vs CPU etc. I'll see you there if you'll be in NYC for SparkSummit, else on live streaming.

The primary motivation is from the Spark perspective---e.g., easily adding another useful workload to an existing Spark deployment.

For distributed TensorFlow in the abstract, Google will release a distributed implementation "soon".

HTH.

Update on the above (Distributed Tensorflow on Spark):

@mschonwe

This comment has been minimized.

Show comment
Hide comment
@mschonwe

mschonwe commented Feb 16, 2016

+1

1 similar comment
@jesuisnicolasdavid

This comment has been minimized.

Show comment
Hide comment
@jesuisnicolasdavid

jesuisnicolasdavid commented Feb 17, 2016

+1

@shendiaomo

This comment has been minimized.

Show comment
Hide comment
@shendiaomo

shendiaomo Feb 17, 2016

TensorFlow-serving released today, it seems networking in gRPC has been proved to be a mature solution, great news!

shendiaomo commented Feb 17, 2016

TensorFlow-serving released today, it seems networking in gRPC has been proved to be a mature solution, great news!

@LiorZ

This comment has been minimized.

Show comment
Hide comment
@LiorZ

LiorZ commented Apr 6, 2016

+1

@mrry

This comment has been minimized.

Show comment
Hide comment
@mrry

mrry Apr 9, 2016

Contributor

I just wanted to draw everyone's attention to 6d83874, which modifies the interface to some of the distributed runtime methods. In particular, tf.GrpcServer becomes tf.train.Server, and its constructor is more ergonomic, so you no longer need to construct a (now-renamed-to) tf.train.ServerDef proto to instantiate a server. The docs in the repository are now updated, but haven't yet made it onto the website.

Let me know if there are any questions!

Contributor

mrry commented Apr 9, 2016

I just wanted to draw everyone's attention to 6d83874, which modifies the interface to some of the distributed runtime methods. In particular, tf.GrpcServer becomes tf.train.Server, and its constructor is more ergonomic, so you no longer need to construct a (now-renamed-to) tf.train.ServerDef proto to instantiate a server. The docs in the repository are now updated, but haven't yet made it onto the website.

Let me know if there are any questions!

@mrry

This comment has been minimized.

Show comment
Hide comment
@mrry

mrry Apr 15, 2016

Contributor

Since 0.8 is now released, I think it's time to close this issue. Please create new issues for anything that arises with the distributed version, and thanks for all of your input!

Contributor

mrry commented Apr 15, 2016

Since 0.8 is now released, I think it's time to close this issue. Please create new issues for anything that arises with the distributed version, and thanks for all of your input!

@mrry mrry closed this Apr 15, 2016

@LiorZ

This comment has been minimized.

Show comment
Hide comment
@LiorZ

LiorZ Apr 17, 2016

Great! Amazing work

On Fri, Apr 15, 2016 at 10:00 PM Derek Murray notifications@github.com
wrote:

Since 0.8 is now released, I think it's time to close this issue. Please
create new issues for anything that arises with the distributed version,
and thanks for all of your input!


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#23 (comment)

LiorZ commented Apr 17, 2016

Great! Amazing work

On Fri, Apr 15, 2016 at 10:00 PM Derek Murray notifications@github.com
wrote:

Since 0.8 is now released, I think it's time to close this issue. Please
create new issues for anything that arises with the distributed version,
and thanks for all of your input!


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#23 (comment)

@JinXinDeep

This comment has been minimized.

Show comment
Hide comment
@JinXinDeep

JinXinDeep Aug 1, 2016

@mrry Thanks for the distributed version of TensorFlow since v0.8. In my opinion, TensorFlow is very flexible, it can do model parallel, data parallel, or mixed parallel, although the examples are for data parallel.

For example, for model parallel, typically there is a cluster consists of a number of distributed nodes (e.g. machines) for model training. If we use the In-graph mode, a main program can be used to define the tasks for all the nodes. To reduce communication overhead and to do model parallel efficiently, each node includes a parameter server (ps) task contains the sub-model parameters and a worker task that corresponding to the computations of the sub-model; each node contains a different sub-model, which is assigned by the main program, and these sub-models collectively form the whole model.
Each ps task from each node will receive the computation results from at least one node’s worker in the cluster. In model parallel, for example, for each node, its worker task typically do the compution corresponding to its ps task’s sub-model; the ps task in each node will update its sub-model according to the training results it received.

Is that right? Thanks!

JinXinDeep commented Aug 1, 2016

@mrry Thanks for the distributed version of TensorFlow since v0.8. In my opinion, TensorFlow is very flexible, it can do model parallel, data parallel, or mixed parallel, although the examples are for data parallel.

For example, for model parallel, typically there is a cluster consists of a number of distributed nodes (e.g. machines) for model training. If we use the In-graph mode, a main program can be used to define the tasks for all the nodes. To reduce communication overhead and to do model parallel efficiently, each node includes a parameter server (ps) task contains the sub-model parameters and a worker task that corresponding to the computations of the sub-model; each node contains a different sub-model, which is assigned by the main program, and these sub-models collectively form the whole model.
Each ps task from each node will receive the computation results from at least one node’s worker in the cluster. In model parallel, for example, for each node, its worker task typically do the compution corresponding to its ps task’s sub-model; the ps task in each node will update its sub-model according to the training results it received.

Is that right? Thanks!

@raghav20

This comment has been minimized.

Show comment
Hide comment
@raghav20

raghav20 Sep 22, 2016

I am wondering how can I run Distributed Tensor Flow on Top of Spark

raghav20 commented Sep 22, 2016

I am wondering how can I run Distributed Tensor Flow on Top of Spark

plutoshe pushed a commit to plutoshe/tensorflow that referenced this issue Nov 23, 2016

@aselle aselle added type:feature and removed enhancement labels Feb 9, 2017

benoitsteiner added a commit to benoitsteiner/tensorflow that referenced this issue Mar 30, 2017

benoitsteiner added a commit to benoitsteiner/tensorflow that referenced this issue Mar 30, 2017

gunan added a commit that referenced this issue Mar 30, 2017

sschaetz pushed a commit to ButterflyNetwork/tensorflow that referenced this issue Apr 6, 2017

tarasglek pushed a commit to tarasglek/tensorflow that referenced this issue Jun 20, 2017

tarasglek pushed a commit to tarasglek/tensorflow that referenced this issue Jun 20, 2017

scottcjt added a commit to scottcjt/tensorflow that referenced this issue May 16, 2018

Merge pull request tensorflow#23 from gaoxin627/tflite
fix a bug in bazel version check (bazel-0.10 and later)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment