Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Version #23

Closed
jermainewang opened this issue Nov 9, 2015 · 54 comments
Closed

Distributed Version #23

jermainewang opened this issue Nov 9, 2015 · 54 comments
Assignees
Labels

Comments

@jermainewang
Copy link

@jermainewang jermainewang commented Nov 9, 2015

Is there any distributed version of TensorFlow that could work on multiple machines?

-Minjie

@rusenask
Copy link

@rusenask rusenask commented Nov 9, 2015

I think that's the point - Google hasn't open sourced "scalable" version :)

@vrv
Copy link
Contributor

@vrv vrv commented Nov 9, 2015

Thanks for the question! To reiterate what I said here, we are working on making a distributed implementation available, it's currently not in the initial release. Please stay tuned, and take a look at the cifar multi-gpu tutorial for a flavor of how we handle multiple 'devices': http://tensorflow.org/tutorials/deep_cnn/index.md

@saraswat
Copy link

@saraswat saraswat commented Nov 9, 2015

would appreciate any insight on the availability of the distributed version. Is the distributed code that is being worked on in github? That is one place where some of us who are interested can contribute

@zhangandyx
Copy link

@zhangandyx zhangandyx commented Nov 10, 2015

👍

@edwardyoon
Copy link

@edwardyoon edwardyoon commented Nov 10, 2015

Hello,

After reading these plans and ideas, I'm somewhat surprised. According to http://static.googleusercontent.com/media/research.google.com/en//people/jeff/BayLearn2015.pdf, both data and model parallel are needed to train large and powerful models quickly. BTW, GPUs transferring data takes time as described in http://tensorflow.org/tutorials/deep_cnn/index.md. Then, how it's possible to efficiently support both model parallelism and heterogeneous multi-devices (of a single node) on distributed cluster? Could you please roughly explain how different it from DistBelief?

Thanks!

@edwardyoon
Copy link

@edwardyoon edwardyoon commented Nov 10, 2015

P.S., GPU acceleration also could be limited by model partition strategies.

@jeffreyadean
Copy link
Contributor

@jeffreyadean jeffreyadean commented Nov 11, 2015

Our current internal distributed extensions are somewhat entangled with Google internal infrastructure, which is why we released the single-machine version first. The code is not yet in GitHub, because it has dependencies on other parts of the Google code base at the moment, most of which have been trimmed, but there are some remaining ones.

We realize that distributed support is really important, and it's one of the top features we're prioritizing at the moment.

@edwardyoon
Copy link

@edwardyoon edwardyoon commented Nov 11, 2015

Awesome.

After reading the whitepaper, I just realized that large neural network model can be partitioned into sub-graphs by layer (horizontal partitioning) and executed in a serial way.

One thing not clear is the performance for fully connected network on multi node equipped with GPUs cluster ..

@kdunn926
Copy link

@kdunn926 kdunn926 commented Nov 14, 2015

In theory, something like Dask could be layered on top for handling this - at least for the Python front-end.

@saraswat
Copy link

@saraswat saraswat commented Nov 24, 2015

Any update on timeline?

@edwardyoon
Copy link

@edwardyoon edwardyoon commented Dec 2, 2015

Dask looks interesting project but the drawback of the blocking algorithm is that it's not memory optimal. Since a large amount of memory is required for fully-connected layers, I was thought that Pregel-like model parallelism on CPUs w/ vertical partitioning is more attractive for fully connected layers (blocking mat-mult on GPU also appears to me slow and memory demanding). Of course, I maybe wrong but that's why I launched Apache Horn project recently. Since layers can be pipelined, I hope we can collaborate each projects in complementary way.

@bhack
Copy link
Contributor

@bhack bhack commented Jan 15, 2016

@saudet
Copy link

@saudet saudet commented Jan 16, 2016

@bhack I wonder what their Java/Scala interface looks like...

@martinwicke
Copy link
Member

@martinwicke martinwicke commented Jan 16, 2016

We've started work on this using gRPC. We hopefully have something to show soon.

@RobotiAi
Copy link

@RobotiAi RobotiAi commented Jan 17, 2016

Really in desperate need for the distribute version.. Dream to see it to be released early and contribute my efforts in this great project..

@bhack
Copy link
Contributor

@bhack bhack commented Jan 17, 2016

@shendiaomo
Copy link

@shendiaomo shendiaomo commented Jan 28, 2016

Any update on timeline? Can't wait any longer... @martinwicke

@andykitchen
Copy link
Contributor

@andykitchen andykitchen commented Jan 31, 2016

Bump. I'm really glad this is your top priority. Thanks for all your amazing work so far.

@krzysztof-magosa
Copy link

@krzysztof-magosa krzysztof-magosa commented Feb 14, 2016

+1

1 similar comment
@grillermo
Copy link

@grillermo grillermo commented Feb 14, 2016

+1

@ctn
Copy link

@ctn ctn commented Feb 15, 2016

Sorry I completely missed the mention; have been heads down on various matters. This looked just like the other hundreds of commit notifications in my mailbox.

Yes we'll be talking about a distributed implementation of TensorFlow on Spark, pyspark in particular. Some pretty interesting results on scaling and GPU vs CPU etc. I'll see you there if you'll be in NYC for SparkSummit, else on live streaming.

The primary motivation is from the Spark perspective---e.g., easily adding another useful workload to an existing Spark deployment.

For distributed TensorFlow in the abstract, Google will release a distributed implementation "soon".

HTH.

Update on the above (Distributed Tensorflow on Spark):

@mschonwe
Copy link

@mschonwe mschonwe commented Feb 16, 2016

+1

1 similar comment
@jesuisnicolasdavid
Copy link

@jesuisnicolasdavid jesuisnicolasdavid commented Feb 17, 2016

+1

@shendiaomo
Copy link

@shendiaomo shendiaomo commented Feb 17, 2016

TensorFlow-serving released today, it seems networking in gRPC has been proved to be a mature solution, great news!

@aselle aselle added type:feature and removed enhancement labels Feb 9, 2017
benoitsteiner added a commit to benoitsteiner/tensorflow that referenced this issue Mar 30, 2017
benoitsteiner added a commit to benoitsteiner/tensorflow that referenced this issue Mar 30, 2017
gunan added a commit that referenced this issue Mar 30, 2017
sschaetz pushed a commit to ButterflyNetwork/tensorflow that referenced this issue Apr 6, 2017
tarasglek pushed a commit to tarasglek/tensorflow that referenced this issue Jun 20, 2017
tarasglek pushed a commit to tarasglek/tensorflow that referenced this issue Jun 20, 2017
scottcjt added a commit to scottcjt/tensorflow that referenced this issue May 16, 2018
fix a bug in bazel version check (bazel-0.10 and later)
darthsuogles pushed a commit to darthsuogles/tensorflow that referenced this issue Dec 8, 2018
Create base Dockerfile
pooyadavoodi pushed a commit to pooyadavoodi/tensorflow that referenced this issue Oct 16, 2019
Add use_explicit_batch parameter available in OpConverterParams and other places

Formatting and make const bool everywhere

Enable use_explicit_batch for TRT 6.0

Revise validation checks to account for use_explicit_batch. Propagate flag to ConversionParams and TRTEngineOp

Rename use_explicit_batch/use_implicit_batch

Formatting

Add simple activtion test for testing dynamic input shapes. Second test with None dims is disabled

Update ConvertAxis to account for use_implicit batch

fix use of use_implicit_batch (tensorflow#7)

* fix use of use_implicit_batch

* change order of parameters in ConvertAxis function

fix build (tensorflow#8)

Update converters for ResNet50 (except Binary ops) (tensorflow#9)

* Update RN50 converters for use_implicit_batch: Conv2D, BiasAdd, Transpose, MaxPool, Squeeze, MatMul, Pad

* Fix compilation errors

* Fix tests

Use TRT6 API's for dynamic shape (tensorflow#11)

* adding changes for addnetworkv2

* add plugin utils header file in build

* optimization profile api added

* fix optimization profile

* TRT 6.0 api changes + clang format

* Return valid errors in trt_engine_op

* add/fix comments

* Changes to make sure activation test passes with TRT trunk

* use HasStaticShape API, add new line at EOF

Allow opt profiles to be set via env variables temporarily.

Undo accidental change

 fix segfault by properly returning the status from OverwriteStaticDims function

Update GetTrtBroadcastShapes for use_implicit_batch (tensorflow#14)

* Update GetTrtBroadcastShapes for use_implicit_batch

* Formatting

Update activation test

Fix merge errors

Update converter for reshape (tensorflow#17)

Allow INT32 for elementwise (tensorflow#18)

Add Shape op (tensorflow#19)

* Add Shape op

* Add #if guards for Shape. Fix formatting

Support dynamic shapes for strided slice (tensorflow#20)

Support dynamic shapes for strided slice

Support const scalars + Pack on constants (tensorflow#21)

Support const scalars and pack with constants in TRT6

Fixes/improvements for BERT (tensorflow#22)

* Support shrink_axis_mask for StridedSlice

* Use a pointer for final_shape arg in ConvertStridedSliceHelper. Use final_shape for unpack/unstack

* Support BatchMatMulV2.

* Remove TODO and update comments

* Remove unused include

* Update Gather for TRT 6

* Update BatchMatMul for TRT6 - may need more changes

* Update StridedSlice shrink_axis for TRT6

* Fix bugs with ConvertAxis, StridedSlice shrink_axis, Gather

* Fix FC and broadcast

* Compile issue and matmul fix

* Use nullptr for empty weights

* Update Slice

* Fix matmul for TRT6

* Use enqueueV2. Don't limit to 1 input per engine

Change INetworkConfig to IBuilderConfig

Allow expand dims to work on dynamic inputs by slicing shape. Catch problems with DepthwiseConv. Don't try to verify dynamic shapes in CheckValidSize (tensorflow#24)

Update CombinedNMS converter (tensorflow#23)

* Support CombinedNMS in non implicit batch mode. The squeeze will not work if multiple dimensions are unknown

* Fix compile error and formatting

Support squeeze when input dims are unknown

Support an additional case of StridedSlice where some dims aren't known

Use new API for createNetworkV2

Fix flag type for createNetworkV2

Use tensor inputs for strided slice

Allow squeeze to work on -1 dims

Add TRT6 checks to new API

spliting ConvertGraphDefToEngine  (tensorflow#29)

* spliting ConvertGraphDefToEngine into ConvertGraphDefToNetwork and BuildEngineFromNetwork

* some compiler error

* fix format

Squeeze Helper function (tensorflow#31)

* Add squeeze helper

* Fix compile issues

* Use squeeze helper for CombinedNMS

Update Split & Unpack for dynamic shapes (tensorflow#32)

* Update Unpack for dynamic shapes

* Fix compilation error

Temporary hack to fix bug in config while finding TRT library

Fix errors from rebasing

Remove GatherV2 limitations for TRT6

Fix BiasAdd elementwise for NCHW case with explicit batch mode (tensorflow#34)

Update TRT6 headers, Make tests compile (tensorflow#35)

* Change header files for TRT6 in configure script

* Fix bug with size of scalars. Use implicit batch mode based on the converter flag when creating network

* Fix compilation of tests and Broadcast tests

Properly fix biasadd nchw (tensorflow#36)

Revert tensorflow#29 to fix weight corruption (tensorflow#37)

* Revert tensorflow#29 to fix weight corruption

* Revert change in test

Fix bug with converters and get all tests passing for TRT6 (tensorflow#39)

Update DepthToSpace and SpaceToTest for TRT6 + dynamic shapes (tensorflow#40)

Add new C++ tests for TRT6 converters (tensorflow#41)

* Remove third shuffle layer since bug with transpose was fixed

* Add new tests for TRT6 features

* Update TRT6 headers list

Fix compilation errors

Remove bazel_build.sh

Enable quantization mnist test back

Disabled by mistake I believe

Remove undesirable changes in quantization_mnist_test

Add code back that was missed during rebase

Fix bug: change "type" to type_key
cjolivier01 pushed a commit to Cerebras/tensorflow that referenced this issue Dec 6, 2019
Cubkernels v2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.