Distributed Version #23

jermainewang · 2015-11-09T17:45:29Z

Is there any distributed version of TensorFlow that could work on multiple machines?

-Minjie

rusenask · 2015-11-09T17:53:37Z

I think that's the point - Google hasn't open sourced "scalable" version :)

vrv · 2015-11-09T18:44:00Z

Thanks for the question! To reiterate what I said here, we are working on making a distributed implementation available, it's currently not in the initial release. Please stay tuned, and take a look at the cifar multi-gpu tutorial for a flavor of how we handle multiple 'devices': http://tensorflow.org/tutorials/deep_cnn/index.md

saraswat · 2015-11-09T21:28:49Z

would appreciate any insight on the availability of the distributed version. Is the distributed code that is being worked on in github? That is one place where some of us who are interested can contribute

zh4ngx · 2015-11-10T05:25:17Z

👍

edwardyoon · 2015-11-10T07:19:26Z

Hello,

After reading these plans and ideas, I'm somewhat surprised. According to http://static.googleusercontent.com/media/research.google.com/en//people/jeff/BayLearn2015.pdf, both data and model parallel are needed to train large and powerful models quickly. BTW, GPUs transferring data takes time as described in http://tensorflow.org/tutorials/deep_cnn/index.md. Then, how it's possible to efficiently support both model parallelism and heterogeneous multi-devices (of a single node) on distributed cluster? Could you please roughly explain how different it from DistBelief?

Thanks!

edwardyoon · 2015-11-10T07:29:26Z

P.S., GPU acceleration also could be limited by model partition strategies.

jeffreyadean · 2015-11-11T00:03:58Z

Our current internal distributed extensions are somewhat entangled with Google internal infrastructure, which is why we released the single-machine version first. The code is not yet in GitHub, because it has dependencies on other parts of the Google code base at the moment, most of which have been trimmed, but there are some remaining ones.

We realize that distributed support is really important, and it's one of the top features we're prioritizing at the moment.

edwardyoon · 2015-11-11T03:33:39Z

Awesome.

After reading the whitepaper, I just realized that large neural network model can be partitioned into sub-graphs by layer (horizontal partitioning) and executed in a serial way.

One thing not clear is the performance for fully connected network on multi node equipped with GPUs cluster ..

kdunn926 · 2015-11-14T07:04:08Z

In theory, something like Dask could be layered on top for handling this - at least for the Python front-end.

saraswat · 2015-11-24T21:59:14Z

Any update on timeline?

edwardyoon · 2015-12-02T05:23:35Z

Dask looks interesting project but the drawback of the blocking algorithm is that it's not memory optimal. Since a large amount of memory is required for fully-connected layers, I was thought that Pregel-like model parallelism on CPUs w/ vertical partitioning is more attractive for fully connected layers (blocking mat-mult on GPU also appears to me slow and memory demanding). Of course, I maybe wrong but that's why I launched Apache Horn project recently. Since layers can be pipelined, I hope we can collaborate each projects in complementary way.

bhack · 2016-01-15T23:21:38Z

I don't know if you can give us some anticipation on the framework choice but..

https://github.com/cloudera/spark-dataflow

https://spark-summit.org/east-2016/events/distributed-tensor-flow-on-spark-scaling-googles-deep-learning-library/

saudet · 2016-01-16T00:41:35Z

@bhack I wonder what their Java/Scala interface looks like...

martinwicke · 2016-01-16T00:51:14Z

We've started work on this using gRPC. We hopefully have something to show soon.

RobotiAi · 2016-01-17T09:11:18Z

Really in desperate need for the distribute version.. Dream to see it to be released early and contribute my efforts in this great project..

bhack · 2016-01-17T11:01:23Z

@saudet /cc @ctn

shendiaomo · 2016-01-28T09:35:38Z

Any update on timeline? Can't wait any longer... @martinwicke

andykitchen · 2016-01-31T02:25:26Z

Bump. I'm really glad this is your top priority. Thanks for all your amazing work so far.

krzysztof-magosa · 2016-02-14T16:33:42Z

+1

grillermo · 2016-02-14T20:52:31Z

+1

ctn · 2016-02-15T00:47:35Z

Sorry I completely missed the mention; have been heads down on various matters. This looked just like the other hundreds of commit notifications in my mailbox.

Yes we'll be talking about a distributed implementation of TensorFlow on Spark, pyspark in particular. Some pretty interesting results on scaling and GPU vs CPU etc. I'll see you there if you'll be in NYC for SparkSummit, else on live streaming.

The primary motivation is from the Spark perspective---e.g., easily adding another useful workload to an existing Spark deployment.

For distributed TensorFlow in the abstract, Google will release a distributed implementation "soon".

HTH.

Update on the above (Distributed Tensorflow on Spark):

mschonwe · 2016-02-16T14:51:03Z

+1

jesuisnicolasdavid · 2016-02-17T09:50:14Z

+1

shendiaomo · 2016-02-17T09:54:52Z

TensorFlow-serving released today, it seems networking in gRPC has been proved to be a mature solution, great news!

profile weight conversion

Fixing issues tensorflow#23 and tensorflow#25

fix a bug in bazel version check (bazel-0.10 and later)

Create base Dockerfile

Add use_explicit_batch parameter available in OpConverterParams and other places Formatting and make const bool everywhere Enable use_explicit_batch for TRT 6.0 Revise validation checks to account for use_explicit_batch. Propagate flag to ConversionParams and TRTEngineOp Rename use_explicit_batch/use_implicit_batch Formatting Add simple activtion test for testing dynamic input shapes. Second test with None dims is disabled Update ConvertAxis to account for use_implicit batch fix use of use_implicit_batch (tensorflow#7) * fix use of use_implicit_batch * change order of parameters in ConvertAxis function fix build (tensorflow#8) Update converters for ResNet50 (except Binary ops) (tensorflow#9) * Update RN50 converters for use_implicit_batch: Conv2D, BiasAdd, Transpose, MaxPool, Squeeze, MatMul, Pad * Fix compilation errors * Fix tests Use TRT6 API's for dynamic shape (tensorflow#11) * adding changes for addnetworkv2 * add plugin utils header file in build * optimization profile api added * fix optimization profile * TRT 6.0 api changes + clang format * Return valid errors in trt_engine_op * add/fix comments * Changes to make sure activation test passes with TRT trunk * use HasStaticShape API, add new line at EOF Allow opt profiles to be set via env variables temporarily. Undo accidental change fix segfault by properly returning the status from OverwriteStaticDims function Update GetTrtBroadcastShapes for use_implicit_batch (tensorflow#14) * Update GetTrtBroadcastShapes for use_implicit_batch * Formatting Update activation test Fix merge errors Update converter for reshape (tensorflow#17) Allow INT32 for elementwise (tensorflow#18) Add Shape op (tensorflow#19) * Add Shape op * Add #if guards for Shape. Fix formatting Support dynamic shapes for strided slice (tensorflow#20) Support dynamic shapes for strided slice Support const scalars + Pack on constants (tensorflow#21) Support const scalars and pack with constants in TRT6 Fixes/improvements for BERT (tensorflow#22) * Support shrink_axis_mask for StridedSlice * Use a pointer for final_shape arg in ConvertStridedSliceHelper. Use final_shape for unpack/unstack * Support BatchMatMulV2. * Remove TODO and update comments * Remove unused include * Update Gather for TRT 6 * Update BatchMatMul for TRT6 - may need more changes * Update StridedSlice shrink_axis for TRT6 * Fix bugs with ConvertAxis, StridedSlice shrink_axis, Gather * Fix FC and broadcast * Compile issue and matmul fix * Use nullptr for empty weights * Update Slice * Fix matmul for TRT6 * Use enqueueV2. Don't limit to 1 input per engine Change INetworkConfig to IBuilderConfig Allow expand dims to work on dynamic inputs by slicing shape. Catch problems with DepthwiseConv. Don't try to verify dynamic shapes in CheckValidSize (tensorflow#24) Update CombinedNMS converter (tensorflow#23) * Support CombinedNMS in non implicit batch mode. The squeeze will not work if multiple dimensions are unknown * Fix compile error and formatting Support squeeze when input dims are unknown Support an additional case of StridedSlice where some dims aren't known Use new API for createNetworkV2 Fix flag type for createNetworkV2 Use tensor inputs for strided slice Allow squeeze to work on -1 dims Add TRT6 checks to new API spliting ConvertGraphDefToEngine (tensorflow#29) * spliting ConvertGraphDefToEngine into ConvertGraphDefToNetwork and BuildEngineFromNetwork * some compiler error * fix format Squeeze Helper function (tensorflow#31) * Add squeeze helper * Fix compile issues * Use squeeze helper for CombinedNMS Update Split & Unpack for dynamic shapes (tensorflow#32) * Update Unpack for dynamic shapes * Fix compilation error Temporary hack to fix bug in config while finding TRT library Fix errors from rebasing Remove GatherV2 limitations for TRT6 Fix BiasAdd elementwise for NCHW case with explicit batch mode (tensorflow#34) Update TRT6 headers, Make tests compile (tensorflow#35) * Change header files for TRT6 in configure script * Fix bug with size of scalars. Use implicit batch mode based on the converter flag when creating network * Fix compilation of tests and Broadcast tests Properly fix biasadd nchw (tensorflow#36) Revert tensorflow#29 to fix weight corruption (tensorflow#37) * Revert tensorflow#29 to fix weight corruption * Revert change in test Fix bug with converters and get all tests passing for TRT6 (tensorflow#39) Update DepthToSpace and SpaceToTest for TRT6 + dynamic shapes (tensorflow#40) Add new C++ tests for TRT6 converters (tensorflow#41) * Remove third shuffle layer since bug with transpose was fixed * Add new tests for TRT6 features * Update TRT6 headers list Fix compilation errors Remove bazel_build.sh Enable quantization mnist test back Disabled by mistake I believe Remove undesirable changes in quantization_mnist_test Add code back that was missed during rebase Fix bug: change "type" to type_key

Cubkernels v2

fixed new AllocatePersistentBuffer in pad op

vrv mentioned this issue Nov 9, 2015

Remote worker configuration #12

Closed

teamdandelion mentioned this issue Nov 11, 2015

add multiple-machine support #147

Closed

bhack mentioned this issue Dec 1, 2015

when the tensorflow support for cluster installation #385

Closed

girving added the enhancement label Dec 7, 2015

rajatmonga assigned mrry Dec 12, 2015

PeterBeukelman mentioned this issue Dec 19, 2015

Install problems #562

Closed

sschaetz referenced this issue in ButterflyNetwork/tensorflow Apr 6, 2017

Merge pull request #23 from ButterflyNetwork/prof_weights

9709f55

profile weight conversion

hexujun mentioned this issue Apr 10, 2017

android libtensorflow_inference.so run crash (signal 6 (SIGABRT)) #9096

Closed

tarasglek pushed a commit to tarasglek/tensorflow that referenced this issue Jun 20, 2017

fixing issues tensorflow#23 and tensorflow#25

55a34ae

tarasglek pushed a commit to tarasglek/tensorflow that referenced this issue Jun 20, 2017

Merge pull request tensorflow#26 from daviddao/master

5481748

Fixing issues tensorflow#23 and tensorflow#25

zhangdingfei mentioned this issue Jun 21, 2017

tensorflow-1.2.0 import tensorflow Segmentation fault #10870

Closed

jakiechris mentioned this issue Sep 4, 2017

protobuf crashes at runtime when loading tensor lib. #12794

Closed

zhangbo5891001 mentioned this issue Nov 29, 2017

[BUG]Out-of-Bounds Read in DecodeBmpOp class(tensorflow/core/kernels/decode_bmp_op.cc) #14959

Closed

scottcjt added a commit to scottcjt/tensorflow that referenced this issue May 16, 2018

Merge pull request tensorflow#23 from gaoxin627/tflite

a2ffbb8

fix a bug in bazel version check (bazel-0.10 and later)

samhodge mentioned this issue May 22, 2018

Unknown TF crash on OSX in C++ Application, works fine on another machine, other operating systems #19426

Closed

ychen404 mentioned this issue Aug 19, 2018

Not able to port a 6-layered mobilenet tflite model to mobile #21368

Closed

darthsuogles pushed a commit to darthsuogles/tensorflow that referenced this issue Dec 8, 2018

Merge pull request tensorflow#23 from liuqun/docker-without-travis

6eef330

Create base Dockerfile

chenjiasheng mentioned this issue Dec 12, 2018

Distributed Training Randomly Stops During the Training Process #12667

Closed

isra60 mentioned this issue Mar 25, 2019

Segmentation Fault with TensorRT create interference graph #27100

Closed

dkashkin mentioned this issue Apr 25, 2019

TFLite Interpreter fails to load quantized model on Android (stock ssd_mobilenet_v2) #28163

Closed

chengdianxuezi mentioned this issue Nov 1, 2019

Bug: tensorflow-gpu takes long time before beginning to compute #18652

Closed

arielbenitah mentioned this issue Nov 10, 2019

TensorRT Segmentation Fault During Conversion #34136

Closed

yanceyblog mentioned this issue Nov 28, 2019

armeabi-v7a libtensorflowlite_jni.so：signal 7 (SIGBUS), code 1 (BUS_ADRALN), fault addr 0xeef5445f #34669

Closed

cjolivier01 pushed a commit to Cerebras/tensorflow that referenced this issue Dec 6, 2019

Merge pull request tensorflow#23 from iotamudelta/cubkernels_v2

b056653

Cubkernels v2

This was referenced Sep 24, 2020

Didnt find op for builtin opcode 'RESIZE_NEAREST_NEIGHBOR' version '3' #43291

Closed

null pointer dereference Error in TF2.3.0 with runforMultipleInputOutput #43657

Closed

This was referenced Nov 2, 2020

Undefined symbols for architecture arm64 when loading TensorFlowLiteSelectTfOps on iOS device #41948

Closed

crashed at TfLiteInterpreterCreate #44513

Closed

keithm-xmos referenced this issue in xmos/tensorflow Feb 1, 2021

Merge pull request #23 from xmos/fix/pad_op

247c04f

fixed new AllocatePersistentBuffer in pad op

dinkdeep mentioned this issue Apr 7, 2021

Segmentation fault in tf-opt while running a tf dialect mlir file #48365

Open

rsanthanam-amd mentioned this issue Jul 1, 2021

[ROCm] This change replaces the original assert for detecting multiple #49232

Closed

DavidvSon1 mentioned this issue Oct 3, 2021

Segmentation fault when invoking TFLite interpreter on basic quantized model tensorflow/model-optimization#857

Open

ivankxt mentioned this issue Jun 12, 2023

Get deadlock after Predict(cuda10.0, cudnn7.6.5, Tesla T4 GPU) #60841

Closed

lyz1005 mentioned this issue Oct 26, 2023

Interpreter run crash #62240

Closed

spacycoder mentioned this issue Dec 11, 2023

Why does my full integer quantized tflite model crash when loaded? #62618

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Version #23

Distributed Version #23

jermainewang commented Nov 9, 2015

rusenask commented Nov 9, 2015

vrv commented Nov 9, 2015

saraswat commented Nov 9, 2015

zh4ngx commented Nov 10, 2015

edwardyoon commented Nov 10, 2015

edwardyoon commented Nov 10, 2015

jeffreyadean commented Nov 11, 2015

edwardyoon commented Nov 11, 2015

kdunn926 commented Nov 14, 2015

saraswat commented Nov 24, 2015

edwardyoon commented Dec 2, 2015

bhack commented Jan 15, 2016

saudet commented Jan 16, 2016

martinwicke commented Jan 16, 2016

RobotiAi commented Jan 17, 2016

bhack commented Jan 17, 2016

shendiaomo commented Jan 28, 2016

andykitchen commented Jan 31, 2016

krzysztof-magosa commented Feb 14, 2016

grillermo commented Feb 14, 2016

ctn commented Feb 15, 2016

mschonwe commented Feb 16, 2016

jesuisnicolasdavid commented Feb 17, 2016

shendiaomo commented Feb 17, 2016

Distributed Version #23

Distributed Version #23

Comments

jermainewang commented Nov 9, 2015

rusenask commented Nov 9, 2015

vrv commented Nov 9, 2015

saraswat commented Nov 9, 2015

zh4ngx commented Nov 10, 2015

edwardyoon commented Nov 10, 2015

edwardyoon commented Nov 10, 2015

jeffreyadean commented Nov 11, 2015

edwardyoon commented Nov 11, 2015

kdunn926 commented Nov 14, 2015

saraswat commented Nov 24, 2015

edwardyoon commented Dec 2, 2015

bhack commented Jan 15, 2016

saudet commented Jan 16, 2016

martinwicke commented Jan 16, 2016

RobotiAi commented Jan 17, 2016

bhack commented Jan 17, 2016

shendiaomo commented Jan 28, 2016

andykitchen commented Jan 31, 2016

krzysztof-magosa commented Feb 14, 2016

grillermo commented Feb 14, 2016

ctn commented Feb 15, 2016

HTH.

mschonwe commented Feb 16, 2016

jesuisnicolasdavid commented Feb 17, 2016

shendiaomo commented Feb 17, 2016