GRPC causes training to pause in individual worker (distributed tensorflow, synchronised) #13213

utkrist · 2017-09-21T15:27:50Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian GNU/Linux 8.9 (jessie)
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): v1.2.0-5-g435cdfc 1.2.1
Python version: 3.6.2
CUDA/cuDNN version: cuda-8.0 / cudnn-5.1.5
GPU model and memory: GeForce GTX Titan X, 12 GB
Exact command to reproduce:

Describe the problem

The distributed synchronized ( between graph replication, 4 workers, 3 ps ) training works fine until one of the ps tasks reports following error. After that, one of the worker processes just stops, and the rest of the workers may also stop later with same error.

2017-09-21 16:45:55.606842: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2000, 1 -> localhost:2001, 2 -> localhost:2002}
 2017-09-21 16:45:55.606877: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2003, 1 -> localhost:2004, 2 -> localhost:2005, 3 -> localhost:2006}
 2017-09-21 16:45:55.608066: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:2002
 E0921 16:48:52.596846076    3037 parsing.c:801]              ignoring out of order new grpc_chttp2_stream request on server; last grpc_chttp2_stream id=12325, new grpc_chttp2_stream id=12317
 2017-09-21 16:48:57.497244: W tensorflow/core/framework/op_kernel.cc:1158] Out of range: End of sequence
      [[Node: data_source_task_index_0/IteratorGetNext = IteratorGetNext[output_shapes=[[-1,-1], [-1,-1], [-1,-1], [-1,-1], [-1,-1]], output_types=[DT_INT64, DT_INT64, DT_INT64, DT_INT64, DT_INT64], _device="/job:ps/replica:0/task:0/cpu:0"](data_source_task_index_0/Iterator)]]
      [[Node: data_source_task_index_0/cond/Merge_2_S341 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:2/cpu:0", send_device="/job:ps/replica:0/task:0/cpu:0", send_device_incarnation=-6450759800525444137, tensor_name="edge_359_data_source_task_index_0/cond/Merge_2", tensor_type=DT_INT64, _device="/job:ps/replica:0/task:2/cpu:0"]()]]
 E0921 16:49:58.462749643    3036 parsing.c:801]              ignoring out of order new grpc_chttp2_stream request on server; last grpc_chttp2_stream id=24775, new grpc_chttp2_stream id=24769
 E0921 16:49:58.462780714    3036 parsing.c:801]              ignoring out of order new grpc_chttp2_stream request on server; last grpc_chttp2_stream id=24775, new grpc_chttp2_stream id=24773
 E0921 16:49:58.463260203    3036 parsing.c:801]              ignoring out of order new grpc_chttp2_stream request on server; last grpc_chttp2_stream id=24793, new grpc_chttp2_stream id=24777
 E0921 16:49:58.463277333    3036 parsing.c:801]              ignoring out of order new grpc_chttp2_stream request on server; last grpc_chttp2_stream id=24793, new grpc_chttp2_stream id=24779
 E0921 16:49:58.463283953    3036 parsing.c:801]              ignoring out of order new grpc_chttp2_stream request on server; last grpc_chttp2_stream id=24793, new grpc_chttp2_stream id=24781
 E0921 16:49:58.463289625    3036 parsing.c:801]              ignoring out of order new grpc_chttp2_stream request on server; last grpc_chttp2_stream id=24793, new grpc_chttp2_stream id=24783
 E0921 16:49:58.463295275    3036 parsing.c:801]              ignoring out of order new grpc_chttp2_stream request on server; last grpc_chttp2_stream id=24793, new grpc_chttp2_stream id=24785

For more detail see the stackoverflow post:
https://stackoverflow.com/questions/46322337/frozen-training-in-distributed-tensorflow

The text was updated successfully, but these errors were encountered:

shivaniag · 2017-09-21T17:52:38Z

@mrry could you please take a look.

utkrist · 2017-09-21T21:26:13Z

Someone mentions here that: "I'm also seeing this error in distributed tensorflow when I close and reset a lot of FIFOQueues."

mrry · 2017-09-21T21:49:45Z

There are a couple of possible issues, but it's difficult to tell without a reproducible example:

According to the linked issue in gRPC, the problem might be due to an old version of gRPC. Try upgrading to a later version of TensorFlow to get the update.
It looks like you're using an initializable iterator in order to get end-of-epoch signals. However, SyncReplicasOptimizer depends on all of the replicas taking an equal number of steps. (It wasn't designed with the possibility of initializable iterators in mind....)

utkrist · 2017-09-22T10:27:12Z

(2) Yes I am using initializable iterator to get local End-of-epoch signal for every worker. In my case the 'epoch' count is a python variable that is local to the worker. It does not interfere with the 'global_step' and the 'local_step' that 'SyncReplicasOptimizer' uses. When some worker is initializing its iterator, other workers need to wait for brief time, but the training runs as expected afterwards. All original assumptions of SyncReplicasOptimizer seems to hold true. The problem seems to arise in grpc.

(1.) Are you suggesting to update the C++ library or python version of grpc. I updated python package, it does not solve it. I am guessing I should compile tensorflow (1.2) with bazel to use latest grpc.

And, I will provide you with minimum reproducible code if (1.) does not fix it.

mrry · 2017-09-23T00:20:21Z

On (1.), I think you might need to upgrade to a nightly build to get the upgraded version of gRPC. (It doesn't look like the 1.3 branch has the upgraded version.)

utkrist · 2017-09-24T13:40:02Z

Is this the relevant one or this ?

utkrist · 2017-09-25T12:21:24Z

Thank you very much @mrry for the suggestion. Problem went away with this nightly build:
http://ci.tensorflow.org/view/Nightly/job/nightly-matrix-linux-gpu/TF_BUILD_IS_OPT=OPT,TF_BUILD_IS_PIP=PIP,TF_BUILD_PYTHON_VERSION=PYTHON3.5,label=gpu-linux/

shivaniag added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Sep 21, 2017

shivaniag assigned mrry Sep 21, 2017

mrry added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Sep 21, 2017

aselle removed the stat:awaiting response Status - Awaiting response from author label Sep 22, 2017

mrry added the stat:awaiting response Status - Awaiting response from author label Sep 23, 2017

aselle removed the stat:awaiting response Status - Awaiting response from author label Sep 24, 2017

utkrist closed this as completed Sep 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPC causes training to pause in individual worker (distributed tensorflow, synchronised) #13213

GRPC causes training to pause in individual worker (distributed tensorflow, synchronised) #13213

utkrist commented Sep 21, 2017

shivaniag commented Sep 21, 2017

utkrist commented Sep 21, 2017 •

edited

mrry commented Sep 21, 2017

utkrist commented Sep 22, 2017 •

edited

mrry commented Sep 23, 2017

utkrist commented Sep 24, 2017 •

edited

utkrist commented Sep 25, 2017 •

edited

GRPC causes training to pause in individual worker (distributed tensorflow, synchronised) #13213

GRPC causes training to pause in individual worker (distributed tensorflow, synchronised) #13213

Comments

utkrist commented Sep 21, 2017

System information

Describe the problem

shivaniag commented Sep 21, 2017

utkrist commented Sep 21, 2017 • edited

mrry commented Sep 21, 2017

utkrist commented Sep 22, 2017 • edited

mrry commented Sep 23, 2017

utkrist commented Sep 24, 2017 • edited

utkrist commented Sep 25, 2017 • edited

utkrist commented Sep 21, 2017 •

edited

utkrist commented Sep 22, 2017 •

edited

utkrist commented Sep 24, 2017 •

edited

utkrist commented Sep 25, 2017 •

edited