Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRPC causes training to pause in individual worker (distributed tensorflow, synchronised) #13213

Closed
utkrist opened this issue Sep 21, 2017 · 7 comments
Assignees

Comments

@utkrist
Copy link

utkrist commented Sep 21, 2017

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian GNU/Linux 8.9 (jessie)
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): v1.2.0-5-g435cdfc 1.2.1
  • Python version: 3.6.2
  • CUDA/cuDNN version: cuda-8.0 / cudnn-5.1.5
  • GPU model and memory: GeForce GTX Titan X, 12 GB
  • Exact command to reproduce:

Describe the problem

The distributed synchronized ( between graph replication, 4 workers, 3 ps ) training works fine until one of the ps tasks reports following error. After that, one of the worker processes just stops, and the rest of the workers may also stop later with same error.

2017-09-21 16:45:55.606842: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2000, 1 -> localhost:2001, 2 -> localhost:2002}
 2017-09-21 16:45:55.606877: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2003, 1 -> localhost:2004, 2 -> localhost:2005, 3 -> localhost:2006}
 2017-09-21 16:45:55.608066: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:2002
 E0921 16:48:52.596846076    3037 parsing.c:801]              ignoring out of order new grpc_chttp2_stream request on server; last grpc_chttp2_stream id=12325, new grpc_chttp2_stream id=12317
 2017-09-21 16:48:57.497244: W tensorflow/core/framework/op_kernel.cc:1158] Out of range: End of sequence
      [[Node: data_source_task_index_0/IteratorGetNext = IteratorGetNext[output_shapes=[[-1,-1], [-1,-1], [-1,-1], [-1,-1], [-1,-1]], output_types=[DT_INT64, DT_INT64, DT_INT64, DT_INT64, DT_INT64], _device="/job:ps/replica:0/task:0/cpu:0"](data_source_task_index_0/Iterator)]]
      [[Node: data_source_task_index_0/cond/Merge_2_S341 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:2/cpu:0", send_device="/job:ps/replica:0/task:0/cpu:0", send_device_incarnation=-6450759800525444137, tensor_name="edge_359_data_source_task_index_0/cond/Merge_2", tensor_type=DT_INT64, _device="/job:ps/replica:0/task:2/cpu:0"]()]]
 E0921 16:49:58.462749643    3036 parsing.c:801]              ignoring out of order new grpc_chttp2_stream request on server; last grpc_chttp2_stream id=24775, new grpc_chttp2_stream id=24769
 E0921 16:49:58.462780714    3036 parsing.c:801]              ignoring out of order new grpc_chttp2_stream request on server; last grpc_chttp2_stream id=24775, new grpc_chttp2_stream id=24773
 E0921 16:49:58.463260203    3036 parsing.c:801]              ignoring out of order new grpc_chttp2_stream request on server; last grpc_chttp2_stream id=24793, new grpc_chttp2_stream id=24777
 E0921 16:49:58.463277333    3036 parsing.c:801]              ignoring out of order new grpc_chttp2_stream request on server; last grpc_chttp2_stream id=24793, new grpc_chttp2_stream id=24779
 E0921 16:49:58.463283953    3036 parsing.c:801]              ignoring out of order new grpc_chttp2_stream request on server; last grpc_chttp2_stream id=24793, new grpc_chttp2_stream id=24781
 E0921 16:49:58.463289625    3036 parsing.c:801]              ignoring out of order new grpc_chttp2_stream request on server; last grpc_chttp2_stream id=24793, new grpc_chttp2_stream id=24783
 E0921 16:49:58.463295275    3036 parsing.c:801]              ignoring out of order new grpc_chttp2_stream request on server; last grpc_chttp2_stream id=24793, new grpc_chttp2_stream id=24785

For more detail see the stackoverflow post:
https://stackoverflow.com/questions/46322337/frozen-training-in-distributed-tensorflow

@shivaniag shivaniag added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Sep 21, 2017
@shivaniag
Copy link
Contributor

@mrry could you please take a look.

@utkrist
Copy link
Author

utkrist commented Sep 21, 2017

Someone mentions here that: "I'm also seeing this error in distributed tensorflow when I close and reset a lot of FIFOQueues."

@mrry
Copy link
Contributor

mrry commented Sep 21, 2017

There are a couple of possible issues, but it's difficult to tell without a reproducible example:

  1. According to the linked issue in gRPC, the problem might be due to an old version of gRPC. Try upgrading to a later version of TensorFlow to get the update.
  2. It looks like you're using an initializable iterator in order to get end-of-epoch signals. However, SyncReplicasOptimizer depends on all of the replicas taking an equal number of steps. (It wasn't designed with the possibility of initializable iterators in mind....)

@mrry mrry added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Sep 21, 2017
@utkrist
Copy link
Author

utkrist commented Sep 22, 2017

(2) Yes I am using initializable iterator to get local End-of-epoch signal for every worker. In my case the 'epoch' count is a python variable that is local to the worker. It does not interfere with the 'global_step' and the 'local_step' that 'SyncReplicasOptimizer' uses. When some worker is initializing its iterator, other workers need to wait for brief time, but the training runs as expected afterwards. All original assumptions of SyncReplicasOptimizer seems to hold true. The problem seems to arise in grpc.

(1.) Are you suggesting to update the C++ library or python version of grpc. I updated python package, it does not solve it. I am guessing I should compile tensorflow (1.2) with bazel to use latest grpc.

And, I will provide you with minimum reproducible code if (1.) does not fix it.

@aselle aselle removed the stat:awaiting response Status - Awaiting response from author label Sep 22, 2017
@mrry
Copy link
Contributor

mrry commented Sep 23, 2017

On (1.), I think you might need to upgrade to a nightly build to get the upgraded version of gRPC. (It doesn't look like the 1.3 branch has the upgraded version.)

@mrry mrry added the stat:awaiting response Status - Awaiting response from author label Sep 23, 2017
@utkrist
Copy link
Author

utkrist commented Sep 24, 2017

Is this the relevant one or this ?

@aselle aselle removed the stat:awaiting response Status - Awaiting response from author label Sep 24, 2017
@utkrist
Copy link
Author

utkrist commented Sep 25, 2017

@utkrist utkrist closed this as completed Sep 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants