New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GRPC causes training to pause in individual worker (distributed tensorflow, synchronised) #13213
Comments
@mrry could you please take a look. |
Someone mentions here that: "I'm also seeing this error in distributed tensorflow when I close and reset a lot of FIFOQueues." |
There are a couple of possible issues, but it's difficult to tell without a reproducible example:
|
(2) Yes I am using initializable iterator to get local End-of-epoch signal for every worker. In my case the 'epoch' count is a python variable that is local to the worker. It does not interfere with the 'global_step' and the 'local_step' that 'SyncReplicasOptimizer' uses. When some worker is initializing its iterator, other workers need to wait for brief time, but the training runs as expected afterwards. All original assumptions of SyncReplicasOptimizer seems to hold true. The problem seems to arise in grpc. (1.) Are you suggesting to update the C++ library or python version of grpc. I updated python package, it does not solve it. I am guessing I should compile tensorflow (1.2) with bazel to use latest grpc. And, I will provide you with minimum reproducible code if (1.) does not fix it. |
On (1.), I think you might need to upgrade to a nightly build to get the upgraded version of gRPC. (It doesn't look like the 1.3 branch has the upgraded version.) |
Thank you very much @mrry for the suggestion. Problem went away with this nightly build: |
System information
Describe the problem
The distributed synchronized ( between graph replication, 4 workers, 3 ps ) training works fine until one of the ps tasks reports following error. After that, one of the worker processes just stops, and the rest of the workers may also stop later with same error.
For more detail see the stackoverflow post:
https://stackoverflow.com/questions/46322337/frozen-training-in-distributed-tensorflow
The text was updated successfully, but these errors were encountered: