-
Notifications
You must be signed in to change notification settings - Fork 74.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TF1.6/1.7 PS/Worker Distributed Run Failed with "UnavailableError: OS Error" when jobs are not running on current machine #17852
Comments
Update: I revert the grpc upgrade change locally, and i can start the ps task successfully now. I believe for users who are suffering this, this might be considered as a workaround. I would like to heard some feedbacks from you guys as experts. :) |
A different user is reporting something similar on SO: https://stackoverflow.com/questions/49403392/unavailableerror-os-error-while-training-on-gc-ml-engine |
Hi, guy from @rhaertel80 SO link here, |
@simpeng which gRPC commit are you going back to? This looks like the PS is simply not up when you first try to connect. The resulting error should be handled, the most surprising thing here is that a gRPC downgrade helps resolve this. Is this repeatable? |
@martinwicke , yes it's repeatable, I did twice ( one on my local machine for debugging, the other one is on another build machine). My testing case is:
The change I reverted is the gRPC commit. |
I am struggling with the same error with Tensorflow 1.6.The same error has been reported elsewhere with Tensorflow 1.6 It seems to be an OOM error. Any suggested workarounds ? How can I change the grpc version if I am using the gcloud ml-engine. Any help will be appreciated. |
I can confirm that I get an OOM error with a different signature, if I run the same config on a single gcloud ml-engine instance(with no GPUS) |
The issue repro on tf1.7 release, and the workaround(revert grpc change) still work till now. I am quite curious anybody else who use distributed tf, don't really have the exact same repro? (to be clear, the initial issue has nothing to do with OOM). |
Nagging Assignee @jart: It has been 15 days with no activity and this issue has an assignee. Please update the label and/or status accordingly. |
This question is better asked on StackOverflow since it is not a bug or feature request. There is also a larger community that reads questions there. If you think we've misinterpreted a bug, please comment again with a clear explanation, as well as all of the information requested in the issue template. Thanks! |
@simpeng I had the same error. Could you tell me which version of grpc you downgrade to and what version of tensorflow are you using? thank you |
@simpeng Also, could you show a little on how to downgrade gprc? I mannually use pip uninstall grpcio and pip install grpcio==1.0.0, but it seems still reporting same error. I wonder if this needs to revert grpc somewhere else in tensorflow? |
No, the issue is not gone ! Even with 1.8 the issue persists for me. |
@g-hrafiq what is you issue? I still have "UnavailableError: OS Error" here. My running command is One server 1 (172.19.120.41), I run: One server 2 (172.19.120.39), I run: Then I still got the issue here. Just FYI, I installed tensorflow from pip |
Today I ran an object detection model retrain via CMLE using runtime 1.8 and below config.yml file, training ran fine for 11000 steps for first time then errored out "UnavailableError: OS Error". I restarted the job later, then it ran till 60000 until error.
|
I am getting this same error. seems to be a lot of tickets for this issue - has no one solved it yet? |
As someone mentioned above, this got solved for me by using a larger memory instance, I changed from "masterType": "complex_model_m", |
This has been troubling me for a while. I found out that the problem is that GRPC uses the native "epoll" polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, "GRPC_POLL_STRATEGY=poll" before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md. |
@nerdyalbin thanks for the sharing! Was the default polling engine changed for recently GRPC? Or maybe epoll behaving differently (compared with before) caused the issues coming? |
same problem, it works! ^-^ |
could you tell how to set the environment variable? thanks very much |
import os |
it works! thank for you help! |
System information
You can collect some of this information using our environment capture script:
https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh
You can obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
('v1.6.0-rc1-1503-g47407cc', '1.6.0')
Describe the problem
The expected behavior
The below source code utilized ps/worker mode to do some training, for usage: we need to run
respectively on "ps job" machine and "worker job" machine.
If we run the script firstly on ps, normally, it will wait for worker machine ready, before going furthur, the log is as below:
Then we run the script on worker machine, the two machines communicated and coordinated to get things done.
The problem
It works pretty well on tf1.5/1.4 or earlier version, but on latest 1.6.0 release version (and i also tried to build from master source code), it failed for sometimes. I did some investigation and testing, here are the symptoms:
If the specified ps/worker-hosts are having the same ip as current machine running the scripts (e.g. ps/worker are running different ports of current machine), everything is just fine, they works.
If the specified ps/worker-hosts are having the same ip (we call it A-IP), but different with current machine, even though current machine can ping successfully the A-IP, but will failed. The error log after starting ps task (with python mnist_replica.py --data_dir /tmp/tensorflow/mnist/input_data --task_index 0 --ps_hosts '10.0.1.5:14416' --worker_hosts '10.0.1.4:14417' --job_name 'ps'):
If the specified ps/worker-hosts are having different ips (in the same LAN, can ping successfully each other), errors on starting on ps worker is similar with the second situation.
The exception happens in MasterSession initilization ( i guess there needs some communication via grpc there)
My personal thinking
To be honest, i am wondering whether the gRPC upgrade (that was introduced since v1.6rc0 ) did the trick, but since I am pretty new to this component, besides i am not sure whether somebody else have the similar issues (while I think people using tf1.6 and master will suffer from this on distribute run).
That would be great if any experts can share some insights or thoughts. Thanks in advance!!!
Source code / logs
source code:
`from future import absolute_import
from future import division
from future import print_function
import argparse
import math
import sys
import tempfile
import time
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
IMAGE_PIXELS = 28
def create_done_queue(ps_task_index, worker_count):
"""Queue used to signal death for i'th ps shard. Intended to have
all workers enqueue an item onto it to signal doneness."""
def create_done_queues(ps_count, worker_count):
return [create_done_queue(ps_task_index, worker_count) for ps_task_index in range(ps_count)]
def main(args):
mnist = input_data.read_data_sets(args.input_training_data_path, one_hot=True)
if args.download_only:
sys.exit(0)
if name == "main":
parser = argparse.ArgumentParser()
parser.add_argument("--input-training-data-path", default="/tmp/mnist-data")
parser.add_argument("--input_training_data_path", default="/tmp/mnist-data")
parser.add_argument("--download_only", type=bool, default=False)
parser.add_argument("--task-index", type=int)
parser.add_argument("--task_index", type=int)
parser.add_argument("--num_gpus", type=int, default=1)
parser.add_argument("--replicas_to_aggregate", type=int)
parser.add_argument("--hidden_units", type=int, default=100)
parser.add_argument("--train_steps", type=int, default=200)
parser.add_argument("--batch_size", type=int, default=100)
parser.add_argument("--learning_rate", type=float, default=0.01)
parser.add_argument("--sync_replicas", type=bool, default=False)
parser.add_argument("--existing_servers", type=bool, default=False)
parser.add_argument("--ps-hosts", default="localhost:2222")
parser.add_argument("--ps_hosts", default="localhost:2222")
parser.add_argument("--worker-hosts", default="localhost:2223,localhost:2224")
parser.add_argument("--worker_hosts", default="localhost:2223,localhost:2224")
parser.add_argument("--job-name")
parser.add_argument("--job_name")
parser.add_argument("--protocol", default="grpc")
parser.add_argument("--infer_shapes", type=bool, default=False)
The text was updated successfully, but these errors were encountered: