TF1.6/1.7 PS/Worker Distributed Run Failed with "UnavailableError: OS Error" when jobs are not running on current machine #17852

simpeng · 2018-03-20T06:37:26Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
TensorFlow installed from (source or binary): official 1.6.0 release binary, or build from master branch (with latest commit: 47407cc)
TensorFlow version (use command below): 1.6.0 official release or master
Python version: python 3.5 or python 2.7
Bazel version (if compiling from source): 0.11.1
GCC/Compiler version (if compiling from source): gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
CUDA/cuDNN version: 9.0/7.0
GPU model and memory: Tesla K80, 12206MiB
Exact command to reproduce:

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
('v1.6.0-rc1-1503-g47407cc', '1.6.0')

Describe the problem

The expected behavior

The below source code utilized ps/worker mode to do some training, for usage: we need to run

python mnist_replica.py --data_dir /tmp/tensorflow/mnist/input_data --task_index 0 --ps_hosts '10.0.1.5:14416' --worker_hosts '10.0.1.4:14417' --job_name 'ps'

python mnist_replica.py --data_dir /tmp/tensorflow/mnist/input_data --task_index 0 --ps_hosts '10.0.1.5:14416' --worker_hosts '10.0.1.4:14417' --job_name 'worker'

respectively on "ps job" machine and "worker job" machine.

If we run the script firstly on ps, normally, it will wait for worker machine ready, before going furthur, the log is as below:

2018-03-20 05:49:40.410488: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:14416}
2018-03-20 05:49:40.410614: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 10.0.1.4:14417}
2018-03-20 05:49:40.418149: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:14416
ps 0, create done queue
ps 0, running
2018-03-20 05:49:50.430531: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-03-20 05:50:00.430728: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-03-20 05:50:10.430943: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-03-20 05:50:20.431080: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-03-20 05:50:30.431351: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
^C2018-03-20 05:50:40.434895: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-03-20 05:50:50.435104: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-03-20 05:51:00.435244: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0

Then we run the script on worker machine, the two machines communicated and coordinated to get things done.

The problem

It works pretty well on tf1.5/1.4 or earlier version, but on latest 1.6.0 release version (and i also tried to build from master source code), it failed for sometimes. I did some investigation and testing, here are the symptoms:

If the specified ps/worker-hosts are having the same ip as current machine running the scripts (e.g. ps/worker are running different ports of current machine), everything is just fine, they works.
If the specified ps/worker-hosts are having the same ip (we call it A-IP), but different with current machine, even though current machine can ping successfully the A-IP, but will failed. The error log after starting ps task (with python mnist_replica.py --data_dir /tmp/tensorflow/mnist/input_data --task_index 0 --ps_hosts '10.0.1.5:14416' --worker_hosts '10.0.1.4:14417' --job_name 'ps'):

2018-03-20 05:57:29.228323: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:14416}
2018-03-20 05:57:29.228478: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 10.0.1.4:14417}
2018-03-20 05:57:29.229155: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:14416
ps 0, create done queue
ps 0, running
I0320 05:57:29.309552659 3803 subchannel.cc:677] Connect failed: {"created":"@1521525449.309441854","description":"Failed to connect to remote host: OS Error","errno":111,"file":"external/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":198,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:10.0.1.4:14417"}
I0320 05:57:29.309786369 3803 subchannel.cc:484] Retry in 998 milliseconds
I0320 05:57:30.307312499 3796 subchannel.cc:437] Failed to connect to channel, retrying
I0320 05:57:30.308555551 3804 subchannel.cc:677] Connect failed: {"created":"@1521525450.308464247","description":"Failed to connect to remote host: OS Error","errno":111,"file":"external/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":198,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:10.0.1.4:14417"}
I0320 05:57:30.308750759 3804 subchannel.cc:484] Retry in 999 milliseconds
I0320 05:57:31.307171978 3796 subchannel.cc:437] Failed to connect to channel, retrying
I0320 05:57:31.308303225 3802 subchannel.cc:677] Connect failed: {"created":"@1521525451.308214021","description":"Failed to connect to remote host: OS Error","errno":111,"file":"external/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":198,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:10.0.1.4:14417"}
I0320 05:57:31.308338927 3802 subchannel.cc:484] Retry in 999 milliseconds
I0320 05:57:32.307163816 3796 subchannel.cc:437] Failed to connect to channel, retrying
I0320 05:57:32.308250261 3801 subchannel.cc:677] Connect failed: {"created":"@1521525452.308164957","description":"Failed to connect to remote host: OS Error","errno":111,"file":"external/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":198,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:10.0.1.4:14417"}
I0320 05:57:32.308284662 3801 subchannel.cc:484] Retry in 999 milliseconds
I0320 05:57:33.307136307 3796 subchannel.cc:437] Failed to connect to channel, retrying
I0320 05:57:33.308314356 3806 subchannel.cc:677] Connect failed: {"created":"@1521525453.308215652","description":"Failed to connect to remote host: OS Error","errno":111,"file":"external/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":198,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:10.0.1.4:14417"}
I0320 05:57:33.308375658 3806 subchannel.cc:484] Retry in 999 milliseconds
I0320 05:57:34.307172752 3796 subchannel.cc:437] Failed to connect to channel, retrying
2018-03-20 05:57:34.308793: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
Traceback (most recent call last):
File "mnist_replica.py", line 304, in
main(args)
File "mnist_replica.py", line 102, in main
sess.run(queue.dequeue())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

If the specified ps/worker-hosts are having different ips (in the same LAN, can ping successfully each other), errors on starting on ps worker is similar with the second situation.
The exception happens in MasterSession initilization ( i guess there needs some communication via grpc there)

My personal thinking

To be honest, i am wondering whether the gRPC upgrade (that was introduced since v1.6rc0 ) did the trick, but since I am pretty new to this component, besides i am not sure whether somebody else have the similar issues (while I think people using tf1.6 and master will suffer from this on distribute run).

That would be great if any experts can share some insights or thoughts. Thanks in advance!!!

Source code / logs

source code:

`from future import absolute_import
from future import division
from future import print_function

import argparse
import math
import sys
import tempfile
import time

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

IMAGE_PIXELS = 28

def create_done_queue(ps_task_index, worker_count):
"""Queue used to signal death for i'th ps shard. Intended to have
all workers enqueue an item onto it to signal doneness."""

with tf.device("/job:ps/task:%d/cpu:0" % (ps_task_index)):
    return tf.FIFOQueue(worker_count, tf.int32, shared_name="done_queue" + str(ps_task_index))

def create_done_queues(ps_count, worker_count):
return [create_done_queue(ps_task_index, worker_count) for ps_task_index in range(ps_count)]

def main(args):
mnist = input_data.read_data_sets(args.input_training_data_path, one_hot=True)
if args.download_only:
sys.exit(0)

if args.job_name is None or args.job_name == "":
    raise ValueError("Must specify an explicit `job_name`")
if args.task_index is None or args.task_index == "":
    raise ValueError("Must specify an explicit `task_index`")

print("job name = %s" % args.job_name)
print("task index = %d" % args.task_index)

# Construct the cluster and start the server
ps_spec = args.ps_hosts.split(",")
worker_spec = args.worker_hosts.split(",")

# Get the number of workers.
num_workers = len(worker_spec)
num_pss = len(ps_spec)

cluster = tf.train.ClusterSpec({
    "ps": ps_spec,
    "worker": worker_spec})

if not args.existing_servers:
    # Not using existing servers. Create an in-process server.
    server = tf.train.Server(
        cluster, job_name=args.job_name, task_index=args.task_index, protocol=args.protocol)
    if args.job_name == "ps":
        config = tf.ConfigProto(log_device_placement=True, allow_soft_placement=True)
        sess = tf.Session(server.target, config=config)

        print("ps %d, create done queue" % args.task_index)
        queue = create_done_queue(args.task_index, num_workers)

        print("ps %d, running" % args.task_index)
        for i in range(num_workers):
            sess.run(queue.dequeue())
            print("ps %d received worker %d done" % (args.task_index, i))

        print("all workers are done, ps %d: exit" % (args.task_index))
        sys.exit()

is_chief = (args.task_index == 0)
if args.num_gpus > 0:
    # Avoid gpu allocation conflict: now allocate task_num -> #gpu
    # for each worker in the corresponding machine
    gpu = (args.task_index % args.num_gpus)
    worker_device = "/job:worker/task:%d/gpu:%d" % (args.task_index, gpu)
elif args.num_gpus == 0:
    # Just allocate the CPU to worker server
    cpu = 0
    worker_device = "/job:worker/task:%d/cpu:%d" % (args.task_index, cpu)

print("worker %d, worker_device=%s" % (args.task_index, worker_device))
print("worker %d, create done queue" % args.task_index)
queues = create_done_queues(num_pss, num_workers)
print("worker %d, done queue created" % args.task_index)

# The device setter will automatically place Variables ops on separate
# parameter servers (ps). The non-Variable ops will be placed on the workers.
# The ps use CPU and workers use corresponding GPU

with tf.device(
        tf.train.replica_device_setter(
            worker_device=worker_device,
            ps_device="/job:ps/cpu:0",
            cluster=cluster)):
    global_step = tf.Variable(0, name="global_step", trainable=False)

    # Variables of the hidden layer
    hid_w = tf.Variable(
        tf.truncated_normal(
            [IMAGE_PIXELS * IMAGE_PIXELS, args.hidden_units],
            stddev=1.0 / IMAGE_PIXELS),
        name="hid_w")
    hid_b = tf.Variable(tf.zeros([args.hidden_units]), name="hid_b")

    # Variables of the softmax layer
    sm_w = tf.Variable(
        tf.truncated_normal(
            [args.hidden_units, 10],
            stddev=1.0 / math.sqrt(args.hidden_units)),
        name="sm_w")
    sm_b = tf.Variable(tf.zeros([10]), name="sm_b")

    # Ops: located on the worker specified with args.task_index
    x = tf.placeholder(tf.float32, [None, IMAGE_PIXELS * IMAGE_PIXELS])
    y_ = tf.placeholder(tf.float32, [None, 10])

    hid_lin = tf.nn.xw_plus_b(x, hid_w, hid_b)
    hid = tf.nn.relu(hid_lin)

    y = tf.nn.softmax(tf.nn.xw_plus_b(hid, sm_w, sm_b))
    cross_entropy = -tf.reduce_sum(y_ * tf.log(tf.clip_by_value(y, 1e-10, 1.0)))

    opt = tf.train.AdamOptimizer(args.learning_rate)

    if args.sync_replicas:
        if args.replicas_to_aggregate is None:
            replicas_to_aggregate = num_workers
        else:
            replicas_to_aggregate = args.replicas_to_aggregate

        opt = tf.train.SyncReplicasOptimizer(
            opt,
            replicas_to_aggregate=replicas_to_aggregate,
            total_num_replicas=num_workers,
            name="mnist_sync_replicas")

    train_step = opt.minimize(cross_entropy, global_step=global_step)

    if args.sync_replicas:
        local_init_op = opt.local_step_init_op
        if is_chief:
            local_init_op = opt.chief_init_op

        ready_for_local_init_op = opt.ready_for_local_init_op

        # Initial token and chief queue runners required by the sync_replicas mode
        chief_queue_runner = opt.get_chief_queue_runner()
        sync_init_op = opt.get_init_tokens_op()

    init_op = tf.global_variables_initializer()
    train_dir = tempfile.mkdtemp()

    enq_ops = []
    for q in queues:
        qop = q.enqueue(1)
        enq_ops.append(qop)
if args.sync_replicas:
    sv = tf.train.Supervisor(
        is_chief=is_chief,
        logdir=train_dir,
        init_op=init_op,
        local_init_op=local_init_op,
        ready_for_local_init_op=ready_for_local_init_op,
        recovery_wait_secs=1,
        global_step=global_step)
else:
    sv = tf.train.Supervisor(
        is_chief=is_chief,
        logdir=train_dir,
        init_op=init_op,
        recovery_wait_secs=1,
        global_step=global_step)

sess_config = tf.ConfigProto(
    allow_soft_placement=True,
    log_device_placement=False,
    device_filters=["/job:ps", "/job:worker/task:%d" % args.task_index])
if args.infer_shapes == True:
    sess_config.graph_options.infer_shapes = args.infer_shapes

# The chief worker (task_index==0) session will prepare the session,
# while the remaining workers will wait for the preparation to complete.
if is_chief:
    print("Worker %d: Initializing session..." % args.task_index)
else:
    print("Worker %d: Waiting for session to be initialized..." %
          args.task_index)

		  
if args.existing_servers:
    server_grpc_url = "grpc://" + worker_spec[args.task_index]
    print("Using existing server at: %s" % server_grpc_url)

    sess = sv.prepare_or_wait_for_session(server_grpc_url,
                                          config=sess_config)
else:
    sess = sv.prepare_or_wait_for_session(server.target, config=sess_config)

print("Worker %d: Session initialization complete." % args.task_index)

if args.sync_replicas and is_chief:
    # Chief worker will start the chief queue runner and call the init op.
    sess.run(sync_init_op)
    sv.start_queue_runners(sess, [chief_queue_runner])

# Perform training
time_begin = time.time()
print("Training begins @ %f" % time_begin)

local_step = 0
while True:
    # Training feed
    batch_xs, batch_ys = mnist.train.next_batch(args.batch_size)
    train_feed = {x: batch_xs, y_: batch_ys}

    _, step = sess.run([train_step, global_step], feed_dict=train_feed)
    local_step += 1

    now = time.time()
    print("%f: Worker %d: training step %d done (global step: %d)" %
          (now, args.task_index, local_step, step))

    if step >= args.train_steps:
        break

time_end = time.time()
print("Training ends @ %f" % time_end)
training_time = time_end - time_begin
print("Training elapsed time: %f s" % training_time)

# Validation feed
val_feed = {x: mnist.validation.images, y_: mnist.validation.labels}
val_xent = sess.run(cross_entropy, feed_dict=val_feed)
print("After %d training step(s), validation cross entropy = %g" %
      (args.train_steps, val_xent))

for op in enq_ops:
    sess.run(op)

if name == "main":
parser = argparse.ArgumentParser()
parser.add_argument("--input-training-data-path", default="/tmp/mnist-data")
parser.add_argument("--input_training_data_path", default="/tmp/mnist-data")
parser.add_argument("--download_only", type=bool, default=False)
parser.add_argument("--task-index", type=int)
parser.add_argument("--task_index", type=int)
parser.add_argument("--num_gpus", type=int, default=1)
parser.add_argument("--replicas_to_aggregate", type=int)
parser.add_argument("--hidden_units", type=int, default=100)
parser.add_argument("--train_steps", type=int, default=200)
parser.add_argument("--batch_size", type=int, default=100)
parser.add_argument("--learning_rate", type=float, default=0.01)
parser.add_argument("--sync_replicas", type=bool, default=False)
parser.add_argument("--existing_servers", type=bool, default=False)
parser.add_argument("--ps-hosts", default="localhost:2222")
parser.add_argument("--ps_hosts", default="localhost:2222")
parser.add_argument("--worker-hosts", default="localhost:2223,localhost:2224")
parser.add_argument("--worker_hosts", default="localhost:2223,localhost:2224")
parser.add_argument("--job-name")
parser.add_argument("--job_name")
parser.add_argument("--protocol", default="grpc")
parser.add_argument("--infer_shapes", type=bool, default=False)

(args, unknown) = parser.parse_known_args()
main(args)`

The text was updated successfully, but these errors were encountered:

simpeng · 2018-03-20T09:07:30Z

Update: I revert the grpc upgrade change locally, and i can start the ps task successfully now.

I believe for users who are suffering this, this might be considered as a workaround.

I would like to heard some feedbacks from you guys as experts. :)

rhaertel80 · 2018-03-21T14:10:50Z

A different user is reporting something similar on SO: https://stackoverflow.com/questions/49403392/unavailableerror-os-error-while-training-on-gc-ml-engine

TopHatCroat · 2018-03-22T08:36:59Z

Hi, guy from @rhaertel80 SO link here,
I have since lowered the version to 1.5. But now I'm experiencing a different problem similar to this one where I have posted some more details.
This might be completely unrelated, though

martinwicke · 2018-03-22T16:09:37Z

@simpeng which gRPC commit are you going back to?

This looks like the PS is simply not up when you first try to connect. The resulting error should be handled, the most surprising thing here is that a gRPC downgrade helps resolve this. Is this repeatable?

simpeng · 2018-03-23T01:35:28Z

@martinwicke , yes it's repeatable, I did twice ( one on my local machine for debugging, the other one is on another build machine).

My testing case is:

Start the PS job first, normally it keep waiting for response from worker, util worker get started.

2018-03-20 05:49:50.430531: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0

With 1.6, it will try to "connect to the channel" and retry few times, then throw exception:

{"created":"@1521525453.308215652","description":"Failed to connect to remote host: OS Error","errno":111,"file":"external/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":198,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:10.0.1.4:14417"}
I0320 05:57:33.308375658 3806 subchannel.cc:484] Retry in 999 milliseconds
I0320 05:57:34.307172752 3796 subchannel.cc:437] Failed to connect to channel, retrying
2018-03-20 05:57:34.308793: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
Traceback (most recent call last):
File "mnist_replica.py", line 304, in
main(args)
File "mnist_replica.py", line 102, in main
sess.run(queue.dequeue())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

The change I reverted is the gRPC commit.

choudharydhruv · 2018-03-26T21:55:07Z

I am struggling with the same error with Tensorflow 1.6.The same error has been reported elsewhere with Tensorflow 1.6
yahoo/TensorFlowOnSpark#245

It seems to be an OOM error.

Any suggested workarounds ? How can I change the grpc version if I am using the gcloud ml-engine.
I believe they come pre-packaged in the runtime environment.

Any help will be appreciated.

choudharydhruv · 2018-03-26T23:33:15Z

I can confirm that I get an OOM error with a different signature, if I run the same config on a single gcloud ml-engine instance(with no GPUS)
The replica master 0 ran out-of-memory and exited with a non-zero status of 247.

simpeng · 2018-04-08T12:10:34Z

The issue repro on tf1.7 release, and the workaround(revert grpc change) still work till now.

I am quite curious anybody else who use distributed tf, don't really have the exact same repro? (to be clear, the initial issue has nothing to do with OOM).

tensorflowbutler · 2018-04-23T18:42:44Z

Nagging Assignee @jart: It has been 15 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

jart · 2018-04-26T17:49:08Z

This question is better asked on StackOverflow since it is not a bug or feature request. There is also a larger community that reads questions there.

If you think we've misinterpreted a bug, please comment again with a clear explanation, as well as all of the information requested in the issue template. Thanks!

tanndx17 · 2018-07-20T12:07:12Z

@simpeng I had the same error. Could you tell me which version of grpc you downgrade to and what version of tensorflow are you using? thank you

tanndx17 · 2018-07-21T07:22:53Z

@simpeng Also, could you show a little on how to downgrade gprc? I mannually use pip uninstall grpcio and pip install grpcio==1.0.0, but it seems still reporting same error. I wonder if this needs to revert grpc somewhere else in tensorflow?

pengwa · 2018-07-24T06:47:56Z

@tanndx17 I reverted the gRPC commit: cb49899, and build from source.

AFAIK, since tf1.8, the issue is gone.

g-hrafiq · 2018-07-27T05:50:13Z

No, the issue is not gone ! Even with 1.8 the issue persists for me.

tanndx17 · 2018-07-28T07:32:54Z

@g-hrafiq what is you issue? I still have "UnavailableError: OS Error" here. My running command is

One server 1 (172.19.120.41), I run:
CUDA_VISIBLE_DEVICES=0 python3 mnist_2.py
--ps_hosts=localhost:2222
--worker_hosts=172.19.120.39:2223
--job_name=ps --task_index=0

One server 2 (172.19.120.39), I run:
CUDA_VISIBLE_DEVICES=0 python3 mnist_2.py
--ps_hosts=172.19.120.41:2222
--worker_hosts=localhost:2223
--job_name=worker -task_index=0

Then I still got the issue here. Just FYI, I installed tensorflow from pip

g-hrafiq · 2018-07-28T14:18:42Z

Today I ran an object detection model retrain via CMLE using runtime 1.8 and below config.yml file, training ran fine for 11000 steps for first time then errored out "UnavailableError: OS Error". I restarted the job later, then it ran till 60000 until error.

trainingInput:
  runtimeVersion: "1.8"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

cclough · 2018-08-02T00:23:06Z

I am getting this same error. seems to be a lot of tickets for this issue - has no one solved it yet?

elmer-garduno · 2018-08-14T21:11:09Z

As someone mentioned above, this got solved for me by using a larger memory instance, I changed from

"masterType": "complex_model_m",
to
"masterType": "large_model",

nerdyalbin · 2018-08-20T21:32:03Z

This has been troubling me for a while. I found out that the problem is that GRPC uses the native "epoll" polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, "GRPC_POLL_STRATEGY=poll" before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.

pengwa · 2018-08-21T02:11:07Z

@nerdyalbin thanks for the sharing! Was the default polling engine changed for recently GRPC? Or maybe epoll behaving differently (compared with before) caused the issues coming?

zyhzdp · 2018-10-19T02:10:01Z

This has been troubling me for a while. I found out that the problem is that GRPC uses the native "epoll" polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, "GRPC_POLL_STRATEGY=poll" before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.

same problem, it works! ^-^

lihao056 · 2019-03-13T05:38:02Z

This has been troubling me for a while. I found out that the problem is that GRPC uses the native "epoll" polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, "GRPC_POLL_STRATEGY=poll" before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.

could you tell how to set the environment variable? thanks very much

formath · 2019-03-13T05:49:16Z

This has been troubling me for a while. I found out that the problem is that GRPC uses the native "epoll" polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, "GRPC_POLL_STRATEGY=poll" before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.

could you tell how to set the environment variable? thanks very much

import os
os.environ['GRPC_POLL_STRATEGY'] = "poll"

lihao056 · 2019-03-14T02:08:51Z

This has been troubling me for a while. I found out that the problem is that GRPC uses the native "epoll" polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, "GRPC_POLL_STRATEGY=poll" before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.

could you tell how to set the environment variable? thanks very much

import os
os.environ['GRPC_POLL_STRATEGY'] = "poll"

it works! thank for you help!

tensorflowbutler assigned jart Mar 20, 2018

martinwicke added the type:bug Bug label Mar 21, 2018

roysheffi mentioned this issue Apr 7, 2018

tensorflow.python.framework.errors_impl.UnavailableError: OS Error tensorflow/models#3788

Closed

simpeng changed the title ~~TF1.6 PS/Worker Distributed Run Failed with "UnavailableError: OS Error" when jobs are not running on current machine~~ TF1.6/1.7 PS/Worker Distributed Run Failed with "UnavailableError: OS Error" when jobs are not running on current machine Apr 8, 2018

jart closed this as completed Apr 26, 2018

lihao056 unassigned jart Mar 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TF1.6/1.7 PS/Worker Distributed Run Failed with "UnavailableError: OS Error" when jobs are not running on current machine #17852

TF1.6/1.7 PS/Worker Distributed Run Failed with "UnavailableError: OS Error" when jobs are not running on current machine #17852

simpeng commented Mar 20, 2018 •

edited

Loading

simpeng commented Mar 20, 2018

rhaertel80 commented Mar 21, 2018

TopHatCroat commented Mar 22, 2018

martinwicke commented Mar 22, 2018

simpeng commented Mar 23, 2018 •

edited

Loading

choudharydhruv commented Mar 26, 2018

choudharydhruv commented Mar 26, 2018

simpeng commented Apr 8, 2018

tensorflowbutler commented Apr 23, 2018

jart commented Apr 26, 2018

tanndx17 commented Jul 20, 2018

tanndx17 commented Jul 21, 2018

pengwa commented Jul 24, 2018

g-hrafiq commented Jul 27, 2018

tanndx17 commented Jul 28, 2018

g-hrafiq commented Jul 28, 2018 •

edited

Loading

cclough commented Aug 2, 2018

elmer-garduno commented Aug 14, 2018

nerdyalbin commented Aug 20, 2018

pengwa commented Aug 21, 2018 •

edited

Loading

zyhzdp commented Oct 19, 2018

lihao056 commented Mar 13, 2019

formath commented Mar 13, 2019

lihao056 commented Mar 14, 2019

TF1.6/1.7 PS/Worker Distributed Run Failed with "UnavailableError: OS Error" when jobs are not running on current machine #17852

TF1.6/1.7 PS/Worker Distributed Run Failed with "UnavailableError: OS Error" when jobs are not running on current machine #17852

Comments

simpeng commented Mar 20, 2018 • edited Loading

System information

Describe the problem

The expected behavior

The problem

My personal thinking

Source code / logs

simpeng commented Mar 20, 2018

rhaertel80 commented Mar 21, 2018

TopHatCroat commented Mar 22, 2018

martinwicke commented Mar 22, 2018

simpeng commented Mar 23, 2018 • edited Loading

choudharydhruv commented Mar 26, 2018

choudharydhruv commented Mar 26, 2018

simpeng commented Apr 8, 2018

tensorflowbutler commented Apr 23, 2018

jart commented Apr 26, 2018

tanndx17 commented Jul 20, 2018

tanndx17 commented Jul 21, 2018

pengwa commented Jul 24, 2018

g-hrafiq commented Jul 27, 2018

tanndx17 commented Jul 28, 2018

g-hrafiq commented Jul 28, 2018 • edited Loading

cclough commented Aug 2, 2018

elmer-garduno commented Aug 14, 2018

nerdyalbin commented Aug 20, 2018

pengwa commented Aug 21, 2018 • edited Loading

zyhzdp commented Oct 19, 2018

lihao056 commented Mar 13, 2019

formath commented Mar 13, 2019

lihao056 commented Mar 14, 2019

simpeng commented Mar 20, 2018 •

edited

Loading

simpeng commented Mar 23, 2018 •

edited

Loading

g-hrafiq commented Jul 28, 2018 •

edited

Loading

pengwa commented Aug 21, 2018 •

edited

Loading