Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF1.6/1.7 PS/Worker Distributed Run Failed with "UnavailableError: OS Error" when jobs are not running on current machine #17852

Closed
simpeng opened this issue Mar 20, 2018 · 24 comments
Labels

Comments

@simpeng
Copy link

simpeng commented Mar 20, 2018

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • TensorFlow installed from (source or binary): official 1.6.0 release binary, or build from master branch (with latest commit: 47407cc)
  • TensorFlow version (use command below): 1.6.0 official release or master
  • Python version: python 3.5 or python 2.7
  • Bazel version (if compiling from source): 0.11.1
  • GCC/Compiler version (if compiling from source): gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
  • CUDA/cuDNN version: 9.0/7.0
  • GPU model and memory: Tesla K80, 12206MiB
  • Exact command to reproduce:

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
('v1.6.0-rc1-1503-g47407cc', '1.6.0')

Describe the problem

The expected behavior

The below source code utilized ps/worker mode to do some training, for usage: we need to run

python mnist_replica.py --data_dir /tmp/tensorflow/mnist/input_data --task_index 0 --ps_hosts '10.0.1.5:14416' --worker_hosts '10.0.1.4:14417' --job_name 'ps'

python mnist_replica.py --data_dir /tmp/tensorflow/mnist/input_data --task_index 0 --ps_hosts '10.0.1.5:14416' --worker_hosts '10.0.1.4:14417' --job_name 'worker'

respectively on "ps job" machine and "worker job" machine.

If we run the script firstly on ps, normally, it will wait for worker machine ready, before going furthur, the log is as below:

2018-03-20 05:49:40.410488: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:14416}
2018-03-20 05:49:40.410614: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 10.0.1.4:14417}
2018-03-20 05:49:40.418149: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:14416
ps 0, create done queue
ps 0, running
2018-03-20 05:49:50.430531: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-03-20 05:50:00.430728: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-03-20 05:50:10.430943: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-03-20 05:50:20.431080: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-03-20 05:50:30.431351: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
^C2018-03-20 05:50:40.434895: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-03-20 05:50:50.435104: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-03-20 05:51:00.435244: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0

Then we run the script on worker machine, the two machines communicated and coordinated to get things done.

The problem

It works pretty well on tf1.5/1.4 or earlier version, but on latest 1.6.0 release version (and i also tried to build from master source code), it failed for sometimes. I did some investigation and testing, here are the symptoms:

  • If the specified ps/worker-hosts are having the same ip as current machine running the scripts (e.g. ps/worker are running different ports of current machine), everything is just fine, they works.

  • If the specified ps/worker-hosts are having the same ip (we call it A-IP), but different with current machine, even though current machine can ping successfully the A-IP, but will failed. The error log after starting ps task (with python mnist_replica.py --data_dir /tmp/tensorflow/mnist/input_data --task_index 0 --ps_hosts '10.0.1.5:14416' --worker_hosts '10.0.1.4:14417' --job_name 'ps'):

2018-03-20 05:57:29.228323: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:14416}
2018-03-20 05:57:29.228478: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 10.0.1.4:14417}
2018-03-20 05:57:29.229155: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:14416
ps 0, create done queue
ps 0, running
I0320 05:57:29.309552659 3803 subchannel.cc:677] Connect failed: {"created":"@1521525449.309441854","description":"Failed to connect to remote host: OS Error","errno":111,"file":"external/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":198,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:10.0.1.4:14417"}
I0320 05:57:29.309786369 3803 subchannel.cc:484] Retry in 998 milliseconds
I0320 05:57:30.307312499 3796 subchannel.cc:437] Failed to connect to channel, retrying
I0320 05:57:30.308555551 3804 subchannel.cc:677] Connect failed: {"created":"@1521525450.308464247","description":"Failed to connect to remote host: OS Error","errno":111,"file":"external/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":198,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:10.0.1.4:14417"}
I0320 05:57:30.308750759 3804 subchannel.cc:484] Retry in 999 milliseconds
I0320 05:57:31.307171978 3796 subchannel.cc:437] Failed to connect to channel, retrying
I0320 05:57:31.308303225 3802 subchannel.cc:677] Connect failed: {"created":"@1521525451.308214021","description":"Failed to connect to remote host: OS Error","errno":111,"file":"external/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":198,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:10.0.1.4:14417"}
I0320 05:57:31.308338927 3802 subchannel.cc:484] Retry in 999 milliseconds
I0320 05:57:32.307163816 3796 subchannel.cc:437] Failed to connect to channel, retrying
I0320 05:57:32.308250261 3801 subchannel.cc:677] Connect failed: {"created":"@1521525452.308164957","description":"Failed to connect to remote host: OS Error","errno":111,"file":"external/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":198,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:10.0.1.4:14417"}
I0320 05:57:32.308284662 3801 subchannel.cc:484] Retry in 999 milliseconds
I0320 05:57:33.307136307 3796 subchannel.cc:437] Failed to connect to channel, retrying
I0320 05:57:33.308314356 3806 subchannel.cc:677] Connect failed: {"created":"@1521525453.308215652","description":"Failed to connect to remote host: OS Error","errno":111,"file":"external/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":198,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:10.0.1.4:14417"}
I0320 05:57:33.308375658 3806 subchannel.cc:484] Retry in 999 milliseconds
I0320 05:57:34.307172752 3796 subchannel.cc:437] Failed to connect to channel, retrying
2018-03-20 05:57:34.308793: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
Traceback (most recent call last):
File "mnist_replica.py", line 304, in
main(args)
File "mnist_replica.py", line 102, in main
sess.run(queue.dequeue())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

  • If the specified ps/worker-hosts are having different ips (in the same LAN, can ping successfully each other), errors on starting on ps worker is similar with the second situation.

  • The exception happens in MasterSession initilization ( i guess there needs some communication via grpc there)

My personal thinking

To be honest, i am wondering whether the gRPC upgrade (that was introduced since v1.6rc0 ) did the trick, but since I am pretty new to this component, besides i am not sure whether somebody else have the similar issues (while I think people using tf1.6 and master will suffer from this on distribute run).

That would be great if any experts can share some insights or thoughts. Thanks in advance!!!

Source code / logs

source code:

`from future import absolute_import
from future import division
from future import print_function

import argparse
import math
import sys
import tempfile
import time

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

IMAGE_PIXELS = 28

def create_done_queue(ps_task_index, worker_count):
"""Queue used to signal death for i'th ps shard. Intended to have
all workers enqueue an item onto it to signal doneness."""

with tf.device("/job:ps/task:%d/cpu:0" % (ps_task_index)):
    return tf.FIFOQueue(worker_count, tf.int32, shared_name="done_queue" + str(ps_task_index))

def create_done_queues(ps_count, worker_count):
return [create_done_queue(ps_task_index, worker_count) for ps_task_index in range(ps_count)]

def main(args):
mnist = input_data.read_data_sets(args.input_training_data_path, one_hot=True)
if args.download_only:
sys.exit(0)

if args.job_name is None or args.job_name == "":
    raise ValueError("Must specify an explicit `job_name`")
if args.task_index is None or args.task_index == "":
    raise ValueError("Must specify an explicit `task_index`")

print("job name = %s" % args.job_name)
print("task index = %d" % args.task_index)

# Construct the cluster and start the server
ps_spec = args.ps_hosts.split(",")
worker_spec = args.worker_hosts.split(",")

# Get the number of workers.
num_workers = len(worker_spec)
num_pss = len(ps_spec)

cluster = tf.train.ClusterSpec({
    "ps": ps_spec,
    "worker": worker_spec})

if not args.existing_servers:
    # Not using existing servers. Create an in-process server.
    server = tf.train.Server(
        cluster, job_name=args.job_name, task_index=args.task_index, protocol=args.protocol)
    if args.job_name == "ps":
        config = tf.ConfigProto(log_device_placement=True, allow_soft_placement=True)
        sess = tf.Session(server.target, config=config)

        print("ps %d, create done queue" % args.task_index)
        queue = create_done_queue(args.task_index, num_workers)

        print("ps %d, running" % args.task_index)
        for i in range(num_workers):
            sess.run(queue.dequeue())
            print("ps %d received worker %d done" % (args.task_index, i))

        print("all workers are done, ps %d: exit" % (args.task_index))
        sys.exit()

is_chief = (args.task_index == 0)
if args.num_gpus > 0:
    # Avoid gpu allocation conflict: now allocate task_num -> #gpu
    # for each worker in the corresponding machine
    gpu = (args.task_index % args.num_gpus)
    worker_device = "/job:worker/task:%d/gpu:%d" % (args.task_index, gpu)
elif args.num_gpus == 0:
    # Just allocate the CPU to worker server
    cpu = 0
    worker_device = "/job:worker/task:%d/cpu:%d" % (args.task_index, cpu)

print("worker %d, worker_device=%s" % (args.task_index, worker_device))
print("worker %d, create done queue" % args.task_index)
queues = create_done_queues(num_pss, num_workers)
print("worker %d, done queue created" % args.task_index)

# The device setter will automatically place Variables ops on separate
# parameter servers (ps). The non-Variable ops will be placed on the workers.
# The ps use CPU and workers use corresponding GPU

with tf.device(
        tf.train.replica_device_setter(
            worker_device=worker_device,
            ps_device="/job:ps/cpu:0",
            cluster=cluster)):
    global_step = tf.Variable(0, name="global_step", trainable=False)

    # Variables of the hidden layer
    hid_w = tf.Variable(
        tf.truncated_normal(
            [IMAGE_PIXELS * IMAGE_PIXELS, args.hidden_units],
            stddev=1.0 / IMAGE_PIXELS),
        name="hid_w")
    hid_b = tf.Variable(tf.zeros([args.hidden_units]), name="hid_b")

    # Variables of the softmax layer
    sm_w = tf.Variable(
        tf.truncated_normal(
            [args.hidden_units, 10],
            stddev=1.0 / math.sqrt(args.hidden_units)),
        name="sm_w")
    sm_b = tf.Variable(tf.zeros([10]), name="sm_b")

    # Ops: located on the worker specified with args.task_index
    x = tf.placeholder(tf.float32, [None, IMAGE_PIXELS * IMAGE_PIXELS])
    y_ = tf.placeholder(tf.float32, [None, 10])

    hid_lin = tf.nn.xw_plus_b(x, hid_w, hid_b)
    hid = tf.nn.relu(hid_lin)

    y = tf.nn.softmax(tf.nn.xw_plus_b(hid, sm_w, sm_b))
    cross_entropy = -tf.reduce_sum(y_ * tf.log(tf.clip_by_value(y, 1e-10, 1.0)))

    opt = tf.train.AdamOptimizer(args.learning_rate)

    if args.sync_replicas:
        if args.replicas_to_aggregate is None:
            replicas_to_aggregate = num_workers
        else:
            replicas_to_aggregate = args.replicas_to_aggregate

        opt = tf.train.SyncReplicasOptimizer(
            opt,
            replicas_to_aggregate=replicas_to_aggregate,
            total_num_replicas=num_workers,
            name="mnist_sync_replicas")

    train_step = opt.minimize(cross_entropy, global_step=global_step)

    if args.sync_replicas:
        local_init_op = opt.local_step_init_op
        if is_chief:
            local_init_op = opt.chief_init_op

        ready_for_local_init_op = opt.ready_for_local_init_op

        # Initial token and chief queue runners required by the sync_replicas mode
        chief_queue_runner = opt.get_chief_queue_runner()
        sync_init_op = opt.get_init_tokens_op()

    init_op = tf.global_variables_initializer()
    train_dir = tempfile.mkdtemp()

    enq_ops = []
    for q in queues:
        qop = q.enqueue(1)
        enq_ops.append(qop)
if args.sync_replicas:
    sv = tf.train.Supervisor(
        is_chief=is_chief,
        logdir=train_dir,
        init_op=init_op,
        local_init_op=local_init_op,
        ready_for_local_init_op=ready_for_local_init_op,
        recovery_wait_secs=1,
        global_step=global_step)
else:
    sv = tf.train.Supervisor(
        is_chief=is_chief,
        logdir=train_dir,
        init_op=init_op,
        recovery_wait_secs=1,
        global_step=global_step)

sess_config = tf.ConfigProto(
    allow_soft_placement=True,
    log_device_placement=False,
    device_filters=["/job:ps", "/job:worker/task:%d" % args.task_index])
if args.infer_shapes == True:
    sess_config.graph_options.infer_shapes = args.infer_shapes

# The chief worker (task_index==0) session will prepare the session,
# while the remaining workers will wait for the preparation to complete.
if is_chief:
    print("Worker %d: Initializing session..." % args.task_index)
else:
    print("Worker %d: Waiting for session to be initialized..." %
          args.task_index)

		  
if args.existing_servers:
    server_grpc_url = "grpc://" + worker_spec[args.task_index]
    print("Using existing server at: %s" % server_grpc_url)

    sess = sv.prepare_or_wait_for_session(server_grpc_url,
                                          config=sess_config)
else:
    sess = sv.prepare_or_wait_for_session(server.target, config=sess_config)

print("Worker %d: Session initialization complete." % args.task_index)

if args.sync_replicas and is_chief:
    # Chief worker will start the chief queue runner and call the init op.
    sess.run(sync_init_op)
    sv.start_queue_runners(sess, [chief_queue_runner])

# Perform training
time_begin = time.time()
print("Training begins @ %f" % time_begin)

local_step = 0
while True:
    # Training feed
    batch_xs, batch_ys = mnist.train.next_batch(args.batch_size)
    train_feed = {x: batch_xs, y_: batch_ys}

    _, step = sess.run([train_step, global_step], feed_dict=train_feed)
    local_step += 1

    now = time.time()
    print("%f: Worker %d: training step %d done (global step: %d)" %
          (now, args.task_index, local_step, step))

    if step >= args.train_steps:
        break

time_end = time.time()
print("Training ends @ %f" % time_end)
training_time = time_end - time_begin
print("Training elapsed time: %f s" % training_time)

# Validation feed
val_feed = {x: mnist.validation.images, y_: mnist.validation.labels}
val_xent = sess.run(cross_entropy, feed_dict=val_feed)
print("After %d training step(s), validation cross entropy = %g" %
      (args.train_steps, val_xent))

for op in enq_ops:
    sess.run(op)

if name == "main":
parser = argparse.ArgumentParser()
parser.add_argument("--input-training-data-path", default="/tmp/mnist-data")
parser.add_argument("--input_training_data_path", default="/tmp/mnist-data")
parser.add_argument("--download_only", type=bool, default=False)
parser.add_argument("--task-index", type=int)
parser.add_argument("--task_index", type=int)
parser.add_argument("--num_gpus", type=int, default=1)
parser.add_argument("--replicas_to_aggregate", type=int)
parser.add_argument("--hidden_units", type=int, default=100)
parser.add_argument("--train_steps", type=int, default=200)
parser.add_argument("--batch_size", type=int, default=100)
parser.add_argument("--learning_rate", type=float, default=0.01)
parser.add_argument("--sync_replicas", type=bool, default=False)
parser.add_argument("--existing_servers", type=bool, default=False)
parser.add_argument("--ps-hosts", default="localhost:2222")
parser.add_argument("--ps_hosts", default="localhost:2222")
parser.add_argument("--worker-hosts", default="localhost:2223,localhost:2224")
parser.add_argument("--worker_hosts", default="localhost:2223,localhost:2224")
parser.add_argument("--job-name")
parser.add_argument("--job_name")
parser.add_argument("--protocol", default="grpc")
parser.add_argument("--infer_shapes", type=bool, default=False)

(args, unknown) = parser.parse_known_args()
main(args)`
@simpeng
Copy link
Author

simpeng commented Mar 20, 2018

Update: I revert the grpc upgrade change locally, and i can start the ps task successfully now.

I believe for users who are suffering this, this might be considered as a workaround.

I would like to heard some feedbacks from you guys as experts. :)

@rhaertel80
Copy link

A different user is reporting something similar on SO: https://stackoverflow.com/questions/49403392/unavailableerror-os-error-while-training-on-gc-ml-engine

@martinwicke martinwicke added the type:bug Bug label Mar 21, 2018
@TopHatCroat
Copy link

Hi, guy from @rhaertel80 SO link here,
I have since lowered the version to 1.5. But now I'm experiencing a different problem similar to this one where I have posted some more details.
This might be completely unrelated, though

@martinwicke
Copy link
Member

@simpeng which gRPC commit are you going back to?

This looks like the PS is simply not up when you first try to connect. The resulting error should be handled, the most surprising thing here is that a gRPC downgrade helps resolve this. Is this repeatable?

@simpeng
Copy link
Author

simpeng commented Mar 23, 2018

@martinwicke , yes it's repeatable, I did twice ( one on my local machine for debugging, the other one is on another build machine).

My testing case is:

  • Start the PS job first, normally it keep waiting for response from worker, util worker get started.

2018-03-20 05:49:50.430531: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0

  • With 1.6, it will try to "connect to the channel" and retry few times, then throw exception:

{"created":"@1521525453.308215652","description":"Failed to connect to remote host: OS Error","errno":111,"file":"external/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":198,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:10.0.1.4:14417"}
I0320 05:57:33.308375658 3806 subchannel.cc:484] Retry in 999 milliseconds
I0320 05:57:34.307172752 3796 subchannel.cc:437] Failed to connect to channel, retrying
2018-03-20 05:57:34.308793: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
Traceback (most recent call last):
File "mnist_replica.py", line 304, in
main(args)
File "mnist_replica.py", line 102, in main
sess.run(queue.dequeue())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

The change I reverted is the gRPC commit.

@choudharydhruv
Copy link

I am struggling with the same error with Tensorflow 1.6.The same error has been reported elsewhere with Tensorflow 1.6
yahoo/TensorFlowOnSpark#245

It seems to be an OOM error.

Any suggested workarounds ? How can I change the grpc version if I am using the gcloud ml-engine.
I believe they come pre-packaged in the runtime environment.

Any help will be appreciated.

@choudharydhruv
Copy link

I can confirm that I get an OOM error with a different signature, if I run the same config on a single gcloud ml-engine instance(with no GPUS)
The replica master 0 ran out-of-memory and exited with a non-zero status of 247.

@simpeng
Copy link
Author

simpeng commented Apr 8, 2018

The issue repro on tf1.7 release, and the workaround(revert grpc change) still work till now.

I am quite curious anybody else who use distributed tf, don't really have the exact same repro? (to be clear, the initial issue has nothing to do with OOM).

@simpeng simpeng changed the title TF1.6 PS/Worker Distributed Run Failed with "UnavailableError: OS Error" when jobs are not running on current machine TF1.6/1.7 PS/Worker Distributed Run Failed with "UnavailableError: OS Error" when jobs are not running on current machine Apr 8, 2018
@tensorflowbutler
Copy link
Member

Nagging Assignee @jart: It has been 15 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@jart
Copy link
Contributor

jart commented Apr 26, 2018

This question is better asked on StackOverflow since it is not a bug or feature request. There is also a larger community that reads questions there.

If you think we've misinterpreted a bug, please comment again with a clear explanation, as well as all of the information requested in the issue template. Thanks!

@jart jart closed this as completed Apr 26, 2018
@tanndx17
Copy link

@simpeng I had the same error. Could you tell me which version of grpc you downgrade to and what version of tensorflow are you using? thank you

@tanndx17
Copy link

@simpeng Also, could you show a little on how to downgrade gprc? I mannually use pip uninstall grpcio and pip install grpcio==1.0.0, but it seems still reporting same error. I wonder if this needs to revert grpc somewhere else in tensorflow?

@pengwa
Copy link
Contributor

pengwa commented Jul 24, 2018

@tanndx17 I reverted the gRPC commit: cb49899, and build from source.

AFAIK, since tf1.8, the issue is gone.

@g-hrafiq
Copy link

No, the issue is not gone ! Even with 1.8 the issue persists for me.

@tanndx17
Copy link

@g-hrafiq what is you issue? I still have "UnavailableError: OS Error" here. My running command is

One server 1 (172.19.120.41), I run:
CUDA_VISIBLE_DEVICES=0 python3 mnist_2.py
--ps_hosts=localhost:2222
--worker_hosts=172.19.120.39:2223
--job_name=ps --task_index=0

One server 2 (172.19.120.39), I run:
CUDA_VISIBLE_DEVICES=0 python3 mnist_2.py
--ps_hosts=172.19.120.41:2222
--worker_hosts=localhost:2223
--job_name=worker -task_index=0

Then I still got the issue here. Just FYI, I installed tensorflow from pip

@g-hrafiq
Copy link

g-hrafiq commented Jul 28, 2018

Today I ran an object detection model retrain via CMLE using runtime 1.8 and below config.yml file, training ran fine for 11000 steps for first time then errored out "UnavailableError: OS Error". I restarted the job later, then it ran till 60000 until error.

trainingInput:
  runtimeVersion: "1.8"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

@cclough
Copy link

cclough commented Aug 2, 2018

I am getting this same error. seems to be a lot of tickets for this issue - has no one solved it yet?

@elmer-garduno
Copy link

As someone mentioned above, this got solved for me by using a larger memory instance, I changed from

"masterType": "complex_model_m",
to
"masterType": "large_model",

@nerdyalbin
Copy link

This has been troubling me for a while. I found out that the problem is that GRPC uses the native "epoll" polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, "GRPC_POLL_STRATEGY=poll" before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.

@pengwa
Copy link
Contributor

pengwa commented Aug 21, 2018

@nerdyalbin thanks for the sharing! Was the default polling engine changed for recently GRPC? Or maybe epoll behaving differently (compared with before) caused the issues coming?

@zyhzdp
Copy link

zyhzdp commented Oct 19, 2018

This has been troubling me for a while. I found out that the problem is that GRPC uses the native "epoll" polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, "GRPC_POLL_STRATEGY=poll" before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.

same problem, it works! ^-^

@lihao056
Copy link

This has been troubling me for a while. I found out that the problem is that GRPC uses the native "epoll" polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, "GRPC_POLL_STRATEGY=poll" before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.

could you tell how to set the environment variable? thanks very much

@formath
Copy link
Contributor

formath commented Mar 13, 2019

This has been troubling me for a while. I found out that the problem is that GRPC uses the native "epoll" polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, "GRPC_POLL_STRATEGY=poll" before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.

could you tell how to set the environment variable? thanks very much

import os
os.environ['GRPC_POLL_STRATEGY'] = "poll"

@lihao056
Copy link

This has been troubling me for a while. I found out that the problem is that GRPC uses the native "epoll" polling engine for communication. Changing this to a portable polling engine solved this issue for me. The way to do is to set the environment variable, "GRPC_POLL_STRATEGY=poll" before running the tensorflow programs. This solved this issue for me. For reference, see, https://github.com/grpc/grpc/blob/master/doc/environment_variables.md.

could you tell how to set the environment variable? thanks very much

import os
os.environ['GRPC_POLL_STRATEGY'] = "poll"

it works! thank for you help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests