New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"UnavailableError: OS Error" when running training on Google Cloud with TensorFlow 1.8 #4314

Open
ProjectDent opened this Issue May 19, 2018 · 3 comments

Comments

Projects
None yet
4 participants
@ProjectDent

ProjectDent commented May 19, 2018

My model trains fine in 20 minutes with TensorFlow 1.2. I changed my cloud.yml file's runtimeVersion to TensorFlow 1.8, my setup.py file's REQUIRED PACKAGES to require 'Tensorflow>=1.8.0', and my submit training command's runtime-version to 1.8.

Now, training took about 80 minutes, before crashing at 4940 steps (60 short of my 5000 steps) I'd set in my training, with this error:

The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 184, in tf.app.run() File "/root/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 399, in train saver=saver) File "/root/.local/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 769, in train sess, train_op, global_step, train_step_kwargs) File "/root/.local/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step run_metadata=run_metadata) File "/root/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/root/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/root/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/root/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) UnavailableError: OS Error

@tensorflowbutler

This comment has been minimized.

Member

tensorflowbutler commented May 27, 2018

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
What is the top-level directory of the model you are using
Have I written custom code
OS Platform and Distribution
TensorFlow installed from
TensorFlow version
Bazel version
CUDA/cuDNN version
GPU model and memory
Exact command to reproduce

@wpp

This comment has been minimized.

wpp commented Jul 16, 2018

Hi, I'm experiencing the same behaviour (and error). After training successfully for a while (see below for details). The master-replica-0 spits out:

Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnavailableError'>, OS Error [[Node: clip_grads/clip_by_norm_2/truediv_S5417 = _Recvclient_terminated=false, recv_device="/job:ps/replica:0/task:2/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=2772952984520217269, tensor_name="edge_8753_clip_grads/clip_by_norm_2/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:2/device:CPU:0"]] [[Node: Momentum/update/NoOp_2_S5432 = _Recvclient_terminated=false, recv_device="/job:master/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:2/device:CPU:0", send_device_incarnation=3296691442993841585, tensor_name="edge_8991_Momentum/update/NoOp_2", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/device:CPU:0"]] [[Node: clip_grads/clip_by_norm_186/truediv_S5067 = _Recvclient_terminated=false, recv_device="/job:ps/replica:0/task:1/device:CPU:0", send_device="/job:master/replica:0/task:0/device:CPU:0", send_device_incarnation=2772952984520217269, tensor_name="edge_4785_clip_grads/clip_by_norm_186/truediv", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:1/device:CPU:0"]]

Which is followed by a tear-down of the job.

This happened once after ~40mins and once after ~2hours. It was suggested elsewhere that this might be a OOM error, but master as well as workers hover at around 20% mem usage.
screen shot 2018-07-16 at 14 28 54
screen shot 2018-07-16 at 14 28 44

Let me know if and how I can provide more information.
Regards

@gamcoh

This comment has been minimized.

gamcoh commented Oct 30, 2018

I have the same issue when i train the faster_rcnn_resnet101 model with ml-engine

It's my trial account i have a tesla k80

here's my cloud.yml config file:

trainingInput:
  runtimeVersion: "1.9"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 3
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment