Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get UnavailableError when running object detection training on CloudML #3071

Closed
glarchev opened this issue Dec 27, 2017 · 24 comments
Closed

Comments

@glarchev
Copy link

I can train an Object Detection model just fine locally, but when I try to run the training on CloudML, it runs for a little bit (during the last run it ran for about 340 steps) and then terminates because of the following error:

UnavailableError: Endpoint read failed

The full stack trace is pasted at the end of this post.

System information

  • What is the top-level directory of the model you are using: N/A, training my own model
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No, but the Object Detection source code was modified for workaround of bugs Error running training in google ML engine - No matplotlib.pyplot module #2739 and Train.py in object_detection crash. AttributeError: module 'tensorflow.contrib.slim.python.slim.data.tfexample_decoder' has no attribute 'BackupHandler' #2653
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CloudML default
  • TensorFlow installed from (source or binary): N/A
  • TensorFlow version (use command below): 1.4
  • Bazel version (if compiling from source): N/A
  • CUDA/cuDNN version: CloudML default
  • GPU model and memory: CloudML default
  • Exact command to reproduce: sudo gcloud ml-engine jobs submit training object_detection_171227 --job-dir=gs://my-sandbox/ml/train --packages /Users/user/object_detection/models/research/dist/object_detection-0.1.tar.gz,/Users/user/object_detection/models/research/slim/dist/slim-0.1.tar.gz --module-name object_detection.train --runtime-version 1.4 --region us-east1 --config /Users/user/cloud_yml/cloud.yml -- --train_dir=gs://my-sandbox/ml/train --pipeline_config_path=gs://my-sandbox/ml/data/pipeline.config

Full stack trace:

severity: "ERROR"
textPayload: "The replica worker 0 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 332, in train
saver=saver)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 763, in train
sess, train_op, global_step, train_step_kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
UnavailableError: Endpoint read failed

@vasudevmaduri
Copy link

Faced the same issue
While training on CloudML , it has thrown an error after 640 iterations.

textPayload: "The replica worker 2 exited with a non-zero status of 1. Termination reason: Error.
UnavailableError: Endpoint read failed

@bignamehyp bignamehyp added the stat:awaiting model gardener Waiting on input from TensorFlow model gardener label Dec 31, 2017
@bignamehyp
Copy link
Member

@tombstone can you please take a look or point to CloudML folks?

@kannan60
Copy link

kannan60 commented Jan 2, 2018

I have the same issue. https://stackoverflow.com/questions/48058198/google-object-detection-api-using-faster-rcnn-resnet101-coco-model-for-trainin

@vasudevmaduri
Copy link

@glarchev Can you try changing the runtime version to 1.2 in the command.

I installed the tensorflow 1.4 version but with the same runtime version 1.4, training could not be completed. Then I tried with 1.2 its executed.

--runtime-version 1.2

@jiaxunwu
Copy link

jiaxunwu commented Jan 2, 2018

Thanks for feedback, please use 1.2 runtime version as @vasudevmaduri suggested for now.
We are investigating the issue.

@glarchev
Copy link
Author

glarchev commented Jan 2, 2018

I can confirm that changing to --runtime-version 1.2 fixes the problem

@kannan60
Copy link

kannan60 commented Jan 3, 2018

It fixed the problem but after exporting the inference graph I end up with another error.

---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
   1322     try:
-> 1323       return fn(*args)
   1324     except errors.OpError as e:

c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\client\session.py in _run_fn(session, feed_dict, fetch_list, target_list, options, run_metadata)
   1301                                    feed_dict, fetch_list, target_list,
-> 1302                                    status, run_metadata)
   1303 

c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\framework\errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
    472             compat.as_text(c_api.TF_Message(self.status.status)),
--> 473             c_api.TF_GetCode(self.status.status))
    474     # Delete the underlying status object from memory otherwise it stays alive

InternalError: cuDNN launch failure : input shape([300,512,7,7]) filter shape([3,3,512,512])
	 [[Node: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv1/Relu, SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/weights)]]
	 [[Node: SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Identity/_107 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1917_SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopSecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/strided_slice/stack_1/_1)]]

During handling of the above exception, another exception occurred:

InternalError                             Traceback (most recent call last)
<ipython-input-9-7493eea60222> in <module>()
     20       (boxes, scores, classes, num) = sess.run(
     21           [detection_boxes, detection_scores, detection_classes, num_detections],
---> 22           feed_dict={image_tensor: image_np_expanded})
     23       # Visualization of the results of a detection.
     24       vis_util.visualize_boxes_and_labels_on_image_array(

c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\client\session.py in run(self, fetches, feed_dict, options, run_metadata)
    887     try:
    888       result = self._run(None, fetches, feed_dict, options_ptr,
--> 889                          run_metadata_ptr)
    890       if run_metadata:
    891         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\client\session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1118     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1119       results = self._do_run(handle, final_targets, final_fetches,
-> 1120                              feed_dict_tensor, options, run_metadata)
   1121     else:
   1122       results = []

c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\client\session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1315     if handle is None:
   1316       return self._do_call(_run_fn, self._session, feeds, fetches, targets,
-> 1317                            options, run_metadata)
   1318     else:
   1319       return self._do_call(_prun_fn, self._session, handle, feeds, fetches)

c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
   1334         except KeyError:
   1335           pass
-> 1336       raise type(e)(node_def, op, message)
   1337 
   1338   def _extend_graph(self):

InternalError: cuDNN launch failure : input shape([300,512,7,7]) filter shape([3,3,512,512])
	 [[Node: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv1/Relu, SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/weights)]]
	 [[Node: SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Identity/_107 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1917_SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopSecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/strided_slice/stack_1/_1)]]

Caused by op 'SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/Conv2D', defined at:
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\traitlets\config\application.py", line 658, in launch_instance
    app.start()
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel\kernelapp.py", line 477, in start
    ioloop.IOLoop.instance().start()
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\zmq\eventloop\ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tornado\ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\zmq\eventloop\zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\zmq\eventloop\zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\zmq\eventloop\zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel\kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel\kernelbase.py", line 235, in dispatch_shell
    handler(stream, idents, msg)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel\kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel\ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\ipykernel\zmqshell.py", line 533, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\IPython\core\interactiveshell.py", line 2728, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\IPython\core\interactiveshell.py", line 2850, in run_ast_nodes
    if self.run_code(code, result):
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\IPython\core\interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-5-0d8b8f2357e8>", line 7, in <module>
    tf.import_graph_def(od_graph_def, name='')
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\framework\importer.py", line 313, in import_graph_def
    op_def=op_def)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\framework\ops.py", line 2956, in create_op
    op_def=op_def)
  File "c:\users\kannan\appdata\local\programs\python\python35\lib\site-packages\tensorflow\python\framework\ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): cuDNN launch failure : input shape([300,512,7,7]) filter shape([3,3,512,512])
	 [[Node: SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv1/Relu, SecondStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/weights)]]
	 [[Node: SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Identity/_107 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1917_SecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopSecondStagePostprocessor/BatchMultiClassNonMaxSuppression/map/while/strided_slice/stack_1/_1)]]

@glarchev
Copy link
Author

glarchev commented Jan 3, 2018

I can export the inference graph, but, for some reason it looks like the resulting model performs a lot worse than the model trained locally.

@kannan60
Copy link

kannan60 commented Jan 3, 2018

I tired using mobilenet model and it worked. The results are however not that great like you mentioned. So for better accuracy, when I use faster rcnn, it throws the error after exporting inference graph. What model did you use?

@glarchev
Copy link
Author

glarchev commented Jan 3, 2018

I used faster rcnn as a seed. It seems to work as expected when trained locally but gives poor results when trained via CloudML.

@jiaxunwu
Copy link

jiaxunwu commented Jan 4, 2018

@glarchev, could you provide more details of the training results? I tried to train via CloudML and got 94.08% mAP.

@mrfortynine
Copy link

i'm also seeing "Endpoint read failed" error when switching from 1.2 to 1.4. Is this related to this grpc issue?

@glarchev
Copy link
Author

glarchev commented Jan 5, 2018

@jiaxunwu I don't have hard metrics for my training results, I typically train until I see loss that's low enough, and then visually evaluate the resulting model (my application is object detection). With local training, the model performs roughly as expected. With CloudML training, however, it seems to produce a lot of false-positives (even though the training set is the same, and the loss at the end of training is roughly the same).

@puneetjindal
Copy link

Anybody resolved it on TF 1.4 runtime?
I am also getting the same error now though I haven't made any code changes. It doesn't fail in the beginning. It trains to arbitrary number of steps and then fails

@jiaxunwu
Copy link

@puneetjindal
Copy link

puneetjindal commented Jan 24, 2018 via email

@aysark
Copy link
Contributor

aysark commented Feb 7, 2018

Running into this issue. Using 1.4, and using a single gpu it works fine, but when i try scaling up with varying 1-10 worker nodes it runs into UnavailableError: Endpoint read failed which i'm guessing is when tensorflow loses connectivity from a node.

I can't really revert back to 1.2 because i had to resolve another issue and modified the code to work for 1.4. Guess its back to AWS for now...

@siddharthm83
Copy link

siddharthm83 commented Feb 9, 2018

Same here. 1.4 failed twice arbitrarily but 1.2 works. @jiaxunwu Can I export my trained model with TF 1.4 and expect it to work? My TF-serving infra is based on TF 1.4.

@jhovell
Copy link

jhovell commented Feb 26, 2018

So far this issue seems to mainly effect Faster CNN (I've tried Resnet 101), similar to others have described: fails during the first several hundred iterations, also the loss drops far too quickly to be believable.

Has anyone had any luck with different Faster CNN models or do they all fail with this error? I have actually seen this UnavailableError: Endpoint read failed using SSD mobile net, but only once or twice usually after an hour or more of training.

Down-grading to 1.2 causes other issues with ML cloud engine, but good to know this might solve the issue & may be worth trying.

My setup: Google Cloud ML engine

trainingInput:
  runtimeVersion: "1.4"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

@jhovell
Copy link

jhovell commented Mar 30, 2018

Is this just a corner case that a few of us are hitting with our various data sets or cloud.yml configuration? I'm surprised this isn't a more popular/urgent issue because while ODAPI on Cloud ML is still officially on 1.2 (as @jiaxunwu points out above) in at least 2 related issues (see my 2 references above) the workaround for other issues seems to be to use 1.4 or 1.5. So using newer versions seems to be popular / the only supported workaround for other issues.

Has anyone figured out any workarounds? Right now I am using 1.4 and only SSD, and manually restarting training when I hit a timeout, which is less painful than 1.2 and trying to deal with customizing more of the ODAPI code.

@tensorflowbutler tensorflowbutler removed the stat:awaiting model gardener Waiting on input from TensorFlow model gardener label Apr 6, 2018
@mrfortynine
Copy link

Judging by the discussion in this issue, this is expected behavior when there is network connection issue. The remedy is catching this and restarting session. In TF 1.4 this is done for us by Estimator class. So I would guess there won't any incentive to "fix" existing script, and the way forward to use object detection code with 1.4 is to rewrite relevant part using estimator API.

@wordjelly
Copy link

It is not possible to move back down to 1.2. With 1.2 I get the error : Tensorflow AttributeError: 'module' object has no attribute data. That was the reason for moving to 1.4+. Is there any way to fix this other than reducing the worker count to 1?

@AnubhavSi
Copy link

When I moved from runtime version 1.4 to 1.2, I am facing a weird error, " that /usr/bin/python: No module named util" . FYI, I have not changed anything else, and to make sure when I used 1.4 again, model ran for 1hour again, then job failed with error UnavailableError: Endpoint read failed.

@ymodak
Copy link
Contributor

ymodak commented Dec 28, 2018

Closing this issue since its resolved. Feel free to reopen if the issue still persists. Thanks!

@ymodak ymodak closed this as completed Dec 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests