-
Notifications
You must be signed in to change notification settings - Fork 45.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get UnavailableError when running object detection training on CloudML #3071
Comments
Faced the same issue textPayload: "The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. |
@tombstone can you please take a look or point to CloudML folks? |
@glarchev Can you try changing the runtime version to 1.2 in the command. I installed the tensorflow 1.4 version but with the same runtime version 1.4, training could not be completed. Then I tried with 1.2 its executed. --runtime-version 1.2 |
Thanks for feedback, please use 1.2 runtime version as @vasudevmaduri suggested for now. |
I can confirm that changing to --runtime-version 1.2 fixes the problem |
It fixed the problem but after exporting the inference graph I end up with another error.
|
I can export the inference graph, but, for some reason it looks like the resulting model performs a lot worse than the model trained locally. |
I tired using mobilenet model and it worked. The results are however not that great like you mentioned. So for better accuracy, when I use faster rcnn, it throws the error after exporting inference graph. What model did you use? |
I used faster rcnn as a seed. It seems to work as expected when trained locally but gives poor results when trained via CloudML. |
@glarchev, could you provide more details of the training results? I tried to train via CloudML and got 94.08% mAP. |
i'm also seeing "Endpoint read failed" error when switching from 1.2 to 1.4. Is this related to this grpc issue? |
@jiaxunwu I don't have hard metrics for my training results, I typically train until I see loss that's low enough, and then visually evaluate the resulting model (my application is object detection). With local training, the model performs roughly as expected. With CloudML training, however, it seems to produce a lot of false-positives (even though the training set is the same, and the loss at the end of training is roughly the same). |
Anybody resolved it on TF 1.4 runtime? |
@puneetjindal could you try 1.2 instead as described in https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_on_cloud.md? |
But we have dataset API limitations in 1.2.
…On Jan 24, 2018 8:19 AM, "Jiaxun Wu" ***@***.***> wrote:
@puneetjindal <https://github.com/puneetjindal> could you try 1.2 instead
as described in https://github.com/tensorflow/models/blob/master/research/
object_detection/g3doc/running_on_cloud.md?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3071 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AC0ExZUci_0ospwpu9TBKzo44J9XEPYvks5tNppNgaJpZM4RNyvI>
.
|
Running into this issue. Using 1.4, and using a single gpu it works fine, but when i try scaling up with varying 1-10 worker nodes it runs into I can't really revert back to 1.2 because i had to resolve another issue and modified the code to work for 1.4. Guess its back to AWS for now... |
Same here. 1.4 failed twice arbitrarily but 1.2 works. @jiaxunwu Can I export my trained model with TF 1.4 and expect it to work? My TF-serving infra is based on TF 1.4. |
So far this issue seems to mainly effect Faster CNN (I've tried Resnet 101), similar to others have described: fails during the first several hundred iterations, also the loss drops far too quickly to be believable. Has anyone had any luck with different Faster CNN models or do they all fail with this error? I have actually seen this Down-grading to 1.2 causes other issues with ML cloud engine, but good to know this might solve the issue & may be worth trying. My setup: Google Cloud ML engine
|
Is this just a corner case that a few of us are hitting with our various data sets or cloud.yml configuration? I'm surprised this isn't a more popular/urgent issue because while ODAPI on Cloud ML is still officially on 1.2 (as @jiaxunwu points out above) in at least 2 related issues (see my 2 references above) the workaround for other issues seems to be to use 1.4 or 1.5. So using newer versions seems to be popular / the only supported workaround for other issues. Has anyone figured out any workarounds? Right now I am using 1.4 and only SSD, and manually restarting training when I hit a timeout, which is less painful than 1.2 and trying to deal with customizing more of the ODAPI code. |
Judging by the discussion in this issue, this is expected behavior when there is network connection issue. The remedy is catching this and restarting session. In TF 1.4 this is done for us by Estimator class. So I would guess there won't any incentive to "fix" existing script, and the way forward to use object detection code with 1.4 is to rewrite relevant part using estimator API. |
It is not possible to move back down to 1.2. With 1.2 I get the error : Tensorflow AttributeError: 'module' object has no attribute data. That was the reason for moving to 1.4+. Is there any way to fix this other than reducing the worker count to 1? |
When I moved from runtime version 1.4 to 1.2, I am facing a weird error, " that /usr/bin/python: No module named util" . FYI, I have not changed anything else, and to make sure when I used 1.4 again, model ran for 1hour again, then job failed with error UnavailableError: Endpoint read failed. |
Closing this issue since its resolved. Feel free to reopen if the issue still persists. Thanks! |
I can train an Object Detection model just fine locally, but when I try to run the training on CloudML, it runs for a little bit (during the last run it ran for about 340 steps) and then terminates because of the following error:
UnavailableError: Endpoint read failed
The full stack trace is pasted at the end of this post.
System information
Full stack trace:
severity: "ERROR"
textPayload: "The replica worker 0 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 332, in train
saver=saver)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 763, in train
sess, train_op, global_step, train_step_kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
UnavailableError: Endpoint read failed
The text was updated successfully, but these errors were encountered: