-
Notifications
You must be signed in to change notification settings - Fork 45.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error running training in google ML engine - No matplotlib.pyplot module #2739
Comments
I had exactly the same issue and posted in #2724. But unfortunately, they do not seem to care. I managed to add the matplotlib as a dependency in the setup.py under research folder, which is hacky because I feel uncomfortable if I need to modify the source code to make it working. However, I met more errors after that. |
@derekjchow This looks like a problem with the Pets Tutorial. Can you take a look? |
Why matplotlib is needed for training? |
The |
Apologies, we will take a look at this and respond shortly. |
Same issue here, thanks. |
Having the same issue with pet's retrainer tutorial. |
I was having a similar issue in that whenever I tried importing matplotlib it would give some _tkinter error on the cluster I was training on. I installed matplotlib with pip and then as a temporary fix in research/object_detection/utils/visualization_utils.py before
which is taken from https://stackoverflow.com/a/40931739/2698494. I have not tested it on Google's Cloud ML though. |
I have tried it on Google's Cloud ML. So I got around the matplotlib issue as @glc12125 suggested and around the tkinter error as @floft suggested. I am not sure if my problem is specific to running transfer learning on the new faster_rcnn_nas_coco model, but I get the following downstream error (after correcting for #2668 as well):
|
Was this issue ever addressed? I am unable to run a model I could previously run with an older version of the object_detection API due to the matplotlib and python-tk dependency issue. |
This blog mentions some changes in setup.py that addresses compatibility issues with Tensorflow and GCP: https://medium.com/google-cloud/object-detection-tensorflow-and-google-cloud-platform-72e0a3f3bdd6 |
Hey ashizhao, Thanks for pointing out the direction. I have tried the "fix" for setup.py in the blog you mentioned, and faced another GCP runtime version not-up-to-date issue in #2707. I followed the solution mentioned in #2707 again, where I have to change the source code, in both object_detection/evaluator.py(line 184) and object_detection/builders/optimizer_builder.py (line 103) to use tf.contrib.framework.get_or_create_global_step instead of tf.train.get_or_create_global_step. I finally got it working for both training and evaluation. I guess the main issues are twofolds:
|
Hi @glc12125 , thank you for your help. I tried everything you mentioned above and got this error:
are you familiar with this error? |
Hey @ashizhao, I believe you need to replace the YOUR_GCS_BUCKET with your actual bucket name in the pipeline config file. |
@glc12125 thanks for summarising how to fix these issues |
Thanks for the answers, I was able to solve the issue as well. Nonetheless,
I ended up buying myself a GTX 1080 ti. Training locally is way better.
…On 1 Dec 2017 12:14 p.m., "EffePen" ***@***.***> wrote:
@glc12125 <https://github.com/glc12125> thanks for summarising how to fix
these issues
I succeeded in fixing the not-up-to-date issue also by adding
'tensorflow>=1.4' to the required packaged in setup.py
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2739 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKzzDiUnJFnvyUuzP4aZ_yg9VzZHbVe5ks5s8CX1gaJpZM4QWPQX>
.
|
I think we are going to try out Amazon's Sagemaker, instead of dealing with all of these dependency issues with training and model version hosting. Somehow it supports Tensorflow > 1.2 and Google Cloud does not. |
FWIW TF 1.4 is now available on CloudML Engine. |
Thanks @glc12125 for your help. Will post the current solution for this. Now that TF 1.4 is available on CloudML Engine, the following changes fix this problem: Make sure your yaml version is 1.4, eg:
Change setup.py to the following:
In object_detection/utils/visualization_utils.py, line 24 (before import matplotlib.pyplot as plt) add:
In line 184 of object_detection/evaluator.py, change
to
Finally, in line 103 of object_detection/builders/optimizer_builder.py, change
to
Hope this helps! |
It's working great now with those fixes. Thanks! |
Follwed all these instructions..yet the same error appearing.Refer the log . The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 254, in train var_map, train_config.fine_tune_checkpoint)) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 150, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://detect_animal/model.ckpt The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 254, in train var_map, train_config.fine_tune_checkpoint)) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 150, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://detect_animal/model.ckpt The replica worker 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 254, in train var_map, train_config.fine_tune_checkpoint)) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 150, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://detect_animal/model.ckpt The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 254, in train var_map, train_config.fine_tune_checkpoint)) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 150, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://detect_animal/model.ckpt The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 254, in train var_map, train_config.fine_tune_checkpoint)) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 150, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://detect_animal/model.ckpt To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=267395040285&resource=ml_job%2Fjob_id%2Fdetect_animal_2&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22detect_animal_2%22 Training input: |
Thanks to @andersskog fixes, it worked. Hopefully repo will get updated soon? |
I've made the fixes posted by @andersskog , but I still get errors similar to those reported by @ddurgaprasad . Any ideas? Edit: Below are the tracebacks from my logs ... Job submitted with runtime version 1.2 Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 106, in main overwrite=True) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 384, in copy compat.as_bytes(oldpath), compat.as_bytes(newpath), overwrite, status) File "/usr/lib/python2.7/contextlib.py", line 24, in exit self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) FailedPreconditionError: . Job submitted with runtime version 1.4 Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 106, in main overwrite=True) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 385, in copy compat.as_bytes(oldpath), compat.as_bytes(newpath), overwrite, status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) NotFoundError: ; No such file or directory |
seems like an unrelated error @davidblumntcgeo ? It's probably worth mentioning here that for the folks suggesting upgrading to 1.4 runtime, this sort of does work, though the reason the docs aren't updated is the Object Detection API really doesn't support 1.4 yet and that's why 1.2 is still in the documentation. There are other issues you may run into on 1.4 at least for now so you may need to follow the instructions based on 1.2. Hopefully this saves someone some effort as I found any model besides SSD unusable in 1.4, and even SSD models would flake out occasionally after hours of training. |
Thanks, @jhovell . I was getting the failure to load matplotlib error up until I made the fixes recommended in this issue. After making the fixes, the errors changed to the ones I posted, similar to the other user. It sounds, though, that you’re saying the errors are unrelated, and the matplotlib issue was just masking this second issue. |
@andersskog |
@sainttelant it worked on TF 1.4. TF1.6 is not currently supported (https://cloud.google.com/ml-engine/docs/runtime-version-list) |
@andersskog ,hi guy, i tried to uninstall TF1.6 version, and reinstall Version1.4, however, the same error is still there, i checked the places you mentioned above, tried to fix them up, however, all places has been fixed yet i found, so i still couldn;t figure out how to address them |
@andersskog , thanks for sharing the link with compatible runtime -version instruction, i hope it will be helpful. |
@andersskog , i tried to deploy training according to the steps of (https://cloud.google.com/ml-engine/docs/runtime-version-list) you mentioned, i changed to use python3.5, however, the errors are same as well. |
The replica ps 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 51, in from object_detection.builders import model_builder File "/root/.local/lib/python3.5/site-packages/object_detection/builders/model_builder.py", line 29, in from object_detection.meta_architectures import ssd_meta_arch File "/root/.local/lib/python3.5/site-packages/object_detection/meta_architectures/ssd_meta_arch.py", line 31, in from object_detection.utils import visualization_utils File "/root/.local/lib/python3.5/site-packages/object_detection/utils/visualization_utils.py", line 25, in import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements ImportError: No module named 'matplotlib' The replica ps 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 51, in from object_detection.builders import model_builder File "/root/.local/lib/python3.5/site-packages/object_detection/builders/model_builder.py", line 29, in from object_detection.meta_architectures import ssd_meta_arch File "/root/.local/lib/python3.5/site-packages/object_detection/meta_architectures/ssd_meta_arch.py", line 31, in from object_detection.utils import visualization_utils File "/root/.local/lib/python3.5/site-packages/object_detection/utils/visualization_utils.py", line 25, in import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements ImportError: No module named 'matplotlib' The replica ps 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 51, in from object_detection.builders import model_builder File "/root/.local/lib/python3.5/site-packages/object_detection/builders/model_builder.py", line 29, in from object_detection.meta_architectures import ssd_meta_arch File "/root/.local/lib/python3.5/site-packages/object_detection/meta_architectures/ssd_meta_arch.py", line 31, in from object_detection.utils import visualization_utils File "/root/.local/lib/python3.5/site-packages/object_detection/utils/visualization_utils.py", line 25, in import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements ImportError: No module named 'matplotlib' To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=741093659524&resource=ml_job%2Fjob_id%2Fgoldedseito_shaka_object_detection_1521171159&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22goldedseito_shaka_object_detection_1521171159%22 |
it is worked , 'matplotlib' error disapeared, i remove the dist directories in research folder and slim folder,and python setup.py sdist again, but, another error popped up, The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/root/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python3.5/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python3.5/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 110, in init dct_method=dct_method), TypeError: init() got an unexpected keyword argument 'dct_method' The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/root/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python3.5/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python3.5/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 110, in init dct_method=dct_method), TypeError: init() got an unexpected keyword argument 'dct_method' The replica worker 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/root/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python3.5/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python3.5/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 110, in init dct_method=dct_method), TypeError: init() got an unexpected keyword argument 'dct_method' The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/root/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python3.5/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python3.5/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 110, in init dct_method=dct_method), TypeError: init() got an unexpected keyword argument 'dct_method' The replica worker 3 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/root/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python3.5/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python3.5/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 110, in init dct_method=dct_method), TypeError: init() got an unexpected keyword argument 'dct_method' The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/root/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python3.5/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python3.5/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 110, in init dct_method=dct_method), TypeError: init() got an unexpected keyword argument 'dct_method' To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=741093659524&resource=ml_job%2Fjob_id%2Fgoldedseito_shaka_object_detection_1521181970&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22goldedseito_shaka_object_detection_1521181970%22 |
same problem that @sainttelant |
Running the latest ODAPI release (1.8) with TF runtime 1.6, and after implementing the fixes described in this post by @andersskog, I am now able to run faster_rcnn_nas on ML Engine (prior to the latest release, I could only run SSD). However, this only works without GPU acceleration. There are bugs that prevent this model from running on ML Engine with GPUs. (SSD runs with GPUs, no problem.) I haven't tried TPUs yet, nor have I tried other Faster RCNN models. Edit: I spoke to soon. I had performed a relatively small run, which went to completion. However, a longer run using faster_rcnn_nas errorred out with an "UnavailableError: OS Error". I suspect it may be the error referenced by @jhovell and reported in issue #3071 |
how to edit setup.py as files is on google cloud and when i edit file using cloud shell it says permissin denied |
I was following along with the fixes suggested by @andersskog, and ran into the following error: I have no idea where to even begin looking for the error, any suggestions? |
Thanks to @andersskog. I think now we just modify yaml file to version 1.6 and setup.py like andersskog's code. |
I have tried the above with version 1.2, 1.6 and 1.8 and updated the setup.py and still get error regarding dependencies. Anyone have any ideas The replica ps 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 52, in from object_detection.builders import model_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/model_builder.py", line 18, in from object_detection.builders import box_coder_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/box_coder_builder.py", line 21, in from object_detection.protos import box_coder_pb2 File "/root/.local/lib/python2.7/site-packages/object_detection/protos/box_coder_pb2.py", line 28, in dependencies=[object__detection_dot_protos_dot_faster__rcnn__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_keypoint__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_mean__stddev__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_square__box__coder__pb2.DESCRIPTOR,]) File "/usr/local/lib/python2.7/dist-packages/google/protobuf/descriptor.py", line 829, in new return _message.default_pool.AddSerializedFile(serialized_pb) TypeError: Couldn't build proto file into descriptor pool! Invalid proto descriptor for file "object_detection/protos/box_coder.proto": object_detection/protos/box_coder.proto: Import "object_detection/protos/keypoint_box_coder.proto" has not been loaded. object_detection.protos.BoxCoder.keypoint_box_coder: "object_detection.protos.KeypointBoxCoder" seems to be defined in "keypoint_box_coder.proto", which is not imported by "object_detection/protos/box_coder.proto". To use it here, please add the necessary import. The replica ps 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 52, in from object_detection.builders import model_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/model_builder.py", line 18, in from object_detection.builders import box_coder_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/box_coder_builder.py", line 21, in from object_detection.protos import box_coder_pb2 File "/root/.local/lib/python2.7/site-packages/object_detection/protos/box_coder_pb2.py", line 28, in dependencies=[object__detection_dot_protos_dot_faster__rcnn__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_keypoint__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_mean__stddev__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_square__box__coder__pb2.DESCRIPTOR,]) File "/usr/local/lib/python2.7/dist-packages/google/protobuf/descriptor.py", line 829, in new return _message.default_pool.AddSerializedFile(serialized_pb) TypeError: Couldn't build proto file into descriptor pool! Invalid proto descriptor for file "object_detection/protos/box_coder.proto": object_detection/protos/box_coder.proto: Import "object_detection/protos/keypoint_box_coder.proto" has not been loaded. object_detection.protos.BoxCoder.keypoint_box_coder: "object_detection.protos.KeypointBoxCoder" seems to be defined in "keypoint_box_coder.proto", which is not imported by "object_detection/protos/box_coder.proto". To use it here, please add the necessary import. The replica ps 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 52, in from object_detection.builders import model_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/model_builder.py", line 18, in from object_detection.builders import box_coder_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/box_coder_builder.py", line 21, in from object_detection.protos import box_coder_pb2 File "/root/.local/lib/python2.7/site-packages/object_detection/protos/box_coder_pb2.py", line 28, in dependencies=[object__detection_dot_protos_dot_faster__rcnn__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_keypoint__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_mean__stddev__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_square__box__coder__pb2.DESCRIPTOR,]) File "/usr/local/lib/python2.7/dist-packages/google/protobuf/descriptor.py", line 829, in new return _message.default_pool.AddSerializedFile(serialized_pb) TypeError: Couldn't build proto file into descriptor pool! Invalid proto descriptor for file "object_detection/protos/box_coder.proto": object_detection/protos/box_coder.proto: Import "object_detection/protos/keypoint_box_coder.proto" has not been loaded. object_detection.protos.BoxCoder.keypoint_box_coder: "object_detection.protos.KeypointBoxCoder" seems to be defined in "keypoint_box_coder.proto", which is not imported by "object_detection/protos/box_coder.proto". To use it here, please add the necessary import. To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=36123659232&resource=ml_job%2Fjob_id%2Fgrewe_object_detection_5&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22grewe_object_detection_5%22 |
Hi There, |
Following the procedure to run a training in google ML engine I encounter a problem compiling the project. Here is a snipped from the job log:
Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/eval.py", line 50, in <module> from object_detection import evaluator File "/root/.local/lib/python2.7/site-packages/object_detection/evaluator.py", line 24, in <module> from object_detection import eval_util File "/root/.local/lib/python2.7/site-packages/object_detection/eval_util.py", line 29, in <module> from object_detection.utils import visualization_utils as vis_utils File "/root/.local/lib/python2.7/site-packages/object_detection/utils/visualization_utils.py", line 24, in <module> import matplotlib.pyplot as plt ImportError: No module named matplotlib.pyplot
Evaluation job code:
gcloud ml-engine jobs submit training object_detection_eval_
date +%s\ --job-dir=gs://${TRAIN_DIR} \ --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \ --module-name object_detection.eval \ --region us-central1 \ --scale-tier BASIC_GPU \ -- \ --checkpoint_dir=gs://${TRAIN_DIR} \ --eval_dir=gs://${EVAL_DIR} \ --pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}
Training input:
Any solution for this?
The text was updated successfully, but these errors were encountered: