Error running training in google ML engine - No matplotlib.pyplot module #2739

sagi44222 · 2017-11-08T11:32:18Z

Following the procedure to run a training in google ML engine I encounter a problem compiling the project. Here is a snipped from the job log:

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/eval.py", line 50, in <module> from object_detection import evaluator File "/root/.local/lib/python2.7/site-packages/object_detection/evaluator.py", line 24, in <module> from object_detection import eval_util File "/root/.local/lib/python2.7/site-packages/object_detection/eval_util.py", line 29, in <module> from object_detection.utils import visualization_utils as vis_utils File "/root/.local/lib/python2.7/site-packages/object_detection/utils/visualization_utils.py", line 24, in <module> import matplotlib.pyplot as plt ImportError: No module named matplotlib.pyplot

Evaluation job code:
gcloud ml-engine jobs submit training object_detection_eval_date +%s \ --job-dir=gs://${TRAIN_DIR} \ --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \ --module-name object_detection.eval \ --region us-central1 \ --scale-tier BASIC_GPU \ -- \ --checkpoint_dir=gs://${TRAIN_DIR} \ --eval_dir=gs://${EVAL_DIR} \ --pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}

Training input:

{
  "scaleTier": "BASIC_GPU",
  "packageUris": [
    "gs://TTT/packages/9a42ca0914663739bd0e7eef559ea1e821bdd82971d3a4d3c0a9fc427fc982f1/object_detection-0.1.tar.gz",
    "gs://TTT/packages/9a42ca0914663739bd0e7eef559ea1e821bdd82971d3a4d3c0a9fc427fc982f1/slim-0.1.tar.gz"
  ],
  "pythonModule": "object_detection.eval",
  "args": [
    "--checkpoint_dir=gs://TTT/training",
    "--eval_dir=gs://TTT",
    "--pipeline_config_path=gs://TTT/ssd_mobilenet_v1_pets.config"
  ],
  "region": "asia-east1",
  "jobDir": "gs://TTT/"
}

Any solution for this?

The text was updated successfully, but these errors were encountered:

glc12125 · 2017-11-08T17:49:41Z

I had exactly the same issue and posted in #2724. But unfortunately, they do not seem to care.

I managed to add the matplotlib as a dependency in the setup.py under research folder, which is hacky because I feel uncomfortable if I need to modify the source code to make it working. However, I met more errors after that.

angerson · 2017-11-08T20:41:11Z

@derekjchow This looks like a problem with the Pets Tutorial. Can you take a look?

andrew-veresov · 2017-11-09T06:59:27Z

Why matplotlib is needed for training?
While we can install matplotlib using setup.py script the python-tk could not be installed this way at Google ML engine.

andrew-veresov · 2017-11-09T07:32:41Z

The ssd_meta_arch.py imports the visualization_utils and uses the add_cdf_image_summary function from it to generate images of anchor classification losses.
Shouldn't we just add positive/negative losses to the summary and render images during summary visualization instead?

tombstone · 2017-11-10T16:57:00Z

Apologies, we will take a look at this and respond shortly.

lozuwa · 2017-11-16T05:12:45Z

Same issue here, thanks.

ashizhao · 2017-11-16T06:45:01Z

Having the same issue with pet's retrainer tutorial.

floft · 2017-11-16T15:16:41Z

I was having a similar issue in that whenever I tried importing matplotlib it would give some _tkinter error on the cluster I was training on. I installed matplotlib with pip and then as a temporary fix in research/object_detection/utils/visualization_utils.py before import matplotlib.pyplot as plt:

import os
import matplotlib as mpl
if os.environ.get('DISPLAY','') == '':
	print('no display found. Using non-interactive Agg backend')
mpl.use('Agg')

which is taken from https://stackoverflow.com/a/40931739/2698494. I have not tested it on Google's Cloud ML though.

law826 · 2017-11-17T16:29:12Z

I have tried it on Google's Cloud ML. So I got around the matplotlib issue as @glc12125 suggested and around the tkinter error as @floft suggested. I am not sure if my problem is specific to running transfer learning on the new faster_rcnn_nas_coco model, but I get the following downstream error (after correcting for #2668 as well):

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 228, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/root/.local/lib/python2.7/site-packages/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 165, in _create_losses prediction_dict = detection_model.predict(images) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 531, in predict image_shape) = self._extract_rpn_feature_maps(preprocessed_inputs) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 685, in _extract_rpn_feature_maps preprocessed_inputs, scope=self.first_stage_feature_extractor_scope) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 134, in extract_proposal_features return self._extract_proposal_features(preprocessed_inputs, scope) File "/root/.local/lib/python2.7/site-packages/object_detection/models/faster_rcnn_nas_feature_extractor.py", line 188, in _extract_proposal_features final_endpoint='Cell_11') File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 378, in build_nasnet_large hparams = _large_imagenet_config(is_training=is_training) File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 70, in _large_imagenet_config return tf.contrib.training.HParams( AttributeError: 'module' object has no attribute 'HParams' The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 228, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/root/.local/lib/python2.7/site-packages/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 165, in _create_losses prediction_dict = detection_model.predict(images) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 531, in predict image_shape) = self._extract_rpn_feature_maps(preprocessed_inputs) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 685, in _extract_rpn_feature_maps preprocessed_inputs, scope=self.first_stage_feature_extractor_scope) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 134, in extract_proposal_features return self._extract_proposal_features(preprocessed_inputs, scope) File "/root/.local/lib/python2.7/site-packages/object_detection/models/faster_rcnn_nas_feature_extractor.py", line 188, in _extract_proposal_features final_endpoint='Cell_11') File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 378, in build_nasnet_large hparams = _large_imagenet_config(is_training=is_training) File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 70, in _large_imagenet_config return tf.contrib.training.HParams( AttributeError: 'module' object has no attribute 'HParams' The replica worker 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 228, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/root/.local/lib/python2.7/site-packages/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 165, in _create_losses prediction_dict = detection_model.predict(images) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 531, in predict image_shape) = self._extract_rpn_feature_maps(preprocessed_inputs) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 685, in _extract_rpn_feature_maps preprocessed_inputs, scope=self.first_stage_feature_extractor_scope) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 134, in extract_proposal_features return self._extract_proposal_features(preprocessed_inputs, scope) File "/root/.local/lib/python2.7/site-packages/object_detection/models/faster_rcnn_nas_feature_extractor.py", line 188, in _extract_proposal_features final_endpoint='Cell_11') File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 378, in build_nasnet_large hparams = _large_imagenet_config(is_training=is_training) File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 70, in _large_imagenet_config return tf.contrib.training.HParams( AttributeError: 'module' object has no attribute 'HParams' The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 228, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/root/.local/lib/python2.7/site-packages/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 165, in _create_losses prediction_dict = detection_model.predict(images) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 531, in predict image_shape) = self._extract_rpn_feature_maps(preprocessed_inputs) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 685, in _extract_rpn_feature_maps preprocessed_inputs, scope=self.first_stage_feature_extractor_scope) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 134, in extract_proposal_features return self._extract_proposal_features(preprocessed_inputs, scope) File "/root/.local/lib/python2.7/site-packages/object_detection/models/faster_rcnn_nas_feature_extractor.py", line 188, in _extract_proposal_features final_endpoint='Cell_11') File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 378, in build_nasnet_large hparams = _large_imagenet_config(is_training=is_training) File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 70, in _large_imagenet_config return tf.contrib.training.HParams( AttributeError: 'module' object has no attribute 'HParams' The replica worker 3 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 228, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/root/.local/lib/python2.7/site-packages/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 165, in _create_losses prediction_dict = detection_model.predict(images) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 531, in predict image_shape) = self._extract_rpn_feature_maps(preprocessed_inputs) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 685, in _extract_rpn_feature_maps preprocessed_inputs, scope=self.first_stage_feature_extractor_scope) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 134, in extract_proposal_features return self._extract_proposal_features(preprocessed_inputs, scope) File "/root/.local/lib/python2.7/site-packages/object_detection/models/faster_rcnn_nas_feature_extractor.py", line 188, in _extract_proposal_features final_endpoint='Cell_11') File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 378, in build_nasnet_large hparams = _large_imagenet_config(is_training=is_training) File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 70, in _large_imagenet_config return tf.contrib.training.HParams( AttributeError: 'module' object has no attribute 'HParams' The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 228, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/root/.local/lib/python2.7/site-packages/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 165, in _create_losses prediction_dict = detection_model.predict(images) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 531, in predict image_shape) = self._extract_rpn_feature_maps(preprocessed_inputs) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 685, in _extract_rpn_feature_maps preprocessed_inputs, scope=self.first_stage_feature_extractor_scope) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 134, in extract_proposal_features return self._extract_proposal_features(preprocessed_inputs, scope) File "/root/.local/lib/python2.7/site-packages/object_detection/models/faster_rcnn_nas_feature_extractor.py", line 188, in _extract_proposal_features final_endpoint='Cell_11') File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 378, in build_nasnet_large hparams = _large_imagenet_config(is_training=is_training) File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 70, in _large_imagenet_config return tf.contrib.training.HParams( AttributeError: 'module' object has no attribute 'HParams'

staturecrane · 2017-11-20T01:04:09Z

Was this issue ever addressed? I am unable to run a model I could previously run with an older version of the object_detection API due to the matplotlib and python-tk dependency issue.

ashizhao · 2017-11-21T17:29:03Z

This blog mentions some changes in setup.py that addresses compatibility issues with Tensorflow and GCP: https://medium.com/google-cloud/object-detection-tensorflow-and-google-cloud-platform-72e0a3f3bdd6

glc12125 · 2017-11-21T20:10:08Z

Hey ashizhao,

Thanks for pointing out the direction. I have tried the "fix" for setup.py in the blog you mentioned, and faced another GCP runtime version not-up-to-date issue in #2707. I followed the solution mentioned in #2707 again, where I have to change the source code, in both object_detection/evaluator.py(line 184) and object_detection/builders/optimizer_builder.py (line 103) to use tf.contrib.framework.get_or_create_global_step instead of tf.train.get_or_create_global_step. I finally got it working for both training and evaluation.

I guess the main issues are twofolds:

GCP library dependency (matplotlib, python-tk, not sure why we need them though, because we do not need to visualise them when running in the cloud)
GCP tensorflow version is not up to date.

ashizhao · 2017-11-28T00:05:14Z

Hi @glc12125 , thank you for your help. I tried everything you mentioned above and got this error:

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 254, in train var_map, train_config.fine_tune_checkpoint)) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 110, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on gs://${YOUR_GCS_BUCKET}/data/model.ckpt: Not found: Error executing an HTTP request (HTTP response code 404, error code 0, error message '') when reading gs://${YOUR_GCS_BUCKET}/data

are you familiar with this error?

glc12125 · 2017-11-28T03:07:58Z

Hey @ashizhao,

I believe you need to replace the YOUR_GCS_BUCKET with your actual bucket name in the pipeline config file.

EffePen · 2017-12-01T16:14:16Z

@glc12125 thanks for summarising how to fix these issues
I succeeded in fixing the not-up-to-date issue also by adding 'tensorflow>=1.4' to the required packages in setup.py

lozuwa · 2017-12-02T14:43:58Z

Thanks for the answers, I was able to solve the issue as well. Nonetheless, I ended up buying myself a GTX 1080 ti. Training locally is way better.

…

On 1 Dec 2017 12:14 p.m., "EffePen" ***@***.***> wrote: @glc12125 <https://github.com/glc12125> thanks for summarising how to fix these issues I succeeded in fixing the not-up-to-date issue also by adding 'tensorflow>=1.4' to the required packaged in setup.py — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2739 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKzzDiUnJFnvyUuzP4aZ_yg9VzZHbVe5ks5s8CX1gaJpZM4QWPQX> .

staturecrane · 2017-12-02T23:30:32Z

I think we are going to try out Amazon's Sagemaker, instead of dealing with all of these dependency issues with training and model version hosting. Somehow it supports Tensorflow > 1.2 and Google Cloud does not.

rhaertel80 · 2017-12-12T06:34:29Z

FWIW TF 1.4 is now available on CloudML Engine.

andersskog · 2017-12-12T22:18:44Z

Thanks @glc12125 for your help. Will post the current solution for this.

Now that TF 1.4 is available on CloudML Engine, the following changes fix this problem:

Make sure your yaml version is 1.4, eg:

trainingInput:
  runtimeVersion: "1.4"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

Change setup.py to the following:

"""Setup script for object_detection."""

import logging
import subprocess
from setuptools import find_packages
from setuptools import setup
from setuptools.command.install import install

class CustomCommands(install):

	def RunCustomCommand(self, command_list):
		p = subprocess.Popen(
        command_list,
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT)
		stdout_data, _ = p.communicate()
		logging.info('Log command output: %s', stdout_data)
		if p.returncode != 0:
			raise RuntimeError('Command %s failed: exit code: %s' %
                         (command_list, p.returncode))

	def run(self):
		self.RunCustomCommand(['apt-get', 'update'])
		self.RunCustomCommand(
          ['apt-get', 'install', '-y', 'python-tk'])
		install.run(self)

REQUIRED_PACKAGES = ['Pillow>=1.0', 'protobuf>=3.3.0', 'Matplotlib>=2.1']

setup(
    name='object_detection',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    include_package_data=True,
    packages=[p for p in find_packages() if p.startswith('object_detection')],
    description='Tensorflow Object Detection Library',
 cmdclass={
        'install': CustomCommands,
    }
)

In object_detection/utils/visualization_utils.py, line 24 (before import matplotlib.pyplot as plt) add:

import matplotlib
matplotlib.use('agg')

In line 184 of object_detection/evaluator.py, change

tf.train.get_or_create_global_step()

to

tf.contrib.framework.get_or_create_global_step()

Finally, in line 103 of object_detection/builders/optimizer_builder.py, change

tf.train.get_or_create_global_step()

to

tf.contrib.framework.get_or_create_global_step()

Hope this helps!

law826 · 2017-12-15T13:02:38Z

It's working great now with those fixes. Thanks!

ddurgaprasad · 2018-01-21T07:56:42Z

Follwed all these instructions..yet the same error appearing.Refer the log .

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 254, in train var_map, train_config.fine_tune_checkpoint)) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 150, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://detect_animal/model.ckpt The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 254, in train var_map, train_config.fine_tune_checkpoint)) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 150, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://detect_animal/model.ckpt The replica worker 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 254, in train var_map, train_config.fine_tune_checkpoint)) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 150, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://detect_animal/model.ckpt The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 254, in train var_map, train_config.fine_tune_checkpoint)) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 150, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://detect_animal/model.ckpt The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 254, in train var_map, train_config.fine_tune_checkpoint)) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 150, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://detect_animal/model.ckpt To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=267395040285&resource=ml_job%2Fjob_id%2Fdetect_animal_2&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22detect_animal_2%22

Training input:
{
"scaleTier": "CUSTOM",
"masterType": "standard_gpu",
"workerType": "standard_gpu",
"parameterServerType": "standard",
"workerCount": "5",
"parameterServerCount": "3",
"packageUris": [
"gs://detect_animal/train/packages/6e4b986b141806a38dc6a02983bca09c8e02016689e89d3d9ff361c62725da93/object_detection-0.1.tar.gz",
"gs://detect_animal/train/packages/6e4b986b141806a38dc6a02983bca09c8e02016689e89d3d9ff361c62725da93/slim-0.1.tar.gz"
],
"pythonModule": "object_detection.train",
"args": [
"\",
"--pipeline_config_path=gs://detect_animal/data/faster_rcnn_resnet101_pets.config",
"--train_dir=gs://detect_animal/train"
],
"region": "us-central1",
"runtimeVersion": "1.4",
"jobDir": "gs://detect_animal/train"
}

aysark · 2018-02-06T22:34:15Z

Thanks to @andersskog fixes, it worked.

Hopefully repo will get updated soon?

davidblumntcgeo · 2018-03-06T21:27:59Z

I've made the fixes posted by @andersskog , but I still get errors similar to those reported by @ddurgaprasad . Any ideas?

Edit: Below are the tracebacks from my logs ...

Job submitted with runtime version 1.2

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 106, in main overwrite=True) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 384, in copy compat.as_bytes(oldpath), compat.as_bytes(newpath), overwrite, status) File "/usr/lib/python2.7/contextlib.py", line 24, in exit self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) FailedPreconditionError: .

Job submitted with runtime version 1.4

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 106, in main overwrite=True) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 385, in copy compat.as_bytes(oldpath), compat.as_bytes(newpath), overwrite, status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) NotFoundError: ; No such file or directory

jhovell · 2018-03-06T22:11:37Z

seems like an unrelated error @davidblumntcgeo ? NotFoundError: ; No such file or directory doesn't seem to have to do with matplotlib?

It's probably worth mentioning here that for the folks suggesting upgrading to 1.4 runtime, this sort of does work, though the reason the docs aren't updated is the Object Detection API really doesn't support 1.4 yet and that's why 1.2 is still in the documentation. There are other issues you may run into on 1.4 at least for now so you may need to follow the instructions based on 1.2. Hopefully this saves someone some effort as I found any model besides SSD unusable in 1.4, and even SSD models would flake out occasionally after hours of training.

davidblumntcgeo · 2018-03-06T22:35:43Z

Thanks, @jhovell . I was getting the failure to load matplotlib error up until I made the fixes recommended in this issue. After making the fixes, the errors changed to the ones I posted, similar to the other user. It sounds, though, that you’re saying the errors are unrelated, and the matplotlib issue was just masking this second issue.

sainttelant · 2018-03-15T11:30:49Z

@andersskog
is it worked for TF1.4 version? not supported by TF1.6 deployed in GCP?

andersskog · 2018-03-15T14:38:03Z

@sainttelant it worked on TF 1.4. TF1.6 is not currently supported (https://cloud.google.com/ml-engine/docs/runtime-version-list)

sainttelant · 2018-03-16T01:04:46Z

@andersskog ,hi guy, i tried to uninstall TF1.6 version, and reinstall Version1.4, however, the same error is still there, i checked the places you mentioned above, tried to fix them up, however, all places has been fixed yet i found, so i still couldn;t figure out how to address them

sainttelant · 2018-03-16T01:09:01Z

@andersskog , thanks for sharing the link with compatible runtime -version instruction, i hope it will be helpful.

sainttelant · 2018-03-16T04:23:18Z

@andersskog , i tried to deploy training according to the steps of (https://cloud.google.com/ml-engine/docs/runtime-version-list) you mentioned, i changed to use python3.5, however, the errors are same as well.

sainttelant · 2018-03-16T04:23:31Z

The replica ps 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 51, in from object_detection.builders import model_builder File "/root/.local/lib/python3.5/site-packages/object_detection/builders/model_builder.py", line 29, in from object_detection.meta_architectures import ssd_meta_arch File "/root/.local/lib/python3.5/site-packages/object_detection/meta_architectures/ssd_meta_arch.py", line 31, in from object_detection.utils import visualization_utils File "/root/.local/lib/python3.5/site-packages/object_detection/utils/visualization_utils.py", line 25, in import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements ImportError: No module named 'matplotlib' The replica ps 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 51, in from object_detection.builders import model_builder File "/root/.local/lib/python3.5/site-packages/object_detection/builders/model_builder.py", line 29, in from object_detection.meta_architectures import ssd_meta_arch File "/root/.local/lib/python3.5/site-packages/object_detection/meta_architectures/ssd_meta_arch.py", line 31, in from object_detection.utils import visualization_utils File "/root/.local/lib/python3.5/site-packages/object_detection/utils/visualization_utils.py", line 25, in import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements ImportError: No module named 'matplotlib' The replica ps 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 51, in from object_detection.builders import model_builder File "/root/.local/lib/python3.5/site-packages/object_detection/builders/model_builder.py", line 29, in from object_detection.meta_architectures import ssd_meta_arch File "/root/.local/lib/python3.5/site-packages/object_detection/meta_architectures/ssd_meta_arch.py", line 31, in from object_detection.utils import visualization_utils File "/root/.local/lib/python3.5/site-packages/object_detection/utils/visualization_utils.py", line 25, in import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements ImportError: No module named 'matplotlib' To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=741093659524&resource=ml_job%2Fjob_id%2Fgoldedseito_shaka_object_detection_1521171159&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22goldedseito_shaka_object_detection_1521171159%22

sainttelant · 2018-03-16T06:58:17Z

it is worked , 'matplotlib' error disapeared, i remove the dist directories in research folder and slim folder,and python setup.py sdist again, but, another error popped up,

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/root/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python3.5/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python3.5/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 110, in init dct_method=dct_method), TypeError: init() got an unexpected keyword argument 'dct_method' The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/root/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python3.5/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python3.5/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 110, in init dct_method=dct_method), TypeError: init() got an unexpected keyword argument 'dct_method' The replica worker 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/root/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python3.5/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python3.5/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 110, in init dct_method=dct_method), TypeError: init() got an unexpected keyword argument 'dct_method' The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/root/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python3.5/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python3.5/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 110, in init dct_method=dct_method), TypeError: init() got an unexpected keyword argument 'dct_method' The replica worker 3 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/root/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python3.5/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python3.5/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 110, in init dct_method=dct_method), TypeError: init() got an unexpected keyword argument 'dct_method' The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/root/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python3.5/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python3.5/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 110, in init dct_method=dct_method), TypeError: init() got an unexpected keyword argument 'dct_method' To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=741093659524&resource=ml_job%2Fjob_id%2Fgoldedseito_shaka_object_detection_1521181970&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22goldedseito_shaka_object_detection_1521181970%22

giksa · 2018-03-17T20:48:13Z

same problem that @sainttelant

davidblumntcgeo · 2018-04-05T00:04:45Z

Running the latest ODAPI release (1.8) with TF runtime 1.6, and after implementing the fixes described in this post by @andersskog, I am now able to run faster_rcnn_nas on ML Engine (prior to the latest release, I could only run SSD). However, this only works without GPU acceleration. There are bugs that prevent this model from running on ML Engine with GPUs. (SSD runs with GPUs, no problem.) I haven't tried TPUs yet, nor have I tried other Faster RCNN models.

Edit: I spoke to soon. I had performed a relatively small run, which went to completion. However, a longer run using faster_rcnn_nas errorred out with an "UnavailableError: OS Error". I suspect it may be the error referenced by @jhovell and reported in issue #3071

KumailHussain · 2018-04-20T07:38:24Z

how to edit setup.py as files is on google cloud and when i edit file using cloud shell it says permissin denied

ghost · 2018-05-06T12:45:18Z

I was following along with the fixes suggested by @andersskog, and ran into the following error:
The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 162, in build functools.partial(tf.data.TFRecordDataset, buffer_size=8 * 1000 * 1000), AttributeError: 'module' object has no attribute 'data' The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 162, in build functools.partial(tf.data.TFRecordDataset, buffer_size=8 * 1000 * 1000), AttributeError: 'module' object has no attribute 'data' The replica worker 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 162, in build functools.partial(tf.data.TFRecordDataset, buffer_size=8 * 1000 * 1000), AttributeError: 'module' object has no attribute 'data' The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 162, in build functools.partial(tf.data.TFRecordDataset, buffer_size=8 * 1000 * 1000), AttributeError: 'module' object has no attribute 'data' The replica worker 3 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 162, in build functools.partial(tf.data.TFRecordDataset, buffer_size=8 * 1000 * 1000), AttributeError: 'module' object has no attribute 'data' The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 162, in build functools.partial(tf.data.TFRecordDataset, buffer_size=8 * 1000 * 1000), AttributeError: 'module' object has no attribute 'data' To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=1001770455422&resource=ml_job%2Fjob_id%2Fobject_detection_thermal_1525609335&advancedFilter=resource.type

I have no idea where to even begin looking for the error, any suggestions?

sooonwoo · 2018-05-09T10:50:06Z

Thanks to @andersskog.

I think now we just modify yaml file to version 1.6 and setup.py like andersskog's code.
(see (https://cloud.google.com/ml-engine/docs/runtime-version-list))
Remainder is already modified.

grewe · 2018-05-21T04:22:20Z

I have tried the above with version 1.2, 1.6 and 1.8 and updated the setup.py and still get error regarding dependencies. Anyone have any ideas

The replica ps 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 52, in from object_detection.builders import model_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/model_builder.py", line 18, in from object_detection.builders import box_coder_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/box_coder_builder.py", line 21, in from object_detection.protos import box_coder_pb2 File "/root/.local/lib/python2.7/site-packages/object_detection/protos/box_coder_pb2.py", line 28, in dependencies=[object__detection_dot_protos_dot_faster__rcnn__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_keypoint__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_mean__stddev__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_square__box__coder__pb2.DESCRIPTOR,]) File "/usr/local/lib/python2.7/dist-packages/google/protobuf/descriptor.py", line 829, in new return _message.default_pool.AddSerializedFile(serialized_pb) TypeError: Couldn't build proto file into descriptor pool! Invalid proto descriptor for file "object_detection/protos/box_coder.proto": object_detection/protos/box_coder.proto: Import "object_detection/protos/keypoint_box_coder.proto" has not been loaded. object_detection.protos.BoxCoder.keypoint_box_coder: "object_detection.protos.KeypointBoxCoder" seems to be defined in "keypoint_box_coder.proto", which is not imported by "object_detection/protos/box_coder.proto". To use it here, please add the necessary import. The replica ps 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 52, in from object_detection.builders import model_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/model_builder.py", line 18, in from object_detection.builders import box_coder_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/box_coder_builder.py", line 21, in from object_detection.protos import box_coder_pb2 File "/root/.local/lib/python2.7/site-packages/object_detection/protos/box_coder_pb2.py", line 28, in dependencies=[object__detection_dot_protos_dot_faster__rcnn__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_keypoint__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_mean__stddev__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_square__box__coder__pb2.DESCRIPTOR,]) File "/usr/local/lib/python2.7/dist-packages/google/protobuf/descriptor.py", line 829, in new return _message.default_pool.AddSerializedFile(serialized_pb) TypeError: Couldn't build proto file into descriptor pool! Invalid proto descriptor for file "object_detection/protos/box_coder.proto": object_detection/protos/box_coder.proto: Import "object_detection/protos/keypoint_box_coder.proto" has not been loaded. object_detection.protos.BoxCoder.keypoint_box_coder: "object_detection.protos.KeypointBoxCoder" seems to be defined in "keypoint_box_coder.proto", which is not imported by "object_detection/protos/box_coder.proto". To use it here, please add the necessary import. The replica ps 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 52, in from object_detection.builders import model_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/model_builder.py", line 18, in from object_detection.builders import box_coder_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/box_coder_builder.py", line 21, in from object_detection.protos import box_coder_pb2 File "/root/.local/lib/python2.7/site-packages/object_detection/protos/box_coder_pb2.py", line 28, in dependencies=[object__detection_dot_protos_dot_faster__rcnn__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_keypoint__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_mean__stddev__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_square__box__coder__pb2.DESCRIPTOR,]) File "/usr/local/lib/python2.7/dist-packages/google/protobuf/descriptor.py", line 829, in new return _message.default_pool.AddSerializedFile(serialized_pb) TypeError: Couldn't build proto file into descriptor pool! Invalid proto descriptor for file "object_detection/protos/box_coder.proto": object_detection/protos/box_coder.proto: Import "object_detection/protos/keypoint_box_coder.proto" has not been loaded. object_detection.protos.BoxCoder.keypoint_box_coder: "object_detection.protos.KeypointBoxCoder" seems to be defined in "keypoint_box_coder.proto", which is not imported by "object_detection/protos/box_coder.proto". To use it here, please add the necessary import. To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=36123659232&resource=ml_job%2Fjob_id%2Fgrewe_object_detection_5&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22grewe_object_detection_5%22

tensorflowbutler · 2020-01-29T23:32:43Z

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

angerson added stat:awaiting model gardener Waiting on input from TensorFlow model gardener type:docs labels Nov 8, 2017

tombstone self-assigned this Nov 10, 2017

glc12125 mentioned this issue Nov 22, 2017

missing matplotlib.pyplot when following object_detection tutorial #2724

Closed

reedwm mentioned this issue Dec 4, 2017

ImportError: No module named 'tkinter' during testing the installation #2949

Closed

glarchev mentioned this issue Dec 27, 2017

Get UnavailableError when running object detection training on CloudML #3071

Closed

fabito pushed a commit to ciandt-d1/models that referenced this issue Jan 3, 2018

fix ssd training in CloudML: tensorflow#2739 (comment)

0b4a933

ddurgaprasad unassigned tombstone Jan 21, 2018

davidblumntcgeo mentioned this issue Mar 6, 2018

Error running Oxford Pets Tutorial on Google Cloud ML Engine #3291

Closed

sainttelant mentioned this issue Mar 15, 2018

no module named matplotlib in google cloud training #3607

Closed

tensorflowbutler removed the stat:awaiting model gardener Waiting on input from TensorFlow model gardener label Apr 6, 2018

This was referenced May 21, 2018

Error training on Google ML #4325

Closed

Problem trying to run Object Detection API training on Google ML #4342

Closed

davidblumntcgeo mentioned this issue Nov 6, 2018

[object detection feature request]: use multiple gpu for training #1972

Closed

tensorflowbutler closed this as completed Feb 7, 2020

Error running training in google ML engine - No matplotlib.pyplot module #2739

Error running training in google ML engine - No matplotlib.pyplot module #2739

Comments

sagi44222 commented Nov 8, 2017

glc12125 commented Nov 8, 2017

angerson commented Nov 8, 2017

andrew-veresov commented Nov 9, 2017

andrew-veresov commented Nov 9, 2017 • edited Loading

tombstone commented Nov 10, 2017

lozuwa commented Nov 16, 2017

ashizhao commented Nov 16, 2017

floft commented Nov 16, 2017

law826 commented Nov 17, 2017 • edited Loading

staturecrane commented Nov 20, 2017 • edited Loading

ashizhao commented Nov 21, 2017

glc12125 commented Nov 21, 2017

ashizhao commented Nov 28, 2017

glc12125 commented Nov 28, 2017

EffePen commented Dec 1, 2017 • edited Loading

lozuwa commented Dec 2, 2017 via email

staturecrane commented Dec 2, 2017

rhaertel80 commented Dec 12, 2017

andersskog commented Dec 12, 2017

law826 commented Dec 15, 2017

ddurgaprasad commented Jan 21, 2018

aysark commented Feb 6, 2018

davidblumntcgeo commented Mar 6, 2018 • edited Loading

jhovell commented Mar 6, 2018

davidblumntcgeo commented Mar 6, 2018

sainttelant commented Mar 15, 2018

andersskog commented Mar 15, 2018

sainttelant commented Mar 16, 2018

sainttelant commented Mar 16, 2018

sainttelant commented Mar 16, 2018

sainttelant commented Mar 16, 2018

sainttelant commented Mar 16, 2018

giksa commented Mar 17, 2018

davidblumntcgeo commented Apr 5, 2018 • edited Loading

KumailHussain commented Apr 20, 2018

ghost commented May 6, 2018

sooonwoo commented May 9, 2018 • edited Loading

grewe commented May 21, 2018

tensorflowbutler commented Jan 29, 2020

andrew-veresov commented Nov 9, 2017 •

edited

Loading

law826 commented Nov 17, 2017 •

edited

Loading

staturecrane commented Nov 20, 2017 •

edited

Loading

EffePen commented Dec 1, 2017 •

edited

Loading

davidblumntcgeo commented Mar 6, 2018 •

edited

Loading

davidblumntcgeo commented Apr 5, 2018 •

edited

Loading

sooonwoo commented May 9, 2018 •

edited

Loading