Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running training in google ML engine - No matplotlib.pyplot module #2739

Closed
sagi44222 opened this issue Nov 8, 2017 · 39 comments
Closed

Comments

@sagi44222
Copy link

Following the procedure to run a training in google ML engine I encounter a problem compiling the project. Here is a snipped from the job log:

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/eval.py", line 50, in <module> from object_detection import evaluator File "/root/.local/lib/python2.7/site-packages/object_detection/evaluator.py", line 24, in <module> from object_detection import eval_util File "/root/.local/lib/python2.7/site-packages/object_detection/eval_util.py", line 29, in <module> from object_detection.utils import visualization_utils as vis_utils File "/root/.local/lib/python2.7/site-packages/object_detection/utils/visualization_utils.py", line 24, in <module> import matplotlib.pyplot as plt ImportError: No module named matplotlib.pyplot

Evaluation job code:
gcloud ml-engine jobs submit training object_detection_eval_date +%s \ --job-dir=gs://${TRAIN_DIR} \ --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \ --module-name object_detection.eval \ --region us-central1 \ --scale-tier BASIC_GPU \ -- \ --checkpoint_dir=gs://${TRAIN_DIR} \ --eval_dir=gs://${EVAL_DIR} \ --pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}

Training input:

{
  "scaleTier": "BASIC_GPU",
  "packageUris": [
    "gs://TTT/packages/9a42ca0914663739bd0e7eef559ea1e821bdd82971d3a4d3c0a9fc427fc982f1/object_detection-0.1.tar.gz",
    "gs://TTT/packages/9a42ca0914663739bd0e7eef559ea1e821bdd82971d3a4d3c0a9fc427fc982f1/slim-0.1.tar.gz"
  ],
  "pythonModule": "object_detection.eval",
  "args": [
    "--checkpoint_dir=gs://TTT/training",
    "--eval_dir=gs://TTT",
    "--pipeline_config_path=gs://TTT/ssd_mobilenet_v1_pets.config"
  ],
  "region": "asia-east1",
  "jobDir": "gs://TTT/"
}

Any solution for this?

@glc12125
Copy link

glc12125 commented Nov 8, 2017

I had exactly the same issue and posted in #2724. But unfortunately, they do not seem to care.

I managed to add the matplotlib as a dependency in the setup.py under research folder, which is hacky because I feel uncomfortable if I need to modify the source code to make it working. However, I met more errors after that.

@angerson
Copy link

angerson commented Nov 8, 2017

@derekjchow This looks like a problem with the Pets Tutorial. Can you take a look?

@angerson angerson added stat:awaiting model gardener Waiting on input from TensorFlow model gardener type:docs labels Nov 8, 2017
@andrew-veresov
Copy link

Why matplotlib is needed for training?
While we can install matplotlib using setup.py script the python-tk could not be installed this way at Google ML engine.

@andrew-veresov
Copy link

andrew-veresov commented Nov 9, 2017

The ssd_meta_arch.py imports the visualization_utils and uses the add_cdf_image_summary function from it to generate images of anchor classification losses.
Shouldn't we just add positive/negative losses to the summary and render images during summary visualization instead?

@tombstone tombstone self-assigned this Nov 10, 2017
@tombstone
Copy link
Contributor

Apologies, we will take a look at this and respond shortly.

@lozuwa
Copy link

lozuwa commented Nov 16, 2017

Same issue here, thanks.

@ashizhao
Copy link

Having the same issue with pet's retrainer tutorial.

@floft
Copy link

floft commented Nov 16, 2017

I was having a similar issue in that whenever I tried importing matplotlib it would give some _tkinter error on the cluster I was training on. I installed matplotlib with pip and then as a temporary fix in research/object_detection/utils/visualization_utils.py before import matplotlib.pyplot as plt:

import os
import matplotlib as mpl
if os.environ.get('DISPLAY','') == '':
	print('no display found. Using non-interactive Agg backend')
mpl.use('Agg')

which is taken from https://stackoverflow.com/a/40931739/2698494. I have not tested it on Google's Cloud ML though.

@law826
Copy link

law826 commented Nov 17, 2017

I have tried it on Google's Cloud ML. So I got around the matplotlib issue as @glc12125 suggested and around the tkinter error as @floft suggested. I am not sure if my problem is specific to running transfer learning on the new faster_rcnn_nas_coco model, but I get the following downstream error (after correcting for #2668 as well):

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 228, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/root/.local/lib/python2.7/site-packages/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 165, in _create_losses prediction_dict = detection_model.predict(images) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 531, in predict image_shape) = self._extract_rpn_feature_maps(preprocessed_inputs) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 685, in _extract_rpn_feature_maps preprocessed_inputs, scope=self.first_stage_feature_extractor_scope) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 134, in extract_proposal_features return self._extract_proposal_features(preprocessed_inputs, scope) File "/root/.local/lib/python2.7/site-packages/object_detection/models/faster_rcnn_nas_feature_extractor.py", line 188, in _extract_proposal_features final_endpoint='Cell_11') File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 378, in build_nasnet_large hparams = _large_imagenet_config(is_training=is_training) File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 70, in _large_imagenet_config return tf.contrib.training.HParams( AttributeError: 'module' object has no attribute 'HParams' The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 228, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/root/.local/lib/python2.7/site-packages/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 165, in _create_losses prediction_dict = detection_model.predict(images) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 531, in predict image_shape) = self._extract_rpn_feature_maps(preprocessed_inputs) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 685, in _extract_rpn_feature_maps preprocessed_inputs, scope=self.first_stage_feature_extractor_scope) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 134, in extract_proposal_features return self._extract_proposal_features(preprocessed_inputs, scope) File "/root/.local/lib/python2.7/site-packages/object_detection/models/faster_rcnn_nas_feature_extractor.py", line 188, in _extract_proposal_features final_endpoint='Cell_11') File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 378, in build_nasnet_large hparams = _large_imagenet_config(is_training=is_training) File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 70, in _large_imagenet_config return tf.contrib.training.HParams( AttributeError: 'module' object has no attribute 'HParams' The replica worker 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 228, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/root/.local/lib/python2.7/site-packages/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 165, in _create_losses prediction_dict = detection_model.predict(images) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 531, in predict image_shape) = self._extract_rpn_feature_maps(preprocessed_inputs) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 685, in _extract_rpn_feature_maps preprocessed_inputs, scope=self.first_stage_feature_extractor_scope) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 134, in extract_proposal_features return self._extract_proposal_features(preprocessed_inputs, scope) File "/root/.local/lib/python2.7/site-packages/object_detection/models/faster_rcnn_nas_feature_extractor.py", line 188, in _extract_proposal_features final_endpoint='Cell_11') File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 378, in build_nasnet_large hparams = _large_imagenet_config(is_training=is_training) File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 70, in _large_imagenet_config return tf.contrib.training.HParams( AttributeError: 'module' object has no attribute 'HParams' The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 228, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/root/.local/lib/python2.7/site-packages/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 165, in _create_losses prediction_dict = detection_model.predict(images) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 531, in predict image_shape) = self._extract_rpn_feature_maps(preprocessed_inputs) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 685, in _extract_rpn_feature_maps preprocessed_inputs, scope=self.first_stage_feature_extractor_scope) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 134, in extract_proposal_features return self._extract_proposal_features(preprocessed_inputs, scope) File "/root/.local/lib/python2.7/site-packages/object_detection/models/faster_rcnn_nas_feature_extractor.py", line 188, in _extract_proposal_features final_endpoint='Cell_11') File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 378, in build_nasnet_large hparams = _large_imagenet_config(is_training=is_training) File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 70, in _large_imagenet_config return tf.contrib.training.HParams( AttributeError: 'module' object has no attribute 'HParams' The replica worker 3 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 228, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/root/.local/lib/python2.7/site-packages/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 165, in _create_losses prediction_dict = detection_model.predict(images) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 531, in predict image_shape) = self._extract_rpn_feature_maps(preprocessed_inputs) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 685, in _extract_rpn_feature_maps preprocessed_inputs, scope=self.first_stage_feature_extractor_scope) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 134, in extract_proposal_features return self._extract_proposal_features(preprocessed_inputs, scope) File "/root/.local/lib/python2.7/site-packages/object_detection/models/faster_rcnn_nas_feature_extractor.py", line 188, in _extract_proposal_features final_endpoint='Cell_11') File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 378, in build_nasnet_large hparams = _large_imagenet_config(is_training=is_training) File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 70, in _large_imagenet_config return tf.contrib.training.HParams( AttributeError: 'module' object has no attribute 'HParams' The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 228, in train clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue]) File "/root/.local/lib/python2.7/site-packages/deployment/model_deploy.py", line 193, in create_clones outputs = model_fn(*args, **kwargs) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 165, in _create_losses prediction_dict = detection_model.predict(images) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 531, in predict image_shape) = self._extract_rpn_feature_maps(preprocessed_inputs) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 685, in _extract_rpn_feature_maps preprocessed_inputs, scope=self.first_stage_feature_extractor_scope) File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 134, in extract_proposal_features return self._extract_proposal_features(preprocessed_inputs, scope) File "/root/.local/lib/python2.7/site-packages/object_detection/models/faster_rcnn_nas_feature_extractor.py", line 188, in _extract_proposal_features final_endpoint='Cell_11') File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 378, in build_nasnet_large hparams = _large_imagenet_config(is_training=is_training) File "/root/.local/lib/python2.7/site-packages/nets/nasnet/nasnet.py", line 70, in _large_imagenet_config return tf.contrib.training.HParams( AttributeError: 'module' object has no attribute 'HParams'

@staturecrane
Copy link

staturecrane commented Nov 20, 2017

Was this issue ever addressed? I am unable to run a model I could previously run with an older version of the object_detection API due to the matplotlib and python-tk dependency issue.

@ashizhao
Copy link

This blog mentions some changes in setup.py that addresses compatibility issues with Tensorflow and GCP: https://medium.com/google-cloud/object-detection-tensorflow-and-google-cloud-platform-72e0a3f3bdd6

@glc12125
Copy link

Hey ashizhao,

Thanks for pointing out the direction. I have tried the "fix" for setup.py in the blog you mentioned, and faced another GCP runtime version not-up-to-date issue in #2707. I followed the solution mentioned in #2707 again, where I have to change the source code, in both object_detection/evaluator.py(line 184) and object_detection/builders/optimizer_builder.py (line 103) to use tf.contrib.framework.get_or_create_global_step instead of tf.train.get_or_create_global_step. I finally got it working for both training and evaluation.

I guess the main issues are twofolds:

  1. GCP library dependency (matplotlib, python-tk, not sure why we need them though, because we do not need to visualise them when running in the cloud)
  2. GCP tensorflow version is not up to date.

@ashizhao
Copy link

Hi @glc12125 , thank you for your help. I tried everything you mentioned above and got this error:

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 254, in train var_map, train_config.fine_tune_checkpoint)) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 110, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on gs://${YOUR_GCS_BUCKET}/data/model.ckpt: Not found: Error executing an HTTP request (HTTP response code 404, error code 0, error message '') when reading gs://${YOUR_GCS_BUCKET}/data

are you familiar with this error?

@glc12125
Copy link

Hey @ashizhao,

I believe you need to replace the YOUR_GCS_BUCKET with your actual bucket name in the pipeline config file.

@EffePen
Copy link

EffePen commented Dec 1, 2017

@glc12125 thanks for summarising how to fix these issues
I succeeded in fixing the not-up-to-date issue also by adding 'tensorflow>=1.4' to the required packages in setup.py

@lozuwa
Copy link

lozuwa commented Dec 2, 2017 via email

@staturecrane
Copy link

I think we are going to try out Amazon's Sagemaker, instead of dealing with all of these dependency issues with training and model version hosting. Somehow it supports Tensorflow > 1.2 and Google Cloud does not.

@rhaertel80
Copy link

FWIW TF 1.4 is now available on CloudML Engine.

@andersskog
Copy link

Thanks @glc12125 for your help. Will post the current solution for this.

Now that TF 1.4 is available on CloudML Engine, the following changes fix this problem:

Make sure your yaml version is 1.4, eg:

trainingInput:
  runtimeVersion: "1.4"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

Change setup.py to the following:

"""Setup script for object_detection."""

import logging
import subprocess
from setuptools import find_packages
from setuptools import setup
from setuptools.command.install import install

class CustomCommands(install):

	def RunCustomCommand(self, command_list):
		p = subprocess.Popen(
        command_list,
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT)
		stdout_data, _ = p.communicate()
		logging.info('Log command output: %s', stdout_data)
		if p.returncode != 0:
			raise RuntimeError('Command %s failed: exit code: %s' %
                         (command_list, p.returncode))

	def run(self):
		self.RunCustomCommand(['apt-get', 'update'])
		self.RunCustomCommand(
          ['apt-get', 'install', '-y', 'python-tk'])
		install.run(self)

REQUIRED_PACKAGES = ['Pillow>=1.0', 'protobuf>=3.3.0', 'Matplotlib>=2.1']

setup(
    name='object_detection',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    include_package_data=True,
    packages=[p for p in find_packages() if p.startswith('object_detection')],
    description='Tensorflow Object Detection Library',
 cmdclass={
        'install': CustomCommands,
    }
)

In object_detection/utils/visualization_utils.py, line 24 (before import matplotlib.pyplot as plt) add:

import matplotlib
matplotlib.use('agg')

In line 184 of object_detection/evaluator.py, change

tf.train.get_or_create_global_step()

to

tf.contrib.framework.get_or_create_global_step()

Finally, in line 103 of object_detection/builders/optimizer_builder.py, change

tf.train.get_or_create_global_step()

to

tf.contrib.framework.get_or_create_global_step()

Hope this helps!

@law826
Copy link

law826 commented Dec 15, 2017

It's working great now with those fixes. Thanks!

@ddurgaprasad
Copy link

Follwed all these instructions..yet the same error appearing.Refer the log .

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 254, in train var_map, train_config.fine_tune_checkpoint)) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 150, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://detect_animal/model.ckpt The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 254, in train var_map, train_config.fine_tune_checkpoint)) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 150, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://detect_animal/model.ckpt The replica worker 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 254, in train var_map, train_config.fine_tune_checkpoint)) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 150, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://detect_animal/model.ckpt The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 254, in train var_map, train_config.fine_tune_checkpoint)) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 150, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://detect_animal/model.ckpt The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 159, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 254, in train var_map, train_config.fine_tune_checkpoint)) File "/root/.local/lib/python2.7/site-packages/object_detection/utils/variables_helper.py", line 122, in get_variables_available_in_checkpoint ckpt_reader = tf.train.NewCheckpointReader(checkpoint_path) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 150, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://detect_animal/model.ckpt To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=267395040285&resource=ml_job%2Fjob_id%2Fdetect_animal_2&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22detect_animal_2%22

Training input:
{
"scaleTier": "CUSTOM",
"masterType": "standard_gpu",
"workerType": "standard_gpu",
"parameterServerType": "standard",
"workerCount": "5",
"parameterServerCount": "3",
"packageUris": [
"gs://detect_animal/train/packages/6e4b986b141806a38dc6a02983bca09c8e02016689e89d3d9ff361c62725da93/object_detection-0.1.tar.gz",
"gs://detect_animal/train/packages/6e4b986b141806a38dc6a02983bca09c8e02016689e89d3d9ff361c62725da93/slim-0.1.tar.gz"
],
"pythonModule": "object_detection.train",
"args": [
"\",
"--pipeline_config_path=gs://detect_animal/data/faster_rcnn_resnet101_pets.config",
"--train_dir=gs://detect_animal/train"
],
"region": "us-central1",
"runtimeVersion": "1.4",
"jobDir": "gs://detect_animal/train"
}

@aysark
Copy link
Contributor

aysark commented Feb 6, 2018

Thanks to @andersskog fixes, it worked.

Hopefully repo will get updated soon?

@davidblumntcgeo
Copy link

davidblumntcgeo commented Mar 6, 2018

I've made the fixes posted by @andersskog , but I still get errors similar to those reported by @ddurgaprasad . Any ideas?

Edit: Below are the tracebacks from my logs ...

Job submitted with runtime version 1.2

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 106, in main overwrite=True) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 384, in copy compat.as_bytes(oldpath), compat.as_bytes(newpath), overwrite, status) File "/usr/lib/python2.7/contextlib.py", line 24, in exit self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) FailedPreconditionError: .

Job submitted with runtime version 1.4

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 106, in main overwrite=True) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 385, in copy compat.as_bytes(oldpath), compat.as_bytes(newpath), overwrite, status) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit c_api.TF_GetCode(self.status.status)) NotFoundError: ; No such file or directory

@jhovell
Copy link

jhovell commented Mar 6, 2018

seems like an unrelated error @davidblumntcgeo ? NotFoundError: ; No such file or directory doesn't seem to have to do with matplotlib?

It's probably worth mentioning here that for the folks suggesting upgrading to 1.4 runtime, this sort of does work, though the reason the docs aren't updated is the Object Detection API really doesn't support 1.4 yet and that's why 1.2 is still in the documentation. There are other issues you may run into on 1.4 at least for now so you may need to follow the instructions based on 1.2. Hopefully this saves someone some effort as I found any model besides SSD unusable in 1.4, and even SSD models would flake out occasionally after hours of training.

@davidblumntcgeo
Copy link

Thanks, @jhovell . I was getting the failure to load matplotlib error up until I made the fixes recommended in this issue. After making the fixes, the errors changed to the ones I posted, similar to the other user. It sounds, though, that you’re saying the errors are unrelated, and the matplotlib issue was just masking this second issue.

@sainttelant
Copy link

@andersskog
is it worked for TF1.4 version? not supported by TF1.6 deployed in GCP?

@andersskog
Copy link

@sainttelant it worked on TF 1.4. TF1.6 is not currently supported (https://cloud.google.com/ml-engine/docs/runtime-version-list)

@sainttelant
Copy link

@andersskog ,hi guy, i tried to uninstall TF1.6 version, and reinstall Version1.4, however, the same error is still there, i checked the places you mentioned above, tried to fix them up, however, all places has been fixed yet i found, so i still couldn;t figure out how to address them

@sainttelant
Copy link

@andersskog , thanks for sharing the link with compatible runtime -version instruction, i hope it will be helpful.

@sainttelant
Copy link

@andersskog , i tried to deploy training according to the steps of (https://cloud.google.com/ml-engine/docs/runtime-version-list) you mentioned, i changed to use python3.5, however, the errors are same as well.

@sainttelant
Copy link

The replica ps 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 51, in from object_detection.builders import model_builder File "/root/.local/lib/python3.5/site-packages/object_detection/builders/model_builder.py", line 29, in from object_detection.meta_architectures import ssd_meta_arch File "/root/.local/lib/python3.5/site-packages/object_detection/meta_architectures/ssd_meta_arch.py", line 31, in from object_detection.utils import visualization_utils File "/root/.local/lib/python3.5/site-packages/object_detection/utils/visualization_utils.py", line 25, in import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements ImportError: No module named 'matplotlib' The replica ps 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 51, in from object_detection.builders import model_builder File "/root/.local/lib/python3.5/site-packages/object_detection/builders/model_builder.py", line 29, in from object_detection.meta_architectures import ssd_meta_arch File "/root/.local/lib/python3.5/site-packages/object_detection/meta_architectures/ssd_meta_arch.py", line 31, in from object_detection.utils import visualization_utils File "/root/.local/lib/python3.5/site-packages/object_detection/utils/visualization_utils.py", line 25, in import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements ImportError: No module named 'matplotlib' The replica ps 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 51, in from object_detection.builders import model_builder File "/root/.local/lib/python3.5/site-packages/object_detection/builders/model_builder.py", line 29, in from object_detection.meta_architectures import ssd_meta_arch File "/root/.local/lib/python3.5/site-packages/object_detection/meta_architectures/ssd_meta_arch.py", line 31, in from object_detection.utils import visualization_utils File "/root/.local/lib/python3.5/site-packages/object_detection/utils/visualization_utils.py", line 25, in import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements ImportError: No module named 'matplotlib' To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=741093659524&resource=ml_job%2Fjob_id%2Fgoldedseito_shaka_object_detection_1521171159&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22goldedseito_shaka_object_detection_1521171159%22

@sainttelant
Copy link

it is worked , 'matplotlib' error disapeared, i remove the dist directories in research folder and slim folder,and python setup.py sdist again, but, another error popped up,

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/root/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python3.5/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python3.5/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 110, in init dct_method=dct_method), TypeError: init() got an unexpected keyword argument 'dct_method' The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/root/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python3.5/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python3.5/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 110, in init dct_method=dct_method), TypeError: init() got an unexpected keyword argument 'dct_method' The replica worker 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/root/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python3.5/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python3.5/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 110, in init dct_method=dct_method), TypeError: init() got an unexpected keyword argument 'dct_method' The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/root/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python3.5/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python3.5/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 110, in init dct_method=dct_method), TypeError: init() got an unexpected keyword argument 'dct_method' The replica worker 3 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/root/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python3.5/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python3.5/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 110, in init dct_method=dct_method), TypeError: init() got an unexpected keyword argument 'dct_method' The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 167, in tf.app.run() File "/root/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python3.5/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python3.5/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python3.5/site-packages/object_detection/builders/dataset_builder.py", line 138, in build label_map_proto_file=label_map_proto_file) File "/root/.local/lib/python3.5/site-packages/object_detection/data_decoders/tf_example_decoder.py", line 110, in init dct_method=dct_method), TypeError: init() got an unexpected keyword argument 'dct_method' To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=741093659524&resource=ml_job%2Fjob_id%2Fgoldedseito_shaka_object_detection_1521181970&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22goldedseito_shaka_object_detection_1521181970%22

@giksa
Copy link

giksa commented Mar 17, 2018

same problem that @sainttelant

@davidblumntcgeo
Copy link

davidblumntcgeo commented Apr 5, 2018

Running the latest ODAPI release (1.8) with TF runtime 1.6, and after implementing the fixes described in this post by @andersskog, I am now able to run faster_rcnn_nas on ML Engine (prior to the latest release, I could only run SSD). However, this only works without GPU acceleration. There are bugs that prevent this model from running on ML Engine with GPUs. (SSD runs with GPUs, no problem.) I haven't tried TPUs yet, nor have I tried other Faster RCNN models.

Edit: I spoke to soon. I had performed a relatively small run, which went to completion. However, a longer run using faster_rcnn_nas errorred out with an "UnavailableError: OS Error". I suspect it may be the error referenced by @jhovell and reported in issue #3071

@tensorflowbutler tensorflowbutler removed the stat:awaiting model gardener Waiting on input from TensorFlow model gardener label Apr 6, 2018
@KumailHussain
Copy link

how to edit setup.py as files is on google cloud and when i edit file using cloud shell it says permissin denied

@ghost
Copy link

ghost commented May 6, 2018

I was following along with the fixes suggested by @andersskog, and ran into the following error:
The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 162, in build functools.partial(tf.data.TFRecordDataset, buffer_size=8 * 1000 * 1000), AttributeError: 'module' object has no attribute 'data' The replica worker 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 162, in build functools.partial(tf.data.TFRecordDataset, buffer_size=8 * 1000 * 1000), AttributeError: 'module' object has no attribute 'data' The replica worker 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 162, in build functools.partial(tf.data.TFRecordDataset, buffer_size=8 * 1000 * 1000), AttributeError: 'module' object has no attribute 'data' The replica worker 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 162, in build functools.partial(tf.data.TFRecordDataset, buffer_size=8 * 1000 * 1000), AttributeError: 'module' object has no attribute 'data' The replica worker 3 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 162, in build functools.partial(tf.data.TFRecordDataset, buffer_size=8 * 1000 * 1000), AttributeError: 'module' object has no attribute 'data' The replica worker 4 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 167, in <module> tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 163, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 235, in train train_config.prefetch_queue_capacity, data_augmentation_options) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 59, in create_input_queue tensor_dict = create_tensor_dict_fn() File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 120, in get_next dataset_builder.build(config)).get_next() File "/root/.local/lib/python2.7/site-packages/object_detection/builders/dataset_builder.py", line 162, in build functools.partial(tf.data.TFRecordDataset, buffer_size=8 * 1000 * 1000), AttributeError: 'module' object has no attribute 'data' To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=1001770455422&resource=ml_job%2Fjob_id%2Fobject_detection_thermal_1525609335&advancedFilter=resource.type

I have no idea where to even begin looking for the error, any suggestions?

@sooonwoo
Copy link

sooonwoo commented May 9, 2018

Thanks to @andersskog.

I think now we just modify yaml file to version 1.6 and setup.py like andersskog's code.
(see (https://cloud.google.com/ml-engine/docs/runtime-version-list))
Remainder is already modified.

@grewe
Copy link

grewe commented May 21, 2018

I have tried the above with version 1.2, 1.6 and 1.8 and updated the setup.py and still get error regarding dependencies. Anyone have any ideas

The replica ps 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 52, in from object_detection.builders import model_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/model_builder.py", line 18, in from object_detection.builders import box_coder_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/box_coder_builder.py", line 21, in from object_detection.protos import box_coder_pb2 File "/root/.local/lib/python2.7/site-packages/object_detection/protos/box_coder_pb2.py", line 28, in dependencies=[object__detection_dot_protos_dot_faster__rcnn__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_keypoint__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_mean__stddev__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_square__box__coder__pb2.DESCRIPTOR,]) File "/usr/local/lib/python2.7/dist-packages/google/protobuf/descriptor.py", line 829, in new return _message.default_pool.AddSerializedFile(serialized_pb) TypeError: Couldn't build proto file into descriptor pool! Invalid proto descriptor for file "object_detection/protos/box_coder.proto": object_detection/protos/box_coder.proto: Import "object_detection/protos/keypoint_box_coder.proto" has not been loaded. object_detection.protos.BoxCoder.keypoint_box_coder: "object_detection.protos.KeypointBoxCoder" seems to be defined in "keypoint_box_coder.proto", which is not imported by "object_detection/protos/box_coder.proto". To use it here, please add the necessary import. The replica ps 1 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 52, in from object_detection.builders import model_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/model_builder.py", line 18, in from object_detection.builders import box_coder_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/box_coder_builder.py", line 21, in from object_detection.protos import box_coder_pb2 File "/root/.local/lib/python2.7/site-packages/object_detection/protos/box_coder_pb2.py", line 28, in dependencies=[object__detection_dot_protos_dot_faster__rcnn__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_keypoint__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_mean__stddev__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_square__box__coder__pb2.DESCRIPTOR,]) File "/usr/local/lib/python2.7/dist-packages/google/protobuf/descriptor.py", line 829, in new return _message.default_pool.AddSerializedFile(serialized_pb) TypeError: Couldn't build proto file into descriptor pool! Invalid proto descriptor for file "object_detection/protos/box_coder.proto": object_detection/protos/box_coder.proto: Import "object_detection/protos/keypoint_box_coder.proto" has not been loaded. object_detection.protos.BoxCoder.keypoint_box_coder: "object_detection.protos.KeypointBoxCoder" seems to be defined in "keypoint_box_coder.proto", which is not imported by "object_detection/protos/box_coder.proto". To use it here, please add the necessary import. The replica ps 2 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 52, in from object_detection.builders import model_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/model_builder.py", line 18, in from object_detection.builders import box_coder_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/box_coder_builder.py", line 21, in from object_detection.protos import box_coder_pb2 File "/root/.local/lib/python2.7/site-packages/object_detection/protos/box_coder_pb2.py", line 28, in dependencies=[object__detection_dot_protos_dot_faster__rcnn__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_keypoint__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_mean__stddev__box__coder__pb2.DESCRIPTOR,object__detection_dot_protos_dot_square__box__coder__pb2.DESCRIPTOR,]) File "/usr/local/lib/python2.7/dist-packages/google/protobuf/descriptor.py", line 829, in new return _message.default_pool.AddSerializedFile(serialized_pb) TypeError: Couldn't build proto file into descriptor pool! Invalid proto descriptor for file "object_detection/protos/box_coder.proto": object_detection/protos/box_coder.proto: Import "object_detection/protos/keypoint_box_coder.proto" has not been loaded. object_detection.protos.BoxCoder.keypoint_box_coder: "object_detection.protos.KeypointBoxCoder" seems to be defined in "keypoint_box_coder.proto", which is not imported by "object_detection/protos/box_coder.proto". To use it here, please add the necessary import. To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=36123659232&resource=ml_job%2Fjob_id%2Fgrewe_object_detection_5&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22grewe_object_detection_5%22

@tensorflowbutler
Copy link
Member

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests