Attempt to access beyond input size: 4 >= 4 #45

ryanjay0 · 2017-12-28T23:03:22Z

https://github.com/tensorflow/tpu-demos/blob/cb18fe2a4bacf4c8ef7685aebfbffb4550d5e938/cloud_tpu/models/resnet_garden/resnet_main.py#L204

I can't get parallel_interleave to work here. It gives this error:

InvalidArgumentError (see above for traceback): Attempt to access beyond input size: 4 >= 4
	In ParallelInterleaveDataset = ParallelInterleaveDataset[Targuments=[], f=tf_map_func_c72e772a[], output_shapes=[[]], output_types=[DT_STRING]](RepeatDataset:handle:0, ParallelInterleaveDataset/input_pipeline_task0/cycle_length:output:0, ParallelInterleaveDataset/input_pipeline_task0/block_length:output:0, ParallelInterleaveDataset/input_pipeline_task0/sloppy:output:0)
	 [[Node: input_pipeline_task0/OneShotIterator = OneShotIterator[container="", dataset_factory=_make_dataset_737011ca[], output_shapes=[[1024,224,224,3], [1024,1001]], output_types=[DT_FLOAT, DT_FLOAT], shared_name="", _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]()]]

The ImageNet data is coming from gcs and it's definitely accessible.
It works fine if I replace the line with interleave (no parallel) and remove the apply, but I'm afraid that might slow it down significantly.

Note: my gpu version of this code is happy with parallel_interleave.

The text was updated successfully, but these errors were encountered:

frankchn · 2018-01-03T21:07:59Z

Are you using the same code and settings on GPU and Cloud TPUs? Specifically, I am curious about batch sizes.

frankchn · 2018-01-03T22:51:42Z

We are also interested in what version of Cloud TPU and TensorFlow you are using. It will be great if you can start a new Cloud TPU and install the latest nightly of TensorFlow (from pip install tf-nightly) with default batch sizes and see if you can isolate the problem.

ryanjay0 · 2018-01-04T15:00:58Z

We are using the latest version of tf and the cloud tpu demo code. On the phone with Arhaan and Marcos today they were able to reproduce the bug immediately. Sent from Yahoo Mail on Android On Wed, Jan 3, 2018 at 14:51, Frank Chen<notifications@github.com> wrote: We are also interested in what version of Cloud TPU and TensorFlow you are using. It will be great if you can start a new Cloud TPU and install the latest nightly of TensorFlow (from pip install tf-nightly) with default batch sizes and see if you can isolate the problem. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

saeta · 2018-01-04T16:50:46Z

Hey @ryanjay0

Sorry about the failure. The underlying issue is that tensorflow/tensorflow@76db97f changed the definition of parallel interleave. While it's backwards compatible from the Python API, it is not backwards compatible at the RPC level. Alas, RPC incompatibilities are a known issue without a good short-term fix, and that's why the Cloud TPU nightly builds are not stable and don't come with any guarantees.

I've confirmed this morning that if you create a new Cloud TPU, and use a fresh install of TF-nightly, you shouldn't encounter this issue. Once TF-1.5 lands in the next few weeks, I strongly recommend switching to that, as the stable release won't encounter these issues.

Thank you very much for trying out Cloud TPUs, and please do file issues you encounter, and reopen if this is still an issue!

All the best,
-Brennan

sb2nov · 2018-01-05T21:04:47Z

@ryanjay0 can you specify what version of Tensorflow is installed on your GCE VM. And also try installing the TF nightly and check if that resolves this issue.

ryanjay0 · 2018-01-05T21:43:10Z

We got past that error by pip installing tf-nightly. Creating a new instance and tpu-resource with the tf-nightly parameter wasn't enough.

Unfortunately we are still having an issue. The error is now "DataLossError (see above for traceback): corrupted record at 0"

Perhaps we need to recreate the tfrecords to be compatible with the new tensorflow nightly, but it was working fine without parallel interleave with the old build. We are working with Arhaan also to resolve this but if you have any suggestions i'd be happy to try them.

TF version. 1.5.0-dev20180103

Mistobaan · 2018-01-06T12:16:49Z

Got same error using the ml-engine job system with runtime-version HEAD

Is something that needs to be handled on the ML engine side or should we patch the model script?

gcloud ml-engine jobs submit training $JOB_NAME \
--staging-bucket $STAGING_BUCKET \
--runtime-version HEAD \
--scale-tier BASIC_TPU \
--module-name resnet_garden.resnet_main \
--package-path resnet_garden/ \
--region $REGION \
-- --data_dir=gs://cloudtpu-imagenet-data/train --model_dir=$OUTPUT_PATH --train_steps=5000

Caused by op u'input_pipeline_task0/OneShotIterator', defined at:
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
  "__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
  exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/resnet_garden/resnet_main.py", line 428, in <module>
  tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 124, in run
  _sys.exit(main(argv))
File "/root/.local/lib/python2.7/site-packages/resnet_garden/resnet_main.py", line 369, in main
  input_fn=ImageNetInput(True), max_steps=next_checkpoint)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 314, in train
  loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 743, in _train_model
  features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 725, in _call_model_fn
  model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1598, in _model_fn
  input_holders.generate_infeed_enqueue_ops_and_dequeue_fn())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 901, in generate_infeed_enqueue_ops_and_dequeue_fn
  enqueue_ops = self._invoke_input_fn_and_record_structure()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 954, in _invoke_input_fn_and_record_structure
  enqueue_ops.append(enqueue_ops_fn())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 715, in enqueue_ops_fn
  inputs = input_fn()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1575, in _input_fn
  return input_fn(**kwargs)
File "/root/.local/lib/python2.7/site-packages/resnet_garden/resnet_main.py", line 217, in __call__
  images, labels = dataset.make_one_shot_iterator().get_next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 153, in make_one_shot_iterator
  self.output_classes))), None,
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1173, in one_shot_iterator
  container=container, shared_name=shared_name, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
  op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3158, in create_op
  op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1624, in __init__
  self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Attempt to access beyond input size: 4 >= 4
  In ParallelInterleaveDataset = ParallelInterleaveDataset[Targuments=[], f=tf_map_func_c72e772a[], output_shapes=[[]], output_types=[DT_STRING]](RepeatDataset:handle:0, ParallelInterleaveDataset/input_pipeline_task0/cycle_length:output:0, ParallelInterleaveDataset/input_pipeline_task0/block_length:output:0, ParallelInterleaveDataset/input_pipeline_task0/sloppy:output:0)
   [[Node: input_pipeline_task0/OneShotIterator = OneShotIterator[container="", dataset_factory=_make_dataset_b456ecd5[], output_shapes=[[1024,224,224,3], [1024,1001]], output_types=[DT_FLOAT, DT_FLOAT], shared_name="", _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]()]]

saeta · 2018-01-07T00:18:27Z

@ryanjay0 I believe we've fixed the issue with the nightly GCE images, so new nightly GCE VMs should not encounter the "Attempt to access beyond input size" error any further. As for the DataLossError (see above for traceback): corrupted record at 0 error, I'd need more context / logs to be sure, but it seems like your error is generated from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/lib/io/record_reader.cc#L114 (or similar lines in the same file). Is it possible that you have an extra file in your data directory? (e.g. Did you put your model directory in the same location as your data directory?)

@Mistobaan As a short-term fallback while the updated builds propagate through to CMLE, you can switch from parallel interleave to standard interleave (at a runtime performance cost). Once TF-1.5 lands on Cloud TPUs and CMLE, you shouldn't encounter any of these issues.

Please do re-open this issue if the issues aren't resolved by Tuesday morning. Thanks!

ryanjay0 · 2018-01-09T00:50:05Z

It works. Thanks everyone for the advice. It's very fast.

saeta closed this as completed Jan 4, 2018

sb2nov reopened this Jan 5, 2018

saeta closed this as completed Jan 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to access beyond input size: 4 >= 4 #45

Attempt to access beyond input size: 4 >= 4 #45

ryanjay0 commented Dec 28, 2017 •

edited

frankchn commented Jan 3, 2018 •

edited

frankchn commented Jan 3, 2018

ryanjay0 commented Jan 4, 2018 via email

saeta commented Jan 4, 2018

sb2nov commented Jan 5, 2018

ryanjay0 commented Jan 5, 2018 •

edited

Mistobaan commented Jan 6, 2018 •

edited

saeta commented Jan 7, 2018

ryanjay0 commented Jan 9, 2018 •

edited

Attempt to access beyond input size: 4 >= 4 #45

Attempt to access beyond input size: 4 >= 4 #45

Comments

ryanjay0 commented Dec 28, 2017 • edited

frankchn commented Jan 3, 2018 • edited

frankchn commented Jan 3, 2018

ryanjay0 commented Jan 4, 2018 via email

saeta commented Jan 4, 2018

sb2nov commented Jan 5, 2018

ryanjay0 commented Jan 5, 2018 • edited

Mistobaan commented Jan 6, 2018 • edited

saeta commented Jan 7, 2018

ryanjay0 commented Jan 9, 2018 • edited

ryanjay0 commented Dec 28, 2017 •

edited

frankchn commented Jan 3, 2018 •

edited

ryanjay0 commented Jan 5, 2018 •

edited

Mistobaan commented Jan 6, 2018 •

edited

ryanjay0 commented Jan 9, 2018 •

edited