Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to access beyond input size: 4 >= 4 #45

Closed
ryanjay0 opened this issue Dec 28, 2017 · 9 comments
Closed

Attempt to access beyond input size: 4 >= 4 #45

ryanjay0 opened this issue Dec 28, 2017 · 9 comments

Comments

@ryanjay0
Copy link

ryanjay0 commented Dec 28, 2017

https://github.com/tensorflow/tpu-demos/blob/cb18fe2a4bacf4c8ef7685aebfbffb4550d5e938/cloud_tpu/models/resnet_garden/resnet_main.py#L204

I can't get parallel_interleave to work here. It gives this error:

InvalidArgumentError (see above for traceback): Attempt to access beyond input size: 4 >= 4
	In ParallelInterleaveDataset = ParallelInterleaveDataset[Targuments=[], f=tf_map_func_c72e772a[], output_shapes=[[]], output_types=[DT_STRING]](RepeatDataset:handle:0, ParallelInterleaveDataset/input_pipeline_task0/cycle_length:output:0, ParallelInterleaveDataset/input_pipeline_task0/block_length:output:0, ParallelInterleaveDataset/input_pipeline_task0/sloppy:output:0)
	 [[Node: input_pipeline_task0/OneShotIterator = OneShotIterator[container="", dataset_factory=_make_dataset_737011ca[], output_shapes=[[1024,224,224,3], [1024,1001]], output_types=[DT_FLOAT, DT_FLOAT], shared_name="", _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]()]]

The ImageNet data is coming from gcs and it's definitely accessible.
It works fine if I replace the line with interleave (no parallel) and remove the apply, but I'm afraid that might slow it down significantly.

Note: my gpu version of this code is happy with parallel_interleave.

@frankchn
Copy link
Contributor

frankchn commented Jan 3, 2018

Are you using the same code and settings on GPU and Cloud TPUs? Specifically, I am curious about batch sizes.

@frankchn
Copy link
Contributor

frankchn commented Jan 3, 2018

We are also interested in what version of Cloud TPU and TensorFlow you are using. It will be great if you can start a new Cloud TPU and install the latest nightly of TensorFlow (from pip install tf-nightly) with default batch sizes and see if you can isolate the problem.

@ryanjay0
Copy link
Author

ryanjay0 commented Jan 4, 2018 via email

@saeta
Copy link
Contributor

saeta commented Jan 4, 2018

Hey @ryanjay0

Sorry about the failure. The underlying issue is that tensorflow/tensorflow@76db97f changed the definition of parallel interleave. While it's backwards compatible from the Python API, it is not backwards compatible at the RPC level. Alas, RPC incompatibilities are a known issue without a good short-term fix, and that's why the Cloud TPU nightly builds are not stable and don't come with any guarantees.

I've confirmed this morning that if you create a new Cloud TPU, and use a fresh install of TF-nightly, you shouldn't encounter this issue. Once TF-1.5 lands in the next few weeks, I strongly recommend switching to that, as the stable release won't encounter these issues.

Thank you very much for trying out Cloud TPUs, and please do file issues you encounter, and reopen if this is still an issue!

All the best,
-Brennan

@saeta saeta closed this as completed Jan 4, 2018
@sb2nov sb2nov reopened this Jan 5, 2018
@sb2nov
Copy link
Contributor

sb2nov commented Jan 5, 2018

@ryanjay0 can you specify what version of Tensorflow is installed on your GCE VM. And also try installing the TF nightly and check if that resolves this issue.

@ryanjay0
Copy link
Author

ryanjay0 commented Jan 5, 2018

We got past that error by pip installing tf-nightly. Creating a new instance and tpu-resource with the tf-nightly parameter wasn't enough.

Unfortunately we are still having an issue. The error is now "DataLossError (see above for traceback): corrupted record at 0"

Perhaps we need to recreate the tfrecords to be compatible with the new tensorflow nightly, but it was working fine without parallel interleave with the old build. We are working with Arhaan also to resolve this but if you have any suggestions i'd be happy to try them.

TF version. 1.5.0-dev20180103

@Mistobaan
Copy link

Mistobaan commented Jan 6, 2018

Got same error using the ml-engine job system with runtime-version HEAD

Is something that needs to be handled on the ML engine side or should we patch the model script?

gcloud ml-engine jobs submit training $JOB_NAME \
--staging-bucket $STAGING_BUCKET \
--runtime-version HEAD \
--scale-tier BASIC_TPU \
--module-name resnet_garden.resnet_main \
--package-path resnet_garden/ \
--region $REGION \
-- --data_dir=gs://cloudtpu-imagenet-data/train --model_dir=$OUTPUT_PATH --train_steps=5000
Caused by op u'input_pipeline_task0/OneShotIterator', defined at:
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
  "__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
  exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/resnet_garden/resnet_main.py", line 428, in <module>
  tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 124, in run
  _sys.exit(main(argv))
File "/root/.local/lib/python2.7/site-packages/resnet_garden/resnet_main.py", line 369, in main
  input_fn=ImageNetInput(True), max_steps=next_checkpoint)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 314, in train
  loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 743, in _train_model
  features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 725, in _call_model_fn
  model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1598, in _model_fn
  input_holders.generate_infeed_enqueue_ops_and_dequeue_fn())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 901, in generate_infeed_enqueue_ops_and_dequeue_fn
  enqueue_ops = self._invoke_input_fn_and_record_structure()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 954, in _invoke_input_fn_and_record_structure
  enqueue_ops.append(enqueue_ops_fn())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 715, in enqueue_ops_fn
  inputs = input_fn()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1575, in _input_fn
  return input_fn(**kwargs)
File "/root/.local/lib/python2.7/site-packages/resnet_garden/resnet_main.py", line 217, in __call__
  images, labels = dataset.make_one_shot_iterator().get_next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 153, in make_one_shot_iterator
  self.output_classes))), None,
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1173, in one_shot_iterator
  container=container, shared_name=shared_name, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
  op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3158, in create_op
  op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1624, in __init__
  self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Attempt to access beyond input size: 4 >= 4
  In ParallelInterleaveDataset = ParallelInterleaveDataset[Targuments=[], f=tf_map_func_c72e772a[], output_shapes=[[]], output_types=[DT_STRING]](RepeatDataset:handle:0, ParallelInterleaveDataset/input_pipeline_task0/cycle_length:output:0, ParallelInterleaveDataset/input_pipeline_task0/block_length:output:0, ParallelInterleaveDataset/input_pipeline_task0/sloppy:output:0)
   [[Node: input_pipeline_task0/OneShotIterator = OneShotIterator[container="", dataset_factory=_make_dataset_b456ecd5[], output_shapes=[[1024,224,224,3], [1024,1001]], output_types=[DT_FLOAT, DT_FLOAT], shared_name="", _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]()]]

@saeta
Copy link
Contributor

saeta commented Jan 7, 2018

@ryanjay0 I believe we've fixed the issue with the nightly GCE images, so new nightly GCE VMs should not encounter the "Attempt to access beyond input size" error any further. As for the DataLossError (see above for traceback): corrupted record at 0 error, I'd need more context / logs to be sure, but it seems like your error is generated from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/lib/io/record_reader.cc#L114 (or similar lines in the same file). Is it possible that you have an extra file in your data directory? (e.g. Did you put your model directory in the same location as your data directory?)

@Mistobaan As a short-term fallback while the updated builds propagate through to CMLE, you can switch from parallel interleave to standard interleave (at a runtime performance cost). Once TF-1.5 lands on Cloud TPUs and CMLE, you shouldn't encounter any of these issues.

Please do re-open this issue if the issues aren't resolved by Tuesday morning. Thanks!

@saeta saeta closed this as completed Jan 7, 2018
@ryanjay0
Copy link
Author

ryanjay0 commented Jan 9, 2018

It works. Thanks everyone for the advice. It's very fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants