New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to access beyond input size: 4 >= 4 #45
Comments
Are you using the same code and settings on GPU and Cloud TPUs? Specifically, I am curious about batch sizes. |
We are also interested in what version of Cloud TPU and TensorFlow you are using. It will be great if you can start a new Cloud TPU and install the latest nightly of TensorFlow (from |
We are using the latest version of tf and the cloud tpu demo code. On the phone with Arhaan and Marcos today they were able to reproduce the bug immediately.
Sent from Yahoo Mail on Android
On Wed, Jan 3, 2018 at 14:51, Frank Chen<notifications@github.com> wrote:
We are also interested in what version of Cloud TPU and TensorFlow you are using. It will be great if you can start a new Cloud TPU and install the latest nightly of TensorFlow (from pip install tf-nightly) with default batch sizes and see if you can isolate the problem.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Hey @ryanjay0 Sorry about the failure. The underlying issue is that tensorflow/tensorflow@76db97f changed the definition of parallel interleave. While it's backwards compatible from the Python API, it is not backwards compatible at the RPC level. Alas, RPC incompatibilities are a known issue without a good short-term fix, and that's why the Cloud TPU nightly builds are not stable and don't come with any guarantees. I've confirmed this morning that if you create a new Cloud TPU, and use a fresh install of TF-nightly, you shouldn't encounter this issue. Once TF-1.5 lands in the next few weeks, I strongly recommend switching to that, as the stable release won't encounter these issues. Thank you very much for trying out Cloud TPUs, and please do file issues you encounter, and reopen if this is still an issue! All the best, |
@ryanjay0 can you specify what version of Tensorflow is installed on your GCE VM. And also try installing the TF nightly and check if that resolves this issue. |
We got past that error by pip installing tf-nightly. Creating a new instance and tpu-resource with the tf-nightly parameter wasn't enough. Unfortunately we are still having an issue. The error is now "DataLossError (see above for traceback): corrupted record at 0" Perhaps we need to recreate the tfrecords to be compatible with the new tensorflow nightly, but it was working fine without parallel interleave with the old build. We are working with Arhaan also to resolve this but if you have any suggestions i'd be happy to try them. TF version. 1.5.0-dev20180103 |
Got same error using the ml-engine job system with runtime-version HEAD Is something that needs to be handled on the ML engine side or should we patch the model script?
|
@ryanjay0 I believe we've fixed the issue with the nightly GCE images, so new nightly GCE VMs should not encounter the "Attempt to access beyond input size" error any further. As for the @Mistobaan As a short-term fallback while the updated builds propagate through to CMLE, you can switch from parallel interleave to standard interleave (at a runtime performance cost). Once TF-1.5 lands on Cloud TPUs and CMLE, you shouldn't encounter any of these issues. Please do re-open this issue if the issues aren't resolved by Tuesday morning. Thanks! |
It works. Thanks everyone for the advice. It's very fast. |
https://github.com/tensorflow/tpu-demos/blob/cb18fe2a4bacf4c8ef7685aebfbffb4550d5e938/cloud_tpu/models/resnet_garden/resnet_main.py#L204
I can't get parallel_interleave to work here. It gives this error:
The ImageNet data is coming from gcs and it's definitely accessible.
It works fine if I replace the line with interleave (no parallel) and remove the apply, but I'm afraid that might slow it down significantly.
Note: my gpu version of this code is happy with parallel_interleave.
The text was updated successfully, but these errors were encountered: