Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown error appears when I use the UCF101 dataset. Perhaps some bug exists in the file extractor.py. #2539

Open
Acebee opened this issue Oct 6, 2020 · 2 comments
Labels
bug Something isn't working

Comments

@Acebee
Copy link

Acebee commented Oct 6, 2020

Short description
when I try to use the UCF101 dataset ,the program report something like this
tensorflow.python.framework.errors_impl.OutOfRangeError: E:\tfdsdata\datasets\ucf\downloads\thumos14_files_UCF101_videosxm55JXkGdBSDxwckqpN5c7GNr_LXm9dTyoJdpxR_aas.zip; Unknown error

Environment information

  • Operating System: Win10

  • Python version: 3.7(Conda)

  • tensorflow-datasets/tfds-nightly version: tensorflow-datasets 3.2.1

  • tensorflow/tf-nightly version: tensorflow-gpu 2.3.1

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ?
    yes

Reproduction instructions

mnist_train = tfds.load(name="ucf101", data_dir="E:\\tfdsdata\\datasets\\ucf")

or just reproduce the problem like this:

#something.zip refers to any zipFile
import tensorflow.compat.v2 as tf
with tf.io.gfile.GFile('E:\\tfdsdata\\datasets\\ucf\\downloads\\something.zip', 'r') as f_obj:
    z = zipfile.ZipFile(f_obj)

Link to logs

Expected behavior
I looked into the extractor.py file and fond the reason:
It seems that when zipfile.ZipFile() trys to unzip a file which is wrapped by tf.io.gfile.GFile, it throws an exception.

@contextlib.contextmanager
def _open_or_pass(path_or_fobj):
  if isinstance(path_or_fobj, six.string_types):
    with tf.io.gfile.GFile(path_or_fobj, 'rb') as f_obj:
      yield f_obj
  else:
    yield path_or_fobj

I manage to solve this problem by trying to not use the wrapped file. something like this:

...
def iter_zip(arch_f):
  """Iterate over zip archive."""
  with _open_or_pass(arch_f) as fobj:
    ########
    z = zipfile.ZipFile(fobj)#change this
    ########
    for member in z.infolist():
      extract_file = z.open(member)
      if member.is_dir():  # Filter directories  # pytype: disable=attribute-error
        continue
      path = _normpath(member.filename)
      if not path:
        continue
      yield [path, extract_file]
def iter_zip(arch_f):
  """Iterate over zip archive."""
  with _open_or_pass(arch_f) as fobj:
    ########
    z = zipfile.ZipFile(arch_f)
    ########
    for member in z.infolist():
      extract_file = z.open(member)
      if member.is_dir():  # Filter directories  # pytype: disable=attribute-error
        continue
      path = _normpath(member.filename)
      if not path:
        continue
      yield [path, extract_file]

Additional context
Add any other context about the problem here.

@Acebee Acebee added the bug Something isn't working label Oct 6, 2020
@vijayphoenix
Copy link
Contributor

Thanks for reporting.
It seems that using tf.io.gfile with python zipfile results in corruption of the data. (For some reason Windows only)

Related tensorflow/tensorflow#32975

@parasol4791
Copy link

I am actually seeing the same issue on Ubuntu 20.04 with the same dataset 'cats_vs_dogs'. Fine-tuning EfficientNet B4, getting similar errors. A separate investigation showed 'corrupted' file names change on every run.
Epoch 1/20
75/234 [========>.....................] - ETA: 43s - loss: 0.3529 - accuracy: 0.9261Corrupt JPEG data: 99 extraneous bytes before marker 0xd9
119/234 [==============>...............] - ETA: 31s - loss: 0.3379 - accuracy: 0.9352Corrupt JPEG data: 65 extraneous bytes before marker 0xd9
205/234 [=========================>....] - ETA: 7s - loss: 0.3345 - accuracy: 0.9424Corrupt JPEG data: 2226 extraneous bytes before marker 0xd9
215/234 [==========================>...] - ETA: 5s - loss: 0.3345 - accuracy: 0.9428Corrupt JPEG data: 239 extraneous bytes before marker 0xd9
227/234 [============================>.] - ETA: 1s - loss: 0.3343 - accuracy: 0.9434Corrupt JPEG data: 1153 extraneous bytes before marker 0xd9
229/234 [============================>.] - ETA: 1s - loss: 0.3343 - accuracy: 0.9435Corrupt JPEG data: 228 extraneous bytes before marker 0xd9
234/234 [==============================] - ETA: 0s - loss: 0.3342 - accuracy: 0.9437Corrupt JPEG data: 65 extraneous bytes before marker 0xd9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants