New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupt files in the dogs_vs_cats
dataset
#2188
Comments
@tomergt45 I am unable to reproduce the bug. As far as I can see, all corrupt images are removed already.
And printing is an I/O operation, so time it will a lot of time to print of array of 20000+ images. |
The problem happens when you just iterate over the data (without printing):
It also happens when you try to fit a model with this data. |
@tomergt45 I will not able to reproduce the error. @vijayphoenix is correct, all 1738 corrupt images were skipped, see this colab. |
@Eshan-Agarwal It still happens to me, I tried updating full example:
|
However, I can reproduce this issue on windows. |
I encountered this bug during my TensorFlow certification exam yesterday. |
It is possible that the code to auto-detect corrupted images do not works on windows:
Or maybe there are additional corrupted images on windows that works on linux ? Unfortunately, I do not have access to any windows computer, so I'm can't really debug this. If someone want to help us investigate this, it would be great. |
@Conchylicultor I tried checking it out, I added some print calls in each function of the
Only the print in the Edit: I'd like to point out I am not very familiar with how the TensorFlow Datasets API is structured. |
@tomergt45 Thanks for looking into this. |
After investgiating a bit, I managed to get the names of the corrupted images that was not skipped using this code:
which gave me the following output:
Hope this helps. EDIT: That's very weird but every time you execute this code you get diffrent file names, I'm not sure why. |
It is very like that this is because of the following: For more info tensorflow/tensorflow#32975 |
I encountered this today while training a VGG model using 224/727 [========>.....................] - ETA: 41s - loss: 0.6927 - accuracy: 0.5188Corrupt JPEG data: 99 extraneous bytes before marker 0xd9 |
Similar problem:
gives
|
So, is there some solution to this? |
Hi No solution have been found now ? |
Same here, on Ubuntu 18. |
The only thing that worked for me was using this software to filter the images: https://github.com/coderslagoon/BadPeggy |
Same on macOS Ventura :( |
2+ years later, still being an issue. I'm running the TensorFlow: Advanced Techniques Specialization Coursera Course 1 Week 4 quiz. My env is
|
I encounter this today.. any solution? |
I came across the same problem too. I had downloaded the dataset from Kaggle and tried running it on my local machine. But when I called My solution was to write a code to try and open files and if there is any error, remove the required file. Also, if the number of channels (or dimensions) in the image are not 3 (reed, green, blue channels) then also I will remove the file. After running this code on the dataset I was able to get the model to train without any issues. My code: from pathlib import Path
from tensorflow.io import read_file
from tensorflow.image import decode_image
# data_dir is of type Path and points to the parent dir
# parent dir contains the directories 'Dog' and 'Cat'
# run the same code for the dir 'Cat' to remove corrupt files
for image in sorted((data_dir/'Dog').glob('*')):
try:
img = read_file(str(image))
img = decode_image(img)
if img.ndim != 3:
print(f"[FILE_CORRUPT] {str(image).split('/')[-1]} DELETED")
image.unlink()
except Exception as e:
print(f"[ERR] {str(image).split('/')[-1]}: {e} DELETED")
image.unlink() |
I have seen a similar error in JPEG reading functions of several libraries, not just tensorflow, so I think this is an error in the underlying image decoding library employed. You can get around this issue by re-encoding and writing the JPEG images. It's an expensive operation, but you should only need to do it once. I manipulated the image removal function provided for the dataset. On my machine, this fixed the Corrupt JPEG error. Note also that my directory name is "data/cats_dogs", which is different than the default directory name.
Hope this helps others as a workaround. |
Python 3.10:
|
A simple working code:
|
Short description
I encountered this bug during my TensorFlow certification exam, when trying to work with images from the dataset you constantly get the message
Corrupt JPEG data: 228 extraneous bytes before marker 0xd9
again and again, and it takes forever to iterate over the data once with that, I couldn't complete my exam because of that.Environment information
tensorflow-datasets
/tfds-nightly
version:tensorflow-datasets
version 3.1.0tensorflow
/tensorflow-gpu
/tf-nightly
/tf-nightly-gpu
version:tensorflow-gpu
version 2.2.0Reproduction instructions
A very simple way to reproduce the bug:
Expected behavior
I except to be able to iterate over all the images without getting errors and without it taking forever to complete a single iteration.
The text was updated successfully, but these errors were encountered: