Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when train on customized dataset: Invalid JPEG data or crop window, data size 36864 #455

Closed
panfeng-hover opened this issue Jul 20, 2019 · 11 comments

Comments

@panfeng-hover
Copy link

panfeng-hover commented Jul 20, 2019

It seems to be Invalid JPEG data or crop window error, but I double-check the image format in my tf records are jpegs, I am wondering any possible reason that could cause this error?

The code I check the image format in tf records:

for tfrecord in tqdm(tfrecord_files):
    for example in tqdm(tf.python_io.tf_record_iterator(tfrecord)):
        data = tf.train.Example.FromString(example)
        encoded_jpg = data.features.feature['image/encoded'].bytes_list.value[0]
        img = Image.open(BytesIO(encoded_jpg))
        assert img.format == 'JPEG'

The log when I met the error:

E0719 23:46:18.549607 139925925385984 error_handling.py:70] Error recorded frominfeed: From /job:worker/replica:0/task:0:
Invalid JPEG data or crop window, data size 36864
         [[{{node parser/case/cond/else/_20/cond_jpeg/then/_0/DecodeJpeg}}]]
         [[input_pipeline_task0/while/IteratorGetNext_1]]
E0719 23:46:18.572818 139925916993280 error_handling.py:70] Error recorded fromoutfeed: From /job:worker/replica:0/task:0:
Bad hardware status: 0x1
         [[node OutfeedDequeueTuple_4 (defined at /home/panfeng/projects/tpu/models/official/mask_rcnn/distributed_executer.py:115) ]]

Original stack trace for u'OutfeedDequeueTuple_4':
  File "tpu/models/official/mask_rcnn/mask_rcnn_main.py", line 156, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "tpu/models/official/mask_rcnn/mask_rcnn_main.py", line 151, in main
    run_executer(params, train_input_fn, eval_input_fn)
  File "tpu/models/official/mask_rcnn/mask_rcnn_main.py", line 99, in run_executer
    executer.train(train_input_fn, FLAGS.eval_after_training, eval_input_fn)
  File "/home/panfeng/projects/tpu/models/official/mask_rcnn/distributed_executer.py", line 115, in train
input_fn=train_input_fn, max_steps=self._model_params.total_steps)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2721, in train
    saving_listeners=saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 362, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1184, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2560, in _call_model_fn
    config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1142, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2870, in _model_fn
    host_ops = host_call.create_tpu_hostcall()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1943, in create_tpu_hostcall
    device_ordinal=ordinal_id)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_tpu_ops.py", line 3190, in outfeed_dequeue_tuple
    device_ordinal=device_ordinal, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()
E0719 23:46:19.930372 139927321310656 error_handling.py:70] Error recorded fromtraining_loop: From /job:worker/replica:0/task:0:
9 root error(s) found.
  (0) Cancelled: Node was closed
  (1) Cancelled: Node was closed
  (2) Cancelled: Node was closed
  (3) Cancelled: Node was closed
  (4) Cancelled: Node was closed
  (5) Cancelled: Node was closed
  (6) Cancelled: Node was closed
  (7) Cancelled: Node was closed
  (8) Invalid argument: Gradient for resnet50/batch_normalization_32/beta:0 is NaN : Tensor had NaN values
         [[node CheckNumerics_98 (defined at /home/panfeng/projects/tpu/models/official/mask_rcnn/distributed_executer.py:115) ]]
0 successful operations.
0 derived errors ignored.

@saberkun
Copy link
Member

Is there any data corruption? it turns out to be very common like: tensorflow/tensorflow#7434

In this case, the error happens in input pipeline. It is necessary to debug on cpu and validate if data can be accessed correctly. I would recommend to write a simple program to test data pipeline. Here is an example to read data in eager mode: https://github.com/tensorflow/tpu/blob/master/models/official/mnasnet/post_quantization.py#L49

@panfeng-hover
Copy link
Author

Thanks for your reply. Yeah, it is due to file transfer issue, I generated the tf records on another remote machine. I later met the corrupted tf record files error similar to corrupted record at 12, fixed by increasing the number of shards.

@mkr2667
Copy link

mkr2667 commented Dec 12, 2020

InvalidArgumentError: Invalid JPEG data or crop window, data size 114304 [[{{node DecodeJpeg}}]]

I am getting this error when i am running the below code

def load_image(image_path):
    img = tf.io.read_file(image_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, (299, 299))
    img = tf.keras.applications.inception_v3.preprocess_input(img)
    return img, image_path

image_model = tf.keras.applications.InceptionV3(include_top=False,weights='imagenet')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output
image_features_extract_model = tf.keras.Model(new_input, hidden_layer)

encode_train = sorted(set(img_name_vector))
image_dataset = tf.data.Dataset.from_tensor_slices(encode_train)
image_dataset = image_dataset.map(load_image, num_parallel_calls=1).batch(64)
for img, path in tqdm(image_dataset):
    print("\nimage path {} : {}".format(img, path))
    batch_features = image_features_extract_model(img)
    batch_features = tf.reshape(batch_features,(batch_features.shape[0], -1, batch_features.shape[3]))
    for bf, p in zip(batch_features, path):
        path_of_feature = p.numpy().decode("utf-8")
        #print("{}:{}".format(path_of_feature,bf.numpy()))
        np.save(path_of_feature, bf.numpy())

The log when i met the error:
InvalidArgumentError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py in execution_mode(mode)
2101 ctx.executor = executor_new
-> 2102 yield
2103 finally:

11 frames
InvalidArgumentError: Invalid JPEG data or crop window, data size 114304
[[{{node DecodeJpeg}}]] [Op:IteratorGetNext]

During handling of the above exception, another exception occurred:

InvalidArgumentError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/executor.py in wait(self)
65 def wait(self):
66 """Waits for ops dispatched in this executor to finish."""
---> 67 pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle)
68
69 def clear_error(self):

InvalidArgumentError: Invalid JPEG data or crop window, data size 114304
[[{{node DecodeJpeg}}]]

Could please help how to resolve this same question is asked in stack overflow but no clear answer on internet please answer ASAP

@milad-4274
Copy link

milad-4274 commented Dec 19, 2020

I faced similar problem. there is a problem in some of your training data. you can use code below to check which jpeg image is corrupted and delete it.

from struct import unpack
import os


marker_mapping = {
    0xffd8: "Start of Image",
    0xffe0: "Application Default Header",
    0xffdb: "Quantization Table",
    0xffc0: "Start of Frame",
    0xffc4: "Define Huffman Table",
    0xffda: "Start of Scan",
    0xffd9: "End of Image"
}


class JPEG:
    def __init__(self, image_file):
        with open(image_file, 'rb') as f:
            self.img_data = f.read()
    
    def decode(self):
        data = self.img_data
        while(True):
            marker, = unpack(">H", data[0:2])
            # print(marker_mapping.get(marker))
            if marker == 0xffd8:
                data = data[2:]
            elif marker == 0xffd9:
                return
            elif marker == 0xffda:
                data = data[-2:]
            else:
                lenchunk, = unpack(">H", data[2:4])
                data = data[2+lenchunk:]            
            if len(data)==0:
                break        


bads = []

for img in tqdm(images):
  image = osp.join(root_img,img)
  image = JPEG(image) 
  try:
    image.decode()   
  except:
    bads.append(img)


for name in bads:
  os.remove(osp.join(root_img,name))

I used yasoob script to decode jpeg image.

@rdvelazquez
Copy link

rdvelazquez commented Sep 2, 2021

Thank you @milad-4274 (and yasoob) for sharing this jpeg checking script. It saved the day for us!

For others who may be looking at this, I made a few small revisions to your script to get it working for us, the most important of which was replacing:

            if len(data)==0:
                break    

with:

            if len(data)==0:
               raise TypeError("issue reading jpeg file")    

The other small changes were importing tqdm: from tqdm import tqdm, replacing osp.join with os.path.join and reading in the list of images with somethings like:

for dirName, subdirList, fileList in os.walk(img_dir):
    imagesList = fileList
    for img in tqdm(imagesList):

Thanks again 👍

UPDATE:
The script found one bad image (out of ~200,000) but after removing that image we still saw the invalid JPEG error.
Our next approach is using the image size printed out in the error message to try to find the offending image ls -l | grep <image_size> and then remove images with that exact file size (seems to work for JPEGs because, although our images are mostly all the same pixel dimensions, the image sizes are somewhat unique)

@OnSebii
Copy link

OnSebii commented Sep 28, 2021

Thank you @milad-4274 (and yasoob) for sharing this jpeg checking script. It saved the day for us!

For others who may be looking at this, I made a few small revisions to your script to get it working for us, the most important of which was replacing:

            if len(data)==0:
                break    

with:

            if len(data)==0:
               raise TypeError("issue reading jpeg file")    

The other small changes were importing tqdm: from tqdm import tqdm, replacing osp.join with os.path.join and reading in the list of images with somethings like:

for dirName, subdirList, fileList in os.walk(img_dir):
    imagesList = fileList
    for img in tqdm(imagesList):

Thanks again 👍

UPDATE: The script found one bad image (out of ~200,000) but after removing that image we still saw the invalid JPEG error. Our next approach is using the image size printed out in the error message to try to find the offending image ls -l | grep <image_size> and then remove images with that exact file size (seems to work for JPEGs because, although our images are mostly all the same pixel dimensions, the image sizes are somewhat unique)

And how can I start this script?
for dirName, subdirList, fileList in os.walk(img_dir): NameError: name 'img_dir' is not defined

@rdvelazquez
Copy link

@OnSebii You need to define the path to the directory where your images are stored img_dir = "./<path_to_image_dir>/" as either an absolute path or a relative path (from where your python script is called) above the for dirName, subdirList, fileList in os.walk(img_dir): line.

@antonison
Copy link

antonison commented Jun 10, 2022

@OnSebii You need to define the path to the directory where your images are stored img_dir = "./<path_to_image_dir>/" as either an absolute path or a relative path (from where your python script is called) above the for dirName, subdirList, fileList in os.walk(img_dir): line.

I do everything but it won't recognize the root_img. It raises an error that reads as follows:
NameError: name 'root_img' is not defined

What should I replace it with? Thank you!

@sarLum52
Copy link

@OnSebii You need to define the path to the directory where your images are stored img_dir = "./<path_to_image_dir>/" as either an absolute path or a relative path (from where your python script is called) above the for dirName, subdirList, fileList in os.walk(img_dir): line.

I do everything but it won't recognize the root_img. It raises an error that reads as follows: NameError: name 'root_img' is not defined

What should I replace it with? Thank you!

I am having the same issue with root_img too. Were you able to resolve it? I am pretty new to all of this

@choudharyfaisal
Copy link

img_dir = ( 'same path ' )
root_img = ( ' same path ' )

@biphasic
Copy link

Here the complete code with modifications that does the job for me

from struct import unpack
import os
from tqdm import tqdm

marker_mapping = {
    0xffd8: "Start of Image",
    0xffe0: "Application Default Header",
    0xffdb: "Quantization Table",
    0xffc0: "Start of Frame",
    0xffc4: "Define Huffman Table",
    0xffda: "Start of Scan",
    0xffd9: "End of Image"
}


class JPEG:
    def __init__(self, image_file):
        with open(image_file, 'rb') as f:
            self.img_data = f.read()
    
    def decode(self):
        data = self.img_data
        while(True):
            marker, = unpack(">H", data[0:2])
            # print(marker_mapping.get(marker))
            if marker == 0xffd8:
                data = data[2:]
            elif marker == 0xffd9:
                return
            elif marker == 0xffda:
                data = data[-2:]
            else:
                lenchunk, = unpack(">H", data[2:4])
                data = data[2+lenchunk:]            
            if len(data)==0:
                raise TypeError("issue reading jpeg file")            


# list all files in directory
folder_path = 'data/train_v2'
image_paths = os.listdir(folder_path)

corrupted_jpegs = []

for img_path in tqdm(image_paths):
  full_image_path = os.path.join(folder_path, img_path)
  image = JPEG(full_image_path) 
  try:
    image.decode()   
  except:
    corrupted_jpegs.append(img_path)
    print(f"Corrupted image: {img_path}")

print(corrupted_jpegs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants