New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataLossError (see above for traceback): corrupted record at 12 #13463

Closed
huangrandong opened this Issue Oct 3, 2017 · 53 comments

Comments

Projects
None yet
@huangrandong
Copy link

huangrandong commented Oct 3, 2017

I have a big problem, I use the tfrecord file to import data for my tensorflow program. But, when the program run a period of time, it displays the DataLossError:

System information

OS Platform and Distribution : Linux Ubuntu 14.04
TensorFlow installed from : Anaconda
TensorFlow version : 1.3.0
Python version: 2.7.13
CUDA/cuDNN version: 8.0 / 6.0
GPU model and memory: Pascal TITAN X

Describe the problem

2017-10-03 19:45:43.854601: W tensorflow/core/framework/op_kernel.cc:1192] Data loss: corrupted record at 12
Traceback (most recent call last):
File "east_quad_train_backup.py", line 416, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "east_quad_train_backup.py", line 330, in main
Training()
File "east_quad_train_backup.py", line 312, in Training
feed_dict={learning_rate: lr})
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 12
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,512,512,3], [?,128,128,9]], output_types=[DT_UINT8, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"]]
[[Node: gradients/Tile_grad/Shape/_23 = _HostRecvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_442_gradients/Tile_grad/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]]

Caused by op u'IteratorGetNext', defined at:
File "east_quad_train_backup.py", line 416, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "east_quad_train_backup.py", line 330, in main
Training()
File "east_quad_train_backup.py", line 251, in Training
batch_image, batch_label = iterator.get_next()
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/data/python/ops/dataset_ops.py", line 304, in get_next
name=name))
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 379, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

DataLossError (see above for traceback): corrupted record at 12
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,512,512,3], [?,128,128,9]], output_types=[DT_UINT8, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"]]
[[Node: gradients/Tile_grad/Shape/_23 = _HostRecvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_442_gradients/Tile_grad/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]]

Thanks anyone to answer this question.

@cy89

This comment has been minimized.

Copy link

cy89 commented Oct 9, 2017

@huangrandong is this problem repeatable, or did it happen just one time?

@huangrandong

This comment has been minimized.

Copy link

huangrandong commented Oct 9, 2017

@cy89 , thank you for your response。This problem happened many times,and it will come out whenever i run my program. The problem can not be repeatable. the reason can be the problem of my computer configuration. my program can run on another machine and don't display the error.

@reedwm

This comment has been minimized.

Copy link
Member

reedwm commented Oct 12, 2017

Can you post a small example that will cause the DataLossError after running it for a while, so that we can see what the problem is?

@huangrandong

This comment has been minimized.

Copy link

huangrandong commented Oct 12, 2017

@reedwm my code is used to put the numpy array into the TFrecord file and read the it from the same file ,this is my code:

create the TFrecord file function:

img_tfrecord_name = image_base_name + ".tfrecord"
writer = tf.python_io.TFRecordWriter(new_label_path + img_tfrecord_name)
label_concate = np.concatenate((score_map, x1_offset, y1_offset,
x2_offset, y2_offset, x3_offset,
y3_offset, x4_offset, y4_offset), axis = -1)
org_train_image = cv2.imread(org_train_images_path + img_name)
org_train_image_resize = cv2.resize(org_train_image,
(input_image_size, input_image_size))
assert org_train_image_resize.shape == (512,512,3)
org_train_image_resize = org_train_image_resize.astype(np.uint8)
org_train_image_resize_raw = org_train_image_resize.tostring()
label_concate = label_concate.astype(np.float32)
label_concate_raw = label_concate.tostring()
example = tf.train.Example(
features = tf.train.Features(
feature = {'image':tf.train.Feature(bytes_list =
tf.train.BytesList(value[org_train_image_resize_raw])),
'label':tf.train.Feature(bytes_list = tf.train.BytesList(value=[label_concate_raw]))}))
serialized = example.SerializeToString()
writer.write(serialized)
print 'writer ',img_name,' DOWN!'
writer.close()

read the TFrecord file :

def _parse_function_for_train(example_proto):
features = {'image': tf.FixedLenFeature((), tf.string, default_value=""),
'label': tf.FixedLenFeature((), tf.string, default_value="")}
parsed_features = tf.parse_single_example(example_proto, features)
image_raw_out = parsed_features['image']
label_raw_out = parsed_features['label']
image_out = tf.decode_raw(image_raw_out, tf.uint8)
label_out = tf.decode_raw(label_raw_out, tf.float32)
image_out = tf.reshape(image_out, [512, 512, 3])
label_out = tf.reshape(label_out, [128,128,9])
return image_out, label_out

def CreateTrainDataset():
train_image_label_tfrecord_list = ["t1.tfrecord", "t2.tfrecord",......]
train_dataset = tf.contrib.data.TFRecordDataset(train_image_label_tfrecord_list)
train_dataset = train_dataset.map(_parse_function_for_train)
batched_train_dataset = train_dataset.batch(512)
return batched_train_dataset

batched_train_dataset = CreateTrainDataset()
iterator = batched_train_dataset.make_initializable_iterator()
batch_image, batch_label = iterator.get_next()
with tf.Session() as sess:
sess.run(iterator.initializer)
When the above code run some iterations, the DataLossError will com out

@reedwm

This comment has been minimized.

Copy link
Member

reedwm commented Oct 13, 2017

@huangrandong can you post a complete, self-contained example I can copy to a text file and run? In the code above, image_base_name is not defined.

@saxenasaurabh @vrv, any idea what the problem could be?

@huangrandong

This comment has been minimized.

Copy link

huangrandong commented Oct 14, 2017

@reedwm you can define the variable which the code didn't define. and the code is used to put a numpy array of image and another label array into the tfrecord file. Then, reading the two array from the tfrecord file

@reedwm

This comment has been minimized.

Copy link
Member

reedwm commented Oct 16, 2017

It's much easier to quickly reproduce these issues if I have a self-contained example without having to define variables. Perhaps the issue only occurs for certain values of x1_offset, for example. So can you please add a complete example?

@guillaumekln

This comment has been minimized.

Copy link
Contributor

guillaumekln commented Nov 10, 2017

I also had reports of this error which appears to occur randomly during the training. It happened on multiple occasions and with different reported offsets (see OpenNMT/OpenNMT-tf#19).

To investigate the issue, I wrote a small script that repeatedly loops over the same TFRecord dataset that threw the error and applies the same processing as done during the training. However, I was not able to reproduce it, indicating that no records are corrupted in the file and something else is going one during the training.

Any pointers to better investigate the issue would be appreciated.

@rjbruin

This comment has been minimized.

Copy link

rjbruin commented Nov 14, 2017

Same problem here. For several different sets of TFRecord files we get this error at random times during training.

@homink

This comment has been minimized.

Copy link

homink commented Nov 14, 2017

I have reproduced the error at the same record location. The first and third got the error in the middle of 'Filling up shuffle buffer' and the second got the error in the beginning of that. In my case, this error looks highly relevant with the buffer shuffling process although different size of buffer didn't work. I hope this would be helpful for debugging.

[kwon@ssi-dnn-slave-002 wsj_kaldi_tf]$ grep DataLossError wsj.log
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 3449023918
DataLossError (see above for traceback): corrupted record at 3449023918
[kwon@ssi-dnn-slave-002 wsj_kaldi_tf]$ grep DataLossError wsj.log1
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 3449023918
DataLossError (see above for traceback): corrupted record at 3449023918
[kwon@ssi-dnn-slave-002 wsj_kaldi_tf]$ grep DataLossError wsj.log2
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 3449023918
DataLossError (see above for traceback): corrupted record at 3449023918
@FirefoxMetzger

This comment has been minimized.

Copy link
Contributor

FirefoxMetzger commented Nov 21, 2017

Allow me to further complicate matters. (Although I am not 100% sure that it is the same issue)

I have some custom data and know that the TFRecord is not corrupt, because I've iterated over it (using the same code) successfully before. Now I've encountered the same situation that homink described.
After restarting my machine it is again working as intended.

Assuming that it is related, is there any caching involved when reading the .tfrecord? Either from tensorflow, python or the OS? (I am currently running it on Win10)

@tjvandal

This comment has been minimized.

Copy link

tjvandal commented Nov 22, 2017

@FirefoxMetzger I am too having this issue so I tried restarting my machine, as you did, and it did not fix the problem. I'm using Ubuntu 16.04.

@tensorflowbutler

This comment has been minimized.

Copy link
Member

tensorflowbutler commented Dec 20, 2017

It has been 14 days with no activity and the awaiting response label was assigned. Is this still an issue? Please update the label and/or status accordingly.

@reedwm

This comment has been minimized.

Copy link
Member

reedwm commented Dec 20, 2017

/CC @mrry @saxenasaurabh, any ideas what the issue could be? This is hard to debug without a small example that reproduces the issue.

@mrry

This comment has been minimized.

Copy link
Contributor

mrry commented Dec 20, 2017

AFAICT, this problem only affects ZLIB-compressed TFRecord files (because that is the sole source of "corrupted record at" in an error message). The source indicates a CRC mismatch. I'm a little surprised that none of the code snippets mention ZLIB compression.

/CC @saxenasaurabh @rohan100jain, who last touched the ZLIB-related code in that file.

@guillaumekln

This comment has been minimized.

Copy link
Contributor

guillaumekln commented Dec 20, 2017

I confirm that the issue was encountered without any compression configured, unless it is the default (which is not AFAIK).

@mrry

This comment has been minimized.

Copy link
Contributor

mrry commented Dec 21, 2017

Pardon my mistake, indeed there are other code paths that can print that message, and each of them is related to a CRC mismatch.

@tensorflowbutler

This comment has been minimized.

Copy link
Member

tensorflowbutler commented Jan 4, 2018

It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

@tjvandal

This comment has been minimized.

Copy link

tjvandal commented Jan 6, 2018

Anymore thoughts on this? It's a big issue for me but I don't know where to start debugging. Each time I reprocess my data the errors appear in different locations. Sometimes it takes a couple training epochs to occur.

@amj

This comment has been minimized.

Copy link

amj commented Jan 20, 2018

/sub

This is happening to us as well, any ideas?

Edit to add: We are using zlib compression, reading a bunch of files off GCS with interleave and shuffling them into one large Dataset; as a result, there's no way to catch the error and try and carry on.

Is it possible this is some GCS transient? I'm also having trouble repeating it with the same data.

@cy89

This comment has been minimized.

Copy link

cy89 commented Feb 16, 2018

@tnikolla meaning no disrespect, but for isolating and debugging issues, what we mean by "small" is code that's been minimized to exhibit the bug in as few lines as possible, rather than a whole model. If you've got a deterministic exhibitor, the binary search to minimize it should be that much work for the author, but is a lot of work for someone unfamiliar with the code. If you could do that, we'd be able to converge rapidly.

I'm going to close this bug for lack of activity; please reopen if we can get a small exhibitor to work with.

@cy89 cy89 closed this Feb 16, 2018

@ed-alertedh

This comment has been minimized.

Copy link

ed-alertedh commented Feb 26, 2018

We just ran into this error and it turned out to be legitimately caused by a corrupted record in our tfrecord file. Just threw these functions together to help check for this in future, thought they may be useful to others: https://gist.github.com/ed-alertedh/9f49bfc6216585f520c7c7723d20d951

@jsenellart

This comment has been minimized.

Copy link

jsenellart commented Mar 18, 2018

I also ran into this error in a totally different context. The error occurs randomly with the same code/tfrecord. What seems constant, is that this occurs only with very large tfrecord (>5Gb)

@lhlmgr

This comment has been minimized.

Copy link
Contributor

lhlmgr commented Apr 3, 2018

I also ran into this error and my tfrecord files are also >5Gb (~55Gb) without compression.

@igorgad

This comment has been minimized.

Copy link

igorgad commented Apr 17, 2018

Hi. I am also running into the corrupted record error. According to the functions made by @ed-alertedh, my tfrecord file is perfectly fine without any crc mismatch. I figured out that you can momentarily get rid of the corrupted error by cleaning the linux memory cache with the command sudo sh -c "sync; echo 1 > /proc/sys/vm/drop_caches". This might indicate that the records are getting corrupted in memory or during the read from drivers.

@john-parton

This comment has been minimized.

Copy link

john-parton commented May 31, 2018

Can we re-open this? I have a minimal example:

#!/usr/bin/env python

from __future__ import print_function

import tensorflow as tf


def get_iterator():

    output_buffer_size = 1000

    pattern = 'test.txt.gz'

    filenames = tf.data.Dataset.list_files(pattern).repeat()

    dataset = filenames.apply(
        tf.contrib.data.parallel_interleave(
            lambda filename: tf.data.TFRecordDataset(filename, compression_type='GZIP'),
            cycle_length=8,
        )
    )

    dataset = dataset.map(
        lambda src: tf.string_split([src]).values,
        num_parallel_calls=8
    ).prefetch(output_buffer_size)

    iterator = dataset.make_initializable_iterator()

    source = iterator.get_next()

    return iterator, source


def main():


    graph = tf.Graph()

    with graph.as_default():
        iterator, source = get_iterator()

    with tf.Session(graph=graph) as sess:
        table_initializer = tf.tables_initializer()
        sess.run(table_initializer)
        sess.run(iterator.initializer)
        sess.run(tf.global_variables_initializer())

        for __ in range(100):

            value = sess.run([source])

if __name__ == '__main__':
    main()

Here's the necessary file: test.txt.gz

Here's the output:

2018-05-31 06:38:55.956213: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-05-31 06:38:55.956679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties: 
name: GeForce GTX 680 major: 3 minor: 0 memoryClockRate(GHz): 1.163
pciBusID: 0000:01:00.0
totalMemory: 1.95GiB freeMemory: 1.58GiB
2018-05-31 06:38:55.956700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0
2018-05-31 06:38:56.186321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-31 06:38:56.186355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958]      0 
2018-05-31 06:38:56.186366: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   N 
2018-05-31 06:38:56.186483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1355 MB memory) -> physical GPU (device: 0, name: GeForce GTX 680, pci bus id: 0000:01:00.0, compute capability: 3.0)
Traceback (most recent call last):
  File "/home/john/Code/venv/tensorflow-rnn/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/home/john/Code/venv/tensorflow-rnn/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/john/Code/venv/tensorflow-rnn/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
	 [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[]], output_types=[DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "example.py", line 54, in <module>
    main()
  File "example.py", line 51, in main
    value = sess.run([source])
  File "/home/john/Code/venv/tensorflow-rnn/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/home/john/Code/venv/tensorflow-rnn/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/john/Code/venv/tensorflow-rnn/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/home/john/Code/venv/tensorflow-rnn/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
	 [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[]], output_types=[DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]

Caused by op 'IteratorGetNext', defined at:
  File "example.py", line 54, in <module>
    main()
  File "example.py", line 41, in main
    iterator, source = get_iterator()
  File "example.py", line 30, in get_iterator
    source = iterator.get_next()
  File "/home/john/Code/venv/tensorflow-rnn/lib/python3.5/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 373, in get_next
    name=name)), self._output_types,
  File "/home/john/Code/venv/tensorflow-rnn/lib/python3.5/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1666, in iterator_get_next
    output_shapes=output_shapes, name=name)
  File "/home/john/Code/venv/tensorflow-rnn/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/john/Code/venv/tensorflow-rnn/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3417, in create_op
    op_def=op_def)
  File "/home/john/Code/venv/tensorflow-rnn/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1743, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

DataLossError (see above for traceback): corrupted record at 0
	 [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[]], output_types=[DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]

My production code is a lot more complex, obviously. This is just the "minimal" example that was requested. Everything works fine if I use a regular interleave instead of the parallel_interleave.

Edit

I just tried it with an uncompressed txt and without the compression_type='GZIP' flag and it failed as well.

Maybe there's something I don't understand about parallel_interleave?

Thanks for your all your hard work!

@lhlmgr

This comment has been minimized.

Copy link
Contributor

lhlmgr commented May 31, 2018

You are trying to open a compressed text-file, which doesn't work with a TFRecordReader.
Possible solutions would be, to save your data as TFRecord, or read your data with with a generator (of course, there are also other solutions.. )

@john-parton

This comment has been minimized.

Copy link

john-parton commented May 31, 2018

@lhlmgr Thank you for pointing out my silly mistake.

I changed the relevant code from TFRecordReader to TextLineDataset and everything seems to be working just fine.

@xiongzhiyao

This comment has been minimized.

Copy link

xiongzhiyao commented Jun 5, 2018

can confirm that after cleaning the cache / restarting the computer @igorgad , problem solved temporarily

@xiongzhiyao

This comment has been minimized.

Copy link

xiongzhiyao commented Jun 5, 2018

Problem occurs more often on large dataset, I am using the shuffling function.

@CppChan

This comment has been minimized.

Copy link

CppChan commented Jun 28, 2018

I also encountered this problem these days. At last, I found that it is due to I have not revised the code downloaded from other's github. Then, python will run the *.pyc file by default and my root directory lack the 'pycache' directory, which supposed to contained those *.pyc file. I hope it will help.

@meyerjo

This comment has been minimized.

Copy link

meyerjo commented Jul 16, 2018

Having the same problem, with large dataset files. It often happens after multiple successful training runs.

I can fix it by copying the same file (according to diff, hash, size) back from my backup server.

@aranga81

This comment has been minimized.

Copy link

aranga81 commented Aug 9, 2018

I am seeing this on large dataset too - and it is repeatable. Tried clearing up my cache and reboots, also tweaking around parallel_interleave(cycle_length), num_parallel_calls and buffer_size of my shuffle_and_repeat. - Everything's like a temporary fix but the underlying problem is still there.

@f90

This comment has been minimized.

Copy link

f90 commented Aug 9, 2018

I get this problem on a dataset where each example has about 200k 32-bit floats, and it definitely seems related to the shuffle buffer. Without any shuffling, it works fine. With a small buffer as well. But when I want to increase the shuffle buffer size from 500 to 1500, this error comes up, with the same message every time:

DataLossError (see above for traceback): truncated record at 1718960034

EDIT: Encountering this issue with a small buffer size too, but later sometime during the training. I regenerated my dataset but to no avail.

@stillwalker1234

This comment has been minimized.

Copy link

stillwalker1234 commented Aug 23, 2018

same problem with tfrecord > 10gb. tf version 1.10, ubuntu 17, python 3.6.

Rebooting only solve the problem temporarily :(

@aranga81

This comment has been minimized.

Copy link

aranga81 commented Aug 23, 2018

Things that solved this issue for me:

  • write the tfrecords onto SSD or HDD on your local machine
  • reduce the buffer size for shuffle_and_repeat & also check your core size and adjust num_parallel_calls to be not more than number of cpu cores
@eliorc

This comment has been minimized.

Copy link

eliorc commented Sep 17, 2018

Having the same problem

The setup is as follows.
I'm running inside a nvidia-docker container, TF version 1.10, Ubuntu 16.04.
My TFRecords are 161GB in size.

Since the code is sensitive I can't post it but I'll explain what goes on and how I can repeat this exception.

After restarting the machine and rebuilding the image (still TF 1.10) before training I'll actually go over the entire dataset, and count its size using train_size = sum(1 for _ in tf.python_io.tf_record_iterator(meta['train_tfr_path'])) - so far no problems - notice this is a full iteration over that tfrecords.
Then while iterating through a session that accepts the handle string, training will start normally and will suddenly fail indicating DataLossError: corrupted record at X.
If then I try to run the same script again, it will fail on the train_size = sum(1 for _ in tf.python_io.tf_record_iterator(meta['train_tfr_path'])) immediately indicating the corrupted record at exactly same value (same X)

This is now repeatable forever.

If I restart the machine it the whole process starts over, the first run will randomly fail during training, at an unexpected record number (this time a different number) and then no matter how many times I rerun the script I will fail on the first call on the same record.

I think this should be reopened and solved - TFRecords are very important for big data training.

@linrongc

This comment has been minimized.

Copy link

linrongc commented Sep 29, 2018

I encountered the same problem. The DataLossError: corrupted record at X failed randomly when training on a large dataset.
tf.version is 1.10, ubuntu 16.04, python 3.6.
Can anyone help to solve the problem?

@siavash-khodadadeh

This comment has been minimized.

Copy link

siavash-khodadadeh commented Oct 1, 2018

I also encountered the same problem. There really was an issue with one of the TFRecord files I had. The way that I found it was to loop over all TFRecords and create a TFRecordDataset. Then I parsed them and just tried to access the data. I just printed the name of TFRecord files and then found the ones which were problematic. For me, it was because I was using a very large dataset and my hard drive limit on the cluster was met and my code created some corrupted TFRecords.

@stillwalker1234

This comment has been minimized.

Copy link

stillwalker1234 commented Oct 1, 2018

@linrongc

This comment has been minimized.

Copy link

linrongc commented Oct 1, 2018

I just found the problem was caused from the unstable RAM (using the CRC checking for tfrecord here: https://gist.github.com/ed-alertedh/9f49bfc6216585f520c7c7723d20d951). After I downgrade the memory frequency from 3066 to 2800 in BIOS, it works fine now.

@meyerjo

This comment has been minimized.

Copy link

meyerjo commented Oct 1, 2018

In my experience it helps

  • to keep the dataset files rather small (also helps for shuffling the data properly)
  • and strictly restrict the number of threads and parallel reads to the number of processors which are available on the system.

Still, it happens occasionally (but far less frequent). So please fix it ;-)

@mhtrinh

This comment has been minimized.

Copy link

mhtrinh commented Oct 3, 2018

Me too :
Tensorflow 1.10.1 (recompiled locally using unchanged github source at tags/v1.10.1)
Python 3.6
Kernel : 4.15.0-34-generic Ubuntu 18.04 LTS
No NFS, just linux raid (mdadm)
TFrecord : 10 file of 1.8GB
RAM: 62GB

Dropping cache fix the corruption temporary. I was running tensorflow 1.4.0 and there was no crash/corruption during all my training.

I suspect it have to do with the shuffle operation as I do also have random segfault/error just before or after "Filling up shuffle buffer" or "Shuffle buffer filled." ( I am using unchanged object detection pipeline)

Edit: Same thing happen with tensorflow 1.11 (pip install).
There is 2 differents error message that I believe related to the same problem :

  • The "DataLossError" and "corrupted record"
  • And "InvalidArgumentError: Invalid PNG data, size " (I use png files)

I saved md5sum of my tfrecords before training. When it crash with the error above,and I run md5sum: one of my tfrecord changes. Once I dropped the cache, the md5sum is back to what it was before training.

@guillaumekln

This comment has been minimized.

Copy link
Contributor

guillaumekln commented Oct 10, 2018

If someone from the TensorFlow team is still reading this thread, would it be possible to optionally ignore records that fail the CRC check instead of raising an exception? This would be an acceptable behavior during training.

@mhtrinh

This comment has been minimized.

Copy link

mhtrinh commented Oct 10, 2018

I am not sure this is acceptable behavior : we are lucky that the CRC check failed, otherwise you will be silently training with corrupted data which is kind of scary. Some may say this is a kind data augmentation? ;-)

@stillwalker1234

This comment has been minimized.

Copy link

stillwalker1234 commented Oct 10, 2018

@dimitarsh1

This comment has been minimized.

Copy link

dimitarsh1 commented Jan 7, 2019

Hi all, I ran into the same error several times. Sometimes it gets fixed after restarting the machine (weird, I know). Updated the drivers as well - still remains. Any update from TF folks?

@ridvaneksi

This comment has been minimized.

Copy link

ridvaneksi commented Jan 16, 2019

I was having a similar error for my tfrecord files. Then I went back to the script that converts my images to tfrecords format. I rerun the code with a single thread instead of multiple threads, and this fixed the problem. Running the conversion script with multiple threads makes it slower and results in corrupted files for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment