Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The tfrecords file is 8 times larger than raw image data #9675

Closed
daiab opened this issue May 5, 2017 · 10 comments
Closed

The tfrecords file is 8 times larger than raw image data #9675

daiab opened this issue May 5, 2017 · 10 comments

Comments

@daiab
Copy link

daiab commented May 5, 2017

I try to write a tfrecords file, but the file is larger than raw data.

img = Image.open('img_file')  # this image file size: 24 kb
b = img.tobytes()  # the len(b) is 24 kb, this is right
feats = tf.train.Features(feature={'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[2])), 
'image_raw': tf.train.Feature(bytes_list=tf.train.BytesList(value=[b]))})
example_string = example.SerializeToString()
len(example_string) / 8 / 1024  # the output == 24.0057373046875 kb, look like well

but I write this 'example_string' to tfrecords file , the tfrecords file size become 192 kb, I cann`t understand why tfrecords file size serval times larger than 'example_string' and raw image data

@ppwwyyxx
Copy link
Contributor

ppwwyyxx commented May 5, 2017

Because you use Int64List.

@drpngx drpngx closed this as completed May 6, 2017
@asanakoy
Copy link

asanakoy commented May 19, 2017

@ppwwyyxx , but he uses BytesList for the image, not Int64.
The only int64 value in 1 integer label.
I have the problem of blowing up tfrecords as well. My images are 30G when stored in jpg, but 280G when I write them in tfrecord

    img = np.asarray(Image.open(img_path))
    img_raw = img.tostring()

    height = img.shape[0]
    width = img.shape[1]
    example = tf.train.Example(features=tf.train.Features(feature={
        'height': _int64_feature(height), # single integer
        'width': _int64_feature(width),  # single integer
        'image_raw': _bytes_feature(img_raw), # image
        }))

I think we should reopen issue.

@ppwwyyxx
Copy link
Contributor

30G when wtored in jpg, but 280G when I write them in tfrecord.. do you mean they are not jpeg when stored in tfrecord?

@asanakoy
Copy link

asanakoy commented May 19, 2017

@ppwwyyxx , Probably the are not jpeg anymore when stored in tfrecord. I edited my code in the previous post.
When I used leveldb binaries with caffe, they didn't blow up so much comparing to original data.
Looks like it stores raw, without any encoding?

@ppwwyyxx
Copy link
Contributor

ppwwyyxx commented May 19, 2017

Then it definitely will be several times larger and the factor depends on how well JPEG works on your images. The factor is 5.x on the whole ImageNet, btw (reference).
TFRecord stores bytes so you can do any encoding you want.

@asanakoy
Copy link

Is it because tfrecords stores decompressed images?
Is it possible to to store compressed? Do you think it will make everything much slower?

@ppwwyyxx
Copy link
Contributor

ppwwyyxx commented May 19, 2017

TFRecord stores bytes so you can do any encoding you want.
The easiest thing you can do is just img_raw = open(img_path).read() to use the original encoding.
The decompression time is usually much smaller than training, and could be completely hidden if the preprocessing runs in parallel with training. Then it will only make everything faster (as in the link I posted).

@laura-wang
Copy link

@ppwwyyxx This solution really reduces the size. But when use image = tf.decode_raw(features['image_raw'], tf.uint8) to decode the image, the image size is much smaller. Do you know how can I read from the tfrecord to get the original image?

@laura-wang
Copy link

I find the solution
tf.image.decode_jpeg(image_raw_data) or
cv2.imdecode(np.fromstring(image_raw_data, dtype=np.uint8), -1)
will solve the problem.

@switchfootsid
Copy link

switchfootsid commented Aug 21, 2018

For anyone who is confused how serialization of digital images works, this is a pretty wonderful explanation of "why the size of TFRecords might be larger" than the original image, from the ground up. Here: https://planspace.org/20170403-images_and_tfrecords/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants