Converting numpy array to TFRecord is slow #16933

karthikeyann · 2018-02-11T12:38:19Z

FloatList and Feature is slow for numpy array.

Saving numpy arrays with np.load and np.save is much faster than Converting to TFRecord and reading it back.
while profiling the code, I found that half of the time is spent in _floats_feature.
tf.train.FloatList is taking 1/3 of the time.
How to speed this up?

System information

Below snippet of code to convert numpy array is much slow compared np.save, np.load:
**OS Platform and Distribution: Linux Ubuntu 16.04
TensorFlow version (use command below): 1.4.0
Python version: 2.7.12

Source code / logs

import tensorflow as tf
import numpy as np



def floatme(value):
    return tf.train.FloatList(value=value)

def _floats_feature(value):
    return tf.train.Feature(float_list=floatme(value))

tfr_filename = "deleteme.tfr"
data = [" ".join(np.random.randint(0, 1000, size=4005).astype(str)) for i in range(10000)]
with tf.python_io.TFRecordWriter(tfr_filename) as writer:
    print('Converting to vectors')
    vectors = [np.fromstring(line, dtype=int, sep=' ', count=4004+1) for line in data]
    print('Converting to examples')
    for i, vec in enumerate(vectors):
        # Create an example protocol buffer
        example = tf.train.Example(features=tf.train.Features(feature={
            'label': _floats_feature([vec[4004], vec[4004]<1.0]),
            'data' : _floats_feature(vec[:4004]),
            }))
        writer.write(example.SerializeToString())

ncalls	tottime	cumtime	filename:lineno(function)
232810	49.887	49.887	convert_train_dataset_tfrecord.py:76(floatme)
116405	20.095	20.095	{numpy.core.multiarray.fromstring}
232810	13.328	63.216	convert_train_dataset_tfrecord.py:79(_floats_feature)

The text was updated successfully, but these errors were encountered:

tensorflowbutler · 2018-02-12T01:10:12Z

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
Have I written custom code
TensorFlow installed from
Bazel version
CUDA/cuDNN version
GPU model and memory
Exact command to reproduce

tensorflowbutler · 2018-03-03T07:52:13Z

Nagging Awaiting Response: It has been 14 days with no activityand the awaiting response label was assigned. Is this still an issue?

tensorflowbutler · 2018-03-17T14:58:08Z

Nagging Awaiting Response: It has been 14 days with no activityand the awaiting response label was assigned. Is this still an issue?

mbrio · 2018-03-27T22:06:29Z

I'd like to add to this, it seems as though the instantiation of a tf.train.Features object takes a tremendous amount of time. A very simple example of timings on my machine:

start = time.perf_counter()

for i in range(2000000):
    example = tf.train.Example()

time.perf_counter() - start

The instantiation of 2,000,000 examples with no features takes .76 seconds.

start = time.perf_counter()

for i in range(2000000):
    example = tf.train.Example()
    feature_1 = tf.train.Int64List(value=[10])
    feature_2 = tf.train.Int64List(value=[10])
        
time.perf_counter() - start

The instantiation of 2,000,000 examples and two tf.train.Int64List objects takes 5 seconds.

start = time.perf_counter()

for i in range(2000000):
    example = tf.train.Example()
    feature = tf.train.Feature(int64_list=tf.train.Int64List(value=[10]))
    label = tf.train.Feature(int64_list=tf.train.Int64List(value=[10]))
        
time.perf_counter() - start

The instantiation of 2,000,000 examples and two tf.train.Int64List features takes 11 seconds.

start = time.perf_counter()

for i in range(2000000):
    example = tf.train.Example(features = tf.train.Features(
            feature={
                'src': tf.train.Feature(int64_list=tf.train.Int64List(value=[10])),
                'dst': tf.train.Feature(int64_list=tf.train.Int64List(value=[10])),
            }
        )
    )
        
time.perf_counter() - start

The instantiation of 2,000,000 examples with two tf.train.Int64List features takes 41 seconds.

And finally, when I put it all together with a TFRecordWriter:

start = time.perf_counter()

with tf.python_io.TFRecordWriter('/mnt/data/repository/test.tfrecord') as writer:
    for i in range(2000000):
        example = tf.train.Example(features = tf.train.Features(
                feature={
                    'src': tf.train.Feature(int64_list=tf.train.Int64List(value=[10])),
                    'dst': tf.train.Feature(int64_list=tf.train.Int64List(value=[10])),
                }
            )
        )

        serialized = example.SerializeToString()
        writer.write(serialized)
        
time.perf_counter() - start

The final timing is 64 seconds for 2,000,000 records.

Unfortunately when you have a dataset that's got 1 billion rows it means that it takes 8.8 hours just to convert to tfrecord files.

tensorflowbutler · 2018-04-17T12:37:21Z

It has been 14 days with no activity and the awaiting response label was assigned. Is this still an issue?

tensorflowbutler · 2018-05-02T18:41:20Z

It has been 29 days with no activity and the awaiting response label was assigned. Is this still an issue?

tensorflowbutler · 2018-05-17T18:55:15Z

It has been 50 days with no activity and the awaiting response label was assigned. Is this still an issue?

tensorflowbutler · 2018-06-02T07:19:13Z

We are closing this issue for now due to lack of activity. Please comment if this is still an issue for you. Thanks!

bstadlbauer · 2019-01-25T16:27:42Z

I can confirm the timing results of mbrio, i wonder if there is any way to speed this up besides performing the computation in parallel?

Setup:
OS: Ubuntu 18.04
Tensorflow-GPU running on CPU i7-7700HQ
Tensorflow version: 1.12.0 (installed via pip install)

yongtang · 2019-01-25T17:29:14Z

@lxkarthi @mbrio @bstadlbauer

The SIG I/O under TensorFlow:

https://github.com/tensorflow/community/blob/master/sigs/io/CHARTER.md

has a focus on data and streaming processing for tensorflow. We are working on improving the data input output (likely with tf.data API).

The SIG I/O google group is:
https://groups.google.com/a/tensorflow.org/forum/#!forum/io

You could consider joining the google group to explain your use case so that we could fine a way to resolve the performance issue you are encountering, or create an issue in https://github.com/tensorflow/io

harahu · 2019-06-26T15:03:11Z

@shivaniag We've never hear back from the tensorflow team on this issue. Could you give us an update on whether this is recognized as a problem or not?

shivaniag · 2019-06-26T16:03:44Z

@jsimsa do you have an update on this.

harahu · 2019-06-27T09:55:05Z

Just want to say that my team suffers from the same issue @mbrio mentions. Using tf.train.Feature and tf.train.Example to generate serialized examples is convenient, but prohibitively slow.

A solution could be to implement native tensorflow ops for this that don't have to be wrapped in a tf.py_function?

yongtang · 2019-06-27T10:10:10Z

@harahu Are you looking for reading numpy files (on disk) into tf.data, or converting numpy array (in memory managed by python process) into tf.data (in memory and managed by tf process)?

The former might be easier to implement than the latter. The underlying tensorflow (including tf.data) is implemented in C++ and interacting with python's memory is not very straightforward (hence the not so convenient tf.py_function).

harahu · 2019-06-27T15:02:24Z

@yongtang I am looking to serialize data in order to write to tfrecord files. The origin of my data is numpy arrays, but the solution doesn't need to be a direct mapping from numpy arrays to serialized strings, as I don't mind turning said arrays into tensors before serialization. We have tf.serialize_tensor already, which is efficient, but it is very limited in what it can serialize (one tensor at a time). A tf.data.Dataset can have quite complex structure. For instance, I have a time series dataset structured something like this:

dataset.output_types
>>> {'feature0': tf.float32, 'feature1': tf.int64, 'lable': tf.int64, 'timestamp': tf.int64}

I would love to have a tf op with the efficiency of tf.serialize_tensor, but the usability and flexibility of tf.Feature and tf.Example. Some operation where I could map the dictionary structure above directly to a single tf.string, ready to write to file.

jsimsa · 2019-06-27T16:56:31Z

@rohan100jain and @frankchn are working on a mechanism for persisting the outputs of (a prefix of) an input pipeline which needs to solve the same problem (efficiently serializing elements of a tf.data.Dataset).

I believe that their solution could be extended to provide "save" and "load" functionality, but I also expect that it might take some time to settle on a format for which backwards compatibility is provided (i.e. it might initially be only possible to "load" data which was "save" using the same version of TensorFlow).

harahu · 2019-06-28T13:47:22Z

@jsimsa I also assume that some serialization happens when you cache a dataset to file:

dataset = dataset.cache(filename='somecachefile')

But I don know if or how this might be usable in solving the issues mentioned here.

areeh · 2019-06-28T14:36:12Z

@jsimsa This sounds very promising, but just want to mention that one use-case for efficient serialization of tf.data is handling a dataset that is too large to fit in memory. If there is only a simple "save" and "load" command that attempts to load everything into memory that use is not covered. For me, the ideal solution would be something that can save/load batch by batch or similar so that I can avoid memory issues. Not implying you didn't already think of this, just trying to show interest for the specific part of the feature I would find most helpful.

jsimsa · 2019-06-28T16:23:49Z

@harahu The cache(filename=...) transformation could indeed by used as a stop gap solution for serializing and deserializing data.

@areeh the prospective "save" and "load" functionality would work similar to the rest of tf.data transformations (in that it would support streaming of elements). In other words, it would not require that all of the data can fit into memory.

areeh · 2019-06-28T16:41:27Z

Functionality similar to the rest of tf.data was exactly what I was hoping for, thank you

harahu · 2019-06-28T16:42:55Z

@jsimsa It could, in a way, but to be able to read from the cached file you would need a dataset instance configured exactly like the one that was used to write the cache. This seems fairly restrictive. My thinking was more that you, or rather @rohan100jain and/or @frankchn, could maybe gain some inspiration (in a very broad sense) from the cache implementation. I don't know the details of it, but it seems it uses protocol buffers at least.

frankchn · 2019-06-28T21:53:55Z

The current cache implementation uses ~~TFRecords~~ TensorBundles, which are not great (performance-wise) for data reading and writing ~~(and also doesn't support other things like indexing into a specific record, etc...)~~. We are still thinking through a better file format internally and will provide updates when we think we have a better solution.

feihugis · 2019-07-01T17:10:02Z

@frankchn Will HDF5 be a potential format? It supports various data types, index, chunking, compression and efficient I/O.

austinzh · 2022-09-22T21:52:15Z

@frankchn any update on the save and load?
We facing same issue when try to convert npz into tfrecord. writing it 1 by 1 in python is really slow.

tensorflowbutler added the stat:awaiting response Status - Awaiting response from author label Feb 12, 2018

tensorflowbutler assigned shivaniag Apr 3, 2018

tensorflowbutler closed this as completed Jun 2, 2018

yongtang mentioned this issue Jan 27, 2019

Support NumPy for tensorflow-io tensorflow/io#68

Closed

shivaniag assigned jsimsa and unassigned shivaniag Jun 26, 2019

novog mentioned this issue Jul 3, 2019

Poor Feature/Example serialization performance #30372

Closed

bluenote10 mentioned this issue Mar 13, 2022

ddsp_prepare_tfrecord significantly slowed down on v3.x magenta/ddsp#420

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting numpy array to TFRecord is slow #16933

Converting numpy array to TFRecord is slow #16933

karthikeyann commented Feb 11, 2018 •

edited

tensorflowbutler commented Feb 12, 2018

tensorflowbutler commented Mar 3, 2018

tensorflowbutler commented Mar 17, 2018

mbrio commented Mar 27, 2018

tensorflowbutler commented Apr 17, 2018

tensorflowbutler commented May 2, 2018

tensorflowbutler commented May 17, 2018

tensorflowbutler commented Jun 2, 2018

bstadlbauer commented Jan 25, 2019 •

edited

yongtang commented Jan 25, 2019

harahu commented Jun 26, 2019

shivaniag commented Jun 26, 2019

harahu commented Jun 27, 2019 •

edited

yongtang commented Jun 27, 2019

harahu commented Jun 27, 2019

jsimsa commented Jun 27, 2019

harahu commented Jun 28, 2019

areeh commented Jun 28, 2019 •

edited

jsimsa commented Jun 28, 2019

areeh commented Jun 28, 2019

harahu commented Jun 28, 2019

frankchn commented Jun 28, 2019 •

edited

feihugis commented Jul 1, 2019

austinzh commented Sep 22, 2022

Converting numpy array to TFRecord is slow #16933

Converting numpy array to TFRecord is slow #16933

Comments

karthikeyann commented Feb 11, 2018 • edited

FloatList and Feature is slow for numpy array.

System information

Source code / logs

tensorflowbutler commented Feb 12, 2018

tensorflowbutler commented Mar 3, 2018

tensorflowbutler commented Mar 17, 2018

mbrio commented Mar 27, 2018

tensorflowbutler commented Apr 17, 2018

tensorflowbutler commented May 2, 2018

tensorflowbutler commented May 17, 2018

tensorflowbutler commented Jun 2, 2018

bstadlbauer commented Jan 25, 2019 • edited

yongtang commented Jan 25, 2019

harahu commented Jun 26, 2019

shivaniag commented Jun 26, 2019

harahu commented Jun 27, 2019 • edited

yongtang commented Jun 27, 2019

harahu commented Jun 27, 2019

jsimsa commented Jun 27, 2019

harahu commented Jun 28, 2019

areeh commented Jun 28, 2019 • edited

jsimsa commented Jun 28, 2019

areeh commented Jun 28, 2019

harahu commented Jun 28, 2019

frankchn commented Jun 28, 2019 • edited

feihugis commented Jul 1, 2019

austinzh commented Sep 22, 2022

karthikeyann commented Feb 11, 2018 •

edited

bstadlbauer commented Jan 25, 2019 •

edited

harahu commented Jun 27, 2019 •

edited

areeh commented Jun 28, 2019 •

edited

frankchn commented Jun 28, 2019 •

edited