Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting numpy array to TFRecord is slow #16933

Closed
karthikeyann opened this issue Feb 11, 2018 · 24 comments
Closed

Converting numpy array to TFRecord is slow #16933

karthikeyann opened this issue Feb 11, 2018 · 24 comments
Assignees
Labels
stat:awaiting response Status - Awaiting response from author

Comments

@karthikeyann
Copy link

karthikeyann commented Feb 11, 2018

FloatList and Feature is slow for numpy array.

Saving numpy arrays with np.load and np.save is much faster than Converting to TFRecord and reading it back.
while profiling the code, I found that half of the time is spent in _floats_feature.
tf.train.FloatList is taking 1/3 of the time.
How to speed this up?

System information

  • Below snippet of code to convert numpy array is much slow compared np.save, np.load:
  • **OS Platform and Distribution: Linux Ubuntu 16.04
  • TensorFlow version (use command below): 1.4.0
  • Python version: 2.7.12

Source code / logs

import tensorflow as tf
import numpy as np



def floatme(value):
    return tf.train.FloatList(value=value)

def _floats_feature(value):
    return tf.train.Feature(float_list=floatme(value))

tfr_filename = "deleteme.tfr"
data = [" ".join(np.random.randint(0, 1000, size=4005).astype(str)) for i in range(10000)]
with tf.python_io.TFRecordWriter(tfr_filename) as writer:
    print('Converting to vectors')
    vectors = [np.fromstring(line, dtype=int, sep=' ', count=4004+1) for line in data]
    print('Converting to examples')
    for i, vec in enumerate(vectors):
        # Create an example protocol buffer
        example = tf.train.Example(features=tf.train.Features(feature={
            'label': _floats_feature([vec[4004], vec[4004]<1.0]),
            'data' : _floats_feature(vec[:4004]),
            }))
        writer.write(example.SerializeToString())


ncalls tottime percall cumtime percall filename:lineno(function)
232810 49.887 0 49.887 0 convert_train_dataset_tfrecord.py:76(floatme)
116405 20.095 0 20.095 0 {numpy.core.multiarray.fromstring}
232810 13.328 0 63.216 0 convert_train_dataset_tfrecord.py:79(_floats_feature)
@tensorflowbutler tensorflowbutler added the stat:awaiting response Status - Awaiting response from author label Feb 12, 2018
@tensorflowbutler
Copy link
Member

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
Have I written custom code
TensorFlow installed from
Bazel version
CUDA/cuDNN version
GPU model and memory
Exact command to reproduce

@tensorflowbutler
Copy link
Member

Nagging Awaiting Response: It has been 14 days with no activityand the awaiting response label was assigned. Is this still an issue?

1 similar comment
@tensorflowbutler
Copy link
Member

Nagging Awaiting Response: It has been 14 days with no activityand the awaiting response label was assigned. Is this still an issue?

@mbrio
Copy link

mbrio commented Mar 27, 2018

I'd like to add to this, it seems as though the instantiation of a tf.train.Features object takes a tremendous amount of time. A very simple example of timings on my machine:

start = time.perf_counter()

for i in range(2000000):
    example = tf.train.Example()

time.perf_counter() - start

The instantiation of 2,000,000 examples with no features takes .76 seconds.

start = time.perf_counter()

for i in range(2000000):
    example = tf.train.Example()
    feature_1 = tf.train.Int64List(value=[10])
    feature_2 = tf.train.Int64List(value=[10])
        
time.perf_counter() - start

The instantiation of 2,000,000 examples and two tf.train.Int64List objects takes 5 seconds.

start = time.perf_counter()

for i in range(2000000):
    example = tf.train.Example()
    feature = tf.train.Feature(int64_list=tf.train.Int64List(value=[10]))
    label = tf.train.Feature(int64_list=tf.train.Int64List(value=[10]))
        
time.perf_counter() - start

The instantiation of 2,000,000 examples and two tf.train.Int64List features takes 11 seconds.

start = time.perf_counter()

for i in range(2000000):
    example = tf.train.Example(features = tf.train.Features(
            feature={
                'src': tf.train.Feature(int64_list=tf.train.Int64List(value=[10])),
                'dst': tf.train.Feature(int64_list=tf.train.Int64List(value=[10])),
            }
        )
    )
        
time.perf_counter() - start

The instantiation of 2,000,000 examples with two tf.train.Int64List features takes 41 seconds.

And finally, when I put it all together with a TFRecordWriter:

start = time.perf_counter()

with tf.python_io.TFRecordWriter('/mnt/data/repository/test.tfrecord') as writer:
    for i in range(2000000):
        example = tf.train.Example(features = tf.train.Features(
                feature={
                    'src': tf.train.Feature(int64_list=tf.train.Int64List(value=[10])),
                    'dst': tf.train.Feature(int64_list=tf.train.Int64List(value=[10])),
                }
            )
        )

        serialized = example.SerializeToString()
        writer.write(serialized)
        
time.perf_counter() - start

The final timing is 64 seconds for 2,000,000 records.

Unfortunately when you have a dataset that's got 1 billion rows it means that it takes 8.8 hours just to convert to tfrecord files.

@tensorflowbutler
Copy link
Member

It has been 14 days with no activity and the awaiting response label was assigned. Is this still an issue?

@tensorflowbutler
Copy link
Member

It has been 29 days with no activity and the awaiting response label was assigned. Is this still an issue?

@tensorflowbutler
Copy link
Member

It has been 50 days with no activity and the awaiting response label was assigned. Is this still an issue?

@tensorflowbutler
Copy link
Member

We are closing this issue for now due to lack of activity. Please comment if this is still an issue for you. Thanks!

@bstadlbauer
Copy link

bstadlbauer commented Jan 25, 2019

I can confirm the timing results of mbrio, i wonder if there is any way to speed this up besides performing the computation in parallel?

Setup:
OS: Ubuntu 18.04
Tensorflow-GPU running on CPU i7-7700HQ
Tensorflow version: 1.12.0 (installed via pip install)

@yongtang
Copy link
Member

@lxkarthi @mbrio @bstadlbauer

The SIG I/O under TensorFlow:

https://github.com/tensorflow/community/blob/master/sigs/io/CHARTER.md

has a focus on data and streaming processing for tensorflow. We are working on improving the data input output (likely with tf.data API).

The SIG I/O google group is:
https://groups.google.com/a/tensorflow.org/forum/#!forum/io

You could consider joining the google group to explain your use case so that we could fine a way to resolve the performance issue you are encountering, or create an issue in https://github.com/tensorflow/io

@harahu
Copy link
Contributor

harahu commented Jun 26, 2019

@shivaniag We've never hear back from the tensorflow team on this issue. Could you give us an update on whether this is recognized as a problem or not?

@shivaniag shivaniag assigned jsimsa and unassigned shivaniag Jun 26, 2019
@shivaniag
Copy link
Contributor

@jsimsa do you have an update on this.

@harahu
Copy link
Contributor

harahu commented Jun 27, 2019

Just want to say that my team suffers from the same issue @mbrio mentions. Using tf.train.Feature and tf.train.Example to generate serialized examples is convenient, but prohibitively slow.

A solution could be to implement native tensorflow ops for this that don't have to be wrapped in a tf.py_function?

@yongtang
Copy link
Member

@harahu Are you looking for reading numpy files (on disk) into tf.data, or converting numpy array (in memory managed by python process) into tf.data (in memory and managed by tf process)?

The former might be easier to implement than the latter. The underlying tensorflow (including tf.data) is implemented in C++ and interacting with python's memory is not very straightforward (hence the not so convenient tf.py_function).

@harahu
Copy link
Contributor

harahu commented Jun 27, 2019

@yongtang I am looking to serialize data in order to write to tfrecord files. The origin of my data is numpy arrays, but the solution doesn't need to be a direct mapping from numpy arrays to serialized strings, as I don't mind turning said arrays into tensors before serialization. We have tf.serialize_tensor already, which is efficient, but it is very limited in what it can serialize (one tensor at a time). A tf.data.Dataset can have quite complex structure. For instance, I have a time series dataset structured something like this:

dataset.output_types
>>> {'feature0': tf.float32, 'feature1': tf.int64, 'lable': tf.int64, 'timestamp': tf.int64}

I would love to have a tf op with the efficiency of tf.serialize_tensor, but the usability and flexibility of tf.Feature and tf.Example. Some operation where I could map the dictionary structure above directly to a single tf.string, ready to write to file.

@jsimsa
Copy link
Contributor

jsimsa commented Jun 27, 2019

@rohan100jain and @frankchn are working on a mechanism for persisting the outputs of (a prefix of) an input pipeline which needs to solve the same problem (efficiently serializing elements of a tf.data.Dataset).

I believe that their solution could be extended to provide "save" and "load" functionality, but I also expect that it might take some time to settle on a format for which backwards compatibility is provided (i.e. it might initially be only possible to "load" data which was "save" using the same version of TensorFlow).

@harahu
Copy link
Contributor

harahu commented Jun 28, 2019

@jsimsa I also assume that some serialization happens when you cache a dataset to file:

dataset = dataset.cache(filename='somecachefile')

But I don know if or how this might be usable in solving the issues mentioned here.

@areeh
Copy link

areeh commented Jun 28, 2019

@jsimsa This sounds very promising, but just want to mention that one use-case for efficient serialization of tf.data is handling a dataset that is too large to fit in memory. If there is only a simple "save" and "load" command that attempts to load everything into memory that use is not covered. For me, the ideal solution would be something that can save/load batch by batch or similar so that I can avoid memory issues. Not implying you didn't already think of this, just trying to show interest for the specific part of the feature I would find most helpful.

@jsimsa
Copy link
Contributor

jsimsa commented Jun 28, 2019

@harahu The cache(filename=...) transformation could indeed by used as a stop gap solution for serializing and deserializing data.

@areeh the prospective "save" and "load" functionality would work similar to the rest of tf.data transformations (in that it would support streaming of elements). In other words, it would not require that all of the data can fit into memory.

@areeh
Copy link

areeh commented Jun 28, 2019

Functionality similar to the rest of tf.data was exactly what I was hoping for, thank you

@harahu
Copy link
Contributor

harahu commented Jun 28, 2019

@jsimsa It could, in a way, but to be able to read from the cached file you would need a dataset instance configured exactly like the one that was used to write the cache. This seems fairly restrictive. My thinking was more that you, or rather @rohan100jain and/or @frankchn, could maybe gain some inspiration (in a very broad sense) from the cache implementation. I don't know the details of it, but it seems it uses protocol buffers at least.

@frankchn
Copy link
Contributor

frankchn commented Jun 28, 2019

The current cache implementation uses TFRecords TensorBundles, which are not great (performance-wise) for data reading and writing (and also doesn't support other things like indexing into a specific record, etc...). We are still thinking through a better file format internally and will provide updates when we think we have a better solution.

@feihugis
Copy link
Member

feihugis commented Jul 1, 2019

@frankchn Will HDF5 be a potential format? It supports various data types, index, chunking, compression and efficient I/O.

@austinzh
Copy link

@frankchn any update on the save and load?
We facing same issue when try to convert npz into tfrecord. writing it 1 by 1 in python is really slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting response Status - Awaiting response from author
Projects
None yet
Development

No branches or pull requests