-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converting numpy array to TFRecord is slow #16933
Comments
Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. |
Nagging Awaiting Response: It has been 14 days with no activityand the |
1 similar comment
Nagging Awaiting Response: It has been 14 days with no activityand the |
I'd like to add to this, it seems as though the instantiation of a start = time.perf_counter()
for i in range(2000000):
example = tf.train.Example()
time.perf_counter() - start The instantiation of 2,000,000 examples with no features takes .76 seconds. start = time.perf_counter()
for i in range(2000000):
example = tf.train.Example()
feature_1 = tf.train.Int64List(value=[10])
feature_2 = tf.train.Int64List(value=[10])
time.perf_counter() - start The instantiation of 2,000,000 examples and two start = time.perf_counter()
for i in range(2000000):
example = tf.train.Example()
feature = tf.train.Feature(int64_list=tf.train.Int64List(value=[10]))
label = tf.train.Feature(int64_list=tf.train.Int64List(value=[10]))
time.perf_counter() - start The instantiation of 2,000,000 examples and two start = time.perf_counter()
for i in range(2000000):
example = tf.train.Example(features = tf.train.Features(
feature={
'src': tf.train.Feature(int64_list=tf.train.Int64List(value=[10])),
'dst': tf.train.Feature(int64_list=tf.train.Int64List(value=[10])),
}
)
)
time.perf_counter() - start The instantiation of 2,000,000 examples with two And finally, when I put it all together with a TFRecordWriter: start = time.perf_counter()
with tf.python_io.TFRecordWriter('/mnt/data/repository/test.tfrecord') as writer:
for i in range(2000000):
example = tf.train.Example(features = tf.train.Features(
feature={
'src': tf.train.Feature(int64_list=tf.train.Int64List(value=[10])),
'dst': tf.train.Feature(int64_list=tf.train.Int64List(value=[10])),
}
)
)
serialized = example.SerializeToString()
writer.write(serialized)
time.perf_counter() - start The final timing is 64 seconds for 2,000,000 records. Unfortunately when you have a dataset that's got 1 billion rows it means that it takes 8.8 hours just to convert to tfrecord files. |
It has been 14 days with no activity and the |
It has been 29 days with no activity and the |
It has been 50 days with no activity and the |
We are closing this issue for now due to lack of activity. Please comment if this is still an issue for you. Thanks! |
I can confirm the timing results of mbrio, i wonder if there is any way to speed this up besides performing the computation in parallel? Setup: |
The SIG I/O under TensorFlow: https://github.com/tensorflow/community/blob/master/sigs/io/CHARTER.md has a focus on data and streaming processing for tensorflow. We are working on improving the data input output (likely with tf.data API). The SIG I/O google group is: You could consider joining the google group to explain your use case so that we could fine a way to resolve the performance issue you are encountering, or create an issue in https://github.com/tensorflow/io |
@shivaniag We've never hear back from the tensorflow team on this issue. Could you give us an update on whether this is recognized as a problem or not? |
@jsimsa do you have an update on this. |
Just want to say that my team suffers from the same issue @mbrio mentions. Using A solution could be to implement native tensorflow ops for this that don't have to be wrapped in a |
@harahu Are you looking for reading numpy files (on disk) into tf.data, or converting numpy array (in memory managed by python process) into tf.data (in memory and managed by tf process)? The former might be easier to implement than the latter. The underlying tensorflow (including tf.data) is implemented in C++ and interacting with python's memory is not very straightforward (hence the not so convenient |
@yongtang I am looking to serialize data in order to write to tfrecord files. The origin of my data is numpy arrays, but the solution doesn't need to be a direct mapping from numpy arrays to serialized strings, as I don't mind turning said arrays into tensors before serialization. We have dataset.output_types
>>> {'feature0': tf.float32, 'feature1': tf.int64, 'lable': tf.int64, 'timestamp': tf.int64} I would love to have a tf op with the efficiency of |
@rohan100jain and @frankchn are working on a mechanism for persisting the outputs of (a prefix of) an input pipeline which needs to solve the same problem (efficiently serializing elements of a tf.data.Dataset). I believe that their solution could be extended to provide "save" and "load" functionality, but I also expect that it might take some time to settle on a format for which backwards compatibility is provided (i.e. it might initially be only possible to "load" data which was "save" using the same version of TensorFlow). |
@jsimsa I also assume that some serialization happens when you cache a dataset to file: dataset = dataset.cache(filename='somecachefile') But I don know if or how this might be usable in solving the issues mentioned here. |
@jsimsa This sounds very promising, but just want to mention that one use-case for efficient serialization of tf.data is handling a dataset that is too large to fit in memory. If there is only a simple "save" and "load" command that attempts to load everything into memory that use is not covered. For me, the ideal solution would be something that can save/load batch by batch or similar so that I can avoid memory issues. Not implying you didn't already think of this, just trying to show interest for the specific part of the feature I would find most helpful. |
@harahu The @areeh the prospective "save" and "load" functionality would work similar to the rest of tf.data transformations (in that it would support streaming of elements). In other words, it would not require that all of the data can fit into memory. |
Functionality similar to the rest of tf.data was exactly what I was hoping for, thank you |
@jsimsa It could, in a way, but to be able to read from the cached file you would need a dataset instance configured exactly like the one that was used to write the cache. This seems fairly restrictive. My thinking was more that you, or rather @rohan100jain and/or @frankchn, could maybe gain some inspiration (in a very broad sense) from the cache implementation. I don't know the details of it, but it seems it uses protocol buffers at least. |
The current |
@frankchn any update on the save and load? |
FloatList and Feature is slow for numpy array.
Saving numpy arrays with np.load and np.save is much faster than Converting to TFRecord and reading it back.
while profiling the code, I found that half of the time is spent in _floats_feature.
tf.train.FloatList is taking 1/3 of the time.
How to speed this up?
System information
Source code / logs
The text was updated successfully, but these errors were encountered: