Support NumPy for tensorflow-io #68

yongtang · 2019-01-27T16:10:37Z

In TensorFlow's guide of "Importing Data":
https://www.tensorflow.org/guide/datasets

It is possible to reading input data directly from TFRecord (TFRecordDataset), text (TextLineDataset ), csv (CsvDataset) but not with NumPy. Reading input from NumPy still have to use a not so elegant way in the example code of the TensorFlow Guide:

with np.load("/var/data/training_data.npy") as data:
  features = data["features"]
  labels = data["labels"]

# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]

dataset = tf.data.Dataset.from_tensor_slices((features, labels))

It should be possible to implement NumPy support so that reading input from numpy could be done in a similar fashion as other input format. This potentially could also improve the performance as it may not be needed to read everything into the memory immediately (remotely related: tensorflow/tensorflow#16933).

The text was updated successfully, but these errors were encountered:

BryanCutler · 2019-02-02T00:56:49Z

@yongtang Arrow supports reading numpy arrays to record batches. So I don't think it would be much effort to add this, but there would be a limit on dimensionality - for now at least. Hopefully that will change in the future.

yongtang · 2019-02-06T05:24:08Z

@BryanCutler I think there are two issues, one is the conversion between Apache Arrow and Numpy in memory, another is to read data from npy or npz file format. I haven't find out a way to open npy or npz from Arrow library. However, it seems like the npy or npz file format are very much straightforward, a simple parser should be enough. I will look into it and see if I could come up with something quick.

BryanCutler · 2019-02-06T22:40:25Z

Actually the conversion from Numpy to Arrow is zero copy, so it wouldn't consume any more memory, but you're right, Arrow doesn't support reading these files. If you're able to read them directly in the op that would be cool!

areeh · 2019-06-28T19:19:03Z

Performant solutions using tf.data and disk are quite painful right now, so this would be a welcome addition. I just want to mention that it's particularly useful if you support a case where data looks like:

{ 'feature1': ([...], dtype=np.float32), 'feature2': ([...], dtype=np.int16), }
etc. Somewhat similar to the example from the guide in the split between feature and label, if I'm understanding correctly

yongtang · 2019-08-03T21:19:55Z

@areeh PR #407 has been opened which should support your case. The PR allows to read a local numpy through the same address space.

It may have limitations but if your process is local then performance might be good for large numpy array (as there is no serialization overhead before hand).

The dict/tuple of the features has been added as well.

areeh · 2019-08-13T12:08:41Z

@yongtang The PR looks great, it supports everything I had in mind when I wrote the comment. Thank you

kvignesh1420 · 2020-11-08T16:40:17Z

@yongtang can this be closed?

yongtang · 2020-11-08T16:50:21Z

@kvignesh1420 Ah yes thanks for the reminder 👍

yongtang mentioned this issue Feb 11, 2019

[WIP] Add numpy file (npy) support #83

Closed

jperl mentioned this issue Mar 8, 2019

Guidance on initializable iterators w/ numpy arrays #138

Closed

yongtang mentioned this issue Aug 3, 2019

Add NumpyDataset support #407

Merged

yongtang closed this as completed Nov 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support NumPy for tensorflow-io #68

Support NumPy for tensorflow-io #68

yongtang commented Jan 27, 2019

BryanCutler commented Feb 2, 2019

yongtang commented Feb 6, 2019

BryanCutler commented Feb 6, 2019

areeh commented Jun 28, 2019

yongtang commented Aug 3, 2019

areeh commented Aug 13, 2019

kvignesh1420 commented Nov 8, 2020

yongtang commented Nov 8, 2020

Support NumPy for tensorflow-io #68

Support NumPy for tensorflow-io #68

Comments

yongtang commented Jan 27, 2019

BryanCutler commented Feb 2, 2019

yongtang commented Feb 6, 2019

BryanCutler commented Feb 6, 2019

areeh commented Jun 28, 2019

yongtang commented Aug 3, 2019

areeh commented Aug 13, 2019

kvignesh1420 commented Nov 8, 2020

yongtang commented Nov 8, 2020