Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support NumPy for tensorflow-io #68

Closed
yongtang opened this issue Jan 27, 2019 · 8 comments
Closed

Support NumPy for tensorflow-io #68

yongtang opened this issue Jan 27, 2019 · 8 comments

Comments

@yongtang
Copy link
Member

In TensorFlow's guide of "Importing Data":
https://www.tensorflow.org/guide/datasets

It is possible to reading input data directly from TFRecord (TFRecordDataset), text (TextLineDataset ), csv (CsvDataset) but not with NumPy. Reading input from NumPy still have to use a not so elegant way in the example code of the TensorFlow Guide:

with np.load("/var/data/training_data.npy") as data:
  features = data["features"]
  labels = data["labels"]

# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]

dataset = tf.data.Dataset.from_tensor_slices((features, labels))

It should be possible to implement NumPy support so that reading input from numpy could be done in a similar fashion as other input format. This potentially could also improve the performance as it may not be needed to read everything into the memory immediately (remotely related: tensorflow/tensorflow#16933).

@BryanCutler
Copy link
Member

@yongtang Arrow supports reading numpy arrays to record batches. So I don't think it would be much effort to add this, but there would be a limit on dimensionality - for now at least. Hopefully that will change in the future.

@yongtang
Copy link
Member Author

yongtang commented Feb 6, 2019

@BryanCutler I think there are two issues, one is the conversion between Apache Arrow and Numpy in memory, another is to read data from npy or npz file format. I haven't find out a way to open npy or npz from Arrow library. However, it seems like the npy or npz file format are very much straightforward, a simple parser should be enough. I will look into it and see if I could come up with something quick.

@BryanCutler
Copy link
Member

Actually the conversion from Numpy to Arrow is zero copy, so it wouldn't consume any more memory, but you're right, Arrow doesn't support reading these files. If you're able to read them directly in the op that would be cool!

@areeh
Copy link

areeh commented Jun 28, 2019

Performant solutions using tf.data and disk are quite painful right now, so this would be a welcome addition. I just want to mention that it's particularly useful if you support a case where data looks like:

{ 'feature1': ([...], dtype=np.float32), 'feature2': ([...], dtype=np.int16), }
etc. Somewhat similar to the example from the guide in the split between feature and label, if I'm understanding correctly

@yongtang
Copy link
Member Author

yongtang commented Aug 3, 2019

@areeh PR #407 has been opened which should support your case. The PR allows to read a local numpy through the same address space.

It may have limitations but if your process is local then performance might be good for large numpy array (as there is no serialization overhead before hand).

The dict/tuple of the features has been added as well.

@areeh
Copy link

areeh commented Aug 13, 2019

@yongtang The PR looks great, it supports everything I had in mind when I wrote the comment. Thank you

@kvignesh1420
Copy link
Member

@yongtang can this be closed?

@yongtang
Copy link
Member Author

yongtang commented Nov 8, 2020

@kvignesh1420 Ah yes thanks for the reminder 👍

@yongtang yongtang closed this as completed Nov 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants