Skip to content

Conversation

@BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Apr 19, 2019

This PR changes the input of the ArrowDataset to be a string Tensor that contains a serialized Arrow buffer. A placeholder Tensor can then be used to create the Dataset and then the iterator can be initialized with the serialized data when running the session. By doing this, it prevents the data from being stored in the graph as a constant, which leads to unnecessarily copies.

Fixes #204

@BryanCutler
Copy link
Member Author

BryanCutler commented Apr 19, 2019

TODO need to make test conditional on 2.0 because of placeholder
Done

@BryanCutler BryanCutler force-pushed the arrow-dataset-enable-feed branch from 862d966 to 9947518 Compare April 23, 2019 21:04
@BryanCutler
Copy link
Member Author

Ping @yongtang and @terrytangyuan if you could please take a look when you get the chance? I ran some tests and these changes still help quite a bit. Thanks!

Copy link
Member

@yongtang yongtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. That is a nice enhancement!

@yongtang
Copy link
Member

Overall, the Dataset itself is a pull model and is capable of pull a chunk (or part) of the data at a time. But when the whole data is already in memory, it is hard to feed data into Dataset in piecemeal, as Dataset will require converting the whole data into tensor. We may want to address this issue, as was pointed by:
https://stackoverflow.com/questions/48956404/tensorflow-dataset-tf-estimator-inputs-numpy-input-fn

I am playing with the idea of exposing the data in memory (in python or other languages) as a "server" to allow Dataset to "pull" from "server", will see if I could come up with some PoC.

@yongtang yongtang merged commit 2091537 into tensorflow:master Apr 24, 2019
@BryanCutler BryanCutler deleted the arrow-dataset-enable-feed branch April 25, 2019 04:12
@BryanCutler
Copy link
Member Author

Thanks @yongtang !

I am playing with the idea of exposing the data in memory (in python or other languages) as a "server" to allow Dataset to "pull" from "server", will see if I could come up with some PoC.

Sounds good, I've been thinking about something very similar. I think it could be very useful..

@yongtang
Copy link
Member

@BryanCutler Created an initial PR #206.

i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021
…le-feed

Allow ArrowDataset to create an initializable iterator for input data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow ArrowDataset to use an initializable iterator

2 participants