Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Apache Arrow in TensorFlow Dataset #23001

Closed
BryanCutler opened this issue Oct 15, 2018 · 4 comments
Closed

Add Support for Apache Arrow in TensorFlow Dataset #23001

BryanCutler opened this issue Oct 15, 2018 · 4 comments
Assignees
Labels
comp:ops OPs related issues stat:awaiting response Status - Awaiting response from author type:feature Feature requests

Comments

@BryanCutler
Copy link
Member

BryanCutler commented Oct 15, 2018

Apache Arrow is a standard format for in-memory columnar data. It provides a cross-language platform for systems to communicate and operate on data efficiently.

Adding Arrow support in TensorFlow Dataset will allow systems to interface with TensorFlow in a well defined way, without the need to develop custom converters, serialize data, or write to specialized files.

It would be straightforward to add a base layer of Arrow support that works on Arrow record batches (a common struct for Arrow IPC) and extend that layer to support different kinds of Arrow Ops:

  • Python memory / Pandas DataFrames
  • Arrow Feather files
  • Parquet files
  • Socket / Pipes

A slightly more involved Op could use Arrow Flight - Arrow-based messaging over gRPC. Additionally, it would possible to define Ops to connect directly to other systems that can export Arrow data.

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): master branch
  • Python version: 3.6
  • Bazel version (if compiling from source): 0.17.1
  • GCC/Compiler version (if compiling from source): 5.4.0
  • CUDA/cuDNN version: 9.1
  • GPU model and memory: Quadro M1000M 4G
  • Exact command to reproduce: N/A
@tensorflowbutler
Copy link
Member

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
Have I written custom code
OS Platform and Distribution
TensorFlow installed from
TensorFlow version
Bazel version
CUDA/cuDNN version
GPU model and memory
Exact command to reproduce
Mobile device

@ymodak ymodak added type:feature Feature requests comp:ops OPs related issues labels Oct 19, 2018
@pbk0
Copy link

pbk0 commented Nov 19, 2018

I hope that Rapids support tensorflow if Apache Arrow integration is planned. Currently, only PyTorch and Chainer seem to work with Apache Arrow.

@tensorflowbutler
Copy link
Member

It has been 44 days with no activity and the awaiting response label was assigned. Is this still an issue?

@BryanCutler
Copy link
Member Author

Issue has been moved to tensorflow/io#13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:ops OPs related issues stat:awaiting response Status - Awaiting response from author type:feature Feature requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants