Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFRecord file support with Hadoop Mapreduce/Spark #5293

Closed
llhe opened this issue Oct 31, 2016 · 7 comments
Closed

TFRecord file support with Hadoop Mapreduce/Spark #5293

llhe opened this issue Oct 31, 2016 · 7 comments
Labels
type:feature Feature requests

Comments

@llhe
Copy link
Contributor

llhe commented Oct 31, 2016

MR/Spark are commonly used for ETL and feature generation, it's better to support close integration with such systems. More specifically, supporting the following:

  1. TFRecord file Mapreduce InputFormat/OutputFormat
  2. Integrating Feature/Example proto classes
@drpngx
Copy link
Contributor

drpngx commented Oct 31, 2016

@jhseu for comments

The native format for Spark is parquet, for which you can now get a C++ reader/writer from impala.

The ORC format does have a C++ reader, but it lacks bloom filters and a variety of compression algorithms. It has been extracted out of the hadoop code as a standalone recently.

The legacy MR RCFile format is very, very Java dependent and I wouldn't want to support that.

@drpngx drpngx added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Oct 31, 2016
@llhe
Copy link
Contributor Author

llhe commented Oct 31, 2016

@drpngx I mean writing TFRecord file with MR and Spark directly (e.g. to HDFS/GCS or customized file system which can be accessed by tensorflow), avoiding unnecessary and slow data conversion in the python code. We have some code to make this possible and would like to contribute if applicable.

@drpngx
Copy link
Contributor

drpngx commented Oct 31, 2016

That sounds great!

@jhseu
Copy link
Contributor

jhseu commented Oct 31, 2016

@llhe Yeah, that sounds useful. If you have it working, please send a pull request to http://github.com/tensorflow/ecosystem instead of the core TensorFlow repository.

@llhe
Copy link
Contributor Author

llhe commented Nov 1, 2016

@jhseu Yeah, I noticed that repo, looks like it's mainly for deployment. Perhaps I can also add this application level stuff. Thanks.

@llhe
Copy link
Contributor Author

llhe commented Nov 1, 2016

@jhseu @drpngx I just created a pull request for ecosystem: tensorflow/ecosystem#18

@drpngx drpngx removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Nov 1, 2016
@drpngx
Copy link
Contributor

drpngx commented Nov 1, 2016

Nice!

@drpngx drpngx closed this as completed Nov 1, 2016
@aselle aselle added type:feature Feature requests and removed enhancement labels Feb 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature Feature requests
Projects
None yet
Development

No branches or pull requests

4 participants