TFRecord file support with Hadoop Mapreduce/Spark #5293

llhe · 2016-10-31T01:50:46Z

MR/Spark are commonly used for ETL and feature generation, it's better to support close integration with such systems. More specifically, supporting the following:

TFRecord file Mapreduce InputFormat/OutputFormat
Integrating Feature/Example proto classes

drpngx · 2016-10-31T01:57:24Z

@jhseu for comments

The native format for Spark is parquet, for which you can now get a C++ reader/writer from impala.

The ORC format does have a C++ reader, but it lacks bloom filters and a variety of compression algorithms. It has been extracted out of the hadoop code as a standalone recently.

The legacy MR RCFile format is very, very Java dependent and I wouldn't want to support that.

llhe · 2016-10-31T02:09:23Z

@drpngx I mean writing TFRecord file with MR and Spark directly (e.g. to HDFS/GCS or customized file system which can be accessed by tensorflow), avoiding unnecessary and slow data conversion in the python code. We have some code to make this possible and would like to contribute if applicable.

drpngx · 2016-10-31T02:13:05Z

That sounds great!

jhseu · 2016-10-31T20:28:50Z

@llhe Yeah, that sounds useful. If you have it working, please send a pull request to http://github.com/tensorflow/ecosystem instead of the core TensorFlow repository.

llhe · 2016-11-01T01:40:58Z

@jhseu Yeah, I noticed that repo, looks like it's mainly for deployment. Perhaps I can also add this application level stuff. Thanks.

llhe · 2016-11-01T05:58:46Z

@jhseu @drpngx I just created a pull request for ecosystem: tensorflow/ecosystem#18

drpngx · 2016-11-01T06:03:43Z

Nice!

drpngx added the enhancement label Oct 31, 2016

drpngx added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Oct 31, 2016

llhe mentioned this issue Nov 1, 2016

Add Hadoop InputFormat/OutputFormat for TFRecord feature file tensorflow/ecosystem#18

Merged

drpngx removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Nov 1, 2016

drpngx closed this as completed Nov 1, 2016

aselle added type:feature Feature requests and removed enhancement labels Feb 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFRecord file support with Hadoop Mapreduce/Spark #5293

TFRecord file support with Hadoop Mapreduce/Spark #5293

llhe commented Oct 31, 2016

drpngx commented Oct 31, 2016

llhe commented Oct 31, 2016

drpngx commented Oct 31, 2016

jhseu commented Oct 31, 2016

llhe commented Nov 1, 2016

llhe commented Nov 1, 2016

drpngx commented Nov 1, 2016

TFRecord file support with Hadoop Mapreduce/Spark #5293

TFRecord file support with Hadoop Mapreduce/Spark #5293

Comments

llhe commented Oct 31, 2016

drpngx commented Oct 31, 2016

llhe commented Oct 31, 2016

drpngx commented Oct 31, 2016

jhseu commented Oct 31, 2016

llhe commented Nov 1, 2016

llhe commented Nov 1, 2016

drpngx commented Nov 1, 2016