New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFRecord file support with Hadoop Mapreduce/Spark #5293
Comments
@jhseu for comments The native format for Spark is parquet, for which you can now get a C++ reader/writer from impala. The ORC format does have a C++ reader, but it lacks bloom filters and a variety of compression algorithms. It has been extracted out of the hadoop code as a standalone recently. The legacy MR RCFile format is very, very Java dependent and I wouldn't want to support that. |
@drpngx I mean writing TFRecord file with MR and Spark directly (e.g. to HDFS/GCS or customized file system which can be accessed by tensorflow), avoiding unnecessary and slow data conversion in the python code. We have some code to make this possible and would like to contribute if applicable. |
That sounds great! |
@llhe Yeah, that sounds useful. If you have it working, please send a pull request to http://github.com/tensorflow/ecosystem instead of the core TensorFlow repository. |
@jhseu Yeah, I noticed that repo, looks like it's mainly for deployment. Perhaps I can also add this application level stuff. Thanks. |
@jhseu @drpngx I just created a pull request for ecosystem: tensorflow/ecosystem#18 |
Nice! |
MR/Spark are commonly used for ETL and feature generation, it's better to support close integration with such systems. More specifically, supporting the following:
The text was updated successfully, but these errors were encountered: