Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Apache Parquet support for TensorFlow Dataset #19461

Closed
wants to merge 14 commits into from

Conversation

yongtang
Copy link
Member

Apache Parquet is a widely used columnar storage format available in the Hadoop ecosystem.

This PR is a preliminary attempt to add Apache Parquet support for TensorFlow's Dataset API. It should help many to working on existing parquet-formatted big data with TesnorFlow.

The PR may not cover all the use cases though it could be served as a starting point for further improvement in the future.

The ParquetDataset depends on parquet-cpp (Apache) project as well as other dependencies (e.g, Thrift, etc.). The ParquetDataset only builds on Linux at the moment. This PR also adds the option in ./configure so that those dependencies could be skipped.

Signed-off-by: Yong Tang yong.tang.github@outlook.com

@case540
Copy link
Contributor

case540 commented May 22, 2018

Seeing compilation errors...

In file included from external/arrow/cpp/src/arrow/buffer.h:28:
external/arrow/cpp/src/arrow/memory_pool.h:24:10: fatal error: 'arrow/util/visibility.h' file not found
#include "arrow/util/visibility.h"

@case540 case540 added the stat:awaiting response Status - Awaiting response from author label May 22, 2018
@yongtang yongtang force-pushed the parquet branch 15 times, most recently from 9491050 to e0ad3bf Compare May 24, 2018 03:25
@yongtang yongtang force-pushed the parquet branch 11 times, most recently from 149c4ad to c7b6b28 Compare May 26, 2018 16:18
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
…packages

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
@yongtang
Copy link
Member Author

Thanks @skye for the review. The comment has been addressed. The current test is failing but that is due to a Bazel bug. I opened the issue in bazel:
bazelbuild/bazel#5932
The bug has been fixed so it will be in the next bazel release.

Can you take a look and see if the PR is OK? Once bazel is updated then all tests should pass I believe.

skye
skye previously approved these changes Aug 20, 2018
@yongtang yongtang added the kokoro:force-run Tests on submitted change label Sep 14, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 14, 2018
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
@yongtang
Copy link
Member Author

Bazel 0.17.1 has been released (bazelbuild/bazel#5059 (comment)) and it fixed the issue we are facing here. The tests in TF still use Bazel 0.15.0.

A separate PR #22281 has been created to update Bazel to 0.17.1.

/cc @gunan

@mrry mrry added the kokoro:force-run Tests on submitted change label Sep 23, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Sep 23, 2018
@yongtang
Copy link
Member Author

yongtang commented Oct 9, 2018

This PR still depends on #22449 (to update bazel to 0.17.1). Will update the PR once #22449 is fixed.

@tensorflowbutler
Copy link
Member

Nagging Assignee @case540: It has been 44 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@mrry
Copy link
Contributor

mrry commented Dec 8, 2018

I'm removing myself as reviewer because I assume that we'll integrate this via SIG IO (where hopefully the same build issues won't be a problem!)...

@yongtang
Copy link
Member Author

This PR has been migrated to tensorflow/io#21 with all builds pass. I will close this PR once tensorflow/io#21 is merged.

@yongtang
Copy link
Member Author

Since tensorflow/io#21 has been merged. I will close this PR. Thanks everyone for the help!

@yongtang yongtang closed this Dec 26, 2018
@yongtang yongtang deleted the parquet branch May 4, 2020 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants