Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet support #1

Open
dselivanov opened this issue Nov 30, 2015 · 6 comments
Open

Parquet support #1

dselivanov opened this issue Nov 30, 2015 · 6 comments

Comments

@dselivanov
Copy link

Any plans to add parquet support?

@jorgemarsal
Copy link
Contributor

Hi @dselivanov,

No plans to add Parquet support in the short term.
Shouldn't be too hard though. We just need to add a ParquetRecordParser that delegates to the official parquet-cpp implementation https://github.com/apache/parquet-cpp.

I can help you if you want to give it a shot.

@dselivanov
Copy link
Author

@jorgemarsal, thanks for clarification. At the moment I also don't have time for porting parquet-cpp, so will use SparkR in short term. Will try to return back and have a closer look on parquet-cpp integration.

@dselivanov
Copy link
Author

@jorgemarsal, FYI - SparkR parquet reading totally unusable for any real problem due to very inefficient serialization/deserialization. Collection even tiny data.frame of 100mb takes more than 2 minutes...

@jorgemarsal
Copy link
Contributor

Parquet support is on the top of my list. Will get to that when I have some free time.

@dselivanov
Copy link
Author

FYI introducing-apache-arrow.
Also seems, cloudera developers started actively refactor parquet-cpp.

@DheerajAgarwal
Copy link

Any updates on this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants