-
-
Notifications
You must be signed in to change notification settings - Fork 95
Parquet store plugin #2284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet store plugin #2284
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't wait to see this landing in master!
Quick comment regarding the plugin naming: could you rename it to parquet
? All other plugins don't have the plugin type in their name, because (1) it's possible to implement multiple plugins, and (2) there is just one global namespace for names that the current plugins map to in a one-to-one fashion.
b725c40
to
209afbf
Compare
We probably need some CI scaffolding to to build and test the parquet plugin. |
You need to add it in three places:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good so far. As we talked about I mostly took a look at the post read table fixup that we sadly have to do. I think the way we're doing it now works nicely. I also left a few remarks inline for things I've noticed.
af18b6f
to
81f6c86
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good! 🚀
I played around with it for an hour or so and did some benchmarks, and I took a second look at the code, for which I left some comments, although none of them are major blockers in any way.
Regarding the performance, There was no noticeable difference compared to current master in the import performance, and the store files are ~40% smaller than with the segment store backend.
For the export performance I noticed a dip that was larger than I expected (~20%). When running many queries in parallel, the conversion from std::shared_ptr<arrow::Table>
to std::vector<table_slice>
dominates benchmarks. I was wondering if it would be a good idea to cache the result of that conversion and to only do it once the first query arrives.
The next step after fixing that would be to take this PR and the rebuild
command from #2321 and to try a large-scale conversion on our test server to evaluate just how well this performs for larger setups than I can run locally.
@dominiklohmann : addressed all comments in the previous commit, so it's probably easiest to just review that one. Renaming of the plugin will be done last to make the diff more digestible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving modulo outstanding comments. This is running fine, and the remaining things I've noticed are all minor. Looks great, and I think we can ship it as part of v2.1 actually. Really great work!
02403fc
to
634e200
Compare
This is WIP.
Reorganized code to work with table slices until the actual `write` operation.
In Arrow's debian repositories, parquet is a separate dependency.
In the parquet store, we ignore query hints when evaluating a query, as we don't keep the global IDs around so we can't use them.
This is currently only used for the import time, which is stored per-column, but then on read injected back into the table slices, chosing the max of the entire column.
The parquet installation links to thrift, but fails to call `find_dependency` for it. Thrift itself tries to find libevent but doesn't install its own FindLibevent module.
This currently fails when reading back the parquet file, probably due to https://issues.apache.org/jira/browse/ARROW-5030
We're having trouble reading a parquet file with multiple row groups using `arrow::parquet`, but to avoid table slices that are just huge, we split the record batch on read.
3aa0e20
to
2070bef
Compare
now part of feather PR #2413 |
Preliminary write / read path for the parquet store plugin.
📝 Checklist
🎯 Review Instructions
Please take a look at the general approach, mostly on the read path with respect to transforming the table