Skip to content

Parquet store plugin #2284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 25 commits into from
Closed

Conversation

dispanser
Copy link
Contributor

Preliminary write / read path for the parquet store plugin.

📝 Checklist

  • All user-facing changes have changelog entries.
  • The changes are reflected on docs.tenzir.com/vast, if necessary.
  • The PR description contains instructions for the reviewer, if necessary.

🎯 Review Instructions

Please take a look at the general approach, mostly on the read path with respect to transforming the table

@dispanser dispanser added the feature New functionality label May 17, 2022
Copy link
Member

@mavam mavam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't wait to see this landing in master!

Quick comment regarding the plugin naming: could you rename it to parquet? All other plugins don't have the plugin type in their name, because (1) it's possible to implement multiple plugins, and (2) there is just one global namespace for names that the current plugins map to in a one-to-one fashion.

@dispanser dispanser force-pushed the story/sc-17753/parquet-store-plugin branch from b725c40 to 209afbf Compare May 18, 2022 06:53
@dispanser
Copy link
Contributor Author

We probably need some CI scaffolding to to build and test the parquet plugin.

@dominiklohmann
Copy link
Member

We probably need some CI scaffolding to to build and test the parquet plugin.

You need to add it in three places:

  • The Dockerfile
  • The Nix build script for the static binary
  • The configuration matrix of the VAST GitHub Actions workflow file

Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good so far. As we talked about I mostly took a look at the post read table fixup that we sadly have to do. I think the way we're doing it now works nicely. I also left a few remarks inline for things I've noticed.

@dispanser dispanser changed the title Story/sc 17753/parquet store plugin Parquet store plugin May 19, 2022
@dispanser dispanser force-pushed the story/sc-17753/parquet-store-plugin branch 5 times, most recently from af18b6f to 81f6c86 Compare May 31, 2022 16:37
@dispanser dispanser marked this pull request as ready for review May 31, 2022 19:35
Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good! 🚀

I played around with it for an hour or so and did some benchmarks, and I took a second look at the code, for which I left some comments, although none of them are major blockers in any way.

Regarding the performance, There was no noticeable difference compared to current master in the import performance, and the store files are ~40% smaller than with the segment store backend.

For the export performance I noticed a dip that was larger than I expected (~20%). When running many queries in parallel, the conversion from std::shared_ptr<arrow::Table> to std::vector<table_slice> dominates benchmarks. I was wondering if it would be a good idea to cache the result of that conversion and to only do it once the first query arrives.

The next step after fixing that would be to take this PR and the rebuild command from #2321 and to try a large-scale conversion on our test server to evaluate just how well this performs for larger setups than I can run locally.

@dispanser dispanser requested a review from dominiklohmann June 8, 2022 08:16
@dispanser
Copy link
Contributor Author

@dominiklohmann : addressed all comments in the previous commit, so it's probably easiest to just review that one. Renaming of the plugin will be done last to make the diff more digestible.

Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving modulo outstanding comments. This is running fine, and the remaining things I've noticed are all minor. Looks great, and I think we can ship it as part of v2.1 actually. Really great work!

@dispanser dispanser force-pushed the story/sc-17753/parquet-store-plugin branch from 02403fc to 634e200 Compare June 21, 2022 10:13
dispanser added 11 commits June 29, 2022 14:40
Reorganized code to work with table slices until the actual `write`
operation.
In Arrow's debian repositories, parquet is a separate dependency.
In the parquet store, we ignore query hints when evaluating a query, as
we don't keep the global IDs around so we can't use them.
This is currently only used for the import time, which is stored
per-column, but then on read injected back into the table slices,
chosing the max of the entire column.
@dispanser dispanser force-pushed the story/sc-17753/parquet-store-plugin branch from 3aa0e20 to 2070bef Compare June 29, 2022 13:28
@dispanser
Copy link
Contributor Author

now part of feather PR #2413

@dispanser dispanser closed this Jul 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants