Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement BYTE_STREAM_SPLIT encoding for Parquet #8357

Closed
electrum opened this issue Jun 23, 2021 · 6 comments · Fixed by #12809
Closed

Implement BYTE_STREAM_SPLIT encoding for Parquet #8357

electrum opened this issue Jun 23, 2021 · 6 comments · Fixed by #12809

Comments

@electrum
Copy link
Member

electrum commented Jun 23, 2021

See #8338

@manupatteri
Copy link
Member

I wanted to take a look at this. I wanted to reproduce this first. I could generate a parquet file where float column was in BYTE_STREAM_SPLIT encoding. Now I want to know what are the connectors which support parquet. I could identify Hive and iceberg only, as they are the ones which supports generic files/filesystems. Other databases have their own format.
Now in iceberg I couldn't find support for external table. Basically I have to configure the table to be parquet format. Since Trino doesn't support this encoding on write path I can't generate this encoding without modifying existing code. So is Hive the only option I have for reproducing it with minimal effort?

@findepi
Copy link
Member

findepi commented Jun 10, 2022

cc @raunaqmorarka @skrzypo987

@manupatteri
Copy link
Member

I had a PR in this state for for some time. Just wanted to get it out and get early feedback. I think I should probably add some more tests for Float type. Also I have sort of replicated test classes. May be I should templatize them to avoid repeating tests.

@skrzypo987
Copy link
Member

Don't want to ruin the party but do we know if implementing this brings any performance benefit?
These encodings are not widely used as they require a lot of single-byte memory copying with little compression ration benefit.

@raunaqmorarka
Copy link
Member

Don't want to ruin the party but do we know if implementing this brings any performance benefit?
These encodings are not widely used as they require a lot of single-byte memory copying with little compression ration benefit.

As per the original parquet issue https://issues.apache.org/jira/browse/PARQUET-1622, this makes float/doubles compress better. Maybe benefit of reduced IO can make the additional CPU effort worth it, but that may or may not be worthwhile trade-off depending on the setup and workload. IMO the main reason to implement it would be compatibility with the parquet spec and ability to read such data just in case some other engine has produced it.

@skrzypo987
Copy link
Member

I agree. Let's just provide some sort of benchmark mid-work to be aware of how this trade-off work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

5 participants