New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement BYTE_STREAM_SPLIT encoding for Parquet #8357
Comments
I wanted to take a look at this. I wanted to reproduce this first. I could generate a parquet file where float column was in BYTE_STREAM_SPLIT encoding. Now I want to know what are the connectors which support parquet. I could identify Hive and iceberg only, as they are the ones which supports generic files/filesystems. Other databases have their own format. |
I had a PR in this state for for some time. Just wanted to get it out and get early feedback. I think I should probably add some more tests for Float type. Also I have sort of replicated test classes. May be I should templatize them to avoid repeating tests. |
Don't want to ruin the party but do we know if implementing this brings any performance benefit? |
As per the original parquet issue https://issues.apache.org/jira/browse/PARQUET-1622, this makes float/doubles compress better. Maybe benefit of reduced IO can make the additional CPU effort worth it, but that may or may not be worthwhile trade-off depending on the setup and workload. IMO the main reason to implement it would be compatibility with the parquet spec and ability to read such data just in case some other engine has produced it. |
I agree. Let's just provide some sort of benchmark mid-work to be aware of how this trade-off work |
See #8338
The text was updated successfully, but these errors were encountered: