Implement BYTE_STREAM_SPLIT encoding for Parquet #8357

electrum · 2021-06-23T05:49:50Z

manupatteri · 2021-08-09T04:10:03Z

I wanted to take a look at this. I wanted to reproduce this first. I could generate a parquet file where float column was in BYTE_STREAM_SPLIT encoding. Now I want to know what are the connectors which support parquet. I could identify Hive and iceberg only, as they are the ones which supports generic files/filesystems. Other databases have their own format.
Now in iceberg I couldn't find support for external table. Basically I have to configure the table to be parquet format. Since Trino doesn't support this encoding on write path I can't generate this encoding without modifying existing code. So is Hive the only option I have for reproducing it with minimal effort?

findepi · 2022-06-10T09:05:08Z

cc @raunaqmorarka @skrzypo987

manupatteri · 2022-06-12T23:52:36Z

I had a PR in this state for for some time. Just wanted to get it out and get early feedback. I think I should probably add some more tests for Float type. Also I have sort of replicated test classes. May be I should templatize them to avoid repeating tests.

skrzypo987 · 2022-06-13T03:28:30Z

Don't want to ruin the party but do we know if implementing this brings any performance benefit?
These encodings are not widely used as they require a lot of single-byte memory copying with little compression ration benefit.

raunaqmorarka · 2022-06-13T03:36:54Z

Don't want to ruin the party but do we know if implementing this brings any performance benefit?
These encodings are not widely used as they require a lot of single-byte memory copying with little compression ration benefit.

As per the original parquet issue https://issues.apache.org/jira/browse/PARQUET-1622, this makes float/doubles compress better. Maybe benefit of reduced IO can make the additional CPU effort worth it, but that may or may not be worthwhile trade-off depending on the setup and workload. IMO the main reason to implement it would be compatibility with the parquet spec and ability to read such data just in case some other engine has produced it.

skrzypo987 · 2022-06-13T04:01:55Z

I agree. Let's just provide some sort of benchmark mid-work to be aware of how this trade-off work

manupatteri mentioned this issue Jun 12, 2022

Support reading BYTE_STREAM_SPLIT encoding in parquet #12809

Merged

raunaqmorarka closed this as completed in #12809 Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement BYTE_STREAM_SPLIT encoding for Parquet #8357

Implement BYTE_STREAM_SPLIT encoding for Parquet #8357

electrum commented Jun 23, 2021 •

edited

manupatteri commented Aug 9, 2021

findepi commented Jun 10, 2022

manupatteri commented Jun 12, 2022

skrzypo987 commented Jun 13, 2022

raunaqmorarka commented Jun 13, 2022

skrzypo987 commented Jun 13, 2022

Implement BYTE_STREAM_SPLIT encoding for Parquet #8357

Implement BYTE_STREAM_SPLIT encoding for Parquet #8357

Comments

electrum commented Jun 23, 2021 • edited

manupatteri commented Aug 9, 2021

findepi commented Jun 10, 2022

manupatteri commented Jun 12, 2022

skrzypo987 commented Jun 13, 2022

raunaqmorarka commented Jun 13, 2022

skrzypo987 commented Jun 13, 2022

electrum commented Jun 23, 2021 •

edited