Native Parquet Writer writes Parquet V2 files that are not yet adopted by other engines. #7953

anjalinorwood · 2021-05-18T00:40:23Z

Native Parquet Writer that was introduced in presto-337 writes parquet files in v2 format. This format though supported by query engines like spark and table formats like Iceberg are not fully tested since there has not been a wide-spread adoption of v2 format. Newer features like iceberg vectorized read do not yet support v2.
Trino's hive connector has not yet switched over to the new parquet writer. Given this, it would be a good idea for the native parquet writer to support parquet v1 as well.

@electrum FYI, this is the issue we had discussed.

rdblue · 2021-05-19T00:28:36Z

@electrum, is there a reason why Trino is writing Parquet v2 by default? My understanding is that the v2 spec wasn't ever finalized and v1 is the default. So although there are some interesting new encodings, it is fairly risky to write data as v2 because those were never really evaluated and marked for support in the format itself -- they're still experimental. I suggested some easier encodings that I think did better than those but we never got enough people interested to finalize anything.

So if this is because you've validated that the new delta encodings are working well, then we may want to build support elsewhere. But if this is just using v2 because it seems like a good choice, I'd recommend against it.

martint · 2021-05-19T00:40:05Z

By v2 do you mean Data Page v2?

I don't think we support anything that's not specified here: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift and https://github.com/apache/parquet-format. Is that not the official spec? Or are there experimental/unfinished features there we should be aware of (it's not clear which ones, if that's the case).

rdblue · 2021-05-19T00:45:56Z

There are a few features that are experimental:

v2 data pages
all of the delta encodings

There used to be a clear table on parquet-format, but that was removed a while ago so the docs are now confusing. I think the Parquet community had a thread on this recently to clean up the spec. I'll look into the status of it.

Is Trino producing just v2 pages or delta encodings as well?

electrum · 2021-05-19T05:17:22Z

I actually know very little about Parquet. The new writer was contributed by @qqibrow. We do have integration tests for Iceberg where we write with Trino and read with Spark: https://github.com/trinodb/trino/blob/master/testing/trino-product-tests/src/main/java/io/trino/tests/product/iceberg/TestSparkCompatibility.java

But I now see that test is fairly limited and doesn’t cover all data types, or any nested types.

Do you have any suggestions for tests we can add to ensure we are compatible with Spark, or some existing test suite used ensure Parquet compatibility or conformance?

Also, is this something that should be addressed in the Iceberg specification? I didn’t even know there were multiple Parquet versions, so it seems like we need to specify exactly what writers should produce and what readers need to support.

electrum · 2021-05-19T05:27:23Z

Can you see if changing this line fixes it?

trino/lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriter.java

Line 92 in 39ebb43

.withWriterVersion(PARQUET_2_0)

rdblue · 2021-05-19T16:34:41Z

@anjalinorwood worked on the fix for us, so I'll defer to her on what we need to do to fix it.

rdblue · 2021-05-19T16:42:13Z

Do you have any suggestions for tests we can add to ensure we are compatible with Spark, or some existing test suite used ensure Parquet compatibility or conformance?

I'd recommend using some of the test cases from Iceberg for schema evolution and types:

The Spark/Parquet test uses random records generated for various schemas in AvroDataTest, so it can easily cover a lot of cases. The vectorized tests go into more detail and test encoding results in Parquet, like full dictionary encoding and dictionary to plain encoding fallback.

anjalinorwood · 2021-05-19T16:56:32Z

Can you see if changing this line fixes it?

trino/lib/trino-parquet/src/main/java/io/trino/parquet/writer/ParquetWriter.java

Line 92 in 39ebb43

.withWriterVersion(PARQUET_2_0)

No, that alone is not enough. The page header is written in V2 format here:
https://github.com/trinodb/trino/blob/master/lib/trino-parquet/src/main/java/io/trino/parquet/writer/PrimitiveColumnWriter.java#L220

Once I fixed that, I found that in general v2 format is a bit different from v1 format and those two changes are not enough. One will need to go through the spec for v1 and v2 and make sure that native parquet writer follows the v1 spec.

electrum · 2021-05-20T17:32:50Z

@anjalinorwood makes sense. Is this something you're working on?

When you say "go through the spec", is there something other than this one? The only mention I can find of v1 or v2 is a few references in the thrift definition and a couple mentions of "Parquet 2.0" on the encodings page.

anjalinorwood · 2021-05-20T18:41:49Z

@electrum I am not planning to work on this in the short term.

I don't remember the details, but after writing the page header in V1 format, I still had issues reading the resulting file (I think header itself) in the vectorized path. I was referring to the spec links you posted.

findepi · 2021-05-21T08:42:37Z

Is this related to #6377 ?

minchowang · 2021-06-30T12:49:28Z

When can v2 be supported in the trino?

rdblue · 2021-07-01T20:55:56Z

@findepi, that sounds like the same issue.

@mincwang, I don't recommend using v2. The spec has not been finalized. Is there a reason you would like to use Parquet v2?

minchowang · 2021-07-02T06:12:08Z

@findepi, that sounds like the same issue.

@mincwang, I don't recommend using v2. The spec has not been finalized. Is there a reason you would like to use Parquet v2?

@rdblue The mysql-binlog data write to iceberg by flink cdc，then used trino as ad-hoc query.

rdblue · 2021-07-05T16:35:44Z

@mincwang, it sounds like you're talking about Iceberg's v2 row-level deletes and not Parquet's v2 encodings. I assumed that you were talking about Parquet because you were commenting on a Parquet v2 issue. To ask about Iceberg, I'd recommend checking in with the Trino slack community or finding an issue for Iceberg v2.

findepi · 2021-09-28T20:38:06Z

Let me close this one in favor of the older issue on the same topic -- #6377.

findepi mentioned this issue May 21, 2021

Native ("optimized") Parquet writer #6382

Open

17 tasks

findepi mentioned this issue Jul 12, 2021

Native Parquet writer creates files that cannot be read by Hive and Spark #6377

Closed

findepi mentioned this issue Sep 9, 2021

Add created_by field to native Parquet writer metadata #9157

Merged

findepi closed this as completed Sep 28, 2021

findepi mentioned this issue Oct 5, 2021

Added toggle for Parquet V1 and V2 formats #9497

Closed

alexjo2144 mentioned this issue Oct 13, 2021

Change native parquet writer to write v1 parquet files #9611

Merged

hashhar mentioned this issue Jan 6, 2022

Failure in reading decimal and timestamp from PARQUET (written by Trino) on hive with experimental_parquet_optimized_writer_enabledset to true #10486

Closed

kbendick mentioned this issue Aug 21, 2022

Reading records inserted using Athena throws UOE exception when read using Spark (AWS) apache/iceberg#5593

Closed

raunaqmorarka mentioned this issue May 31, 2023

PARQUET-1950: Define core features apache/parquet-format#164

Draft

3 tasks

emkornfield mentioned this issue Jun 19, 2024

PARQUET-2489: Strawman guidance on feature releases apache/parquet-format#258

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native Parquet Writer writes Parquet V2 files that are not yet adopted by other engines. #7953

Native Parquet Writer writes Parquet V2 files that are not yet adopted by other engines. #7953

anjalinorwood commented May 18, 2021

rdblue commented May 19, 2021

martint commented May 19, 2021

rdblue commented May 19, 2021

electrum commented May 19, 2021

electrum commented May 19, 2021

rdblue commented May 19, 2021

rdblue commented May 19, 2021

anjalinorwood commented May 19, 2021

electrum commented May 20, 2021

anjalinorwood commented May 20, 2021

findepi commented May 21, 2021

minchowang commented Jun 30, 2021

rdblue commented Jul 1, 2021

minchowang commented Jul 2, 2021

rdblue commented Jul 5, 2021

findepi commented Sep 28, 2021

Native Parquet Writer writes Parquet V2 files that are not yet adopted by other engines. #7953

Native Parquet Writer writes Parquet V2 files that are not yet adopted by other engines. #7953

Comments

anjalinorwood commented May 18, 2021

rdblue commented May 19, 2021

martint commented May 19, 2021

rdblue commented May 19, 2021

electrum commented May 19, 2021

electrum commented May 19, 2021

rdblue commented May 19, 2021

rdblue commented May 19, 2021

anjalinorwood commented May 19, 2021

electrum commented May 20, 2021

anjalinorwood commented May 20, 2021

findepi commented May 21, 2021

minchowang commented Jun 30, 2021

rdblue commented Jul 1, 2021

minchowang commented Jul 2, 2021

rdblue commented Jul 5, 2021

findepi commented Sep 28, 2021