Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native Parquet Writer writes Parquet V2 files that are not yet adopted by other engines. #7953

Closed
anjalinorwood opened this issue May 18, 2021 · 16 comments

Comments

@anjalinorwood
Copy link
Member

Native Parquet Writer that was introduced in presto-337 writes parquet files in v2 format. This format though supported by query engines like spark and table formats like Iceberg are not fully tested since there has not been a wide-spread adoption of v2 format. Newer features like iceberg vectorized read do not yet support v2.
Trino's hive connector has not yet switched over to the new parquet writer. Given this, it would be a good idea for the native parquet writer to support parquet v1 as well.

@electrum FYI, this is the issue we had discussed.

@rdblue
Copy link
Contributor

rdblue commented May 19, 2021

@electrum, is there a reason why Trino is writing Parquet v2 by default? My understanding is that the v2 spec wasn't ever finalized and v1 is the default. So although there are some interesting new encodings, it is fairly risky to write data as v2 because those were never really evaluated and marked for support in the format itself -- they're still experimental. I suggested some easier encodings that I think did better than those but we never got enough people interested to finalize anything.

So if this is because you've validated that the new delta encodings are working well, then we may want to build support elsewhere. But if this is just using v2 because it seems like a good choice, I'd recommend against it.

@martint
Copy link
Member

martint commented May 19, 2021

By v2 do you mean Data Page v2?

I don't think we support anything that's not specified here: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift and https://github.com/apache/parquet-format. Is that not the official spec? Or are there experimental/unfinished features there we should be aware of (it's not clear which ones, if that's the case).

@rdblue
Copy link
Contributor

rdblue commented May 19, 2021

There are a few features that are experimental:

  • v2 data pages
  • all of the delta encodings

There used to be a clear table on parquet-format, but that was removed a while ago so the docs are now confusing. I think the Parquet community had a thread on this recently to clean up the spec. I'll look into the status of it.

Is Trino producing just v2 pages or delta encodings as well?

@electrum
Copy link
Member

I actually know very little about Parquet. The new writer was contributed by @qqibrow. We do have integration tests for Iceberg where we write with Trino and read with Spark: https://github.com/trinodb/trino/blob/master/testing/trino-product-tests/src/main/java/io/trino/tests/product/iceberg/TestSparkCompatibility.java

But I now see that test is fairly limited and doesn’t cover all data types, or any nested types.

Do you have any suggestions for tests we can add to ensure we are compatible with Spark, or some existing test suite used ensure Parquet compatibility or conformance?

Also, is this something that should be addressed in the Iceberg specification? I didn’t even know there were multiple Parquet versions, so it seems like we need to specify exactly what writers should produce and what readers need to support.

@electrum
Copy link
Member

Can you see if changing this line fixes it?

@rdblue
Copy link
Contributor

rdblue commented May 19, 2021

@anjalinorwood worked on the fix for us, so I'll defer to her on what we need to do to fix it.

@rdblue
Copy link
Contributor

rdblue commented May 19, 2021

Do you have any suggestions for tests we can add to ensure we are compatible with Spark, or some existing test suite used ensure Parquet compatibility or conformance?

I'd recommend using some of the test cases from Iceberg for schema evolution and types:

The Spark/Parquet test uses random records generated for various schemas in AvroDataTest, so it can easily cover a lot of cases. The vectorized tests go into more detail and test encoding results in Parquet, like full dictionary encoding and dictionary to plain encoding fallback.

@anjalinorwood
Copy link
Member Author

Can you see if changing this line fixes it?

No, that alone is not enough. The page header is written in V2 format here:
https://github.com/trinodb/trino/blob/master/lib/trino-parquet/src/main/java/io/trino/parquet/writer/PrimitiveColumnWriter.java#L220

Once I fixed that, I found that in general v2 format is a bit different from v1 format and those two changes are not enough. One will need to go through the spec for v1 and v2 and make sure that native parquet writer follows the v1 spec.

@electrum
Copy link
Member

@anjalinorwood makes sense. Is this something you're working on?

When you say "go through the spec", is there something other than this one? The only mention I can find of v1 or v2 is a few references in the thrift definition and a couple mentions of "Parquet 2.0" on the encodings page.

@anjalinorwood
Copy link
Member Author

@electrum I am not planning to work on this in the short term.

I don't remember the details, but after writing the page header in V1 format, I still had issues reading the resulting file (I think header itself) in the vectorized path. I was referring to the spec links you posted.

@findepi
Copy link
Member

findepi commented May 21, 2021

Is this related to #6377 ?

@minchowang
Copy link

When can v2 be supported in the trino?

@rdblue
Copy link
Contributor

rdblue commented Jul 1, 2021

@findepi, that sounds like the same issue.

@mincwang, I don't recommend using v2. The spec has not been finalized. Is there a reason you would like to use Parquet v2?

@minchowang
Copy link

@findepi, that sounds like the same issue.

@mincwang, I don't recommend using v2. The spec has not been finalized. Is there a reason you would like to use Parquet v2?

@rdblue The mysql-binlog data write to iceberg by flink cdc,then used trino as ad-hoc query.

@rdblue
Copy link
Contributor

rdblue commented Jul 5, 2021

@mincwang, it sounds like you're talking about Iceberg's v2 row-level deletes and not Parquet's v2 encodings. I assumed that you were talking about Parquet because you were commenting on a Parquet v2 issue. To ask about Iceberg, I'd recommend checking in with the Trino slack community or finding an issue for Iceberg v2.

@findepi
Copy link
Member

findepi commented Sep 28, 2021

Let me close this one in favor of the older issue on the same topic -- #6377.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

6 participants