-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support backwards compatible reads for unnanotated repeated primitive fields in Parquet #20943
Support backwards compatible reads for unnanotated repeated primitive fields in Parquet #20943
Conversation
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
76c7f3a
to
a13e0a1
Compare
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
lib/trino-parquet/src/main/java/io/trino/parquet/ParquetTypeUtils.java
Outdated
Show resolved
Hide resolved
a13e0a1
to
e48dcb7
Compare
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
Fixed failing unit tests by adding an additional check in Other changes:
|
e48dcb7
to
8ac025f
Compare
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/AbstractTestParquetReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/ParquetTypeUtils.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/ParquetTypeUtils.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/AbstractTestParquetReader.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/AbstractTestParquetReader.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/AbstractTestParquetReader.java
Outdated
Show resolved
Hide resolved
8ac025f
to
f628064
Compare
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
f628064
to
80682f3
Compare
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
80682f3
to
b439b81
Compare
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
b439b81
to
ab353ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach looks good to me, @raunaqmorarka PTAL. @martint can you process the CLA on this one too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm % comment
plugin/trino-hive/src/test/java/io/trino/plugin/hive/parquet/AbstractTestParquetReader.java
Outdated
Show resolved
Hide resolved
ab353ce
to
063a45f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks @mxmarkovics !
Description
Per the Parquet Spec:
A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field
. So the following schemas should be treated equivalently.However, Trino currently does not support reading the old format and will throw the following error if attempting to read the field as a
array<int>
:If instead the field is defined in the table as simply the primitive type itself e.g.
int
then another unhelpful error will be thrown:This issue is also present in Presto, however the second case will actually return incorrect results instead of throwing the second error.
Spark is able to correctly handle this however, e.g.:
Have added two old parquet file that was written with the old format showing the reproduction and used for unit testing. Tested on AWS Athena.
There are several issues here:
ParquetTypeUtils
the table declared typearray<int>
assumes that the ParquetColumnIO
will be aGroupColumIO
, however in this case theColumnIO
is aPrimitiveColumnIO
which is repeated.PrimitiveField
with aGroupField
as it would be in the case of the new schema format causes theGroupField
to have the same repetition and definition levels of thePrimitiveField
child, where it should have 1 less for each since it is one level above. This change is also consistent with the change made in Spark to handle this case as well: apache/spark@d246010.Note: There is a similar code path in the Iceberg connector in
IcebergParquetColumnIOConverter
, however I was unable to reproduce this issue using Iceberg as it currently does not allow this case, so we don't need to handle the Iceberg case in Trino. https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java#L74-L77.Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: