Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support backwards compatible reads for unnanotated repeated primitive fields in Parquet #20943

Merged
merged 1 commit into from
Mar 13, 2024

Conversation

mxmarkovics
Copy link
Contributor

@mxmarkovics mxmarkovics commented Mar 5, 2024

Description

Per the Parquet Spec: A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field. So the following schemas should be treated equivalently.

# Old schema format for required list of required ints
message TestProtobuf.RepeatedIntMessage {
  repeated int32 repeatedInt;
}
# New schema format for required list of required ints
message TestProtobuf.RepeatedIntMessage {
  required group repeatedInt (LIST) {
    repeated group list {
      required int32 element;
    }
  }
}

However, Trino currently does not support reading the old format and will throw the following error if attempting to read the field as a array<int>:

Query 20240229_210455_00000_hegy2 failed: Error opening Hive split s3://...: class org.apache.parquet.io.PrimitiveColumnIO cannot be cast to class org.apache.parquet.io.GroupColumnIO (org.apache.parquet.io.PrimitiveColumnIO and org.apache.parquet.io.GroupColumnIO are in unnamed module of loader io.trino.server.PluginClassLoader @56b37c54)

If instead the field is defined in the table as simply the primitive type itself e.g. int then another unhelpful error will be thrown:

GENERIC_INTERNAL_ERROR: Loaded block positions count (3) doesn’t match lazy block positions count (1)

This issue is also present in Presto, however the second case will actually return incorrect results instead of throwing the second error.
Spark is able to correctly handle this however, e.g.:

inputDF = spark.read.parquet(“s3://...“).select(“*”);
inputDF.show()
...
+-----------+
|repeatedInt|
+-----------+
|  [1, 2, 3]     |
+-----------+

Have added two old parquet file that was written with the old format showing the reproduction and used for unit testing. Tested on AWS Athena.
There are several issues here:

  1. In ParquetTypeUtils the table declared type array<int> assumes that the Parquet ColumnIO will be a GroupColumIO, however in this case the ColumnIO is a PrimitiveColumnIO which is repeated.
  2. The above issue aside, Trino should interpret the repeated non-group field Parquet schema as a required list of required elements per the spec. However, when wrapping the PrimitiveField with a GroupField as it would be in the case of the new schema format causes the GroupField to have the same repetition and definition levels of the PrimitiveField child, where it should have 1 less for each since it is one level above. This change is also consistent with the change made in Spark to handle this case as well: apache/spark@d246010.
  3. When the declared table type is incompatible with the repeated field in the schema an unhelpful error message is thrown that does not indicate there is a schema mismatch, and in the case of Presto will give incorrect results. This should only be the case when the repeated primitive field as at the top level, i.e. parent is a MessageColumnIO, to still be backwards compatible with old Avro format as well.

Note: There is a similar code path in the Iceberg connector in IcebergParquetColumnIOConverter, however I was unable to reproduce this issue using Iceberg as it currently does not allow this case, so we don't need to handle the Iceberg case in Trino. https://github.com/apache/iceberg/blob/main/parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java#L74-L77.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Fix reading array types from parquet files produced by some legacy writes  ({issue}`20943`)

Copy link

cla-bot bot commented Mar 5, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@github-actions github-actions bot added tests:hive hive Hive connector labels Mar 5, 2024
Copy link

cla-bot bot commented Mar 5, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link

cla-bot bot commented Mar 6, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@mxmarkovics
Copy link
Contributor Author

mxmarkovics commented Mar 6, 2024

Fixed failing unit tests by adding an additional check in ColumnReaderFactory. The tests were failing because the batched reader used the FlatColumnReader instead of the NestedColumnReader via ColumnReaderFactory for repeated top level primitive fields, however the primitive field in the normal/new schema format for arrays it would use NestedColumnReader . In other words, the FlatColumnReader is only used to read top level primitive fields, however since top level repeated primitives should be treated the same as nested primitives inside of a list group, they should be considered as nested instead of flat columns. The additional check in isFlatColumns should work because the repetition and definition levels for top level repeated primitives are (1, 1) and the only case where a top level primitive can have a max repetition level of 1 is if itself is repeated. Nested repeated primitives of other forms are handled already in NestedColumnReader via handling for the old Avro format and the normal/current way of representing lists in the schema.

Other changes:

  • Added back the REPEATED check instead of required in ParquetTypeUtils
  • Improved error message for schema mismatch
  • Added unit test coverage for error message
  • Added unit test coverage for change to isFlatColumn

Copy link

cla-bot bot commented Mar 6, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link

cla-bot bot commented Mar 7, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link

cla-bot bot commented Mar 7, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@mxmarkovics mxmarkovics marked this pull request as ready for review March 7, 2024 17:30
Copy link

cla-bot bot commented Mar 7, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link
Member

@pettyjamesm pettyjamesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good to me, @raunaqmorarka PTAL. @martint can you process the CLA on this one too?

Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comment

Copy link
Member

@pettyjamesm pettyjamesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @mxmarkovics !

@pettyjamesm pettyjamesm merged commit cd01bb5 into trinodb:master Mar 13, 2024
58 checks passed
@github-actions github-actions bot added this to the 441 milestone Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

None yet

3 participants