Native Avro File Reader #17221

jklamer · 2023-04-24T23:50:55Z

Description

Add classes and utilities to replace Hive library AvroSerde deserializing functionality by creating a Trino native Avro page source.

The changes are largely split into two module:

trino-hive-formats General use avro library that will allow connectors to define custom page building code or use default behavior. Responsible for resolution of read/write schemas and file bytes decoding.
trino-hive The Hive plugins usage of the above library to ensure backwards compatibility with current implementation along with custom schema sourcing code. Includes custom solutions for handling reader projection use cases.

Main features:

Column masking pushdown to decrease memory footprint for most queries
Reading bytes directly from file into pages without Object creation or casting in majority of cases

Main classes for review:

io.trino.hive.formats.avro.AvroPageDataReader:
- A cross between org.apache.avro.io.FastReaderBuilder and io.trino.hive.formats.line.json.JsonDeserializer
- impelents org.apache.avro.io.DatumReader
- Accumulates pages as the Avro Library reads through the file. Handling schema read resolution (reordering, skips, defaults, promotions)
io.trino.plugin.hive.avro.AvroHiveFileUtils: handles HiveType -> Avro Schema mapping in place of org.apache.hadoop.hive.serde2.avro.AvroSerDe
io.trino.plugin.hive.avro.HiveAvroTypeManager: customizes the format library with special HiveType and Timestamp transformations in a backwards compatible way. Flattens behavior currently located in a number of files into functions it supplies to the library.

Additional context and related issues

This is part of a broader effort to optionalize the Trino dependency on hive/hadoop.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(X) Release notes are required, with the following suggested text:

New Avro Native Reader enabled by default in trine-hive plugin.
Disable with config avro.native-reader.enabled=false or using session property SET SESSION <catalog_name>.avro_native_reader_enabled = false if experiencing issues with this new reader.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/rcfile/RcFilePageSourceFactory.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveFormatsConfig.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/avro/AvroHiveFileUtils.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/avro/AvroHivePageSource.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/avro/AvroHivePageSourceFactory.java

dain

I'm still reviewing but here are some initial comments

...rmats/src/test/java/io/trino/hive/formats/avro/TestAvroPageDataReaderWithoutTypeManager.java

dain · 2023-04-28T17:57:46Z

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/avro/AvroTypeUtils.java

+            }
+            case ENUM -> VarcharType.VARCHAR;
+            case ARRAY -> new ArrayType(typeFromAvro(schema.getElementType(), avroTypeManager, enclosingRecords));
+            case MAP -> new MapType(VarcharType.VARCHAR, typeFromAvro(schema.getValueType(), avroTypeManager, enclosingRecords), new TypeOperators());


You should to use the TypeOperators from ConnectorContext.getTypeManager().getTypeOperators(), but if that is too difficult you could create a static final. The TypeOperators are effectiely generating code with method handles so if you are constantly creating new instances, the code will always be cold (a.k.a., slow)

I might need some help wiring this up the way you're envisioning

reminder we need to resolve this one

dain · 2023-04-28T17:59:09Z

lib/trino-hive-formats/pom.xml

@@ -113,6 +123,13 @@
            <scope>test</scope>
        </dependency>

+        <dependency>
+            <groupId>io.trino</groupId>
+            <artifactId>trino-main</artifactId>


Do we really need this dependency? I would prefer if we did not need this in this module.

Im using it for io.trino.block.BlockAssertions can I move that class to SPI?

should I make own version?

I think we should move it to the SPI. When you do that, switch the TestNG assertEquals usage to AssertJ, since we're trying to move away from TestNG in new code.

Getting the rolling IntelliJ errors on the refactor. I'll ping you offline on what you think we should do here.

Field ColorType.COLOR, referenced in method BlockAssertions.createColorRepeatBlock(int, int), will not be accessible in module trino-spi Field TestingConnectorSession.SESSION, referenced in method BlockAssertions.getOnlyValue(Type, Block), will not be accessible in module trino-spi Field IpAddressType.IPADDRESS, referenced in method BlockAssertions.createRandomBlockForType(Type, int, float), will not be accessible in module trino-spi

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/avro/AvroFilePageIterator.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/avro/AvroPageDataReader.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/avro/AvroFilePageIterator.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/avro/AvroTypeException.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/avro/AvroTypeUtils.java

dain

I reviewed Add Native Avro to Page code with connector defined mappings commit. My comments are mostly stylistic, and the code looks good.

I am concerned about the testing. Most of the tests in this commit are just verifying the correct number of rows and that no exceptions were thrown. I think we need basic tests for every type and a few complex combinations of types.

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/avro/AvroFileReader.java

...rmats/src/test/java/io/trino/hive/formats/avro/TestAvroPageDataReaderWithoutTypeManager.java

dain · 2023-05-04T02:14:30Z

...rmats/src/test/java/io/trino/hive/formats/avro/TestAvroPageDataReaderWithoutTypeManager.java

+    protected static TrinoInputFile createWrittenFileWithSchema(int count, Schema schema)
+            throws IOException
+    {
+        Iterator<Object> randomData = new RandomData(schema, count).iterator();


this is really nice

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/avro/AvroTypeManager.java

...rmats/src/test/java/io/trino/hive/formats/avro/TestAvroPageDataReaderWithTypeManagement.java

jklamer · 2023-05-19T01:53:24Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/avro/AvroHivePageSourceFactory.java

+        for (String columnName : maskedColumns) {
+            Schema.Field field = tableSchema.getField(columnName);
+            if (Objects.isNull(field)) {
+                continue;


rdblue · 2023-05-28T21:18:11Z

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/avro/AvroPageDataReader.java

+    }
+
+    private static BlockBuildingDecoder createBlockBuildingDecoderForAction(Resolver.Action action, AvroTypeManager typeManager)
+            throws AvroTypeException


I probably wouldn't recommend using the resolver. The only thing that needs to be resolved is the order to read struct fields, skip, and then how to fill in defaults. That's actually pretty simple and can be done directly. Check out our implementation in Python.

Basically, that creates a list of (optional position, reader) pairs by doing the following:

Loop over the file schema's record and create a pair for every field

If there is a corresponding field in the read schema, use that field's position from the read schema

If the field is not in the read schema, use empty

Loop over the read schema and add any needed pairs

If the field was in the file schema, skip it

If the field was missing, create a pair from the field's position in the read schema and a constant reader with the field's default

Then to read, you just loop over those pairs. If the position is present, get the value from the reader and set that position in the record. If the position is not present, call skip to consume and discard the bytes.

I think this is overall way cleaner, although I like what you're doing here and using the resolver to create your reader tree, rather than resolving in each record.

Oh that's a good point. I definitely don't love the data model of the Actions returned for this or other use cases. When I was implementing I was just kinda peeling back layers of Generic Datum Reader and FastDatumReader until I didn't need to anymore and just pretty much copied the use from those classes. The reason I might keep using it is that is also provides Reader and Writer Union resolution steps that I originally wan't super confident on implementing but need (I imagine I could but would need to think on it, I think it would involve load bearing exceptions). From my understanding after looking at the pyiceberg code, Iceberg doesn't support non-trivial(optional) unions and doesn't need to do union resolution correct?

Do you want to address this now or in a follow up?

I'll ping @rdblue offline and we'll get this sorted.

electrum · 2023-06-21T00:10:19Z

Capitalize the commit titles: https://cbea.ms/git-commit/

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/avro/AvroFileReader.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/avro/AvroPageDataReader.java

electrum · 2023-06-21T00:23:44Z

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/avro/AvroPageDataReader.java

+    }
+
+    private static BlockBuildingDecoder createBlockBuildingDecoderForAction(Resolver.Action action, AvroTypeManager typeManager)
+            throws AvroTypeException


Do you want to address this now or in a follow up?

electrum · 2023-06-21T00:25:22Z

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/avro/AvroPageDataReader.java

+        {
+            BlockBuilder entryBuilder = builder.beginBlockEntry();
+            long entriesInBlock = decoder.readMapStart();
+            // TODO need to filter out all but last value for key?


What would happen if we encounter a file with duplicate keys?

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/avro/AvroPageDataReader.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/avro/AvroTypeException.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/avro/AvroTypeManager.java

lib/trino-hive-formats/src/main/java/io/trino/hive/formats/avro/AvroTypeUtils.java

electrum · 2023-06-21T00:34:01Z

...hive-formats/src/main/java/io/trino/hive/formats/avro/NativeLogicalTypesAvroTypeManager.java

+                logicalType = fromSchemaIgnoreInvalid(schema);
+                break;
+            case LOCAL_TIMESTAMP_MICROS + LOCAL_TIMESTAMP_MILLIS:
+                log.debug("Logical type " + typeName + " not currently supported by by Trino");


Is this something we plan to support? I'm wondering how the log message might be useful.

I would actually like to talk with you about this. I want to make sure I understand our TimestampWithTimeZoneType enough to know if maps directly to this logical type or can be coerced in a way that makes sense.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/rcfile/RcFilePageSourceFactory.java

jklamer · 2023-06-21T21:56:25Z

@b-slim @weijiii , be aware this has been merged. Let me know if you find any issues with it while using.

cla-bot bot added the cla-signed label Apr 24, 2023

jklamer requested review from dain and electrum April 24, 2023 23:54

github-actions bot added hive Hive connector tests:hive labels Apr 25, 2023

jklamer force-pushed the jklamer/AvroHiveFileFormatReader branch 2 times, most recently from 1e57d63 to afde27b Compare April 25, 2023 18:13

electrum reviewed Apr 26, 2023

View reviewed changes

plugin/trino-hive/src/main/java/io/trino/plugin/hive/rcfile/RcFilePageSourceFactory.java Outdated Show resolved Hide resolved

electrum reviewed Apr 26, 2023

View reviewed changes

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveFormatsConfig.java Show resolved Hide resolved

electrum reviewed Apr 26, 2023

View reviewed changes

jklamer force-pushed the jklamer/AvroHiveFileFormatReader branch 2 times, most recently from 5033fb9 to 487d3a5 Compare April 28, 2023 15:17

dain reviewed Apr 28, 2023

View reviewed changes

jklamer force-pushed the jklamer/AvroHiveFileFormatReader branch 4 times, most recently from 5feca4b to d101b9e Compare May 2, 2023 21:29

dain reviewed May 4, 2023

View reviewed changes

jklamer force-pushed the jklamer/AvroHiveFileFormatReader branch 2 times, most recently from 45aa870 to 9752755 Compare May 9, 2023 23:13

jklamer commented May 19, 2023

View reviewed changes

rdblue reviewed May 28, 2023

View reviewed changes

findepi mentioned this pull request Jun 16, 2023

Support timestamp type in Iceberg migrate procedure #17391

Open

jklamer force-pushed the jklamer/AvroHiveFileFormatReader branch from e5dd970 to e5f4d4c Compare June 16, 2023 18:56

electrum approved these changes Jun 21, 2023

View reviewed changes

Test avro upcasting timestamps

e15e0a6

jklamer force-pushed the jklamer/AvroHiveFileFormatReader branch from e5f4d4c to 13183c0 Compare June 21, 2023 16:59

jklamer added 2 commits June 21, 2023 14:53

Add Native Avro to Page code with connector defined mappings

dd7f67b

Refactor out hive split error utility function into HiveUtils

10f4a48

jklamer added 2 commits June 21, 2023 14:53

Hive Avro Native Reader

95140b6

Move union to row coersion logic to hive-file-formats

35e959d

jklamer force-pushed the jklamer/AvroHiveFileFormatReader branch from 13183c0 to 35e959d Compare June 21, 2023 19:54

electrum merged commit 6f5d455 into trinodb:master Jun 21, 2023
68 checks passed

github-actions bot added this to the 420 milestone Jun 21, 2023

colebow mentioned this pull request Jun 22, 2023

Add Trino 420 release notes #17997

Merged

kekwan mentioned this pull request Jan 10, 2024

"Schema must be union" error with Native Avro File Reader #20310

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native Avro File Reader #17221

Native Avro File Reader #17221

jklamer commented Apr 24, 2023 •

edited

dain left a comment

dain Apr 28, 2023

jklamer May 1, 2023

dain May 4, 2023

dain Apr 28, 2023

jklamer May 1, 2023

jklamer May 1, 2023

electrum Jun 21, 2023

jklamer Jun 21, 2023

dain left a comment

dain May 4, 2023

jklamer May 19, 2023

rdblue May 28, 2023

jklamer Jun 7, 2023 •

edited

electrum Jun 21, 2023

jklamer Jun 21, 2023

electrum commented Jun 21, 2023

electrum Jun 21, 2023

electrum Jun 21, 2023

electrum Jun 21, 2023

jklamer Jun 21, 2023

jklamer commented Jun 21, 2023

Native Avro File Reader #17221

Native Avro File Reader #17221

Conversation

jklamer commented Apr 24, 2023 • edited

Description

Additional context and related issues

Release notes

dain left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dain left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jklamer Jun 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

electrum commented Jun 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jklamer commented Jun 21, 2023

jklamer commented Apr 24, 2023 •

edited

jklamer Jun 7, 2023 •

edited