Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix writing Hive timestamps array/map/struct timestamps from Trino to Parquet #6717

Merged
merged 3 commits into from Feb 4, 2021

Conversation

jirassimok
Copy link
Member

@jirassimok jirassimok commented Jan 25, 2021

The commits up to "Add tests for Hive struct timestamps inserted from Trino" are from #6622.

This addresses #6760.

@jirassimok
Copy link
Member Author

jirassimok commented Jan 26, 2021

Re: this comment

@findepi

do you have any perf-related concerns for Parquet?

The code currently in this PR should have the same performance as, but the code that converts Blocks to writable values for Hive is duplicated in three places to support Parquet (HiveWriteUtils.getField, FieldSetterFactory.getField, and the subclasses of FieldSetter), and two methods in FieldSetterFactory need to be overridden (in the exact same way) to make timestamps work properly with Parquet. Both of those issues may make it difficult (or at least inconvenient) to maintain/reuse this code.

what other options do you see here?

I have a cleaner version here that uses the same code to handle bare values and values in arrays, and, thinking about it after stepping back for a bit, I'm not sure it will have a significant impact. Basically, it creates an extra object or two for each value being written (including boxing and then unboxing primitives).

Another option would be to go through Hive's Writable.readField(DataInput). It might be able to avoid the potential downsides to the other approaches, but it seems harder to implement cleanly.

Another approach that might avoid both issues is to use objects similar to the ones in the original code, but creating a new one for each array/map/row entry. This will increase costs for those types, but not for regular values of any type. I'll take look at this. (This looks promising, but runs into some issues with under-documented Hive code and some weird things we do in HiveWriteUtils.)

@jirassimok
Copy link
Member Author

I figured out the third approach I mentioned in my previous comment, and this PR is now using it.

This will slightly lower performance for data being written in arrays, maps, and structs, but should have no impact on regular data.

@jirassimok jirassimok force-pushed the parquet-type-cleanup-1 branch 4 times, most recently from 5c79f51 to 22445af Compare January 29, 2021 02:49
@jirassimok
Copy link
Member Author

jirassimok commented Jan 29, 2021

The first four commits in this PR ("Clean up in FieldSetterFactory" and "Fix writing long timestamps in map/array/row to Hive") should get timestamps working in arrays et al. in Parquet.

The rest of the commits dramatically refactor FieldSetterFactory to reduce code duplication and allow easier extension.

I'm starting to think I should have made FieldTranslatorFactory to go with my changes, and used it to replace HiveWriteUtils.getField, which still duplicates code currently in FieldSetterFactory.

@jirassimok jirassimok force-pushed the parquet-type-cleanup-1 branch 2 times, most recently from a42b87e to f09a6d9 Compare January 29, 2021 04:32
@findepi
Copy link
Member

findepi commented Jan 29, 2021

The rest of the commits dramatically refactor FieldSetterFactory to reduce code duplication and allow easier extension.

@jirassimok would it make sense to separate fix from potentially-an-improvement into their own PRs?

@jirassimok
Copy link
Member Author

jirassimok commented Jan 29, 2021

Oh, yeah, I had that in my previous comment, but I forgot to leave it in when I updated the PR and changed the comment.

I'll split it off.

@jirassimok jirassimok force-pushed the parquet-type-cleanup-1 branch 8 times, most recently from d17afa8 to 54b33d6 Compare February 2, 2021 14:16
@jirassimok jirassimok force-pushed the parquet-type-cleanup-1 branch 3 times, most recently from e5620c9 to 758557d Compare February 3, 2021 16:13
- Clean up FieldSetterFactory.create's type conditions
- Rename FieldSetterFactory.BigintFieldBuilder to BigintFieldSetter
@jirassimok jirassimok force-pushed the parquet-type-cleanup-1 branch 2 times, most recently from 62c9206 to 236bf10 Compare February 4, 2021 17:16
@jirassimok
Copy link
Member Author

Here is the passing check, and here is a diff from then to now.

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

@jirassimok
Copy link
Member Author

The test failure in Iceberg is #5758.

@findepi findepi merged commit 1d83cef into trinodb:master Feb 4, 2021
@findepi findepi added this to the 352 milestone Feb 4, 2021
@findepi findepi mentioned this pull request Feb 4, 2021
10 tasks
@findepi
Copy link
Member

findepi commented Feb 4, 2021

Merged, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

2 participants