Allow case-insensitive fieldname matching for struct coercion in hive connector #5575

phd3 · 2020-10-16T16:58:22Z

Co-authored-by: Xingyuan Lin @lxynov

Hive also performs case insensitive matching for struct field names:

martint · 2020-10-16T22:57:52Z

What happens if two row fields have the same name but with different case? E.g., ROW("foo" BIGINT, "FOO" BIGINT)?

dain · 2020-10-17T20:12:02Z

What happens if two row fields have the same name but with different case? E.g., ROW("foo" BIGINT, "FOO" BIGINT)?

According to the linked code, it appears that Hive uses the first one, and ignores the second. I'd be curious what happens when you do SELECT * on a table like that... I'd bet you get both.

dain

Do we need changes to Parquet also? Does the HiveCoercionPolicy cover changes at the partition level?

findepi · 2020-10-19T12:21:33Z

Hive also performs case insensitive matching for struct field names:

Can we extend TestHiveCoercion so that it runs control queries against Hive?

It would make it easier to maintain these tests. I do not think anyone wants to verify the expected outcomes against all supported Hive versions manually.

phd3 · 2020-10-20T16:45:06Z

@martint

What happens if two row fields have the same name but with different case? E.g., ROW("foo" BIGINT, "FOO" BIGINT)?

@dain's guess is accurate, but the behavior varies in the two versions. consider the following example I tried out through product tests:

CREATE TABLE u_padesai.abc (row_column STRUCT<foo: BIGINT, FOO: BIGINT>,bigint_column BIGINT) 
PARTITIONED BY (id BIGINT) 
STORED AS ORC
TBLPROPERTIES ('transactional'='false');

The reads look like the following after inserting through presto: INSERT INTO T VALUES (row(1, 10), 2, 3)

	Hive 1	Hive 3	Presto
SELECT *	[{"foo":1,"FOO":10}, 2, 3]	[{"foo":10,"FOO":null}, 2, 3]	Fails in planner: ambiguous fields
SELECT row_column.FOO	[1]	[10]	Fails while creating orc page source
SELECT row_column.foo	[1]	[10]	Fails while creating orc page source
SELECT row_column.FoO	[1]	[10]	Fails while creating orc page source

So I feel that Presto's approach of failing here is correct and avoids ambiguity. May be even for the change proposed in this PR, we should make sure that such duplicate fields are correctly flagged as errors. However, it is a bit strange that insertion succeeds.

Does the HiveCoercionPolicy cover changes at the partition level?

yes, it is used for generating TableToPartitionMapping, which tracks table-partition column ordering/naming and coercions.

Do we need changes to Parquet also?

I don't think so, it seems already covered. https://github.com/prestosql/presto/blob/master/presto-parquet/src/main/java/io/prestosql/parquet/ParquetTypeUtils.java#L174

Can we extend TestHiveCoercion so that it runs control queries against Hive?

@findepi this is a great idea, however not sure which hive version we pick as the "control", as different versions can have different behavior. I would tend towards picking hive 3, but it seems from this comment that it doesn't support all the coercions. But if we then attempt to match Presto's behavior with hive3, it'll not be backwards compatible.

findepi · 2020-10-21T09:11:00Z

this is a great idea, however not sure which hive version we pick as the "control", as different versions can have different behavior.

The point is not to pick a single version.
We run product tests (suites 1-5) against multiple Hive versions:
https://github.com/prestosql/presto/blob/54c6b2b66445bad3f32874992bacc6d772813c71/.github/workflows/ci.yml#L272-L282

the goal of the test is to "document" hive behavior with appropriate Presto behavior next to it
(if they are same -- great; if they are soundly different -- great too)

For the case where hive behavior changed version to version, you can always base the expected behavior conditionally on
io.prestosql.tests.hive.HiveProductTest#getHiveVersionMajor like here
https://github.com/prestosql/presto/blob/53bafb41da3b36f923f57f995b012bec9b85be4c/presto-product-tests/src/main/java/io/prestosql/tests/hive/TestHiveTableStatistics.java#L1074-L1076

phd3 · 2020-12-17T01:15:33Z

Updated the tests - added assertions for hive's behavior across different versions and formats. Based on that, we can decide on whether we want to move ahead with this change :)

Co-authored-by: Xingyuan Lin <linxingyuan1102@gmail.com>

findepi · 2021-01-08T21:23:34Z

How does this relate to #1558 (comment)?
in particular, is Hive's behavior dependent on file format in use?

phd3 · 2021-01-28T00:23:37Z

testing/trino-product-tests/src/main/java/io/trino/tests/hive/TestHiveCoercion.java

+        // Document case-sensitivity related behavior for nested fields in hive
+        String hiveValueForCaseChangeField;
+        Predicate<String> isFormat = formatName -> tableName.toLowerCase(Locale.ENGLISH).contains(formatName);
+        if (isFormat.test("rctext") || isFormat.test("textfile")) {
+            hiveValueForCaseChangeField = "\"lower2uppercase\":2";
+        }
+        else if (getHiveVersionMajor() == 3 && isFormat.test("orc")) {
+            hiveValueForCaseChangeField = "\"LOWER2UPPERCASE\":null";
+        }
+        else {
+            hiveValueForCaseChangeField = "\"LOWER2UPPERCASE\":2";
+        }
+


@findepi

in particular, is Hive's behavior dependent on file format in use?

yes, the value changes based on format and version. the behavior with dereference (as in assertNestedSubFields ) is tricky too.

electrum · 2021-03-02T16:48:30Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/orc/OrcPageSourceFactory.java

-                                        column -> column.getHiveColumnProjectionInfo().map(HiveColumnProjectionInfo::getDereferenceNames).orElse(ImmutableList.<String>of()),
+                                        column -> column.getHiveColumnProjectionInfo()
+                                                .map(HiveColumnProjectionInfo::getDereferenceNames)
+                                                .map(names -> names.stream().map(name -> name.toLowerCase(ENGLISH)).collect(toUnmodifiableList()))


Should we do this lowercase when we create HiveColumnProjectionInfo?

electrum · 2021-03-02T16:56:06Z

@phd3 Your test and table of the results is great! The follow up question is, what happens if you swap the order of the fields in the struct? It's not clear if the different Hive versions are picking the lower or uppercase version, or if they are picking the first or second one. Would this distinction make a difference to the behavior in this PR?

mosabua · 2022-10-20T17:15:12Z

👋 @phd3 - this PR has become inactive. Please let us know if you will continue to work on this or if we can close the PR.

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

phd3 · 2022-10-21T20:03:13Z

superseded by #13423

cla-bot bot added the cla-signed label Oct 16, 2020

dain reviewed Oct 17, 2020

View reviewed changes

dain requested review from findepi, losipiuk and electrum October 17, 2020 20:15

phd3 mentioned this pull request Nov 9, 2020

Assert coercion behavior with Hive #5884

Merged

phd3 force-pushed the case-sensitivity-fix branch from acafe3a to 13a6320 Compare December 17, 2020 01:12

phd3 requested a review from dain December 22, 2020 18:44

Allow case insensitivity in fieldnames for coercing a Row type

8a5998a

Co-authored-by: Xingyuan Lin <linxingyuan1102@gmail.com>

phd3 force-pushed the case-sensitivity-fix branch from 13a6320 to 8a5998a Compare January 4, 2021 18:46

phd3 commented Jan 28, 2021

View reviewed changes

electrum reviewed Mar 2, 2021

View reviewed changes

dain removed their request for review March 3, 2021 02:31

phd3 mentioned this pull request Mar 24, 2021

Do case insensitive comparison between dereferenced fields and internal ORC field names #7350

Closed

findepi force-pushed the master branch from 8538e49 to 1f896ea Compare July 30, 2021 22:13

leetcode-1533 mentioned this pull request Jul 29, 2022

Allow case-insensitive fieldname matching for struct coercion in hive connector #13423

Merged

phd3 closed this Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow case-insensitive fieldname matching for struct coercion in hive connector #5575

Allow case-insensitive fieldname matching for struct coercion in hive connector #5575

phd3 commented Oct 16, 2020 •

edited

martint commented Oct 16, 2020

dain commented Oct 17, 2020

dain left a comment

findepi commented Oct 19, 2020

phd3 commented Oct 20, 2020

findepi commented Oct 21, 2020

phd3 commented Dec 17, 2020

findepi commented Jan 8, 2021

phd3 Jan 28, 2021

electrum Mar 2, 2021

electrum commented Mar 2, 2021

mosabua commented Oct 20, 2022

phd3 commented Oct 21, 2022

Allow case-insensitive fieldname matching for struct coercion in hive connector #5575

Allow case-insensitive fieldname matching for struct coercion in hive connector #5575

Conversation

phd3 commented Oct 16, 2020 • edited

martint commented Oct 16, 2020

dain commented Oct 17, 2020

dain left a comment

Choose a reason for hiding this comment

findepi commented Oct 19, 2020

phd3 commented Oct 20, 2020

findepi commented Oct 21, 2020

phd3 commented Dec 17, 2020

findepi commented Jan 8, 2021

phd3 Jan 28, 2021

Choose a reason for hiding this comment

electrum Mar 2, 2021

Choose a reason for hiding this comment

electrum commented Mar 2, 2021

mosabua commented Oct 20, 2022

phd3 commented Oct 21, 2022

phd3 commented Oct 16, 2020 •

edited