Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow case-insensitive fieldname matching for struct coercion in hive connector #5575

Closed
wants to merge 1 commit into from

Conversation

phd3
Copy link
Member

@phd3 phd3 commented Oct 16, 2020

@cla-bot cla-bot bot added the cla-signed label Oct 16, 2020
@martint
Copy link
Member

martint commented Oct 16, 2020

What happens if two row fields have the same name but with different case? E.g., ROW("foo" BIGINT, "FOO" BIGINT)?

@dain
Copy link
Member

dain commented Oct 17, 2020

What happens if two row fields have the same name but with different case? E.g., ROW("foo" BIGINT, "FOO" BIGINT)?

According to the linked code, it appears that Hive uses the first one, and ignores the second. I'd be curious what happens when you do SELECT * on a table like that... I'd bet you get both.

Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need changes to Parquet also? Does the HiveCoercionPolicy cover changes at the partition level?

@findepi
Copy link
Member

findepi commented Oct 19, 2020

Hive also performs case insensitive matching for struct field names:

Can we extend TestHiveCoercion so that it runs control queries against Hive?

It would make it easier to maintain these tests. I do not think anyone wants to verify the expected outcomes against all supported Hive versions manually.

@phd3
Copy link
Member Author

phd3 commented Oct 20, 2020

@martint

What happens if two row fields have the same name but with different case? E.g., ROW("foo" BIGINT, "FOO" BIGINT)?

@dain's guess is accurate, but the behavior varies in the two versions. consider the following example I tried out through product tests:

CREATE TABLE u_padesai.abc (row_column STRUCT<foo: BIGINT, FOO: BIGINT>,bigint_column BIGINT) 
PARTITIONED BY (id BIGINT) 
STORED AS ORC
TBLPROPERTIES ('transactional'='false');

The reads look like the following after inserting through presto: INSERT INTO T VALUES (row(1, 10), 2, 3)

Hive 1 Hive 3 Presto
SELECT * [{"foo":1,"FOO":10}, 2, 3] [{"foo":10,"FOO":null}, 2, 3] Fails in planner: ambiguous fields
SELECT row_column.FOO [1] [10] Fails while creating orc page source
SELECT row_column.foo [1] [10] Fails while creating orc page source
SELECT row_column.FoO [1] [10] Fails while creating orc page source

So I feel that Presto's approach of failing here is correct and avoids ambiguity. May be even for the change proposed in this PR, we should make sure that such duplicate fields are correctly flagged as errors. However, it is a bit strange that insertion succeeds.

Does the HiveCoercionPolicy cover changes at the partition level?

yes, it is used for generating TableToPartitionMapping, which tracks table-partition column ordering/naming and coercions.

Do we need changes to Parquet also?

I don't think so, it seems already covered. https://github.com/prestosql/presto/blob/master/presto-parquet/src/main/java/io/prestosql/parquet/ParquetTypeUtils.java#L174

Can we extend TestHiveCoercion so that it runs control queries against Hive?

@findepi this is a great idea, however not sure which hive version we pick as the "control", as different versions can have different behavior. I would tend towards picking hive 3, but it seems from this comment that it doesn't support all the coercions. But if we then attempt to match Presto's behavior with hive3, it'll not be backwards compatible.

@findepi
Copy link
Member

findepi commented Oct 21, 2020

this is a great idea, however not sure which hive version we pick as the "control", as different versions can have different behavior.

The point is not to pick a single version.
We run product tests (suites 1-5) against multiple Hive versions:
https://github.com/prestosql/presto/blob/54c6b2b66445bad3f32874992bacc6d772813c71/.github/workflows/ci.yml#L272-L282

the goal of the test is to "document" hive behavior with appropriate Presto behavior next to it
(if they are same -- great; if they are soundly different -- great too)

For the case where hive behavior changed version to version, you can always base the expected behavior conditionally on
io.prestosql.tests.hive.HiveProductTest#getHiveVersionMajor like here
https://github.com/prestosql/presto/blob/53bafb41da3b36f923f57f995b012bec9b85be4c/presto-product-tests/src/main/java/io/prestosql/tests/hive/TestHiveTableStatistics.java#L1074-L1076

@phd3
Copy link
Member Author

phd3 commented Dec 17, 2020

Updated the tests - added assertions for hive's behavior across different versions and formats. Based on that, we can decide on whether we want to move ahead with this change :)

@phd3 phd3 requested a review from dain December 22, 2020 18:44
Co-authored-by: Xingyuan Lin <linxingyuan1102@gmail.com>
@findepi
Copy link
Member

findepi commented Jan 8, 2021

How does this relate to #1558 (comment)?
in particular, is Hive's behavior dependent on file format in use?

Comment on lines +367 to +379
// Document case-sensitivity related behavior for nested fields in hive
String hiveValueForCaseChangeField;
Predicate<String> isFormat = formatName -> tableName.toLowerCase(Locale.ENGLISH).contains(formatName);
if (isFormat.test("rctext") || isFormat.test("textfile")) {
hiveValueForCaseChangeField = "\"lower2uppercase\":2";
}
else if (getHiveVersionMajor() == 3 && isFormat.test("orc")) {
hiveValueForCaseChangeField = "\"LOWER2UPPERCASE\":null";
}
else {
hiveValueForCaseChangeField = "\"LOWER2UPPERCASE\":2";
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi

in particular, is Hive's behavior dependent on file format in use?

yes, the value changes based on format and version. the behavior with dereference (as in assertNestedSubFields ) is tricky too.

column -> column.getHiveColumnProjectionInfo().map(HiveColumnProjectionInfo::getDereferenceNames).orElse(ImmutableList.<String>of()),
column -> column.getHiveColumnProjectionInfo()
.map(HiveColumnProjectionInfo::getDereferenceNames)
.map(names -> names.stream().map(name -> name.toLowerCase(ENGLISH)).collect(toUnmodifiableList()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this lowercase when we create HiveColumnProjectionInfo?

@electrum
Copy link
Member

electrum commented Mar 2, 2021

@phd3 Your test and table of the results is great! The follow up question is, what happens if you swap the order of the fields in the struct? It's not clear if the different Hive versions are picking the lower or uppercase version, or if they are picking the first or second one. Would this distinction make a difference to the behavior in this PR?

@mosabua
Copy link
Member

mosabua commented Oct 20, 2022

👋 @phd3 - this PR has become inactive. Please let us know if you will continue to work on this or if we can close the PR.

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

@phd3
Copy link
Member Author

phd3 commented Oct 21, 2022

superseded by #13423

@phd3 phd3 closed this Oct 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

6 participants