-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow case-insensitive fieldname matching for struct coercion in hive connector #5575
Conversation
What happens if two row fields have the same name but with different case? E.g., |
According to the linked code, it appears that Hive uses the first one, and ignores the second. I'd be curious what happens when you do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need changes to Parquet also? Does the HiveCoercionPolicy cover changes at the partition level?
Can we extend It would make it easier to maintain these tests. I do not think anyone wants to verify the expected outcomes against all supported Hive versions manually. |
@dain's guess is accurate, but the behavior varies in the two versions. consider the following example I tried out through product tests:
The reads look like the following after inserting through presto:
So I feel that Presto's approach of failing here is correct and avoids ambiguity. May be even for the change proposed in this PR, we should make sure that such duplicate fields are correctly flagged as errors. However, it is a bit strange that insertion succeeds.
yes, it is used for generating
I don't think so, it seems already covered. https://github.com/prestosql/presto/blob/master/presto-parquet/src/main/java/io/prestosql/parquet/ParquetTypeUtils.java#L174
@findepi this is a great idea, however not sure which hive version we pick as the "control", as different versions can have different behavior. I would tend towards picking hive 3, but it seems from this comment that it doesn't support all the coercions. But if we then attempt to match Presto's behavior with hive3, it'll not be backwards compatible. |
The point is not to pick a single version. the goal of the test is to "document" hive behavior with appropriate Presto behavior next to it For the case where hive behavior changed version to version, you can always base the expected behavior conditionally on |
acafe3a
to
13a6320
Compare
Updated the tests - added assertions for hive's behavior across different versions and formats. Based on that, we can decide on whether we want to move ahead with this change :) |
Co-authored-by: Xingyuan Lin <linxingyuan1102@gmail.com>
13a6320
to
8a5998a
Compare
How does this relate to #1558 (comment)? |
// Document case-sensitivity related behavior for nested fields in hive | ||
String hiveValueForCaseChangeField; | ||
Predicate<String> isFormat = formatName -> tableName.toLowerCase(Locale.ENGLISH).contains(formatName); | ||
if (isFormat.test("rctext") || isFormat.test("textfile")) { | ||
hiveValueForCaseChangeField = "\"lower2uppercase\":2"; | ||
} | ||
else if (getHiveVersionMajor() == 3 && isFormat.test("orc")) { | ||
hiveValueForCaseChangeField = "\"LOWER2UPPERCASE\":null"; | ||
} | ||
else { | ||
hiveValueForCaseChangeField = "\"LOWER2UPPERCASE\":2"; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in particular, is Hive's behavior dependent on file format in use?
yes, the value changes based on format and version. the behavior with dereference (as in assertNestedSubFields
) is tricky too.
column -> column.getHiveColumnProjectionInfo().map(HiveColumnProjectionInfo::getDereferenceNames).orElse(ImmutableList.<String>of()), | ||
column -> column.getHiveColumnProjectionInfo() | ||
.map(HiveColumnProjectionInfo::getDereferenceNames) | ||
.map(names -> names.stream().map(name -> name.toLowerCase(ENGLISH)).collect(toUnmodifiableList())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do this lowercase when we create HiveColumnProjectionInfo
?
@phd3 Your test and table of the results is great! The follow up question is, what happens if you swap the order of the fields in the struct? It's not clear if the different Hive versions are picking the lower or uppercase version, or if they are picking the first or second one. Would this distinction make a difference to the behavior in this PR? |
👋 @phd3 - this PR has become inactive. Please let us know if you will continue to work on this or if we can close the PR. We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks. |
superseded by #13423 |
Co-authored-by: Xingyuan Lin @lxynov
Hive also performs case insensitive matching for struct field names: