Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Presto does not support to read ORC structs by the ordinal of Hive metadata #4321

Open
ArvinZheng opened this issue Jul 2, 2020 · 3 comments

Comments

@ArvinZheng
Copy link
Member

ArvinZheng commented Jul 2, 2020

We recently upgraded 2 of our Presto clusters to 0.208 and 317 and found that after upgraded, Presto changed the default schema evolution for struct to name based instead of positional, and does not provide an option for positional mapping.

For example, following query runs fine in 0.180 and data for cost.raw_cost is returned as expected.

select
	logdate ,
	cost.raw_cost
from
	core.log cross
join unnest(costs) AS t(cost)
where
	logdate = '2019-11-01'
	and hour = '10'
limit 1000

After moved to 0.208 or 317, the following query always returns null for cost.raw_cost

select
	logdate ,
	cost.raw_cost
from
	core.log cross
join unnest(costs) AS cost
where
	logdate = '2019-11-01'
	and hour = '10'
limit 1000

Note:

  1. core.log is a hive table where costs is an array of structs
  2. the data file format for core.log is ORC
  3. the column name in the ORC file for cost.raw_cost is different - raw_cost_micros
  4. if I change the column name in Hive struct from raw_cost to raw_cost_micros to match the ORC metadata, we are able to get the correct data

I noticed that the change was introduced in from https://github.com/prestodb/presto/pull/11123/files .

IMO, when we are talking about default behaviors of Hive, the version of Hive should always be involved.
IIRC currently the default schema evolution in Hive is

  1. versions prior to 2.1 and 2.1 are using positional
  2. versions later than 2.1 default to name based but config item orc.force.positional.evolution is provided to force positional

But the default in Presto is positional for all other columns and name based for struct which is not aligned with any Hive version.
I understand aligning default behaviors with Hive is not easy and #1558 has been created to track that, but before we are able to make a decision and implement #1558, should we think about addressing current issue?

Updating the column name in Hive to match ORC is not that easy to us, we have multiple Hive columns whose name does not match to ORC file, and we also have many downstream consumers which already subscribed to this table and rely on current Hive table definition.

What I can think of now is

  1. change the default schema evolution of struct back to positional and use name based only when hive.orc.use-column-names is set to true
  2. keep current implementation add another experimental config item to force positional schema evolution for ORC struct.

Both are not ideal but maybe option 2 is safer as it won't break current default behavior (as people may have been relying on this to change their ORC structs).

@dain, @findepi feel free to comment, cc: @martint

@ArvinZheng
Copy link
Member Author

ArvinZheng commented Jul 2, 2020

Also wanted to share another finding here, in Hive 2.3.4, when orc.force.positional.evolution is set to true, all columns are index mapped by their ordinal except nested struct which matches current Presto behavior I'm gonna start another conversation with ORC folks to see if this is intentional or they also wanted to address it, will ping here if I have any updates.

@Sarrouna
Copy link

Sarrouna commented Jan 8, 2021

@ArvinZheng Do you have some updates please from ORC folks ? I'm looking for a solutions also, but I'm in the first steps of choices which data format do I uses for our project.
Is ORC support schema evolution with Presto or no ? If I set hive.orc.use-column-names to true or I force orc.force.positional.evolution to true , is ORC will accept an evolution in the schema ?

Thanks in advance.

@ArvinZheng
Copy link
Member Author

@Sarrouna , yes, the issue has been fixed in https://issues.apache.org/jira/browse/ORC-626, one new config item orc.force.positional.evolution.level is added to determine how many levels of nested types will be read by indexes.
BTW, orc.force.positional.evolution is a config item of Apache ORC while Presto maintains its own ORC readers, adding it to your Presto config wouldn't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants