Skip to content

HIVE-28938: Error in LATERAL VIEW with non native tables due to prese… #5798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 12, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 17 additions & 17 deletions hbase-handler/src/test/results/positive/hbase_queries.q.out
Original file line number Diff line number Diff line change
@@ -151,14 +151,14 @@ STAGE PLANS:
predicate: UDFToDouble(key) is not null (type: boolean)
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: key (type: int)
expressions: UDFToDouble(key) (type: double)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious why HBase (and Kudu in another file) column type changed from int to double. The stats above shows 4 bytes which indicates integer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this PR, we don't see any virtual columns for the HBase table, whereas earlier virtual columns for native tables were getting wrongly added to the HBase tables. This affects the plan in the CBO, especially in RelFieldTrimmer.

With the PR, the plan after RelFieldTrimmer is:

2025-05-09T12:06:15,305 DEBUG [0de9da36-a1b4-4747-8aa3-b84858aed485 main] rules.RelFieldTrimmer: Plan after trimming unused fields
HiveSortLimit(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC], fetch=[20])
  HiveProject(key=[$1], value=[$2])
    HiveJoin(condition=[=($0, $3)], joinType=[inner], algorithm=[none], cost=[not available])
      HiveProject(EXPR$0=[CAST($0):DOUBLE])
        HiveFilter(condition=[IS NOT NULL(CAST($0):DOUBLE)])
          HiveProject(key=[$0])
            HiveTableScan(table=[[default, hbase_table_1]], table:alias=[hbase_table_1])
      HiveProject(key=[$0], value=[$1], EXPR$0=[CAST($0):DOUBLE])
        HiveFilter(condition=[IS NOT NULL(CAST($0):DOUBLE)])
          HiveProject(key=[$0], value=[$1])
            HiveTableScan(table=[[default, src]], table:alias=[src])

whereas earlier the plan was:

2025-05-09T12:11:50,457 DEBUG [bd84dbdb-d563-43d7-a012-3132221196b4 main] rules.RelFieldTrimmer: Plan after trimming unused fields
HiveSortLimit(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC], fetch=[20])
  HiveProject(key=[$1], value=[$2])
    HiveJoin(condition=[=(CAST($0):DOUBLE, CAST($1):DOUBLE)], joinType=[inner], algorithm=[none], cost=[not available])
      HiveFilter(condition=[IS NOT NULL(CAST($0):DOUBLE)])
        HiveProject(key=[$0])
          HiveTableScan(table=[[default, hbase_table_1]], table:alias=[hbase_table_1])
      HiveFilter(condition=[IS NOT NULL(CAST($0):DOUBLE)])
        HiveProject(key=[$0], value=[$1])
          HiveTableScan(table=[[default, src]], table:alias=[src])

We see that the CASTs are getting pushed down from the Join now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new plan with the pushed down cast 'key' to double on both sides of the join looks good to me, although I don't know why the extraneous virtual columns in the past would have prevented the push down since the cast is on a user column. But that's a separate topic.

outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: UDFToDouble(_col0) (type: double)
key expressions: _col0 (type: double)
null sort order: z
sort order: +
Map-reduce partition columns: UDFToDouble(_col0) (type: double)
Map-reduce partition columns: _col0 (type: double)
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: src
@@ -168,29 +168,29 @@ STAGE PLANS:
predicate: UDFToDouble(key) is not null (type: boolean)
Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: key (type: string), value (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE Column stats: COMPLETE
expressions: key (type: string), value (type: string), UDFToDouble(key) (type: double)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 500 Data size: 93000 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: UDFToDouble(_col0) (type: double)
key expressions: _col2 (type: double)
null sort order: z
sort order: +
Map-reduce partition columns: UDFToDouble(_col0) (type: double)
Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE Column stats: COMPLETE
Map-reduce partition columns: _col2 (type: double)
Statistics: Num rows: 500 Data size: 93000 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: string), _col1 (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 UDFToDouble(_col0) (type: double)
1 UDFToDouble(_col0) (type: double)
0 _col0 (type: double)
1 _col2 (type: double)
outputColumnNames: _col1, _col2
Statistics: Num rows: 550 Data size: 97900 Basic stats: COMPLETE Column stats: NONE
Statistics: Num rows: 550 Data size: 102300 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col1 (type: string), _col2 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 550 Data size: 97900 Basic stats: COMPLETE Column stats: NONE
Statistics: Num rows: 550 Data size: 102300 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
@@ -206,20 +206,20 @@ STAGE PLANS:
key expressions: _col0 (type: string), _col1 (type: string)
null sort order: zz
sort order: ++
Statistics: Num rows: 550 Data size: 97900 Basic stats: COMPLETE Column stats: NONE
Statistics: Num rows: 550 Data size: 102300 Basic stats: COMPLETE Column stats: NONE
TopN Hash Memory Usage: 0.1
Execution mode: vectorized
Reduce Operator Tree:
Select Operator
expressions: KEY.reducesinkkey0 (type: string), KEY.reducesinkkey1 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 550 Data size: 97900 Basic stats: COMPLETE Column stats: NONE
Statistics: Num rows: 550 Data size: 102300 Basic stats: COMPLETE Column stats: NONE
Limit
Number of rows: 20
Statistics: Num rows: 20 Data size: 3560 Basic stats: COMPLETE Column stats: NONE
Statistics: Num rows: 20 Data size: 3720 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 20 Data size: 3560 Basic stats: COMPLETE Column stats: NONE
Statistics: Num rows: 20 Data size: 3720 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
-- A simple explain formatted test for an iceberg table to check virtual columns in the JSON output.
create external table test (a int, b int) stored by iceberg;
explain formatted select * from test;
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
create external table test(id int, arr array<string>) stored by iceberg;
insert into test values (1, array("a", "b")), (2, array("c", "d")), (3, array("e", "f"));

select * from test
lateral view explode(arr) tbl1 as name
lateral view explode(arr) tbl2 as name1;
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
PREHOOK: query: create external table test (a int, b int) stored by iceberg
PREHOOK: type: CREATETABLE
PREHOOK: Output: database:default
PREHOOK: Output: default@test
POSTHOOK: query: create external table test (a int, b int) stored by iceberg
POSTHOOK: type: CREATETABLE
POSTHOOK: Output: database:default
POSTHOOK: Output: default@test
PREHOOK: query: explain formatted select * from test
PREHOOK: type: QUERY
PREHOOK: Input: default@test
PREHOOK: Output: hdfs://### HDFS PATH ###
POSTHOOK: query: explain formatted select * from test
POSTHOOK: type: QUERY
POSTHOOK: Input: default@test
POSTHOOK: Output: hdfs://### HDFS PATH ###
{"CBOPlan":"{\n \"rels\": [\n {\n \"id\": \"0\",\n \"relOp\": \"org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveTableScan\",\n \"table\": [\n \"default\",\n \"test\"\n ],\n \"table:alias\": \"test\",\n \"inputs\": [],\n \"rowCount\": 1.0,\n \"avgRowSize\": 8.0,\n \"rowType\": {\n \"fields\": [\n {\n \"type\": \"INTEGER\",\n \"nullable\": true,\n \"name\": \"a\"\n },\n {\n \"type\": \"INTEGER\",\n \"nullable\": true,\n \"name\": \"b\"\n },\n {\n \"type\": \"INTEGER\",\n \"nullable\": true,\n \"name\": \"PARTITION__SPEC__ID\"\n },\n {\n \"type\": \"BIGINT\",\n \"nullable\": true,\n \"name\": \"PARTITION__HASH\"\n },\n {\n \"type\": \"VARCHAR\",\n \"nullable\": true,\n \"precision\": 2147483647,\n \"name\": \"FILE__PATH\"\n },\n {\n \"type\": \"BIGINT\",\n \"nullable\": true,\n \"name\": \"ROW__POSITION\"\n },\n {\n \"type\": \"VARCHAR\",\n \"nullable\": true,\n \"precision\": 2147483647,\n \"name\": \"PARTITION__PROJECTION\"\n },\n {\n \"type\": \"BIGINT\",\n \"nullable\": true,\n \"name\": \"SNAPSHOT__ID\"\n }\n ],\n \"nullable\": false\n },\n \"colStats\": [\n {\n \"name\": \"a\",\n \"ndv\": 1,\n \"minValue\": -9223372036854775808,\n \"maxValue\": 9223372036854775807\n },\n {\n \"name\": \"b\",\n \"ndv\": 1,\n \"minValue\": -9223372036854775808,\n \"maxValue\": 9223372036854775807\n }\n ]\n },\n {\n \"id\": \"1\",\n \"relOp\": \"org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveProject\",\n \"fields\": [\n \"a\",\n \"b\"\n ],\n \"exprs\": [\n {\n \"input\": 0,\n \"name\": \"$0\"\n },\n {\n \"input\": 1,\n \"name\": \"$1\"\n }\n ],\n \"rowCount\": 1.0\n }\n ]\n}","optimizedSQL":"SELECT `a`, `b`\nFROM `default`.`test`","cboInfo":"Plan optimized by CBO.","STAGE DEPENDENCIES":{"Stage-0":{"ROOT STAGE":"TRUE"}},"STAGE PLANS":{"Stage-0":{"Fetch Operator":{"limit:":"-1","Processor Tree:":{"TableScan":{"alias:":"test","columns:":["a","b"],"database:":"default","table:":"test","isTempTable:":"false","OperatorId:":"TS_0","children":{"Select Operator":{"expressions:":"a (type: int), b (type: int)","columnExprMap:":{"_col0":"a","_col1":"b"},"outputColumnNames:":["_col0","_col1"],"OperatorId:":"SEL_1","children":{"ListSink":{"OperatorId:":"LIST_SINK_3"}}}}}}}}}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
PREHOOK: query: create external table test(id int, arr array<string>) stored by iceberg
PREHOOK: type: CREATETABLE
PREHOOK: Output: database:default
PREHOOK: Output: default@test
POSTHOOK: query: create external table test(id int, arr array<string>) stored by iceberg
POSTHOOK: type: CREATETABLE
POSTHOOK: Output: database:default
POSTHOOK: Output: default@test
PREHOOK: query: insert into test values (1, array("a", "b")), (2, array("c", "d")), (3, array("e", "f"))
PREHOOK: type: QUERY
PREHOOK: Input: _dummy_database@_dummy_table
PREHOOK: Output: default@test
POSTHOOK: query: insert into test values (1, array("a", "b")), (2, array("c", "d")), (3, array("e", "f"))
POSTHOOK: type: QUERY
POSTHOOK: Input: _dummy_database@_dummy_table
POSTHOOK: Output: default@test
PREHOOK: query: select * from test
lateral view explode(arr) tbl1 as name
lateral view explode(arr) tbl2 as name1
PREHOOK: type: QUERY
PREHOOK: Input: default@test
PREHOOK: Output: hdfs://### HDFS PATH ###
POSTHOOK: query: select * from test
lateral view explode(arr) tbl1 as name
lateral view explode(arr) tbl2 as name1
POSTHOOK: type: QUERY
POSTHOOK: Input: default@test
POSTHOOK: Output: hdfs://### HDFS PATH ###
1 ["a","b"] a a
1 ["a","b"] a b
1 ["a","b"] b a
1 ["a","b"] b b
2 ["c","d"] c c
2 ["c","d"] c d
2 ["c","d"] d c
2 ["c","d"] d d
3 ["e","f"] e e
3 ["e","f"] e f
3 ["e","f"] f e
3 ["e","f"] f f
26 changes: 13 additions & 13 deletions kudu-handler/src/test/results/positive/kudu_complex_queries.q.out
Original file line number Diff line number Diff line change
@@ -91,15 +91,15 @@ STAGE PLANS:
predicate: UDFToDouble(key) is not null (type: boolean)
Statistics: Num rows: 309 Data size: 1236 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: key (type: int)
expressions: UDFToDouble(key) (type: double)
outputColumnNames: _col0
Statistics: Num rows: 309 Data size: 1236 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 309 Data size: 2472 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: UDFToDouble(_col0) (type: double)
key expressions: _col0 (type: double)
null sort order: z
sort order: +
Map-reduce partition columns: UDFToDouble(_col0) (type: double)
Statistics: Num rows: 309 Data size: 1236 Basic stats: COMPLETE Column stats: COMPLETE
Map-reduce partition columns: _col0 (type: double)
Statistics: Num rows: 309 Data size: 2472 Basic stats: COMPLETE Column stats: COMPLETE
Execution mode: vectorized
Map 4
Map Operator Tree:
@@ -111,15 +111,15 @@ STAGE PLANS:
predicate: UDFToDouble(key) is not null (type: boolean)
Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: key (type: string), value (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE Column stats: COMPLETE
expressions: key (type: string), value (type: string), UDFToDouble(key) (type: double)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 500 Data size: 93000 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: UDFToDouble(_col0) (type: double)
key expressions: _col2 (type: double)
null sort order: z
sort order: +
Map-reduce partition columns: UDFToDouble(_col0) (type: double)
Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE Column stats: COMPLETE
Map-reduce partition columns: _col2 (type: double)
Statistics: Num rows: 500 Data size: 93000 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: string), _col1 (type: string)
Execution mode: vectorized
Reducer 2
@@ -128,8 +128,8 @@ STAGE PLANS:
condition map:
Inner Join 0 to 1
keys:
0 UDFToDouble(_col0) (type: double)
1 UDFToDouble(_col0) (type: double)
0 _col0 (type: double)
1 _col2 (type: double)
outputColumnNames: _col1, _col2
Statistics: Num rows: 488 Data size: 86864 Basic stats: COMPLETE Column stats: COMPLETE
Top N Key Operator
19 changes: 19 additions & 0 deletions ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java
Original file line number Diff line number Diff line change
@@ -61,6 +61,7 @@
import org.apache.hadoop.hive.metastore.utils.MetaStoreServerUtils;
import org.apache.hadoop.hive.metastore.utils.MetaStoreUtils;
import org.apache.hadoop.hive.ql.exec.Utilities;
import org.apache.hadoop.hive.ql.io.AcidUtils;
import org.apache.hadoop.hive.ql.io.HiveFileFormatUtils;
import org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat;
import org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.TableSpec;
@@ -85,6 +86,8 @@

import com.google.common.base.Preconditions;

import static org.apache.commons.lang3.StringUtils.isBlank;

/**
* A Hive Table: is a fundamental unit of data in Hive that shares a common schema/DDL.
*
@@ -1364,4 +1367,20 @@ public SourceTable createSourceTable() {
sourceTable.setDeletedCount(0L);
return sourceTable;
}

public List<VirtualColumn> getVirtualColumns(HiveConf conf) {
List<VirtualColumn> virtualColumns = new ArrayList<>();
if (!isNonNative()) {
virtualColumns.addAll(VirtualColumn.getRegistry(conf));
}
if (isNonNative() && AcidUtils.isNonNativeAcidTable(this)) {
virtualColumns.addAll(getStorageHandler().acidVirtualColumns());
}
if (isNonNative() && getStorageHandler().areSnapshotsSupported() &&
isBlank(getMetaTable())) {
virtualColumns.add(VirtualColumn.SNAPSHOT_ID);
}

return virtualColumns;
}
}
26 changes: 8 additions & 18 deletions ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
Original file line number Diff line number Diff line change
@@ -366,7 +366,6 @@

import javax.sql.DataSource;

import static org.apache.commons.lang3.StringUtils.isBlank;
import static org.apache.hadoop.hive.ql.optimizer.calcite.HiveMaterializedViewASTSubQueryRewriteShuttle.getMaterializedViewByAST;
import static org.apache.hadoop.hive.ql.metadata.RewriteAlgorithm.ANY;

@@ -3095,23 +3094,14 @@ private RelNode genTableLogicalPlan(String tableAlias, QB qb) throws SemanticExc
final TableType tableType = obtainTableType(tabMetaData);

// 3.3 Add column info corresponding to virtual columns
List<VirtualColumn> virtualCols = new ArrayList<>();
if (tableType == TableType.NATIVE) {
virtualCols = VirtualColumn.getRegistry(conf);
if (AcidUtils.isNonNativeAcidTable(tabMetaData)) {
virtualCols.addAll(tabMetaData.getStorageHandler().acidVirtualColumns());
}
if (tabMetaData.isNonNative() && tabMetaData.getStorageHandler().areSnapshotsSupported() &&
isBlank(tabMetaData.getMetaTable())) {
virtualCols.add(VirtualColumn.SNAPSHOT_ID);
}
for (VirtualColumn vc : virtualCols) {
colInfo = new ColumnInfo(vc.getName(), vc.getTypeInfo(), tableAlias, true,
vc.getIsHidden());
rr.put(tableAlias, vc.getName().toLowerCase(), colInfo);
cInfoLst.add(colInfo);
}
}
List<VirtualColumn> virtualCols = tabMetaData.getVirtualColumns(conf);

virtualCols
.forEach(vc ->
rr.put(tableAlias, vc.getName().toLowerCase(),
new ColumnInfo(vc.getName(), vc.getTypeInfo(), tableAlias, true, vc.getIsHidden())
)
);

// 4. Build operator
Map<String, String> tabPropsFromQuery = qb.getTabPropsForAlias(tableAlias);
Original file line number Diff line number Diff line change
@@ -11988,13 +11988,7 @@ private Operator genTablePlan(String alias, QB qb) throws SemanticException {
}

// put virtual columns into RowResolver.
List<VirtualColumn> vcList = new ArrayList<>();
if (!tab.isNonNative()) {
vcList.addAll(VirtualColumn.getRegistry(conf));
}
if (tab.isNonNative() && AcidUtils.isNonNativeAcidTable(tab)) {
vcList.addAll(tab.getStorageHandler().acidVirtualColumns());
}
List<VirtualColumn> vcList = tab.getVirtualColumns(conf);

vcList.forEach(vc -> rwsch.put(alias, vc.getName().toLowerCase(), new ColumnInfo(vc.getName(),
vc.getTypeInfo(), alias, true, vc.getIsHidden()
Loading
Oops, something went wrong.