spark sql查询批量同步生成的hudi分区表会报错，而hive和presto查询正常 #131

ChenShuai1981 · 2022-06-08T11:24:39Z

order_info表设置分区策略为slashEncodedDay，字段选取create_time，同时设置pt为分区字段，触发执行成功。分别使用hive和presto查询表的分区情况和count数都无误，然后改使用spark-sql查询分区情况时会提示如下信息：

spark-sql> show partitions order_info;
org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not allowed on order_info since its partition metadata is not stored in the Hive metastore. To import this information into the metastore, run `msck repair table order_info`;
  at org.apache.spark.sql.execution.command.DDLUtils$.verifyPartitionProviderIsHive(ddl.scala:835)
  at org.apache.spark.sql.execution.command.ShowPartitionsCommand.run(tables.scala:888)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)

于是执行 msck repair table order_info 命令后再重新执行上述命令显示正常

scala> spark.sql("msck repair table order_info").show(100, false)
22/06/08 10:17:05 WARN command.AlterTableRecoverPartitionsCommand: ignore hdfs://namenode/user/admin/default/20220608181019/order_info/hudi/2019
++
||
++
++

spark-sql> show partitions order_info;
+-------------+
|partition    |
+-------------+
|pt=2019-11-23|
+-------------+

接下来查询count数

spark-sql> select count(1) from order_info
Caused by: java.io.IOException: Required column is missing in data file. Col: [pt]
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initializeInternal(VectorizedParquetRecordReader.java:292)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:132)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:418)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:352)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

The text was updated successfully, but these errors were encountered:

baisui1981 · 2022-06-08T13:15:31Z

收到

baisui1981 · 2022-11-01T08:46:12Z

complete

baisui1981 added bug Something isn't working 3.8.0 labels Jun 8, 2022

baisui1981 closed this as completed Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark sql查询批量同步生成的hudi分区表会报错，而hive和presto查询正常 #131

spark sql查询批量同步生成的hudi分区表会报错，而hive和presto查询正常 #131

ChenShuai1981 commented Jun 8, 2022

baisui1981 commented Jun 8, 2022

baisui1981 commented Nov 1, 2022

spark sql查询批量同步生成的hudi分区表会报错，而hive和presto查询正常 #131

spark sql查询批量同步生成的hudi分区表会报错，而hive和presto查询正常 #131

Comments

ChenShuai1981 commented Jun 8, 2022

baisui1981 commented Jun 8, 2022

baisui1981 commented Nov 1, 2022