Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark sql查询批量同步生成的hudi分区表会报错,而hive和presto查询正常 #131

Closed
ChenShuai1981 opened this issue Jun 8, 2022 · 2 comments
Labels
3.8.0 bug Something isn't working

Comments

@ChenShuai1981
Copy link

order_info表设置分区策略为slashEncodedDay,字段选取create_time,同时设置pt为分区字段,触发执行成功。分别使用hive和presto查询表的分区情况和count数都无误,然后改使用spark-sql查询分区情况时会提示如下信息:

spark-sql> show partitions order_info;
org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not allowed on order_info since its partition metadata is not stored in the Hive metastore. To import this information into the metastore, run `msck repair table order_info`;
  at org.apache.spark.sql.execution.command.DDLUtils$.verifyPartitionProviderIsHive(ddl.scala:835)
  at org.apache.spark.sql.execution.command.ShowPartitionsCommand.run(tables.scala:888)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)

于是执行 msck repair table order_info 命令后再重新执行上述命令显示正常

scala> spark.sql("msck repair table order_info").show(100, false)
22/06/08 10:17:05 WARN command.AlterTableRecoverPartitionsCommand: ignore hdfs://namenode/user/admin/default/20220608181019/order_info/hudi/2019
++
||
++
++

spark-sql> show partitions order_info;
+-------------+
|partition    |
+-------------+
|pt=2019-11-23|
+-------------+

接下来查询count数

spark-sql> select count(1) from order_info
Caused by: java.io.IOException: Required column is missing in data file. Col: [pt]
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initializeInternal(VectorizedParquetRecordReader.java:292)
	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:132)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:418)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:352)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
@baisui1981
Copy link
Member

收到

@baisui1981 baisui1981 added bug Something isn't working 3.8.0 labels Jun 8, 2022
@baisui1981
Copy link
Member

complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.8.0 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants