Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocks read inconsistent #101

Closed
shenghui361 opened this issue Apr 5, 2022 · 6 comments
Closed

Blocks read inconsistent #101

shenghui361 opened this issue Apr 5, 2022 · 6 comments

Comments

@shenghui361
Copy link

spark version: 3.2.1
rss version: master
sql: tpc-ds[10T] query17

spark parameters:
conf spark.shuffle.manager=org.apache.spark.shuffle.RssShuffleManager
spark.rss.storage.type=HDFS
spark.rss.base.path=hdfs://ns/tmp/rss/hdfs_base_path
park.rss.data.replica=2
spark.dynamicAllocation.enabled=false
spark.shuffle.service.enabled=false
spark.rss.coordinator.quorum=coordinator1:19999,coordinator2:19999

`
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2455)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2404)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2403)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2403)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2643)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2585)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2574)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: com.tencent.rss.common.exception.RssException: Blocks read inconsistent: expected 310 blocks, actual 0 blocks
at com.tencent.rss.client.impl.ShuffleReadClientImpl.checkProcessedBlockIds(ShuffleReadClientImpl.java:222)
at org.apache.spark.shuffle.reader.RssShuffleDataIterator.hasNext(RssShuffleDataIterator.java:126)
at org.apache.spark.shuffle.reader.RssShuffleReader$MultiPartitionIterator.hasNext(RssShuffleReader.java:213)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage10.sort_addToSorter_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage10.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.smj_findNextJoinRows_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:778)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.writer.RssShuffleWriter.write(RssShuffleWriter.java:138)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

`

@colinmjj
Copy link
Collaborator

colinmjj commented Apr 6, 2022

"Blocks read inconsistent" means can't read expected blocks which are sent to shuffle server. It may be caused by write problem. For storage type, please use MEMORY_HDFS instead of HDFS.

@shenghui361
Copy link
Author

MEMORY_LOCALFILE has the same problem.

@latincross
Copy link

用local_hdfs模式,你会发现,本地存的和hdfs上存的数据不一样,然后报找不到文件的错误。

@jerqi
Copy link
Collaborator

jerqi commented Apr 7, 2022

MEMORY_LOCALFILE has the same problem.

What's the configuration of shuffle server?

@shenghui361
Copy link
Author

shenghui361 commented Apr 8, 2022

MEMORY_LOCALFILE has the same problem.

What's the configuration of shuffle server?

rss.rpc.server.port 19999
rss.jetty.http.port 19998
rss.storage.basePath /HDATA/1/rssdata,/HDATA/2/rssdata,/HDATA/3/rssdata,/HDATA/4/rssdata,/HDATA/5/rssdata,/HDATA/6/rssdata
#rss.storage.type MEMORY_LOCALFILE
rss.storage.type MEMORY_LOCALFILE_HDFS
rss.coordinator.quorum coordinator1:19999, coordinator2:19999
rss.server.buffer.capacity 40gb
rss.server.buffer.spill.threshold 22gb
rss.server.partition.buffer.size 150mb
rss.server.read.buffer.capacity 20gb
rss.server.flush.thread.alive 50
rss.server.flush.threadPool.size 100

rss.server.hdfs.base.path hdfs://ns/tmp/rss/hdfs_base_path

# multistorage config
rss.server.multistorage.enable true
#rss.server.uploader.enable false
#rss.server.uploader.base.path hdfs://ns/tmp/rss/uploader_base_path
#rss.server.uploader.thread.number 32
# rss.server.disk.capacity 1011550697553

@jerqi
Copy link
Collaborator

jerqi commented Apr 11, 2022

Client should keep the storage type consistent with server.

@jerqi jerqi closed this as completed Apr 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants