Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed when Spark ESS is enabled #83

Closed
Lobo2008 opened this issue Feb 22, 2022 · 4 comments
Closed

Failed when Spark ESS is enabled #83

Lobo2008 opened this issue Feb 22, 2022 · 4 comments

Comments

@Lobo2008
Copy link

Lobo2008 commented Feb 22, 2022

Hi,
I am running a WordCount on Spark-2.4.5(on YARN) + FireStorm-0.2.0,
Spark External Shuffle Service is running on NodeManager with spark-2.4.5-yarn-shuffle.jar in its classpath.
When spark.shuffle.service.enabled=false, it works fine, but when it is true, it failed.
In both cases coordinator and shuffle server detect the application.
So Firestorm doesn't need to enable ESS ? if not, what should I do
Thank You

########  ESS enabled #########

# container log
22/02/22 10:52:33 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 1000, initial allocation : 200) intervals

[Stage 0:>                                                          (0 + 0) / 2]
[Stage 0:>                                                          (0 + 2) / 2]22/02/22 10:53:20 ERROR YarnClusterScheduler: Lost executor 2 on 10.163.4.4: Unable to create executor due to Unable to register with external shuffle server due to : java.lang.UnsupportedOperationException: Unsupported shuffle manager of executor: ExecutorShuffleInfo{localDirs=[/home/yarn/nm-local-dir/usercache/MYUSER/appcache/application_1644546216413_0288/blockmgr-a9f411a8-baac-4dfd-9044-7902fc9ebbd9], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.RssShuffleManager}
	at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.registerExecutor(ExternalShuffleBlockResolver.java:149)
	at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:113)
	at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:81)
	at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:180)
	at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103)
	at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.spark_project.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.spark_project.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
	at org.spark_project.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
	at org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
	at org.spark_project.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at org.spark_project.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:745)

22/02/22 10:53:21 ERROR YarnClusterScheduler: Lost executor 1 on 10.163.4.4: Unable to create executor due to Unable to register with external shuffle server due to : java.lang.UnsupportedOperationException: Unsupported shuffle manager of executor: ExecutorShuffleInfo{localDirs=[/home/yarn/nm-local-dir/usercache/MYUSER/appcache/application_1644546216413_0288/blockmgr-44708e30-2a4c-4063-a31e-5b9db7194f04], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.RssShuffleManager}
	at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.registerExecutor(ExternalShuffleBlockResolver.java:149)
	at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:113)
	at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:81)
	at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:180)
	at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103)
	at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.spark_project.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.spark_project.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
	at org.spark_project.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
	at org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
	at org.spark_project.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at org.spark_project.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:745)


[Stage 0:>                                                          (0 + 1) / 2]22/02/22 10:53:25 ERROR YarnClusterScheduler: Lost executor 5 on 10.163.4.4: Unable to create executor due to Unable to register with external shuffle server due to : java.lang.UnsupportedOperationException: Unsupported shuffle manager of executor: ExecutorShuffleInfo{localDirs=[/home/yarn/nm-local-dir/usercache/MYUSER/appcache/application_1644546216413_0288/blockmgr-5ddc98e8-2935-4805-9c16-2c8a09a4d9e7], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.RssShuffleManager}
	at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.registerExecutor(ExternalShuffleBlockResolver.java:149)
	at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:113)
	at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:81)
	at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:180)
	at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103)
	at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.spark_project.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.spark_project.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
	at org.spark_project.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
	at org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
	at org.spark_project.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at org.spark_project.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:745)

22/02/22 10:53:25 ERROR YarnClusterScheduler: Lost executor 3 on 10.163.4.4: Unable to create executor due to Unable to register with external shuffle server due to : java.lang.UnsupportedOperationException: Unsupported shuffle manager of executor: ExecutorShuffleInfo{localDirs=[/home/yarn/nm-local-dir/usercache/MYUSER/appcache/application_1644546216413_0288/blockmgr-443ec3cb-e1f4-40d1-8954-58c115527d93], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.RssShuffleManager}
	at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.registerExecutor(ExternalShuffleBlockResolver.java:149)
	at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:113)
	at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:81)
	at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:180)
	at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103)
	at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.spark_project.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.spark_project.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
	at org.spark_project.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
	at org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
	at org.spark_project.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at org.spark_project.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:745)


[Stage 0:>                                                          (0 + 0) / 2]22/02/22 10:54:06 ERROR YarnClusterScheduler: Lost executor 6 on 10.163.4.4: Unable to create executor due to Unable to register with external shuffle server due to : java.lang.UnsupportedOperationException: Unsupported shuffle manager of executor: ExecutorShuffleInfo{localDirs=[/home/yarn/nm-local-dir/usercache/MYUSER/appcache/application_1644546216413_0288/blockmgr-9b801e2a-30b7-4178-bfeb-c3c694ac6691], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.RssShuffleManager}
	at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.registerExecutor(ExternalShuffleBlockResolver.java:149)
	at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:113)
	at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:81)
	at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:180)
	at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103)
	at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.spark_project.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at org.spark_project.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at org.spark_project.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
	at org.spark_project.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
	at org.spark_project.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
	at org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
	at org.spark_project.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at org.spark_project.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:745)

# coordinator log:

[INFO] 2022-02-22 11:03:06,229 Grpc-181 CoordinatorGrpcService getShuffleAssignments - Request of getShuffleAssignments for appId[application_1644546216413_0292_1645498981354], shuffleId[0], partitionNum[2], partitionNumPerRange[1], replica[1]
[WARN] 2022-02-22 11:03:06,231 Grpc-181 PartitionBalanceAssignmentStrategy assign - Can't get expected servers [5] and found only [3]
[INFO] 2022-02-22 11:03:06,231 Grpc-181 CoordinatorGrpcService logAssignmentResult - Shuffle Servers of assignment for appId[application_1644546216413_0292_1645498981354], shuffleId[0] are [10.163.4.11-19999, 10.163.4.9-19999]
[INFO] 2022-02-22 11:03:26,338 ApplicationManager-0 ApplicationManager statusCheck - Start to check status for 1 applications
 [INFO] 2022-02-22 11:03:56,338 ApplicationManager-0 ApplicationManager statusCheck - Start to check status for 1 applications
[INFO] 2022-02-22 11:04:26,338 ApplicationManager-0 ApplicationManager statusCheck - Start to check status for 1 applications


# shuffle server log
   [INFO] 2022-02-22 11:03:07,064 Grpc-529 ShuffleServerGrpcService registerShuffle - Get register request for appId[application_1644546216413_0292_1645498981354], shuffleId[0] with 1 partition ranges
[INFO] 2022-02-22 11:03:12,153 Grpc-531 ShuffleServerGrpcService appHeartbeat - Get heartbeat from application_1644546216413_0292_1645498981354
 [INFO] 2022-02-22 11:03:22,095 Grpc-533 ShuffleServerGrpcService appHeartbeat - Get heartbeat from application_1644546216413_0292_1645498981354
[INFO] 2022-02-22 11:03:32,085 Grpc-535 ShuffleServerGrpcService appHeartbeat - Get heartbeat from application_1644546216413_0292_1645498981354
########### ESS disabled ############

# coordinator log
[INFO] 2022-02-22 11:09:30,033 Grpc-566 CoordinatorGrpcService getShuffleAssignments - Request of getShuffleAssignments for appId[application_1644546216413_0293_1645499365247], shuffleId[0], partitionNum[2], partitionNumPerRange[1], replica[1]
[WARN] 2022-02-22 11:09:30,034 Grpc-566 PartitionBalanceAssignmentStrategy assign - Can't get expected servers [5] and found only [3]
[INFO] 2022-02-22 11:09:30,035 Grpc-566 CoordinatorGrpcService logAssignmentResult - Shuffle Servers of assignment for appId[application_1644546216413_0293_1645499365247], shuffleId[0] are [10.163.4.11-19999, 10.163.4.9-19999]
 [INFO] 2022-02-22 11:09:56,338 ApplicationManager-0 ApplicationManager statusCheck - Start to check status for 1 applications
 [INFO] 2022-02-22 11:10:26,338 ApplicationManager-0 ApplicationManager statusCheck - Start to check status for 1 applications
 [INFO] 2022-02-22 11:10:56,338 ApplicationManager-0 ApplicationManager statusCheck - Start to check status for 1 applications
 [INFO] 2022-02-22 11:11:26,338 ApplicationManager-0 ApplicationManager statusCheck - Start to check status for 1 applications
 [INFO] 2022-02-22 11:11:56,338 ApplicationManager-0 ApplicationManager statusCheck - Start to check status for 1 applications
[INFO] 2022-02-22 11:11:56,338 ApplicationManager-0 ApplicationManager statusCheck - Remove expired application:application_1644546216413_0293_1645499365247


# shuffle server log

[INFO] 2022-02-22 11:07:21,981 expiredAppCleaner ShuffleTaskManager checkResourceStatus - Detect expired appId[application_1644546216413_0292_1645498981354] according to rss.server.app.expired.withoutHeartbeat
[INFO] 2022-02-22 11:07:21,981 clearResourceThread ShuffleTaskManager removeResources - Start remove resource for appId[application_1644546216413_0292_1645498981354]
[INFO] 2022-02-22 11:07:21,981 clearResourceThread ShuffleTaskManager removeResources - Finish remove resource for appId[application_1644546216413_0292_1645498981354] cost 0 ms
    [INFO] 2022-02-22 11:09:30,458 Grpc-558 ShuffleServerGrpcService registerShuffle - Get register request for appId[application_1644546216413_0293_1645499365247], shuffleId[0] with 1 partition ranges
[INFO] 2022-02-22 11:09:35,497 Grpc-560 ShuffleServerGrpcService appHeartbeat - Get heartbeat from application_1644546216413_0293_1645499365247
[INFO] 2022-02-22 11:09:45,469 Grpc-562 ShuffleServerGrpcService appHeartbeat - Get heartbeat from application_1644546216413_0293_1645499365247
[INFO] 2022-02-22 11:09:55,467 Grpc-564 ShuffleServerGrpcService appHeartbeat - Get heartbeat from application_1644546216413_0293_1645499365247
[INFO] 2022-02-22 11:10:05,469 Grpc-566 ShuffleServerGrpcService appHeartbeat - Get heartbeat from application_1644546216413_0293_1645499365247
 [INFO] 2022-02-22 11:10:15,467 Grpc-580 ShuffleServerGrpcService appHeartbeat - Get heartbeat from application_1644546216413_0293_1645499365247
[INFO] 2022-02-22 11:10:15,677 Grpc-584 ShuffleServerGrpcService reportShuffleResult - Report 1 blocks as shuffle result for the task of appId[application_1644546216413_0293_1645499365247], shuffleId[0], taskAttemptId[0]
[INFO] 2022-02-22 11:10:16,547 Grpc-590 ShuffleServerGrpcService finishShuffle - Get finishShuffle request for appId[application_1644546216413_0293_1645499365247], shuffleId[0]
[WARN] 2022-02-22 11:10:16,549 Grpc-590 ShuffleFlushManager getCommittedBlockIds - Unexpected value when getCommittedBlockIds for appId[application_1644546216413_0293_1645499365247]
[INFO] 2022-02-22 11:10:16,552 pool-5-thread-9 LocalStorageMeta createMetadataIfNotExist - Create metadata of shuffle application_1644546216413_0293_1645499365247/0.
[INFO] 2022-02-22 11:10:17,550 Grpc-590 ShuffleTaskManager commitShuffle - Checking commit result for appId[application_1644546216413_0293_1645499365247], shuffleId[0], expect committed[2], remain[2]
[INFO] 2022-02-22 11:10:17,553 Grpc-590 ShuffleTaskManager commitShuffle - Finish commit for appId[application_1644546216413_0293_1645499365247], shuffleId[0] with expectedCommitted[2], cost 1006 ms to check
[INFO] 2022-02-22 11:10:18,462 Grpc-592 ShuffleServerGrpcService reportShuffleResult - Report 1 blocks as shuffle result for the task of appId[application_1644546216413_0293_1645499365247], shuffleId[0], taskAttemptId[1]
[INFO] 2022-02-22 11:10:25,471 Grpc-594 ShuffleServerGrpcService appHeartbeat - Get heartbeat from application_1644546216413_0293_1645499365247
[INFO] 2022-02-22 11:10:35,469 Grpc-596 ShuffleServerGrpcService appHeartbeat - Get heartbeat from application_1644546216413_0293_1645499365247
 [INFO] 2022-02-22 11:10:45,471 Grpc-598 ShuffleServerGrpcService appHeartbeat - Get heartbeat from application_1644546216413_0293_1645499365247
[INFO] 2022-02-22 11:10:47,055 Grpc-604 ShuffleServerGrpcService getLocalShuffleIndex - Successfully getShuffleIndex cost 0 ms for 80 bytes with appId[application_1644546216413_0293_1645499365247], shuffleId[0], partitionId[1]
[INFO] 2022-02-22 11:10:47,076 Grpc-607 ShuffleServerGrpcService getLocalShuffleData - Successfully getShuffleData cost 1 ms for shuffle data with appId[application_1644546216413_0293_1645499365247], shuffleId[0], partitionId[1]offset[0]length[1473]
[INFO] 2022-02-22 11:10:55,972 Grpc-609 ShuffleServerGrpcService appHeartbeat - Get heartbeat from application_1644546216413_0293_1645499365247
     [INFO] 2022-02-22 11:13:21,981 expiredAppCleaner ShuffleTaskManager checkResourceStatus - Detect expired appId[application_1644546216413_0293_1645499365247] according to rss.server.app.expired.withoutHeartbeat
[INFO] 2022-02-22 11:13:21,981 clearResourceThread ShuffleTaskManager removeResources - Start remove resource for appId[application_1644546216413_0293_1645499365247]
[INFO] 2022-02-22 11:13:21,984 clearResourceThread MultiStorageManager removeResources - Start to remove resource of appId: application_1644546216413_0293_1645499365247, shuffles: [0]
[INFO] 2022-02-22 11:13:21,984 clearResourceThread LocalStorage removeResources - Start to remove resource of application_1644546216413_0293_1645499365247/0
@colinmjj
Copy link
Collaborator

colinmjj commented Feb 22, 2022

@Lobo2008 to enable dynamic allocation with Firestorm, you can refer README section Support Spark dynamic allocation.
In short, patch should be applied to spark and update configuration

@Lobo2008
Copy link
Author

Lobo2008 commented Feb 22, 2022

@Lobo2008 to enable dynamic allocation with Firestorm, you can refer README section Support Spark dynamic allocation. In short, patch should be applied to spark and update configuration

dynamicAllocation is disabled.

spark.dynamicAllocation.enabled   false (always false)
spark.shuffle.service.enabled   false|true (i change this)

So you mean spark.shuffle.service.enabled should always be false , because firestorm will deal with all shuffle tasks, no need YARM(NM) to do the job anymore?

@colinmjj
Copy link
Collaborator

Yes, spark.shuffle.service.enabled should always be false if enable Firestorm

@Lobo2008
Copy link
Author

Yes, spark.shuffle.service.enabled should always be false if enable Firestorm

Thanks a lot!

@jerqi jerqi closed this as completed Feb 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants