You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Actual (wrong) behavior:
SJS can't restart spark job after shutting down spark context. The reason of killing Spark job:
[2017-10-06 10:40:55,550] ERROR .jobserver.JobManagerActor [] [] - About to restart actor due to exception:
java.util.concurrent.TimeoutException: Futures timed out after [3 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread$$anon$3.block(ThreadPoolBuilder.scala:167)
at scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3640)
at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread.blockOn(ThreadPoolBuilder.scala:165)
at scala.concurrent.Await$.result(package.scala:190)
at spark.jobserver.JobManagerActor.startJobInternal(JobManagerActor.scala:219)
at spark.jobserver.JobManagerActor$$anonfun$wrappedReceive$1.applyOrElse(JobManagerActor.scala:157)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at spark.jobserver.common.akka.ActorStack$$anonfun$receive$1.applyOrElse(ActorStack.scala:33)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at spark.jobserver.common.akka.Slf4jLogging$$anonfun$receive$1$$anonfun$applyOrElse$1.apply$mcV$sp(Slf4jLogging.scala:25)
at spark.jobserver.common.akka.Slf4jLogging$class.spark$jobserver$common$akka$Slf4jLogging$$withAkkaSourceLogging(Slf4jLogging.scala:34)
at spark.jobserver.common.akka.Slf4jLogging$$anonfun$receive$1.applyOrElse(Slf4jLogging.scala:24)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at spark.jobserver.common.akka.ActorMetrics$$anonfun$receive$1.applyOrElse(ActorMetrics.scala:23)
at akka.actor.Actor$class.aroundReceive(Actor.scala:484)
at spark.jobserver.common.akka.InstrumentedActor.aroundReceive(InstrumentedActor.scala:8)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[2017-10-06 10:40:55,557] INFO .jobserver.JobManagerActor [] [] - Shutting down SparkContext hc1
After this, SJS tries to restart actors several times, but with no success.
[2017-10-06 10:41:00,197] INFO YarnClientSchedulerBackend [] [] - Stopped
[2017-10-06 10:41:00,206] INFO utputTrackerMasterEndpoint [] [akka://JobServer/user/jobManager-fe-a4d8-39ca544b101d] - MapOutputTrackerMasterEndpoint stopped!
[2017-10-06 10:41:00,467] INFO storage.memory.MemoryStore [] [] - MemoryStore cleared
[2017-10-06 10:41:00,473] INFO spark.storage.BlockManager [] [] - BlockManager stopped
[2017-10-06 10:41:00,504] INFO storage.BlockManagerMaster [] [] - BlockManagerMaster stopped
[2017-10-06 10:41:00,512] INFO tCommitCoordinatorEndpoint [] [akka://JobServer/user/jobManager-fe-a4d8-39ca544b101d] - OutputCommitCoordinator stopped!
[2017-10-06 10:41:00,597] INFO .apache.spark.SparkContext [] [] - Successfully stopped SparkContext
[2017-10-06 10:41:00,598] INFO .jobserver.JobManagerActor [] [akka://JobServer/user/jobManager-fe-a4d8-39ca544b101d] - Starting actor spark.jobserver.JobManagerActor
[2017-10-06 10:41:00,623] ERROR .jobserver.JobManagerActor [] [] - About to restart actor due to exception:
java.lang.NullPointerException
at spark.jobserver.JobManagerActor.startJobInternal(JobManagerActor.scala:229)
at spark.jobserver.JobManagerActor$$anonfun$wrappedReceive$1.applyOrElse(JobManagerActor.scala:157)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at spark.jobserver.common.akka.ActorStack$$anonfun$receive$1.applyOrElse(ActorStack.scala:33)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at spark.jobserver.common.akka.Slf4jLogging$$anonfun$receive$1$$anonfun$applyOrElse$1.apply$mcV$sp(Slf4jLogging.scala:25)
at spark.jobserver.common.akka.Slf4jLogging$class.spark$jobserver$common$akka$Slf4jLogging$$withAkkaSourceLogging(Slf4jLogging.scala:34)
at spark.jobserver.common.akka.Slf4jLogging$$anonfun$receive$1.applyOrElse(Slf4jLogging.scala:24)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at spark.jobserver.common.akka.ActorMetrics$$anonfun$receive$1.applyOrElse(ActorMetrics.scala:23)
at akka.actor.Actor$class.aroundReceive(Actor.scala:484)
at spark.jobserver.common.akka.InstrumentedActor.aroundReceive(InstrumentedActor.scala:8)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[2017-10-06 10:41:00,643] INFO .jobserver.JobManagerActor [] [] - Shutting down SparkContext hc1
As a result, we have alive SJS-context pointing to closed Spark-context. Any calls to SJS-context results in timeouts. SJS-context should be re-created manually to proceed working with this context.
{
"status": "ERROR",
"result": {
"message": "Ask timed out on [Actor[akka.tcp://JobServer@127.0.0.1:45639/user/jobManager-fe-a4d8-39ca544b101d#-863049129]] after [10000 ms]. Sender[null] sent message of type \"spark.jobserver.JobManagerActor$StartJob\".",
"errorClass": "akka.pattern.AskTimeoutException",
"stack": [
"akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)",
"akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)",
"scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)",
"scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)",
"scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)",
"akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:331)",
"akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:282)",
"akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:286)",
"akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:238)",
"java.lang.Thread.run(Thread.java:748)"
]
}
}
The text was updated successfully, but these errors were encountered:
Used Spark version: 2.1.1.2.6.1.10-4
Used Spark Job Server version: based on https://github.com/spark-jobserver/spark-jobserver/tree/spark-2.0-preview
Deployed mode: yarn-client
Actual (wrong) behavior:
SJS can't restart spark job after shutting down spark context. The reason of killing Spark job:
After this, SJS tries to restart actors several times, but with no success.
As a result, we have alive SJS-context pointing to closed Spark-context. Any calls to SJS-context results in timeouts. SJS-context should be re-created manually to proceed working with this context.
The text was updated successfully, but these errors were encountered: