-
Notifications
You must be signed in to change notification settings - Fork 6
Operations Guide
This guide helps to fix a job failure in common scenarios.
These are some possible errors while working on registering the jobs.
This is one of the most common issue while working with Spark. It describes that while compiling Spark programs the required class will not be defined and this is because as the spark will be actually running several JVM's during the process execution and the path needs to be specified correctly. This actually comes down while passing the dependencies to the executors and we need to make sure that our jar file is having all the dependencies required.
val conf = new SparkConf().setAppName(appName).setJars(Seq(System.getProperty("user.dir") + "/target/scala-2.10/sparktest.jar"))
If you don't want to pass around the large jar file, then there is an option to place all the dependencies on the default class path on all the worker nodes in the cluster.
If different versions of the libraries are being used in your application and the spark server, then it can be the root cause of classnotfound
exceptions. So we should be sure that the library versions are identical within the server and program that's being loaded onto the executor classpaths
.
Sometimes you may come across errors like
java.lang.UnsupportedOperationException: Accumulator must be registered before send to executor
This is due to the blending of the case class
definition and the dataset/dataframe
operations together in a notebook cell and later use the case class in the separate spark Job.
Defining case class and creating a dataset as
case class MyClass(value: Int)
val dataset = spark.createDataset(Seq(1))
Now creating an instance of the case class inside the spark job and this is where the error is originated.
dataset.map { i => MyClass(i) }.count()
The only solution is move the case class to its own cell.
case class MyClass(value: Int) // no other code in this cell
val dataset = spark.createDataset(Seq(1))
dataset.map { i => MyClass(i) }.count()
The solution will be same for the java.lang.NoClassDefFoundError
.
You might get an error as follows while trying to register a Job
.
[pstl@ambari1 ~]$ pstl-jobs --initial-contacts 127.0.0.1:2552 --register --job-id test1 --conf /tmp/jobs/v1_single_column.conf 21:35:37.829 [pstl-akka.actor.default-dispatcher-2] INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started job register failed akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://pstl/user/$a#166793955]] after [30000 ms]. Sender[null] sent message of type "akka.cluster.client.ClusterClient$Send". at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:601) at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) at java.lang.Thread.run(Thread.java:748)
Please check if the PSTL service is up and running and listening on port 2552 (please check with your PSTL sysadmin to verify if its been changed). Please bring up the service and try again.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 19559.0 failed 4 times, most recent failure: Lost task 1.3 in stage 19559.0 (TID 58785, nwk2-hdp-dn-008.gdcs-lab.microfocus.com, executor 11): org.apache.avro.AvroRuntimeException: Duplicate field dcimName in record pstl.record: dcimName type:STRING pos:5 and dcimName type:STRING pos:4.
If you notice the last line of the error, you will find it states that there is a duplicate column in your SQL query that you are trying to push into your sink. Please modify your job and use the pstl-jobs command with the modify option to resubmit it.
pstl-jobs --initial-contacts 10.143.130.243:2552 --modify --job-id sevone_test2 --conf /home/amp/banddu01/jobs/sevone-power.conf
For more troubleshooting and tuning solutions for Spark in case of heavy loads can be found here.