Operations Guide

This guide helps to fix a job failure in common scenarios.

Possible Errors

These are some possible errors while working on registering the jobs.

Class Not Found Issue

This is one of the most common issue while working with Spark. It describes that while compiling Spark programs the required class will not be defined and this is because as the spark will be actually running several JVM's during the process execution and the path needs to be specified correctly. This actually comes down while passing the dependencies to the executors and we need to make sure that our jar file is having all the dependencies required.

val conf = new SparkConf().setAppName(appName).setJars(Seq(System.getProperty("user.dir") + "/target/scala-2.10/sparktest.jar"))

If you don't want to pass around the large jar file, then there is an option to place all the dependencies on the default class path on all the worker nodes in the cluster. If different versions of the libraries are being used in your application and the spark server, then it can be the root cause of classnotfound exceptions. So we should be sure that the library versions are identical within the server and program that's being loaded onto the executor classpaths.

Spark job fails with `java.lang.UnsupportedOperationException`

Sometimes you may come across errors like

java.lang.UnsupportedOperationException: Accumulator must be registered before send to executor

This is due to the blending of the case class definition and the dataset/dataframe operations together in a notebook cell and later use the case class in the separate spark Job. Defining case class and creating a dataset as

case class MyClass(value: Int)

val dataset = spark.createDataset(Seq(1))

Now creating an instance of the case class inside the spark job and this is where the error is originated.

dataset.map { i => MyClass(i) }.count()

The only solution is move the case class to its own cell.

case class MyClass(value: Int)   // no other code in this cell

val dataset = spark.createDataset(Seq(1))

dataset.map { i => MyClass(i) }.count()

The solution will be same for the java.lang.NoClassDefFoundError.

If you cannot register the Job.

You might get an error as follows while trying to register a Job.

[pstl@ambari1 ~]$ pstl-jobs --initial-contacts 127.0.0.1:2552 --register --job-id test1 --conf /tmp/jobs/v1_single_column.conf 21:35:37.829 [pstl-akka.actor.default-dispatcher-2] INFO  akka.event.slf4j.Slf4jLogger - Slf4jLogger started job register failed akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://pstl/user/$a#166793955]] after [30000 ms]. Sender[null] sent message of type "akka.cluster.client.ClusterClient$Send".     at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:601)     at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)     at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)     at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)     at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)     at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)     at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)     at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)     at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)     at java.lang.Thread.run(Thread.java:748)

Please check if the PSTL service is up and running and listening on port 2552 (please check with your PSTL sysadmin to verify if its been changed). Please bring up the service and try again.

When your job fails to start and you see the following error in the PSTL error log

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 19559.0 failed 4 times, most recent failure: Lost task 1.3 in stage 19559.0 (TID 58785, nwk2-hdp-dn-008.gdcs-lab.microfocus.com, executor 11): org.apache.avro.AvroRuntimeException: Duplicate field dcimName in record pstl.record: dcimName type:STRING pos:5 and dcimName type:STRING pos:4.

If you notice the last line of the error, you will find it states that there is a duplicate column in your SQL query that you are trying to push into your sink. Please modify your job and use the pstl-jobs command with the modify option to resubmit it.

pstl-jobs --initial-contacts 10.143.130.243:2552 --modify --job-id sevone_test2 --conf /home/amp/banddu01/jobs/sevone-power.conf

For more troubleshooting and tuning solutions for Spark in case of heavy loads can be found here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operations Guide