New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullPointerException when Running CaffeOnSpark on EC2 #67

Closed
syuquad opened this Issue May 18, 2016 · 21 comments

Comments

Projects
None yet
5 participants
@syuquad

syuquad commented May 18, 2016

I am running CaffeOnSpark on EC2 following the instructions https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_EC2. I got the following errors:

16/05/06 00:10:35 INFO TaskSetManager: Starting task 1.1 in stage 1.0 (TID 5, ip-10-30-15-17.us-west-2.compute.internal, partition 1,PROCESS_LOCAL, 2197 bytes)
16/05/06 00:10:35 WARN TaskSetManager: Lost task 0.1 in stage 1.0 (TID 4, ip-10-30-15-17.us-west-2.compute.internal): java.lang.NullPointerException
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply$mcVI$sp(CaffeOnSpark.scala:153)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

@syuquad

This comment has been minimized.

Show comment
Hide comment
@syuquad

syuquad May 18, 2016

more logs:

16/05/06 00:10:35 ERROR TaskSetManager: Task 0 in stage 1.0 failed 4 times; aborting job
16/05/06 00:10:35 INFO TaskSchedulerImpl: Cancelling stage 1
16/05/06 00:10:35 INFO TaskSchedulerImpl: Stage 1 was cancelled
16/05/06 00:10:35 INFO DAGScheduler: ResultStage 1 (collect at CaffeOnSpark.scala:155) failed in 112.354 s
16/05/06 00:10:35 INFO DAGScheduler: Job 1 failed: collect at CaffeOnSpark.scala:155, took 112.367186 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 8, ip-10-30-15-17.us-west-2.compute.internal): java.lang.NullPointerException
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply$mcVI$sp(CaffeOnSpark.scala:153)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at com.yahoo.ml.caffe.CaffeOnSpark.train(CaffeOnSpark.scala:155)
at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:40)
at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NullPointerException
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply$mcVI$sp(CaffeOnSpark.scala:153)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/05/06 00:10:35 INFO SparkContext: Invoking stop() from shutdown hook
16/05/06 00:10:35 INFO TaskSetManager: Lost task 1.3 in stage 1.0 (TID 9) on executor ip-10-30-15-17.us-west-2.compute.internal: org.apache.spark.TaskKilledException (null) [duplicate 5]
16/05/06 00:10:35 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/05/06 00:10:35 INFO SparkUI: Stopped Spark web UI at http://ec2-52-24-22-149.us-west-2.compute.amazonaws.com:4040
16/05/06 00:10:35 INFO SparkDeploySchedulerBackend: Shutting down all executors
16/05/06 00:10:35 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
16/05/06 00:10:35 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/05/06 00:10:35 INFO MemoryStore: MemoryStore cleared
16/05/06 00:10:35 INFO BlockManager: BlockManager stopped
16/05/06 00:10:35 INFO BlockManagerMaster: BlockManagerMaster stopped
16/05/06 00:10:35 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/05/06 00:10:35 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/05/06 00:10:35 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/05/06 00:10:35 INFO SparkContext: Successfully stopped SparkContext
16/05/06 00:10:35 INFO ShutdownHookManager: Shutdown hook called
16/05/06 00:10:35 INFO ShutdownHookManager: Deleting directory /mnt/spark/spark-23e9b9f9-8b51-4b03-bd3d-6fdb15bb9fe6/httpd-8c4d711a-1bf6-4917-911c-161187fb2b61
16/05/06 00:10:35 INFO ShutdownHookManager: Deleting directory /mnt/spark/spark-23e9b9f9-8b51-4b03-bd3d-6fdb15bb9fe6

syuquad commented May 18, 2016

more logs:

16/05/06 00:10:35 ERROR TaskSetManager: Task 0 in stage 1.0 failed 4 times; aborting job
16/05/06 00:10:35 INFO TaskSchedulerImpl: Cancelling stage 1
16/05/06 00:10:35 INFO TaskSchedulerImpl: Stage 1 was cancelled
16/05/06 00:10:35 INFO DAGScheduler: ResultStage 1 (collect at CaffeOnSpark.scala:155) failed in 112.354 s
16/05/06 00:10:35 INFO DAGScheduler: Job 1 failed: collect at CaffeOnSpark.scala:155, took 112.367186 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 8, ip-10-30-15-17.us-west-2.compute.internal): java.lang.NullPointerException
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply$mcVI$sp(CaffeOnSpark.scala:153)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at com.yahoo.ml.caffe.CaffeOnSpark.train(CaffeOnSpark.scala:155)
at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:40)
at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NullPointerException
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply$mcVI$sp(CaffeOnSpark.scala:153)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/05/06 00:10:35 INFO SparkContext: Invoking stop() from shutdown hook
16/05/06 00:10:35 INFO TaskSetManager: Lost task 1.3 in stage 1.0 (TID 9) on executor ip-10-30-15-17.us-west-2.compute.internal: org.apache.spark.TaskKilledException (null) [duplicate 5]
16/05/06 00:10:35 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/05/06 00:10:35 INFO SparkUI: Stopped Spark web UI at http://ec2-52-24-22-149.us-west-2.compute.amazonaws.com:4040
16/05/06 00:10:35 INFO SparkDeploySchedulerBackend: Shutting down all executors
16/05/06 00:10:35 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
16/05/06 00:10:35 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/05/06 00:10:35 INFO MemoryStore: MemoryStore cleared
16/05/06 00:10:35 INFO BlockManager: BlockManager stopped
16/05/06 00:10:35 INFO BlockManagerMaster: BlockManagerMaster stopped
16/05/06 00:10:35 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/05/06 00:10:35 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/05/06 00:10:35 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/05/06 00:10:35 INFO SparkContext: Successfully stopped SparkContext
16/05/06 00:10:35 INFO ShutdownHookManager: Shutdown hook called
16/05/06 00:10:35 INFO ShutdownHookManager: Deleting directory /mnt/spark/spark-23e9b9f9-8b51-4b03-bd3d-6fdb15bb9fe6/httpd-8c4d711a-1bf6-4917-911c-161187fb2b61
16/05/06 00:10:35 INFO ShutdownHookManager: Deleting directory /mnt/spark/spark-23e9b9f9-8b51-4b03-bd3d-6fdb15bb9fe6

@syuquad

This comment has been minimized.

Show comment
Hide comment
@syuquad

syuquad May 18, 2016

Here is my parameters:

export CORES_PER_WORKER=8
export DEVICES=1
export SPARK_WORKER_INSTANCES=2
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))
export MASTER_URL=spark://$(hostname):7077

syuquad commented May 18, 2016

Here is my parameters:

export CORES_PER_WORKER=8
export DEVICES=1
export SPARK_WORKER_INSTANCES=2
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))
export MASTER_URL=spark://$(hostname):7077

@syuquad

This comment has been minimized.

Show comment
Hide comment
@syuquad

syuquad May 18, 2016

Here is my command:

spark-submit --master ${MASTER_URL}
--files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.task.cpus=${CORES_PER_WORKER}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf lenet_memory_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES}
-connection ethernet
-model /mnist.model
-output /mnist_features_result

syuquad commented May 18, 2016

Here is my command:

spark-submit --master ${MASTER_URL}
--files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.task.cpus=${CORES_PER_WORKER}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf lenet_memory_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES}
-connection ethernet
-model /mnist.model
-output /mnist_features_result

@mriduljain

This comment has been minimized.

Show comment
Hide comment
@mriduljain

mriduljain May 21, 2016

Contributor

Please paste the content of this file: lenet_memory_train_test.prototxt
Alternative check your dataset path is correct

On Wed, May 18, 2016 at 4:16 PM, syuquad notifications@github.com wrote:

Here is my command:

spark-submit --master ${MASTER_URL}
--files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.task.cpus=${CORES_PER_WORKER}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar

-train
-features accuracy,loss -label label
-conf lenet_memory_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES}
-connection ethernet
-model /mnist.model
-output /mnist_features_result


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#67 (comment)

Contributor

mriduljain commented May 21, 2016

Please paste the content of this file: lenet_memory_train_test.prototxt
Alternative check your dataset path is correct

On Wed, May 18, 2016 at 4:16 PM, syuquad notifications@github.com wrote:

Here is my command:

spark-submit --master ${MASTER_URL}
--files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.task.cpus=${CORES_PER_WORKER}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar

-train
-features accuracy,loss -label label
-conf lenet_memory_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES}
-connection ethernet
-model /mnist.model
-output /mnist_features_result


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#67 (comment)

@syuquad

This comment has been minimized.

Show comment
Hide comment
@syuquad

syuquad May 23, 2016

I see the following in the log and I believe that's the reason for the failure. But I don't know how to fix it.

16/05/20 01:23:58 ERROR TaskSchedulerImpl: Lost executor 1 on ip-10-30-23-120.us-west-2.compute.internal: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
16/05/20 01:23:58 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2, ip-10-30-23-120.us-west-2.compute.internal): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

syuquad commented May 23, 2016

I see the following in the log and I believe that's the reason for the failure. But I don't know how to fix it.

16/05/20 01:23:58 ERROR TaskSchedulerImpl: Lost executor 1 on ip-10-30-23-120.us-west-2.compute.internal: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
16/05/20 01:23:58 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2, ip-10-30-23-120.us-west-2.compute.internal): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

@syuquad

This comment has been minimized.

Show comment
Hide comment
@syuquad

syuquad May 23, 2016

paste lenet_memory_train_test.prototxt per your request:

name: "LeNet"
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TRAIN
}
source_class: "com.yahoo.ml.caffe.LMDB"
memory_data_param {
source: "file:///root/CaffeOnSpark/data/mnist_train_lmdb/"
batch_size: 64
channels: 1
height: 28
width: 28
share_in_parallel: false
}
transform_param {
scale: 0.00390625
}
}
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TEST
}
source_class: "com.yahoo.ml.caffe.LMDB"
memory_data_param {
source: "file:///root/CaffeOnSpark/data/mnist_test_lmdb/"
batch_size: 100
channels: 1
height: 28
width: 28
share_in_parallel: false
}
transform_param {
scale: 0.00390625
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}

syuquad commented May 23, 2016

paste lenet_memory_train_test.prototxt per your request:

name: "LeNet"
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TRAIN
}
source_class: "com.yahoo.ml.caffe.LMDB"
memory_data_param {
source: "file:///root/CaffeOnSpark/data/mnist_train_lmdb/"
batch_size: 64
channels: 1
height: 28
width: 28
share_in_parallel: false
}
transform_param {
scale: 0.00390625
}
}
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TEST
}
source_class: "com.yahoo.ml.caffe.LMDB"
memory_data_param {
source: "file:///root/CaffeOnSpark/data/mnist_test_lmdb/"
batch_size: 100
channels: 1
height: 28
width: 28
share_in_parallel: false
}
transform_param {
scale: 0.00390625
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}

@anfeng

This comment has been minimized.

Show comment
Hide comment
@anfeng

anfeng May 24, 2016

Contributor

any detailed logs from the killed executor?

Andy

On Mon, May 23, 2016 at 1:24 PM, syuquad notifications@github.com wrote:

I see the following in the log and I believe that's the reason for the
failure. But I don't know how to fix it.

16/05/20 01:23:58 ERROR TaskSchedulerImpl: Lost executor 1 on
ip-10-30-23-120.us-west-2.compute.internal: Remote RPC client
disassociated. Likely due to containers exceeding thresholds, or network
issues. Check driver logs for WARN messages.
16/05/20 01:23:58 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2,
ip-10-30-23-120.us-west-2.compute.internal): ExecutorLostFailure (executor
1 exited caused by one of the running tasks) Reason: Remote RPC client
disassociated. Likely due to containers exceeding thresholds, or network
issues. Check driver logs for WARN messages.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#67 (comment)

Contributor

anfeng commented May 24, 2016

any detailed logs from the killed executor?

Andy

On Mon, May 23, 2016 at 1:24 PM, syuquad notifications@github.com wrote:

I see the following in the log and I believe that's the reason for the
failure. But I don't know how to fix it.

16/05/20 01:23:58 ERROR TaskSchedulerImpl: Lost executor 1 on
ip-10-30-23-120.us-west-2.compute.internal: Remote RPC client
disassociated. Likely due to containers exceeding thresholds, or network
issues. Check driver logs for WARN messages.
16/05/20 01:23:58 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2,
ip-10-30-23-120.us-west-2.compute.internal): ExecutorLostFailure (executor
1 exited caused by one of the running tasks) Reason: Remote RPC client
disassociated. Likely due to containers exceeding thresholds, or network
issues. Check driver logs for WARN messages.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#67 (comment)

@syuquad

This comment has been minimized.

Show comment
Hide comment
@syuquad

syuquad May 24, 2016

Not sure which log you are looking for. Here is the log file from the master node under /spark/logs:

root@ip-10-30-10-204:/spark/logs# ls -lt
total 12
-rw-r--r-- 1 root root 9240 May 23 22:52 spark-root-org.apache.spark.deploy.master.Master-1-ip-10-30-10-204.us-west-2.compute.internal.out
root@ip-10-30-10-204:
/spark/logs# more spark-root-org.apache.spark.deploy.master.Master-1-ip-10-30-10-204.us-west-2.compute.internal.out
Spark Command: /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp /root/s
park/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spa
rk/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.
2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/
conf/ -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.master.M

aster --ip 10.30.10.204 --port 7077 --webui-port 8080

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.prop
erties
16/05/23 21:31:00 INFO Master: Registered signal handlers for [TERM, HUP,
INT]
16/05/23 21:31:00 WARN NativeCodeLoader: Unable to load native-hadoop lib
rary for your platform... using builtin-java classes where applicable
16/05/23 21:31:00 INFO SecurityManager: Changing view acls to: root
16/05/23 21:31:00 INFO SecurityManager: Changing modify acls to: root
16/05/23 21:31:00 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 21:31:01 INFO Utils: Successfully started service 'sparkMaster'
on port 7077.
16/05/23 21:31:01 INFO Master: Starting Spark master at spark://10.30.10.
204:7077
16/05/23 21:31:01 INFO Master: Running Spark version 1.6.0
16/05/23 21:31:01 INFO Utils: Successfully started service 'MasterUI' on
port 8080.
16/05/23 21:31:01 INFO MasterWebUI: Started MasterWebUI at http://ec2-54-
148-250-247.us-west-2.compute.amazonaws.com:8080
16/05/23 21:31:01 INFO Utils: Successfully started service on port 6066.
16/05/23 21:31:01 INFO StandaloneRestServer: Started REST server for subm
itting applications on port 6066
16/05/23 21:31:02 INFO Master: I have been elected leader! New state: ALI
VE
16/05/23 21:31:24 INFO Master: Registering worker 10.30.3.42:51910 with 8
cores, 13.7 GB RAM
16/05/23 21:31:24 INFO Master: Registering worker 10.30.3.43:41021 with 8
cores, 13.7 GB RAM
16/05/23 21:55:18 INFO Master: Registering app com.yahoo.ml.caffe.CaffeOn
Spark
16/05/23 21:55:18 INFO Master: Registered app com.yahoo.ml.caffe.CaffeOnS
park with ID app-20160523215518-0000
16/05/23 21:55:18 INFO Master: Launching executor app-20160523215518-0000
/0 on worker worker-20160523213123-10.30.3.42-51910
16/05/23 21:55:18 INFO Master: Launching executor app-20160523215518-0000
/1 on worker worker-20160523213123-10.30.3.43-41021
16/05/23 21:55:21 INFO Master: Received unregister request from applicati
on app-20160523215518-0000
16/05/23 21:55:21 INFO Master: Removing app app-20160523215518-0000
16/05/23 21:55:21 INFO Master: ip-10-30-10-204.us-west-2.compute.internal
:50082 got disassociated, removing it.
16/05/23 21:55:21 INFO Master: 10.30.10.204:44056 got disassociated, remo
ving it.
16/05/23 21:55:21 WARN Master: Got status update for unknown executor app
-20160523215518-0000/1
16/05/23 21:55:21 WARN Master: Got status update for unknown executor app
-20160523215518-0000/0
16/05/23 21:58:40 INFO Master: Registering app com.yahoo.ml.caffe.CaffeOn
Spark
16/05/23 21:58:40 INFO Master: Registered app com.yahoo.ml.caffe.CaffeOnS
park with ID app-20160523215840-0001
16/05/23 21:58:40 INFO Master: Launching executor app-20160523215840-0001
/0 on worker worker-20160523213123-10.30.3.42-51910
16/05/23 21:58:40 INFO Master: Launching executor app-20160523215840-0001
/1 on worker worker-20160523213123-10.30.3.43-41021
16/05/23 21:58:42 INFO Master: Received unregister request from applicati
on app-20160523215840-0001
16/05/23 21:58:42 INFO Master: Removing app app-20160523215840-0001
16/05/23 21:58:42 INFO Master: ip-10-30-10-204.us-west-2.compute.internal
:50091 got disassociated, removing it.
16/05/23 21:58:42 INFO Master: 10.30.10.204:52644 got disassociated, remo
ving it.
16/05/23 21:58:43 WARN Master: Got status update for unknown executor app
-20160523215840-0001/0
16/05/23 21:58:43 WARN Master: Got status update for unknown executor app
-20160523215840-0001/1
16/05/23 22:00:08 INFO Master: Registering app com.yahoo.ml.caffe.CaffeOn
Spark
16/05/23 22:00:08 INFO Master: Registered app com.yahoo.ml.caffe.CaffeOnS
park with ID app-20160523220008-0002
16/05/23 22:00:08 INFO Master: Launching executor app-20160523220008-0002
/0 on worker worker-20160523213123-10.30.3.42-51910
16/05/23 22:00:08 INFO Master: Launching executor app-20160523220008-0002
/1 on worker worker-20160523213123-10.30.3.43-41021
16/05/23 22:00:11 INFO Master: Received unregister request from applicati
on app-20160523220008-0002
16/05/23 22:00:11 INFO Master: Removing app app-20160523220008-0002
16/05/23 22:00:11 INFO Master: ip-10-30-10-204.us-west-2.compute.internal
:50094 got disassociated, removing it.
16/05/23 22:00:11 INFO Master: 10.30.10.204:58235 got disassociated, remo
ving it.
16/05/23 22:00:11 WARN Master: Got status update for unknown executor app
-20160523220008-0002/1
16/05/23 22:00:11 WARN Master: Got status update for unknown executor app
-20160523220008-0002/0
16/05/23 22:00:28 INFO Master: Registering app com.yahoo.ml.caffe.CaffeOn
Spark
16/05/23 22:00:28 INFO Master: Registered app com.yahoo.ml.caffe.CaffeOnS
park with ID app-20160523220028-0003
16/05/23 22:00:28 INFO Master: Launching executor app-20160523220028-0003
/0 on worker worker-20160523213123-10.30.3.42-51910
16/05/23 22:00:28 INFO Master: Launching executor app-20160523220028-0003
/1 on worker worker-20160523213123-10.30.3.43-41021
16/05/23 22:02:58 INFO Master: Removing executor app-20160523220028-0003/
1 because it is EXITED
16/05/23 22:02:58 INFO Master: Launching executor app-20160523220028-0003
/2 on worker worker-20160523213123-10.30.3.43-41021
16/05/23 22:02:58 INFO Master: Removing executor app-20160523220028-0003/
0 because it is EXITED
16/05/23 22:02:58 INFO Master: Launching executor app-20160523220028-0003
/3 on worker worker-20160523213123-10.30.3.42-51910
16/05/23 22:03:01 INFO Master: Received unregister request from applicati
on app-20160523220028-0003
16/05/23 22:03:01 INFO Master: Removing app app-20160523220028-0003
16/05/23 22:03:01 INFO Master: ip-10-30-10-204.us-west-2.compute.internal
:50096 got disassociated, removing it.
16/05/23 22:03:01 INFO Master: 10.30.10.204:44475 got disassociated, remo
ving it.
16/05/23 22:03:01 WARN Master: Got status update for unknown executor app
-20160523220028-0003/2
16/05/23 22:03:01 WARN Master: Got status update for unknown executor app
-20160523220028-0003/3
16/05/23 22:14:23 INFO Master: Registering app com.yahoo.ml.caffe.CaffeOn
Spark
16/05/23 22:14:23 INFO Master: Registered app com.yahoo.ml.caffe.CaffeOnS
park with ID app-20160523221423-0004
16/05/23 22:14:23 INFO Master: Launching executor app-20160523221423-0004
/0 on worker worker-20160523213123-10.30.3.42-51910
16/05/23 22:14:23 INFO Master: Launching executor app-20160523221423-0004
/1 on worker worker-20160523213123-10.30.3.43-41021
16/05/23 22:14:39 INFO Master: Removing executor app-20160523221423-0004/
0 because it is EXITED
16/05/23 22:14:39 INFO Master: Launching executor app-20160523221423-0004
/2 on worker worker-20160523213123-10.30.3.42-51910
16/05/23 22:14:39 INFO Master: Removing executor app-20160523221423-0004/
1 because it is EXITED
16/05/23 22:14:39 INFO Master: Launching executor app-20160523221423-0004
/3 on worker worker-20160523213123-10.30.3.43-41021
16/05/23 22:14:42 INFO Master: Received unregister request from applicati
on app-20160523221423-0004
16/05/23 22:14:42 INFO Master: Removing app app-20160523221423-0004
16/05/23 22:14:42 INFO Master: ip-10-30-10-204.us-west-2.compute.internal
:50113 got disassociated, removing it.
16/05/23 22:14:42 INFO Master: 10.30.10.204:44328 got disassociated, remo
ving it.
16/05/23 22:14:43 WARN Master: Got status update for unknown executor app
-20160523221423-0004/3
16/05/23 22:14:43 WARN Master: Got status update for unknown executor app
-20160523221423-0004/2
16/05/23 22:52:31 INFO Master: Registering app com.yahoo.ml.caffe.CaffeOn
Spark
16/05/23 22:52:31 INFO Master: Registered app com.yahoo.ml.caffe.CaffeOnS
park with ID app-20160523225231-0005
16/05/23 22:52:31 INFO Master: Launching executor app-20160523225231-0005
/0 on worker worker-20160523213123-10.30.3.42-51910
16/05/23 22:52:31 INFO Master: Launching executor app-20160523225231-0005
/1 on worker worker-20160523213123-10.30.3.43-41021
16/05/23 22:52:47 INFO Master: Removing executor app-20160523225231-0005/
0 because it is EXITED
16/05/23 22:52:47 INFO Master: Launching executor app-20160523225231-0005
/2 on worker worker-20160523213123-10.30.3.42-51910
16/05/23 22:52:47 INFO Master: Removing executor app-20160523225231-0005/
1 because it is EXITED
16/05/23 22:52:47 INFO Master: Launching executor app-20160523225231-0005
/3 on worker worker-20160523213123-10.30.3.43-41021
16/05/23 22:52:50 INFO Master: Received unregister request from applicati
on app-20160523225231-0005
16/05/23 22:52:50 INFO Master: Removing app app-20160523225231-0005
16/05/23 22:52:50 INFO Master: ip-10-30-10-204.us-west-2.compute.internal
:50159 got disassociated, removing it.
16/05/23 22:52:50 INFO Master: 10.30.10.204:45211 got disassociated, remo
ving it.
16/05/23 22:52:51 WARN Master: Got status update for unknown executor app
-20160523225231-0005/3
16/05/23 22:52:51 WARN Master: Got status update for unknown executor app
-20160523225231-0005/2

syuquad commented May 24, 2016

Not sure which log you are looking for. Here is the log file from the master node under /spark/logs:

root@ip-10-30-10-204:/spark/logs# ls -lt
total 12
-rw-r--r-- 1 root root 9240 May 23 22:52 spark-root-org.apache.spark.deploy.master.Master-1-ip-10-30-10-204.us-west-2.compute.internal.out
root@ip-10-30-10-204:
/spark/logs# more spark-root-org.apache.spark.deploy.master.Master-1-ip-10-30-10-204.us-west-2.compute.internal.out
Spark Command: /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp /root/s
park/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spa
rk/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.
2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/
conf/ -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.master.M

aster --ip 10.30.10.204 --port 7077 --webui-port 8080

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.prop
erties
16/05/23 21:31:00 INFO Master: Registered signal handlers for [TERM, HUP,
INT]
16/05/23 21:31:00 WARN NativeCodeLoader: Unable to load native-hadoop lib
rary for your platform... using builtin-java classes where applicable
16/05/23 21:31:00 INFO SecurityManager: Changing view acls to: root
16/05/23 21:31:00 INFO SecurityManager: Changing modify acls to: root
16/05/23 21:31:00 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 21:31:01 INFO Utils: Successfully started service 'sparkMaster'
on port 7077.
16/05/23 21:31:01 INFO Master: Starting Spark master at spark://10.30.10.
204:7077
16/05/23 21:31:01 INFO Master: Running Spark version 1.6.0
16/05/23 21:31:01 INFO Utils: Successfully started service 'MasterUI' on
port 8080.
16/05/23 21:31:01 INFO MasterWebUI: Started MasterWebUI at http://ec2-54-
148-250-247.us-west-2.compute.amazonaws.com:8080
16/05/23 21:31:01 INFO Utils: Successfully started service on port 6066.
16/05/23 21:31:01 INFO StandaloneRestServer: Started REST server for subm
itting applications on port 6066
16/05/23 21:31:02 INFO Master: I have been elected leader! New state: ALI
VE
16/05/23 21:31:24 INFO Master: Registering worker 10.30.3.42:51910 with 8
cores, 13.7 GB RAM
16/05/23 21:31:24 INFO Master: Registering worker 10.30.3.43:41021 with 8
cores, 13.7 GB RAM
16/05/23 21:55:18 INFO Master: Registering app com.yahoo.ml.caffe.CaffeOn
Spark
16/05/23 21:55:18 INFO Master: Registered app com.yahoo.ml.caffe.CaffeOnS
park with ID app-20160523215518-0000
16/05/23 21:55:18 INFO Master: Launching executor app-20160523215518-0000
/0 on worker worker-20160523213123-10.30.3.42-51910
16/05/23 21:55:18 INFO Master: Launching executor app-20160523215518-0000
/1 on worker worker-20160523213123-10.30.3.43-41021
16/05/23 21:55:21 INFO Master: Received unregister request from applicati
on app-20160523215518-0000
16/05/23 21:55:21 INFO Master: Removing app app-20160523215518-0000
16/05/23 21:55:21 INFO Master: ip-10-30-10-204.us-west-2.compute.internal
:50082 got disassociated, removing it.
16/05/23 21:55:21 INFO Master: 10.30.10.204:44056 got disassociated, remo
ving it.
16/05/23 21:55:21 WARN Master: Got status update for unknown executor app
-20160523215518-0000/1
16/05/23 21:55:21 WARN Master: Got status update for unknown executor app
-20160523215518-0000/0
16/05/23 21:58:40 INFO Master: Registering app com.yahoo.ml.caffe.CaffeOn
Spark
16/05/23 21:58:40 INFO Master: Registered app com.yahoo.ml.caffe.CaffeOnS
park with ID app-20160523215840-0001
16/05/23 21:58:40 INFO Master: Launching executor app-20160523215840-0001
/0 on worker worker-20160523213123-10.30.3.42-51910
16/05/23 21:58:40 INFO Master: Launching executor app-20160523215840-0001
/1 on worker worker-20160523213123-10.30.3.43-41021
16/05/23 21:58:42 INFO Master: Received unregister request from applicati
on app-20160523215840-0001
16/05/23 21:58:42 INFO Master: Removing app app-20160523215840-0001
16/05/23 21:58:42 INFO Master: ip-10-30-10-204.us-west-2.compute.internal
:50091 got disassociated, removing it.
16/05/23 21:58:42 INFO Master: 10.30.10.204:52644 got disassociated, remo
ving it.
16/05/23 21:58:43 WARN Master: Got status update for unknown executor app
-20160523215840-0001/0
16/05/23 21:58:43 WARN Master: Got status update for unknown executor app
-20160523215840-0001/1
16/05/23 22:00:08 INFO Master: Registering app com.yahoo.ml.caffe.CaffeOn
Spark
16/05/23 22:00:08 INFO Master: Registered app com.yahoo.ml.caffe.CaffeOnS
park with ID app-20160523220008-0002
16/05/23 22:00:08 INFO Master: Launching executor app-20160523220008-0002
/0 on worker worker-20160523213123-10.30.3.42-51910
16/05/23 22:00:08 INFO Master: Launching executor app-20160523220008-0002
/1 on worker worker-20160523213123-10.30.3.43-41021
16/05/23 22:00:11 INFO Master: Received unregister request from applicati
on app-20160523220008-0002
16/05/23 22:00:11 INFO Master: Removing app app-20160523220008-0002
16/05/23 22:00:11 INFO Master: ip-10-30-10-204.us-west-2.compute.internal
:50094 got disassociated, removing it.
16/05/23 22:00:11 INFO Master: 10.30.10.204:58235 got disassociated, remo
ving it.
16/05/23 22:00:11 WARN Master: Got status update for unknown executor app
-20160523220008-0002/1
16/05/23 22:00:11 WARN Master: Got status update for unknown executor app
-20160523220008-0002/0
16/05/23 22:00:28 INFO Master: Registering app com.yahoo.ml.caffe.CaffeOn
Spark
16/05/23 22:00:28 INFO Master: Registered app com.yahoo.ml.caffe.CaffeOnS
park with ID app-20160523220028-0003
16/05/23 22:00:28 INFO Master: Launching executor app-20160523220028-0003
/0 on worker worker-20160523213123-10.30.3.42-51910
16/05/23 22:00:28 INFO Master: Launching executor app-20160523220028-0003
/1 on worker worker-20160523213123-10.30.3.43-41021
16/05/23 22:02:58 INFO Master: Removing executor app-20160523220028-0003/
1 because it is EXITED
16/05/23 22:02:58 INFO Master: Launching executor app-20160523220028-0003
/2 on worker worker-20160523213123-10.30.3.43-41021
16/05/23 22:02:58 INFO Master: Removing executor app-20160523220028-0003/
0 because it is EXITED
16/05/23 22:02:58 INFO Master: Launching executor app-20160523220028-0003
/3 on worker worker-20160523213123-10.30.3.42-51910
16/05/23 22:03:01 INFO Master: Received unregister request from applicati
on app-20160523220028-0003
16/05/23 22:03:01 INFO Master: Removing app app-20160523220028-0003
16/05/23 22:03:01 INFO Master: ip-10-30-10-204.us-west-2.compute.internal
:50096 got disassociated, removing it.
16/05/23 22:03:01 INFO Master: 10.30.10.204:44475 got disassociated, remo
ving it.
16/05/23 22:03:01 WARN Master: Got status update for unknown executor app
-20160523220028-0003/2
16/05/23 22:03:01 WARN Master: Got status update for unknown executor app
-20160523220028-0003/3
16/05/23 22:14:23 INFO Master: Registering app com.yahoo.ml.caffe.CaffeOn
Spark
16/05/23 22:14:23 INFO Master: Registered app com.yahoo.ml.caffe.CaffeOnS
park with ID app-20160523221423-0004
16/05/23 22:14:23 INFO Master: Launching executor app-20160523221423-0004
/0 on worker worker-20160523213123-10.30.3.42-51910
16/05/23 22:14:23 INFO Master: Launching executor app-20160523221423-0004
/1 on worker worker-20160523213123-10.30.3.43-41021
16/05/23 22:14:39 INFO Master: Removing executor app-20160523221423-0004/
0 because it is EXITED
16/05/23 22:14:39 INFO Master: Launching executor app-20160523221423-0004
/2 on worker worker-20160523213123-10.30.3.42-51910
16/05/23 22:14:39 INFO Master: Removing executor app-20160523221423-0004/
1 because it is EXITED
16/05/23 22:14:39 INFO Master: Launching executor app-20160523221423-0004
/3 on worker worker-20160523213123-10.30.3.43-41021
16/05/23 22:14:42 INFO Master: Received unregister request from applicati
on app-20160523221423-0004
16/05/23 22:14:42 INFO Master: Removing app app-20160523221423-0004
16/05/23 22:14:42 INFO Master: ip-10-30-10-204.us-west-2.compute.internal
:50113 got disassociated, removing it.
16/05/23 22:14:42 INFO Master: 10.30.10.204:44328 got disassociated, remo
ving it.
16/05/23 22:14:43 WARN Master: Got status update for unknown executor app
-20160523221423-0004/3
16/05/23 22:14:43 WARN Master: Got status update for unknown executor app
-20160523221423-0004/2
16/05/23 22:52:31 INFO Master: Registering app com.yahoo.ml.caffe.CaffeOn
Spark
16/05/23 22:52:31 INFO Master: Registered app com.yahoo.ml.caffe.CaffeOnS
park with ID app-20160523225231-0005
16/05/23 22:52:31 INFO Master: Launching executor app-20160523225231-0005
/0 on worker worker-20160523213123-10.30.3.42-51910
16/05/23 22:52:31 INFO Master: Launching executor app-20160523225231-0005
/1 on worker worker-20160523213123-10.30.3.43-41021
16/05/23 22:52:47 INFO Master: Removing executor app-20160523225231-0005/
0 because it is EXITED
16/05/23 22:52:47 INFO Master: Launching executor app-20160523225231-0005
/2 on worker worker-20160523213123-10.30.3.42-51910
16/05/23 22:52:47 INFO Master: Removing executor app-20160523225231-0005/
1 because it is EXITED
16/05/23 22:52:47 INFO Master: Launching executor app-20160523225231-0005
/3 on worker worker-20160523213123-10.30.3.43-41021
16/05/23 22:52:50 INFO Master: Received unregister request from applicati
on app-20160523225231-0005
16/05/23 22:52:50 INFO Master: Removing app app-20160523225231-0005
16/05/23 22:52:50 INFO Master: ip-10-30-10-204.us-west-2.compute.internal
:50159 got disassociated, removing it.
16/05/23 22:52:50 INFO Master: 10.30.10.204:45211 got disassociated, remo
ving it.
16/05/23 22:52:51 WARN Master: Got status update for unknown executor app
-20160523225231-0005/3
16/05/23 22:52:51 WARN Master: Got status update for unknown executor app
-20160523225231-0005/2

@syuquad

This comment has been minimized.

Show comment
Hide comment
@syuquad

syuquad May 24, 2016

Here is the log file from one of the slave node.

root@ip-10-30-3-42:/spark/logs# ls
spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-30-3-42.out
root@ip-10-30-3-42:
/spark/logs# more spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-30-3-42.out
Spark Command: /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java -cp /root/s
park/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spa
rk/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.
2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/
conf/ -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.W

orker --webui-port 8081 spark://10.30.10.204:7077

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.prop
erties
16/05/23 21:31:22 INFO Worker: Registered signal handlers for [TERM, HUP,
INT]
16/05/23 21:31:22 WARN NativeCodeLoader: Unable to load native-hadoop lib
rary for your platform... using builtin-java classes where applicable
16/05/23 21:31:23 INFO SecurityManager: Changing view acls to: root
16/05/23 21:31:23 INFO SecurityManager: Changing modify acls to: root
16/05/23 21:31:23 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 21:31:23 INFO Utils: Successfully started service 'sparkWorker'
on port 51910.
16/05/23 21:31:23 INFO Worker: Starting Spark worker 10.30.3.42:51910 wit
h 8 cores, 13.7 GB RAM
16/05/23 21:31:23 INFO Worker: Running Spark version 1.6.0
16/05/23 21:31:23 INFO Worker: Spark home: /root/spark
16/05/23 21:31:24 INFO Utils: Successfully started service 'WorkerUI' on
port 8081.
16/05/23 21:31:24 INFO WorkerWebUI: Started WorkerWebUI at http://ec2-54-
149-117-170.us-west-2.compute.amazonaws.com:8081
16/05/23 21:31:24 INFO Worker: Connecting to master 10.30.10.204:7077...
16/05/23 21:31:24 INFO Worker: Successfully registered with master spark:
//10.30.10.204:7077
16/05/23 21:55:18 INFO Worker: Asked to launch executor app-2016052321551
8-0000/0 for com.yahoo.ml.caffe.CaffeOnSpark
16/05/23 21:55:18 INFO SecurityManager: Changing view acls to: root
16/05/23 21:55:18 INFO SecurityManager: Changing modify acls to: root
16/05/23 21:55:18 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 21:55:18 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java
-7-openjdk-amd64/jre/bin/java" "-cp" "/root/ephemeral-hdfs/conf/:/root/sp
ark/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spar
k/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2
.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/c
onf/" "-Xms12991M" "-Xmx12991M" "-Dspark.driver.port=44056" "-XX:MaxPermS
ize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--dri
ver-url" "spark://CoarseGrainedScheduler@10.30.10.204:44056" "--executor-
id" "0" "--hostname" "10.30.3.42" "--cores" "8" "--app-id" "app-201605232
15518-0000" "--worker-url" "spark://Worker@10.30.3.42:51910"
16/05/23 21:55:21 INFO Worker: Asked to kill executor app-20160523215518-
0000/0
16/05/23 21:55:21 INFO ExecutorRunner: Runner thread for executor app-201
60523215518-0000/0 interrupted
16/05/23 21:55:21 INFO ExecutorRunner: Killing process!
16/05/23 21:55:21 ERROR FileAppender: Error writing stream to file /root/
spark/work/app-20160523215518-0000/0/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.j
ava:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272
)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(
FileAppender.scala:70)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply$mcV$sp(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala
:1741)
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileApp
ender.scala:38)
16/05/23 21:55:21 INFO Worker: Executor app-20160523215518-0000/0 finishe
d with state KILLED exitStatus 143
16/05/23 21:55:21 INFO Worker: Cleaning up local directories for applicat
ion app-20160523215518-0000
16/05/23 21:55:21 INFO ExternalShuffleBlockResolver: Application app-2016
0523215518-0000 removed, cleanupLocalDirs = true
16/05/23 21:58:40 INFO Worker: Asked to launch executor app-2016052321584
0-0001/0 for com.yahoo.ml.caffe.CaffeOnSpark
16/05/23 21:58:40 INFO SecurityManager: Changing view acls to: root
16/05/23 21:58:40 INFO SecurityManager: Changing modify acls to: root
16/05/23 21:58:40 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 21:58:40 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java
-7-openjdk-amd64/jre/bin/java" "-cp" "/root/ephemeral-hdfs/conf/:/root/sp
ark/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spar
k/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2
.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/c
onf/" "-Xms12991M" "-Xmx12991M" "-Dspark.driver.port=52644" "-XX:MaxPermS
ize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--dri
ver-url" "spark://CoarseGrainedScheduler@10.30.10.204:52644" "--executor-
id" "0" "--hostname" "10.30.3.42" "--cores" "8" "--app-id" "app-201605232
15840-0001" "--worker-url" "spark://Worker@10.30.3.42:51910"
16/05/23 21:58:42 INFO Worker: Asked to kill executor app-20160523215840-
0001/0
16/05/23 21:58:42 INFO ExecutorRunner: Runner thread for executor app-201
60523215840-0001/0 interrupted
16/05/23 21:58:42 INFO ExecutorRunner: Killing process!
16/05/23 21:58:42 ERROR FileAppender: Error writing stream to file /root/
spark/work/app-20160523215840-0001/0/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.j
ava:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272
)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(
FileAppender.scala:70)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply$mcV$sp(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala
:1741)
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileApp
ender.scala:38)
16/05/23 21:58:42 INFO Worker: Executor app-20160523215840-0001/0 finishe
d with state KILLED exitStatus 143
16/05/23 21:58:42 INFO Worker: Cleaning up local directories for applicat
ion app-20160523215840-0001
16/05/23 21:58:42 INFO ExternalShuffleBlockResolver: Application app-2016
0523215840-0001 removed, cleanupLocalDirs = true
16/05/23 22:00:08 INFO Worker: Asked to launch executor app-2016052322000
8-0002/0 for com.yahoo.ml.caffe.CaffeOnSpark
16/05/23 22:00:08 INFO SecurityManager: Changing view acls to: root
16/05/23 22:00:08 INFO SecurityManager: Changing modify acls to: root
16/05/23 22:00:08 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 22:00:08 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java
-7-openjdk-amd64/jre/bin/java" "-cp" "/root/ephemeral-hdfs/conf/:/root/sp
ark/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spar
k/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2
.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/c
onf/" "-Xms12991M" "-Xmx12991M" "-Dspark.driver.port=58235" "-XX:MaxPermS
ize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--dri
ver-url" "spark://CoarseGrainedScheduler@10.30.10.204:58235" "--executor-
id" "0" "--hostname" "10.30.3.42" "--cores" "8" "--app-id" "app-201605232
20008-0002" "--worker-url" "spark://Worker@10.30.3.42:51910"
16/05/23 22:00:11 INFO Worker: Asked to kill executor app-20160523220008-
0002/0
16/05/23 22:00:11 INFO ExecutorRunner: Runner thread for executor app-201
60523220008-0002/0 interrupted
16/05/23 22:00:11 INFO ExecutorRunner: Killing process!
16/05/23 22:00:11 ERROR FileAppender: Error writing stream to file /root/
spark/work/app-20160523220008-0002/0/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.j
ava:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272
)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(
FileAppender.scala:70)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply$mcV$sp(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala
:1741)
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileApp
ender.scala:38)
16/05/23 22:00:11 INFO Worker: Executor app-20160523220008-0002/0 finishe
d with state KILLED exitStatus 143
16/05/23 22:00:11 INFO Worker: Cleaning up local directories for applicat
ion app-20160523220008-0002
16/05/23 22:00:11 INFO ExternalShuffleBlockResolver: Application app-2016
0523220008-0002 removed, cleanupLocalDirs = true
16/05/23 22:00:28 INFO Worker: Asked to launch executor app-2016052322002
8-0003/0 for com.yahoo.ml.caffe.CaffeOnSpark
16/05/23 22:00:28 INFO SecurityManager: Changing view acls to: root
16/05/23 22:00:28 INFO SecurityManager: Changing modify acls to: root
16/05/23 22:00:28 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 22:00:28 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java
-7-openjdk-amd64/jre/bin/java" "-cp" "/root/ephemeral-hdfs/conf/:/root/sp
ark/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spar
k/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2
.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/c
onf/" "-Xms12991M" "-Xmx12991M" "-Dspark.driver.port=44475" "-XX:MaxPermS
ize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--dri
ver-url" "spark://CoarseGrainedScheduler@10.30.10.204:44475" "--executor-
id" "0" "--hostname" "10.30.3.42" "--cores" "8" "--app-id" "app-201605232
20028-0003" "--worker-url" "spark://Worker@10.30.3.42:51910"
16/05/23 22:02:58 INFO Worker: Executor app-20160523220028-0003/0 finishe
d with state EXITED message Command exited with code 134 exitStatus 134
16/05/23 22:02:58 INFO Worker: Asked to launch executor app-2016052322002
8-0003/3 for com.yahoo.ml.caffe.CaffeOnSpark
16/05/23 22:02:58 INFO SecurityManager: Changing view acls to: root
16/05/23 22:02:58 INFO SecurityManager: Changing modify acls to: root
16/05/23 22:02:58 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 22:02:58 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java
-7-openjdk-amd64/jre/bin/java" "-cp" "/root/ephemeral-hdfs/conf/:/root/sp
ark/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spar
k/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2
.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/c
onf/" "-Xms12991M" "-Xmx12991M" "-Dspark.driver.port=44475" "-XX:MaxPermS
ize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--dri
ver-url" "spark://CoarseGrainedScheduler@10.30.10.204:44475" "--executor-
id" "3" "--hostname" "10.30.3.42" "--cores" "8" "--app-id" "app-201605232
20028-0003" "--worker-url" "spark://Worker@10.30.3.42:51910"
16/05/23 22:03:01 INFO Worker: Asked to kill executor app-20160523220028-
0003/3
16/05/23 22:03:01 INFO ExecutorRunner: Runner thread for executor app-201
60523220028-0003/3 interrupted
16/05/23 22:03:01 INFO ExecutorRunner: Killing process!
16/05/23 22:03:01 ERROR FileAppender: Error writing stream to file /root/
spark/work/app-20160523220028-0003/3/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.j
ava:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272
)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(
FileAppender.scala:70)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply$mcV$sp(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala
:1741)
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileApp
ender.scala:38)
16/05/23 22:03:01 INFO Worker: Executor app-20160523220028-0003/3 finishe
d with state KILLED exitStatus 143
16/05/23 22:03:01 INFO Worker: Cleaning up local directories for applicat
ion app-20160523220028-0003
16/05/23 22:03:01 INFO ExternalShuffleBlockResolver: Application app-2016
0523220028-0003 removed, cleanupLocalDirs = true
16/05/23 22:14:23 INFO Worker: Asked to launch executor app-2016052322142
3-0004/0 for com.yahoo.ml.caffe.CaffeOnSpark
16/05/23 22:14:23 INFO SecurityManager: Changing view acls to: root
16/05/23 22:14:23 INFO SecurityManager: Changing modify acls to: root
16/05/23 22:14:23 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 22:14:23 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java
-7-openjdk-amd64/jre/bin/java" "-cp" "/root/ephemeral-hdfs/conf/:/root/sp
ark/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spar
k/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2
.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/c
onf/" "-Xms5120M" "-Xmx5120M" "-Dspark.driver.port=44328" "-XX:MaxPermSiz
e=512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--drive
r-url" "spark://CoarseGrainedScheduler@10.30.10.204:44328" "--executor-id
" "0" "--hostname" "10.30.3.42" "--cores" "8" "--app-id" "app-20160523221
423-0004" "--worker-url" "spark://Worker@10.30.3.42:51910"
16/05/23 22:14:39 INFO Worker: Executor app-20160523221423-0004/0 finishe
d with state EXITED message Command exited with code 134 exitStatus 134
16/05/23 22:14:39 INFO Worker: Asked to launch executor app-2016052322142
3-0004/2 for com.yahoo.ml.caffe.CaffeOnSpark
16/05/23 22:14:39 INFO SecurityManager: Changing view acls to: root
16/05/23 22:14:39 INFO SecurityManager: Changing modify acls to: root
16/05/23 22:14:39 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 22:14:39 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java
-7-openjdk-amd64/jre/bin/java" "-cp" "/root/ephemeral-hdfs/conf/:/root/sp
ark/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spar
k/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2
.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/c
onf/" "-Xms5120M" "-Xmx5120M" "-Dspark.driver.port=44328" "-XX:MaxPermSiz
e=512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--drive
r-url" "spark://CoarseGrainedScheduler@10.30.10.204:44328" "--executor-id
" "2" "--hostname" "10.30.3.42" "--cores" "8" "--app-id" "app-20160523221
423-0004" "--worker-url" "spark://Worker@10.30.3.42:51910"
16/05/23 22:14:42 INFO Worker: Asked to kill executor app-20160523221423-
0004/2
16/05/23 22:14:42 INFO ExecutorRunner: Runner thread for executor app-201
60523221423-0004/2 interrupted
16/05/23 22:14:42 INFO ExecutorRunner: Killing process!
16/05/23 22:14:42 ERROR FileAppender: Error writing stream to file /root/
spark/work/app-20160523221423-0004/2/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.j
ava:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272
)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(
FileAppender.scala:70)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply$mcV$sp(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala
:1741)
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileApp
ender.scala:38)
16/05/23 22:14:42 INFO Worker: Executor app-20160523221423-0004/2 finishe
d with state KILLED exitStatus 143
16/05/23 22:14:42 INFO Worker: Cleaning up local directories for applicat
ion app-20160523221423-0004
16/05/23 22:14:42 INFO ExternalShuffleBlockResolver: Application app-2016
0523221423-0004 removed, cleanupLocalDirs = true
16/05/23 22:52:31 INFO Worker: Asked to launch executor app-2016052322523
1-0005/0 for com.yahoo.ml.caffe.CaffeOnSpark
16/05/23 22:52:31 INFO SecurityManager: Changing view acls to: root
16/05/23 22:52:31 INFO SecurityManager: Changing modify acls to: root
16/05/23 22:52:31 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 22:52:31 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java
-7-openjdk-amd64/jre/bin/java" "-cp" "/root/ephemeral-hdfs/conf/:/root/sp
ark/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spar
k/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2
.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/c
onf/" "-Xms12991M" "-Xmx12991M" "-Dspark.driver.port=45211" "-XX:MaxPermS
ize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--dri
ver-url" "spark://CoarseGrainedScheduler@10.30.10.204:45211" "--executor-
id" "0" "--hostname" "10.30.3.42" "--cores" "8" "--app-id" "app-201605232
25231-0005" "--worker-url" "spark://Worker@10.30.3.42:51910"
16/05/23 22:52:47 INFO Worker: Executor app-20160523225231-0005/0 finishe
d with state EXITED message Command exited with code 134 exitStatus 134
16/05/23 22:52:47 INFO Worker: Asked to launch executor app-2016052322523
1-0005/2 for com.yahoo.ml.caffe.CaffeOnSpark
16/05/23 22:52:47 INFO SecurityManager: Changing view acls to: root
16/05/23 22:52:47 INFO SecurityManager: Changing modify acls to: root
16/05/23 22:52:47 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 22:52:47 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java
-7-openjdk-amd64/jre/bin/java" "-cp" "/root/ephemeral-hdfs/conf/:/root/sp
ark/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spar
k/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2
.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/c
onf/" "-Xms12991M" "-Xmx12991M" "-Dspark.driver.port=45211" "-XX:MaxPermS
ize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--dri
ver-url" "spark://CoarseGrainedScheduler@10.30.10.204:45211" "--executor-
id" "2" "--hostname" "10.30.3.42" "--cores" "8" "--app-id" "app-201605232
25231-0005" "--worker-url" "spark://Worker@10.30.3.42:51910"
16/05/23 22:52:50 INFO Worker: Asked to kill executor app-20160523225231-
0005/2
16/05/23 22:52:50 INFO ExecutorRunner: Runner thread for executor app-201
60523225231-0005/2 interrupted
16/05/23 22:52:50 INFO ExecutorRunner: Killing process!
16/05/23 22:52:50 ERROR FileAppender: Error writing stream to file /root/
spark/work/app-20160523225231-0005/2/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.j
ava:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272
)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(
FileAppender.scala:70)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply$mcV$sp(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala
:1741)
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileApp
ender.scala:38)
16/05/23 22:52:50 INFO Worker: Executor app-20160523225231-0005/2 finishe
d with state KILLED exitStatus 143
16/05/23 22:52:50 INFO Worker: Cleaning up local directories for applicat
ion app-20160523225231-0005
16/05/23 22:52:50 INFO ExternalShuffleBlockResolver: Application app-2016
0523225231-0005 removed, cleanupLocalDirs = true

syuquad commented May 24, 2016

Here is the log file from one of the slave node.

root@ip-10-30-3-42:/spark/logs# ls
spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-30-3-42.out
root@ip-10-30-3-42:
/spark/logs# more spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-30-3-42.out
Spark Command: /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java -cp /root/s
park/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spa
rk/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.
2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/
conf/ -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.W

orker --webui-port 8081 spark://10.30.10.204:7077

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.prop
erties
16/05/23 21:31:22 INFO Worker: Registered signal handlers for [TERM, HUP,
INT]
16/05/23 21:31:22 WARN NativeCodeLoader: Unable to load native-hadoop lib
rary for your platform... using builtin-java classes where applicable
16/05/23 21:31:23 INFO SecurityManager: Changing view acls to: root
16/05/23 21:31:23 INFO SecurityManager: Changing modify acls to: root
16/05/23 21:31:23 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 21:31:23 INFO Utils: Successfully started service 'sparkWorker'
on port 51910.
16/05/23 21:31:23 INFO Worker: Starting Spark worker 10.30.3.42:51910 wit
h 8 cores, 13.7 GB RAM
16/05/23 21:31:23 INFO Worker: Running Spark version 1.6.0
16/05/23 21:31:23 INFO Worker: Spark home: /root/spark
16/05/23 21:31:24 INFO Utils: Successfully started service 'WorkerUI' on
port 8081.
16/05/23 21:31:24 INFO WorkerWebUI: Started WorkerWebUI at http://ec2-54-
149-117-170.us-west-2.compute.amazonaws.com:8081
16/05/23 21:31:24 INFO Worker: Connecting to master 10.30.10.204:7077...
16/05/23 21:31:24 INFO Worker: Successfully registered with master spark:
//10.30.10.204:7077
16/05/23 21:55:18 INFO Worker: Asked to launch executor app-2016052321551
8-0000/0 for com.yahoo.ml.caffe.CaffeOnSpark
16/05/23 21:55:18 INFO SecurityManager: Changing view acls to: root
16/05/23 21:55:18 INFO SecurityManager: Changing modify acls to: root
16/05/23 21:55:18 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 21:55:18 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java
-7-openjdk-amd64/jre/bin/java" "-cp" "/root/ephemeral-hdfs/conf/:/root/sp
ark/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spar
k/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2
.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/c
onf/" "-Xms12991M" "-Xmx12991M" "-Dspark.driver.port=44056" "-XX:MaxPermS
ize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--dri
ver-url" "spark://CoarseGrainedScheduler@10.30.10.204:44056" "--executor-
id" "0" "--hostname" "10.30.3.42" "--cores" "8" "--app-id" "app-201605232
15518-0000" "--worker-url" "spark://Worker@10.30.3.42:51910"
16/05/23 21:55:21 INFO Worker: Asked to kill executor app-20160523215518-
0000/0
16/05/23 21:55:21 INFO ExecutorRunner: Runner thread for executor app-201
60523215518-0000/0 interrupted
16/05/23 21:55:21 INFO ExecutorRunner: Killing process!
16/05/23 21:55:21 ERROR FileAppender: Error writing stream to file /root/
spark/work/app-20160523215518-0000/0/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.j
ava:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272
)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(
FileAppender.scala:70)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply$mcV$sp(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala
:1741)
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileApp
ender.scala:38)
16/05/23 21:55:21 INFO Worker: Executor app-20160523215518-0000/0 finishe
d with state KILLED exitStatus 143
16/05/23 21:55:21 INFO Worker: Cleaning up local directories for applicat
ion app-20160523215518-0000
16/05/23 21:55:21 INFO ExternalShuffleBlockResolver: Application app-2016
0523215518-0000 removed, cleanupLocalDirs = true
16/05/23 21:58:40 INFO Worker: Asked to launch executor app-2016052321584
0-0001/0 for com.yahoo.ml.caffe.CaffeOnSpark
16/05/23 21:58:40 INFO SecurityManager: Changing view acls to: root
16/05/23 21:58:40 INFO SecurityManager: Changing modify acls to: root
16/05/23 21:58:40 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 21:58:40 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java
-7-openjdk-amd64/jre/bin/java" "-cp" "/root/ephemeral-hdfs/conf/:/root/sp
ark/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spar
k/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2
.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/c
onf/" "-Xms12991M" "-Xmx12991M" "-Dspark.driver.port=52644" "-XX:MaxPermS
ize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--dri
ver-url" "spark://CoarseGrainedScheduler@10.30.10.204:52644" "--executor-
id" "0" "--hostname" "10.30.3.42" "--cores" "8" "--app-id" "app-201605232
15840-0001" "--worker-url" "spark://Worker@10.30.3.42:51910"
16/05/23 21:58:42 INFO Worker: Asked to kill executor app-20160523215840-
0001/0
16/05/23 21:58:42 INFO ExecutorRunner: Runner thread for executor app-201
60523215840-0001/0 interrupted
16/05/23 21:58:42 INFO ExecutorRunner: Killing process!
16/05/23 21:58:42 ERROR FileAppender: Error writing stream to file /root/
spark/work/app-20160523215840-0001/0/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.j
ava:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272
)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(
FileAppender.scala:70)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply$mcV$sp(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala
:1741)
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileApp
ender.scala:38)
16/05/23 21:58:42 INFO Worker: Executor app-20160523215840-0001/0 finishe
d with state KILLED exitStatus 143
16/05/23 21:58:42 INFO Worker: Cleaning up local directories for applicat
ion app-20160523215840-0001
16/05/23 21:58:42 INFO ExternalShuffleBlockResolver: Application app-2016
0523215840-0001 removed, cleanupLocalDirs = true
16/05/23 22:00:08 INFO Worker: Asked to launch executor app-2016052322000
8-0002/0 for com.yahoo.ml.caffe.CaffeOnSpark
16/05/23 22:00:08 INFO SecurityManager: Changing view acls to: root
16/05/23 22:00:08 INFO SecurityManager: Changing modify acls to: root
16/05/23 22:00:08 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 22:00:08 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java
-7-openjdk-amd64/jre/bin/java" "-cp" "/root/ephemeral-hdfs/conf/:/root/sp
ark/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spar
k/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2
.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/c
onf/" "-Xms12991M" "-Xmx12991M" "-Dspark.driver.port=58235" "-XX:MaxPermS
ize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--dri
ver-url" "spark://CoarseGrainedScheduler@10.30.10.204:58235" "--executor-
id" "0" "--hostname" "10.30.3.42" "--cores" "8" "--app-id" "app-201605232
20008-0002" "--worker-url" "spark://Worker@10.30.3.42:51910"
16/05/23 22:00:11 INFO Worker: Asked to kill executor app-20160523220008-
0002/0
16/05/23 22:00:11 INFO ExecutorRunner: Runner thread for executor app-201
60523220008-0002/0 interrupted
16/05/23 22:00:11 INFO ExecutorRunner: Killing process!
16/05/23 22:00:11 ERROR FileAppender: Error writing stream to file /root/
spark/work/app-20160523220008-0002/0/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.j
ava:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272
)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(
FileAppender.scala:70)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply$mcV$sp(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala
:1741)
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileApp
ender.scala:38)
16/05/23 22:00:11 INFO Worker: Executor app-20160523220008-0002/0 finishe
d with state KILLED exitStatus 143
16/05/23 22:00:11 INFO Worker: Cleaning up local directories for applicat
ion app-20160523220008-0002
16/05/23 22:00:11 INFO ExternalShuffleBlockResolver: Application app-2016
0523220008-0002 removed, cleanupLocalDirs = true
16/05/23 22:00:28 INFO Worker: Asked to launch executor app-2016052322002
8-0003/0 for com.yahoo.ml.caffe.CaffeOnSpark
16/05/23 22:00:28 INFO SecurityManager: Changing view acls to: root
16/05/23 22:00:28 INFO SecurityManager: Changing modify acls to: root
16/05/23 22:00:28 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 22:00:28 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java
-7-openjdk-amd64/jre/bin/java" "-cp" "/root/ephemeral-hdfs/conf/:/root/sp
ark/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spar
k/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2
.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/c
onf/" "-Xms12991M" "-Xmx12991M" "-Dspark.driver.port=44475" "-XX:MaxPermS
ize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--dri
ver-url" "spark://CoarseGrainedScheduler@10.30.10.204:44475" "--executor-
id" "0" "--hostname" "10.30.3.42" "--cores" "8" "--app-id" "app-201605232
20028-0003" "--worker-url" "spark://Worker@10.30.3.42:51910"
16/05/23 22:02:58 INFO Worker: Executor app-20160523220028-0003/0 finishe
d with state EXITED message Command exited with code 134 exitStatus 134
16/05/23 22:02:58 INFO Worker: Asked to launch executor app-2016052322002
8-0003/3 for com.yahoo.ml.caffe.CaffeOnSpark
16/05/23 22:02:58 INFO SecurityManager: Changing view acls to: root
16/05/23 22:02:58 INFO SecurityManager: Changing modify acls to: root
16/05/23 22:02:58 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 22:02:58 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java
-7-openjdk-amd64/jre/bin/java" "-cp" "/root/ephemeral-hdfs/conf/:/root/sp
ark/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spar
k/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2
.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/c
onf/" "-Xms12991M" "-Xmx12991M" "-Dspark.driver.port=44475" "-XX:MaxPermS
ize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--dri
ver-url" "spark://CoarseGrainedScheduler@10.30.10.204:44475" "--executor-
id" "3" "--hostname" "10.30.3.42" "--cores" "8" "--app-id" "app-201605232
20028-0003" "--worker-url" "spark://Worker@10.30.3.42:51910"
16/05/23 22:03:01 INFO Worker: Asked to kill executor app-20160523220028-
0003/3
16/05/23 22:03:01 INFO ExecutorRunner: Runner thread for executor app-201
60523220028-0003/3 interrupted
16/05/23 22:03:01 INFO ExecutorRunner: Killing process!
16/05/23 22:03:01 ERROR FileAppender: Error writing stream to file /root/
spark/work/app-20160523220028-0003/3/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.j
ava:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272
)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(
FileAppender.scala:70)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply$mcV$sp(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala
:1741)
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileApp
ender.scala:38)
16/05/23 22:03:01 INFO Worker: Executor app-20160523220028-0003/3 finishe
d with state KILLED exitStatus 143
16/05/23 22:03:01 INFO Worker: Cleaning up local directories for applicat
ion app-20160523220028-0003
16/05/23 22:03:01 INFO ExternalShuffleBlockResolver: Application app-2016
0523220028-0003 removed, cleanupLocalDirs = true
16/05/23 22:14:23 INFO Worker: Asked to launch executor app-2016052322142
3-0004/0 for com.yahoo.ml.caffe.CaffeOnSpark
16/05/23 22:14:23 INFO SecurityManager: Changing view acls to: root
16/05/23 22:14:23 INFO SecurityManager: Changing modify acls to: root
16/05/23 22:14:23 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 22:14:23 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java
-7-openjdk-amd64/jre/bin/java" "-cp" "/root/ephemeral-hdfs/conf/:/root/sp
ark/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spar
k/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2
.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/c
onf/" "-Xms5120M" "-Xmx5120M" "-Dspark.driver.port=44328" "-XX:MaxPermSiz
e=512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--drive
r-url" "spark://CoarseGrainedScheduler@10.30.10.204:44328" "--executor-id
" "0" "--hostname" "10.30.3.42" "--cores" "8" "--app-id" "app-20160523221
423-0004" "--worker-url" "spark://Worker@10.30.3.42:51910"
16/05/23 22:14:39 INFO Worker: Executor app-20160523221423-0004/0 finishe
d with state EXITED message Command exited with code 134 exitStatus 134
16/05/23 22:14:39 INFO Worker: Asked to launch executor app-2016052322142
3-0004/2 for com.yahoo.ml.caffe.CaffeOnSpark
16/05/23 22:14:39 INFO SecurityManager: Changing view acls to: root
16/05/23 22:14:39 INFO SecurityManager: Changing modify acls to: root
16/05/23 22:14:39 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 22:14:39 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java
-7-openjdk-amd64/jre/bin/java" "-cp" "/root/ephemeral-hdfs/conf/:/root/sp
ark/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spar
k/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2
.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/c
onf/" "-Xms5120M" "-Xmx5120M" "-Dspark.driver.port=44328" "-XX:MaxPermSiz
e=512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--drive
r-url" "spark://CoarseGrainedScheduler@10.30.10.204:44328" "--executor-id
" "2" "--hostname" "10.30.3.42" "--cores" "8" "--app-id" "app-20160523221
423-0004" "--worker-url" "spark://Worker@10.30.3.42:51910"
16/05/23 22:14:42 INFO Worker: Asked to kill executor app-20160523221423-
0004/2
16/05/23 22:14:42 INFO ExecutorRunner: Runner thread for executor app-201
60523221423-0004/2 interrupted
16/05/23 22:14:42 INFO ExecutorRunner: Killing process!
16/05/23 22:14:42 ERROR FileAppender: Error writing stream to file /root/
spark/work/app-20160523221423-0004/2/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.j
ava:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272
)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(
FileAppender.scala:70)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply$mcV$sp(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala
:1741)
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileApp
ender.scala:38)
16/05/23 22:14:42 INFO Worker: Executor app-20160523221423-0004/2 finishe
d with state KILLED exitStatus 143
16/05/23 22:14:42 INFO Worker: Cleaning up local directories for applicat
ion app-20160523221423-0004
16/05/23 22:14:42 INFO ExternalShuffleBlockResolver: Application app-2016
0523221423-0004 removed, cleanupLocalDirs = true
16/05/23 22:52:31 INFO Worker: Asked to launch executor app-2016052322523
1-0005/0 for com.yahoo.ml.caffe.CaffeOnSpark
16/05/23 22:52:31 INFO SecurityManager: Changing view acls to: root
16/05/23 22:52:31 INFO SecurityManager: Changing modify acls to: root
16/05/23 22:52:31 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 22:52:31 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java
-7-openjdk-amd64/jre/bin/java" "-cp" "/root/ephemeral-hdfs/conf/:/root/sp
ark/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spar
k/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2
.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/c
onf/" "-Xms12991M" "-Xmx12991M" "-Dspark.driver.port=45211" "-XX:MaxPermS
ize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--dri
ver-url" "spark://CoarseGrainedScheduler@10.30.10.204:45211" "--executor-
id" "0" "--hostname" "10.30.3.42" "--cores" "8" "--app-id" "app-201605232
25231-0005" "--worker-url" "spark://Worker@10.30.3.42:51910"
16/05/23 22:52:47 INFO Worker: Executor app-20160523225231-0005/0 finishe
d with state EXITED message Command exited with code 134 exitStatus 134
16/05/23 22:52:47 INFO Worker: Asked to launch executor app-2016052322523
1-0005/2 for com.yahoo.ml.caffe.CaffeOnSpark
16/05/23 22:52:47 INFO SecurityManager: Changing view acls to: root
16/05/23 22:52:47 INFO SecurityManager: Changing modify acls to: root
16/05/23 22:52:47 INFO SecurityManager: SecurityManager: authentication d
isabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
16/05/23 22:52:47 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java
-7-openjdk-amd64/jre/bin/java" "-cp" "/root/ephemeral-hdfs/conf/:/root/sp
ark/conf/:/root/spark/lib/spark-assembly-1.6.0-hadoop2.4.0.jar:/root/spar
k/lib/datanucleus-core-3.2.10.jar:/root/spark/lib/datanucleus-api-jdo-3.2
.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/ephemeral-hdfs/c
onf/" "-Xms12991M" "-Xmx12991M" "-Dspark.driver.port=45211" "-XX:MaxPermS
ize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--dri
ver-url" "spark://CoarseGrainedScheduler@10.30.10.204:45211" "--executor-
id" "2" "--hostname" "10.30.3.42" "--cores" "8" "--app-id" "app-201605232
25231-0005" "--worker-url" "spark://Worker@10.30.3.42:51910"
16/05/23 22:52:50 INFO Worker: Asked to kill executor app-20160523225231-
0005/2
16/05/23 22:52:50 INFO ExecutorRunner: Runner thread for executor app-201
60523225231-0005/2 interrupted
16/05/23 22:52:50 INFO ExecutorRunner: Killing process!
16/05/23 22:52:50 ERROR FileAppender: Error writing stream to file /root/
spark/work/app-20160523225231-0005/2/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.j
ava:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272
)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(
FileAppender.scala:70)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply$mcV$sp(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$ru
n$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala
:1741)
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileApp
ender.scala:38)
16/05/23 22:52:50 INFO Worker: Executor app-20160523225231-0005/2 finishe
d with state KILLED exitStatus 143
16/05/23 22:52:50 INFO Worker: Cleaning up local directories for applicat
ion app-20160523225231-0005
16/05/23 22:52:50 INFO ExternalShuffleBlockResolver: Application app-2016
0523225231-0005 removed, cleanupLocalDirs = true

@syuquad

This comment has been minimized.

Show comment
Hide comment
@syuquad

syuquad May 24, 2016

I use the following command to start the cluster:

${SPARK_HOME}/ec2/spark-ec2 --key-pair=${EC2_KEY} --identity-file=${EC2_PEM_FILE} --region=${EC2_REGION} --zone=${EC2_ZONE} --ebs-vol-size=50 --instance-type=${EC2_INSTANCE_TYPE} --master-instance-type=m4.xlarge --ami=${AMI_IMAGE} -s ${SPARK_WORKER_INSTANCES} --copy-aws-credentials --vpc-id=vpc-fe813496 --subnet-id=subnet-2d833645 --private-ips --hadoop-major-version=yarn --spark-version=1.6.0 --no-ganglia --user-data=/Users/myusername/work/CaffeOnSpark-master/scripts/ec2-cloud-config.txt launch CaffeOnSparkDemoVPC

syuquad commented May 24, 2016

I use the following command to start the cluster:

${SPARK_HOME}/ec2/spark-ec2 --key-pair=${EC2_KEY} --identity-file=${EC2_PEM_FILE} --region=${EC2_REGION} --zone=${EC2_ZONE} --ebs-vol-size=50 --instance-type=${EC2_INSTANCE_TYPE} --master-instance-type=m4.xlarge --ami=${AMI_IMAGE} -s ${SPARK_WORKER_INSTANCES} --copy-aws-credentials --vpc-id=vpc-fe813496 --subnet-id=subnet-2d833645 --private-ips --hadoop-major-version=yarn --spark-version=1.6.0 --no-ganglia --user-data=/Users/myusername/work/CaffeOnSpark-master/scripts/ec2-cloud-config.txt launch CaffeOnSparkDemoVPC

@syuquad

This comment has been minimized.

Show comment
Hide comment
@syuquad

syuquad May 25, 2016

I also see the following log from the work nodes:

root@ip-10-30-9-14:/spark/work/app-20160525000805-0000/0# ls
caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
hs_err_pid3378.log
lenet_memory_solver.prototxt
lenet_memory_train_test.prototxt
stderr
stdout
root@ip-10-30-9-14:
/spark/work/app-20160525000805-0000/0# more hs_err_pid3378.log

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007f3cdfd26ff9, pid=3378, tid=139898812618496

JRE version: OpenJDK Runtime Environment (7.0_101) (build 1.7.0_101-b00

)

Java VM: OpenJDK 64-Bit Server VM (24.95-b01 mixed mode linux-amd64 com

pressed oops)

Derivative: IcedTea 2.6.6

Distribution: Ubuntu 14.04 LTS, package 7u101-2.6.6-0ubuntu0.14.04.1

Problematic frame:

C [libc.so.6+0x97ff9]

Failed to write core dump. Core dumps have been disabled. To enable cor

e dumping, try "ulimit -c unlimited" before starting Java again

If you would like to submit a bug report, please include

instructions on how to reproduce the bug and visit:

http://icedtea.classpath.org/bugzilla

The crash happened outside the Java Virtual Machine in native code.

See problematic frame for where to report the bug.

--------------- T H R E A D ---------------

Current thread (0x00007f3c4c002000): JavaThread "Executor task launch wo
rker-0" daemon [_thread_in_native, id=3455, stack(0x00007f3cbaf7d000,0x00
007f3cbb07e000)]

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=2 (SEGV_ACCERR), si_addr=0x
00000002031e0000

Registers:
RAX=0x00007f399923e0b0, RBX=0x0000000000000001, RCX=0x0000000000000fa0, R
DX=0x00000000000007d0
RSP=0x00007f3cbb07b998, RBP=0x0000000000000000, RSI=0x00007f3b9c41e880, R
DI=0x00000002031e0000
R8 =0x0000000000000000, R9 =0x00007f3cbb07b580, R10=0x00007f3cbb07b760, R
11=0x00007f3cab8384a0
R12=0x00007f3cab9e6e20, R13=0x00007f3b9c3e4910, R14=0x00000002031e0000, R
15=0x00000000000001f4
...........

syuquad commented May 25, 2016

I also see the following log from the work nodes:

root@ip-10-30-9-14:/spark/work/app-20160525000805-0000/0# ls
caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
hs_err_pid3378.log
lenet_memory_solver.prototxt
lenet_memory_train_test.prototxt
stderr
stdout
root@ip-10-30-9-14:
/spark/work/app-20160525000805-0000/0# more hs_err_pid3378.log

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007f3cdfd26ff9, pid=3378, tid=139898812618496

JRE version: OpenJDK Runtime Environment (7.0_101) (build 1.7.0_101-b00

)

Java VM: OpenJDK 64-Bit Server VM (24.95-b01 mixed mode linux-amd64 com

pressed oops)

Derivative: IcedTea 2.6.6

Distribution: Ubuntu 14.04 LTS, package 7u101-2.6.6-0ubuntu0.14.04.1

Problematic frame:

C [libc.so.6+0x97ff9]

Failed to write core dump. Core dumps have been disabled. To enable cor

e dumping, try "ulimit -c unlimited" before starting Java again

If you would like to submit a bug report, please include

instructions on how to reproduce the bug and visit:

http://icedtea.classpath.org/bugzilla

The crash happened outside the Java Virtual Machine in native code.

See problematic frame for where to report the bug.

--------------- T H R E A D ---------------

Current thread (0x00007f3c4c002000): JavaThread "Executor task launch wo
rker-0" daemon [_thread_in_native, id=3455, stack(0x00007f3cbaf7d000,0x00
007f3cbb07e000)]

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=2 (SEGV_ACCERR), si_addr=0x
00000002031e0000

Registers:
RAX=0x00007f399923e0b0, RBX=0x0000000000000001, RCX=0x0000000000000fa0, R
DX=0x00000000000007d0
RSP=0x00007f3cbb07b998, RBP=0x0000000000000000, RSI=0x00007f3b9c41e880, R
DI=0x00000002031e0000
R8 =0x0000000000000000, R9 =0x00007f3cbb07b580, R10=0x00007f3cbb07b760, R
11=0x00007f3cab8384a0
R12=0x00007f3cab9e6e20, R13=0x00007f3b9c3e4910, R14=0x00000002031e0000, R
15=0x00000000000001f4
...........

@mriduljain

This comment has been minimized.

Show comment
Hide comment
@mriduljain

mriduljain May 26, 2016

Contributor

Will check tomorrow but looks like permission errors on your filesystems

On Wednesday, May 25, 2016, syuquad notifications@github.com wrote:

I also see the following log from the work nodes:

root@ip-10-30-9-14:/spark/work/app-20160525000805-0000/0# ls
caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
hs_err_pid3378.log
lenet_memory_solver.prototxt
lenet_memory_train_test.prototxt
stderr
stdout
root@ip-10-30-9-14:
/spark/work/app-20160525000805-0000/0# more
hs_err_pid3378.log

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007f3cdfd26ff9, pid=3378, tid=139898812618496

JRE version: OpenJDK Runtime Environment (7.0_101) (build 1.7.0_101-b00

)
Java VM: OpenJDK 64-Bit Server VM (24.95-b01 mixed mode linux-amd64 com

pressed oops)
Derivative: IcedTea 2.6.6 Distribution: Ubuntu 14.04 LTS, package
7u101-2.6.6-0ubuntu0.14.04.1 Problematic frame: C [libc.so.6+0x97ff9]

Failed to write core dump. Core dumps have been disabled. To enable cor

e dumping, try "ulimit -c unlimited" before starting Java again

If you would like to submit a bug report, please include instructions on
how to reproduce the bug and visit: http://icedtea.classpath.org/bugzilla The
crash happened outside the Java Virtual Machine in native code. See
problematic frame for where to report the bug.

--------------- T H R E A D ---------------

Current thread (0x00007f3c4c002000): JavaThread "Executor task launch wo
rker-0" daemon [_thread_in_native, id=3455, stack(0x00007f3cbaf7d000,0x00
007f3cbb07e000)]

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=2 (SEGV_ACCERR), si_addr=0x
00000002031e0000

Registers:
RAX=0x00007f399923e0b0, RBX=0x0000000000000001, RCX=0x0000000000000fa0, R
DX=0x00000000000007d0
RSP=0x00007f3cbb07b998, RBP=0x0000000000000000, RSI=0x00007f3b9c41e880, R
DI=0x00000002031e0000
R8 =0x0000000000000000, R9 =0x00007f3cbb07b580, R10=0x00007f3cbb07b760, R
11=0x00007f3cab8384a0
R12=0x00007f3cab9e6e20, R13=0x00007f3b9c3e4910, R14=0x00000002031e0000, R
15=0x00000000000001f4
...........


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#67 (comment)

Contributor

mriduljain commented May 26, 2016

Will check tomorrow but looks like permission errors on your filesystems

On Wednesday, May 25, 2016, syuquad notifications@github.com wrote:

I also see the following log from the work nodes:

root@ip-10-30-9-14:/spark/work/app-20160525000805-0000/0# ls
caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
hs_err_pid3378.log
lenet_memory_solver.prototxt
lenet_memory_train_test.prototxt
stderr
stdout
root@ip-10-30-9-14:
/spark/work/app-20160525000805-0000/0# more
hs_err_pid3378.log

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007f3cdfd26ff9, pid=3378, tid=139898812618496

JRE version: OpenJDK Runtime Environment (7.0_101) (build 1.7.0_101-b00

)
Java VM: OpenJDK 64-Bit Server VM (24.95-b01 mixed mode linux-amd64 com

pressed oops)
Derivative: IcedTea 2.6.6 Distribution: Ubuntu 14.04 LTS, package
7u101-2.6.6-0ubuntu0.14.04.1 Problematic frame: C [libc.so.6+0x97ff9]

Failed to write core dump. Core dumps have been disabled. To enable cor

e dumping, try "ulimit -c unlimited" before starting Java again

If you would like to submit a bug report, please include instructions on
how to reproduce the bug and visit: http://icedtea.classpath.org/bugzilla The
crash happened outside the Java Virtual Machine in native code. See
problematic frame for where to report the bug.

--------------- T H R E A D ---------------

Current thread (0x00007f3c4c002000): JavaThread "Executor task launch wo
rker-0" daemon [_thread_in_native, id=3455, stack(0x00007f3cbaf7d000,0x00
007f3cbb07e000)]

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=2 (SEGV_ACCERR), si_addr=0x
00000002031e0000

Registers:
RAX=0x00007f399923e0b0, RBX=0x0000000000000001, RCX=0x0000000000000fa0, R
DX=0x00000000000007d0
RSP=0x00007f3cbb07b998, RBP=0x0000000000000000, RSI=0x00007f3b9c41e880, R
DI=0x00000002031e0000
R8 =0x0000000000000000, R9 =0x00007f3cbb07b580, R10=0x00007f3cbb07b760, R
11=0x00007f3cab8384a0
R12=0x00007f3cab9e6e20, R13=0x00007f3b9c3e4910, R14=0x00000002031e0000, R
15=0x00000000000001f4
...........


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#67 (comment)

@mriduljain

This comment has been minimized.

Show comment
Hide comment
@mriduljain

mriduljain May 27, 2016

Contributor

If you want to specify/increase executor memory you can use these in spark submit:
--executor-memory 38g --conf spark.yarn.executor.memoryOverhead=16384

Contributor

mriduljain commented May 27, 2016

If you want to specify/increase executor memory you can use these in spark submit:
--executor-memory 38g --conf spark.yarn.executor.memoryOverhead=16384

@syuquad

This comment has been minimized.

Show comment
Hide comment
@syuquad

syuquad May 31, 2016

it doesn't help with "--executor-memory 38g --conf spark.yarn.executor.memoryOverhead=16384" or other adjusted values (the max is 13g on the g2.2xlarge box). Same errors.

syuquad commented May 31, 2016

it doesn't help with "--executor-memory 38g --conf spark.yarn.executor.memoryOverhead=16384" or other adjusted values (the max is 13g on the g2.2xlarge box). Same errors.

@guchensmile

This comment has been minimized.

Show comment
Hide comment
@guchensmile

guchensmile Jul 6, 2016

I have a similar issue on my local deploy following https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_yarn

Master(no nvidia and cuda setup and caffeOnSpark build,caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar is copied from other slave node)
Two slave:
Slave1(one GPU)
Slave2(two GPU)

the model file:
lenet_memory_solver.prototxt.txt
lenet_memory_train_test.prototxt.txt

hadoop config file:
etc-hadoop.zip

After I use
export SPARK_WORKER_INSTANCES=2
export DEVICES=1
hadoop fs -rm -f hdfs:///mnist.model
hadoop fs -rm -r -f hdfs:///mnist_features_result
spark-submit --master yarn --deploy-mode cluster
--num-executors ${SPARK_WORKER_INSTANCES}
--files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf lenet_memory_solver.prototxt
-devices ${DEVICES}
-connection ethernet
-model hdfs:///mnist.model
-output hdfs:///mnist_features_result

I got the error below:
User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 7, heracles): java.lang.NullPointerException
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply$mcVI$sp(CaffeOnSpark.scala:153)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:

But you can notice when I submit it using spark mode(it's also one master and two slaves)
export MASTER_URL=spark://$(hostname):7077
export SPARK_WORKER_INSTANCES=2
export CORES_PER_WORKER=1
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))

pushd ${CAFFE_ON_SPARK}/data
hadoop fs -rm -f hdfs:///mnist.model
hadoop fs -rm -r -f hdfs:///mnist_features_result
spark-submit --master ${MASTER_URL}
--files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.task.cpus=${CORES_PER_WORKER}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf lenet_memory_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices 1
-connection ethernet
-model hdfs:///mnist.model
-output hdfs:///mnist_features_result

Everything is OK.

guchensmile commented Jul 6, 2016

I have a similar issue on my local deploy following https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_yarn

Master(no nvidia and cuda setup and caffeOnSpark build,caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar is copied from other slave node)
Two slave:
Slave1(one GPU)
Slave2(two GPU)

the model file:
lenet_memory_solver.prototxt.txt
lenet_memory_train_test.prototxt.txt

hadoop config file:
etc-hadoop.zip

After I use
export SPARK_WORKER_INSTANCES=2
export DEVICES=1
hadoop fs -rm -f hdfs:///mnist.model
hadoop fs -rm -r -f hdfs:///mnist_features_result
spark-submit --master yarn --deploy-mode cluster
--num-executors ${SPARK_WORKER_INSTANCES}
--files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf lenet_memory_solver.prototxt
-devices ${DEVICES}
-connection ethernet
-model hdfs:///mnist.model
-output hdfs:///mnist_features_result

I got the error below:
User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 7, heracles): java.lang.NullPointerException
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply$mcVI$sp(CaffeOnSpark.scala:153)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:

But you can notice when I submit it using spark mode(it's also one master and two slaves)
export MASTER_URL=spark://$(hostname):7077
export SPARK_WORKER_INSTANCES=2
export CORES_PER_WORKER=1
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))

pushd ${CAFFE_ON_SPARK}/data
hadoop fs -rm -f hdfs:///mnist.model
hadoop fs -rm -r -f hdfs:///mnist_features_result
spark-submit --master ${MASTER_URL}
--files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.task.cpus=${CORES_PER_WORKER}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf lenet_memory_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices 1
-connection ethernet
-model hdfs:///mnist.model
-output hdfs:///mnist_features_result

Everything is OK.

@junshi15

This comment has been minimized.

Show comment
Hide comment
@junshi15

junshi15 Jul 6, 2016

Collaborator

This is usually caused by incorrect settings.
I believe syuquad has solved this issue. @syuquad, do you want to share your solution?

Collaborator

junshi15 commented Jul 6, 2016

This is usually caused by incorrect settings.
I believe syuquad has solved this issue. @syuquad, do you want to share your solution?

@syuquad

This comment has been minimized.

Show comment
Hide comment
@syuquad

syuquad Jul 6, 2016

I didn’t try to run it with yarn. I only run it with the local mode.

From: Jun Shi
Reply-To: yahoo/CaffeOnSpark
Date: Wednesday, July 6, 2016 at 1:57 AM
To: yahoo/CaffeOnSpark
Cc: Shenxi Yu, Mention
Subject: Re: [yahoo/CaffeOnSpark] NullPointerException when Running CaffeOnSpark on EC2 (#67)

This is usually caused by incorrect settings.
I believe syuquad has solved this issue. @syuquadhttps://github.com/syuquad, do you want to share your solution?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/67#issuecomment-230716465, or mute the threadhttps://github.com/notifications/unsubscribe/ASidFCQaIEeVKzjP8bXEF-aQbZW6-_I_ks5qS24XgaJpZM4IhyMG.

syuquad commented Jul 6, 2016

I didn’t try to run it with yarn. I only run it with the local mode.

From: Jun Shi
Reply-To: yahoo/CaffeOnSpark
Date: Wednesday, July 6, 2016 at 1:57 AM
To: yahoo/CaffeOnSpark
Cc: Shenxi Yu, Mention
Subject: Re: [yahoo/CaffeOnSpark] NullPointerException when Running CaffeOnSpark on EC2 (#67)

This is usually caused by incorrect settings.
I believe syuquad has solved this issue. @syuquadhttps://github.com/syuquad, do you want to share your solution?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/67#issuecomment-230716465, or mute the threadhttps://github.com/notifications/unsubscribe/ASidFCQaIEeVKzjP8bXEF-aQbZW6-_I_ks5qS24XgaJpZM4IhyMG.

@guchensmile

This comment has been minimized.

Show comment
Hide comment
@guchensmile

guchensmile Jul 7, 2016

Hello @syuquad , could you share your solution for this error?
The root cause maybe the same for the similar error.

@junshi15 , I have no idea what's wrong about the settings because it run correctly in spark mode.

Thank you!

guchensmile commented Jul 7, 2016

Hello @syuquad , could you share your solution for this error?
The root cause maybe the same for the similar error.

@junshi15 , I have no idea what's wrong about the settings because it run correctly in spark mode.

Thank you!

@syuquad

This comment has been minimized.

Show comment
Hide comment
@syuquad

syuquad Jul 7, 2016

Hi,

You got the exception with
spark-submit --master yarn --deploy-mode cluster….

That’s different from my case. I got the exception with
spark-submit --master ${MASTER_URL} ….

After I use Jun’s ami, that exception is gone. I never use
spark-submit --master yarn --deploy-mode cluster….

From: guchensmile
Reply-To: yahoo/CaffeOnSpark
Date: Thursday, July 7, 2016 at 1:02 AM
To: yahoo/CaffeOnSpark
Cc: Shenxi Yu, Mention
Subject: Re: [yahoo/CaffeOnSpark] NullPointerException when Running CaffeOnSpark on EC2 (#67)

Hello @syuquadhttps://github.com/syuquad , could you share your solution for this error?
The root cause maybe the same for the similar error.

Thank you!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/67#issuecomment-231010010, or mute the threadhttps://github.com/notifications/unsubscribe/ASidFIO-g1ADQ8Jt3U6YqPnE12IArx-Cks5qTLKJgaJpZM4IhyMG.

syuquad commented Jul 7, 2016

Hi,

You got the exception with
spark-submit --master yarn --deploy-mode cluster….

That’s different from my case. I got the exception with
spark-submit --master ${MASTER_URL} ….

After I use Jun’s ami, that exception is gone. I never use
spark-submit --master yarn --deploy-mode cluster….

From: guchensmile
Reply-To: yahoo/CaffeOnSpark
Date: Thursday, July 7, 2016 at 1:02 AM
To: yahoo/CaffeOnSpark
Cc: Shenxi Yu, Mention
Subject: Re: [yahoo/CaffeOnSpark] NullPointerException when Running CaffeOnSpark on EC2 (#67)

Hello @syuquadhttps://github.com/syuquad , could you share your solution for this error?
The root cause maybe the same for the similar error.

Thank you!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/67#issuecomment-231010010, or mute the threadhttps://github.com/notifications/unsubscribe/ASidFIO-g1ADQ8Jt3U6YqPnE12IArx-Cks5qTLKJgaJpZM4IhyMG.

@guchensmile

This comment has been minimized.

Show comment
Hide comment
@guchensmile

guchensmile Jul 8, 2016

OK, I will try reconfigure all for yarn mode.

guchensmile commented Jul 8, 2016

OK, I will try reconfigure all for yarn mode.

@mriduljain

This comment has been minimized.

Show comment
Hide comment
@mriduljain

mriduljain Nov 29, 2016

Contributor

No update. Closing this. Reopen if required

Contributor

mriduljain commented Nov 29, 2016

No update. Closing this. Reopen if required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment