Skip to content
This repository has been archived by the owner on Nov 16, 2019. It is now read-only.

MemoryData & JAR error #44

Closed
mauricio-onoda opened this issue Mar 29, 2016 · 6 comments
Closed

MemoryData & JAR error #44

mauricio-onoda opened this issue Mar 29, 2016 · 6 comments

Comments

@mauricio-onoda
Copy link

We launched a cluster with your image and ran the "lenet_memory" example with success.

(1) Then we executed the same example, but with other data type (type: '"Data" vs "MemoryData") as showed in Caffe's directory, and an error occured:

Example:
==> Original data type:

}
data_param {
source: mnist_train_lmdb/"
batch_size: 64
backend: LMDB
}

==> New data type:

source_class: "com.yahoo.ml.caffe.LMDB"
memory_data_param {
source: "mnist_train_lmdb/"
batch_size: 64
channels: 1
height: 28
width: 28
share_in_parallel: false
}

Execution:

root@ip-172-31-14-118:~/CaffeOnSpark/data# spark-submit --master spark://$(hostname):7077
--files lenet_train_test.prototxt,lenet_solver.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf lenet_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES}
-connection ethernet
-model /mnist.model
-output /mnist_features_result
....
16/03/29 16:36:25 INFO DataSource$: Source data layer:0
16/03/29 16:36:25 ERROR DataSource$: source_class must be defined for input data layer:Data
Exception in thread "main" java.lang.NullPointerException
...

Do CaffeOnSpark use only MemoryData type?

(2) We have tested another example from Caffe: "mnist_autoencoder". After change data type to MemoryData in prototxt file, we got an error:

root@ip-172-31-14-118:~/CaffeOnSpark/data# spark-submit --master spark://$(hostname):7077
--files mnist_memory_autoencoder.prototxt, mnist_memory_autoencoder_solver.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train -persistent
-features accuracy,loss -label label
-conf mnist_memory_autoencoder_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES}
-connection ethernet
-model /mnist_memory.model
-output /mnist_memory_autoencoder
Error: Cannot load main class from JAR file:/root/CaffeOnSpark/data/mnist_memory_autoencoder_solver.prototxt

This files exists in ~/CaffeOnSpark/data:

root@ip-172-31-14-118:~/CaffeOnSpark/data# ls mni* -l
-rwxr-xr-x 1 root root 5102 Mar 29 16:14 mnist_memory_autoencoder.prototxt
-rwxr-xr-x 1 root root 417 Mar 29 16:06 mnist_memory_autoencoder_solver.prototxt

What are we missing?

@anfeng
Copy link
Contributor

anfeng commented Mar 29, 2016

Yes. We only support MemoryData right now.

What's content of your /data/mnist_memory_autoencoder_solver.prototxt? More specifically, I like to know its value for source_class

@mauricio-onoda
Copy link
Author

name: "MNISTAutoencoder"
layer {
name: "data"
type: "MemoryData"
top: "data"
include {
phase: TRAIN
}
transform_param {
scale: 0.0039215684
}
source_class: "com.yahoo.ml.caffe.LMDB"
memory_data_param {
source: "mnist_train_lmdb/"
batch_size: 64
channels: 1
height: 28
width: 28
share_in_parallel: false
}
}

This source_class is the same for phases TEST/stage test-on-train and TEST/stage test-on-test.

@anfeng
Copy link
Contributor

anfeng commented Mar 29, 2016

I suspect that you missed some spaces in your CLI. Please add a space char before all \s.

@mauricio-onoda
Copy link
Author

I found the problem! At second line in my CLI there was a space between file's names:

--files mnist_memory_autoencoder.prototxt, mnist_memory_autoencoder_solver.prototxt\

With a space after the comma, the error occurs.
After I removed the comma, my CLI worked.

However, we got another error:

16/03/30 12:49:12 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ip-172-31-2-118.eu->west-1.compute.internal, partition 0,PROCESS_LOCAL, 2216 bytes)
16/03/30 12:49:12 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, ip-172-31-2-117.eu->west-1.compute.internal, partition 1,PROCESS_LOCAL, 2216 bytes)
16/03/30 12:49:12 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-2->118.eu-west-1.compute.internal:39514 (size: 2.0 KB, free: 8.9 GB)
16/03/30 12:49:12 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-2->117.eu-west-1.compute.internal:58422 (size: 2.0 KB, free: 8.9 GB)
16/03/30 12:49:12 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, ip-172-31-2-117.eu->west-1.compute.internal): java.io.FileNotFoundException: >/root/CaffeOnSpark/data/mnist_memory_autoencoder.prototxt (No such file or directory)

The question is: the prototxt files must be exists in all workers (nodes)? If yes, how may I copy these files to workers?

Thanks again!

@mauricio-onoda
Copy link
Author

My mistake! I had used a version of mnist_memory_autoencoder_solver.prototxt with path at "net" parameter. After I removed the path, it worked.

@githubier
Copy link

I have met the same error (NullPointerException) when I train other network(Caffenet), more detail see the #issue 217. I
I change the spark submit :
${CAFFE_ON_SPARK}/data/, so the spark submit is:
spark-submit --master ${MASTER_URL}
--files ${CAFFE_ON_SPARK}/data/solver.prototxt,${CAFFE_ON_SPARK}/data/train_val.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.task.cpus=${CORES_PER_WORKER}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices 1
-connection ethernet
-model file:${CAFFE_ON_SPARK}/myself_caffenet.model
-output file:${CAFFE_ON_SPARK}/myself_result

the solver.prototxt and train_val.prototxt at the path: ${CAFFE_ON_SPARK}/data/,
and the error is:
17/01/11 20:49:34 ERROR caffe.DataSource$: source_class must be defined for input data layer:Data
Exception in thread "main" java.lang.NullPointerException
at com.yahoo.ml.caffe.CaffeOnSpark.train(CaffeOnSpark.scala:103)
at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:40)
at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/01/11 20:49:34 INFO spark.SparkContext: Invoking stop() from shutdown hook

where is my error?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants