New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMI updated to new code #41

Closed
rahulbhalerao001 opened this Issue Mar 25, 2016 · 35 comments

Comments

Projects
None yet
4 participants
@rahulbhalerao001

rahulbhalerao001 commented Mar 25, 2016

is the AMI ami-6373ca10 updated with the latest code. If no what are the steps to bring it up to the latest development.

@anfeng

This comment has been minimized.

Show comment
Hide comment
@anfeng

anfeng Mar 25, 2016

Contributor

We will need to bring it up to date per instruction given at https://github.com/yahoo/CaffeOnSpark/wiki/Create_AMI

Contributor

anfeng commented Mar 25, 2016

We will need to bring it up to date per instruction given at https://github.com/yahoo/CaffeOnSpark/wiki/Create_AMI

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 25, 2016

Thank you for your quick response. So to confirm, I need to follow steps 6,7,8,9 from the wiki (not the clone though) on AMI ami-6373ca10.

Also, do I need to do it only on the master or also on the slaves?

rahulbhalerao001 commented Mar 25, 2016

Thank you for your quick response. So to confirm, I need to follow steps 6,7,8,9 from the wiki (not the clone though) on AMI ami-6373ca10.

Also, do I need to do it only on the master or also on the slaves?

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 25, 2016

Or will you recommend creating a new AMI from scratch following all steps, and then using it for all master and slave machines?

rahulbhalerao001 commented Mar 25, 2016

Or will you recommend creating a new AMI from scratch following all steps, and then using it for all master and slave machines?

@anfeng

This comment has been minimized.

Show comment
Hide comment
@anfeng

anfeng Mar 25, 2016

Contributor

We should launch an instance w/ existing image, and apply steps 6, 8, 9 to build a new image. For step 6, we will do git pull for updated source code.

If I find time this weekend, I will try to create a new image.

Contributor

anfeng commented Mar 25, 2016

We should launch an instance w/ existing image, and apply steps 6, 8, 9 to build a new image. For step 6, we will do git pull for updated source code.

If I find time this weekend, I will try to create a new image.

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 26, 2016

I started a new g2.8xlarge instance with the latest AMI - ami-6373ca10. I did a git pull and

pushd CaffeOnSpark/caffe-public/
cp Makefile.config.example Makefile.config
echo "INCLUDE_DIRS += ${JAVA_HOME}/include" >> Makefile.config
pushd ..
export CAFFE_ON_SPARK=/root/CaffeOnSpark
export LD_LIBRARY_PATH="${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib:/usr/lib64:/lib64:/usr/local/cuda-7.0/lib64"
make build

But am getting the following error :

[INFO] Compiling 16 source files to /root/CaffeOnSpark/caffe-grid/target/classes at 1458951756956
[ERROR] /root/CaffeOnSpark/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:35: error: value hasDataframeFormat is not a member of caffe.Caffe.MemoryDataParameter
[INFO] if (memdatalayer_param.hasDataframeFormat())
[INFO] ^
[ERROR] /root/CaffeOnSpark/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:36: error: value getDataframeFormat is not a member of caffe.Caffe.MemoryDataParameter
[INFO] reader = reader.format(memdatalayer_param.getDataframeFormat())
[INFO] ^
[ERROR] two errors found
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] caffe ............................................. SUCCESS [0.002s]
[INFO] caffe-distri ...................................... SUCCESS [4:58.843s]
[INFO] caffe-grid ........................................ FAILURE [38.472s]
[INFO] ------------------------------------------------------------------------

It will be great if you could let me know if I am missing something here.

rahulbhalerao001 commented Mar 26, 2016

I started a new g2.8xlarge instance with the latest AMI - ami-6373ca10. I did a git pull and

pushd CaffeOnSpark/caffe-public/
cp Makefile.config.example Makefile.config
echo "INCLUDE_DIRS += ${JAVA_HOME}/include" >> Makefile.config
pushd ..
export CAFFE_ON_SPARK=/root/CaffeOnSpark
export LD_LIBRARY_PATH="${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib:/usr/lib64:/lib64:/usr/local/cuda-7.0/lib64"
make build

But am getting the following error :

[INFO] Compiling 16 source files to /root/CaffeOnSpark/caffe-grid/target/classes at 1458951756956
[ERROR] /root/CaffeOnSpark/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:35: error: value hasDataframeFormat is not a member of caffe.Caffe.MemoryDataParameter
[INFO] if (memdatalayer_param.hasDataframeFormat())
[INFO] ^
[ERROR] /root/CaffeOnSpark/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:36: error: value getDataframeFormat is not a member of caffe.Caffe.MemoryDataParameter
[INFO] reader = reader.format(memdatalayer_param.getDataframeFormat())
[INFO] ^
[ERROR] two errors found
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] caffe ............................................. SUCCESS [0.002s]
[INFO] caffe-distri ...................................... SUCCESS [4:58.843s]
[INFO] caffe-grid ........................................ FAILURE [38.472s]
[INFO] ------------------------------------------------------------------------

It will be great if you could let me know if I am missing something here.

@anfeng

This comment has been minimized.

Show comment
Hide comment
@anfeng

anfeng Mar 26, 2016

Contributor

You need update caffe-public submodule.

cd caffe-public
git pull origin master
cd ..
make build

Andy Feng

Sent from my iPhone

On Mar 25, 2016, at 5:27 PM, Rahul Bhalerao notifications@github.com wrote:

I started a new g2.8xlarge instance with the latest AMI - ami-6373ca10. I did a git pull and

pushd CaffeOnSpark/caffe-public/
cp Makefile.config.example Makefile.config
echo "INCLUDE_DIRS += ${JAVA_HOME}/include" >> Makefile.config
pushd ..
export CAFFE_ON_SPARK=/root/CaffeOnSpark
export LD_LIBRARY_PATH="${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib:/usr/lib64:/lib64:/usr/local/cuda-7.0/lib64"
make build

But am getting the following error :

[INFO] Compiling 16 source files to /root/CaffeOnSpark/caffe-grid/target/classes at 1458951756956
[ERROR] /root/CaffeOnSpark/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:35: error: value hasDataframeFormat is not a member of caffe.Caffe.MemoryDataParameter
[INFO] if (memdatalayer_param.hasDataframeFormat())
[INFO] ^
[ERROR] /root/CaffeOnSpark/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:36: error: value getDataframeFormat is not a member of caffe.Caffe.MemoryDataParameter
[INFO] reader = reader.format(memdatalayer_param.getDataframeFormat())
[INFO] ^
[ERROR] two errors found
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] caffe ............................................. SUCCESS [0.002s]
[INFO] caffe-distri ...................................... SUCCESS [4:58.843s]
[INFO] caffe-grid ........................................ FAILURE [38.472s]
[INFO] ------------------------------------------------------------------------

It will be great if you could let me know if I am missing something here.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

Contributor

anfeng commented Mar 26, 2016

You need update caffe-public submodule.

cd caffe-public
git pull origin master
cd ..
make build

Andy Feng

Sent from my iPhone

On Mar 25, 2016, at 5:27 PM, Rahul Bhalerao notifications@github.com wrote:

I started a new g2.8xlarge instance with the latest AMI - ami-6373ca10. I did a git pull and

pushd CaffeOnSpark/caffe-public/
cp Makefile.config.example Makefile.config
echo "INCLUDE_DIRS += ${JAVA_HOME}/include" >> Makefile.config
pushd ..
export CAFFE_ON_SPARK=/root/CaffeOnSpark
export LD_LIBRARY_PATH="${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib:/usr/lib64:/lib64:/usr/local/cuda-7.0/lib64"
make build

But am getting the following error :

[INFO] Compiling 16 source files to /root/CaffeOnSpark/caffe-grid/target/classes at 1458951756956
[ERROR] /root/CaffeOnSpark/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:35: error: value hasDataframeFormat is not a member of caffe.Caffe.MemoryDataParameter
[INFO] if (memdatalayer_param.hasDataframeFormat())
[INFO] ^
[ERROR] /root/CaffeOnSpark/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:36: error: value getDataframeFormat is not a member of caffe.Caffe.MemoryDataParameter
[INFO] reader = reader.format(memdatalayer_param.getDataframeFormat())
[INFO] ^
[ERROR] two errors found
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] caffe ............................................. SUCCESS [0.002s]
[INFO] caffe-distri ...................................... SUCCESS [4:58.843s]
[INFO] caffe-grid ........................................ FAILURE [38.472s]
[INFO] ------------------------------------------------------------------------

It will be great if you could let me know if I am missing something here.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 26, 2016

It is giving another error now :( . I apologize for the spam, and please let me know if this is an issue specific to my instance and one which I should figure out.

Tests run: 17, Failures: 9, Errors: 0, Skipped: 7, Time elapsed: 0.785 sec <<< F AILURE!
setUp(com.yahoo.ml.jcaffe.CaffeNetTest) Time elapsed: 0.322 sec <<< FAILURE!
java.lang.UnsatisfiedLinkError: /root/CaffeOnSpark/caffe-distri/.build_release/l ib/libcaffedistri.so: libcudart.so.7.0: cannot open shared object file: No such file or directory
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1965)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1890)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1880)
at java.lang.Runtime.loadLibrary0(Runtime.java:849)
at java.lang.System.loadLibrary(System.java:1088)
at com.yahoo.ml.jcaffe.BaseObject.(BaseObject.java:10)
at com.yahoo.ml.jcaffe.CaffeNetTest.setUp(CaffeNetTest.java:39)

rahulbhalerao001 commented Mar 26, 2016

It is giving another error now :( . I apologize for the spam, and please let me know if this is an issue specific to my instance and one which I should figure out.

Tests run: 17, Failures: 9, Errors: 0, Skipped: 7, Time elapsed: 0.785 sec <<< F AILURE!
setUp(com.yahoo.ml.jcaffe.CaffeNetTest) Time elapsed: 0.322 sec <<< FAILURE!
java.lang.UnsatisfiedLinkError: /root/CaffeOnSpark/caffe-distri/.build_release/l ib/libcaffedistri.so: libcudart.so.7.0: cannot open shared object file: No such file or directory
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1965)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1890)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1880)
at java.lang.Runtime.loadLibrary0(Runtime.java:849)
at java.lang.System.loadLibrary(System.java:1088)
at com.yahoo.ml.jcaffe.BaseObject.(BaseObject.java:10)
at com.yahoo.ml.jcaffe.CaffeNetTest.setUp(CaffeNetTest.java:39)

@mriduljain

This comment has been minimized.

Show comment
Hide comment
@mriduljain

mriduljain Mar 26, 2016

Contributor

Just specify the path to your cuda libs in LD_LIBRARY_PATH, export and compile

Contributor

mriduljain commented Mar 26, 2016

Just specify the path to your cuda libs in LD_LIBRARY_PATH, export and compile

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 27, 2016

Thank you for your patience.
I was able to build succesfully and then I launched a 3 node cluster using the scripts given in the wiki, by replacing the AMI ID with my AMI with the updated code.

I followed the steps further and ran the lenet example on the page, and am getting the following error. Your help will be greatly appreciated.

Exception in thread "main" org.apache.spark.SparkException: addFile does not support local directories when not running local mode.
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1368)
at com.yahoo.ml.caffe.LmdbRDD.getPartitions(LmdbRDD.scala:43)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at com.yahoo.ml.caffe.CaffeOnSpark.train(CaffeOnSpark.scala:157)
at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:40)
at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

rahulbhalerao001 commented Mar 27, 2016

Thank you for your patience.
I was able to build succesfully and then I launched a 3 node cluster using the scripts given in the wiki, by replacing the AMI ID with my AMI with the updated code.

I followed the steps further and ran the lenet example on the page, and am getting the following error. Your help will be greatly appreciated.

Exception in thread "main" org.apache.spark.SparkException: addFile does not support local directories when not running local mode.
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1368)
at com.yahoo.ml.caffe.LmdbRDD.getPartitions(LmdbRDD.scala:43)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at com.yahoo.ml.caffe.CaffeOnSpark.train(CaffeOnSpark.scala:157)
at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:40)
at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

@anfeng

This comment has been minimized.

Show comment
Hide comment
@anfeng

anfeng Mar 27, 2016

Contributor

It sounds like that we should not to use SparkContext.addFile() and SparkFiles.get() when lmdb_path is a local file path. LmdbRDD.scala may need to be revised as below:

  • L43 ... if (!lmdb_path.startsWith(FSUtils.localfsPrefix)) sc.addFile(lmdb_path, true)
  • L168 ... val local_lmdb_folder = if (lmdb_path.startsWith(FSUtils.localfsPrefix)) lmdb_path.substring("file://".length) else SparkFiles.get(folder.getName)
Contributor

anfeng commented Mar 27, 2016

It sounds like that we should not to use SparkContext.addFile() and SparkFiles.get() when lmdb_path is a local file path. LmdbRDD.scala may need to be revised as below:

  • L43 ... if (!lmdb_path.startsWith(FSUtils.localfsPrefix)) sc.addFile(lmdb_path, true)
  • L168 ... val local_lmdb_folder = if (lmdb_path.startsWith(FSUtils.localfsPrefix)) lmdb_path.substring("file://".length) else SparkFiles.get(folder.getName)
@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 27, 2016

Ok that has worked and I am not getting the same error now.
However, as shown below the MNIST example is taking a lot of time. It is stuck in the min step for more than 10 min. I remember running this just after the open source of CaffeOnSpark, and it used to get completed within 5 min. Am I missing something here?
image

rahulbhalerao001 commented Mar 27, 2016

Ok that has worked and I am not getting the same error now.
However, as shown below the MNIST example is taking a lot of time. It is stuck in the min step for more than 10 min. I remember running this just after the open source of CaffeOnSpark, and it used to get completed within 5 min. Am I missing something here?
image

@mriduljain

This comment has been minimized.

Show comment
Hide comment
@mriduljain

mriduljain Mar 27, 2016

Contributor

Could you check the executor logs please

On Saturday, March 26, 2016, Rahul Bhalerao notifications@github.com
wrote:

Ok that has worked and I am not getting the same error now.
However, as shown below the MNIST example is taking a lot of time. It is
stuck in the min step for more than 10 min. I remember running this just
after the launch, and it used to get completed within 5 min. Am I missing
something here?
[image: image]
https://cloud.githubusercontent.com/assets/4405642/14064089/d764d65a-f3a8-11e5-90fa-d11c343f14a2.png


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#41 (comment)

Contributor

mriduljain commented Mar 27, 2016

Could you check the executor logs please

On Saturday, March 26, 2016, Rahul Bhalerao notifications@github.com
wrote:

Ok that has worked and I am not getting the same error now.
However, as shown below the MNIST example is taking a lot of time. It is
stuck in the min step for more than 10 min. I remember running this just
after the launch, and it used to get completed within 5 min. Am I missing
something here?
[image: image]
https://cloud.githubusercontent.com/assets/4405642/14064089/d764d65a-f3a8-11e5-90fa-d11c343f14a2.png


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#41 (comment)

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 27, 2016

I think instead of copy pasting, sharing the Spark URL will be a better option. http://ec2-54-194-79-51.eu-west-1.compute.amazonaws.com:4040/jobs/

Please let me know if I should paste the logs here.

rahulbhalerao001 commented Mar 27, 2016

I think instead of copy pasting, sharing the Spark URL will be a better option. http://ec2-54-194-79-51.eu-west-1.compute.amazonaws.com:4040/jobs/

Please let me know if I should paste the logs here.

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 27, 2016

Its stuck for 35 min now. Attaching the logs -
executor_stderr.txt

image

rahulbhalerao001 commented Mar 27, 2016

Its stuck for 35 min now. Attaching the logs -
executor_stderr.txt

image

@anfeng

This comment has been minimized.

Show comment
Hide comment
@anfeng

anfeng Mar 27, 2016

Contributor

What's your CLI command? Look like that GPU could not communicate with each
other.

I0327 06:06:38.474051 4559 parallel.cpp:392] GPUs pairs 0:1, 2:3, 0:2
I0327 06:06:38.485669 4559 parallel.cpp:234] GPU 1 does not have p2p
access to GPU 0
I0327 06:06:38.496879 4559 parallel.cpp:234] GPU 2 does not have p2p
access to GPU 0
I0327 06:06:38.508085 4559 parallel.cpp:234] GPU 3 does not have p2p
access to GPU 2

Andy

On Sat, Mar 26, 2016 at 11:42 PM, Rahul Bhalerao notifications@github.com
wrote:

Its stuck for 35 min now. Attaching the logs -
executor_stderr.txt
https://github.com/yahoo/CaffeOnSpark/files/190796/executor_stderr.txt

[image: image]
https://cloud.githubusercontent.com/assets/4405642/14064159/5826e910-f3ac-11e5-89cd-708bbfbb3e66.png


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#41 (comment)

Contributor

anfeng commented Mar 27, 2016

What's your CLI command? Look like that GPU could not communicate with each
other.

I0327 06:06:38.474051 4559 parallel.cpp:392] GPUs pairs 0:1, 2:3, 0:2
I0327 06:06:38.485669 4559 parallel.cpp:234] GPU 1 does not have p2p
access to GPU 0
I0327 06:06:38.496879 4559 parallel.cpp:234] GPU 2 does not have p2p
access to GPU 0
I0327 06:06:38.508085 4559 parallel.cpp:234] GPU 3 does not have p2p
access to GPU 2

Andy

On Sat, Mar 26, 2016 at 11:42 PM, Rahul Bhalerao notifications@github.com
wrote:

Its stuck for 35 min now. Attaching the logs -
executor_stderr.txt
https://github.com/yahoo/CaffeOnSpark/files/190796/executor_stderr.txt

[image: image]
https://cloud.githubusercontent.com/assets/4405642/14064159/5826e910-f3ac-11e5-89cd-708bbfbb3e66.png


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#41 (comment)

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 27, 2016

Command is same from the wiki

root@ip-172-31-29-4:~/CaffeOnSpark/data# spark-submit --master spark://$(hostname):7077
--files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf lenet_memory_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES} -connection ethernet
-model /mnist.model -output /mnist_features_result

rahulbhalerao001 commented Mar 27, 2016

Command is same from the wiki

root@ip-172-31-29-4:~/CaffeOnSpark/data# spark-submit --master spark://$(hostname):7077
--files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf lenet_memory_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES} -connection ethernet
-model /mnist.model -output /mnist_features_result

@anfeng

This comment has been minimized.

Show comment
Hide comment
@anfeng

anfeng Mar 27, 2016

Contributor

What's the value for total_cores and devices?

Andy Feng

Sent from my iPhone

On Mar 27, 2016, at 12:14 AM, Rahul Bhalerao notifications@github.com wrote:

Command is same from the wiki

root@ip-172-31-29-4:~/CaffeOnSpark/data# spark-submit --master spark://$(hostname):7077

--files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt

--conf spark.cores.max=${TOTAL_CORES}

--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"

--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"

--class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar

-train

-features accuracy,loss -label label

-conf lenet_memory_solver.prototxt

-clusterSize ${SPARK_WORKER_INSTANCES}

-devices ${DEVICES} -connection ethernet

-model /mnist.model -output /mnist_features_result


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

Contributor

anfeng commented Mar 27, 2016

What's the value for total_cores and devices?

Andy Feng

Sent from my iPhone

On Mar 27, 2016, at 12:14 AM, Rahul Bhalerao notifications@github.com wrote:

Command is same from the wiki

root@ip-172-31-29-4:~/CaffeOnSpark/data# spark-submit --master spark://$(hostname):7077

--files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt

--conf spark.cores.max=${TOTAL_CORES}

--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"

--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"

--class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar

-train

-features accuracy,loss -label label

-conf lenet_memory_solver.prototxt

-clusterSize ${SPARK_WORKER_INSTANCES}

-devices ${DEVICES} -connection ethernet

-model /mnist.model -output /mnist_features_result


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

@junshi15

This comment has been minimized.

Show comment
Hide comment
@junshi15

junshi15 Mar 27, 2016

Collaborator

In your setup, gpus can not do p2p access. Your communication among the gpus will be slow. You can check your setup according to
https://github.com/BVLC/caffe/blob/master/docs/multigpu.md#hardware-configuration-assumptions

Your code was really stuck somewhere below. This is where the program tries to find the minimal size of the partitions. I suspect one of the executors failed to read the lmdb file for various reasons, hard disk failure, hadoop file system failure, lmdb parser failure, etc. I only see log file for one executor. You may want to examine all of them.
https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-grid/src/main/scala/com/yahoo/ml/caffe/CaffeOnSpark.scala#L167-L182

Did the old AMI work for you? You only had problems after upgrading the AMI?

Collaborator

junshi15 commented Mar 27, 2016

In your setup, gpus can not do p2p access. Your communication among the gpus will be slow. You can check your setup according to
https://github.com/BVLC/caffe/blob/master/docs/multigpu.md#hardware-configuration-assumptions

Your code was really stuck somewhere below. This is where the program tries to find the minimal size of the partitions. I suspect one of the executors failed to read the lmdb file for various reasons, hard disk failure, hadoop file system failure, lmdb parser failure, etc. I only see log file for one executor. You may want to examine all of them.
https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-grid/src/main/scala/com/yahoo/ml/caffe/CaffeOnSpark.scala#L167-L182

Did the old AMI work for you? You only had problems after upgrading the AMI?

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 27, 2016

@anfeng :
export CORES_PER_WORKER=32
export DEVICES=4
export SPARK_WORKER_INSTANCES=3
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))

rahulbhalerao001 commented Mar 27, 2016

@anfeng :
export CORES_PER_WORKER=32
export DEVICES=4
export SPARK_WORKER_INSTANCES=3
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 27, 2016

@junshi15 :
All the log files had similar content.
The first AMI which was launched had worked. But after pulling latest changes, it did not work in any AMI.
I am not doing anything new or different, I just pulled the changes , built and followed steps on the wiki.

rahulbhalerao001 commented Mar 27, 2016

@junshi15 :
All the log files had similar content.
The first AMI which was launched had worked. But after pulling latest changes, it did not work in any AMI.
I am not doing anything new or different, I just pulled the changes , built and followed steps on the wiki.

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 27, 2016

It will be great if someone tries this out and see if it works..because I have tried to run this basic example directly out of box.

rahulbhalerao001 commented Mar 27, 2016

It will be great if someone tries this out and see if it works..because I have tried to run this basic example directly out of box.

@anfeng

This comment has been minimized.

Show comment
Hide comment
@anfeng

anfeng Mar 27, 2016

Contributor

@rahulbhalerao001 I just upgraded AMI, and verified its execution per guide with g2.8xlarge. Please try out the new AMI ami-790c8b0a at eu-west-1 region.

@junshi15 @mriduljain Please review PR #42 which fixed the lcoal file issue found by @rahulbhalerao001.

Contributor

anfeng commented Mar 27, 2016

@rahulbhalerao001 I just upgraded AMI, and verified its execution per guide with g2.8xlarge. Please try out the new AMI ami-790c8b0a at eu-west-1 region.

@junshi15 @mriduljain Please review PR #42 which fixed the lcoal file issue found by @rahulbhalerao001.

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 28, 2016

I am still getting the same behavior at http://ec2-54-171-180-143.eu-west-1.compute.amazonaws.com:4040/jobs/

Here are my commands :

export AMI_IMAGE=ami-790c8b0a
export EC2_REGION=eu-west-1
export EC2_ZONE=eu-west-1c
export SPARK_WORKER_INSTANCES=3
export EC2_INSTANCE_TYPE=g2.8xlarge
~/spark/ec2/spark-ec2 --key-pair=newtrial --identity-file=newtrial.pem
--region=${EC2_REGION} --zone=${EC2_ZONE}
--ebs-vol-size=50
--instance-type=${EC2_INSTANCE_TYPE}
--master-instance-type=m4.xlarge
--ami=${AMI_IMAGE} -s ${SPARK_WORKER_INSTANCES}
--copy-aws-credentials
--hadoop-major-version=yarn --spark-version 1.6.0
--no-ganglia
--user-data ~/CaffeOnSpark/scripts/ec2-cloud-config.txt
--ebs-vol-size=200
launch CaffeOnSparkDemo


On Master

export CORES_PER_WORKER=32
export DEVICES=4
export SPARK_WORKER_INSTANCES=3
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))

source ~/.bashrc
export PATH=${PATH}:${HADOOP_HOME}/bin:${SPARK_HOME}/bin
pushd ${CAFFE_ON_SPARK}/data

hadoop fs -rm -r -f /mnist.model
hadoop fs -rm -r -f /mnist_features_result
spark-submit --master spark://$(hostname):7077
--files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf lenet_memory_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES}
-connection ethernet
-model /mnist.model
-output /mnist_features_result

rahulbhalerao001 commented Mar 28, 2016

I am still getting the same behavior at http://ec2-54-171-180-143.eu-west-1.compute.amazonaws.com:4040/jobs/

Here are my commands :

export AMI_IMAGE=ami-790c8b0a
export EC2_REGION=eu-west-1
export EC2_ZONE=eu-west-1c
export SPARK_WORKER_INSTANCES=3
export EC2_INSTANCE_TYPE=g2.8xlarge
~/spark/ec2/spark-ec2 --key-pair=newtrial --identity-file=newtrial.pem
--region=${EC2_REGION} --zone=${EC2_ZONE}
--ebs-vol-size=50
--instance-type=${EC2_INSTANCE_TYPE}
--master-instance-type=m4.xlarge
--ami=${AMI_IMAGE} -s ${SPARK_WORKER_INSTANCES}
--copy-aws-credentials
--hadoop-major-version=yarn --spark-version 1.6.0
--no-ganglia
--user-data ~/CaffeOnSpark/scripts/ec2-cloud-config.txt
--ebs-vol-size=200
launch CaffeOnSparkDemo


On Master

export CORES_PER_WORKER=32
export DEVICES=4
export SPARK_WORKER_INSTANCES=3
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))

source ~/.bashrc
export PATH=${PATH}:${HADOOP_HOME}/bin:${SPARK_HOME}/bin
pushd ${CAFFE_ON_SPARK}/data

hadoop fs -rm -r -f /mnist.model
hadoop fs -rm -r -f /mnist_features_result
spark-submit --master spark://$(hostname):7077
--files lenet_memory_train_test.prototxt,lenet_memory_solver.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf lenet_memory_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES}
-connection ethernet
-model /mnist.model
-output /mnist_features_result

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 28, 2016

Still having the same issue, am attaching the logs here and terminating the instances.

worker1.txt
worker2.txt
worker3.txt
master.txt

rahulbhalerao001 commented Mar 28, 2016

Still having the same issue, am attaching the logs here and terminating the instances.

worker1.txt
worker2.txt
worker3.txt
master.txt

@anfeng

This comment has been minimized.

Show comment
Hide comment
@anfeng

anfeng Mar 28, 2016

Contributor

From the log, GPU devices could not be synchronized. They are stuck at L180 of CaffeOnSpark. This should be a system issue unrelated to our code change.

Here is a response from nVidia engineer:

  • Known issue. You can’t enable P2P on a VM because of the VMs protection mechanism.

We may have to set devices=1 for EC2.

Contributor

anfeng commented Mar 28, 2016

From the log, GPU devices could not be synchronized. They are stuck at L180 of CaffeOnSpark. This should be a system issue unrelated to our code change.

Here is a response from nVidia engineer:

  • Known issue. You can’t enable P2P on a VM because of the VMs protection mechanism.

We may have to set devices=1 for EC2.

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 28, 2016

@anfeng
Thank you for your response. I will try with devices=1 but I want to tell that with the initial AMI, it worked on EC2. I highly doubt it will be a issue specific to EC2/nvidia, because the code ran perfectly fine with the same configuration previously.

rahulbhalerao001 commented Mar 28, 2016

@anfeng
Thank you for your response. I will try with devices=1 but I want to tell that with the initial AMI, it worked on EC2. I highly doubt it will be a issue specific to EC2/nvidia, because the code ran perfectly fine with the same configuration previously.

@anfeng

This comment has been minimized.

Show comment
Hide comment
@anfeng

anfeng Mar 28, 2016

Contributor

@rahulbhalerao001 We reproduced a hang problem w/in L180 in house, and will come up a solution soon.

Contributor

anfeng commented Mar 28, 2016

@rahulbhalerao001 We reproduced a hang problem w/in L180 in house, and will come up a solution soon.

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 28, 2016

@anfeng thank you for looking into it. Meanwhile, will it work with 1 device per machine.

rahulbhalerao001 commented Mar 28, 2016

@anfeng thank you for looking into it. Meanwhile, will it work with 1 device per machine.

@anfeng

This comment has been minimized.

Show comment
Hide comment
@anfeng

anfeng Mar 28, 2016

Contributor

AFAK, the code works fine for 2 machines even with multiple devices.
Somehow, we have problem with 3 machines or more.

Contributor

anfeng commented Mar 28, 2016

AFAK, the code works fine for 2 machines even with multiple devices.
Somehow, we have problem with 3 machines or more.

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 28, 2016

@anfeng : I am reframing my question :
MNIST example produces a directory /mnist_features_result which has intermediate accuracy and loss logging.
eg.
{"SampleID":"00000000","accuracy":[1.0],"loss":[0.0019047105],"label":[7.0]}
{"SampleID":"00000001","accuracy":[1.0],"loss":[0.0019047105],"label":[2.0]}
{"SampleID":"00000002","accuracy":[1.0],"loss":[0.0019047105],"label":[1.0]}
{"SampleID":"00000003","accuracy":[1.0],"loss":[0.0019047105],"label":[0.0]}
{"SampleID":"00000004","accuracy":[1.0],"loss":[0.0019047105],"label":[4.0]}
{"SampleID":"00000005","accuracy":[1.0],"loss":[0.0019047105],"label":[1.0]}
{"SampleID":"00000006","accuracy":[1.0],"loss":[0.0019047105],"label":[4.0]}
{"SampleID":"00000007","accuracy":[1.0],"loss":[0.0019047105],"label":[9.0]}
{"SampleID":"00000008","accuracy":[1.0],"loss":[0.0019047105],"label":[5.0]}
{"SampleID":"00000009","accuracy":[1.0],"loss":[0.0019047105],"label":[9.0]}

CIFAR-10 /cifar10_features_result is a file which has final loss and accuracy
For example CIFAR-10, the accuracy can be viewed :
loss: 1.367735541228092
accuracy: 0.6595959609205072

Could you provide some info as to how to configure what form of result is obtained. For example how to view periodic values for CIFAR-10

rahulbhalerao001 commented Mar 28, 2016

@anfeng : I am reframing my question :
MNIST example produces a directory /mnist_features_result which has intermediate accuracy and loss logging.
eg.
{"SampleID":"00000000","accuracy":[1.0],"loss":[0.0019047105],"label":[7.0]}
{"SampleID":"00000001","accuracy":[1.0],"loss":[0.0019047105],"label":[2.0]}
{"SampleID":"00000002","accuracy":[1.0],"loss":[0.0019047105],"label":[1.0]}
{"SampleID":"00000003","accuracy":[1.0],"loss":[0.0019047105],"label":[0.0]}
{"SampleID":"00000004","accuracy":[1.0],"loss":[0.0019047105],"label":[4.0]}
{"SampleID":"00000005","accuracy":[1.0],"loss":[0.0019047105],"label":[1.0]}
{"SampleID":"00000006","accuracy":[1.0],"loss":[0.0019047105],"label":[4.0]}
{"SampleID":"00000007","accuracy":[1.0],"loss":[0.0019047105],"label":[9.0]}
{"SampleID":"00000008","accuracy":[1.0],"loss":[0.0019047105],"label":[5.0]}
{"SampleID":"00000009","accuracy":[1.0],"loss":[0.0019047105],"label":[9.0]}

CIFAR-10 /cifar10_features_result is a file which has final loss and accuracy
For example CIFAR-10, the accuracy can be viewed :
loss: 1.367735541228092
accuracy: 0.6595959609205072

Could you provide some info as to how to configure what form of result is obtained. For example how to view periodic values for CIFAR-10

@anfeng

This comment has been minimized.

Show comment
Hide comment
@anfeng

anfeng Mar 29, 2016

Contributor

@rahulbhalerao001 It will be great if you could verify our new code per PR #43. It's verified at our local environment only.

Contributor

anfeng commented Mar 29, 2016

@rahulbhalerao001 It will be great if you could verify our new code per PR #43. It's verified at our local environment only.

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 29, 2016

Thank you for providing a fix. I will do a make build, verify and let you know by end of day.

rahulbhalerao001 commented Mar 29, 2016

Thank you for providing a fix. I will do a make build, verify and let you know by end of day.

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 29, 2016

@anfeng : I verified the MNIST and CIFAR-10 example on a 3 g2.8xlarge cluster. It is working correctly and no longer getting hung.

Before closing this issue it will be great if you could help me figure out the previous question.

rahulbhalerao001 commented Mar 29, 2016

@anfeng : I verified the MNIST and CIFAR-10 example on a 3 g2.8xlarge cluster. It is working correctly and no longer getting hung.

Before closing this issue it will be great if you could help me figure out the previous question.

@junshi15

This comment has been minimized.

Show comment
Hide comment
@junshi15

junshi15 Mar 29, 2016

Collaborator

@rahulbhalerao001
The accuracy and loss you got for MNIST are per mini-batch numbers. You got them by "-features accuracy,loss". To get the overall accuracy and loss, replace that with "-test" as you did with CIFAR-10.

Interleaving training and testing, i.e. periodic testing while training, is not available in current version of CaffeOnSpark. So you will not get any test accuracy/loss during training (though you could get the train accuracy/loss which may not be as useful). The workaround will be snapshotting the model periodically and start another Spark job to "-test" the accuracy/loss.

Collaborator

junshi15 commented Mar 29, 2016

@rahulbhalerao001
The accuracy and loss you got for MNIST are per mini-batch numbers. You got them by "-features accuracy,loss". To get the overall accuracy and loss, replace that with "-test" as you did with CIFAR-10.

Interleaving training and testing, i.e. periodic testing while training, is not available in current version of CaffeOnSpark. So you will not get any test accuracy/loss during training (though you could get the train accuracy/loss which may not be as useful). The workaround will be snapshotting the model periodically and start another Spark job to "-test" the accuracy/loss.

@rahulbhalerao001

This comment has been minimized.

Show comment
Hide comment
@rahulbhalerao001

rahulbhalerao001 Mar 29, 2016

@junshi15 : Thank you for the detailed explanation. Appreciate your help.

rahulbhalerao001 commented Mar 29, 2016

@junshi15 : Thank you for the detailed explanation. Appreciate your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment