ssl problem when installing pip #56

jzhusc · 2017-04-07T00:08:31Z

I am installing TensorFlowOnSpark on a Google dataproc which also ready have Python and Openssl installed.
I followed the guidance for Yarn cluster and met these question when running get-pip.py

pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
Collecting pip
  Could not fetch URL https://pypi.python.org/simple/pip/: There was a problem confirming the ssl certificate: Can't connect to HTTPS URL because the SSL module is not available. - skipping
  Could not find a version that satisfies the requirement pip (from versions: )
No matching distribution found for pip

The text was updated successfully, but these errors were encountered:

leewyang · 2017-04-07T16:19:22Z

@jzhusc that line just installs pip into the Python distro that we're trying to package up for Spark. If Google dataproc already has Python installed, can you just try something like pip install pydoop to see if that works?

Unfortunately, I don't have any experience with this environment, so it's hard to say if your Python dependencies will be available on the executors. But if they are, you may not need to supply a Python.zip to the Spark command line.

jzhusc · 2017-04-07T22:17:26Z

@leewyang hi, installing libssl-dev works.
btw, in Convert the MNIST zip files into HDFS files
I find that
mnist.zip in this line --archives hdfs:///user/${USER}/Python.zip#Python,mnist/mnist.zip#mnist
doesn't appear before.
Is this just .a zip file contain all 4 mnist data?

leewyang · 2017-04-07T22:52:57Z

Yes, sorry, that line was missing in the wiki. I've corrected it now.

jzhusc · 2017-04-07T23:53:11Z

@leewyang another problem I think is in
Convert the MNIST zip files into HDFS files
export PYTHON_ROOT=~/Python
should be
export PYTHON_ROOT=Python

jzhusc · 2017-04-08T01:03:54Z

@leewyang
Also I met this problem when training

17/04/08 00:50:20 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-04-08 00:50:22,233 INFO (MainThread-19895) connected to server at ('snaprec-deep-w-1', 60302)
2017-04-08 00:50:22,234 INFO (MainThread-19895) TFSparkNode.reserve: {'authkey': "\x13\x01=\x08#\xe5@J\xba'\xe4\xebh\x81n\xa1", 'worker_num': 1, 'host': 'snaprec-deep-w-1', 'tb_port': 0, 'addr': '/tmp/pymp-pKz80z/listener-nGc7sH', 'ppid': 19889, 'task_index': 0, 'job_name': 'worker', 'tb_pid': 0, 'port': 49855}
2017-04-08 00:50:22,389 INFO (MainThread-19896) connected to server at ('snaprec-deep-w-1', 60302)
2017-04-08 00:50:22,390 INFO (MainThread-19896) node: {'addr': '/tmp/pymp-pKz80z/listener-nGc7sH', 'task_index': 0, 'job_name': 'worker', 'authkey': "\x13\x01=\x08#\xe5@J\xba'\xe4\xebh\x81n\xa1", 'worker_num': 1, 'host': 'snaprec-deep-w-1', 'ppid': 19889, 'port': 49855, 'tb_pid': 0, 'tb_port': 0}
2017-04-08 00:50:22,524 INFO (MainThread-19896) Starting TensorFlow ps:0 on cluster node 0 on background process
2017-04-08 00:50:28,374 INFO (MainThread-19943) 0: ======== ps:0 ========
2017-04-08 00:50:28,374 INFO (MainThread-19943) 0: Cluster spec: {'worker': ['snaprec-deep-w-1:49855']}
2017-04-08 00:50:28,375 INFO (MainThread-19943) 0: Using CPU
Process Process-2:
Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0026/container_1491523243553_0026_01_000003/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0026/container_1491523243553_0026_01_000003/Python/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0026/container_1491523243553_0026_01_000003/__pyfiles__/mnist_dist.py", line 42, in map_fun
    cluster, server = TFNode.start_cluster_server(ctx, 1, args.rdma)
  File "./tfspark.zip/com/yahoo/ml/tf/TFNode.py", line 88, in start_cluster_server
    server = tf.train.Server(cluster, ctx.job_name, ctx.task_index)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0026/container_1491523243553_0026_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 144, in __init__
    self._server_def.SerializeToString(), status)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0026/container_1491523243553_0026_01_000003/Python/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0026/container_1491523243553_0026_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
InternalError: Job "ps" was not defined in cluster

Do you have any suggestion?
Thx

leewyang · 2017-04-10T16:29:00Z

@jzhusc according to this line, your cluster_spec only defined one worker (and no PS):

2017-04-08 00:50:28,374 INFO (MainThread-19943) 0: Cluster spec: {'worker': ['snaprec-deep-w-1:49855']}

Can you provide the command line that you used to launch this job?

jzhusc · 2017-04-10T17:04:56Z

@leewyang

spark-submit --master yarn --deploy-mode cluster --queue ${QUEUE} --num-executors 4 --executor-memory 5G --py-files TensorFlowOnSpark/tfspark.zip,TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --archives hdfs:///user/${USER}/Python.zip#Python --conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py --images mnist/tfr/train --format tfr --mode train --model mnist_model
17/04/10 17:02:31 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.0-hadoop2

leewyang · 2017-04-10T17:44:32Z

Can you post the driver logs, which should print the following log statements: https://github.com/yahoo/TensorFlowOnSpark/blob/master/src/com/yahoo/ml/tf/TFCluster.py#L267-L269

…

On Mon, Apr 10, 2017 at 10:04 AM, Jiaxu Zhu ***@***.***> wrote: @leewyang <https://github.com/leewyang> spark-submit --master yarn --deploy-mode cluster --queue ${QUEUE} --num-executors 4 --executor-memory 5G --py-files TensorFlowOnSpark/tfspark.zip,TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --archives hdfs:///user/${USER}/Python.zip#Python --conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py --images mnist/tfr/train --format tfr --mode train --model mnist_model 17/04/10 17:02:31 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.0-hadoop2 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#56 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADgXrsq8KUJ6iCzHgqqAzps_EbLis6G9ks5rumE6gaJpZM4M2UAQ> .

jzhusc · 2017-04-10T17:51:22Z

@leewyang

args: Namespace(cluster_size=10, epochs=0, format='tfr', images='mnist/tfr/train', labels=None, mode='train', model='mnist_model', output='predictions', rdma=False, readers=1, steps=1000, tensorboard=False)
2017-04-10T17:49:35.496284 ===== Start
2017-04-10 17:49:35,496 INFO (MainThread-2086) Reserving TFSparkNodes 
2017-04-10 17:49:35,500 INFO (MainThread-2086) listening for reservations at ('snaprec-deep-w-4', 39247)
2017-04-10 17:49:35,501 INFO (MainThread-2086) Starting TensorFlow on executors
2017-04-10 17:49:35,859 INFO (MainThread-2086) Waiting for TFSparkNodes to start
2017-04-10 17:49:35,859 INFO (MainThread-2086) waiting for 10 reservations
2017-04-10 17:49:36,860 INFO (MainThread-2086) waiting for 10 reservations
2017-04-10 17:49:37,862 INFO (MainThread-2086) waiting for 10 reservations
2017-04-10 17:49:38,863 INFO (MainThread-2086) waiting for 10 reservations
2017-04-10 17:49:39,865 INFO (MainThread-2086) waiting for 8 reservations
2017-04-10 17:49:40,866 INFO (MainThread-2086) waiting for 6 reservations
2017-04-10 17:49:41,867 INFO (MainThread-2086) waiting for 3 reservations
2017-04-10 17:49:42,869 INFO (MainThread-2086) waiting for 2 reservations
2017-04-10 17:49:43,870 INFO (MainThread-2086) waiting for 2 reservations
2017-04-10 17:49:44,871 INFO (MainThread-2086) waiting for 2 reservations
2017-04-10 17:49:45,873 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:46,874 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:47,875 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:48,877 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:49,878 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:50,879 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:51,881 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:52,882 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:53,883 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:54,885 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:55,886 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:56,887 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:57,889 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:58,890 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:59,891 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:00,893 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:01,893 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:02,895 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:03,896 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:04,897 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:05,899 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:06,900 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:07,901 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:08,903 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:09,904 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:10,905 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:11,907 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:12,908 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:13,909 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:14,911 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:15,912 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:16,913 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:17,915 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:18,916 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:19,917 INFO (MainThread-2086) waiting for 1 reservations

Also I find a error occurred in the driver

                           (0 + 10) / 10]17/04/10 17:49:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, snaprec-deep-w-9.c.snap-brain.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 762, in func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/tfspark.zip/com/yahoo/ml/tf/TFSparkNode.py", line 411, in _mapfn
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/__pyfiles__/mnist_dist.py", line 140, in map_fun
    x, y_ = read_tfr_examples(images, 100, num_epochs, index, workers)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/__pyfiles__/mnist_dist.py", line 83, in read_tfr_examples
    files = tf.gfile.Glob(tf_record_pattern)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/Python/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 269, in get_matching_files
    compat.as_bytes(filename), status)]
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/Python/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
UnimplementedError: File system scheme hdfs not implemented

	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
	at org.apache.spark.scheduler.Task.run(Task.scala:86)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)


[Stage 0:>                                                        (0 + 10) / 10]

But sometimes another error occurred

a.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

17/04/10 18:01:46 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 7.0 in stage 0.0 (TID 7, snaprec-deep-w-8.c.snap-brain.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000003/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000003/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000001/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000001/pyspark.zip/pyspark/rdd.py", line 762, in func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000001/tfspark.zip/com/yahoo/ml/tf/TFSparkNode.py", line 411, in _mapfn
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000003/__pyfiles__/mnist_dist.py", line 42, in map_fun
    cluster, server = TFNode.start_cluster_server(ctx, 1, args.rdma)
  File "./tfspark.zip/com/yahoo/ml/tf/TFNode.py", line 88, in start_cluster_server
    server = tf.train.Server(cluster, ctx.job_name, ctx.task_index)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 144, in __init__
    self._server_def.SerializeToString(), status)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000003/Python/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
InvalidArgumentError: Task 6 was not defined in job "worker"

	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
	at org.apache.spark.scheduler.Task.run(Task.scala:86)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

leewyang · 2017-04-10T18:11:27Z

That doesn't seem to match your command line from earlier... Generally, `cluster_size` == `num_executors`. And, you may be encountering other issues w/ HDFS access from TensorFlow, e.g. #33 (comment).

…

On Mon, Apr 10, 2017 at 10:51 AM, Jiaxu Zhu ***@***.***> wrote: @leewyang <https://github.com/leewyang> args: Namespace(cluster_size=10, epochs=0, format='tfr', images='mnist/tfr/train', labels=None, mode='train', model='mnist_model', output='predictions', rdma=False, readers=1, steps=1000, tensorboard=False) 2017-04-10T17:49:35.496284 ===== Start 2017-04-10 17:49:35,496 INFO (MainThread-2086) Reserving TFSparkNodes 2017-04-10 17:49:35,500 INFO (MainThread-2086) listening for reservations at ('snaprec-deep-w-4', 39247) 2017-04-10 17:49:35,501 INFO (MainThread-2086) Starting TensorFlow on executors 2017-04-10 17:49:35,859 INFO (MainThread-2086) Waiting for TFSparkNodes to start 2017-04-10 17:49:35,859 INFO (MainThread-2086) waiting for 10 reservations 2017-04-10 17:49:36,860 INFO (MainThread-2086) waiting for 10 reservations 2017-04-10 17:49:37,862 INFO (MainThread-2086) waiting for 10 reservations 2017-04-10 17:49:38,863 INFO (MainThread-2086) waiting for 10 reservations 2017-04-10 17:49:39,865 INFO (MainThread-2086) waiting for 8 reservations 2017-04-10 17:49:40,866 INFO (MainThread-2086) waiting for 6 reservations 2017-04-10 17:49:41,867 INFO (MainThread-2086) waiting for 3 reservations 2017-04-10 17:49:42,869 INFO (MainThread-2086) waiting for 2 reservations 2017-04-10 17:49:43,870 INFO (MainThread-2086) waiting for 2 reservations 2017-04-10 17:49:44,871 INFO (MainThread-2086) waiting for 2 reservations 2017-04-10 17:49:45,873 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:49:46,874 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:49:47,875 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:49:48,877 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:49:49,878 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:49:50,879 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:49:51,881 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:49:52,882 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:49:53,883 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:49:54,885 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:49:55,886 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:49:56,887 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:49:57,889 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:49:58,890 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:49:59,891 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:00,893 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:01,893 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:02,895 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:03,896 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:04,897 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:05,899 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:06,900 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:07,901 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:08,903 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:09,904 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:10,905 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:11,907 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:12,908 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:13,909 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:14,911 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:15,912 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:16,913 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:17,915 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:18,916 INFO (MainThread-2086) waiting for 1 reservations 2017-04-10 17:50:19,917 INFO (MainThread-2086) waiting for 1 reservations Also I find a error occurred in the driver (0 + 10) / 10]17/04/10 17:49:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, snaprec-deep-w-9.c.snap-brain.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/pyspark.zip/pyspark/worker.py", line 172, in main process() File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/pyspark.zip/pyspark/worker.py", line 167, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 317, in func File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 762, in func File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/tfspark.zip/com/yahoo/ml/tf/TFSparkNode.py", line 411, in _mapfn File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/__pyfiles__/mnist_dist.py", line 140, in map_fun x, y_ = read_tfr_examples(images, 100, num_epochs, index, workers) File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/__pyfiles__/mnist_dist.py", line 83, in read_tfr_examples files = tf.gfile.Glob(tf_record_pattern) File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/Python/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 269, in get_matching_files compat.as_bytes(filename), status)] File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/Python/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) UnimplementedError: File system scheme hdfs not implemented at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) [Stage 0:> (0 + 10) / 10] — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#56 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADgXrt80p4J7z4hytdffVzo6fP4Q3xQUks5rumwegaJpZM4M2UAQ> .

jzhusc · 2017-04-10T18:38:52Z

@leewyang I expend the cluster size to 10 but still have the same question.
Seems like some question related to hdfs. But setting logdir=None doesn't work for me.

Also I find out that there are two tasks in each executor which ended up taking 5 executors instead of 10. is that the reason?

jzhusc · 2017-04-10T21:33:23Z

@leewyang
Now my job is stuck at 2/10 or 4/10 for a long time. I have set the logdir=None
There are three types of errors in the executor

17/04/10 21:23:37 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-04-10 21:23:39,672 INFO (MainThread-8481) connected to server at ('snaprec-deep-w-9', 58463)
2017-04-10 21:23:39,674 INFO (MainThread-8481) TFSparkNode.reserve: {'authkey': '\x8a\x15\xab4\xcb{O\xd1\x93\x0c\x91\x8f\xab\xa2p+', 'worker_num': 3, 'host': 'snaprec-deep-w-6', 'tb_port': 0, 'addr': '/tmp/pymp-IorQjq/listener-3OXvo3', 'ppid': 8473, 'task_index': 2, 'job_name': 'worker', 'tb_pid': 0, 'port': 45394}
2017-04-10 21:23:39,674 INFO (MainThread-8480) connected to server at ('snaprec-deep-w-9', 58463)
2017-04-10 21:23:39,676 INFO (MainThread-8480) node: {'addr': '/tmp/pymp-uisw6e/listener-UjwPsq', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xf2\x1bv\xac\x9a\xeaL\x90\xa7\x99\xa1\x08\x88\xd0\xc0\x06', 'worker_num': 1, 'host': 'snaprec-deep-w-5', 'ppid': 8992, 'port': 40473, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,676 INFO (MainThread-8480) node: {'addr': '/tmp/pymp-IorQjq/listener-3OXvo3', 'task_index': 2, 'job_name': 'worker', 'authkey': '\x8a\x15\xab4\xcb{O\xd1\x93\x0c\x91\x8f\xab\xa2p+', 'worker_num': 3, 'host': 'snaprec-deep-w-6', 'ppid': 8473, 'port': 45394, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,855 INFO (MainThread-8480) Starting TensorFlow worker:1 on cluster node 2 on background process
2017-04-10 21:23:40,727 INFO (MainThread-8530) 2: ======== worker:1 ========
2017-04-10 21:23:40,727 INFO (MainThread-8530) 2: Cluster spec: {'worker': ['snaprec-deep-w-5:40473', 'snaprec-deep-w-6:45394']}
2017-04-10 21:23:40,727 INFO (MainThread-8530) 2: Using CPU
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> snaprec-deep-w-5:40473, 1 -> localhost:45394}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:241] Started server with target: grpc://localhost:45394
tensorflow model path: None
Process Process-2:
Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/Python/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/__pyfiles__/mnist_dist.py", line 122, in map_fun
    save_model_secs=10)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/Python/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 336, in __init__
    self._verify_setup()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/Python/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 881, in _verify_setup
    "their device set: %s" % op)
ValueError: When using replicas, all Variables must have their device set: name: "hid_w"
op: "VariableV2"
attr {
  key: "container"
  value {
    s: ""
  }
}
attr {
  key: "dtype"
  value {
    type: DT_FLOAT
  }
}
attr {
  key: "shape"
  value {
    shape {
      dim {
        size: 784
      }
      dim {
        size: 128
      }
    }
  }
}
attr {
  key: "shared_name"
  value {
    s: ""
  }
}

and

17/04/10 21:23:37 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-04-10 21:23:39,562 INFO (MainThread-8999) connected to server at ('snaprec-deep-w-9', 58463)
2017-04-10 21:23:39,563 INFO (MainThread-8999) TFSparkNode.reserve: {'authkey': '\xf2\x1bv\xac\x9a\xeaL\x90\xa7\x99\xa1\x08\x88\xd0\xc0\x06', 'worker_num': 1, 'host': 'snaprec-deep-w-5', 'tb_port': 0, 'addr': '/tmp/pymp-uisw6e/listener-UjwPsq', 'ppid': 8992, 'task_index': 0, 'job_name': 'worker', 'tb_pid': 0, 'port': 40473}
2017-04-10 21:23:39,711 INFO (MainThread-8998) connected to server at ('snaprec-deep-w-9', 58463)
2017-04-10 21:23:39,713 INFO (MainThread-8998) node: {'addr': '/tmp/pymp-uisw6e/listener-UjwPsq', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xf2\x1bv\xac\x9a\xeaL\x90\xa7\x99\xa1\x08\x88\xd0\xc0\x06', 'worker_num': 1, 'host': 'snaprec-deep-w-5', 'ppid': 8992, 'port': 40473, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,713 INFO (MainThread-8998) node: {'addr': '/tmp/pymp-IorQjq/listener-3OXvo3', 'task_index': 2, 'job_name': 'worker', 'authkey': '\x8a\x15\xab4\xcb{O\xd1\x93\x0c\x91\x8f\xab\xa2p+', 'worker_num': 3, 'host': 'snaprec-deep-w-6', 'ppid': 8473, 'port': 45394, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,861 INFO (MainThread-8998) Starting TensorFlow ps:0 on cluster node 0 on background process
2017-04-10 21:23:45,679 INFO (MainThread-9048) 0: ======== ps:0 ========
2017-04-10 21:23:45,679 INFO (MainThread-9048) 0: Cluster spec: {'worker': ['snaprec-deep-w-5:40473', 'snaprec-deep-w-6:45394']}
2017-04-10 21:23:45,679 INFO (MainThread-9048) 0: Using CPU
Process Process-2:
Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/__pyfiles__/mnist_dist.py", line 39, in map_fun
    cluster, server = TFNode.start_cluster_server(ctx, 1, args.rdma)
  File "./tfspark.zip/com/yahoo/ml/tf/TFNode.py", line 88, in start_cluster_server
    server = tf.train.Server(cluster, ctx.job_name, ctx.task_index)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 144, in __init__
    self._server_def.SerializeToString(), status)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
InternalError: Job "ps" was not defined in cluster

and

17/04/10 21:23:38 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-04-10 21:23:39,827 INFO (MainThread-21382) connected to server at ('snaprec-deep-w-9', 58463)
2017-04-10 21:23:39,828 INFO (MainThread-21382) TFSparkNode.reserve: {'authkey': 'TY\xaa\xfe-aGZ\xa5\xb8\xbe\x8b\x8ca&\xd1', 'worker_num': 5, 'host': 'snaprec-deep-w-0', 'tb_port': 0, 'addr': '/tmp/pymp-tYaBWE/listener-_ISeHu', 'ppid': 21376, 'task_index': 4, 'job_name': 'worker', 'tb_pid': 0, 'port': 48886}
2017-04-10 21:23:39,831 INFO (MainThread-21383) connected to server at ('snaprec-deep-w-9', 58463)
2017-04-10 21:23:39,833 INFO (MainThread-21383) node: {'addr': '/tmp/pymp-uisw6e/listener-UjwPsq', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xf2\x1bv\xac\x9a\xeaL\x90\xa7\x99\xa1\x08\x88\xd0\xc0\x06', 'worker_num': 1, 'host': 'snaprec-deep-w-5', 'ppid': 8992, 'port': 40473, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,833 INFO (MainThread-21383) node: {'addr': '/tmp/pymp-IorQjq/listener-3OXvo3', 'task_index': 2, 'job_name': 'worker', 'authkey': '\x8a\x15\xab4\xcb{O\xd1\x93\x0c\x91\x8f\xab\xa2p+', 'worker_num': 3, 'host': 'snaprec-deep-w-6', 'ppid': 8473, 'port': 45394, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,833 INFO (MainThread-21383) node: {'addr': '/tmp/pymp-tYaBWE/listener-_ISeHu', 'task_index': 4, 'job_name': 'worker', 'authkey': 'TY\xaa\xfe-aGZ\xa5\xb8\xbe\x8b\x8ca&\xd1', 'worker_num': 5, 'host': 'snaprec-deep-w-0', 'ppid': 21376, 'port': 48886, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,978 INFO (MainThread-21383) Starting TensorFlow worker:3 on cluster node 4 on background process
2017-04-10 21:23:40,825 INFO (MainThread-21432) 4: ======== worker:3 ========
2017-04-10 21:23:40,825 INFO (MainThread-21432) 4: Cluster spec: {'worker': ['snaprec-deep-w-5:40473', 'snaprec-deep-w-6:45394', 'snaprec-deep-w-0:48886']}
2017-04-10 21:23:40,825 INFO (MainThread-21432) 4: Using CPU
Process Process-2:
Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/__pyfiles__/mnist_dist.py", line 39, in map_fun
    cluster, server = TFNode.start_cluster_server(ctx, 1, args.rdma)
  File "./tfspark.zip/com/yahoo/ml/tf/TFNode.py", line 88, in start_cluster_server
    server = tf.train.Server(cluster, ctx.job_name, ctx.task_index)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 144, in __init__
    self._server_def.SerializeToString(), status)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
InvalidArgumentError: Task 3 was not defined in job "worker"

leewyang · 2017-04-10T21:43:54Z

I'd recommend simplifying your test case to one worker + one ps. You seem to be getting some very strange cluster_specs somehow: The first cluster_spec constructs two workers and no ps, yet it's trying to start a PS: {'worker': ['snaprec-deep-w-5:40473', 'snaprec-deep-w-6:45394']} The second cluster_spec seems to be skipping task_index numbers: 2017-04-10 21:23:39,833 INFO (MainThread-21383) node: {'addr': '/tmp/pymp-uisw6e/listener-UjwPsq', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xf2\x1bv\xac\x9a\xeaL\x90\xa7\x99\xa1\x08\x88\xd0\xc0\x06', 'worker_num': 1, 'host': 'snaprec-deep-w-5', 'ppid': 8992, 'port': 40473, 'tb_pid': 0, 'tb_port': 0}2017-04-10 21:23:39,833 INFO (MainThread-21383) node: {'addr': '/tmp/pymp-IorQjq/listener-3OXvo3', 'task_index': 2, 'job_name': 'worker', 'authkey': '\x8a\x15\xab4\xcb{O\xd1\x93\x0c\x91\x8f\xab\xa2p+', 'worker_num': 3, 'host': 'snaprec-deep-w-6', 'ppid': 8473, 'port': 45394, 'tb_pid': 0, 'tb_port': 0}2017-04-10 21:23:39,833 INFO (MainThread-21383) node: {'addr': '/tmp/pymp-tYaBWE/listener-_ISeHu', 'task_index': 4, 'job_name': 'worker', 'authkey': 'TY\xaa\xfe-aGZ\xa5\xb8\xbe\x8b\x8ca&\xd1', 'worker_num': 5, 'host': 'snaprec-deep-w-0', 'ppid': 21376, 'port': 48886, 'tb_pid': 0, 'tb_port': 0}

…

On Mon, Apr 10, 2017 at 2:33 PM, Jiaxu Zhu ***@***.***> wrote: @leewyang <https://github.com/leewyang> Now my job is stuck at 2/10 or 4/10 for a long time. I have set the logdir=None There are three types of errors in the executor 17/04/10 21:23:37 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2017-04-10 21:23:39,672 INFO (MainThread-8481) connected to server at ('snaprec-deep-w-9', 58463) 2017-04-10 21:23:39,674 INFO (MainThread-8481) TFSparkNode.reserve: {'authkey': '\x8a\x15\xab4\xcb{O\xd1\x93\x0c\x91\x8f\xab\xa2p+', 'worker_num': 3, 'host': 'snaprec-deep-w-6', 'tb_port': 0, 'addr': '/tmp/pymp-IorQjq/listener-3OXvo3', 'ppid': 8473, 'task_index': 2, 'job_name': 'worker', 'tb_pid': 0, 'port': 45394} 2017-04-10 21:23:39,674 INFO (MainThread-8480) connected to server at ('snaprec-deep-w-9', 58463) 2017-04-10 21:23:39,676 INFO (MainThread-8480) node: {'addr': '/tmp/pymp-uisw6e/listener-UjwPsq', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xf2\x1bv\xac\x9a\xeaL\x90\xa7\x99\xa1\x08\x88\xd0\xc0\x06', 'worker_num': 1, 'host': 'snaprec-deep-w-5', 'ppid': 8992, 'port': 40473, 'tb_pid': 0, 'tb_port': 0} 2017-04-10 21:23:39,676 INFO (MainThread-8480) node: {'addr': '/tmp/pymp-IorQjq/listener-3OXvo3', 'task_index': 2, 'job_name': 'worker', 'authkey': '\x8a\x15\xab4\xcb{O\xd1\x93\x0c\x91\x8f\xab\xa2p+', 'worker_num': 3, 'host': 'snaprec-deep-w-6', 'ppid': 8473, 'port': 45394, 'tb_pid': 0, 'tb_port': 0} 2017-04-10 21:23:39,855 INFO (MainThread-8480) Starting TensorFlow worker:1 on cluster node 2 on background process 2017-04-10 21:23:40,727 INFO (MainThread-8530) 2: ======== worker:1 ======== 2017-04-10 21:23:40,727 INFO (MainThread-8530) 2: Cluster spec: {'worker': ['snaprec-deep-w-5:40473', 'snaprec-deep-w-6:45394']} 2017-04-10 21:23:40,727 INFO (MainThread-8530) 2: Using CPU I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> snaprec-deep-w-5:40473, 1 -> localhost:45394} I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:241] Started server with target: grpc://localhost:45394 tensorflow model path: None Process Process-2: Traceback (most recent call last): File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/Python/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/__pyfiles__/mnist_dist.py", line 122, in map_fun save_model_secs=10) File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/Python/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 336, in __init__ self._verify_setup() File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/Python/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 881, in _verify_setup "their device set: %s" % op) ValueError: When using replicas, all Variables must have their device set: name: "hid_w" op: "VariableV2" attr { key: "container" value { s: "" } } attr { key: "dtype" value { type: DT_FLOAT } } attr { key: "shape" value { shape { dim { size: 784 } dim { size: 128 } } } } attr { key: "shared_name" value { s: "" } } and 17/04/10 21:23:37 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2017-04-10 21:23:39,562 INFO (MainThread-8999) connected to server at ('snaprec-deep-w-9', 58463) 2017-04-10 21:23:39,563 INFO (MainThread-8999) TFSparkNode.reserve: {'authkey': '\xf2\x1bv\xac\x9a\xeaL\x90\xa7\x99\xa1\x08\x88\xd0\xc0\x06', 'worker_num': 1, 'host': 'snaprec-deep-w-5', 'tb_port': 0, 'addr': '/tmp/pymp-uisw6e/listener-UjwPsq', 'ppid': 8992, 'task_index': 0, 'job_name': 'worker', 'tb_pid': 0, 'port': 40473} 2017-04-10 21:23:39,711 INFO (MainThread-8998) connected to server at ('snaprec-deep-w-9', 58463) 2017-04-10 21:23:39,713 INFO (MainThread-8998) node: {'addr': '/tmp/pymp-uisw6e/listener-UjwPsq', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xf2\x1bv\xac\x9a\xeaL\x90\xa7\x99\xa1\x08\x88\xd0\xc0\x06', 'worker_num': 1, 'host': 'snaprec-deep-w-5', 'ppid': 8992, 'port': 40473, 'tb_pid': 0, 'tb_port': 0} 2017-04-10 21:23:39,713 INFO (MainThread-8998) node: {'addr': '/tmp/pymp-IorQjq/listener-3OXvo3', 'task_index': 2, 'job_name': 'worker', 'authkey': '\x8a\x15\xab4\xcb{O\xd1\x93\x0c\x91\x8f\xab\xa2p+', 'worker_num': 3, 'host': 'snaprec-deep-w-6', 'ppid': 8473, 'port': 45394, 'tb_pid': 0, 'tb_port': 0} 2017-04-10 21:23:39,861 INFO (MainThread-8998) Starting TensorFlow ps:0 on cluster node 0 on background process 2017-04-10 21:23:45,679 INFO (MainThread-9048) 0: ======== ps:0 ======== 2017-04-10 21:23:45,679 INFO (MainThread-9048) 0: Cluster spec: {'worker': ['snaprec-deep-w-5:40473', 'snaprec-deep-w-6:45394']} 2017-04-10 21:23:45,679 INFO (MainThread-9048) 0: Using CPU Process Process-2: Traceback (most recent call last): File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/__pyfiles__/mnist_dist.py", line 39, in map_fun cluster, server = TFNode.start_cluster_server(ctx, 1, args.rdma) File "./tfspark.zip/com/yahoo/ml/tf/TFNode.py", line 88, in start_cluster_server server = tf.train.Server(cluster, ctx.job_name, ctx.task_index) File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 144, in __init__ self._server_def.SerializeToString(), status) File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) InternalError: Job "ps" was not defined in cluster and 17/04/10 21:23:38 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2017-04-10 21:23:39,827 INFO (MainThread-21382) connected to server at ('snaprec-deep-w-9', 58463) 2017-04-10 21:23:39,828 INFO (MainThread-21382) TFSparkNode.reserve: {'authkey': 'TY\xaa\xfe-aGZ\xa5\xb8\xbe\x8b\x8ca&\xd1', 'worker_num': 5, 'host': 'snaprec-deep-w-0', 'tb_port': 0, 'addr': '/tmp/pymp-tYaBWE/listener-_ISeHu', 'ppid': 21376, 'task_index': 4, 'job_name': 'worker', 'tb_pid': 0, 'port': 48886} 2017-04-10 21:23:39,831 INFO (MainThread-21383) connected to server at ('snaprec-deep-w-9', 58463) 2017-04-10 21:23:39,833 INFO (MainThread-21383) node: {'addr': '/tmp/pymp-uisw6e/listener-UjwPsq', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xf2\x1bv\xac\x9a\xeaL\x90\xa7\x99\xa1\x08\x88\xd0\xc0\x06', 'worker_num': 1, 'host': 'snaprec-deep-w-5', 'ppid': 8992, 'port': 40473, 'tb_pid': 0, 'tb_port': 0} 2017-04-10 21:23:39,833 INFO (MainThread-21383) node: {'addr': '/tmp/pymp-IorQjq/listener-3OXvo3', 'task_index': 2, 'job_name': 'worker', 'authkey': '\x8a\x15\xab4\xcb{O\xd1\x93\x0c\x91\x8f\xab\xa2p+', 'worker_num': 3, 'host': 'snaprec-deep-w-6', 'ppid': 8473, 'port': 45394, 'tb_pid': 0, 'tb_port': 0} 2017-04-10 21:23:39,833 INFO (MainThread-21383) node: {'addr': '/tmp/pymp-tYaBWE/listener-_ISeHu', 'task_index': 4, 'job_name': 'worker', 'authkey': 'TY\xaa\xfe-aGZ\xa5\xb8\xbe\x8b\x8ca&\xd1', 'worker_num': 5, 'host': 'snaprec-deep-w-0', 'ppid': 21376, 'port': 48886, 'tb_pid': 0, 'tb_port': 0} 2017-04-10 21:23:39,978 INFO (MainThread-21383) Starting TensorFlow worker:3 on cluster node 4 on background process 2017-04-10 21:23:40,825 INFO (MainThread-21432) 4: ======== worker:3 ======== 2017-04-10 21:23:40,825 INFO (MainThread-21432) 4: Cluster spec: {'worker': ['snaprec-deep-w-5:40473', 'snaprec-deep-w-6:45394', 'snaprec-deep-w-0:48886']} 2017-04-10 21:23:40,825 INFO (MainThread-21432) 4: Using CPU Process Process-2: Traceback (most recent call last): File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(*self._args, **self._kwargs) File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/__pyfiles__/mnist_dist.py", line 39, in map_fun cluster, server = TFNode.start_cluster_server(ctx, 1, args.rdma) File "./tfspark.zip/com/yahoo/ml/tf/TFNode.py", line 88, in start_cluster_server server = tf.train.Server(cluster, ctx.job_name, ctx.task_index) File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 144, in __init__ self._server_def.SerializeToString(), status) File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) InvalidArgumentError: Task 3 was not defined in job "worker" — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#56 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADgXrgmykwgjeaxFUZc273NCxx4e0Ez0ks5ruqAngaJpZM4M2UAQ> .

jzhusc · 2017-04-10T21:58:26Z

@leewyang I use only 2 executors and get this errors

17/04/10 21:55:19 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-04-10 21:55:20,813 INFO (MainThread-9005) connected to server at ('snaprec-deep-w-7', 48312)
2017-04-10 21:55:20,814 INFO (MainThread-9005) TFSparkNode.reserve: {'authkey': '\xb2\x94\x91\xdd\xb6d@}\xa4%\x96s^\xec\x07\xfb', 'worker_num': 1, 'host': 'snaprec-deep-w-8', 'tb_port': 0, 'addr': '/tmp/pymp-_VGtFS/listener-tWsOQQ', 'ppid': 8999, 'task_index': 0, 'job_name': 'worker', 'tb_pid': 0, 'port': 41254}
2017-04-10 21:55:20,973 INFO (MainThread-9006) connected to server at ('snaprec-deep-w-7', 48312)
2017-04-10 21:55:20,975 INFO (MainThread-9006) node: {'addr': '/tmp/pymp-_VGtFS/listener-tWsOQQ', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xb2\x94\x91\xdd\xb6d@}\xa4%\x96s^\xec\x07\xfb', 'worker_num': 1, 'host': 'snaprec-deep-w-8', 'ppid': 8999, 'port': 41254, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:55:21,120 INFO (MainThread-9006) Starting TensorFlow ps:0 on cluster node 0 on background process
2017-04-10 21:55:27,017 INFO (MainThread-9055) 0: ======== ps:0 ========
2017-04-10 21:55:27,017 INFO (MainThread-9055) 0: Cluster spec: {'worker': ['snaprec-deep-w-8:41254']}
2017-04-10 21:55:27,017 INFO (MainThread-9055) 0: Using CPU
Process Process-2:
Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0061/container_1491523243553_0061_01_000003/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0061/container_1491523243553_0061_01_000003/Python/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0061/container_1491523243553_0061_01_000003/__pyfiles__/mnist_dist.py", line 39, in map_fun
    cluster, server = TFNode.start_cluster_server(ctx, 1, args.rdma)
  File "./tfspark.zip/com/yahoo/ml/tf/TFNode.py", line 88, in start_cluster_server
    server = tf.train.Server(cluster, ctx.job_name, ctx.task_index)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0061/container_1491523243553_0061_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 144, in __init__
    self._server_def.SerializeToString(), status)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0061/container_1491523243553_0061_01_000003/Python/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0061/container_1491523243553_0061_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
InternalError: Job "ps" was not defined in cluster

Seems like that's the reason.
Seem like ps is not launched due to some reason.
And in another executor, the stderr is

17/04/10 22:07:53 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

jzhusc · 2017-04-11T17:14:45Z

@leewyang
I think I find the reason.
in 1 worker and 1 ps mode. they are 2 tasks in the one executor so that they have same ppid and will only create one node so we cannot start both ps and worker

by limiting one task for each executor. The problem is solved

leewyang · 2017-04-11T20:30:36Z

Yes, we require that each executor only runs one task at a time (and no dynamic allocation). The exact configuration depends on your spark version/setup, but you might be able to try --executor-cores 1.

leewyang closed this as completed Apr 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ssl problem when installing pip #56

ssl problem when installing pip #56

jzhusc commented Apr 7, 2017

leewyang commented Apr 7, 2017

jzhusc commented Apr 7, 2017

leewyang commented Apr 7, 2017

jzhusc commented Apr 7, 2017

jzhusc commented Apr 8, 2017

leewyang commented Apr 10, 2017

jzhusc commented Apr 10, 2017

leewyang commented Apr 10, 2017 via email

jzhusc commented Apr 10, 2017 •

edited

Loading

leewyang commented Apr 10, 2017 via email

jzhusc commented Apr 10, 2017 •

edited

Loading

jzhusc commented Apr 10, 2017

leewyang commented Apr 10, 2017 via email

jzhusc commented Apr 10, 2017 •

edited

Loading

jzhusc commented Apr 11, 2017 •

edited

Loading

leewyang commented Apr 11, 2017

ssl problem when installing pip #56

ssl problem when installing pip #56

Comments

jzhusc commented Apr 7, 2017

leewyang commented Apr 7, 2017

jzhusc commented Apr 7, 2017

leewyang commented Apr 7, 2017

jzhusc commented Apr 7, 2017

jzhusc commented Apr 8, 2017

leewyang commented Apr 10, 2017

jzhusc commented Apr 10, 2017

leewyang commented Apr 10, 2017 via email

jzhusc commented Apr 10, 2017 • edited Loading

leewyang commented Apr 10, 2017 via email

jzhusc commented Apr 10, 2017 • edited Loading

jzhusc commented Apr 10, 2017

leewyang commented Apr 10, 2017 via email

jzhusc commented Apr 10, 2017 • edited Loading

jzhusc commented Apr 11, 2017 • edited Loading

leewyang commented Apr 11, 2017

jzhusc commented Apr 10, 2017 •

edited

Loading

jzhusc commented Apr 10, 2017 •

edited

Loading

jzhusc commented Apr 10, 2017 •

edited

Loading

jzhusc commented Apr 11, 2017 •

edited

Loading