-
Notifications
You must be signed in to change notification settings - Fork 939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ssl problem when installing pip #56
Comments
@jzhusc that line just installs pip into the Python distro that we're trying to package up for Spark. If Google dataproc already has Python installed, can you just try something like Unfortunately, I don't have any experience with this environment, so it's hard to say if your Python dependencies will be available on the executors. But if they are, you may not need to supply a |
@leewyang hi, installing libssl-dev works. |
Yes, sorry, that line was missing in the wiki. I've corrected it now. |
@leewyang another problem I think is in |
@leewyang
Do you have any suggestion? |
@jzhusc according to this line, your cluster_spec only defined one worker (and no PS):
Can you provide the command line that you used to launch this job? |
|
Can you post the driver logs, which should print the following log
statements:
https://github.com/yahoo/TensorFlowOnSpark/blob/master/src/com/yahoo/ml/tf/TFCluster.py#L267-L269
…On Mon, Apr 10, 2017 at 10:04 AM, Jiaxu Zhu ***@***.***> wrote:
@leewyang <https://github.com/leewyang>
spark-submit --master yarn --deploy-mode cluster --queue ${QUEUE} --num-executors 4 --executor-memory 5G --py-files TensorFlowOnSpark/tfspark.zip,TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --archives hdfs:///user/${USER}/Python.zip#Python --conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py --images mnist/tfr/train --format tfr --mode train --model mnist_model
17/04/10 17:02:31 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.0-hadoop2
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#56 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADgXrsq8KUJ6iCzHgqqAzps_EbLis6G9ks5rumE6gaJpZM4M2UAQ>
.
|
Also I find a error occurred in the driver
But sometimes another error occurred
|
That doesn't seem to match your command line from earlier... Generally,
`cluster_size` == `num_executors`. And, you may be encountering other
issues w/ HDFS access from TensorFlow, e.g.
#33 (comment).
…On Mon, Apr 10, 2017 at 10:51 AM, Jiaxu Zhu ***@***.***> wrote:
@leewyang <https://github.com/leewyang>
args: Namespace(cluster_size=10, epochs=0, format='tfr', images='mnist/tfr/train', labels=None, mode='train', model='mnist_model', output='predictions', rdma=False, readers=1, steps=1000, tensorboard=False)
2017-04-10T17:49:35.496284 ===== Start
2017-04-10 17:49:35,496 INFO (MainThread-2086) Reserving TFSparkNodes
2017-04-10 17:49:35,500 INFO (MainThread-2086) listening for reservations at ('snaprec-deep-w-4', 39247)
2017-04-10 17:49:35,501 INFO (MainThread-2086) Starting TensorFlow on executors
2017-04-10 17:49:35,859 INFO (MainThread-2086) Waiting for TFSparkNodes to start
2017-04-10 17:49:35,859 INFO (MainThread-2086) waiting for 10 reservations
2017-04-10 17:49:36,860 INFO (MainThread-2086) waiting for 10 reservations
2017-04-10 17:49:37,862 INFO (MainThread-2086) waiting for 10 reservations
2017-04-10 17:49:38,863 INFO (MainThread-2086) waiting for 10 reservations
2017-04-10 17:49:39,865 INFO (MainThread-2086) waiting for 8 reservations
2017-04-10 17:49:40,866 INFO (MainThread-2086) waiting for 6 reservations
2017-04-10 17:49:41,867 INFO (MainThread-2086) waiting for 3 reservations
2017-04-10 17:49:42,869 INFO (MainThread-2086) waiting for 2 reservations
2017-04-10 17:49:43,870 INFO (MainThread-2086) waiting for 2 reservations
2017-04-10 17:49:44,871 INFO (MainThread-2086) waiting for 2 reservations
2017-04-10 17:49:45,873 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:46,874 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:47,875 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:48,877 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:49,878 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:50,879 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:51,881 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:52,882 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:53,883 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:54,885 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:55,886 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:56,887 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:57,889 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:58,890 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:59,891 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:00,893 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:01,893 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:02,895 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:03,896 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:04,897 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:05,899 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:06,900 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:07,901 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:08,903 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:09,904 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:10,905 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:11,907 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:12,908 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:13,909 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:14,911 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:15,912 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:16,913 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:17,915 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:18,916 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:19,917 INFO (MainThread-2086) waiting for 1 reservations
Also I find a error occurred in the driver
(0 + 10) / 10]17/04/10 17:49:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, snaprec-deep-w-9.c.snap-brain.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/pyspark.zip/pyspark/worker.py", line 172, in main
process()
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/pyspark.zip/pyspark/worker.py", line 167, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 317, in func
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 762, in func
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/tfspark.zip/com/yahoo/ml/tf/TFSparkNode.py", line 411, in _mapfn
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/__pyfiles__/mnist_dist.py", line 140, in map_fun
x, y_ = read_tfr_examples(images, 100, num_epochs, index, workers)
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/__pyfiles__/mnist_dist.py", line 83, in read_tfr_examples
files = tf.gfile.Glob(tf_record_pattern)
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/Python/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 269, in get_matching_files
compat.as_bytes(filename), status)]
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/Python/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
UnimplementedError: File system scheme hdfs not implemented
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[Stage 0:> (0 + 10) / 10]
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#56 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADgXrt80p4J7z4hytdffVzo6fP4Q3xQUks5rumwegaJpZM4M2UAQ>
.
|
@leewyang I expend the cluster size to 10 but still have the same question. Also I find out that there are two tasks in each executor which ended up taking 5 executors instead of 10. is that the reason? |
@leewyang
and
and
|
I'd recommend simplifying your test case to one worker + one ps. You seem
to be getting some very strange cluster_specs somehow:
The first cluster_spec constructs two workers and no ps, yet it's trying to
start a PS:
{'worker': ['snaprec-deep-w-5:40473', 'snaprec-deep-w-6:45394']}
The second cluster_spec seems to be skipping task_index numbers:
2017-04-10 21:23:39,833 INFO (MainThread-21383) node: {'addr':
'/tmp/pymp-uisw6e/listener-UjwPsq', 'task_index': 0, 'job_name':
'worker', 'authkey':
'\xf2\x1bv\xac\x9a\xeaL\x90\xa7\x99\xa1\x08\x88\xd0\xc0\x06',
'worker_num': 1, 'host': 'snaprec-deep-w-5', 'ppid': 8992, 'port':
40473, 'tb_pid': 0, 'tb_port': 0}2017-04-10 21:23:39,833 INFO
(MainThread-21383) node: {'addr': '/tmp/pymp-IorQjq/listener-3OXvo3',
'task_index': 2, 'job_name': 'worker', 'authkey':
'\x8a\x15\xab4\xcb{O\xd1\x93\x0c\x91\x8f\xab\xa2p+', 'worker_num': 3,
'host': 'snaprec-deep-w-6', 'ppid': 8473, 'port': 45394, 'tb_pid': 0,
'tb_port': 0}2017-04-10 21:23:39,833 INFO (MainThread-21383) node:
{'addr': '/tmp/pymp-tYaBWE/listener-_ISeHu', 'task_index': 4,
'job_name': 'worker', 'authkey':
'TY\xaa\xfe-aGZ\xa5\xb8\xbe\x8b\x8ca&\xd1', 'worker_num': 5, 'host':
'snaprec-deep-w-0', 'ppid': 21376, 'port': 48886, 'tb_pid': 0,
'tb_port': 0}
…On Mon, Apr 10, 2017 at 2:33 PM, Jiaxu Zhu ***@***.***> wrote:
@leewyang <https://github.com/leewyang>
Now my job is stuck at 2/10 or 4/10 for a long time. I have set the
logdir=None
There are three types of errors in the executor
17/04/10 21:23:37 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-04-10 21:23:39,672 INFO (MainThread-8481) connected to server at ('snaprec-deep-w-9', 58463)
2017-04-10 21:23:39,674 INFO (MainThread-8481) TFSparkNode.reserve: {'authkey': '\x8a\x15\xab4\xcb{O\xd1\x93\x0c\x91\x8f\xab\xa2p+', 'worker_num': 3, 'host': 'snaprec-deep-w-6', 'tb_port': 0, 'addr': '/tmp/pymp-IorQjq/listener-3OXvo3', 'ppid': 8473, 'task_index': 2, 'job_name': 'worker', 'tb_pid': 0, 'port': 45394}
2017-04-10 21:23:39,674 INFO (MainThread-8480) connected to server at ('snaprec-deep-w-9', 58463)
2017-04-10 21:23:39,676 INFO (MainThread-8480) node: {'addr': '/tmp/pymp-uisw6e/listener-UjwPsq', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xf2\x1bv\xac\x9a\xeaL\x90\xa7\x99\xa1\x08\x88\xd0\xc0\x06', 'worker_num': 1, 'host': 'snaprec-deep-w-5', 'ppid': 8992, 'port': 40473, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,676 INFO (MainThread-8480) node: {'addr': '/tmp/pymp-IorQjq/listener-3OXvo3', 'task_index': 2, 'job_name': 'worker', 'authkey': '\x8a\x15\xab4\xcb{O\xd1\x93\x0c\x91\x8f\xab\xa2p+', 'worker_num': 3, 'host': 'snaprec-deep-w-6', 'ppid': 8473, 'port': 45394, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,855 INFO (MainThread-8480) Starting TensorFlow worker:1 on cluster node 2 on background process
2017-04-10 21:23:40,727 INFO (MainThread-8530) 2: ======== worker:1 ========
2017-04-10 21:23:40,727 INFO (MainThread-8530) 2: Cluster spec: {'worker': ['snaprec-deep-w-5:40473', 'snaprec-deep-w-6:45394']}
2017-04-10 21:23:40,727 INFO (MainThread-8530) 2: Using CPU
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> snaprec-deep-w-5:40473, 1 -> localhost:45394}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:241] Started server with target: grpc://localhost:45394
tensorflow model path: None
Process Process-2:
Traceback (most recent call last):
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/Python/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/__pyfiles__/mnist_dist.py", line 122, in map_fun
save_model_secs=10)
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/Python/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 336, in __init__
self._verify_setup()
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/Python/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 881, in _verify_setup
"their device set: %s" % op)
ValueError: When using replicas, all Variables must have their device set: name: "hid_w"
op: "VariableV2"
attr {
key: "container"
value {
s: ""
}
}
attr {
key: "dtype"
value {
type: DT_FLOAT
}
}
attr {
key: "shape"
value {
shape {
dim {
size: 784
}
dim {
size: 128
}
}
}
}
attr {
key: "shared_name"
value {
s: ""
}
}
and
17/04/10 21:23:37 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-04-10 21:23:39,562 INFO (MainThread-8999) connected to server at ('snaprec-deep-w-9', 58463)
2017-04-10 21:23:39,563 INFO (MainThread-8999) TFSparkNode.reserve: {'authkey': '\xf2\x1bv\xac\x9a\xeaL\x90\xa7\x99\xa1\x08\x88\xd0\xc0\x06', 'worker_num': 1, 'host': 'snaprec-deep-w-5', 'tb_port': 0, 'addr': '/tmp/pymp-uisw6e/listener-UjwPsq', 'ppid': 8992, 'task_index': 0, 'job_name': 'worker', 'tb_pid': 0, 'port': 40473}
2017-04-10 21:23:39,711 INFO (MainThread-8998) connected to server at ('snaprec-deep-w-9', 58463)
2017-04-10 21:23:39,713 INFO (MainThread-8998) node: {'addr': '/tmp/pymp-uisw6e/listener-UjwPsq', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xf2\x1bv\xac\x9a\xeaL\x90\xa7\x99\xa1\x08\x88\xd0\xc0\x06', 'worker_num': 1, 'host': 'snaprec-deep-w-5', 'ppid': 8992, 'port': 40473, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,713 INFO (MainThread-8998) node: {'addr': '/tmp/pymp-IorQjq/listener-3OXvo3', 'task_index': 2, 'job_name': 'worker', 'authkey': '\x8a\x15\xab4\xcb{O\xd1\x93\x0c\x91\x8f\xab\xa2p+', 'worker_num': 3, 'host': 'snaprec-deep-w-6', 'ppid': 8473, 'port': 45394, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,861 INFO (MainThread-8998) Starting TensorFlow ps:0 on cluster node 0 on background process
2017-04-10 21:23:45,679 INFO (MainThread-9048) 0: ======== ps:0 ========
2017-04-10 21:23:45,679 INFO (MainThread-9048) 0: Cluster spec: {'worker': ['snaprec-deep-w-5:40473', 'snaprec-deep-w-6:45394']}
2017-04-10 21:23:45,679 INFO (MainThread-9048) 0: Using CPU
Process Process-2:
Traceback (most recent call last):
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/__pyfiles__/mnist_dist.py", line 39, in map_fun
cluster, server = TFNode.start_cluster_server(ctx, 1, args.rdma)
File "./tfspark.zip/com/yahoo/ml/tf/TFNode.py", line 88, in start_cluster_server
server = tf.train.Server(cluster, ctx.job_name, ctx.task_index)
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 144, in __init__
self._server_def.SerializeToString(), status)
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
InternalError: Job "ps" was not defined in cluster
and
17/04/10 21:23:38 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-04-10 21:23:39,827 INFO (MainThread-21382) connected to server at ('snaprec-deep-w-9', 58463)
2017-04-10 21:23:39,828 INFO (MainThread-21382) TFSparkNode.reserve: {'authkey': 'TY\xaa\xfe-aGZ\xa5\xb8\xbe\x8b\x8ca&\xd1', 'worker_num': 5, 'host': 'snaprec-deep-w-0', 'tb_port': 0, 'addr': '/tmp/pymp-tYaBWE/listener-_ISeHu', 'ppid': 21376, 'task_index': 4, 'job_name': 'worker', 'tb_pid': 0, 'port': 48886}
2017-04-10 21:23:39,831 INFO (MainThread-21383) connected to server at ('snaprec-deep-w-9', 58463)
2017-04-10 21:23:39,833 INFO (MainThread-21383) node: {'addr': '/tmp/pymp-uisw6e/listener-UjwPsq', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xf2\x1bv\xac\x9a\xeaL\x90\xa7\x99\xa1\x08\x88\xd0\xc0\x06', 'worker_num': 1, 'host': 'snaprec-deep-w-5', 'ppid': 8992, 'port': 40473, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,833 INFO (MainThread-21383) node: {'addr': '/tmp/pymp-IorQjq/listener-3OXvo3', 'task_index': 2, 'job_name': 'worker', 'authkey': '\x8a\x15\xab4\xcb{O\xd1\x93\x0c\x91\x8f\xab\xa2p+', 'worker_num': 3, 'host': 'snaprec-deep-w-6', 'ppid': 8473, 'port': 45394, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,833 INFO (MainThread-21383) node: {'addr': '/tmp/pymp-tYaBWE/listener-_ISeHu', 'task_index': 4, 'job_name': 'worker', 'authkey': 'TY\xaa\xfe-aGZ\xa5\xb8\xbe\x8b\x8ca&\xd1', 'worker_num': 5, 'host': 'snaprec-deep-w-0', 'ppid': 21376, 'port': 48886, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,978 INFO (MainThread-21383) Starting TensorFlow worker:3 on cluster node 4 on background process
2017-04-10 21:23:40,825 INFO (MainThread-21432) 4: ======== worker:3 ========
2017-04-10 21:23:40,825 INFO (MainThread-21432) 4: Cluster spec: {'worker': ['snaprec-deep-w-5:40473', 'snaprec-deep-w-6:45394', 'snaprec-deep-w-0:48886']}
2017-04-10 21:23:40,825 INFO (MainThread-21432) 4: Using CPU
Process Process-2:
Traceback (most recent call last):
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/__pyfiles__/mnist_dist.py", line 39, in map_fun
cluster, server = TFNode.start_cluster_server(ctx, 1, args.rdma)
File "./tfspark.zip/com/yahoo/ml/tf/TFNode.py", line 88, in start_cluster_server
server = tf.train.Server(cluster, ctx.job_name, ctx.task_index)
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 144, in __init__
self._server_def.SerializeToString(), status)
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
InvalidArgumentError: Task 3 was not defined in job "worker"
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#56 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADgXrgmykwgjeaxFUZc273NCxx4e0Ez0ks5ruqAngaJpZM4M2UAQ>
.
|
@leewyang I use only 2 executors and get this errors
Seems like that's the reason.
|
@leewyang by limiting one task for each executor. The problem is solved |
Yes, we require that each executor only runs one task at a time (and no dynamic allocation). The exact configuration depends on your spark version/setup, but you might be able to try |
I am installing TensorFlowOnSpark on a Google dataproc which also ready have Python and Openssl installed.
I followed the guidance for Yarn cluster and met these question when running
get-pip.py
The text was updated successfully, but these errors were encountered: