Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnimplementedError: File system scheme hdfs not implemented #125

Closed
xuande opened this issue Aug 30, 2017 · 6 comments
Closed

UnimplementedError: File system scheme hdfs not implemented #125

xuande opened this issue Aug 30, 2017 · 6 comments

Comments

@xuande
Copy link

xuande commented Aug 30, 2017

RUNTIME:

  • OS: CentOS7.2
  • JDK: 1.8
  • Python: 2.7.13
  • TensorFlow: 1.2.1
  • Hadoop: 2.7.3.2.6.1.0-129
  • Cluster: 3 Nodes

ENV

export PYTHON_ROOT=/opt/anaconda
export LD_LIBRARY_PATH=${PATH}
export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python
export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/opt/anaconda/bin/python"
export PATH=${PYTHON_ROOT}/bin/:$PATH
export QUEUE=default

export HADOOP_COMMON_HOME=/usr/hdp/2.6.1.0-129
export LIB_HDFS=${HADOOP_COMMON_HOME}/usr/lib
export LIB_JVM=$JAVA_HOME/jre/lib/amd64/server

export HADOOP_HDFS_HOME=${HADOOP_COMMON_HOME}/hadoop-hdfs
export HADOOP_HOME=${HADOOP_COMMON_HOME}/hadoop
export HADOOP_CONFIG_DIR=$HADOOP_HOME/etc/hadoop

export SPARK_HOME="${HADOOP_COMMON_HOME}/spark2"
export PYSPARK_PYTHON="/opt/anaconda/bin/python"
export PYLIB="$SPARK_HOME/python/lib"
export PYTHONPATH="$PYLIB/py4j-0.10.4-src.zip:$PYTHONPATH"
export PYTHONPATH="$PYLIB/pyspark.zip:$PYTHONPATH"
export PYTHONPATH="/data/TensorFlowOnSpark/examples/mnist/spark:$PYTHONPATH"

export M2_HOME=/opt/apache-maven-3.5.0
export PATH=$M2_HOME/bin:$PATH
export CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob):${CLASSPATH}"
export LD_LIBRARY_PATH=/usr/hdp/2.6.1.0-129/usr/lib:${JAVA_HOME}/jre/lib/amd64/server:$LD_LIBRARY_PATH

command

/usr/hdp/2.6.1.0-129/spark2/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue ${QUEUE} \
--num-executors 3 \
--executor-cores 1 \
--executor-memory 2G \
--py-files /data/TensorFlowOnSpark/tfspark.zip,/data/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.executorEnv.LD_LIBRARY_PATH="/usr/hdp/2.6.1.0-129/usr/lib:${JAVA_HOME}/jre/lib/amd64/server" \
--conf spark.executorEnv.CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob):${CLASSPATH}" \
/data/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \
--images mnist/csv/train/images \
--labels mnist/csv/train/labels \
--mode train \
--model mnist_model_8

yarn.log

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/hadoop/yarn/local/filecache/10/spark2-hdp-yarn-archive.tar.gz/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.1.0-129/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/08/30 16:07:05 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 24579@idap-agent-217.idap.com
17/08/30 16:07:05 INFO SignalUtils: Registered signal handler for TERM
17/08/30 16:07:05 INFO SignalUtils: Registered signal handler for HUP
17/08/30 16:07:05 INFO SignalUtils: Registered signal handler for INT
17/08/30 16:07:05 INFO SecurityManager: Changing view acls to: yarn,root
17/08/30 16:07:05 INFO SecurityManager: Changing modify acls to: yarn,root
17/08/30 16:07:05 INFO SecurityManager: Changing view acls groups to: 
17/08/30 16:07:05 INFO SecurityManager: Changing modify acls groups to: 
17/08/30 16:07:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, root); groups with view permissions: Set(); users  with modify permissions: Set(yarn, root); groups with modify permissions: Set()
17/08/30 16:07:06 INFO TransportClientFactory: Successfully created connection to /10.110.18.218:37739 after 82 ms (0 ms spent in bootstraps)
17/08/30 16:07:06 INFO SecurityManager: Changing view acls to: yarn,root
17/08/30 16:07:06 INFO SecurityManager: Changing modify acls to: yarn,root
17/08/30 16:07:06 INFO SecurityManager: Changing view acls groups to: 
17/08/30 16:07:06 INFO SecurityManager: Changing modify acls groups to: 
17/08/30 16:07:06 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, root); groups with view permissions: Set(); users  with modify permissions: Set(yarn, root); groups with modify permissions: Set()
17/08/30 16:07:06 INFO TransportClientFactory: Successfully created connection to /10.110.18.218:37739 after 1 ms (0 ms spent in bootstraps)
17/08/30 16:07:06 INFO DiskBlockManager: Created local directory at /hadoop/yarn/local/usercache/root/appcache/application_1503653016725_0157/blockmgr-716fb7a5-8017-49de-bc03-8e0bf5ba4c8e
17/08/30 16:07:06 INFO DiskBlockManager: Created local directory at /data/hadoop/yarn/local/usercache/root/appcache/application_1503653016725_0157/blockmgr-376203a7-0f47-4737-8cb0-43b1d92668a2
17/08/30 16:07:06 INFO MemoryStore: MemoryStore started with capacity 912.3 MB
17/08/30 16:07:07 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@10.110.18.218:37739
17/08/30 16:07:07 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
17/08/30 16:07:07 INFO Executor: Starting executor ID 3 on host idap-agent-217.idap.com
17/08/30 16:07:07 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39416.
17/08/30 16:07:07 INFO NettyBlockTransferService: Server created on idap-agent-217.idap.com:39416
17/08/30 16:07:07 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/08/30 16:07:07 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(3, idap-agent-217.idap.com, 39416, None)
17/08/30 16:07:07 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(3, idap-agent-217.idap.com, 39416, None)
17/08/30 16:07:07 INFO BlockManager: Initialized BlockManager: BlockManagerId(3, idap-agent-217.idap.com, 39416, None)
17/08/30 16:07:09 INFO CoarseGrainedExecutorBackend: Got assigned task 1
17/08/30 16:07:09 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
17/08/30 16:07:09 INFO TorrentBroadcast: Started reading broadcast variable 2
17/08/30 16:07:09 INFO TransportClientFactory: Successfully created connection to /10.110.18.218:48138 after 24 ms (0 ms spent in bootstraps)
17/08/30 16:07:09 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 16.3 KB, free 912.3 MB)
17/08/30 16:07:09 INFO TorrentBroadcast: Reading broadcast variable 2 took 300 ms
17/08/30 16:07:09 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 62.4 KB, free 912.2 MB)
2017-08-30 16:07:10,187 INFO (MainThread-24632) connected to server at ('10.110.18.218', 41629)
2017-08-30 16:07:10,189 INFO (MainThread-24632) TFSparkNode.reserve: {'authkey': '\x7f\x8a\xc5C\x01GG\x15\xb0\x97\xe7\xe1\x03\xa9\x13\xaf', 'worker_num': 1, 'host': '10.110.18.217', 'tb_port': 0, 'addr': '/tmp/pymp-Pf3luO/listener-h4IHYw', 'ppid': 24626, 'task_index': 0, 'job_name': 'worker', 'tb_pid': 0, 'port': 49231}
2017-08-30 16:07:12,197 INFO (MainThread-24632) node: {'addr': ('10.110.18.218', 40308), 'task_index': 0, 'job_name': 'ps', 'authkey': '\xecL\x9c=E;Iu\x9c\xda\x8a0C\xb5Q\xea', 'worker_num': 0, 'host': '10.110.18.218', 'ppid': 3591, 'port': 34459, 'tb_pid': 0, 'tb_port': 0}
2017-08-30 16:07:12,198 INFO (MainThread-24632) node: {'addr': '/tmp/pymp-Pf3luO/listener-h4IHYw', 'task_index': 0, 'job_name': 'worker', 'authkey': '\x7f\x8a\xc5C\x01GG\x15\xb0\x97\xe7\xe1\x03\xa9\x13\xaf', 'worker_num': 1, 'host': '10.110.18.217', 'ppid': 24626, 'port': 49231, 'tb_pid': 0, 'tb_port': 0}
2017-08-30 16:07:12,198 INFO (MainThread-24632) node: {'addr': '/tmp/pymp-5x1p33/listener-aCAUhw', 'task_index': 1, 'job_name': 'worker', 'authkey': '\x90Z\x85\xf0\xe7\x9fMI\x94\xac\x88;\xb4;\x90\xd1', 'worker_num': 2, 'host': '10.110.18.216', 'ppid': 29446, 'port': 55107, 'tb_pid': 0, 'tb_port': 0}
2017-08-30 16:07:12,205 INFO (MainThread-24632) Starting TensorFlow worker:0 on cluster node 1 on background process
17/08/30 16:07:12 INFO PythonRunner: Times: total = 2388, boot = 301, init = 51, finish = 2036
17/08/30 16:07:12 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2175 bytes result sent to driver
17/08/30 16:07:12 INFO CoarseGrainedExecutorBackend: Got assigned task 3
17/08/30 16:07:12 INFO Executor: Running task 0.0 in stage 1.0 (TID 3)
17/08/30 16:07:12 INFO TorrentBroadcast: Started reading broadcast variable 3
17/08/30 16:07:12 INFO TransportClientFactory: Successfully created connection to idap-server-216.idap.com/10.110.18.216:58668 after 4 ms (0 ms spent in bootstraps)
17/08/30 16:07:12 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 33.6 KB, free 912.2 MB)
17/08/30 16:07:12 INFO TorrentBroadcast: Reading broadcast variable 3 took 75 ms
17/08/30 16:07:12 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 166.9 KB, free 912.0 MB)
17/08/30 16:07:12 INFO HadoopRDD: Input split: hdfs://idap-agent-217.idap.com:8020/user/root/mnist/csv/train/images/part-00000:0+9338236
17/08/30 16:07:12 INFO TorrentBroadcast: Started reading broadcast variable 0
17/08/30 16:07:12 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 31.0 KB, free 912.0 MB)
17/08/30 16:07:12 INFO TorrentBroadcast: Reading broadcast variable 0 took 21 ms
17/08/30 16:07:12 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 451.3 KB, free 911.6 MB)
17/08/30 16:07:13 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/08/30 16:07:13 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/08/30 16:07:13 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/08/30 16:07:13 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/08/30 16:07:13 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
2017-08-30 16:07:13,835 INFO (MainThread-24647) 1: ======== worker:0 ========
2017-08-30 16:07:13,835 INFO (MainThread-24647) 1: Cluster spec: {'ps': ['10.110.18.218:34459'], 'worker': ['10.110.18.217:49231', '10.110.18.216:55107']}
2017-08-30 16:07:13,835 INFO (MainThread-24647) 1: Using CPU
2017-08-30 16:07:13.836451: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-30 16:07:13.836492: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-30 16:07:13.836502: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-30 16:07:13.836506: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
D0830 16:07:13.836708932   24647 env_linux.c:77]             Warning: insecure environment read function 'getenv' used
2017-08-30 16:07:13.844460: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 10.110.18.218:34459}
2017-08-30 16:07:13.844496: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:49231, 1 -> 10.110.18.216:55107}
2017-08-30 16:07:13.844783: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:49231
17/08/30 16:07:13 INFO HadoopRDD: Input split: hdfs://idap-agent-217.idap.com:8020/user/root/mnist/csv/train/labels/part-00000:0+204800
17/08/30 16:07:13 INFO TorrentBroadcast: Started reading broadcast variable 1
17/08/30 16:07:13 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 31.0 KB, free 911.5 MB)
17/08/30 16:07:13 INFO TorrentBroadcast: Reading broadcast variable 1 took 72 ms
17/08/30 16:07:13 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 451.3 KB, free 911.1 MB)
tensorflow model path: hdfs:///user/root/mnist_model7
2017-08-30 16:07:14,103 INFO (MainThread-24678) Connected to TFSparkNode.mgr on 10.110.18.217, ppid=24626, state='running'
2017-08-30 16:07:14,109 INFO (MainThread-24678) mgr.state='running'
2017-08-30 16:07:14,109 INFO (MainThread-24678) Feeding partition <itertools.chain object at 0x7f6ace1e68d0> into input queue <multiprocessing.queues.JoinableQueue object at 0x7f6ab8322a50>
Process Process-2:
Traceback (most recent call last):
  File "/opt/anaconda/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/opt/anaconda/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/data/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py", line 122, in map_fun
    save_model_secs=10)
  File "/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 330, in __init__
    self._summary_writer = _summary.FileWriter(self._logdir)
  File "/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/summary/writer/writer.py", line 310, in __init__
    filename_suffix)
  File "/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in __init__
    gfile.MakeDirs(self._logdir)
  File "/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 367, in recursive_create_dir
    pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(dirname), status)
  File "/opt/anaconda/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
UnimplementedError: File system scheme hdfs not implemented
17/08/30 16:07:16 INFO PythonRunner: Times: total = 2563, boot = -1351, init = 1499, finish = 2415
17/08/30 16:07:16 INFO PythonRunner: Times: total = 91, boot = 7, init = 16, finish = 68

When running demo in jupyter notebook, also throw same exception.


2017-08-30 15:39:53,166 INFO (MainThread-28832) Stopping TensorFlow nodes
Exception in thread Thread-5:
Traceback (most recent call last):
  File "/opt/anaconda/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/opt/anaconda/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/opt/anaconda/lib/python2.7/site-packages/tensorflowonspark/TFCluster.py", line 54, in _start
    background=(self.input_mode == InputMode.SPARK)))
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 798, in foreachPartition
    self.mapPartitions(func).count()  # Force evaluation
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1040, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 1031, in sum
    return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 905, in fold
    vals = self.mapPartitions(func).collect()
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 808, in collect
    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
    format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 8, idap-agent-218.idap.com, executor 3): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2408, in pipeline_func
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2408, in pipeline_func
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2408, in pipeline_func
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 345, in func
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 793, in func
  File "/opt/anaconda/lib/python2.7/site-packages/tensorflowonspark/TFSparkNode.py", line 243, in _mapfn
    fn(tf_args, ctx)
  File "/data/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py", line 122, in map_fun
    save_model_secs=10)
  File "/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 330, in __init__
    self._summary_writer = _summary.FileWriter(self._logdir)
  File "/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/summary/writer/writer.py", line 310, in __init__
    filename_suffix)
  File "/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in __init__
    gfile.MakeDirs(self._logdir)
  File "/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 367, in recursive_create_dir
    pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(dirname), status)
  File "/opt/anaconda/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
UnimplementedError: File system scheme hdfs not implemented

	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 174, in main
    process()
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 169, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2408, in pipeline_func
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2408, in pipeline_func
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 2408, in pipeline_func
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 345, in func
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line 793, in func
  File "/opt/anaconda/lib/python2.7/site-packages/tensorflowonspark/TFSparkNode.py", line 243, in _mapfn
    fn(tf_args, ctx)
  File "/data/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py", line 122, in map_fun
    save_model_secs=10)
  File "/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 330, in __init__
    self._summary_writer = _summary.FileWriter(self._logdir)
  File "/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/summary/writer/writer.py", line 310, in __init__
    filename_suffix)
  File "/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/summary/writer/event_file_writer.py", line 67, in __init__
    gfile.MakeDirs(self._logdir)
  File "/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 367, in recursive_create_dir
    pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(dirname), status)
  File "/opt/anaconda/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/opt/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
UnimplementedError: File system scheme hdfs not implemented

	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more


2017-08-30 15:39:54,774 INFO (MainThread-28832) Shutting down cluster
@leewyang
Copy link
Contributor

I'm assuming that /usr/hdp/2.6.1.0-129/usr/lib contains the libhdfs.so file and is available on all of your grid nodes at that location?

@xuande
Copy link
Author

xuande commented Aug 31, 2017

@leewyang Thanks for your reply. I checked the configuration of each node in the cluster and all have the libhdfs.so file in /usr/hdp/2.6.0-129/usr/lib directory.

[root@idap-agent-218 ~]# cd /usr/hdp/2.6.1.0-129/usr/lib
[root@idap-agent-218 lib]# ls
libhdfs.so  libhdfs.so.0.0.0
[root@idap-agent-218 lib]# ssh root@10.110.18.217
Last login: Thu Aug 31 10:11:09 2017 from idap-agent-218.idap.com
[root@idap-agent-217 ~]# cd /usr/hdp/2.6.1.0-129/usr/lib
[root@idap-agent-217 lib]# ls
libhdfs.so  libhdfs.so.0.0.0
[root@idap-agent-217 lib]# ssh root@10.110.18.216
[root@idap-server-216 ~]# cd /usr/hdp/2.6.1.0-129/usr/lib
[root@idap-server-216 lib]# ls
libhdfs.so  libhdfs.so.0.0.0

@leewyang
Copy link
Contributor

@xuande Here are a couple thoughts...

  1. Does your core-site.xml have the following lines (or something similar) per this stack overflow?
<property>
   <name>fs.file.impl</name>
   <value>org.apache.hadoop.fs.LocalFileSystem</value>
   <description>The FileSystem for file: uris.</description>
</property>

<property>
   <name>fs.hdfs.impl</name>
   <value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
   <description>The FileSystem for hdfs: uris.</description>
</property>
  1. I see you're using --conf spark.executorEnv.CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob):${CLASSPATH}" which sets the executor's classpath to the hadoop classpath of your gateway/launcher box. This is hopefully the same as your grid nodes.

  2. Just to be sure, you may want to dump the CLASSPATH and LD_LIBRARY_PATH (and maybe $(hadoop classpath --glob) as seen by the executors (i.e. inside the TF code), just to see if there's anything unexpected.

@xuande
Copy link
Author

xuande commented Sep 1, 2017

@leewyang Thanks again for your reply. I have found the reason why it does not work, and I will verify it again. The tensorflow which is installed through the pip install tensorflow occurred errors while it's running on CentOS7 (install_tensorflow_centos7). After I reinstall it with pip install tensorflow-1.2.1-cp27-cp27mu-manylinux1_x86_64.whl ,(download it) it works well.
In addition, I still have a problem, Could the mnist demo TFOS_spark_demo.ipynb run in jupyter notebook? When I am running it in jupyter notebook, the spark task has been keep waitting. Here is the logs:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/hadoop/yarn/local/filecache/10/spark2-hdp-yarn-archive.tar.gz/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.6.1.0-129/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/09/01 23:56:01 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 2210@idap-server-216.idap.com
17/09/01 23:56:01 INFO SignalUtils: Registered signal handler for TERM
17/09/01 23:56:01 INFO SignalUtils: Registered signal handler for HUP
17/09/01 23:56:01 INFO SignalUtils: Registered signal handler for INT
17/09/01 23:56:01 INFO SecurityManager: Changing view acls to: yarn,root
17/09/01 23:56:01 INFO SecurityManager: Changing modify acls to: yarn,root
17/09/01 23:56:01 INFO SecurityManager: Changing view acls groups to: 
17/09/01 23:56:01 INFO SecurityManager: Changing modify acls groups to: 
17/09/01 23:56:01 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, root); groups with view permissions: Set(); users  with modify permissions: Set(yarn, root); groups with modify permissions: Set()
17/09/01 23:56:02 INFO TransportClientFactory: Successfully created connection to /10.110.18.218:44048 after 75 ms (0 ms spent in bootstraps)
17/09/01 23:56:02 INFO SecurityManager: Changing view acls to: yarn,root
17/09/01 23:56:02 INFO SecurityManager: Changing modify acls to: yarn,root
17/09/01 23:56:02 INFO SecurityManager: Changing view acls groups to: 
17/09/01 23:56:02 INFO SecurityManager: Changing modify acls groups to: 
17/09/01 23:56:02 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, root); groups with view permissions: Set(); users  with modify permissions: Set(yarn, root); groups with modify permissions: Set()
17/09/01 23:56:02 INFO TransportClientFactory: Successfully created connection to /10.110.18.218:44048 after 2 ms (0 ms spent in bootstraps)
17/09/01 23:56:02 INFO DiskBlockManager: Created local directory at /hadoop/yarn/local/usercache/root/appcache/application_1504261531784_0012/blockmgr-13b1432f-709e-4624-80da-ac39323572f4
17/09/01 23:56:02 INFO DiskBlockManager: Created local directory at /data/hadoop/yarn/local/usercache/root/appcache/application_1504261531784_0012/blockmgr-d236adcb-bf76-4228-b1de-c65922c8b40f
17/09/01 23:56:02 INFO MemoryStore: MemoryStore started with capacity 912.3 MB
17/09/01 23:56:02 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@10.110.18.218:44048
17/09/01 23:56:03 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
17/09/01 23:56:03 INFO Executor: Starting executor ID 1 on host idap-server-216.idap.com
17/09/01 23:56:03 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38958.
17/09/01 23:56:03 INFO NettyBlockTransferService: Server created on idap-server-216.idap.com:38958
17/09/01 23:56:03 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/09/01 23:56:03 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(1, idap-server-216.idap.com, 38958, None)
17/09/01 23:56:03 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(1, idap-server-216.idap.com, 38958, None)
17/09/01 23:56:03 INFO BlockManager: Initialized BlockManager: BlockManagerId(1, idap-server-216.idap.com, 38958, None)
17/09/01 23:56:05 INFO CoarseGrainedExecutorBackend: Got assigned task 2
17/09/01 23:56:05 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
17/09/01 23:56:05 INFO TorrentBroadcast: Started reading broadcast variable 0
17/09/01 23:56:05 INFO TransportClientFactory: Successfully created connection to /10.110.18.218:51817 after 4 ms (0 ms spent in bootstraps)
17/09/01 23:56:05 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 7.7 KB, free 912.3 MB)
17/09/01 23:56:05 INFO TorrentBroadcast: Reading broadcast variable 0 took 192 ms
17/09/01 23:56:05 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 10.6 KB, free 912.3 MB)
2017-09-01 23:56:06,422 INFO (MainThread-2475) connected to server at ('10.110.18.218', 42188)
2017-09-01 23:56:06,425 INFO (MainThread-2475) TFSparkNode.reserve: {'authkey': 'c\x11\xe0c\x03\xf0O\xe9\x96]\x98\xcf\xdf\xc3\x8b\xa2', 'worker_num': 2, 'host': '10.110.18.216', 'tb_port': 0, 'addr': '/tmp/pymp-X3S7tQ/listener-sZWH7X', 'ppid': 2448, 'task_index': 1, 'job_name': 'worker', 'tb_pid': 0, 'port': 36704}
2017-09-01 23:56:08,435 INFO (MainThread-2475) node: {'addr': ('10.110.18.217', 50067), 'task_index': 0, 'job_name': 'ps', 'authkey': '&$`J\xb0aE\xe3\x9dQ\x80\\bg\xd4\xcb', 'worker_num': 0, 'host': '10.110.18.217', 'ppid': 8715, 'port': 43031, 'tb_pid': 0, 'tb_port': 0}
2017-09-01 23:56:08,435 INFO (MainThread-2475) node: {'addr': '/tmp/pymp-GJZIHa/listener-5A3o0S', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xfc\x1dXz\t\xc3Hu\xb7^>\x90\xb4z\x1c\x7f', 'worker_num': 1, 'host': '10.110.18.218', 'ppid': 16294, 'port': 40333, 'tb_pid': 16308, 'tb_port': 52425}
2017-09-01 23:56:08,435 INFO (MainThread-2475) node: {'addr': '/tmp/pymp-X3S7tQ/listener-sZWH7X', 'task_index': 1, 'job_name': 'worker', 'authkey': 'c\x11\xe0c\x03\xf0O\xe9\x96]\x98\xcf\xdf\xc3\x8b\xa2', 'worker_num': 2, 'host': '10.110.18.216', 'ppid': 2448, 'port': 36704, 'tb_pid': 0, 'tb_port': 0}
2017-09-01 23:56:08,444 INFO (MainThread-2475) Starting TensorFlow worker:1 on cluster node 2 on background process
17/09/01 23:56:08 INFO PythonRunner: Times: total = 2465, boot = 348, init = 69, finish = 2048
17/09/01 23:56:08 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 2175 bytes result sent to driver
2017-09-01 23:56:09,981 INFO (MainThread-2507) 2: ======== worker:1 ========
2017-09-01 23:56:09,981 INFO (MainThread-2507) 2: Cluster spec: {'ps': ['10.110.18.217:43031'], 'worker': ['10.110.18.218:40333', '10.110.18.216:36704']}
2017-09-01 23:56:09,981 INFO (MainThread-2507) 2: Using CPU
2017-09-01 23:56:09.982865: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-01 23:56:09.982885: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-01 23:56:09.982891: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-01 23:56:09.982896: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-09-01 23:56:09.992619: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 10.110.18.217:43031}
2017-09-01 23:56:09.992703: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 10.110.18.218:40333, 1 -> localhost:36704}
2017-09-01 23:56:09.996374: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:36704
tensorflow model path: hdfs://idap-agent-217.idap.com:8020/user/root/mnist_model
17/09/01 23:56:13 INFO CoarseGrainedExecutorBackend: Got assigned task 4
17/09/01 23:56:13 INFO Executor: Running task 1.0 in stage 1.0 (TID 4)
17/09/01 23:56:13 INFO TorrentBroadcast: Started reading broadcast variable 3
17/09/01 23:56:13 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 7.1 KB, free 912.3 MB)
17/09/01 23:56:13 INFO TorrentBroadcast: Reading broadcast variable 3 took 29 ms
17/09/01 23:56:13 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 12.7 KB, free 912.3 MB)
17/09/01 23:56:13 INFO HadoopRDD: Input split: hdfs://10.110.18.217:8020/user/root/mnist/csv/train/images/part-00001:0+11231804
17/09/01 23:56:13 INFO TorrentBroadcast: Started reading broadcast variable 1
17/09/01 23:56:13 INFO TransportClientFactory: Successfully created connection to idap-agent-218.idap.com/10.110.18.218:57294 after 4 ms (0 ms spent in bootstraps)
17/09/01 23:56:13 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 31.0 KB, free 912.2 MB)
17/09/01 23:56:13 INFO TorrentBroadcast: Reading broadcast variable 1 took 111 ms
17/09/01 23:56:13 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 451.3 KB, free 911.8 MB)
17/09/01 23:56:14 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/09/01 23:56:14 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/09/01 23:56:14 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/09/01 23:56:14 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/09/01 23:56:14 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/09/01 23:56:14 INFO HadoopRDD: Input split: hdfs://10.110.18.217:8020/user/root/mnist/csv/train/labels/part-00001:0+245760
17/09/01 23:56:14 INFO TorrentBroadcast: Started reading broadcast variable 2
17/09/01 23:56:14 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 31.0 KB, free 911.8 MB)
17/09/01 23:56:14 INFO TorrentBroadcast: Reading broadcast variable 2 took 30 ms
17/09/01 23:56:14 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 451.3 KB, free 911.3 MB)
2017-09-01 23:56:14,596 INFO (MainThread-2619) Connected to TFSparkNode.mgr on 10.110.18.216, ppid=2448, state='running'
2017-09-01 23:56:14,607 INFO (MainThread-2619) mgr.state='running'
2017-09-01 23:56:14,608 INFO (MainThread-2619) Feeding partition <itertools.chain object at 0x7f0b4d0d0250> into input queue <multiprocessing.queues.JoinableQueue object at 0x7f0b4df96350>
2017-09-01 23:56:15.507988: I tensorflow/core/distributed_runtime/master_session.cc:999] Start master session 5bac03e47564b81e with config: 

INFO:tensorflow:Waiting for model to be ready.  Ready_for_local_init_op:  None, ready: Variables not initialized: hid_w, hid_b, sm_w, sm_b, Variable, hid_w/Adagrad, hid_b/Adagrad, sm_w/Adagrad, sm_b/Adagrad
2017-09-01 23:56:15,556 INFO (MainThread-2507) Waiting for model to be ready.  Ready_for_local_init_op:  None, ready: Variables not initialized: hid_w, hid_b, sm_w, sm_b, Variable, hid_w/Adagrad, hid_b/Adagrad, sm_w/Adagrad, sm_b/Adagrad
17/09/01 23:56:17 INFO PythonRunner: Times: total = 3086, boot = -5732, init = 5862, finish = 2956
17/09/01 23:56:17 INFO PythonRunner: Times: total = 66, boot = 3, init = 9, finish = 54
2017-09-01 23:56:45.586441: I tensorflow/core/distributed_runtime/master_session.cc:999] Start master session 3e9be1b8c62fc48b with config: 

INFO:tensorflow:Waiting for model to be ready.  Ready_for_local_init_op:  None, ready: Variables not initialized: hid_w, hid_b, sm_w, sm_b, Variable, hid_w/Adagrad, hid_b/Adagrad, sm_w/Adagrad, sm_b/Adagrad
2017-09-01 23:56:45,608 INFO (MainThread-2507) Waiting for model to be ready.  Ready_for_local_init_op:  None, ready: Variables not initialized: hid_w, hid_b, sm_w, sm_b, Variable, hid_w/Adagrad, hid_b/Adagrad, sm_w/Adagrad, sm_b/Adagrad

@leewyang
Copy link
Contributor

leewyang commented Sep 1, 2017

@xuande Glad that you got your env working! As for the the notebook issue, how are you starting up the notebook? i.e. are you setting the spark.executorEnv.LD_LIBRARY_PATH somewhere?

@xuande
Copy link
Author

xuande commented Sep 2, 2017

@leewyang Thanks for your reminding, just as you say, after adding 'spark.executorEnv.LD_lIBRARY_PATH' while starting jupyternotebook, the mnist demo works well. Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants