I will train VDSR on the tensorflowOnSpark,which data should I will use? #74

bobo2001281 · 2017-04-29T09:54:04Z

origin resource is here
https://github.com/Jongchan/tensorflow-vdsr

1.the data is MATLAB 5.0 MAT-file.
according the https://github.com/yahoo/TensorFlowOnSpark/wiki/Conversion,I change the code.
but on condition that the data is valid for tensorflowOnSpark to use on HDFS.
2.Need I change the data to TFRecords?

leewyang · 2017-05-01T17:29:03Z

@bobo2001281 You have several options:

If your dataset fits into memory, you can just use any existing code to load it into the memory of each executor and train as usual. In this case, you'll need to ship the MAT file to the executors, much like the way we ship the "mnist.zip" file in the data conversion example
If your dataset doesn't fit into memory, you'll need a way to split it into files that are easily "readable" by either Spark (e.g. sc.textFile() or sc.sequenceFile()) or TensorFlow (e.g. TFRecord)

bobo2001281 · 2017-05-02T08:29:30Z

In the first way:
mnist_data_setup.py is different according what the data is.
The data in the other sample (cifar10,imagenet,slim) is download from github directly according README.md.

How can I ship my data? There doesnot seem to have the label data in my data.

leewyang · 2017-05-02T16:12:32Z

Shipping the data to the executors can be done with --archives option. So, for the data conversion example, specifying --archives mnist/mnist.zip#mnist tells Spark to copy the mnist.zip file to each executor and extract it into it's current working dir into a folder named mnist.

bobo2001281 · 2017-05-03T11:05:47Z

now

I put the data to hdfs by
hadoop fs -put ./data
I modified the VDSY.py and add VDSR_spark.py
${SPARK_HOME}/bin/spark-submit
--master yarn
--deploy-mode cluster
--queue ${QUEUE}
--num-executors 4
--executor-memory 27G
--py-files TensorFlow_VDSR/tfspark.zip,TensorFlow_VDSR/VDSR.py
--conf spark.dynamicAllocation.enabled=false
--conf spark.yarn.maxAppAttempts=1
--archives hdfs:///user/${USER}/Python.zip#Python
--conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server"
TensorFlow_VDSR/VDSR_spark.py
--images hdfs:///user/${USER}/data/train
--model_path ./model_VDSR_me
and error occured

17/05/03 16:39:30 INFO yarn.Client: Application report for application_1493036076768_0092 (state: ACCEPTED)
17/05/03 16:39:31 INFO yarn.Client: Application report for application_1493036076768_0092 (state: FAILED)
17/05/03 16:39:31 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1493036076768_0092 failed 1 times due to AM Container for appattempt_1493036076768_0092_000001 exited with exitCode: 1
For more detailed output, check application tracking page:http://u10-121-135-150:8088/cluster/app/application_1493036076768_0092Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1493036076768_0092_01_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1493800746814
final status: FAILED
tracking URL: http://u10-121-135-150:8088/cluster/app/application_1493036076768_0092
user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1493036076768_0092 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/05/03 16:39:31 INFO util.ShutdownHookManager: Shutdown hook called
17/05/03 16:39:31 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-b8f69730-babb-4b65-93e2-8bbc317bcebc

leewyang · 2017-05-04T21:34:23Z

Please grab the yarn logs via: yarn logs -applicationId application_1493036076768_0092 and search for any errors/exceptions.

bobo2001281 · 2017-05-05T01:30:45Z

hadoop@u10-121-135-150:~/hadoop-2.7.1/logs$ yarn logs -applicationId application_1493036076768_0106
/tmp/logs/hadoop/logs/application_1493036076768_0106 does not exist.
Log aggregation has not completed or is not enabled.

The log is not exist both in the middle of the running of the spark submit and after the command.

bobo2001281 · 2017-05-05T02:22:30Z

I have modified the yarn-site.xml with below and now I can see my logs.

  <name>yarn.log-aggregation-enable</name>
  <value>true</value>

 <name>yarn.nodemanager.remote-app-log-dir</name>
 <value>/app-logs</value>

  <name>yarn.nodemanager.remote-app-log-dir-suffix</name>
  <value>logs</value>

bobo2001281 · 2017-05-05T09:48:48Z

😢
how to attach file?

bobo2001281 · 2017-05-05T11:20:46Z

17/05/05 11:40:07 INFO memory.MemoryStore: MemoryStore started with capacity 14.2 GB
17/05/05 11:40:07 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@10.121.135.150:42573
17/05/05 11:40:07 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
17/05/05 11:40:07 INFO executor.Executor: Starting executor ID 2 on host u10-121-135-150
17/05/05 11:40:07 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 36225.
17/05/05 11:40:07 INFO netty.NettyBlockTransferService: Server created on u10-121-135-150:36225
17/05/05 11:40:07 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(2, u10-121-135-150, 36225)
17/05/05 11:40:07 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(2, u10-121-135-150, 36225)
17/05/05 11:40:10 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 0
17/05/05 11:40:10 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
17/05/05 11:40:10 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
17/05/05 11:40:10 INFO client.TransportClientFactory: Successfully created connection to /10.121.135.150:39980 after 2 ms (0 ms spent in bootstraps)
17/05/05 11:40:10 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 7.8 KB, free 14.2 GB)
17/05/05 11:40:10 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 306 ms
17/05/05 11:40:11 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 11.3 KB, free 14.2 GB)
2017-05-05 11:40:11,669 INFO (MainThread-46435) connected to server at ('u10-121-135-150', 43125)
2017-05-05 11:40:11,670 INFO (MainThread-46435) TFSparkNode.reserve: {'authkey': '\x0c!\xee\xc3Y1K\xee\xa1A\x1efI\xebz\xca', 'worker_num': 0, 'host': 'u10-121-135-150', 'tb_port': 0, 'addr': ('u10-121-135-150', 42646), 'ppid': 46414, 'task_index': 0, 'job_name': 'ps', 'tb_pid': 0, 'port': 44238}
2017-05-05 11:40:12,678 INFO (MainThread-46435) node: {'addr': ('u10-121-135-150', 42646), 'task_index': 0, 'job_name': 'ps', 'authkey': '\x0c!\xee\xc3Y1K\xee\xa1A\x1efI\xebz\xca', 'worker_num': 0, 'host': 'u10-121-135-150', 'ppid': 46414, 'port': 44238, 'tb_pid': 0, 'tb_port': 0}
2017-05-05 11:40:12,678 INFO (MainThread-46435) node: {'addr': '/tmp/pymp-xTRaz/listener-zKctvL', 'task_index': 0, 'job_name': 'worker', 'authkey': '\x9e\xbd\xb8gJ\xc4@"\x93Q\x9bd\x8c\x85\x10S', 'worker_num': 1, 'host': 'u10-121-135-150', 'ppid': 46417, 'port': 42615, 'tb_pid': 0, 'tb_port': 0}
2017-05-05 11:40:12,678 INFO (MainThread-46435) node: {'addr': '/tmp/pymp-kXvJ0C/listener-J6qzjO', 'task_index': 1, 'job_name': 'worker', 'authkey': 'JhP\xff\xa2\xe2Nw\x9d\x02RG\x00 N]', 'worker_num': 2, 'host': 'u10-121-135-150', 'ppid': 46415, 'port': 42711, 'tb_pid': 0, 'tb_port': 0}
2017-05-05 11:40:12,678 INFO (MainThread-46435) node: {'addr': '/tmp/pymp-NPwGp/listener-YFUdU5', 'task_index': 2, 'job_name': 'worker', 'authkey': 'j):\x13\xf2\x00E\x80\x89\xa2\xe9\xd5\xa6*HI', 'worker_num': 3, 'host': 'u10-121-135-150', 'ppid': 46419, 'port': 40856, 'tb_pid': 0, 'tb_port': 0}
2017-05-05 11:40:13,526 INFO (MainThread-46435) Starting TensorFlow ps:0 on cluster node 0 on background process
Process Process-2:
Traceback (most recent call last):
File "/home/hadoop/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/hadoop/Python/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/home/hadoop/hadoopSpace/tmp/nm-local-dir/usercache/hadoop/appcache/application_1493036076768_0112/container_1493036076768_0112_01_000003/pyfiles/VDSR.py", line 79, in map_fun
import tensorflow as tf
ImportError: No module named tensorflow
17/05/05 11:40:14 INFO executor.Executor: Executor is trying to kill task 0.0 in stage 0.0 (TID 0)
2017-05-05 11:40:15,381 INFO (MainThread-46435) Got msg: None
2017-05-05 11:40:15,381 INFO (MainThread-46435) Terminating PS
17/05/05 11:40:15 WARN python.PythonRunner: Incomplete task interrupted: Attempting to kill Python Worker
17/05/05 11:40:15 INFO executor.Executor: Executor killed task 0.0 in stage 0.0 (TID 0)
17/05/05 17:19:36 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown

bobo2001281 · 2017-05-05T11:21:40Z

but when I python VDSR.py , and no error occured.

leewyang · 2017-05-05T16:22:31Z

It looks like you haven't installed tensorflow into your Python distribution that is shipped to the executors via --archives hdfs:///user/${USER}/Python.zip#Python, or you're not setting the following env vars:

export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python
export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=Python/bin/python"

bobo2001281 · 2017-05-08T09:11:41Z

I install tensorflow in my local env.
root@u10-121-135-150:~# python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally

How can I install tensorflow into my Python distribution?
It seems that there is not tensorflow in the distribution env.

hadoop@u10-121-135-150:~$ Python/bin/python
Python 2.7.12 (default, Apr 21 2017, 16:35:26)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow
Traceback (most recent call last):
File "", line 1, in
ImportError: No module named tensorflow

bobo2001281 · 2017-05-08T09:40:36Z

In the instruction:
https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN#convert-the-mnist-zip-files-into-hdfs-files

Install and compile TensorFlow w/ RDMA Support

git clone git@github.com:yahoo/tensorflow.git

# For TensorFlow 0.12 w/ RDMA, checkout the 'yahoo' branch
# For TensorFlow 1.0 w/ RDMA, checkout the 'jun_r1.0' branch
# follow build instructions to install into ${PYTHON_ROOT}

In the last line,How can I install tensorflow into Python?

leewyang · 2017-05-08T23:55:36Z

Actually, if you do not need RDMA support, you should be able to just run something like:
Python/bin/pip install tensorflow

If you need specific versions (e.g. Python 2.7 vs. 3.x, CPU/GPU, etc), you can adapt these instructions from TensorFlow

bobo200128docker · 2017-05-11T12:39:06Z

When I execute python VDSR.py and cost 10s .
--(I have add the BATCH_SIZE=256 and set MAX_EPOCH =1)
When I run the VDSR_spark.py (where call VDSR.py according to the Conversion Guide ),the state is RUNNING for a long time and never finish.
There is not any log on HDFS /app-logs.

bobo200128docker · 2017-05-11T12:48:19Z

2017-05-11 20:46:35,619 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1494227397798_0029_000001 State change from SCHEDULED to ALLOCATED_SAVING
2017-05-11 20:46:35,619 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1494227397798_0029_000001 State change from ALLOCATED_SAVING to ALLOCATED
2017-05-11 20:46:35,619 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching masterappattempt_1494227397798_0029_000001
2017-05-11 20:46:35,620 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting up container Container: [ContainerId: container_1494227397798_0029_01_000001, NodeId: u10-121-135-150:38391, NodeHttpAddress: u10-121-135-150:8042, Resource: <memory:2048, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.121.135.150:38391 }, ] for AM appattempt_1494227397798_0029_000001
2017-05-11 20:46:35,620 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_1494227397798_0029_01_000001 : LD_LIBRARY_PATH="/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH",{{JAVA_HOME}}/bin/java,-server,-Xmx1024m,-Djava.io.tmpdir={{PWD}}/tmp,-Dspark.yarn.app.container.log.dir=<LOG_DIR>,org.apache.spark.deploy.yarn.ApplicationMaster,--class,'org.apache.spark.deploy.PythonRunner',--primary-py-file,VDSR_spark.py,--arg,'--images',--arg,'hdfs:///user/hadoop/data/train',--arg,'--model',--arg,'./model_VDSR_me',--properties-file,{{PWD}}/spark_conf/spark_conf.properties,1>,<LOG_DIR>/stdout,2>,<LOG_DIR>/stderr
2017-05-11 20:46:35,620 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Create AMRMToken for ApplicationAttempt: appattempt_1494227397798_0029_000001
2017-05-11 20:46:35,620 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Creating password for appattempt_1494227397798_0029_000001
2017-05-11 20:46:35,627 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_1494227397798_0029_01_000001, NodeId: u10-121-135-150:38391, NodeHttpAddress: u10-121-135-150:8042, Resource: <memory:2048, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.121.135.150:38391 }, ] for AM appattempt_1494227397798_0029_000001
2017-05-11 20:46:35,627 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1494227397798_0029_000001 State change from ALLOCATED to LAUNCHED
2017-05-11 20:46:36,618 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000001 Container Transitioned from ACQUIRED to RUNNING
2017-05-11 20:46:42,576 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1494227397798_0029_000001 (auth:SIMPLE)
2017-05-11 20:46:42,580 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AM registration appattempt_1494227397798_0029_000001
2017-05-11 20:46:42,580 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop IP=10.121.135.150 OPERATION=Register App Master TARGET=ApplicationMasterService RESULT=SUCCESS APPID=application_1494227397798_0029 APPATTEMPTID=appattempt_1494227397798_0029_000001
2017-05-11 20:46:42,580 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1494227397798_0029_000001 State change from LAUNCHED to RUNNING
2017-05-11 20:46:42,581 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1494227397798_0029 State change from ACCEPTED to RUNNING
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000002 Container Transitioned from NEW to ALLOCATED
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1494227397798_0029 CONTAINERID=container_1494227397798_0029_01_000002
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1494227397798_0029_01_000002 of capacity <memory:30720, vCores:1> on host u10-121-135-152:60402, which has 5 containers, <memory:124928, vCores:5> used and <memory:6144, vCores:3> available after allocation
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: assignedContainer application attempt=appattempt_1494227397798_0029_000001 container=Container: [ContainerId: container_1494227397798_0029_01_000002, NodeId: u10-121-135-152:60402, NodeHttpAddress: u10-121-135-152:8042, Resource: <memory:30720, vCores:1>, Priority: 1, Token: null, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:190464, vCores:9>, usedCapacity=0.7265625, absoluteUsedCapacity=0.7265625, numApps=3, numContainers=9 clusterResource=<memory:262144, vCores:16>
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting assigned queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:221184, vCores:10>, usedCapacity=0.84375, absoluteUsedCapacity=0.84375, numApps=3, numContainers=10
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: assignedContainer queue=root usedCapacity=0.84375 absoluteUsedCapacity=0.84375 used=<memory:221184, vCores:10> cluster=<memory:262144, vCores:16>
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000003 Container Transitioned from NEW to ALLOCATED
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1494227397798_0029 CONTAINERID=container_1494227397798_0029_01_000003
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1494227397798_0029_01_000003 of capacity <memory:30720, vCores:1> on host u10-121-135-150:38391, which has 6 containers, <memory:126976, vCores:6> used and <memory:4096, vCores:2> available after allocation
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: assignedContainer application attempt=appattempt_1494227397798_0029_000001 container=Container: [ContainerId: container_1494227397798_0029_01_000003, NodeId: u10-121-135-150:38391, NodeHttpAddress: u10-121-135-150:8042, Resource: <memory:30720, vCores:1>, Priority: 1, Token: null, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:221184, vCores:10>, usedCapacity=0.84375, absoluteUsedCapacity=0.84375, numApps=3, numContainers=10 clusterResource=<memory:262144, vCores:16>
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting assigned queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:251904, vCores:11>, usedCapacity=0.9609375, absoluteUsedCapacity=0.9609375, numApps=3, numContainers=11
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: assignedContainer queue=root usedCapacity=0.9609375 absoluteUsedCapacity=0.9609375 used=<memory:251904, vCores:11> cluster=<memory:262144, vCores:16>
2017-05-11 20:46:44,050 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : u10-121-135-152:60402 for container : container_1494227397798_0029_01_000002
2017-05-11 20:46:44,051 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000002 Container Transitioned from ALLOCATED to ACQUIRED
2017-05-11 20:46:44,051 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : u10-121-135-150:38391 for container : container_1494227397798_0029_01_000003
2017-05-11 20:46:44,052 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000003 Container Transitioned from ALLOCATED to ACQUIRED
2017-05-11 20:46:44,622 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000002 Container Transitioned from ACQUIRED to RUNNING
2017-05-11 20:46:44,622 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000003 Container Transitioned from ACQUIRED to RUNNING

leewyang · 2017-05-11T16:28:46Z

Those logs don't reveal much... Please grab the yarn application logs and look for errors on the executors.

bobo200128docker · 2017-05-12T02:32:55Z

If I cancel (Ctrl + C)the application that is running，where the application log will saved？

leewyang · 2017-05-12T18:02:24Z

You will need to do the following:

yarn application -kill <your_applicationId>
yarn logs -applicationId <your_applicationId>   >yarn.log

bobo200128docker · 2017-05-13T01:44:56Z

hadoop@u10-121-135-150:~$ Python/bin/python
Python 2.7.12 (default, Apr 21 2017, 16:35:26)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf
tf.sub(None,None)
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'module' object has no attribute 'sub'
tf.version
'1.0.1'

[4]+ Stopped Python/bin/python
hadoop@u10-121-135-150:~$ python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
tf.version
'0.12.1'

What is the difference of the above 2 tensorflow s?
Now the situation is : when I try VDSR.py
'0.12.1' is in local envirment,and have the function tf.sub()
'1.0.1' is in the executor,and report an error "AttributeError: 'module' object has no attribute 'sub'"

bobo200128docker · 2017-05-13T05:53:12Z

I have install scipy in Python that is distributed to the Spark executors.
why "ImportError: No module named scipy.io" still occured in the application log?

leewyang · 2017-05-15T17:34:09Z

A couple notes:

Python/bin/python refers to the custom python distribution that we create and zip up to ship to the executors. We do this because users often don't have control of the python version/packages on the executors.
python on the gateway node is just the local installation of python, which will not be distributed to the executors. If you install any dependencies into this local/gateway installation, the dependencies will not be automatically installed on the executors.
The TensorFlow APIs changed fairly significantly between 0.12 and 1.0. They have a migration script to help you update your code, but your code will not be cross-compatible between these two versions.

So, with that all said, I'd recommend picking the versions of TensorFlow and Python that you wish to move forward with, then create a custom python distribution (with all necessary dependencies), and then use ONLY this distribution to test your code going forward (for "local" and "distributed" testing).

leewyang closed this as completed Jun 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I will train VDSR on the tensorflowOnSpark,which data should I will use? #74

I will train VDSR on the tensorflowOnSpark,which data should I will use? #74

bobo2001281 commented Apr 29, 2017

leewyang commented May 1, 2017

bobo2001281 commented May 2, 2017

leewyang commented May 2, 2017

bobo2001281 commented May 3, 2017

leewyang commented May 4, 2017

bobo2001281 commented May 5, 2017

bobo2001281 commented May 5, 2017 •

edited

bobo2001281 commented May 5, 2017 •

edited

bobo2001281 commented May 5, 2017

bobo2001281 commented May 5, 2017

leewyang commented May 5, 2017 •

edited

bobo2001281 commented May 8, 2017

bobo2001281 commented May 8, 2017 •

edited

leewyang commented May 8, 2017

bobo200128docker commented May 11, 2017 •

edited

bobo200128docker commented May 11, 2017 •

edited

leewyang commented May 11, 2017

bobo200128docker commented May 12, 2017

leewyang commented May 12, 2017

bobo200128docker commented May 13, 2017

bobo200128docker commented May 13, 2017 •

edited

leewyang commented May 15, 2017

I will train VDSR on the tensorflowOnSpark,which data should I will use? #74

I will train VDSR on the tensorflowOnSpark,which data should I will use? #74

Comments

bobo2001281 commented Apr 29, 2017

leewyang commented May 1, 2017

bobo2001281 commented May 2, 2017

leewyang commented May 2, 2017

bobo2001281 commented May 3, 2017

leewyang commented May 4, 2017

bobo2001281 commented May 5, 2017

bobo2001281 commented May 5, 2017 • edited

bobo2001281 commented May 5, 2017 • edited

bobo2001281 commented May 5, 2017

bobo2001281 commented May 5, 2017

leewyang commented May 5, 2017 • edited

bobo2001281 commented May 8, 2017

bobo2001281 commented May 8, 2017 • edited

leewyang commented May 8, 2017

bobo200128docker commented May 11, 2017 • edited

bobo200128docker commented May 11, 2017 • edited

leewyang commented May 11, 2017

bobo200128docker commented May 12, 2017

leewyang commented May 12, 2017

bobo200128docker commented May 13, 2017

bobo200128docker commented May 13, 2017 • edited

leewyang commented May 15, 2017

bobo2001281 commented May 5, 2017 •

edited

bobo2001281 commented May 5, 2017 •

edited

leewyang commented May 5, 2017 •

edited

bobo2001281 commented May 8, 2017 •

edited

bobo200128docker commented May 11, 2017 •

edited

bobo200128docker commented May 11, 2017 •

edited

bobo200128docker commented May 13, 2017 •

edited