Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I will train VDSR on the tensorflowOnSpark,which data should I will use? #74

Closed
bobo2001281 opened this issue Apr 29, 2017 · 22 comments
Closed

Comments

@bobo2001281
Copy link

origin resource is here
https://github.com/Jongchan/tensorflow-vdsr

1.the data is MATLAB 5.0 MAT-file.
according the https://github.com/yahoo/TensorFlowOnSpark/wiki/Conversion,I change the code.
but on condition that the data is valid for tensorflowOnSpark to use on HDFS.
2.Need I change the data to TFRecords?

@leewyang
Copy link
Contributor

leewyang commented May 1, 2017

@bobo2001281 You have several options:

  1. If your dataset fits into memory, you can just use any existing code to load it into the memory of each executor and train as usual. In this case, you'll need to ship the MAT file to the executors, much like the way we ship the "mnist.zip" file in the data conversion example
  2. If your dataset doesn't fit into memory, you'll need a way to split it into files that are easily "readable" by either Spark (e.g. sc.textFile() or sc.sequenceFile()) or TensorFlow (e.g. TFRecord)

@bobo2001281
Copy link
Author

In the first way:
mnist_data_setup.py is different according what the data is.
The data in the other sample (cifar10,imagenet,slim) is download from github directly according README.md.

How can I ship my data? There doesnot seem to have the label data in my data.

@leewyang
Copy link
Contributor

leewyang commented May 2, 2017

Shipping the data to the executors can be done with --archives option. So, for the data conversion example, specifying --archives mnist/mnist.zip#mnist tells Spark to copy the mnist.zip file to each executor and extract it into it's current working dir into a folder named mnist.

@bobo2001281
Copy link
Author

now

  1. I put the data to hdfs by
    hadoop fs -put ./data
  2. I modified the VDSY.py and add VDSR_spark.py
  3. ${SPARK_HOME}/bin/spark-submit
    --master yarn
    --deploy-mode cluster
    --queue ${QUEUE}
    --num-executors 4
    --executor-memory 27G
    --py-files TensorFlow_VDSR/tfspark.zip,TensorFlow_VDSR/VDSR.py
    --conf spark.dynamicAllocation.enabled=false
    --conf spark.yarn.maxAppAttempts=1
    --archives hdfs:///user/${USER}/Python.zip#Python
    --conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server"
    TensorFlow_VDSR/VDSR_spark.py
    --images hdfs:///user/${USER}/data/train
    --model_path ./model_VDSR_me
    and error occured

17/05/03 16:39:30 INFO yarn.Client: Application report for application_1493036076768_0092 (state: ACCEPTED)
17/05/03 16:39:31 INFO yarn.Client: Application report for application_1493036076768_0092 (state: FAILED)
17/05/03 16:39:31 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1493036076768_0092 failed 1 times due to AM Container for appattempt_1493036076768_0092_000001 exited with exitCode: 1
For more detailed output, check application tracking page:http://u10-121-135-150:8088/cluster/app/application_1493036076768_0092Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1493036076768_0092_01_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1493800746814
final status: FAILED
tracking URL: http://u10-121-135-150:8088/cluster/app/application_1493036076768_0092
user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1493036076768_0092 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/05/03 16:39:31 INFO util.ShutdownHookManager: Shutdown hook called
17/05/03 16:39:31 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-b8f69730-babb-4b65-93e2-8bbc317bcebc

@leewyang
Copy link
Contributor

leewyang commented May 4, 2017

Please grab the yarn logs via: yarn logs -applicationId application_1493036076768_0092 and search for any errors/exceptions.

@bobo2001281
Copy link
Author

hadoop@u10-121-135-150:~/hadoop-2.7.1/logs$ yarn logs -applicationId application_1493036076768_0106
/tmp/logs/hadoop/logs/application_1493036076768_0106 does not exist.
Log aggregation has not completed or is not enabled.

The log is not exist both in the middle of the running of the spark submit and after the command.

@bobo2001281
Copy link
Author

bobo2001281 commented May 5, 2017

I have modified the yarn-site.xml with below and now I can see my logs.

  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
 <name>yarn.nodemanager.remote-app-log-dir</name>
 <value>/app-logs</value>
  <name>yarn.nodemanager.remote-app-log-dir-suffix</name>
  <value>logs</value>

@bobo2001281
Copy link
Author

bobo2001281 commented May 5, 2017

😢
how to attach file?

@bobo2001281
Copy link
Author

17/05/05 11:40:07 INFO memory.MemoryStore: MemoryStore started with capacity 14.2 GB
17/05/05 11:40:07 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@10.121.135.150:42573
17/05/05 11:40:07 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
17/05/05 11:40:07 INFO executor.Executor: Starting executor ID 2 on host u10-121-135-150
17/05/05 11:40:07 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 36225.
17/05/05 11:40:07 INFO netty.NettyBlockTransferService: Server created on u10-121-135-150:36225
17/05/05 11:40:07 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(2, u10-121-135-150, 36225)
17/05/05 11:40:07 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(2, u10-121-135-150, 36225)
17/05/05 11:40:10 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 0
17/05/05 11:40:10 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
17/05/05 11:40:10 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
17/05/05 11:40:10 INFO client.TransportClientFactory: Successfully created connection to /10.121.135.150:39980 after 2 ms (0 ms spent in bootstraps)
17/05/05 11:40:10 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 7.8 KB, free 14.2 GB)
17/05/05 11:40:10 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 306 ms
17/05/05 11:40:11 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 11.3 KB, free 14.2 GB)
2017-05-05 11:40:11,669 INFO (MainThread-46435) connected to server at ('u10-121-135-150', 43125)
2017-05-05 11:40:11,670 INFO (MainThread-46435) TFSparkNode.reserve: {'authkey': '\x0c!\xee\xc3Y1K\xee\xa1A\x1efI\xebz\xca', 'worker_num': 0, 'host': 'u10-121-135-150', 'tb_port': 0, 'addr': ('u10-121-135-150', 42646), 'ppid': 46414, 'task_index': 0, 'job_name': 'ps', 'tb_pid': 0, 'port': 44238}
2017-05-05 11:40:12,678 INFO (MainThread-46435) node: {'addr': ('u10-121-135-150', 42646), 'task_index': 0, 'job_name': 'ps', 'authkey': '\x0c!\xee\xc3Y1K\xee\xa1A\x1efI\xebz\xca', 'worker_num': 0, 'host': 'u10-121-135-150', 'ppid': 46414, 'port': 44238, 'tb_pid': 0, 'tb_port': 0}
2017-05-05 11:40:12,678 INFO (MainThread-46435) node: {'addr': '/tmp/pymp-xTRaz/listener-zKctvL', 'task_index': 0, 'job_name': 'worker', 'authkey': '\x9e\xbd\xb8gJ\xc4@"\x93Q\x9bd\x8c\x85\x10S', 'worker_num': 1, 'host': 'u10-121-135-150', 'ppid': 46417, 'port': 42615, 'tb_pid': 0, 'tb_port': 0}
2017-05-05 11:40:12,678 INFO (MainThread-46435) node: {'addr': '/tmp/pymp-kXvJ0C/listener-J6qzjO', 'task_index': 1, 'job_name': 'worker', 'authkey': 'JhP\xff\xa2\xe2Nw\x9d\x02RG\x00 N]', 'worker_num': 2, 'host': 'u10-121-135-150', 'ppid': 46415, 'port': 42711, 'tb_pid': 0, 'tb_port': 0}
2017-05-05 11:40:12,678 INFO (MainThread-46435) node: {'addr': '/tmp/pymp-NPwGp
/listener-YFUdU5', 'task_index': 2, 'job_name': 'worker', 'authkey': 'j):\x13\xf2\x00E\x80\x89\xa2\xe9\xd5\xa6*HI', 'worker_num': 3, 'host': 'u10-121-135-150', 'ppid': 46419, 'port': 40856, 'tb_pid': 0, 'tb_port': 0}
2017-05-05 11:40:13,526 INFO (MainThread-46435) Starting TensorFlow ps:0 on cluster node 0 on background process
Process Process-2:
Traceback (most recent call last):
File "/home/hadoop/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/hadoop/Python/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/home/hadoop/hadoopSpace/tmp/nm-local-dir/usercache/hadoop/appcache/application_1493036076768_0112/container_1493036076768_0112_01_000003/pyfiles/VDSR.py", line 79, in map_fun
import tensorflow as tf
ImportError: No module named tensorflow
17/05/05 11:40:14 INFO executor.Executor: Executor is trying to kill task 0.0 in stage 0.0 (TID 0)
2017-05-05 11:40:15,381 INFO (MainThread-46435) Got msg: None
2017-05-05 11:40:15,381 INFO (MainThread-46435) Terminating PS
17/05/05 11:40:15 WARN python.PythonRunner: Incomplete task interrupted: Attempting to kill Python Worker
17/05/05 11:40:15 INFO executor.Executor: Executor killed task 0.0 in stage 0.0 (TID 0)
17/05/05 17:19:36 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown

@bobo2001281
Copy link
Author

but when I python VDSR.py , and no error occured.

@leewyang
Copy link
Contributor

leewyang commented May 5, 2017

It looks like you haven't installed tensorflow into your Python distribution that is shipped to the executors via --archives hdfs:///user/${USER}/Python.zip#Python, or you're not setting the following env vars:

export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python
export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=Python/bin/python"

@bobo2001281
Copy link
Author

I install tensorflow in my local env.
root@u10-121-135-150:~# python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally

How can I install tensorflow into my Python distribution?
It seems that there is not tensorflow in the distribution env.

hadoop@u10-121-135-150:~$ Python/bin/python
Python 2.7.12 (default, Apr 21 2017, 16:35:26)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow
Traceback (most recent call last):
File "", line 1, in
ImportError: No module named tensorflow

@bobo2001281
Copy link
Author

bobo2001281 commented May 8, 2017

In the instruction:
https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN#convert-the-mnist-zip-files-into-hdfs-files

Install and compile TensorFlow w/ RDMA Support

git clone git@github.com:yahoo/tensorflow.git

# For TensorFlow 0.12 w/ RDMA, checkout the 'yahoo' branch
# For TensorFlow 1.0 w/ RDMA, checkout the 'jun_r1.0' branch
# follow build instructions to install into ${PYTHON_ROOT}

In the last line,How can I install tensorflow into Python?

@leewyang
Copy link
Contributor

leewyang commented May 8, 2017

Actually, if you do not need RDMA support, you should be able to just run something like:
Python/bin/pip install tensorflow

If you need specific versions (e.g. Python 2.7 vs. 3.x, CPU/GPU, etc), you can adapt these instructions from TensorFlow

@bobo200128docker
Copy link

bobo200128docker commented May 11, 2017

When I execute python VDSR.py and cost 10s .
--(I have add the BATCH_SIZE=256 and set MAX_EPOCH =1)
When I run the VDSR_spark.py (where call VDSR.py according to the Conversion Guide ),the state is RUNNING for a long time and never finish.
There is not any log on HDFS /app-logs.

@bobo200128docker
Copy link

bobo200128docker commented May 11, 2017

2017-05-11 20:46:35,619 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1494227397798_0029_000001 State change from SCHEDULED to ALLOCATED_SAVING
2017-05-11 20:46:35,619 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1494227397798_0029_000001 State change from ALLOCATED_SAVING to ALLOCATED
2017-05-11 20:46:35,619 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching masterappattempt_1494227397798_0029_000001
2017-05-11 20:46:35,620 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting up container Container: [ContainerId: container_1494227397798_0029_01_000001, NodeId: u10-121-135-150:38391, NodeHttpAddress: u10-121-135-150:8042, Resource: <memory:2048, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.121.135.150:38391 }, ] for AM appattempt_1494227397798_0029_000001
2017-05-11 20:46:35,620 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Command to launch container container_1494227397798_0029_01_000001 : LD_LIBRARY_PATH="/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH",{{JAVA_HOME}}/bin/java,-server,-Xmx1024m,-Djava.io.tmpdir={{PWD}}/tmp,-Dspark.yarn.app.container.log.dir=<LOG_DIR>,org.apache.spark.deploy.yarn.ApplicationMaster,--class,'org.apache.spark.deploy.PythonRunner',--primary-py-file,VDSR_spark.py,--arg,'--images',--arg,'hdfs:///user/hadoop/data/train',--arg,'--model',--arg,'./model_VDSR_me',--properties-file,{{PWD}}/spark_conf/spark_conf.properties,1>,<LOG_DIR>/stdout,2>,<LOG_DIR>/stderr
2017-05-11 20:46:35,620 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Create AMRMToken for ApplicationAttempt: appattempt_1494227397798_0029_000001
2017-05-11 20:46:35,620 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Creating password for appattempt_1494227397798_0029_000001
2017-05-11 20:46:35,627 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_1494227397798_0029_01_000001, NodeId: u10-121-135-150:38391, NodeHttpAddress: u10-121-135-150:8042, Resource: <memory:2048, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.121.135.150:38391 }, ] for AM appattempt_1494227397798_0029_000001
2017-05-11 20:46:35,627 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1494227397798_0029_000001 State change from ALLOCATED to LAUNCHED
2017-05-11 20:46:36,618 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000001 Container Transitioned from ACQUIRED to RUNNING
2017-05-11 20:46:42,576 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1494227397798_0029_000001 (auth:SIMPLE)
2017-05-11 20:46:42,580 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AM registration appattempt_1494227397798_0029_000001
2017-05-11 20:46:42,580 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop IP=10.121.135.150 OPERATION=Register App Master TARGET=ApplicationMasterService RESULT=SUCCESS APPID=application_1494227397798_0029 APPATTEMPTID=appattempt_1494227397798_0029_000001
2017-05-11 20:46:42,580 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1494227397798_0029_000001 State change from LAUNCHED to RUNNING
2017-05-11 20:46:42,581 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1494227397798_0029 State change from ACCEPTED to RUNNING
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000002 Container Transitioned from NEW to ALLOCATED
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1494227397798_0029 CONTAINERID=container_1494227397798_0029_01_000002
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1494227397798_0029_01_000002 of capacity <memory:30720, vCores:1> on host u10-121-135-152:60402, which has 5 containers, <memory:124928, vCores:5> used and <memory:6144, vCores:3> available after allocation
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: assignedContainer application attempt=appattempt_1494227397798_0029_000001 container=Container: [ContainerId: container_1494227397798_0029_01_000002, NodeId: u10-121-135-152:60402, NodeHttpAddress: u10-121-135-152:8042, Resource: <memory:30720, vCores:1>, Priority: 1, Token: null, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:190464, vCores:9>, usedCapacity=0.7265625, absoluteUsedCapacity=0.7265625, numApps=3, numContainers=9 clusterResource=<memory:262144, vCores:16>
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting assigned queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:221184, vCores:10>, usedCapacity=0.84375, absoluteUsedCapacity=0.84375, numApps=3, numContainers=10
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: assignedContainer queue=root usedCapacity=0.84375 absoluteUsedCapacity=0.84375 used=<memory:221184, vCores:10> cluster=<memory:262144, vCores:16>
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000003 Container Transitioned from NEW to ALLOCATED
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1494227397798_0029 CONTAINERID=container_1494227397798_0029_01_000003
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1494227397798_0029_01_000003 of capacity <memory:30720, vCores:1> on host u10-121-135-150:38391, which has 6 containers, <memory:126976, vCores:6> used and <memory:4096, vCores:2> available after allocation
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: assignedContainer application attempt=appattempt_1494227397798_0029_000001 container=Container: [ContainerId: container_1494227397798_0029_01_000003, NodeId: u10-121-135-150:38391, NodeHttpAddress: u10-121-135-150:8042, Resource: <memory:30720, vCores:1>, Priority: 1, Token: null, ] queue=default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:221184, vCores:10>, usedCapacity=0.84375, absoluteUsedCapacity=0.84375, numApps=3, numContainers=10 clusterResource=<memory:262144, vCores:16>
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Re-sorting assigned queue: root.default stats: default: capacity=1.0, absoluteCapacity=1.0, usedResources=<memory:251904, vCores:11>, usedCapacity=0.9609375, absoluteUsedCapacity=0.9609375, numApps=3, numContainers=11
2017-05-11 20:46:43,622 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: assignedContainer queue=root usedCapacity=0.9609375 absoluteUsedCapacity=0.9609375 used=<memory:251904, vCores:11> cluster=<memory:262144, vCores:16>
2017-05-11 20:46:44,050 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : u10-121-135-152:60402 for container : container_1494227397798_0029_01_000002
2017-05-11 20:46:44,051 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000002 Container Transitioned from ALLOCATED to ACQUIRED
2017-05-11 20:46:44,051 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Sending NMToken for nodeId : u10-121-135-150:38391 for container : container_1494227397798_0029_01_000003
2017-05-11 20:46:44,052 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000003 Container Transitioned from ALLOCATED to ACQUIRED
2017-05-11 20:46:44,622 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000002 Container Transitioned from ACQUIRED to RUNNING
2017-05-11 20:46:44,622 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1494227397798_0029_01_000003 Container Transitioned from ACQUIRED to RUNNING

@leewyang
Copy link
Contributor

Those logs don't reveal much... Please grab the yarn application logs and look for errors on the executors.

@bobo200128docker
Copy link

If I cancel (Ctrl + C)the application that is running,where the application log will saved?

@leewyang
Copy link
Contributor

You will need to do the following:

yarn application -kill <your_applicationId>
yarn logs -applicationId <your_applicationId>   >yarn.log

@bobo200128docker
Copy link

hadoop@u10-121-135-150:~$ Python/bin/python
Python 2.7.12 (default, Apr 21 2017, 16:35:26)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf
tf.sub(None,None)
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'module' object has no attribute 'sub'
tf.version
'1.0.1'

[4]+ Stopped Python/bin/python
hadoop@u10-121-135-150:~$ python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
tf.version
'0.12.1'

What is the difference of the above 2 tensorflow s?
Now the situation is : when I try VDSR.py
'0.12.1' is in local envirment,and have the function tf.sub()
'1.0.1' is in the executor,and report an error "AttributeError: 'module' object has no attribute 'sub'"

@bobo200128docker
Copy link

bobo200128docker commented May 13, 2017

I have install scipy in Python that is distributed to the Spark executors.
why "ImportError: No module named scipy.io" still occured in the application log?

@leewyang
Copy link
Contributor

A couple notes:

  • Python/bin/python refers to the custom python distribution that we create and zip up to ship to the executors. We do this because users often don't have control of the python version/packages on the executors.
  • python on the gateway node is just the local installation of python, which will not be distributed to the executors. If you install any dependencies into this local/gateway installation, the dependencies will not be automatically installed on the executors.
  • The TensorFlow APIs changed fairly significantly between 0.12 and 1.0. They have a migration script to help you update your code, but your code will not be cross-compatible between these two versions.

So, with that all said, I'd recommend picking the versions of TensorFlow and Python that you wish to move forward with, then create a custom python distribution (with all necessary dependencies), and then use ONLY this distribution to test your code going forward (for "local" and "distributed" testing).

@leewyang leewyang closed this as completed Jun 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants