-
Notifications
You must be signed in to change notification settings - Fork 627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to run the benchmark in the distributed mode? #65
Comments
I always started the worker first and then started PS with |
Right, if I start the worker first, then the PS will also show OOM error. Will btw, if this is required, then can someone update the official guide? |
I forgot to update it with the CUDA_VISIBLE_DEVICES. I made some other
mistakes as well in copying the commands I used to run the benchmark.. I
use a wrapper to run the tests that manage the args for me and was not
careful enough when typing them out by hand. I will try to find time to
update the information and get it pushed out. Pushes to the website can
take a long time. I will see if I do the change and speed up the publish.
…On Thu, Oct 12, 2017 at 2:11 PM, Yupeng Fu ***@***.***> wrote:
Right, if I start the worker first, then the PS will also show OOM error.
Will CUDA_VISIBLE_DEVICES disable the GPU devices for PS?
btw, if this is required, then can someone update the official guide?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#65 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AWZesoEP4N8DI86naTPEtAZ6RC7Kg6DAks5sroB_gaJpZM4P3nP0>
.
|
@yupeng9 |
@tfboyd thanks for the information. Since pushing to the website can take a while, do you mind posting the instructions here once you have it? I took a look at More importantly, Tensorflow website publishes useful results from this benchmark, so it has great reference values. |
+1 any update on the latest doc on distributed training steps? thanks. |
I doubt I will update the web page anytime soon. I must have been in a hurry when I typed up that page, I also use my own testing harness that builds the commands and I likely failed to copy and past my exact commands from the logs. I did test the what is likely the most recent code on AWS two weeks ago and everything seemed fine with TF 1.4. It was a very small test with 2x p2.8xlarge instances. I would suggest people not use this code unless they are going to write their own distributed or multi-GPU setup and can understand the variable management aspects. We use this code to test new ideas and a lot of different variations that are not matrix tested, meaning option A may not even work with option D and that will not be documented. I am putting all of my time into helping the team get clean examples published with known accuracy numbers over the next few months. |
As @ppwwyyxx stated, when running the parameter servers on the same hosts as the workers, you should prefix the parameter server commands with
I'm currently blocked by this issue, but afterwards, once I have time, I can update the README (and the website once I figure out how) with the updated commands. |
@reedwm How about setting
|
In the above example, each worker on is a separate machine, since they have different IP addresses (10.0.0.1 and 10.0.0.2). So, they will each have their own set of 8 GPUs, and so If multiple worker processes are run on the same machine, your strategy of setting |
Yep! I know the trick of setting |
Yep, that is correct. On each machine, the worker will have access to both GPUs, and the parameter server will not since |
@reedwm question about the start order. for example, in the same hostA, once run above cmd to start worker, the shall would not return, instead it keeps running, e.g. trying to start the session. so when shall I start PS? any strict order required. |
@Zhaojp-Frank Generally speaking, you'd better launch PS process before worker 0. If no ps is running well, worker 0 would throw the |
You should be able to launch the processes in any order. @DjangoPeng what's the commands you use that sometimes cause an |
Do we have to kill the parameter servers manually when the job is done? |
@abidmalik1967, yes. |
Hi @reedwm / @tfboyd, I am running the benchmarks on a multi node system (2 hosts , each has 4 GPUs) following the below instructions on https://www.tensorflow.org/performance/performance_models#executing_the_script , but I am getting errors (notice I replaced python by python3 and used --num_gpus=4 for each host) Run the following commands on host_0 (10.0.0.1):python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 Run the following commands on host_1 (10.0.0.2):python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 When the system processes the first command, it throws the following error on each host: host_0 output: host_1 output: When runs the second command, it prints the training info, and after just prints the below lines and doesn't produce more outputs, the processes look like on hold on each host |
If you want to try distributed learning, try Horovod.
https://github.com/uber/horovod
Its much cleaner and gives better performance.
…On Tue, May 15, 2018 at 7:45 PM, Vilmara ***@***.***> wrote:
Hi @reedwm <https://github.com/reedwm> / @tfboyd
<https://github.com/tfboyd>, I am running the benchmarks on a multi node
system (2 hosts , each has 4 GPUs) following the below instructions on [
https://www.tensorflow.org/performance/performance_
models#executing_the_script] , but I am getting errors (notice I replaced
python by python3 and used --num_gpus=4 for each host)
Run the following commands on host_0 (10.0.0.1):
python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0
python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0
Run the following commands on host_1 (10.0.0.2):
python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1
python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1
Output on host 1( similar error on the second host):
C4-1:~/benchmarks/scripts/tf_cnn_benchmarks$ python3 tf_cnn_benchmarks.py
--local_parameter_device=gpu --num_gpus=4 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0
WARNING: Logging before flag parsing goes to stderr.
W0515 18:30:07.299445 140167654033152 tf_logging.py:126] From
/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/
learn/python/learn/datasets/base.py:198: retry (from
tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and
will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
2018-05-15 18:30:10.839398: I tensorflow/core/common_
runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:05:00.0
totalMemory: 15.90GiB freeMemory: 15.61GiB
2018-05-15 18:30:11.080912: I tensorflow/core/common_
runtime/gpu/gpu_device.cc:1344] Found device 1 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:06:00.0
totalMemory: 15.90GiB freeMemory: 15.61GiB
2018-05-15 18:30:11.293176: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898]
successful NUMA node read from SysFS had negative value (-1), but there
must be at least one NUMA node, so returning NUMA node zero
2018-05-15 18:30:11.293913: I tensorflow/core/common_
runtime/gpu/gpu_device.cc:1344] Found device 2 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:85:00.0
totalMemory: 15.90GiB freeMemory: 15.61GiB
2018-05-15 18:30:11.512889: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898]
successful NUMA node read from SysFS had negative value (-1), but there
must be at least one NUMA node, so returning NUMA node zero
2018-05-15 18:30:11.513651: I tensorflow/core/common_
runtime/gpu/gpu_device.cc:1344] Found device 3 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:86:00.0
totalMemory: 15.90GiB freeMemory: 15.61GiB
2018-05-15 18:30:11.516938: I tensorflow/core/common_
runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2, 3
2018-05-15 18:30:12.718971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911]
Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-15 18:30:12.719022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]
0 1 2 3
2018-05-15 18:30:12.719030: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930]
0: N Y N N
2018-05-15 18:30:12.719036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930]
1: Y N N N
2018-05-15 18:30:12.719042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930]
2: N N N Y
2018-05-15 18:30:12.719049: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930]
3: N N Y N
2018-05-15 18:30:12.720414: I tensorflow/core/common_
runtime/gpu/gpu_device.cc:1041] Created TensorFlow device
(/job:worker/replica:0/task:0/device:GPU:0 with 15133 MB memory) ->
physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id:
0000:05:00.0, compute capability: 6.0)
2018-05-15 18:30:12.968743: I tensorflow/core/common_
runtime/gpu/gpu_device.cc:1041] Created TensorFlow device
(/job:worker/replica:0/task:0/device:GPU:1 with 15133 MB memory) ->
physical GPU (device: 1, name: Tesla P100-PCIE-16GB, pci bus id:
0000:06:00.0, compute capability: 6.0)
2018-05-15 18:30:13.276930: I tensorflow/core/common_
runtime/gpu/gpu_device.cc:1041] Created TensorFlow device
(/job:worker/replica:0/task:0/device:GPU:2 with 15133 MB memory) ->
physical GPU (device: 2, name: Tesla P100-PCIE-16GB, pci bus id:
0000:85:00.0, compute capability: 6.0)
2018-05-15 18:30:13.608508: I tensorflow/core/common_
runtime/gpu/gpu_device.cc:1041] Created TensorFlow device
(/job:worker/replica:0/task:0/device:GPU:3 with 15133 MB memory) ->
physical GPU (device: 3, name: Tesla P100-PCIE-16GB, pci bus id:
0000:86:00.0, compute capability: 6.0)
2018-05-15 18:30:13.949687: I tensorflow/core/distributed_
runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps
-> {0 -> 10.0.0.1:50000, 1 -> 10.0.0.2:50000}
2018-05-15 18:30:13.949734: I tensorflow/core/distributed_
runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job
worker -> {0 -> localhost:50001, 1 -> 10.0.0.2:50001}
2018-05-15 18:30:13.954717: I tensorflow/core/distributed_
runtime/rpc/grpc_server_lib.cc:333] Started server with target:
grpc://localhost:50001
TensorFlow: 1.7
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 512 global
64.0 per device
Num batches: 100
Num epochs: 0.04
Devices: ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1',
'/job:worker/task:0/gpu:2', '/job:worker/task:0/gpu:3']
Data format: NCHW
Layout optimizer: False
Optimizer: sgd
Variables: distributed_replicated
Sync: True
==========
Generating model
W0515 18:30:26.539938 140167654033152 tf_logging.py:126] From
/home/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1525:
Supervisor.*init* (from tensorflow.python.training.supervisor) is
deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-05-15 18:30:39.134526: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:0
2018-05-15 18:30:39.134602: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:1
2018-05-15 18:30:39.134617: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:worker/replica:0/task:1
2018-05-15 18:30:49.134698: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:0
2018-05-15 18:30:49.134738: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:1
2018-05-15 18:30:49.134751: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:worker/replica:0/task:1
2018-05-15 18:30:59.134867: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:0
2018-05-15 18:30:59.134906: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:1
2018-05-15 18:30:59.134919: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:worker/replica:0/task:1
2018-05-15 18:31:09.135051: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:0
2018-05-15 18:31:09.135090: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:1
2018-05-15 18:31:09.135103: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:worker/replica:0/task:1
2018-05-15 18:31:19.135229: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:0
2018-05-15 18:31:19.135268: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:1
2018-05-15 18:31:19.135283: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:worker/replica:0/task:1
2018-05-15 18:31:29.135525: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:0
2018-05-15 18:31:29.135565: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:1
2018-05-15 18:31:29.135576: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:worker/replica:0/task:1
2018-05-15 18:31:39.135715: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:0
2018-05-15 18:31:39.135755: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:1
2018-05-15 18:31:39.135767: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:worker/replica:0/task:1
2018-05-15 18:31:49.135890: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:0
2018-05-15 18:31:49.135930: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:1
2018-05-15 18:31:49.135941: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:worker/replica:0/task:1
2018-05-15 18:31:59.136066: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:0
2018-05-15 18:31:59.136105: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:1
2018-05-15 18:31:59.136120: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:worker/replica:0/task:1
2018-05-15 18:32:09.136357: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:0
2018-05-15 18:32:09.136397: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:1
2018-05-15 18:32:09.136409: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:worker/replica:0/task:1
2018-05-15 18:32:19.136545: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:0
2018-05-15 18:32:19.136585: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:1
2018-05-15 18:32:19.136597: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:worker/replica:0/task:1
2018-05-15 18:32:29.136718: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:0
2018-05-15 18:32:29.136759: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:1
2018-05-15 18:32:29.136775: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:worker/replica:0/task:1
2018-05-15 18:32:37.369403: E tensorflow/core/distributed_runtime/master.cc:269]
Master init: Unavailable: OS Error
2018-05-15 18:32:37.369545: E tensorflow/core/distributed_runtime/master.cc:269]
Master init: Unavailable: OS Error
2018-05-15 18:32:39.136914: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:1
2018-05-15 18:32:49.137081: I tensorflow/core/distributed_runtime/master.cc:221]
CreateSession still waiting for response from worker:
/job:ps/replica:0/task:1
2018-05-15 18:32:51.321405: E tensorflow/core/distributed_runtime/master.cc:269]
Master init: Unavailable: OS Error
I0515 18:32:51.462898 140167654033152 tf_logging.py:116] Error reported to
Coordinator: <class 'tensorflow.python.framework.
errors_impl.UnavailableError'>, OS Error
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 1327, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 1310, in _run_fn
self._extend_graph()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 1358, in _extend_graph
graph_def.SerializeToString(), status)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/
python/framework/errors_impl.py", line 516, in *exit*
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 60, in
app.run(main) # Raises error on invalid flags, unlike tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 274, in
run
_run_main(main, argv)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 238, in
_run_main
sys.exit(main(argv))
File "tf_cnn_benchmarks.py", line 56, in main
bench.run()
File "/home/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line
1306, in run
return self._benchmark_cnn()
File "/home/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line
1535, in _benchmark_cnn
start_standard_services=start_standard_services) as sess:
File "/usr/lib/python3.5/contextlib.py", line 59, in *enter*
return next(self.gen)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/
python/training/supervisor.py", line 1000, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/
python/training/supervisor.py", line 828, in stop
ignore_live_threads=ignore_live_threads)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/
python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/lib/python3/dist-packages/six.py", line 686, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/tensorflow/
python/training/supervisor.py", line 989, in managed_session
start_standard_services=start_standard_services)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/
python/training/supervisor.py", line 726, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/
python/training/session_manager.py", line 281, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 1140, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 1321, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnavailableError: OS Error
***@***.***:/benchmarks/scripts/tf_cnn_benchmarks$
***@***.***:/benchmarks/scripts/tf_cnn_benchmarks$ python3
tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0
WARNING: Logging before flag parsing goes to stderr.
W0515 18:32:57.843850 140232479717120 tf_logging.py:126] From
/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/
learn/python/learn/datasets/base.py:198: retry (from
tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and
will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
2018-05-15 18:33:01.492719: I tensorflow/core/common_
runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:05:00.0
totalMemory: 15.90GiB freeMemory: 15.61GiB
2018-05-15 18:33:01.696742: I tensorflow/core/common_
runtime/gpu/gpu_device.cc:1344] Found device 1 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:06:00.0
totalMemory: 15.90GiB freeMemory: 15.61GiB
2018-05-15 18:33:01.911120: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898]
successful NUMA node read from SysFS had negative value (-1), but there
must be at least one NUMA node, so returning NUMA node zero
2018-05-15 18:33:01.912332: I tensorflow/core/common_
runtime/gpu/gpu_device.cc:1344] Found device 2 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:85:00.0
totalMemory: 15.90GiB freeMemory: 15.61GiB
2018-05-15 18:33:02.155386: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898]
successful NUMA node read from SysFS had negative value (-1), but there
must be at least one NUMA node, so returning NUMA node zero
2018-05-15 18:33:02.156112: I tensorflow/core/common_
runtime/gpu/gpu_device.cc:1344] Found device 3 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:86:00.0
totalMemory: 15.90GiB freeMemory: 15.61GiB
2018-05-15 18:33:02.159411: I tensorflow/core/common_
runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2, 3
2018-05-15 18:33:03.404523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911]
Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-15 18:33:03.404573: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]
0 1 2 3
2018-05-15 18:33:03.404582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930]
0: N Y N N
2018-05-15 18:33:03.404588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930]
1: Y N N N
2018-05-15 18:33:03.404594: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930]
2: N N N Y
2018-05-15 18:33:03.404601: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930]
3: N N Y N
2018-05-15 18:33:03.405971: I tensorflow/core/common_
runtime/gpu/gpu_device.cc:1041] Created TensorFlow device
(/job:ps/replica:0/task:0/device:GPU:0 with 15133 MB memory) -> physical
GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:05:00.0,
compute capability: 6.0)
2018-05-15 18:33:03.734283: I tensorflow/core/common_
runtime/gpu/gpu_device.cc:1041] Created TensorFlow device
(/job:ps/replica:0/task:0/device:GPU:1 with 15133 MB memory) -> physical
GPU (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:06:00.0,
compute capability: 6.0)
2018-05-15 18:33:04.081842: I tensorflow/core/common_
runtime/gpu/gpu_device.cc:1041] Created TensorFlow device
(/job:ps/replica:0/task:0/device:GPU:2 with 15133 MB memory) -> physical
GPU (device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:85:00.0,
compute capability: 6.0)
2018-05-15 18:33:04.414207: I tensorflow/core/common_
runtime/gpu/gpu_device.cc:1041] Created TensorFlow device
(/job:ps/replica:0/task:0/device:GPU:3 with 15133 MB memory) -> physical
GPU (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:86:00.0,
compute capability: 6.0)
2018-05-15 18:33:04.744345: I tensorflow/core/distributed_
runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps
-> {0 -> localhost:50000, 1 -> 10.0.0.2:50000}
2018-05-15 18:33:04.744393: I tensorflow/core/distributed_
runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job
worker -> {0 -> 10.0.0.1:50001, 1 -> 10.0.0.2:50001}
2018-05-15 18:33:04.748917: I tensorflow/core/distributed_
runtime/rpc/grpc_server_lib.cc:333] Started server with target:
grpc://localhost:50000
TensorFlow: 1.7
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 512 global
64.0 per device
Num batches: 100
Num epochs: 0.04
Devices: ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1',
'/job:worker/task:0/gpu:2', '/job:worker/task:0/gpu:3']
Data format: NCHW
Layout optimizer: False
Optimizer: sgd
Variables: distributed_replicated
Sync: True
==========
Running parameter server 0
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#65 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAbSULFXyDCfIruVAfJE4t1MjzaRMx6iks5ty2iEgaJpZM4P3nP0>
.
--
Abid M. Malik
******************************************************
"I have learned silence from the talkative, toleration from the intolerant,
and kindness from the unkind"---Gibran
"Success is not for the chosen few, but for the few who choose" --- John
Maxwell
"Being a good person does not depend on your religion or status in life,
your race or skin color, political views or culture. IT DEPENDS ON HOW GOOD
YOU TREAT OTHERS"--- Abid
"The Universe is talking to us, and the language of the Universe is
mathematics."----Abid
|
Hi,
I followed the instructions from the [performance page]{https://www.tensorflow.org/performance/performance_models}, and run on two EC2 p2.8xlarge instances, using the same benchmark hash (Benchmark GitHub hash: 9165a70).
However, the worker failed with:
It seems each TF process will allocate all of available GPU memory, so the worker cannot get any memory if I start the parameter server command first.
Likewise, if I run worker first, then the parameter server cannot get any memory.
The text was updated successfully, but these errors were encountered: