Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run the benchmark in the distributed mode? #65

Open
yupeng9 opened this issue Oct 12, 2017 · 20 comments
Open

How to run the benchmark in the distributed mode? #65

yupeng9 opened this issue Oct 12, 2017 · 20 comments
Assignees

Comments

@yupeng9
Copy link

yupeng9 commented Oct 12, 2017

Hi,

I followed the instructions from the [performance page]{https://www.tensorflow.org/performance/performance_models}, and run on two EC2 p2.8xlarge instances, using the same benchmark hash (Benchmark GitHub hash: 9165a70).

# Run the following commands on host_0 (10.0.0.1):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

# Run the following commands on host_1 (10.0.0.2):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

However, the worker failed with:

Generating model
save variable global_step:0
save variable ps_var/v0/conv0/conv2d/kernel:0
save variable ps_var/v0/conv0/biases:0
save variable ps_var/v0/conv1/conv2d/kernel:0
save variable ps_var/v0/conv1/biases:0
save variable ps_var/v0/conv2/conv2d/kernel:0
save variable ps_var/v0/conv2/biases:0
save variable ps_var/v0/conv3/conv2d/kernel:0
save variable ps_var/v0/conv3/biases:0
save variable ps_var/v0/conv4/conv2d/kernel:0
save variable ps_var/v0/conv4/biases:0
save variable ps_var/v0/affine0/weights:0
save variable ps_var/v0/affine0/biases:0
save variable ps_var/v0/affine1/weights:0
save variable ps_var/v0/affine1/biases:0
save variable ps_var/v0/affine2/weights:0
save variable ps_var/v0/affine2/biases:0
Traceback (most recent call last):
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1096, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1092, in main
    bench.run()
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 573, in run
    self._benchmark_cnn()
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 674, in _benchmark_cnn
    start_standard_services=start_standard_services) as sess:
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 792, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 953, in managed_session
    start_standard_services=start_standard_services)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session
    init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1124, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    options, run_metadata)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3,3,384,256]
         [[Node: v0/conv4/conv2d/kernel/Initializer/random_uniform/RandomUniform = RandomUniform[T=DT_INT32, _class=["loc:@v0/conv4/conv2d/kernel"], dtype=DT_FLOAT, seed=1234, seed2=132, _device="/job:worker/replica:0/task:0/gpu:0"](v0/conv4/conv2d/kernel/Initializer/random_uniform/shape)]]
         [[Node: v0/conv2/biases/Initializer/Const_S21 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/gpu:0", send_device="/job:worker/replica:0/task:0/gpu:0", send_device_incarnation=-2694717678558735913, tensor_name="edge_53_v0/conv2/biases/Initializer/Const", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/gpu:0"]()]]

Caused by op u'v0/conv4/conv2d/kernel/Initializer/random_uniform/RandomUniform', defined at:
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1096, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1092, in main
    bench.run()
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 573, in run
    self._benchmark_cnn()
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 620, in _benchmark_cnn
    (enqueue_ops, fetches) = self._build_model()
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 791, in _build_model
    gpu_grad_stage_ops)
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 952, in add_forward_pass_and_gradients
    self.model.add_inference(network)
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/alexnet_model.py", line 42, in add_inference
    cnn.conv(256, 3, 3)
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py", line 103, in conv
    use_bias=False)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/convolutional.py", line 551, in conv2d
    return layer.apply(inputs)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 503, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 443, in __call__
    self.build(input_shapes[0])
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/convolutional.py", line 137, in build
    dtype=self.dtype)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 383, in add_variable
    trainable=trainable and self.trainable)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1065, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 962, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 360, in get_variable
    validate_shape=validate_shape, use_resource=use_resource)
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/variable_mgr.py", line 84, in __call__
    return getter(name, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 352, in _true_getter
    use_resource=use_resource)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 725, in _get_single_variable
    validate_shape=validate_shape)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 199, in __init__
    expected_shape=expected_shape)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 277, in _init_from_args
    initial_value(), name="initial_value", dtype=dtype)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 701, in <lambda>
    shape.as_list(), dtype=dtype, partition_info=partition_info)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py", line 441, in __call__
    dtype, seed=self.seed)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/random_ops.py", line 240, in random_uniform
    shape, dtype, seed=seed1, seed2=seed2)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/gen_random_ops.py", line 247, in _random_uniform
    seed=seed, seed2=seed2, name=name)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[3,3,384,256]
         [[Node: v0/conv4/conv2d/kernel/Initializer/random_uniform/RandomUniform = RandomUniform[T=DT_INT32, _class=["loc:@v0/conv4/conv2d/kernel"], dtype=DT_FLOAT, seed=1234, seed2=132, _device="/job:worker/replica:0/task:0/gpu:0"](v0/conv4/conv2d/kernel/Initializer/random_uniform/shape)]]
         [[Node: v0/conv2/biases/Initializer/Const_S21 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/gpu:0", send_device="/job:worker/replica:0/task:0/gpu:0", send_device_incarnation=-2694717678558735913, tensor_name="edge_53_v0/conv2/biases/Initializer/Const", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/gpu:0"]()]]

It seems each TF process will allocate all of available GPU memory, so the worker cannot get any memory if I start the parameter server command first.

Likewise, if I run worker first, then the parameter server cannot get any memory.

@ppwwyyxx
Copy link
Contributor

I always started the worker first and then started PS with CUDA_VISIBLE_DEVICES=

@yupeng9
Copy link
Author

yupeng9 commented Oct 12, 2017

Right, if I start the worker first, then the PS will also show OOM error. Will CUDA_VISIBLE_DEVICES disable the GPU devices for PS?

btw, if this is required, then can someone update the official guide?

@tfboyd
Copy link
Member

tfboyd commented Oct 13, 2017 via email

@tfboyd
Copy link
Member

tfboyd commented Oct 13, 2017

@yupeng9
If you are doing distributed TensorFlow on just a few servers I would check out this example that includes tensorboard outputs and other nice features like automatic evaluation. Or you could try the Uber project that is also nice for distributed that I have not personally tested but I have seen their results and they are good. We are working on a nicer high level API in TensorFlow for distributed but the above options are currently the best.

@yupeng9
Copy link
Author

yupeng9 commented Oct 13, 2017

@tfboyd thanks for the information.

Since pushing to the website can take a while, do you mind posting the instructions here once you have it?

I took a look at cifar10. Is there a plan to migrate tf_cnn_benchmark to include those additional features? A nice thing I see in tf_cnn_benchmark is that it is more like a general benchmark test bed: it supports multiple models as well as different data sets, and therefore it also allows future additions.

More importantly, Tensorflow website publishes useful results from this benchmark, so it has great reference values.

@tfboyd tfboyd self-assigned this Oct 24, 2017
@DjangoPeng
Copy link

@yupeng9 What's the process of the distributed testing. I'm starting to run the distributed TensorFlow benchmarks.
@tfboyd It seems like the official guide is still not been updated?

@Zhaojp-Frank
Copy link

+1 any update on the latest doc on distributed training steps? thanks.

@tfboyd
Copy link
Member

tfboyd commented Nov 27, 2017

I doubt I will update the web page anytime soon. I must have been in a hurry when I typed up that page, I also use my own testing harness that builds the commands and I likely failed to copy and past my exact commands from the logs. I did test the what is likely the most recent code on AWS two weeks ago and everything seemed fine with TF 1.4. It was a very small test with 2x p2.8xlarge instances.

I would suggest people not use this code unless they are going to write their own distributed or multi-GPU setup and can understand the variable management aspects. We use this code to test new ideas and a lot of different variations that are not matrix tested, meaning option A may not even work with option D and that will not be documented. I am putting all of my time into helping the team get clean examples published with known accuracy numbers over the next few months.

@reedwm
Copy link
Member

reedwm commented Nov 28, 2017

As @ppwwyyxx stated, when running the parameter servers on the same hosts as the workers, you should prefix the parameter server commands with CUDA_VISIBLE_DEVICES= . This hides the GPUs from TensorFlow so it will not use them or allocate the memory on them. I haven't tried myself, but the updated commands should be:

# Run the following commands on host_0 (10.0.0.1):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

# Run the following commands on host_1 (10.0.0.2):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

I'm currently blocked by this issue, but afterwards, once I have time, I can update the README (and the website once I figure out how) with the updated commands.

@DjangoPeng
Copy link

DjangoPeng commented Nov 28, 2017

@reedwm How about setting CUDA_VISIBLE_DEVICES={0..7} for corresponding worker? Such like GPU 0 for Worker 0. The command should be:

CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

@reedwm
Copy link
Member

reedwm commented Nov 28, 2017

In the above example, each worker on is a separate machine, since they have different IP addresses (10.0.0.1 and 10.0.0.2). So, they will each have their own set of 8 GPUs, and so CUDA_VISIBLE_DEVICES should not be set.

If multiple worker processes are run on the same machine, your strategy of setting CUDA_VISIBLE_DEVICES will work. But it's better to run a single worker per machine and have each worker use all the GPUs on the machine.

@DjangoPeng
Copy link

Yep! I know the trick of setting CUDA_VISIBLE_DEVICES. But I just have 3 machines, and 2 1080Ti per machine. So, the recommended cluster specification is 3 parameter servers and 3 workers. Besides, a pair of ps and worker per machine. Am I right?

@reedwm
Copy link
Member

reedwm commented Nov 28, 2017

Yep, that is correct. On each machine, the worker will have access to both GPUs, and the parameter server will not since CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks ... will be run.

@Zhaojp-Frank
Copy link

@reedwm question about the start order. for example, in the same hostA, once run above cmd to start worker, the shall would not return, instead it keeps running, e.g. trying to start the session. so when shall I start PS? any strict order required.
I have suffered numbers of errors such as 'Attempting to use uninitialized value p', 'expects a different device' . it will be great to document the start order info.

@DjangoPeng
Copy link

@Zhaojp-Frank Generally speaking, you'd better launch PS process before worker 0. If no ps is running well, worker 0 would throw the uninitialized error.

@reedwm
Copy link
Member

reedwm commented Nov 29, 2017

You should be able to launch the processes in any order. @DjangoPeng what's the commands you use that sometimes cause an uninitialized error?

@abidmalikwaterloo
Copy link

Do we have to kill the parameter servers manually when the job is done?

@reedwm
Copy link
Member

reedwm commented Jan 31, 2018

@abidmalik1967, yes.

@vilmara
Copy link

vilmara commented May 15, 2018

Hi @reedwm / @tfboyd, I am running the benchmarks on a multi node system (2 hosts , each has 4 GPUs) following the below instructions on https://www.tensorflow.org/performance/performance_models#executing_the_script , but I am getting errors (notice I replaced python by python3 and used --num_gpus=4 for each host)

Run the following commands on host_0 (10.0.0.1):

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

Run the following commands on host_1 (10.0.0.2):

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

When the system processes the first command, it throws the following error on each host:

host_0 output:
2018-05-15 18:32:29.136718: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2018-05-15 18:32:29.136759: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-05-15 18:32:29.136775: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1
2018-05-15 18:32:37.369403: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error

host_1 output:
2018-05-15 18:32:47.220352: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-05-15 18:32:47.220364: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-05-15 18:32:54.466053: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error

When runs the second command, it prints the training info, and after just prints the below lines and doesn't produce more outputs, the processes look like on hold on each host
Running parameter server 0 # in the case of host_0
Running parameter server 1 # in the case of host_1

@abidmalikwaterloo
Copy link

abidmalikwaterloo commented May 16, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants