Skip to content
This repository was archived by the owner on Dec 9, 2024. It is now read-only.
This repository was archived by the owner on Dec 9, 2024. It is now read-only.

How can I start a benchmark with distributed_all_reduce ? #64

@sleepfin

Description

@sleepfin

My Env:
TensorFlow: 1.3
CUDA: 8.0
cuDNN: 6.0

I notice an update for distributed_all_reduce so I want to have a try. But I'm not sure what value should controller_host takes...
My args are:

--variable_update=distributed_all_reduce
--all_reduce_spec=pscpu:32k:xring

and I start 3 processes with args:
FIRST:

--job_name=worker
--worker_hosts=127.0.0.1:50001,127.0.0.1:50002
--task_index=0

SECONDE:

--job_name=worker
--worker_hosts=127.0.0.1:50001,127.0.0.1:50002
--task_index=1

THIRD:

--job_name=controller
--controller_host=??
--task_index=0

When I put 127.0.0.1:50000 or 127.0.0.1:50001 on controller_host, I got:

TensorFlow:  1.3
Model:       resnet50
Mode:        training
SingleSess:  True
Batch size:  128 global
             64 per device
Devices:     ['job:worker/task0/gpu:0', 'job:worker/task1/gpu:0']
Data format: NCHW
Optimizer:   sgd
Variables:   distributed_all_reduce
AllReduce:   pscpu:32k:xring
Sync:        True
==========
Generating model
WARNING:tensorflow:From /home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py:486: __init__ (from tensorflow.contrib.data.python.ops.readers) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.TFRecordDataset`.
WARNING:tensorflow:From /home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py:487: range (from tensorflow.contrib.data.python.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.range()`.
WARNING:tensorflow:From /home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py:489: zip (from tensorflow.contrib.data.python.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.zip()`.
2017-10-10 14:03:34.183287: E tensorflow/core/common_runtime/session.cc:69] Not found: No session factory registered for the given session options: {target: "127.0.0.1:50001" config: intra_op_parallelism_threads: 1 gpu_options { force_gpu_compatible: true } allow_soft_placement: true} Registered factories are {DIRECT_SESSION, GRPC_SESSION}.
Traceback (most recent call last):
  File "/home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 46, in <module>
    tf.app.run()
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 42, in main
    bench.run()
  File "/home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 892, in run
    return self._benchmark_cnn()
  File "/home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1068, in _benchmark_cnn
    start_standard_services=start_standard_services) as sess:
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 953, in managed_session
    start_standard_services=start_standard_services)
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session
    init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 273, in prepare_session
    config=config)
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 178, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1482, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 622, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: No session factory registered for the given session options: {target: "127.0.0.1:50001" config: intra_op_parallelism_threads: 1 gpu_options { force_gpu_compatible: true } allow_soft_placement: true} Registered factories are {DIRECT_SESSION, GRPC_SESSION}.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions