New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All_reduce of collective_ops hangs in a distributed environment #31913
Comments
Reassigning this to @dubey since it involves |
Collective ops support both in-graph and between-graph communication. To enable between-graph communication, each worker needs to know the The fix is to add the following snippet:
and pass this config to the |
Thanks very much to all you guys! The solution fix the hanging. 🙂 |
Here comes a further question. Does this fix work concurrently for the both in-graph and between-graph communications? Two distributed workers are created with three towers on each, where the between-graph communication works fine. Nevertheless, the in-graph allreduce among the three towers on the same worker fails with a message Check failed: cp->group.group_size == cp->instance.device_names.size() (3 vs. 2)0x7ff630142180 and Code: """Illustrate AllReduce"""
import multiprocessing as mp
MP_METHOD = 'fork' # 'fork' (UNIX), 'spawn' (WINDOWS);
NUM_PROCESSES = 2
NUM_TOWERS = 3
def process_fn(worker_hosts, task_index):
"""allreduce process"""
import time
import tensorflow as tf
from tensorflow.python.ops import collective_ops
cluster_spec = tf.train.ClusterSpec({'worker': worker_hosts})
# unconfigured collective_group_leader make each worker the leader
# '/replica:0' is necessary in the configuration.
config = tf.ConfigProto(device_count={'CPU': NUM_TOWERS})
config.experimental.collective_group_leader = '/job:worker/replica:0/task:0'
server = tf.train.Server(cluster_spec, config=config,
job_name='worker', task_index=task_index)
with tf.Graph().as_default():
# create weight
all_weights = list()
for worker_index in range(NUM_PROCESSES):
worker_weights = list()
with tf.variable_scope('worker{}'.format(worker_index)):
for tower_index in range(NUM_TOWERS):
device = '/job:worker/replica:0/task:{}/device:CPU:{}'.format(
worker_index, tower_index)
with tf.device(device):
worker_weights.append(tf.get_variable(
'weight{}'.format(tower_index), shape=[]))
all_weights.append(worker_weights)
intra_reduced = list()
inter_reduced = None
with tf.variable_scope('worker{}'.format(task_index)):
# intra-worker allreduce
for weight in all_weights[task_index]:
with tf.device(weight.device):
intra_reduced.append(collective_ops.all_reduce(
weight, NUM_TOWERS, task_index, 0, 'Add', 'Div'))
# inter-worker allreduce
weight = all_weights[task_index][0]
with tf.device(weight.device):
inter_reduced = collective_ops.all_reduce(
weight, NUM_PROCESSES, NUM_PROCESSES, 0, 'Add', 'Div')
if task_index == 0:
session_creator = tf.train.ChiefSessionCreator(
master=server.target)
else:
session_creator = tf.train.WorkerSessionCreator(
master=server.target)
with tf.train.MonitoredSession(session_creator=session_creator) \
as mon_sess:
result_all = mon_sess.run(all_weights)
print('task {} sense {}'.format(task_index, result_all))
result_inter = mon_sess.run([inter_reduced])
print('task {} inter_reduce {}'.format(task_index, result_inter))
result_intra = mon_sess.run([intra_reduced])
print('task {} intra_reduce {}'.format(task_index, result_intra))
time.sleep(1)
def start_process():
"""start process"""
port = 60000
host_fmt = 'localhost:{}'
worker_hosts = list()
for process_index in range(NUM_PROCESSES):
worker_hosts.append(host_fmt.format(port + process_index))
mp_ctx = mp.get_context(MP_METHOD)
processes = list()
for process_index in range(NUM_PROCESSES):
process = mp_ctx.Process(target=process_fn,
args=(worker_hosts, process_index,))
processes.append(process)
process.start()
for process in processes:
process.join()
if __name__ == '__main__':
start_process() Log: (tf-1.13-py3) [huwh1@huwh1-centos worksync]$ python ./tf_distribute_collective_ops_rev2.py
2019-08-29 15:28:57.955533: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-29 15:28:57.957342: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-29 15:28:57.969872: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3408000000 Hz
2019-08-29 15:28:57.970223: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3c8cb80 executing computations on platform Host. Devices:
2019-08-29 15:28:57.970257: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2019-08-29 15:28:57.971219: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3408000000 Hz
2019-08-29 15:28:57.971524: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x3c8ce20 executing computations on platform Host. Devices:
2019-08-29 15:28:57.971544: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2019-08-29 15:28:57.972298: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job worker -> {0 -> localhost:60000, 1 -> localhost:60001}
2019-08-29 15:28:57.973081: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:252] Initialize GrpcChannelCache for job worker -> {0 -> localhost:60000, 1 -> localhost:60001}
2019-08-29 15:28:57.973289: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:391] Started server with target: grpc://localhost:60000
2019-08-29 15:28:57.974456: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:391] Started server with target: grpc://localhost:60001
WARNING:tensorflow:From /home/huwh1/virtualenv/tf-1.13-py3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/huwh1/virtualenv/tf-1.13-py3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-08-29 15:28:58.154712: I tensorflow/core/distributed_runtime/master_session.cc:1192] Start master session 35144c1b1c60a213 with config: device_count { key: "CPU" value: 3 } experimental { collective_group_leader: "/job:worker/replica:0/task:0" }
2019-08-29 15:28:58.170439: I tensorflow/core/distributed_runtime/master_session.cc:1192] Start master session 3c9b794a1f1e9112 with config: device_count { key: "CPU" value: 3 } experimental { collective_group_leader: "/job:worker/replica:0/task:0" }
task 0 sense [[-0.8433976, 1.4618217, 1.4098305], [-0.9607944, -0.2295152, 1.4139572]]
2019-08-29 15:29:28.195776: I tensorflow/core/distributed_runtime/master_session.cc:1192] Start master session 12138e7e76d8c500 with config: device_count { key: "CPU" value: 3 } experimental { collective_group_leader: "/job:worker/replica:0/task:0" }
task 1 sense [[-0.8433976, 1.4618217, 1.4098305], [-0.9607944, -0.2295152, 1.4139572]]
task 0 inter_reduce [-0.90209603]
task 1 inter_reduce [-0.90209603]
2019-08-29 15:29:28.244516: F tensorflow/core/common_runtime/collective_param_resolver_local.cc:389] Check failed: cp->group.group_size == cp->instance.device_names.size() (3 vs. 2)0x7ff630142180
2019-08-29 15:29:30.742565: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error
[[{{node worker1_1/CollectiveReduce}}]]
2019-08-29 15:29:30.742915: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error
[[{{node worker1_1/CollectiveReduce_1}}]]
2019-08-29 15:29:30.743016: W tensorflow/core/common_runtime/base_collective_executor.cc:203] BaseCollectiveExecutor::StartAbort Unavailable: OS Error
[[{{node worker1_1/CollectiveReduce_2}}]]
2019-08-29 15:29:40.753678: W tensorflow/core/distributed_runtime/master_session.cc:1363] Timeout for closing worker session
2019-08-29 15:29:50.758593: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
Terminated |
Are you trying to do something like a hierarchical all-reduce? I haven't tried this myself but one thing I see that confuses me is the assignment of group keys and instance keys. You want a unique group key for each set of devices participating in a collective, and you want a unique instance key for every instance of a collective. |
Thank you for the fruitful advice. 😃 It seems a global unique |
System information
Describe the current behavior
The monitored session hangs in there fetching the
reduced_weight
.Describe the expected behavior
The all_reduce tensor
reduced_weight
gives proper answer on all workers.Code to reproduce the issue
Other info / logs
The
collective_ops.all_reduce
seems to be referenced only once in build_collective_reduce, where the instruction suggests "input_tensors: tensors within a single worker graph that are to be reduced together; must be one per device." Is that mean theall_reduce
is only applicable to In-graph replication?The text was updated successfully, but these errors were encountered: