tf.train.SyncReplicasOptimizer no synchronization among workers #11753

smodlich · 2017-07-25T15:47:28Z

System information

Have I written custom code : Yes
OS Platform and Distribution : Linux
TensorFlow installed from (source or binary)**: Binary
TensorFlow version (use command below)**: 1.2.1
Python version**: 3.5.2

Problem Description

I'm trying to train an rnn model with distributed synchronized training and between graph replication. I'm using tf.train.replica_device_setter. Asynchronous Training works perfectly fine. As written in the documentation I'm wrapping my optimizer and creating the hook:

def training(loss,learning_rate,global_step,num_workers,is_chief):
    optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate)
    optimizer = tf.train.SyncReplicasOptimizer(optimizer, replicas_to_aggregate=num_workers,
                                       total_num_replicas=num_workers)
    gvs = optimizer.compute_gradients(loss)
    capped_gvs = [(tf.clip_by_value(grad, -CLIPPING_THRESHOLD, CLIPPING_THRESHOLD), var) for       grad, var in gvs]
    train_op = optimizer.apply_gradients(capped_gvs,global_step=global_step)
    print('Is Chief?: ' + str(is_chief))
    hook=optimizer.make_session_run_hook(is_chief)
    return train_op,hook

For creating and running the Session I'm using exactly as told in the documentation:

sess = tf.train.MonitoredTrainingSession(master=server.target, is_chief=(task_index == 0),hooks=[hook])
sess.run([train_op],feed_dict=...)

However as already noticed in #9596 and several other issues[1,2] the training does not seem to synchronize among workers. So is there a bug in SyncReplicasOptimizer? I'm seeing several hints for this hypothesis:

One worker is constantly ahead by several steps in my logs.
When stopping one worker the other just continues with the training as if nothing happened. In a synchronized setting training should stop or crash.
The training steps take approximately the same time as asynchronous training. Synchronous Training should be slower because of the synchronization.

Questions:

Is there any test with which one can confirm, that sync_replicas_optimizer.py really does synchronize?
Is the API-Documentation regarding sync_replicas_optimizer.py up to date?
Is this somehow related to tf.train.replica_device_setter as mentioned by @jmchen-g in Synchronous distributed tensorflow training doesn't synchronize among workers #9596?
Are there any workarounds for this?

The text was updated successfully, but these errors were encountered:

reedwm · 2017-07-25T17:45:27Z

@ali01 Do you have time to look into this?

utkrist · 2017-07-26T00:37:06Z

I haven't looked at synchronizing with tf.train.MonitoredTrainingSession yet. However, I have some experience related to using tf.train.Supervisor instead. I was having the same problem as yours:
* that one of the replicas (chief) was far ahead than others, while other seemed to be waiting for some time

I noticed the default value (30 second) for parameter 'recovery_wait_secs' that tf.train.Supervisor takes. Basically, every replica checks every 30 second to see if the model is ready. So, the chief starts immediately and the rest simply wait for 30 sec. After I set this value to 1, the replicas started training almost at the same time (except the first few steps). So, I suggest you to look at which input parameter of tf.train.MonitoredTrainingSession this time is set. This might be a direction to look at.

(This following discussion also refers to the use of tf.train.Supervisor so please check for yourself if it holds): Another point I have observed is that it seems like SyncReplicaOptimizer does not really care if the 'replicas_to_aggregate' gradients come from the different workers or not. Even if other workers are waiting or not initialized, the chief starts training immediately. And if you print the global_steps you will see same global_steps for 'replicas_to_aggregate' times. This means that the chief pushes enough gradients for tf.train.SyncReplicaOptimizer to average and apply the gradients. So, start the chief worker process only after starting all other workers.

smodlich · 2017-07-31T12:09:06Z

@utkrist Thank you for your informative answer. I checked the global step and indeed it behaves like you explained for tf.train.Supervisor, except for a short initial phase. In my case the model error with asynchronous training got bad after 15 to 20 Workers. When using synchronous training I can scale beyond 40 workers after increasing my learning rate by sqrt(workers), because of increased batch size. So the synchronization seems to work as expected. Issue can be closed.

utkrist · 2017-07-31T15:11:37Z

@smodlich I am curious how do you set the learning rate, is it simply lr = 1/(sqrt(workers)) or what?
I also need to scale to many machines soon.

jmchen-g · 2017-07-31T23:05:52Z

Glad to see that it works as expected :)

Apparently different models have different settings. Just note that the newer sync replica optimizer is using the average instead of sum so if you have N replicas, you might want to try sqrt(N) * lr instead of making it smaller.

smodlich · 2017-08-01T07:53:09Z

@utkrist I'm using a base learning rate of 0.001 which works fine for a single worker. I multiply this learning rate for distributed training by sqrt(N) where N is number of workers (Just as @jmchen-g wrote). I also tried: base lr*N (mentioned in this paper) but that was to high.

fengrussell · 2018-04-02T05:04:37Z

@smodlich Recently i used MonitoredTrainingSession and SyncReplicasOptimizer for distributed training, i was having the same problem as yours.
Now i found the solution:

hook=optimizer.make_session_run_hook(is_chief)

modified to

hook=optimizer.make_session_run_hook(is_chief, num_tokens=0)

snakecon · 2018-10-12T06:22:52Z

@smodlich Recently i used MonitoredTrainingSession and SyncReplicasOptimizer for distributed training, i was having the same problem as yours.
Now i found the solution:
hook=optimizer.make_session_run_hook(is_chief)
modified to
hook=optimizer.make_session_run_hook(is_chief, num_tokens=0)

It seems solve my problem. Thanks a lot!

tbake0155 · 2018-11-04T13:22:45Z

@smodlich Recently i used MonitoredTrainingSession and SyncReplicasOptimizer for distributed training, i was having the same problem as yours.
Now i found the solution:
hook=optimizer.make_session_run_hook(is_chief)
modified to
hook=optimizer.make_session_run_hook(is_chief, num_tokens=0)

I finally got my workers to synchronize by making this change. Thanks.

reedwm added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 25, 2017

reedwm mentioned this issue Jul 28, 2017

ERROR message when using tf.SyncReplicasOptimizer #11823

Closed

smodlich closed this as completed Jul 31, 2017

smodlich reopened this Aug 1, 2017

smodlich closed this as completed Aug 1, 2017

dostos mentioned this issue Dec 7, 2018

Distributed Tensorflow 예제 코드 관련 swsnu/bd2018#28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tf.train.SyncReplicasOptimizer no synchronization among workers #11753

tf.train.SyncReplicasOptimizer no synchronization among workers #11753

smodlich commented Jul 25, 2017

reedwm commented Jul 25, 2017

utkrist commented Jul 26, 2017 •

edited

Loading

smodlich commented Jul 31, 2017

utkrist commented Jul 31, 2017

jmchen-g commented Jul 31, 2017

smodlich commented Aug 1, 2017 •

edited

Loading

fengrussell commented Apr 2, 2018

snakecon commented Oct 12, 2018

tbake0155 commented Nov 4, 2018

tf.train.SyncReplicasOptimizer no synchronization among workers #11753

tf.train.SyncReplicasOptimizer no synchronization among workers #11753

Comments

smodlich commented Jul 25, 2017

System information

Problem Description

reedwm commented Jul 25, 2017

utkrist commented Jul 26, 2017 • edited Loading

smodlich commented Jul 31, 2017

utkrist commented Jul 31, 2017

jmchen-g commented Jul 31, 2017

smodlich commented Aug 1, 2017 • edited Loading

fengrussell commented Apr 2, 2018

snakecon commented Oct 12, 2018

tbake0155 commented Nov 4, 2018

utkrist commented Jul 26, 2017 •

edited

Loading

smodlich commented Aug 1, 2017 •

edited

Loading