Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf.train.SyncReplicasOptimizer no synchronization among workers #11753

Closed
smodlich opened this issue Jul 25, 2017 · 9 comments
Closed

tf.train.SyncReplicasOptimizer no synchronization among workers #11753

smodlich opened this issue Jul 25, 2017 · 9 comments
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower

Comments

@smodlich
Copy link

System information

  • Have I written custom code : Yes
  • OS Platform and Distribution : Linux
  • TensorFlow installed from (source or binary)**: Binary
  • TensorFlow version (use command below)**: 1.2.1
  • Python version**: 3.5.2

Problem Description

I'm trying to train an rnn model with distributed synchronized training and between graph replication. I'm using tf.train.replica_device_setter. Asynchronous Training works perfectly fine. As written in the documentation I'm wrapping my optimizer and creating the hook:

def training(loss,learning_rate,global_step,num_workers,is_chief):
    optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate)
    optimizer = tf.train.SyncReplicasOptimizer(optimizer, replicas_to_aggregate=num_workers,
                                       total_num_replicas=num_workers)
    gvs = optimizer.compute_gradients(loss)
    capped_gvs = [(tf.clip_by_value(grad, -CLIPPING_THRESHOLD, CLIPPING_THRESHOLD), var) for       grad, var in gvs]
    train_op = optimizer.apply_gradients(capped_gvs,global_step=global_step)
    print('Is Chief?: ' + str(is_chief))
    hook=optimizer.make_session_run_hook(is_chief)
    return train_op,hook

For creating and running the Session I'm using exactly as told in the documentation:

sess = tf.train.MonitoredTrainingSession(master=server.target, is_chief=(task_index == 0),hooks=[hook])
sess.run([train_op],feed_dict=...)

However as already noticed in #9596 and several other issues[1,2] the training does not seem to synchronize among workers. So is there a bug in SyncReplicasOptimizer? I'm seeing several hints for this hypothesis:

  1. One worker is constantly ahead by several steps in my logs.
  2. When stopping one worker the other just continues with the training as if nothing happened. In a synchronized setting training should stop or crash.
  3. The training steps take approximately the same time as asynchronous training. Synchronous Training should be slower because of the synchronization.

Questions:

  1. Is there any test with which one can confirm, that sync_replicas_optimizer.py really does synchronize?
  2. Is the API-Documentation regarding sync_replicas_optimizer.py up to date?
  3. Is this somehow related to tf.train.replica_device_setter as mentioned by @jmchen-g in Synchronous distributed tensorflow training doesn't synchronize among workers #9596?
  4. Are there any workarounds for this?
@reedwm reedwm added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 25, 2017
@reedwm
Copy link
Member

reedwm commented Jul 25, 2017

@ali01 Do you have time to look into this?

@utkrist
Copy link

utkrist commented Jul 26, 2017

I haven't looked at synchronizing with tf.train.MonitoredTrainingSession yet. However, I have some experience related to using tf.train.Supervisor instead. I was having the same problem as yours:
* that one of the replicas (chief) was far ahead than others, while other seemed to be waiting for some time

I noticed the default value (30 second) for parameter 'recovery_wait_secs' that tf.train.Supervisor takes. Basically, every replica checks every 30 second to see if the model is ready. So, the chief starts immediately and the rest simply wait for 30 sec. After I set this value to 1, the replicas started training almost at the same time (except the first few steps). So, I suggest you to look at which input parameter of tf.train.MonitoredTrainingSession this time is set. This might be a direction to look at.

(This following discussion also refers to the use of tf.train.Supervisor so please check for yourself if it holds): Another point I have observed is that it seems like SyncReplicaOptimizer does not really care if the 'replicas_to_aggregate' gradients come from the different workers or not. Even if other workers are waiting or not initialized, the chief starts training immediately. And if you print the global_steps you will see same global_steps for 'replicas_to_aggregate' times. This means that the chief pushes enough gradients for tf.train.SyncReplicaOptimizer to average and apply the gradients. So, start the chief worker process only after starting all other workers.

@smodlich
Copy link
Author

@utkrist Thank you for your informative answer. I checked the global step and indeed it behaves like you explained for tf.train.Supervisor, except for a short initial phase. In my case the model error with asynchronous training got bad after 15 to 20 Workers. When using synchronous training I can scale beyond 40 workers after increasing my learning rate by sqrt(workers), because of increased batch size. So the synchronization seems to work as expected. Issue can be closed.

@utkrist
Copy link

utkrist commented Jul 31, 2017

@smodlich I am curious how do you set the learning rate, is it simply lr = 1/(sqrt(workers)) or what?
I also need to scale to many machines soon.

@jmchen-g
Copy link
Contributor

Glad to see that it works as expected :)

Apparently different models have different settings. Just note that the newer sync replica optimizer is using the average instead of sum so if you have N replicas, you might want to try sqrt(N) * lr instead of making it smaller.

@smodlich
Copy link
Author

smodlich commented Aug 1, 2017

@utkrist I'm using a base learning rate of 0.001 which works fine for a single worker. I multiply this learning rate for distributed training by sqrt(N) where N is number of workers (Just as @jmchen-g wrote). I also tried: base lr*N (mentioned in this paper) but that was to high.

@smodlich smodlich reopened this Aug 1, 2017
@smodlich smodlich closed this as completed Aug 1, 2017
@fengrussell
Copy link

@smodlich Recently i used MonitoredTrainingSession and SyncReplicasOptimizer for distributed training, i was having the same problem as yours.
Now i found the solution:

hook=optimizer.make_session_run_hook(is_chief)

modified to

hook=optimizer.make_session_run_hook(is_chief, num_tokens=0)

@snakecon
Copy link

@smodlich Recently i used MonitoredTrainingSession and SyncReplicasOptimizer for distributed training, i was having the same problem as yours.
Now i found the solution:

hook=optimizer.make_session_run_hook(is_chief)

modified to

hook=optimizer.make_session_run_hook(is_chief, num_tokens=0)

It seems solve my problem. Thanks a lot!

@tbake0155
Copy link

@smodlich Recently i used MonitoredTrainingSession and SyncReplicasOptimizer for distributed training, i was having the same problem as yours.
Now i found the solution:

hook=optimizer.make_session_run_hook(is_chief)

modified to

hook=optimizer.make_session_run_hook(is_chief, num_tokens=0)

I finally got my workers to synchronize by making this change. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower
Projects
None yet
Development

No branches or pull requests

7 participants