-
Notifications
You must be signed in to change notification settings - Fork 74.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf.train.SyncReplicasOptimizer no synchronization among workers #11753
Comments
@ali01 Do you have time to look into this? |
I haven't looked at synchronizing with tf.train.MonitoredTrainingSession yet. However, I have some experience related to using tf.train.Supervisor instead. I was having the same problem as yours: I noticed the default value (30 second) for parameter 'recovery_wait_secs' that tf.train.Supervisor takes. Basically, every replica checks every 30 second to see if the model is ready. So, the chief starts immediately and the rest simply wait for 30 sec. After I set this value to 1, the replicas started training almost at the same time (except the first few steps). So, I suggest you to look at which input parameter of tf.train.MonitoredTrainingSession this time is set. This might be a direction to look at. (This following discussion also refers to the use of tf.train.Supervisor so please check for yourself if it holds): Another point I have observed is that it seems like SyncReplicaOptimizer does not really care if the 'replicas_to_aggregate' gradients come from the different workers or not. Even if other workers are waiting or not initialized, the chief starts training immediately. And if you print the global_steps you will see same global_steps for 'replicas_to_aggregate' times. This means that the chief pushes enough gradients for tf.train.SyncReplicaOptimizer to average and apply the gradients. So, start the chief worker process only after starting all other workers. |
@utkrist Thank you for your informative answer. I checked the global step and indeed it behaves like you explained for tf.train.Supervisor, except for a short initial phase. In my case the model error with asynchronous training got bad after 15 to 20 Workers. When using synchronous training I can scale beyond 40 workers after increasing my learning rate by sqrt(workers), because of increased batch size. So the synchronization seems to work as expected. Issue can be closed. |
@smodlich I am curious how do you set the learning rate, is it simply lr = 1/(sqrt(workers)) or what? |
Glad to see that it works as expected :) Apparently different models have different settings. Just note that the newer sync replica optimizer is using the average instead of sum so if you have N replicas, you might want to try sqrt(N) * lr instead of making it smaller. |
@utkrist I'm using a base learning rate of 0.001 which works fine for a single worker. I multiply this learning rate for distributed training by sqrt(N) where N is number of workers (Just as @jmchen-g wrote). I also tried: base lr*N (mentioned in this paper) but that was to high. |
@smodlich Recently i used MonitoredTrainingSession and SyncReplicasOptimizer for distributed training, i was having the same problem as yours.
modified to
|
It seems solve my problem. Thanks a lot! |
I finally got my workers to synchronize by making this change. Thanks. |
System information
Problem Description
I'm trying to train an rnn model with distributed synchronized training and between graph replication. I'm using tf.train.replica_device_setter. Asynchronous Training works perfectly fine. As written in the documentation I'm wrapping my optimizer and creating the hook:
For creating and running the Session I'm using exactly as told in the documentation:
However as already noticed in #9596 and several other issues[1,2] the training does not seem to synchronize among workers. So is there a bug in SyncReplicasOptimizer? I'm seeing several hints for this hypothesis:
Questions:
The text was updated successfully, but these errors were encountered: