elastic averaging SGD update: support partitioner & more optimizers#21486
elastic averaging SGD update: support partitioner & more optimizers#21486tensorflow-copybara merged 6 commits intotensorflow:masterfrom
Conversation
|
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here (e.g. What to do if you already signed the CLAIndividual signers
Corporate signers
|
|
I signed CLA just now, thanks! |
|
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. |
|
I signed it! Added the email address in commit log to CLA email list. |
|
CLAs look good, thanks! |
|
|
||
| # this is for place the variables created by optimizer to local collection | ||
| # e.g., AdamOptimizer will create beta as global variables | ||
| def _adjust_optimizer_variable_collection(): |
There was a problem hiding this comment.
When do the apply_gradients, optimizers like MomentumOptimizer/AdamOptimizer/RMSPropOptimizer will created new variables and put to GLOBAL_VARIABLES collection. While, these variables in fact belongs to local models, local worker will update local model and these newly created variables.
There was a problem hiding this comment.
I'm not sure I follow. I think this change is trying to do too much, and this will be fragile unless this optimizer is used in precisely the right way.
I'd like you to find less intrusive approaches to achieve this, or more invasive approaches that prescribe the code to be written in a certain way for elastic averaging to work.
There was a problem hiding this comment.
As it's for the variables created by apply_gradients only, I'm considering only take care of the incremental part, instead of do it on the whole GLOBAL_VARIABLES collection. How does this sound to you?
| communication_period=10, | ||
| moving_rate=None, | ||
| rho=None, | ||
| rho=0.0, |
There was a problem hiding this comment.
Why changing the default value here?
There was a problem hiding this comment.
- in async mode, rho is not used in original paper
- in my experiments, rho=0 works well
when rho=None, default value: self._rho = self._moving_rate / self._opt._learning_rate, may result a large value. e..g, moving_rate=self.BETA / communication_period / num_worker=0.9/8/4=0.028125, while the initial learning_rate might be 1e-4, which makes rho=281.25, and this makes the workers not able to do perform exploitation.
There was a problem hiding this comment.
Regardless we shouldn't change default values like this as this affects reproducibility of existing work which uses this code.
| rho: the amount of exploration we allow ine the model. The default | ||
| value is moving_rate/learning_rate | ||
| use_locking: If True use locks for update operations. | ||
| sync_flag: Add_sync_queues_and_barrier or not, default to False, in case of |
There was a problem hiding this comment.
Explain what sync_queues_and_barrier is, and when people should set this to true or false
There was a problem hiding this comment.
True: all workers will wait for each other before start training
False: worker can start training when its initilization is done, no need to wait for everyone is ready.
in case one worker is restarted, it can join and continue training without being blocked.
Will update these in to code.
| has_global_step = False | ||
| for key, var in var_list.items(): | ||
| tensor = var | ||
| if False == has_global_step\ |
There was a problem hiding this comment.
"not has_global_step" instead of "False == has_global_step"
Also split lines with parenthesis instead of \
| and GLOBAL_STEP in key.split('/'): | ||
| has_global_step = True | ||
|
|
||
| if isinstance(var, list) == False: |
| swapped_var_list[key] = tensor | ||
|
|
||
| # find global_step and add it if missing | ||
| if False == has_global_step: |
| var_list = saver.BaseSaverBuilder.OpListToDict(var_list) | ||
|
|
||
| swapped_var_list = {} | ||
| has_global_step = False |
There was a problem hiding this comment.
Why is it so important to care about the global step here? Should it be affected by the optimizer?
There was a problem hiding this comment.
In single GPU version, when saving a default model, global_step is saved.
When restoring model with the single GPU version code, global_step is needed in code.
And considering global_step is only 8byte, I added it to avoid potential user's problem when restoring model.
There was a problem hiding this comment.
Yes but I still do not understand why the global step has to be copied over in this case.
There was a problem hiding this comment.
OK, let me remove it, and user can still explicitly add it.
| return local_var | ||
| else: | ||
| return getter(name, trainable, collections, *args, **kwargs) | ||
| # 1. default to LOCAL_VARIABLES (instead of GLOBAL_VARIABLES) |
There was a problem hiding this comment.
This policy for variable placement seems really complicated and error-prone. Why is it being done?
There was a problem hiding this comment.
Assuming user code was like this in single GPU version:
with tf.variable_scope('model'):
model = UserModel(...)
When using ElasticAverageOptimizer, code will be changed to like following:
with tf.device(tf.train.replica_device_setter(...)), \
tf.variable_scope('model', custom_getter=custom_getter):
model = UserModel(...)
Workers train their own local model and communicate with PS to update the 'global_center_variable' model. On single GPU version, variables defaults to GLOBAL_VARIABLES are fine, however for distributed version, they are logically belonging to local workers, so I move them to LOCAL_VARIABLES.
global_step is, of course, an exception, as it is shared across workers.
In case user does have cases similar to global_step, they can put the variables into the GLOBAL_SHARE_VARS collection to make it 'global'.
There was a problem hiding this comment.
Right. But setting these variables to LOCAL_VARIABLES will not affect their placement by a replica_device_setter, I think, so I don't understand what is being done here.
There was a problem hiding this comment.
Accept. Remove this part.
Verified on 4 different models, works good.
There was a problem hiding this comment.
I don't see it removed yet.
There was a problem hiding this comment.
sorry, just pushed the change.
| # 2. put to global if explicitly defined (GLOBAL_SHARE_VARS) | ||
| # 3. other GLOBAL_VARIABLES put to LOCAL_VARIABLES | ||
| # exept global_step, which must be global | ||
| if collections is None or len(collections) == 0: |
There was a problem hiding this comment.
replace with "if not collections"
| value is moving_rate/learning_rate | ||
| rho=0.0 is suggested in async mode. | ||
| use_locking: If True use locks for update operations. | ||
| sync_flag: Add_sync_queues_and_barrier or not. |
There was a problem hiding this comment.
Rename sync_flag to just "synchronous"?
| g = ops.get_default_graph() | ||
| idx = 0 | ||
| for _ in range(len(g._collections[ops.GraphKeys.GLOBAL_VARIABLES])): | ||
| var = g._collections[ops.GraphKeys.GLOBAL_VARIABLES][idx] |
There was a problem hiding this comment.
Do not use _collections, use get_collection_ref instead.
However mutating these collections after the fact might not have the effect you want. Perhaps it's best to tell the optimizers to create local variables instead?
There was a problem hiding this comment.
Changed to get_collection_ref, and verified with MomentumOptimizer/AdamOptimizer/RMSPropOptimizer.
|
@drpngx , I can't see the 'Details' of feedback/copybara failure, anything I should handle on this? |
|
Somehow the test is creating a directory somewhere. We should be using |
|
Thanks. that's for checkpoint UT, will change it. |
|
@drpngx Ubuntu contrib: Windows Bazel GPU: |
|
Hi @drpngx , feedback/copybara test is passed now. ERROR: T:/src/github/tensorflow/tensorflow/core/BUILD:2911:1: C++ compilation of rule '//tensorflow/core:device_tracer' failed (Exit 2): msvc_wrapper_for_nvcc.bat failed: error executing command |
|
We're testing internally. Let's see what happens. |
|
Great! Thanks! |
|
OK, it was pushed internally, should appear soon. |
PiperOrigin-RevId: 209198285
The old implementation of elastic average optimizer does not partitioned variables, support optimizer like AdamOptimizer which will create global variables like beta_1, beta2.
Main changes:
As it turns out the global model is better than the local worker. To allow chief worker save the global values, a mapping logic is added.
Verifications I have done: Cifar10 on AlexNet/MobileNetV2 and other internal models.
All Results shows better performance and precision.
Related Item
#12472: Feature request: EASGD
@ali01, @jinxin0924, @tmulc18 Could you kindly help on review the changes?