elastic averaging SGD update: support partitioner & more optimizers by weidankong · Pull Request #21486 · tensorflow/tensorflow

weidankong · 2018-08-08T19:23:13Z

The old implementation of elastic average optimizer does not partitioned variables, support optimizer like AdamOptimizer which will create global variables like beta_1, beta2.

Main changes:

supports partitioned variables
support more optimizers, e.g., Adam/RMSProp
add a saver to save the values in the global_center_variable
As it turns out the global model is better than the local worker. To allow chief worker save the global values, a mapping logic is added.

Verifications I have done: Cifar10 on AlexNet/MobileNetV2 and other internal models.
All Results shows better performance and precision.

Related Item
#12472: Feature request: EASGD
@ali01, @jinxin0924, @tmulc18 Could you kindly help on review the changes?

googlebot · 2018-08-08T19:23:16Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

weidankong · 2018-08-08T19:29:24Z

I signed CLA just now, thanks!

googlebot · 2018-08-08T19:29:26Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

weidankong · 2018-08-08T20:27:55Z

I signed it! Added the email address in commit log to CLA email list.

googlebot · 2018-08-08T20:27:57Z

CLAs look good, thanks!

alextp · 2018-08-13T16:29:04Z


+    # this is for place the variables created by optimizer to local collection
+    # e.g., AdamOptimizer will create beta as global variables
+    def _adjust_optimizer_variable_collection():


Why is this necessary?

When do the apply_gradients, optimizers like MomentumOptimizer/AdamOptimizer/RMSPropOptimizer will created new variables and put to GLOBAL_VARIABLES collection. While, these variables in fact belongs to local models, local worker will update local model and these newly created variables.

I'm not sure I follow. I think this change is trying to do too much, and this will be fragile unless this optimizer is used in precisely the right way.

I'd like you to find less intrusive approaches to achieve this, or more invasive approaches that prescribe the code to be written in a certain way for elastic averaging to work.

As it's for the variables created by apply_gradients only, I'm considering only take care of the incremental part, instead of do it on the whole GLOBAL_VARIABLES collection. How does this sound to you?

Yes, this is better.

alextp · 2018-08-13T16:31:49Z

               communication_period=10,
               moving_rate=None,
-               rho=None,
+               rho=0.0,


Why changing the default value here?

in async mode, rho is not used in original paper

in my experiments, rho=0 works well
when rho=None, default value: self._rho = self._moving_rate / self._opt._learning_rate, may result a large value. e..g, moving_rate=self.BETA / communication_period / num_worker=0.9/8/4=0.028125, while the initial learning_rate might be 1e-4, which makes rho=281.25, and this makes the workers not able to do perform exploitation.

Regardless we shouldn't change default values like this as this affects reproducibility of existing work which uses this code.

alextp · 2018-08-13T16:32:22Z

      rho: the amount of exploration we allow ine the model. The default
        value is moving_rate/learning_rate
      use_locking: If True use locks for update operations.
+      sync_flag: Add_sync_queues_and_barrier or not, default to False, in case of


Explain what sync_queues_and_barrier is, and when people should set this to true or false

True: all workers will wait for each other before start training
False: worker can start training when its initilization is done, no need to wait for everyone is ready.
in case one worker is restarted, it can join and continue training without being blocked.

Will update these in to code.

alextp · 2018-08-13T16:33:34Z

+    has_global_step = False
+    for key, var in var_list.items():
+      tensor = var
+      if False == has_global_step\


"not has_global_step" instead of "False == has_global_step"

Also split lines with parenthesis instead of \

alextp · 2018-08-13T16:33:46Z

+          and GLOBAL_STEP in key.split('/'):
+        has_global_step = True
+
+      if isinstance(var, list) == False:


"if not isinstance(.."

alextp · 2018-08-13T16:34:17Z

+      swapped_var_list[key] = tensor
+
+    # find global_step and add it if missing
+    if False == has_global_step:


"if not has_global_step"

alextp · 2018-08-13T16:34:50Z

+      var_list = saver.BaseSaverBuilder.OpListToDict(var_list)
+
+    swapped_var_list = {}
+    has_global_step = False


Why is it so important to care about the global step here? Should it be affected by the optimizer?

In single GPU version, when saving a default model, global_step is saved.
When restoring model with the single GPU version code, global_step is needed in code.
And considering global_step is only 8byte, I added it to avoid potential user's problem when restoring model.

Yes but I still do not understand why the global step has to be copied over in this case.

OK, let me remove it, and user can still explicitly add it.

alextp · 2018-08-13T16:35:45Z

      return local_var
    else:
-      return getter(name, trainable, collections, *args, **kwargs)
+      # 1. default to LOCAL_VARIABLES (instead of GLOBAL_VARIABLES)


This policy for variable placement seems really complicated and error-prone. Why is it being done?

Assuming user code was like this in single GPU version:

with tf.variable_scope('model'): model = UserModel(...)

When using ElasticAverageOptimizer, code will be changed to like following:

with tf.device(tf.train.replica_device_setter(...)), \ tf.variable_scope('model', custom_getter=custom_getter): model = UserModel(...)

Workers train their own local model and communicate with PS to update the 'global_center_variable' model. On single GPU version, variables defaults to GLOBAL_VARIABLES are fine, however for distributed version, they are logically belonging to local workers, so I move them to LOCAL_VARIABLES.
global_step is, of course, an exception, as it is shared across workers.
In case user does have cases similar to global_step, they can put the variables into the GLOBAL_SHARE_VARS collection to make it 'global'.

Right. But setting these variables to LOCAL_VARIABLES will not affect their placement by a replica_device_setter, I think, so I don't understand what is being done here.

Accept. Remove this part.
Verified on 4 different models, works good.

I don't see it removed yet.

sorry, just pushed the change.

alextp · 2018-08-14T16:39:35Z

+      # 2. put to global if explicitly defined (GLOBAL_SHARE_VARS)
+      # 3. other GLOBAL_VARIABLES put to LOCAL_VARIABLES
+      #    exept global_step, which must be global
+      if collections is None or len(collections) == 0:


replace with "if not collections"

alextp · 2018-08-14T16:44:33Z

        value is moving_rate/learning_rate
+        rho=0.0 is suggested in async mode.
      use_locking: If True use locks for update operations.
+      sync_flag: Add_sync_queues_and_barrier or not.


Rename sync_flag to just "synchronous"?

alextp · 2018-08-14T16:45:33Z

+      g = ops.get_default_graph()
+      idx = 0
+      for _ in range(len(g._collections[ops.GraphKeys.GLOBAL_VARIABLES])):
+        var = g._collections[ops.GraphKeys.GLOBAL_VARIABLES][idx]


Do not use _collections, use get_collection_ref instead.

However mutating these collections after the fact might not have the effect you want. Perhaps it's best to tell the optimizers to create local variables instead?

Changed to get_collection_ref, and verified with MomentumOptimizer/AdamOptimizer/RMSPropOptimizer.

weidankong · 2018-08-16T18:21:38Z

@drpngx , I can't see the 'Details' of feedback/copybara failure, anything I should handle on this?

drpngx · 2018-08-16T18:35:01Z

Somehow the test is creating a directory somewhere.

PermissionDeniedError (see above for traceback): Localfile::CreateDir mkdir failed on model: Permission denied

We should be using self.get_temp_dir.

weidankong · 2018-08-16T18:42:16Z

Thanks. that's for checkpoint UT, will change it.

… checkpoint

weidankong · 2018-08-16T20:43:32Z

@drpngx
Find these two errors, was passed in previous run. I don't think they related to my UT fixing commit.

Ubuntu contrib:
2018-08-16 19:22:30.637876: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
*** Error in `/opt/python3.6/bin/python3': corrupted double-linked list: 0x00007fcc60016a60 ***

Windows Bazel GPU:
ERROR: T:/src/github/tensorflow/tensorflow/core/BUILD:2911:1: C++ compilation of rule '//tensorflow/core:device_tracer' failed (Exit 2): msvc_wrapper_for_nvcc.bat failed: error executing command

weidankong · 2018-08-17T16:52:10Z

Hi @drpngx , feedback/copybara test is passed now.
And I checked the error information of 'Windows Bazel GPU' test as following is not introduced by last change. What should I do next?

ERROR: T:/src/github/tensorflow/tensorflow/core/BUILD:2911:1: C++ compilation of rule '//tensorflow/core:device_tracer' failed (Exit 2): msvc_wrapper_for_nvcc.bat failed: error executing command
.\tensorflow/core/platform/default/gpu/cupti_wrapper.h(24): fatal error C1083: Cannot open include file: 'extras/CUPTI/include/cupti.h': No such file or directory

drpngx · 2018-08-17T16:54:07Z

We're testing internally. Let's see what happens.

weidankong · 2018-08-17T17:18:35Z

Great! Thanks!

drpngx · 2018-08-17T20:42:23Z

OK, it was pushed internally, should appear soon.

PiperOrigin-RevId: 209198285

elastic averaging SGD update: support partitioner & more optimizers

167487e

googlebot added the cla: no label Aug 8, 2018

googlebot added cla: yes and removed cla: no labels Aug 8, 2018

drpngx requested a review from alextp August 9, 2018 01:03

drpngx self-assigned this Aug 9, 2018

drpngx added the awaiting review Pull request awaiting review label Aug 9, 2018

alextp suggested changes Aug 13, 2018

View reviewed changes

update according review comments

cd89c1b

alextp reviewed Aug 14, 2018

View reviewed changes

update according comments

3b89036

alextp added the kokoro:force-run Tests on submitted change label Aug 14, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 14, 2018

fix sanity check failure

fae346f

tensorflowbutler removed the awaiting review Pull request awaiting review label Aug 15, 2018

alextp added the kokoro:force-run Tests on submitted change label Aug 15, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 15, 2018

alextp previously approved these changes Aug 15, 2018

View reviewed changes

alextp added awaiting testing (then merge) ready to pull PR ready for merge process labels Aug 15, 2018

fix Ubuntu Sanity check failure

5b2ac79

weidankong dismissed alextp’s stale review via 5b2ac79 August 15, 2018 17:06

alextp previously approved these changes Aug 15, 2018

View reviewed changes

alextp added the kokoro:force-run Tests on submitted change label Aug 15, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 15, 2018

fix feedback/copybara failure: change to self.get_temp_dir for saving…

7d9a839

… checkpoint

weidankong dismissed alextp’s stale review via 7d9a839 August 16, 2018 18:44

drpngx added stat:awaiting response Status - Awaiting response from author and removed ready to pull PR ready for merge process labels Aug 16, 2018

alextp approved these changes Aug 16, 2018

View reviewed changes

alextp added the kokoro:force-run Tests on submitted change label Aug 16, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 16, 2018

alextp added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Aug 16, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 16, 2018

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Aug 17, 2018

tensorflow-copybara merged commit 7d9a839 into tensorflow:master Aug 17, 2018

tensorflow-copybara pushed a commit that referenced this pull request Aug 17, 2018

Merge pull request #21486 from weidankong:elastic_average

3aed785

PiperOrigin-RevId: 209198285

weidankong deleted the elastic_average branch August 24, 2018 17:31

weidankong restored the elastic_average branch August 24, 2018 17:33

Conversation

weidankong commented Aug 8, 2018

Uh oh!

googlebot commented Aug 8, 2018

What to do if you already signed the CLA

Individual signers

Corporate signers

Uh oh!

weidankong commented Aug 8, 2018

Uh oh!

googlebot commented Aug 8, 2018

Uh oh!

weidankong commented Aug 8, 2018

Uh oh!

googlebot commented Aug 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weidankong commented Aug 16, 2018

Uh oh!

drpngx commented Aug 16, 2018

Uh oh!

weidankong commented Aug 16, 2018

Uh oh!

weidankong commented Aug 16, 2018

Uh oh!

weidankong commented Aug 17, 2018

Uh oh!