Skip to content

elastic averaging SGD update: support partitioner & more optimizers#21486

Merged
tensorflow-copybara merged 6 commits intotensorflow:masterfrom
weidankong:elastic_average
Aug 17, 2018
Merged

elastic averaging SGD update: support partitioner & more optimizers#21486
tensorflow-copybara merged 6 commits intotensorflow:masterfrom
weidankong:elastic_average

Conversation

@weidankong
Copy link
Copy Markdown
Contributor

The old implementation of elastic average optimizer does not partitioned variables, support optimizer like AdamOptimizer which will create global variables like beta_1, beta2.

Main changes:

  1. supports partitioned variables
  2. support more optimizers, e.g., Adam/RMSProp
  3. add a saver to save the values in the global_center_variable
    As it turns out the global model is better than the local worker. To allow chief worker save the global values, a mapping logic is added.

Verifications I have done: Cifar10 on AlexNet/MobileNetV2 and other internal models.
All Results shows better performance and precision.

Related Item
#12472: Feature request: EASGD
@ali01, @jinxin0924, @tmulc18 Could you kindly help on review the changes?

@googlebot
Copy link
Copy Markdown

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

@weidankong
Copy link
Copy Markdown
Contributor Author

I signed CLA just now, thanks!

@googlebot
Copy link
Copy Markdown

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

@weidankong
Copy link
Copy Markdown
Contributor Author

I signed it! Added the email address in commit log to CLA email list.

@googlebot
Copy link
Copy Markdown

CLAs look good, thanks!

@googlebot googlebot added cla: yes and removed cla: no labels Aug 8, 2018
@drpngx drpngx requested a review from alextp August 9, 2018 01:03
@drpngx drpngx self-assigned this Aug 9, 2018
@drpngx drpngx added the awaiting review Pull request awaiting review label Aug 9, 2018

# this is for place the variables created by optimizer to local collection
# e.g., AdamOptimizer will create beta as global variables
def _adjust_optimizer_variable_collection():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do the apply_gradients, optimizers like MomentumOptimizer/AdamOptimizer/RMSPropOptimizer will created new variables and put to GLOBAL_VARIABLES collection. While, these variables in fact belongs to local models, local worker will update local model and these newly created variables.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. I think this change is trying to do too much, and this will be fragile unless this optimizer is used in precisely the right way.

I'd like you to find less intrusive approaches to achieve this, or more invasive approaches that prescribe the code to be written in a certain way for elastic averaging to work.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it's for the variables created by apply_gradients only, I'm considering only take care of the incremental part, instead of do it on the whole GLOBAL_VARIABLES collection. How does this sound to you?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is better.

communication_period=10,
moving_rate=None,
rho=None,
rho=0.0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why changing the default value here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. in async mode, rho is not used in original paper
  2. in my experiments, rho=0 works well
    when rho=None, default value: self._rho = self._moving_rate / self._opt._learning_rate, may result a large value. e..g, moving_rate=self.BETA / communication_period / num_worker=0.9/8/4=0.028125, while the initial learning_rate might be 1e-4, which makes rho=281.25, and this makes the workers not able to do perform exploitation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless we shouldn't change default values like this as this affects reproducibility of existing work which uses this code.

rho: the amount of exploration we allow ine the model. The default
value is moving_rate/learning_rate
use_locking: If True use locks for update operations.
sync_flag: Add_sync_queues_and_barrier or not, default to False, in case of
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain what sync_queues_and_barrier is, and when people should set this to true or false

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True: all workers will wait for each other before start training
False: worker can start training when its initilization is done, no need to wait for everyone is ready.
in case one worker is restarted, it can join and continue training without being blocked.

Will update these in to code.

has_global_step = False
for key, var in var_list.items():
tensor = var
if False == has_global_step\
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"not has_global_step" instead of "False == has_global_step"

Also split lines with parenthesis instead of \

and GLOBAL_STEP in key.split('/'):
has_global_step = True

if isinstance(var, list) == False:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"if not isinstance(.."

swapped_var_list[key] = tensor

# find global_step and add it if missing
if False == has_global_step:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"if not has_global_step"

var_list = saver.BaseSaverBuilder.OpListToDict(var_list)

swapped_var_list = {}
has_global_step = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it so important to care about the global step here? Should it be affected by the optimizer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In single GPU version, when saving a default model, global_step is saved.
When restoring model with the single GPU version code, global_step is needed in code.
And considering global_step is only 8byte, I added it to avoid potential user's problem when restoring model.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but I still do not understand why the global step has to be copied over in this case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me remove it, and user can still explicitly add it.

return local_var
else:
return getter(name, trainable, collections, *args, **kwargs)
# 1. default to LOCAL_VARIABLES (instead of GLOBAL_VARIABLES)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This policy for variable placement seems really complicated and error-prone. Why is it being done?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming user code was like this in single GPU version:

with tf.variable_scope('model'):
    model = UserModel(...)

When using ElasticAverageOptimizer, code will be changed to like following:

with tf.device(tf.train.replica_device_setter(...)), \
        tf.variable_scope('model', custom_getter=custom_getter):
    model = UserModel(...)

Workers train their own local model and communicate with PS to update the 'global_center_variable' model. On single GPU version, variables defaults to GLOBAL_VARIABLES are fine, however for distributed version, they are logically belonging to local workers, so I move them to LOCAL_VARIABLES.
global_step is, of course, an exception, as it is shared across workers.
In case user does have cases similar to global_step, they can put the variables into the GLOBAL_SHARE_VARS collection to make it 'global'.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. But setting these variables to LOCAL_VARIABLES will not affect their placement by a replica_device_setter, I think, so I don't understand what is being done here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accept. Remove this part.
Verified on 4 different models, works good.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it removed yet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, just pushed the change.

# 2. put to global if explicitly defined (GLOBAL_SHARE_VARS)
# 3. other GLOBAL_VARIABLES put to LOCAL_VARIABLES
# exept global_step, which must be global
if collections is None or len(collections) == 0:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with "if not collections"

value is moving_rate/learning_rate
rho=0.0 is suggested in async mode.
use_locking: If True use locks for update operations.
sync_flag: Add_sync_queues_and_barrier or not.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename sync_flag to just "synchronous"?

g = ops.get_default_graph()
idx = 0
for _ in range(len(g._collections[ops.GraphKeys.GLOBAL_VARIABLES])):
var = g._collections[ops.GraphKeys.GLOBAL_VARIABLES][idx]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use _collections, use get_collection_ref instead.

However mutating these collections after the fact might not have the effect you want. Perhaps it's best to tell the optimizers to create local variables instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to get_collection_ref, and verified with MomentumOptimizer/AdamOptimizer/RMSPropOptimizer.

@alextp alextp added the kokoro:force-run Tests on submitted change label Aug 14, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 14, 2018
@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Aug 15, 2018
@alextp alextp added the kokoro:force-run Tests on submitted change label Aug 15, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 15, 2018
alextp
alextp previously approved these changes Aug 15, 2018
alextp
alextp previously approved these changes Aug 15, 2018
@alextp alextp added the kokoro:force-run Tests on submitted change label Aug 15, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 15, 2018
@weidankong
Copy link
Copy Markdown
Contributor Author

@drpngx , I can't see the 'Details' of feedback/copybara failure, anything I should handle on this?

@drpngx
Copy link
Copy Markdown
Contributor

drpngx commented Aug 16, 2018

Somehow the test is creating a directory somewhere.

PermissionDeniedError (see above for traceback): Localfile::CreateDir mkdir failed on model: Permission denied

We should be using self.get_temp_dir.

@weidankong
Copy link
Copy Markdown
Contributor Author

Thanks. that's for checkpoint UT, will change it.

@drpngx drpngx added stat:awaiting response Status - Awaiting response from author and removed ready to pull PR ready for merge process labels Aug 16, 2018
@alextp alextp added the kokoro:force-run Tests on submitted change label Aug 16, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 16, 2018
@weidankong
Copy link
Copy Markdown
Contributor Author

@drpngx
Find these two errors, was passed in previous run. I don't think they related to my UT fixing commit.

Ubuntu contrib:
2018-08-16 19:22:30.637876: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
*** Error in `/opt/python3.6/bin/python3': corrupted double-linked list: 0x00007fcc60016a60 ***

Windows Bazel GPU:
ERROR: T:/src/github/tensorflow/tensorflow/core/BUILD:2911:1: C++ compilation of rule '//tensorflow/core:device_tracer' failed (Exit 2): msvc_wrapper_for_nvcc.bat failed: error executing command

@alextp alextp added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Aug 16, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Aug 16, 2018
@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Aug 17, 2018
@weidankong
Copy link
Copy Markdown
Contributor Author

Hi @drpngx , feedback/copybara test is passed now.
And I checked the error information of 'Windows Bazel GPU' test as following is not introduced by last change. What should I do next?

ERROR: T:/src/github/tensorflow/tensorflow/core/BUILD:2911:1: C++ compilation of rule '//tensorflow/core:device_tracer' failed (Exit 2): msvc_wrapper_for_nvcc.bat failed: error executing command
.\tensorflow/core/platform/default/gpu/cupti_wrapper.h(24): fatal error C1083: Cannot open include file: 'extras/CUPTI/include/cupti.h': No such file or directory

@drpngx
Copy link
Copy Markdown
Contributor

drpngx commented Aug 17, 2018

We're testing internally. Let's see what happens.

@weidankong
Copy link
Copy Markdown
Contributor Author

Great! Thanks!

@drpngx
Copy link
Copy Markdown
Contributor

drpngx commented Aug 17, 2018

OK, it was pushed internally, should appear soon.

@tensorflow-copybara tensorflow-copybara merged commit 7d9a839 into tensorflow:master Aug 17, 2018
tensorflow-copybara pushed a commit that referenced this pull request Aug 17, 2018
@weidankong weidankong deleted the elastic_average branch August 24, 2018 17:31
@weidankong weidankong restored the elastic_average branch August 24, 2018 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes ready to pull PR ready for merge process

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants