Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip allreduce if array is not updated #510

Merged
merged 7 commits into from Sep 11, 2019

Conversation

@TE-TakuyaNarihira
Copy link
Contributor

TE-TakuyaNarihira commented Sep 1, 2019

Coupled with sony/nnabla-ext-cuda#181, this PR changes the behavior of all reduce by skipping the operation when arrays are not updated after calling .zero(). The advantages of this change are;

  • Save the redundant communication and computation for all reduce of zero arrays
  • Fixing a undesired behavior in solver's update & weight decay. The update & weight decay are performed only if the gradient arrays are updated. The previous implementation always executes all reduce, which causes unnecessary modifications of zero arrays. In the end, the solver's operations are performed undesirably. I will elaborate on this in the following.

Suppose that there are some parameter variables in a graph which don't require gradient (i.e., need_grad=False). User must think that the parameters are fixed during training. Also, there are all reduce operations for gradients before calling weight decay (weight_decay(wd)). In the previous implementation, allreduce will be performed regardless the need_grad flags in a graph (i.e., gradients are not computed during backward and left as zeroing()==True before allreduce), which results in the updated gradients (they are considered as updated in solver.weight_decay). Then, the weight decay is performed, and the gradient becomes non-zero, then the parameter is modified by update() unintentionally. One may think it can be handled by passing parameter variables obtained by nn.get_parameters(grad_only=True) to the allreduce function as it returns only variables which require gradient computation (need_grad=True). But, it can't because the nn.get_parameters returns parameters maintained in global, and need_grad flags of the parameter variables in global are fixed as default values determined by parametric function implementations (e.g., convolution weights and biases are always True, while running mean and variance of BN are always False). Even if you create a computation graph with modified need_grad flags (e.g. by using fix_parameters option in a parametric_functions function), the change doesn't affect the returning variables of nn.get_parameters. We may have to add a way to obtain trainable parameters from a computation graph to make sure the parameters are not updated during training. However, this is kind of a design issue. At this moment, I fix it just by skipping allreduce when gradients are not computed during backward.

@TE-TakuyaNarihira TE-TakuyaNarihira force-pushed the feature/20190812-skip-all-reduce-if-zeroing branch from 2ff0c01 to 0ac6a7c Sep 10, 2019
@TE-AkioHayakawa TE-AkioHayakawa merged commit 92371ad into master Sep 11, 2019
@TE-TakuyaNarihira TE-TakuyaNarihira changed the title Feature/20190812 skip all reduce if zeroing Skip allreduce if array is not updated Sep 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.