Weight normalization #4

GendalfSeriyy · 2020-05-19T17:40:51Z

Hello!
I found that without weight normalization, the network ceases to learn, and the loss is equal to nan. Could you please explain why this is happening and how it can be fixed?

yhhhli · 2020-05-20T06:40:02Z

Hi,

When we first test our algorithm without weight normalization, we also find that problem. It seems that the gradients of the clipping parameter in several layers will explode suddenly. Then we tried to use small learning rates for the clipping parameter but we found the performance is not good.

Then, we think the problem is the distribution of weights is changing significantly and there are no heuristics that can tell when to increase the LR (to accommodate the shift of the distribution of the weights) or when to decrease the LR (to stabilize the training behavior). Therefore, we come up with a method to normalize weights. Weight normalization is inspired by Batch Normalization in activations, because we find learning the clipping parameter in activations quantization does not have the nan issue.

GendalfSeriyy · 2020-06-01T17:40:44Z

Thanks for the answer!
We also noticed that when quantizing the first and last layer to the same accuracy as the other layers, the network also does not learn. To be more precise, network trains a certain number of epochs, but then the accuracy drops to 10% and no longer grows. Have you carried out such experiments?

yhhhli · 2020-06-02T11:20:40Z

I think weight normalization cannot be applied to the last layer because the output of the last layer is the output of the network, without BN to standardize its distribution. For the last layer, maybe you can apply the DoReFa scheme to quantize weights and our APoT quantization to activations.

wu-hai · 2021-11-29T05:53:34Z

Thanks for the great work and the clarification on the weight_norm! I want to ask that after applying weight normalization in the real-valued weight, the lr for \alpha should be the same for weight or add some adjustment on the lr and weight_decay on \alpha (like the settings in your commented code (

APoT_Quantization/ImageNet/main.py

Line 181 in a818104

# customize the lr and wd for clipping thresholds

)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weight normalization #4

Weight normalization #4

GendalfSeriyy commented May 19, 2020

yhhhli commented May 20, 2020

GendalfSeriyy commented Jun 1, 2020

yhhhli commented Jun 2, 2020

wu-hai commented Nov 29, 2021

Weight normalization #4

Weight normalization #4

Comments

GendalfSeriyy commented May 19, 2020

yhhhli commented May 20, 2020

GendalfSeriyy commented Jun 1, 2020

yhhhli commented Jun 2, 2020

wu-hai commented Nov 29, 2021