Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about gradient calculation with respect to weight #2

Closed
jay2002 opened this issue Jul 21, 2017 · 4 comments
Closed

question about gradient calculation with respect to weight #2

jay2002 opened this issue Jul 21, 2017 · 4 comments

Comments

@jay2002
Copy link

jay2002 commented Jul 21, 2017

In Margin Inner Product, the gradient with respect to weight is very simple:

  // Gradient with respect to weight
  if (this->param_propagate_down_[0]) {
    caffe_cpu_gemm<Dtype>(CblasTrans, CblasNoTrans, N_, K_, M_, (Dtype)1.,
        top_diff, bottom_data, (Dtype)1., this->blobs_[0]->mutable_cpu_diff());
  }

But in large margin softmax, the gradient calcuation is much more complex...

Can you please tell me how to simplify the gradient calcuation?
I failed to derive it ...

@YYuanAnyVision
Copy link

same question here. why update of weight it the same for m=1, 2, 3, 4

@YYuanAnyVision
Copy link

also, since you normalize the weight by overwriting in forward() instead of keeping the original weight like here:

  Dtype* norm_weight = this->blobs_[0]->mutable_cpu_data();
  Dtype temp_norm = (Dtype)0.;
  for (int i = 0; i < N_; i++) {
  	temp_norm = caffe_cpu_dot(K_, norm_weight + i * K_, norm_weight + i * K_);
  	temp_norm = (Dtype)1./sqrt(temp_norm);
  	caffe_scal(K_, temp_norm, norm_weight + i * K_);
  }

so everytime the after update of parameter: weight = weight + grad_w, the weight is simply clipped into the normalized version?

@tornadomeet
Copy link

+1, backward seems not correspond with forward.

@wy1iu
Copy link
Owner

wy1iu commented Jul 31, 2017

It is actually a normalized version of the gradient, which can help converge more stably. The direction is the same as before. What we do here is simply to rescale the gradient (the learning rate can help us decide the scale). Similar idea and intuition also appear in https://arxiv.org/pdf/1707.04822.pdf.

However, if you use the original gradient to do the backprop, you could still make it work and obtain similar results, but may not be as stable as this normalized one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants