questions about The two Tf-KD methods #2

pecanjk · 2019-09-30T03:25:38Z

The first Tf-KD method is self-training, which is quite similar to the "Deep Mutual Learning" paper.

The second Tf-KD method actually equals to the LSR.let a=(1-alpha)+alpha/K, u=1/K, your manually designed distribution equals to LSR=(1-alpha)p(k)+alpha*u, where p(k) is hard labels

yuanli2333 · 2019-09-30T10:05:11Z

Hi, the first one is a good question, but DML is not self-training, it is a totally different method.
Deep Mutual Learning (DML) utilizes an ensemble of students to learn collaboratively and teach each other in the training processing. There is no pre-trained models to teach itself, and all models in DML learn from each other in training process. It is very clear in the paper of DML:

As you can see, both the two models are not pre-trained and the \theta_1 and \theta_2 are random parameters rather than pre-trained, then they are updated from each other in the training process, which is not self-training. Additionally, the motivation of two works are totally different. DML believes the small model can reach to same capability as large model by relieving the difficulty of optimization. Our method is motivated by our observation and explanation that KD is more like a regularization rather than the common belief.

For the second question, please make the equation of LSR=(1-alpha)p(k)+alpha*u be clear, then I can understand what did you mean. If you read our paper, we have illustrated the difference between our method of LSR. Of course, you can view our second method as a special LSR, because the standard KD also is a special LSR (regularization term is from a teacher rather than a pre-designed distribution), as stated in our paper.

pecanjk · 2019-10-10T02:32:00Z

Thanks for your detailed explanation.

LSR=(1-alpha)p(k)+alphau is the eq (1) in your paper. LSR= q'(k)=(1-alpha)q(k)+alphau(k) (sorry for wrong typing q as p). Here q(k) is hard label, which means q(k)=1 if k=c, q(k)=0 if k=others.

eq(8) is your manually designed distribution. If let a=(1-alpha)+alpha/K, u=1/K in eq(8), then eq(8) equals eq(1).

yuanli2333 · 2019-11-13T15:19:49Z

Thanks for your detailed explanation.

LSR=(1-alpha)p(k)+alpha_u is the eq (1) in your paper. LSR= q'(k)=(1-alpha)q(k)+alpha_u(k) (sorry for wrong typing q as p). Here q(k) is hard label, which means q(k)=1 if k=c, q(k)=0 if k=others.

eq(8) is your manually designed distribution. If let a=(1-alpha)+alpha/K, u=1/K in eq(8), then eq(8) equals eq(1).

(1). Eq(8) will not be equal to Eq(1) unless temperature is 1.
(2). Even let a=(1-alpha)+alpha/K, eq(8) is not a uniform distribution, so it is not the same as LSR.

lijing1996 · 2020-05-15T12:36:49Z

@yuanli2333 @pecanjk
I still don't get the difference between the proposed Tf-KD_{reg} and LSR.
(1)Though as you said, eq(8) has a temperature. Though there is a temperature to soften the distribution in eq 9, the probability of rest classes is still equal and the sum of all probability is 1. Though with an additional temperature, it can still be re-written as the form in eq(1). It is still a mixture of one hot distribution and uniform distribution. I can't get the difference.
(2)As eq(2), LSR can be interpreted as a mixture KL divergency of the one-hot one and uniform one. So is the loss in eq 9. The loss in eq 9 can also be written as eq 2 with the form of mixture KL divergency of the one-hot one and the uniform one.
I can't get the difference. Are there some mistakes in my words? I'm looking forward to your explanation. Thanks a lot.

yuanli2333 · 2020-05-20T08:33:17Z

@yuanli2333 @pecanjk
I still don't get the difference between the proposed Tf-KD_{reg} and LSR.
(1)Though as you said, eq(8) has a temperature. Though there is a temperature to soften the distribution in eq 9, the probability of rest classes is still equal and the sum of all probability is 1. Though with an additional temperature, it can still be re-written as the form in eq(1). It is still a mixture of one hot distribution and uniform distribution. I can't get the difference.
(2)As eq(2), LSR can be interpreted as a mixture KL divergency of the one-hot one and uniform one. So is the loss in eq 9. The loss in eq 9 can also be written as eq 2 with the form of mixture KL divergency of the one-hot one and the uniform one.
I can't get the difference. Are there some mistakes in my words? I'm looking forward to your explanation. Thanks a lot.

Hi, the simplest way to check the difference between LSR and Tf-KD_{reg} is rewrite the eq2 or eq9, check if they can be rewrote as other one.
You should get your hand dirty to rewrite the two equations, and you will find that eq9 can not be rewrote as eq2 with the temperature existing. But they are similar because you can view our second method as a special LSR.

JiyueWang · 2020-07-26T12:07:44Z

the first Tf-KD is just born again network right?

yuanli2333 closed this as completed Nov 26, 2019

MingSun-Tse mentioned this issue Sep 5, 2020

Mismatch between Eq.9 in the paper and the code #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

questions about The two Tf-KD methods #2

questions about The two Tf-KD methods #2

pecanjk commented Sep 30, 2019

yuanli2333 commented Sep 30, 2019 •

edited

Loading

pecanjk commented Oct 10, 2019

yuanli2333 commented Nov 13, 2019 •

edited

Loading

lijing1996 commented May 15, 2020

yuanli2333 commented May 20, 2020

JiyueWang commented Jul 26, 2020

questions about The two Tf-KD methods #2

questions about The two Tf-KD methods #2

Comments

pecanjk commented Sep 30, 2019

yuanli2333 commented Sep 30, 2019 • edited Loading

pecanjk commented Oct 10, 2019

yuanli2333 commented Nov 13, 2019 • edited Loading

lijing1996 commented May 15, 2020

yuanli2333 commented May 20, 2020

JiyueWang commented Jul 26, 2020

yuanli2333 commented Sep 30, 2019 •

edited

Loading

yuanli2333 commented Nov 13, 2019 •

edited

Loading