Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

questions about The two Tf-KD methods #2

Closed
pecanjk opened this issue Sep 30, 2019 · 6 comments
Closed

questions about The two Tf-KD methods #2

pecanjk opened this issue Sep 30, 2019 · 6 comments

Comments

@pecanjk
Copy link

pecanjk commented Sep 30, 2019

The first Tf-KD method is self-training, which is quite similar to the "Deep Mutual Learning" paper.

The second Tf-KD method actually equals to the LSR.let a=(1-alpha)+alpha/K, u=1/K, your manually designed distribution equals to LSR=(1-alpha)p(k)+alpha*u, where p(k) is hard labels

@yuanli2333
Copy link
Owner

yuanli2333 commented Sep 30, 2019

Hi, the first one is a good question, but DML is not self-training, it is a totally different method.
Deep Mutual Learning (DML) utilizes an ensemble of students to learn collaboratively and teach each other in the training processing. There is no pre-trained models to teach itself, and all models in DML learn from each other in training process. It is very clear in the paper of DML:
image
As you can see, both the two models are not pre-trained and the \theta_1 and \theta_2 are random parameters rather than pre-trained, then they are updated from each other in the training process, which is not self-training. Additionally, the motivation of two works are totally different. DML believes the small model can reach to same capability as large model by relieving the difficulty of optimization. Our method is motivated by our observation and explanation that KD is more like a regularization rather than the common belief.

For the second question, please make the equation of LSR=(1-alpha)p(k)+alpha*u be clear, then I can understand what did you mean. If you read our paper, we have illustrated the difference between our method of LSR. Of course, you can view our second method as a special LSR, because the standard KD also is a special LSR (regularization term is from a teacher rather than a pre-designed distribution), as stated in our paper.

@pecanjk
Copy link
Author

pecanjk commented Oct 10, 2019

Thanks for your detailed explanation.

LSR=(1-alpha)p(k)+alphau is the eq (1) in your paper. LSR= q'(k)=(1-alpha)q(k)+alphau(k) (sorry for wrong typing q as p). Here q(k) is hard label, which means q(k)=1 if k=c, q(k)=0 if k=others.

eq(8) is your manually designed distribution. If let a=(1-alpha)+alpha/K, u=1/K in eq(8), then eq(8) equals eq(1).

@yuanli2333
Copy link
Owner

yuanli2333 commented Nov 13, 2019

Thanks for your detailed explanation.

LSR=(1-alpha)p(k)+alpha_u is the eq (1) in your paper. LSR= q'(k)=(1-alpha)q(k)+alpha_u(k) (sorry for wrong typing q as p). Here q(k) is hard label, which means q(k)=1 if k=c, q(k)=0 if k=others.

eq(8) is your manually designed distribution. If let a=(1-alpha)+alpha/K, u=1/K in eq(8), then eq(8) equals eq(1).

(1). Eq(8) will not be equal to Eq(1) unless temperature is 1.
(2). Even let a=(1-alpha)+alpha/K, eq(8) is not a uniform distribution, so it is not the same as LSR.

@lijing1996
Copy link

@yuanli2333 @pecanjk
I still don't get the difference between the proposed Tf-KD_{reg} and LSR.
(1)Though as you said, eq(8) has a temperature. Though there is a temperature to soften the distribution in eq 9, the probability of rest classes is still equal and the sum of all probability is 1. Though with an additional temperature, it can still be re-written as the form in eq(1). It is still a mixture of one hot distribution and uniform distribution. I can't get the difference.
(2)As eq(2), LSR can be interpreted as a mixture KL divergency of the one-hot one and uniform one. So is the loss in eq 9. The loss in eq 9 can also be written as eq 2 with the form of mixture KL divergency of the one-hot one and the uniform one.
I can't get the difference. Are there some mistakes in my words? I'm looking forward to your explanation. Thanks a lot.

@yuanli2333
Copy link
Owner

@yuanli2333 @pecanjk
I still don't get the difference between the proposed Tf-KD_{reg} and LSR.
(1)Though as you said, eq(8) has a temperature. Though there is a temperature to soften the distribution in eq 9, the probability of rest classes is still equal and the sum of all probability is 1. Though with an additional temperature, it can still be re-written as the form in eq(1). It is still a mixture of one hot distribution and uniform distribution. I can't get the difference.
(2)As eq(2), LSR can be interpreted as a mixture KL divergency of the one-hot one and uniform one. So is the loss in eq 9. The loss in eq 9 can also be written as eq 2 with the form of mixture KL divergency of the one-hot one and the uniform one.
I can't get the difference. Are there some mistakes in my words? I'm looking forward to your explanation. Thanks a lot.

Hi, the simplest way to check the difference between LSR and Tf-KD_{reg} is rewrite the eq2 or eq9, check if they can be rewrote as other one.
You should get your hand dirty to rewrite the two equations, and you will find that eq9 can not be rewrote as eq2 with the temperature existing. But they are similar because you can view our second method as a special LSR.

@JiyueWang
Copy link

the first Tf-KD is just born again network right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants