You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the implementation, the loss is calculated by taking a frobenius norm over the difference between the square matrixes of teacher and student, and then takes a sum of the this. However, torch.norm would calculate the norm and give a single value for a layer. I am confused why we take the sum over it (if its a single value)?
The paper says the loss is the summation over mean of element-wise squared difference between the two square matrix of different layer pairs. So does the sum correspond to different layers?
Thank you.
The text was updated successfully, but these errors were encountered:
I think the summation you mentioned above is taken over batch. Say you have training batch size of 32, then the spkd_losses have 32 loss values. So we take sum over the batch first, and if we choose batch_mean as reduction, we reduce the value by dividing it by b**2 as shown in Eq. 4 of the paper.
Also, please use Discussions tab above for questions. As explained here, I want to keep Issues mainly for bug reports.
Hi,
Thanks for your amazing work.
I have a question regarding the implementation of Similarity Preserving KD loss in
torchdistill/torchdistill/losses/single.py
Line 467 in 7f533ba
In the implementation, the loss is calculated by taking a frobenius norm over the difference between the square matrixes of teacher and student, and then takes a sum of the this. However,
torch.norm
would calculate the norm and give a single value for a layer. I am confused why we take the sum over it (if its a single value)?The paper says the loss is the summation over mean of element-wise squared difference between the two square matrix of different layer pairs. So does the sum correspond to different layers?
Thank you.
The text was updated successfully, but these errors were encountered: