Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some questions about experiment setting and discriminator #2

Closed
freeman-1995 opened this issue Sep 19, 2020 · 6 comments
Closed

some questions about experiment setting and discriminator #2

freeman-1995 opened this issue Sep 19, 2020 · 6 comments

Comments

@freeman-1995
Copy link

freeman-1995 commented Sep 19, 2020

HI~ @szq0214

I'm highly intersted in your work!
Here is a question, I hope you can give your thoughts about it.

  1. in experiment setting, why set weight_decay to 0, in general, weight_decay is important factor to the final performance, usually have 1% validation accuracy difference on ILSVRC2012 imagenet.

  2. about the discriminator, It contains three convolution operations, its inputs is the logits of student and combined logits of teachers, but the target for discriminator is not right, in code that is as following:

target = torch.FloatTensor([[1, 0] for _ in range(batch_size//2)] + [[0, 1] for _ in range(batch_size//2)])

I think the target should be [1,0] through the whole batch_size, so that is weird. are there any considerations? if so, the influence of discriminator loss is to make logit of students away from teachers, something like regularization?

@freeman-1995 freeman-1995 changed the title some questions about experiment setting some questions about experiment setting and discriminator Sep 19, 2020
@szq0214
Copy link
Owner

szq0214 commented Sep 19, 2020

Hi @alxer, thanks for your interest in our work!

  1. weight_decay is not always used on ImageNet, for example, in some cases of training the binary neural networks, we also choose to use 0. I will run a comparison and let you know the results.
  2. The discriminator is used to distinguish the teacher ensemble or student, acting as a regularizer. We want it can prevent the student's output from being identical to the teacher ensemble. If you only want them to be similar, I think KL loss is enough to achieve this purpose.

@szq0214 szq0214 closed this as completed Sep 20, 2020
@wuwuwuxxx
Copy link

I'm curious about this target too. You didn't permute the input, so the discriminator output the same result across all training steps, does this operation cause a overfitting problem?

@szq0214
Copy link
Owner

szq0214 commented Oct 13, 2020

Hi @wuwuwuxxx, it seems the discriminator converges very fast and tends to be overfitting, I'm not sure whether shuffling the input can alleviate this, as I think the feature patterns from teacher ensemble and student have a clear difference and the discriminator is easy to distinguish them, but I will try it later.

@normster
Copy link

@szq0214 I don't think shuffling the input would make a difference. The model isn't aware of the batch index of each input so it shouldn't be able to overfit to the ordering of the logits in the batch.

@szq0214
Copy link
Owner

szq0214 commented Oct 22, 2020

@Freeman1937, see #4 for the comparison of using weight decay and without it.

@zimenglan-sysu-512
Copy link

I'm curious about this target too. You didn't permute the input, so the discriminator output the same result across all training steps, does this operation cause a overfitting problem?

i also confuse this settings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants