some questions about experiment setting and discriminator #2

freeman-1995 · 2020-09-19T04:06:35Z

I'm highly intersted in your work!
Here is a question, I hope you can give your thoughts about it.

in experiment setting, why set weight_decay to 0, in general, weight_decay is important factor to the final performance, usually have 1% validation accuracy difference on ILSVRC2012 imagenet.
about the discriminator, It contains three convolution operations, its inputs is the logits of student and combined logits of teachers, but the target for discriminator is not right, in code that is as following:

target = torch.FloatTensor([[1, 0] for _ in range(batch_size//2)] + [[0, 1] for _ in range(batch_size//2)])

I think the target should be [1,0] through the whole batch_size, so that is weird. are there any considerations? if so, the influence of discriminator loss is to make logit of students away from teachers, something like regularization？

szq0214 · 2020-09-19T14:55:26Z

Hi @alxer, thanks for your interest in our work!

weight_decay is not always used on ImageNet, for example, in some cases of training the binary neural networks, we also choose to use 0. I will run a comparison and let you know the results.
The discriminator is used to distinguish the teacher ensemble or student, acting as a regularizer. We want it can prevent the student's output from being identical to the teacher ensemble. If you only want them to be similar, I think KL loss is enough to achieve this purpose.

wuwuwuxxx · 2020-10-13T03:14:11Z

I'm curious about this target too. You didn't permute the input, so the discriminator output the same result across all training steps, does this operation cause a overfitting problem?

szq0214 · 2020-10-13T05:43:47Z

Hi @wuwuwuxxx, it seems the discriminator converges very fast and tends to be overfitting, I'm not sure whether shuffling the input can alleviate this, as I think the feature patterns from teacher ensemble and student have a clear difference and the discriminator is easy to distinguish them, but I will try it later.

normster · 2020-10-14T01:10:08Z

@szq0214 I don't think shuffling the input would make a difference. The model isn't aware of the batch index of each input so it shouldn't be able to overfit to the ordering of the logits in the batch.

szq0214 · 2020-10-22T07:23:37Z

@Freeman1937, see #4 for the comparison of using weight decay and without it.

zimenglan-sysu-512 · 2020-11-17T09:15:21Z

I'm curious about this target too. You didn't permute the input, so the discriminator output the same result across all training steps, does this operation cause a overfitting problem?

i also confuse this settings

freeman-1995 changed the title ~~some questions about experiment setting~~ some questions about experiment setting and discriminator Sep 19, 2020

szq0214 closed this as completed Sep 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some questions about experiment setting and discriminator #2

some questions about experiment setting and discriminator #2

freeman-1995 commented Sep 19, 2020 •

edited

szq0214 commented Sep 19, 2020

wuwuwuxxx commented Oct 13, 2020

szq0214 commented Oct 13, 2020

normster commented Oct 14, 2020

szq0214 commented Oct 22, 2020

zimenglan-sysu-512 commented Nov 17, 2020

some questions about experiment setting and discriminator #2

some questions about experiment setting and discriminator #2

Comments

freeman-1995 commented Sep 19, 2020 • edited

szq0214 commented Sep 19, 2020

wuwuwuxxx commented Oct 13, 2020

szq0214 commented Oct 13, 2020

normster commented Oct 14, 2020

szq0214 commented Oct 22, 2020

zimenglan-sysu-512 commented Nov 17, 2020

freeman-1995 commented Sep 19, 2020 •

edited