Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where is the parameter \gamma #2

Open
happynear opened this issue Oct 25, 2014 · 4 comments
Open

Where is the parameter \gamma #2

happynear opened this issue Oct 25, 2014 · 4 comments

Comments

@happynear
Copy link

In formulation (3), there is a factor \gamma. This parameter is setted to prevent the hinge loss to be 0. However, I haven't find this parameter in the code.
The 0 loss is quite common in deep learning and this phenomenon is usually called "overfit". In deep learning, people usually use dropout to prevent the loss from getting to zero too early.

@s9xie
Copy link
Owner

s9xie commented Nov 2, 2014

@happynear
This is a good question, and should be a common one. Of course one can tune the gamma based on the validation set, but this is really annoying. We have tried that, but soon we came up with another idea to implement our formulation and avoid overfitting.

So if you look at our experiment configuration files, you can see we adopted an early stopping policy during the training process, i.e, we first train the network with DSN for a number of epochs (which is determined by validation) and we discard all the companion losses and continue to train the network with only the output loss.

The gamma now is implicitly and dynamically determined by the loss value achieved at the time when we early stop, empirically this is essential for DSN to achieve good performance.

@zhangliliang
Copy link

In @happynear comment, \gamma is setted to prevent the hinge loss to be 0.
However, from my point of view, \gamma is setted to $make$ the hinge loss of hidden layers to be 0(i.e. to vanish the gradient), so as \alpha_m does the same thing.
But I don't know the purpose to vanish the gradient in the paper. Is it for speed up the training process because it can skip part of BP algorithm? @s9xie

@s9xie
Copy link
Owner

s9xie commented Nov 3, 2014

@happynear @zhangliliang Sorry yes it is not "preventing hinge to be zero" but vanishing it. I assume it is a typo in original question?

In our paper we have explained that:
"This way, the overall goal of producing good classification of the output layer is not altered and the companion objective just acts as a proxy or regularization."
Intuitively we should emphasize the role of the overall loss during the training, this "early stop" policy can be a good way to avoid over-fitting the lower layers into the local loss.

@sh0416
Copy link

sh0416 commented Feb 14, 2020

@s9xie I am working on implementing your method. So, you mean that you don't explicitly use gamma, right? Actually, I am also curious about the other hyperparameter alpha, which requires exponential search space when layer increases. When I see your paper, you use relatively small architecture, i.e. 3-layer NN. How to tune this hyperparameter?

Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants