-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dropout layer misplaced in output? #1
Comments
Hi! Thank you so much for taking the time to read my paper and codes! This was my class project last year, so please be aware that there might be some serious errors as this didn't go through any rigorous review process. That said, I think I do remember noticing that from Gal's code but still choosing to put dropout after the final linear layer because I simply wasn't able to find the exact reason why I should not - I mean, aren't the final layer's weights also model parameters? I wasn't sure why I should assume different distribution for them. I moved on to other topics for now, so what I said above is quite likely to be a complete nonsense. I will try to come back to this after checking Gal's paper again over the weekend. If you do know the answer, I'd really appreciate it if you could share it with me here! |
Thanks for getting back so quickly. My primary source for this is just
Gal's paper 'Dropout as a Bayesian approximation: Representing model
uncertainty in deep learning' in Section 3 where he says "...with dropout
applied *before *every weight layer...". I am not an expert in this area,
but my thinking at the moment is because dropout is effectively breaking
the connection to the next node in the sequence. If the dropout occurs
after the final linear layer calculation is performed, then the model will
periodically return zero as a final answer, which just doesn't feel right.
That layer's output is not being passed into another layer, so we don't
need to randomly drop it. It's just being passed into our loss function or
stored as the trained model's output.
I've been finding and collecting examples of MC dropout code and running
them to compare their output to a model I'm working on myself. If I make
that tweak to your code then the results fall in line with the others.
…-rps
On Fri, Sep 25, 2020 at 1:11 PM Ronald Seoh ***@***.***> wrote:
Hi! Thank you so much for taking the time to read my paper and codes! This
was my class project last year, so please be aware that there might be some
serious errors as this didn't go through any rigorous review process.
That said, I think I do remember noticing that from Gal's code but still
chose to put dropout after the final linear layer because I simply wasn't
able to find the exact reason why I should not - I mean, aren't the final
layer's weights also model parameters? I wasn't sure why I should assume
different distribution for them.
I moved on to other topics for now, so what I said above is quite likely
to be a complete nonsense. I will try to come back to this after checking
Gal's paper again over the weekend. If you do know the answer, I'd really
appreciate it if you could share it with me here!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKZZNANPWRX5HOWHVA6XXP3SHTMNVANCNFSM4RZV7BDA>
.
|
So I had a look at the paper again and you are right - Under Gal's proof (check the appendix of the paper), it is the units of each weight matrices and corresponding inputs being dropped, so having additional dropout layer is not part of the equivalence. I usually thought of dropouts as each layer's outputs being dropped, so I guess the things got mixed up in my head at some point 😢 Plus, I've also realized that we shouldn't have the bias term in the final output layer to follow his formulation exactly. PyTorch |
I will have to re-run my experiments and revise the paper pretty soon, but I think this probably (hopefully) won't change most of the observations I've made there. Variance values will get smaller though. |
That's an interesting and good observation about the bias. Do you follow
why Gal does not use a bias term in his paper? The only detail I found is a
footnote in the appendix that says: "Note that we omit the outer-most bias
term as this is equivalent to centring the output". However, looking at
Gal's code he does use a bias term. I guess leaving out the bias term
forces the last layer to only be able to linearly shear and stretch the
output since shifting and translation is handled by the bias. I'm not sure
what the significance of that tweak is in relation to the broader theory.
Regarding your model, couldn't another dropout layer be added before the
'noise' linear layer if the model is predicting the heteroscedastic noise?
That would match the pattern with the 'activation' layer at the very least.
I suppose that would give us a better sample of the posterior distribution
if I follow Gal's math correctly?
Ryan
…On Sat, Sep 26, 2020 at 10:25 AM Ronald Seoh ***@***.***> wrote:
I will have to re-run my experiments and revise the paper pretty soon, but
I think this probably (hopefully) won't change most of the observations
I've made there, although most of the variance values will get smaller.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKZZNAKWFHMPEGHCH3PVYWLSHYBVTANCNFSM4RZV7BDA>
.
|
Hello Ryan, apologies for the late reply. I'm currently working on a few different things and haven't really had chance to take a proper look at this lately :( If our discussion is no longer relevant to your project, please ignore this message. I think the easiest explanation for the last bias term is that they never get 'dropped out' in any way under our formulation. Unlike other trainable weights where they contributions to the output would get dropped out at some point, this last bias goes straight to the output. As a result, it would shift the predictive distribution, but doesn't actually contribute to the overall shape of the distribution. Regarding your second point, you might want to check out Gal's Github repo about heteroscedastic noise. The idea here is that we assume our noise to be some linear function of each data point. Since this noise is not entirely caused by our main model, we do not try to estimate the distribution of these weights. By the way, this is where I found thinking about epistemic and aleatoric certainty to be really interesting and confusing at the same time: If we get to observe some data, how much of it should be considered as 'noise' and actual 'truth'? |
Hello,
I have been reading your paper 'Qualitative Analysis of Monte Carlo Dropout' recently and reading through your code. Many thanks for making it available. I noticed something odd in line 88 of fcnet.py. See below link. Adding a dropout layer after the last linear layer seems odd to me. I have typically seen it placed before the final linear output layer, not after. See Yarin Gal's MC dropout model definition here. The dropout precedes the final linear layer. In fcnet.py's case, the model output is 1-D, which is then being dropped periodically. This appears to be incorrectly inflating the computed variance values, especially for models that produce large outputs.
If this is the case, then this issue can be fixed by swapping these two lines. Thoughts?
ronald_bdl/ronald_bdl/models/fcnet.py
Line 88 in 9485d2a
The text was updated successfully, but these errors were encountered: