Dropout layer misplaced in output? #1

rastna12 · 2020-09-25T15:37:51Z

Hello,
I have been reading your paper 'Qualitative Analysis of Monte Carlo Dropout' recently and reading through your code. Many thanks for making it available. I noticed something odd in line 88 of fcnet.py. See below link. Adding a dropout layer after the last linear layer seems odd to me. I have typically seen it placed before the final linear output layer, not after. See Yarin Gal's MC dropout model definition here. The dropout precedes the final linear layer. In fcnet.py's case, the model output is 1-D, which is then being dropped periodically. This appears to be incorrectly inflating the computed variance values, especially for models that produce large outputs.

If this is the case, then this issue can be fixed by swapping these two lines. Thoughts?

ronald_bdl/ronald_bdl/models/fcnet.py

Line 88 in 9485d2a

activation = self.output['dropout'](activation)

ronaldseoh · 2020-09-25T18:11:24Z

Hi! Thank you so much for taking the time to read my paper and codes! This was my class project last year, so please be aware that there might be some serious errors as this didn't go through any rigorous review process.

That said, I think I do remember noticing that from Gal's code but still choosing to put dropout after the final linear layer because I simply wasn't able to find the exact reason why I should not - I mean, aren't the final layer's weights also model parameters? I wasn't sure why I should assume different distribution for them.

I moved on to other topics for now, so what I said above is quite likely to be a complete nonsense. I will try to come back to this after checking Gal's paper again over the weekend. If you do know the answer, I'd really appreciate it if you could share it with me here!

rastna12 · 2020-09-25T19:34:50Z

Thanks for getting back so quickly. My primary source for this is just Gal's paper 'Dropout as a Bayesian approximation: Representing model uncertainty in deep learning' in Section 3 where he says "...with dropout applied *before *every weight layer...". I am not an expert in this area, but my thinking at the moment is because dropout is effectively breaking the connection to the next node in the sequence. If the dropout occurs after the final linear layer calculation is performed, then the model will periodically return zero as a final answer, which just doesn't feel right. That layer's output is not being passed into another layer, so we don't need to randomly drop it. It's just being passed into our loss function or stored as the trained model's output. I've been finding and collecting examples of MC dropout code and running them to compare their output to a model I'm working on myself. If I make that tweak to your code then the results fall in line with the others.

…

-rps

On Fri, Sep 25, 2020 at 1:11 PM Ronald Seoh ***@***.***> wrote: Hi! Thank you so much for taking the time to read my paper and codes! This was my class project last year, so please be aware that there might be some serious errors as this didn't go through any rigorous review process. That said, I think I do remember noticing that from Gal's code but still chose to put dropout after the final linear layer because I simply wasn't able to find the exact reason why I should not - I mean, aren't the final layer's weights also model parameters? I wasn't sure why I should assume different distribution for them. I moved on to other topics for now, so what I said above is quite likely to be a complete nonsense. I will try to come back to this after checking Gal's paper again over the weekend. If you do know the answer, I'd really appreciate it if you could share it with me here! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKZZNANPWRX5HOWHVA6XXP3SHTMNVANCNFSM4RZV7BDA> .

ronaldseoh · 2020-09-26T15:16:03Z

So I had a look at the paper again and you are right - Under Gal's proof (check the appendix of the paper), it is the units of each weight matrices and corresponding inputs being dropped, so having additional dropout layer is not part of the equivalence. I usually thought of dropouts as each layer's outputs being dropped, so I guess the things got mixed up in my head at some point 😢

Plus, I've also realized that we shouldn't have the bias term in the final output layer to follow his formulation exactly. PyTorch Linear layer and Keras Dense layer apparently adds bias term by default.

ronaldseoh · 2020-09-26T15:25:00Z

I will have to re-run my experiments and revise the paper pretty soon, but I think this probably (hopefully) won't change most of the observations I've made there. Variance values will get smaller though.

rastna12 · 2020-09-28T20:56:32Z

That's an interesting and good observation about the bias. Do you follow why Gal does not use a bias term in his paper? The only detail I found is a footnote in the appendix that says: "Note that we omit the outer-most bias term as this is equivalent to centring the output". However, looking at Gal's code he does use a bias term. I guess leaving out the bias term forces the last layer to only be able to linearly shear and stretch the output since shifting and translation is handled by the bias. I'm not sure what the significance of that tweak is in relation to the broader theory. Regarding your model, couldn't another dropout layer be added before the 'noise' linear layer if the model is predicting the heteroscedastic noise? That would match the pattern with the 'activation' layer at the very least. I suppose that would give us a better sample of the posterior distribution if I follow Gal's math correctly? Ryan

…

On Sat, Sep 26, 2020 at 10:25 AM Ronald Seoh ***@***.***> wrote: I will have to re-run my experiments and revise the paper pretty soon, but I think this probably (hopefully) won't change most of the observations I've made there, although most of the variance values will get smaller. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKZZNAKWFHMPEGHCH3PVYWLSHYBVTANCNFSM4RZV7BDA> .

ronaldseoh · 2020-10-22T22:23:55Z

Hello Ryan, apologies for the late reply. I'm currently working on a few different things and haven't really had chance to take a proper look at this lately :( If our discussion is no longer relevant to your project, please ignore this message.

I think the easiest explanation for the last bias term is that they never get 'dropped out' in any way under our formulation. Unlike other trainable weights where they contributions to the output would get dropped out at some point, this last bias goes straight to the output. As a result, it would shift the predictive distribution, but doesn't actually contribute to the overall shape of the distribution.

Regarding your second point, you might want to check out Gal's Github repo about heteroscedastic noise. The idea here is that we assume our noise to be some linear function of each data point. Since this noise is not entirely caused by our main model, we do not try to estimate the distribution of these weights. By the way, this is where I found thinking about epistemic and aleatoric certainty to be really interesting and confusing at the same time: If we get to observe some data, how much of it should be considered as 'noise' and actual 'truth'?

ronaldseoh self-assigned this Oct 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dropout layer misplaced in output? #1

Dropout layer misplaced in output? #1

rastna12 commented Sep 25, 2020

ronaldseoh commented Sep 25, 2020 •

edited

rastna12 commented Sep 25, 2020 via email

ronaldseoh commented Sep 26, 2020 •

edited

ronaldseoh commented Sep 26, 2020 •

edited

rastna12 commented Sep 28, 2020 via email

ronaldseoh commented Oct 22, 2020 •

edited

Dropout layer misplaced in output? #1

Dropout layer misplaced in output? #1

Comments

rastna12 commented Sep 25, 2020

ronaldseoh commented Sep 25, 2020 • edited

rastna12 commented Sep 25, 2020 via email

ronaldseoh commented Sep 26, 2020 • edited

ronaldseoh commented Sep 26, 2020 • edited

rastna12 commented Sep 28, 2020 via email

ronaldseoh commented Oct 22, 2020 • edited

ronaldseoh commented Sep 25, 2020 •

edited

ronaldseoh commented Sep 26, 2020 •

edited

ronaldseoh commented Sep 26, 2020 •

edited

ronaldseoh commented Oct 22, 2020 •

edited