Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropout layer misplaced in output? #1

Open
rastna12 opened this issue Sep 25, 2020 · 6 comments
Open

Dropout layer misplaced in output? #1

rastna12 opened this issue Sep 25, 2020 · 6 comments
Assignees

Comments

@rastna12
Copy link

Hello,
I have been reading your paper 'Qualitative Analysis of Monte Carlo Dropout' recently and reading through your code. Many thanks for making it available. I noticed something odd in line 88 of fcnet.py. See below link. Adding a dropout layer after the last linear layer seems odd to me. I have typically seen it placed before the final linear output layer, not after. See Yarin Gal's MC dropout model definition here. The dropout precedes the final linear layer. In fcnet.py's case, the model output is 1-D, which is then being dropped periodically. This appears to be incorrectly inflating the computed variance values, especially for models that produce large outputs.

If this is the case, then this issue can be fixed by swapping these two lines. Thoughts?

activation = self.output['dropout'](activation)

@ronaldseoh
Copy link
Owner

ronaldseoh commented Sep 25, 2020

Hi! Thank you so much for taking the time to read my paper and codes! This was my class project last year, so please be aware that there might be some serious errors as this didn't go through any rigorous review process.

That said, I think I do remember noticing that from Gal's code but still choosing to put dropout after the final linear layer because I simply wasn't able to find the exact reason why I should not - I mean, aren't the final layer's weights also model parameters? I wasn't sure why I should assume different distribution for them.

I moved on to other topics for now, so what I said above is quite likely to be a complete nonsense. I will try to come back to this after checking Gal's paper again over the weekend. If you do know the answer, I'd really appreciate it if you could share it with me here!

@rastna12
Copy link
Author

rastna12 commented Sep 25, 2020 via email

@ronaldseoh
Copy link
Owner

ronaldseoh commented Sep 26, 2020

So I had a look at the paper again and you are right - Under Gal's proof (check the appendix of the paper), it is the units of each weight matrices and corresponding inputs being dropped, so having additional dropout layer is not part of the equivalence. I usually thought of dropouts as each layer's outputs being dropped, so I guess the things got mixed up in my head at some point 😢

Plus, I've also realized that we shouldn't have the bias term in the final output layer to follow his formulation exactly. PyTorch Linear layer and Keras Dense layer apparently adds bias term by default.

@ronaldseoh
Copy link
Owner

ronaldseoh commented Sep 26, 2020

I will have to re-run my experiments and revise the paper pretty soon, but I think this probably (hopefully) won't change most of the observations I've made there. Variance values will get smaller though.

@rastna12
Copy link
Author

rastna12 commented Sep 28, 2020 via email

@ronaldseoh
Copy link
Owner

ronaldseoh commented Oct 22, 2020

Hello Ryan, apologies for the late reply. I'm currently working on a few different things and haven't really had chance to take a proper look at this lately :( If our discussion is no longer relevant to your project, please ignore this message.

I think the easiest explanation for the last bias term is that they never get 'dropped out' in any way under our formulation. Unlike other trainable weights where they contributions to the output would get dropped out at some point, this last bias goes straight to the output. As a result, it would shift the predictive distribution, but doesn't actually contribute to the overall shape of the distribution.

Regarding your second point, you might want to check out Gal's Github repo about heteroscedastic noise. The idea here is that we assume our noise to be some linear function of each data point. Since this noise is not entirely caused by our main model, we do not try to estimate the distribution of these weights. By the way, this is where I found thinking about epistemic and aleatoric certainty to be really interesting and confusing at the same time: If we get to observe some data, how much of it should be considered as 'noise' and actual 'truth'?

@ronaldseoh ronaldseoh self-assigned this Oct 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants