# Uncertainty in neural networks using noise contrastive priors

## Regression

(Based on **arXiv:1807.09289**)

Neural networks are often very successful at making predictions for inputs that are in some sense similar to the training data. However, if the training data is not sufficiently diverse, then at test time, one will often encounter inputs that are *out-of-distribution (OOD)* and for which the network might yield unpredictable and inaccurate results -- as opposed to the *in-distribution (ID)* training data. In those cases, it would therefore be useful to have reliable estimates on the uncertainty of the prediction.

Bayesian neural networks are a standard way of tackling this problem. During training, instead of learning point estimates for the weights and biases of the network, one learns a probability distribution over those parameters. At test time, one first samples the network parameters from the learned distributions before making a prediction. As such, a Bayesian neural network represents a distribution of functions, which for a given input yields a certain distribution of outputs. 
However, it is not clear exactly how to specify the prior distribution on the weights, or how such a network generalizes on OOD data seems rather arbitrary.

A simple toy example is given in the following figure. A neural network is used to predict the mean and standard deviation of a scalar variable (it has a two-dimensional output layer).
On the left, a simple deterministic network is used. On the right, a bayesian layer is introduced just before the output layer.

<img src="./images/nn.png" width="600" />

Even though the Bayesian approach does introduce uncertainty in the predicted mean (which depends on the posterior of the weights in the final hidden layer), the generalization to unseen data points is, in some sense, random.

A recently proposed approach starts from the premise that, in order to encourage the network to output high uncertainty, it is enough to encourage this at the boundary of the training data. The procedure is as follows
1. Perturb the input data to approximate OOD behavior (e.g. add noise) 
2. Stimulate the network to output a high uncertainty on the OOD data, by adding an additional contribution to the loss function.


In the example of the Bayesian neural network, the proposed new loss function then looks like this

$$\large \mathcal{L}_{\text{NCP}} (\phi) = \mathcal{L}_{\text{BBB}}(\phi) {\Big\rvert}_{\text{ID}} \quad + \quad \lambda \text{KL}\left[ \text{Normal}(\mu_{\mu}, \sigma_{\mu}^2) || q(\mu(x)) \right]\Big\rvert_{\text{OOD}}$$

in which the variance of the normal distribution $\sigma^2$ is chosen very large, to stimulate uncertainty in the distribution of the mean when the network is fed with OOD inputs.


# Classification

This project attempts to apply similar ideas in a classification setting. We want to see whether it is possible to train a neural network to output a higher uncertainty on unseen data and prevent overconfident classification, by adding an additional contribution to the loss function, similar to the previously discussed paper. The following graph explains the setup.

<img src="./images/diagram.svg" width="800" />


In a classification setting, we cannot just use the output distribution as a measure for uncertainty. However, we can use the entropy of the probabilities from the softmax output layer to represent the uncertainty of the classifier. This is an easy-to-calculate quantity. 