# Uncertainty in neural networks using noise contrastive priors

## Regression

(Based on **arXiv:1807.09289**)

Neural networks are often very successful at making predictions for inputs that are in some sense similar to the training data. However, if the training data is not sufficiently diverse, then at test time, one will often encounter inputs that are *out-of-distribution (OOD)* and for which the network might yield unpredictable and inaccurate results -- as opposed to the *in-distribution (ID)* training data. In those cases, it would therefore be useful to have reliable estimates on the uncertainty of the prediction.

Bayesian neural networks are a standard way of tackling this problem. During training, instead of learning point estimates for the weights and biases of the network, one learns a probability distribution over those parameters. At test time, one first samples the network parameters from the learned distributions before making a prediction. As such, a Bayesian neural network represents a distribution of functions, which for a given input yields a certain distribution of outputs. 
However, it is not clear exactly how to specify the prior distribution on the weights, or how such a network generalizes on OOD data seems rather arbitrary.

A simple toy example is given in the following figure. A neural network is used to predict the mean and standard deviation of a scalar variable (it has a two-dimensional output layer).
On the left, a simple deterministic network is used. On the right, a bayesian layer is introduced just before the output layer.


<img src="./images/nn.png" width="600" /> <img src="./images/nn1.png" width="300" />

Even though the Bayesian approach does introduce uncertainty in the predicted mean (which depends on the posterior of the weights in the final hidden layer), the generalization to unseen data points is, in some sense, random.

A recently proposed approach starts from the premise that, in order to encourage the network to output high uncertainty, it is enough to encourage this at the boundary of the training data. The procedure is as follows
1. Perturb the input data to approximate OOD behavior (e.g. add noise) 
2. Stimulate the network to output a high uncertainty on the OOD data, by adding an additional contribution to the loss function.


In the example of the Bayesian neural network, the proposed new loss function then looks like this

$$\large \mathcal{L}_{\text{NCP}} (\phi) = \mathcal{L}_{\text{BBB}}(\phi) {\Big\rvert}_{\text{ID}} \quad + \quad \lambda \text{KL}\left[ \text{Normal}(\mu_{\mu}, \sigma_{\mu}^2) || q(\mu(x)) \right]\Big\rvert_{\text{OOD}}$$

in which the variance of the normal distribution $\sigma^2$ is chosen very large, to stimulate uncertainty in the distribution of the mean when the network is fed with OOD inputs.


# Classification: MNIST

This project attempts to apply similar ideas in a classification setting. We want to see whether it is possible to train a neural network to output a higher uncertainty on unseen data and prevent overconfident classification, by adding an additional contribution to the loss function, similar to the previously discussed paper. The following graph explains the setup.


<img src="./images/diagram.svg" width="800" />


In a classification setting, we cannot just use the output distribution as a measure for uncertainty. **However, we can use the entropy of the probabilities from the softmax output layer to represent the uncertainty of the classifier.** This is an easy-to-calculate quantity. 

As for the generation of OOD data, multiple possibilities exist. Here, we chose to **apply affine transformations to the images.**


# How to generate OOD data

Our training data will generally be “in-distribution” (similar to other data we have observed) whereas sometimes the test data can be drawn from a different distribution, i.e. it is “out-of-distribution” (OOD). We don’t want our uncertainty estimates to be overconfident on this OOD data which is inherently different from the in-distribution data. If we actually had an analytical form or a way of sampling the out-of-distribution data, we could just use that in training, but in general we don’t have this. Instead, we can attempt to generate OOD data to train on (in addition to our regular “in-distribution” training data), and encourage the model to output high uncertainty for this synthetically generated OOD data. This leads us to a procedure for generating OOD data.

For our MNIST classification testbed, we can make a very intuitive in vs. out of distribution split as follows: we take K of the digit classes as in-distribution, and the remaining 10-K classes as the out-of-distribution data. Clearly, if the model is trained on e.g. digits {0,1,2,3,4,5,6,7} but never sees {8,9}, then we expect it to: 1) perform “well” on the in-distribution data and poorly on the omitted OOD data, and 2) have greater certainty in classification decisions on in-distribution digit classes than on the omitted OOD digit classes. We thus have a legitimate way of making OOD data.

Meanwhile, during training, we want to generate something like kind of looks like this OOD data. As a simple proof of concept to generate some type of OOD data, we take a given image and apply a transformation to it. This perturbs the image so that it may move away from the data manifold and into an out-of-distribution region. However, this is not always the case: some perturbations may only slightly change the image so that it is still “in-distribution.” This procedure is ill-posed because what does it mean to generate the complement of the in-distribution training set for a complicated dataset? I.e. what does it look like to be not a 7 or 4 or a dog?

For our case, we rotate the image by a random amount. This is just one of many possible transformations we could imagine applying. For example, we could generalize rotation to any kind of random affine transformation with arbitrary translation, rotation, scaling, and shear. Even more generally, we could apply a projective transformation. Furthermore, for MNIST we could do arbitrary pixelwise transformations on the grayscale intensities (or for different color channels if there is more than one), or add noise (e.g. Gaussian jitter on each pixel). Also, complex warpings and other transformations are possible, e.g. swirl, or any arbitrary deformation.



### Setup for MNIST (1)






1. Deterministic network -- **no Bayesian layers!**

2. Input data are 28x28 images of the digits 1, 2, 3, 4, 5, 6, 7 -- **8 and 9 are omitted, and are used for evaluation.**

3. 256 --> 256 --> 8 network, using leaky ReLU activation functions and a softmax output.

4. Training through Adam.

5. **OOD data generated through rotations**






In [1]:
## random imports

In [2]:
## code for defining network template, calculation of entropy and CE loss

In [3]:
## code showing loss is difference between CE and entropy, scaled by alpha

### Results for deterministic network

We perform several experiments to demonstrate how uncertainty can be estimated to prevent overconfidence in classification and how NCP’s can aid in this process. For a K-category classification task, one reasonable measure of uncertainty is the entropy over the length K softmax output vector. (We also discuss alternatives in the Future Work section).

OBSERVATIONS

1. Accuracy decreases as NCP term in loss is weighted more strongly
2. Entropy difference between ID and OOD data increases, but standard deviation relatively high --> **network still unable to tell the difference between input it has already seen vs new input**


(show images of decrease in accuracy, increase in entropy of both)

“rotate” experiment:

As described earlier, a priori we might expect that doing such a simple transformation as rotation alone would not sufficiently move points away from in-distribution to other “good” OOD regions to sample, i.e. by rotating a “7” by any arbitrary amount we will never recover an “8.” We seek to confirm or refute this with a rotation experiment.
For a given set of holdout digit classes (e.g. {8,9}), and a reasonable alpha value, we vary the range of angles that the digits are allowed to rotate through to transform them from in-distribution to a synthetic sample of OOD data. We retrain the network from the same random seed 10 times. Each time, the range of allowed angles is increased, ranging from 0 degrees (not rotation at all so synthetic OOD looks exactly like in-distribution) to 180 degees. The actual angle of rotation for each image is chosen uniformly at random  from [–Theta,Theta].


“alpha” experiment

It is useful to see how the network’s uncertainty varies as a function of alpha, the weighting parameter in the loss function which trades off the standard cross-entropy loss for classification against the uncertainty term. We look at the properties of three partitions of the dataset:
•	“in-distribution” or “id”: the usual training set MNIST digits
•	“out-of-distribution” or “OOD” or “od”: those images that were transformed to be OOD
•	“omitted” or “om”: the holdout digit classes completely omitted from training
Once the network is fully trained, for each of these partitions, evaluate the mean uncertainty over that entire partition of the data. E.g. for the “out-of-distribution” data, we can check the mean uncertainty over all OOD instances. We also look at basic statistics like the standard deviation of this quantity. 
Also, we want to make sure that the classification accuracy remains good while still giving reasonable uncertainty estimates.

Explanation:
We observe a general trend where accuracy is traded off for larger uncertainty estimates (alpha is related to the inverse of the uncertainty weighting, so as alpha goes to 0, the uncertainty loss is given higher and higher weight). But over a large range of values, the network can achieve good classification accuracy and still output large uncertainties for OOD data.
Note: the fact that the OOD accuracy is still reasonably high brings into question how good a job this transformation does at moving the data away from the data manifold and into an OOD region: because accuracy is still relatively high, perhaps it does not move the point very far?
Also, note that the omitted data classification accuracy is 0 always because the network does not even know how to predict those categories (although it would be interesting to include an “other” or “anomaly” category as a catch-all category for anything too new).


“successive holdout of digit classes” experiment

In this experiment, we look at the effects of holding out (“omitting”) more and more digit classes, i.e. we try various bipartitions of the set {0,…,9} that leave successively fewer digit categories in the in-distribution training set. We constrain the problem to have at least two classes for in-distribution training, and at least one digit class omitted. In terms of training accuracy, we might expect the easiest classification problem to be when there are two classes, i.e. binary classification, instead of multiclass classification with some digits being visually similar. In this case we might expect uncertainty during training to also be low, and perhaps to lead to overconfident decisions when the model sees omitted (unseen) classes. Also, we might reason that the highest uncertainty on the omitted classes would occur when there are very many different omitted categories.



### Setup for MNIST (2): addition of bayesian layer 

### Results for Bayesian network

### Ongoing / Future Work

-We used a simple few layer MLP which is good enough for MNIST but more powerful NN architectures can still benefit from this Bayesian perspective and from NCPs.

-It is worth trying other experiments to generate OOD instances and gain some understanding of how the in-disrtibution regions transition into the out-of-distribution regions. In particular, two kinds of interpolations could be interesting. One is to interpolate between in-distribution data to known out-of-distribution data: borrow ideas from the "Mixup" training procedure, but apply them at test time as a form of analysis instead of during training as a form of data augmentation. In Mixup, you would take a convex combination in data space between pairs of inputs [and same for labels]. But we could do this: for a holdout set, we use ideas of Mixup and vary a parameter lambda from [0,1] so the inputs go from being in-distribution to being OOD. So at test time, evaluate a series of inputs which are made by interpolating between an in-distribution training point to something that is definitely OOD (like some MNIST hold out digit). Then look at the uncertainty as a function of lambda [i.e. uncertainty as a function of distance away from in-distribution]. Roughly speaking, in a Bayesian NN with NCPs we would expect uncertainty to increase as we go from in-distribution to OOD, vs. without NCPs maybe we remain overconfident and the uncertainty may not change much as we go to something that we know is actually OOD. But to confirm this could be interesting. One concern though is that linearly interpolating between e.g. a 7 and a 9, directly in data space, may not be as easily interpretable as we would hope, e.g. it could have intermediate interpolations which are neither 7 nor 9 like, so maybe the OOD "9" with lambda=1 would actually look more like a realistic input than any of the intermediate interpolations, so in other words, there would be no reason to think that this graph of uncertainty would be monotonically increasing as we get further from the in-distribution space. But it might still be interesting to see and compare to a regular Bayesian NN without NCPs.
-On the other hand, we can also interpolate between two categories of in-distribution data. For the MNIST digits, this could be e.g. going from a “3” to a “7.” We would expect that for data as high dimensional as MNIST images, the set would be non-convex. So when we interpolate between two points in the set, we are very likely to move outside of the set and get to OOD regions. The issue here is how meaningful those regions may be. Although they would be OOD, would they be useful/realistic? What does a digit that is halfway between a “3” and a “7” tell you about digits you have never seen before?

-Another very cool experiment could directly apply the KL divergence term of the NCP loss to classification problems. This would be a new novelty since not only are NCP’s fairly new (1st version of paper submitted in July 2018), but also it seems like the authors focused on active learning in the regression setting So extending this to classification is a new area. Here is one possibility for this approach. 

.... 


Dirichlet label prior for the NCP KL divergence term. Will explain this in a lot more detail (net archictecutre, the softplus activations, etc.) and provide one or two figures (fig of Wikipedia dirichlet distribution, and of net architecture). Also: compare to entropy which is permutation invariant which is bad for this task: the alpha concentration paramters simultaneously control both the mean and covariances of the distribution. This would allow it to somewhat capture interesting sapects of the confusion matrix, e.g. mistaking “1” for “7” is more likely than mistaking “1” for “6.”

<img src="./images/NCP_BNN_dirichlet_categorical_classifier.PNG" width="800" />

--------------------------------------------------------------------------------------


-Another open question is what happens if we include output categories that are basically anomaly detectors? I.e. another neuron which is a catch-all category for things unfamiliar? But perhaps the best solution would be a dynamic network which can learn to increase/decrease the number of classification categories on the fly. In other words, a dynamic graph that can prune or grow nods and have an architecture that changes during learning.
