# Uncertainty in Neural Networks using Noise Contrastive Priors

## Probabilistic Programming 6998 Final Project

## Regression

(This section is based on **arXiv:1807.09289**: Hafner, Tran, Lillicrap: "Reliable Uncertainty Estimates", https://arxiv.org/abs/1807.09289)

Neural networks are often very successful at making predictions for inputs that are in some sense similar to the training data. However, if the training data is not sufficiently diverse, then at test time, one will often encounter inputs that are *out-of-distribution (OOD)* and for which the network might yield unpredictable and inaccurate results -- as opposed to the *in-distribution (ID)* training data. In those cases, it would therefore be useful to have reliable estimates on the uncertainty of the prediction.

Bayesian neural networks are a standard way of tackling this problem. During training, instead of learning point estimates for the weights and biases of the network, one learns a probability distribution over those parameters. At test time, one first samples the network parameters from the learned distributions before making a prediction. As such, a Bayesian neural network represents a distribution of functions, which for a given input yields a certain distribution of outputs. 
However, it is not clear exactly how to specify the prior distribution on the weights, or how such a network generalizes on OOD data seems rather arbitrary.

A simple toy example is given in the following figure. A neural network is used to predict the mean and standard deviation of a scalar variable (it has a two-dimensional output layer).
On the left, a simple deterministic network is used. On the right, a bayesian layer is introduced just before the output layer.


<img src="./images/nn.png" width="600" /> <img src="./images/nn1.png" width="300" />

Even though the Bayesian approach does introduce uncertainty in the predicted mean (which depends on the posterior of the weights in the final hidden layer), the generalization to unseen data points is, in some sense, random.

A recently proposed approach starts from the premise that priors in the data space are better behaved than the usual neural network priors in weight space, and in order to encourage the network to output high uncertainty it is enough to encourage this at the boundary of the training data. The procedure is as follows:
1. Perturb the input data to approximate OOD behavior (e.g. add noise) 
2. Stimulate the network to output a high uncertainty on the OOD data, by adding an additional contribution to the loss function.


In the example of the Bayesian neural network, the proposed new loss function then looks like this

$$\large \mathcal{L}_{\text{NCP}} (\phi) = \mathcal{L}_{\text{BBB}}(\phi) {\Big\rvert}_{\text{ID}} \quad + \quad \lambda \text{KL}\left[ \text{Normal}(\mu_{\mu}, \sigma_{\mu}^2) || q(\mu(x)) \right]\Big\rvert_{\text{OOD}}$$

in which the variance of the normal distribution $\sigma^2$ is chosen very large, to stimulate uncertainty in the distribution of the mean when the network is fed with OOD inputs.


# Classification: MNIST

<img src="./images/diagram.png" width="800" />


In a classification setting, we cannot just use the output distribution as a measure for uncertainty. **However, we can use the entropy of the probabilities from the softmax output layer to represent the uncertainty of the classifier.** This is an easy-to-calculate quantity. 

As for the generation of OOD data, multiple possibilities exist. Here, we chose to **apply affine transformations to the images.**


# How to Generate OOD Data

Training data is generally “in-distribution” (similar to other data we have observed) whereas sometimes the test data can be drawn from a different distribution, i.e. it is “out-of-distribution” (OOD). We don’t want our uncertainty estimates to be overconfident on this OOD data which is inherently different from the in-distribution data. If we actually had an analytical form or a way of sampling the out-of-distribution data, we could just use that in training, but in general we don’t have this. Instead, we can attempt to generate OOD data to train on (in addition to our regular “in-distribution” training data), and encourage the model to output high uncertainty for this synthetically generated OOD data. This leads us to a procedure for generating OOD data.

For our **MNIST classification testbed**, we can make a very intuitive in vs. out of distribution split as follows: we take K of the digit classes as in-distribution, and the remaining 10 - K classes as the out-of-distribution data. Clearly, if the model is trained on e.g. digits {0,1,2,3,4,5,6,7} but never sees {8,9}, then we expect: 

1) it will perform “well” on the in-distribution data and "poorly" on the omitted OOD data, and 

2) have greater certainty in classification decisions on in-distribution digit classes than on the omitted OOD digit classes. 

We thus have a legitimate way of making OOD data. Meanwhile, during training, we want to generate something that looks like this OOD data. As a simple proof of concept to generate some type of OOD data, we take a given image and apply a transformation to it. This perturbs the image so that it may move away from the data manifold and into an out-of-distribution region. However, this is not always the case: some perturbations may only slightly change the image so that it is still “in-distribution.” This procedure is ill-posed because what does it mean to generate the complement of the in-distribution training set for a complicated dataset? I.e. what does it look like to be **not** a "7" or "4" or a dog?

For our demonstration, we rotate the image by a random amount. This is just one of many possible transformations we could imagine applying. For example, we could generalize rotation to any kind of random affine transformation with arbitrary translation, rotation, scaling, and shear. Even more generally, we could apply a projective transformation. Furthermore, for MNIST we could do arbitrary pixelwise transformations on the grayscale intensities (or for different color channels if there is more than one), or add noise (e.g. Gaussian jitter on each pixel). Also, complex warpings and other transformations are possible, e.g. swirl, or any arbitrary deformation.



### Setup for MNIST (1)






1. Deterministic network -- **no Bayesian layers!**

2. Input data are 28x28 images of the digits 1, 2, 3, 4, 5, 6, 7 -- **8 and 9 are omitted, and are used for evaluation.**

3. 256 --> 256 --> 8 network, using leaky ReLU activation functions and a softmax output.

4. Training through Adam.

5. **OOD data generated through rotations**

<img src="./images/layouta.png" width="500" />






In [1]:
## download MNIST data
from ncp_classifier.datasets.mnist import init
from __future__ import print_function
import tensorflow as tf
import pickle
import os
import numpy as np
from random import seed
from tensorflow_probability import distributions as tfd
import tensorflow_probability as tfp

init()

  from ._conv import register_converters as _register_converters


Downloading train-images-idx3-ubyte.gz...
Downloading t10k-images-idx3-ubyte.gz...
Downloading train-labels-idx1-ubyte.gz...
Downloading t10k-labels-idx1-ubyte.gz...
Download complete.
Save complete.


In [1]:
# define network layout
def network(data, layer_sizes, ncp_scale = 0.1):
    '''
    Defines network topology 
    '''
    # Define neural network topology (in this case, a simple MLP)
    hidden = data[0]
    labels = data[1]
    for size in layer_sizes[:-1]:
        hidden = tf.layers.dense(
                inputs=hidden,
                units=size,
                activation=tf.nn.leaky_relu
                )
    logits = tf.layers.dense(inputs=hidden, units=layer_sizes[-1], activation=None)
    # computes the traditional cross-entropy loss, 
    # which we want to minimize over the in-distribution training data
    standard_loss=tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels)
            )
    # computes the ncp_loss, in this case simply the entropy, which we want to minimize over the out-of-distribution training data
    logits = logits - tf.reduce_mean(logits)
    class_probabilities = tf.nn.softmax(logits * tf.constant(ncp_scale, dtype=tf.float32))
    entropy = -class_probabilities * tf.log(tf.clip_by_value(class_probabilities, 1e-20, 1))
    # Use the normalized entropy (divide by log_b(K) ) so is on [0,1] 
    # so easier to compare across experiments:
    if NORMALIZE_ENTROPY == True:
        baseK = tf.constant(layer_sizes[-1], dtype=tf.float32, shape=(layer_sizes[-1], ))
        entropy /= tf.log(baseK)
    mean, variance = tf.nn.moments(entropy, axes=[1])
    ncp_loss = tf.reduce_mean(mean)
    ncp_std = tf.reduce_mean(tf.math.sqrt(variance))
    return standard_loss, ncp_loss, logits, class_probabilities, ncp_std


In [2]:
# network_tpl = tf.make_template(
#    'network',
#    network,
#    layer_sizes=logging['layer_sizes'],
#    ncp_scale=logging['ncp_scale']
#    )
# id_loss, id_ncp_loss, id_logits, _, id_ncp_std  = network_tpl(id_data)  # calculate CE loss for id input data
# od_loss, od_ncp_loss, od_logits, _, od_ncp_std = network_tpl(od_data)  # calculate entropy for od input data

In [3]:
# loss = alpha * id_loss - (1 - alpha) * od_ncp_loss

### Results for deterministic network

We perform several experiments to demonstrate how uncertainty can be estimated to prevent overconfident classification and how NCP’s can aid in this process. For a K-category classification task, one reasonable measure of uncertainty is the entropy over the length K softmax output vector. (We also discuss alternatives in the Future Work section). One experiment is described below, and two others are provided in the appendix.

### Experiment 1: “alpha”

It is useful to see how the network’s uncertainty varies as a function of alpha, the weighting parameter in the loss function which trades off the standard cross-entropy loss for classification against the uncertainty term. We look at the properties of three partitions of the dataset:

•	“in-distribution” or “id”: the original training set MNIST digits

•	“out-of-distribution” or “OOD” or “od”: those images that were transformed to represent OOD data

•	“omitted” or “om”: the holdout digit classes completely omitted from training

Once the network is fully trained, for each of these partitions, evaluate the mean uncertainty over that entire partition of the data. E.g. for the “out-of-distribution” data, we can check the mean uncertainty over all OOD instances. We also look at basic statistics like the standard deviation of this quantity. 
Also, we want to make sure that the classification accuracy remains good while still giving reasonable uncertainty estimates.

Explanation:
We observe a general trend where accuracy is traded off for larger uncertainty estimates (alpha is related to the inverse of the uncertainty weighting, so as alpha goes to 0, the uncertainty loss is given higher and higher weight). But over a large range of values, the network can achieve good classification accuracy and still output large uncertainties for OOD data.

Note: the fact that the OOD accuracy is still reasonably high brings into question how good a job this transformation does at moving the data away from the data manifold and into an OOD region: because accuracy is still relatively high, perhaps it does not move the point very far?
Also, note that the omitted data ("om") classification accuracy is 0 always because the network is not evencapable of predicting those categories (although it would be interesting to include an “other” or “anomaly” category as a catch-all category for anything too new, as discussed later).


In [3]:
## code running deterministic network, on the alpha experiment
## parameters and more details are in the file itself:
## see ncp_classifier/models/mnist_det.py

#Reset tf graph in case you have already run one of our other experiments in this notebook session:
tf.reset_default_graph()

from ncp_classifier.models.mnist_det import alpha_experiment

alpha_experiment()

## During training, log files will be periodically saved in ncp_classifier/logs
#Full id_acc:     accuracy over the entire partition of in-distribution data
#Full od_acc:     accuracy over the entire partition of out-of-distribution data
    
#(Ignore Futurewarning)

alpha: 1e-07
Epoch: 0001 cost=24.975974242
Epoch: 0002 cost=13.169512903
Epoch: 0003 cost=9.372821758
Epoch: 0004 cost=7.024959794
Epoch: 0005 cost=5.371150734
Epoch: 0006 cost=4.189217833
Epoch: 0007 cost=3.361607764
Epoch: 0008 cost=2.779235546
Epoch: 0009 cost=2.392999147
Epoch: 0010 cost=2.152677618
Epoch: 0011 cost=1.991151148
Epoch: 0012 cost=1.874213417
Epoch: 0013 cost=1.785440108
Epoch: 0014 cost=1.720613751
Epoch: 0015 cost=1.674968458
Optimization Finished!
Full id_acc: 0.37866193
Full od_acc: 0.20349859


  pickle.dump(logging, open(logpath, "wb" ) )


alpha: 5.62341325190349e-07
Epoch: 0001 cost=22.636331688
Epoch: 0002 cost=10.152962472
Epoch: 0003 cost=6.458519733
Epoch: 0004 cost=4.422609669
Epoch: 0005 cost=3.147070657
Epoch: 0006 cost=2.347727595
Epoch: 0007 cost=1.832818960
Epoch: 0008 cost=1.501410993
Epoch: 0009 cost=1.276239922
Epoch: 0010 cost=1.140459567
Epoch: 0011 cost=1.048024585
Epoch: 0012 cost=0.982088020
Epoch: 0013 cost=0.926946611
Epoch: 0014 cost=0.882430612
Epoch: 0015 cost=0.840814948
Optimization Finished!
Full id_acc: 0.79092807
Full od_acc: 0.34016347


  pickle.dump(logging, open(logpath, "wb" ) )


alpha: 3.162277660168379e-06
Epoch: 0001 cost=15.960980630
Epoch: 0002 cost=4.885454197
Epoch: 0003 cost=2.767755221
Epoch: 0004 cost=1.759029398
Epoch: 0005 cost=1.206281020
Epoch: 0006 cost=0.889243320
Epoch: 0007 cost=0.690560057
Epoch: 0008 cost=0.579700486
Epoch: 0009 cost=0.510158610
Epoch: 0010 cost=0.458015509
Epoch: 0011 cost=0.419436774
Epoch: 0012 cost=0.387386701
Epoch: 0013 cost=0.358688111
Epoch: 0014 cost=0.332650278
Epoch: 0015 cost=0.308003272
Optimization Finished!
Full id_acc: 0.9129101
Full od_acc: 0.3214294


  pickle.dump(logging, open(logpath, "wb" ) )


alpha: 1.778279410038923e-05


KeyboardInterrupt: 

In [None]:
#Run the plotting script
run_plotting('alpha')

#Output is saved in ncp_classifier/output/alpha
#Example plots are shown below

<img src="./images/det_alpha_entropy.png" width="500" />

<img src="./images/det_alpha_accuracy.png" width="500" />

KEY OBSERVATIONS

1. Accuracy decreases as NCP term in loss is weighted more strongly (toward left side of figure above)

2. Entropy difference between ID and OOD data increases, but the variances are relatively high --> **network still unable to tell the difference between input it has already seen vs. new input**


### Setup for MNIST (2): addition of bayesian layer 

<img src="./images/layoutb.png" width="500" />


In [9]:
def network(data, layer_sizes=[256, 256], ncp_scale=0.1):
    '''
    Defines network topology
    '''
    hidden = data[0]
    labels = data[1]
    for size in layer_sizes[:-1]:
        hidden = tf.layers.dense(
                inputs=hidden,
                units=size,
                activation=tf.nn.leaky_relu
                )
    weight_std = 0.1
    init_std = np.log(np.exp(weight_std) - 1).astype(np.float32)
    kernel_posterior = tfd.Independent(tfd.Normal(
        tf.get_variable(
            'kernel_mean',
            (hidden.shape[-1].value, layer_sizes[-1]),
            tf.float32,
            tf.random_normal_initializer(0, weight_std)),
        tf.nn.softplus(tf.get_variable(
            'kernel_std',
            (hidden.shape[-1].value, layer_sizes[-1]),
            tf.float32,
            tf.constant_initializer(init_std)))), 2)
    kernel_prior = tfd.Independent(tfd.Normal(
        tf.zeros_like(kernel_posterior.mean()),
        tf.zeros_like(kernel_posterior.mean()) + tf.nn.softplus(init_std)), 2)

    bias_prior = None
    bias_posterior = tfd.Deterministic(tf.get_variable(
        'bias_mean',
        (layer_sizes[-1],),
        tf.float32,
        tf.constant_initializer(0.0)))
    logits = tfp.layers.DenseReparameterization(
        layer_sizes[-1],
        kernel_prior_fn=lambda *args, **kwargs: kernel_prior,
        kernel_posterior_fn=lambda *args, **kwargs: kernel_posterior,
        bias_prior_fn=lambda *args, **kwargs: bias_prior,
        bias_posterior_fn=lambda *args, **kwargs: bias_posterior)(hidden)

    standard_loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(
                logits=logits,
                labels=labels)
            )
    logits = logits - tf.reduce_mean(logits)
    class_probabilities = tf.nn.softmax(
            logits * tf.constant(ncp_scale, dtype=tf.float32))
    entropy = -class_probabilities * tf.log(
            tf.clip_by_value(class_probabilities, 1e-20, 1))
    if NORMALIZE_ENTROPY is True:
        baseK = tf.constant(
                layer_sizes[-1],
                dtype=tf.float32,
                shape=(layer_sizes[-1],))
        entropy /= tf.log(baseK)
    mean, variance = tf.nn.moments(entropy, axes=[1])
    ncp_loss = tf.reduce_mean(mean)
    ncp_std = tf.reduce_mean(tf.math.sqrt(variance))
    return standard_loss, ncp_loss, logits, ncp_std


In [2]:
# code running bayesian network, on the rotate experiment
# parameters and mroe details are in the file itself:
# see ncp_classifier/models/mnist_bayesian.py

#Reset tf graph in case you have already run one of our other experiments in this notebook session:
tf.reset_default_graph()

from ncp_classifier.models.mnist_bayesian import rotation_experiment

rotation_experiment()

## During training, log files will be periodically saved in ncp_classifier/logs
#Full id_acc:     accuracy over the entire partition of in-distribution data
#Full od_acc:     accuracy over the entire partition of out-of-distribution data
    
#(Ignore Futurewarning)

  imp.find_module('pytest')
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.

Epoch: 0001 cost=1.385920509
Epoch: 0002 cost=0.260288010
Epoch: 0003 cost=0.155769217
Epoch: 0004 cost=0.127837482
Epoch: 0005 cost=0.114170436
Epoch: 0006 cost=0.113825693
Epoch: 0007 cost=0.101082829
Epoch: 0008 cost=0.112927634
Epoch: 0009 cost=0.107083318
Epoch: 0010 cost=0.110372019
Epoch: 0011 cost=0.110643857
Epoch: 0012 cost=0.096120410
Epoch: 0013 cost=0.077545325
Epoch: 0014 cost=0.077223694
Epoch: 0015 cost=0.116040999
Epoch: 0016 cost=0.061386717
Epoch: 0017 cost=0.063121414
Epoch: 0018 cost=0.064915277
Epoch: 0019 cost=0.064451125
Epoch: 0020 cost=0.062958490
Optimization Finished!
Full id_acc: 0.9887759
Full od_acc: 0.28844398


  pickle.dump(logging, open(logpath, "wb" ) )


Epoch: 0001 cost=1.401556650
Epoch: 0002 cost=0.261669069
Epoch: 0003 cost=0.165492805
Epoch: 0004 cost=0.128131353
Epoch: 0005 cost=0.112207308
Epoch: 0006 cost=0.116970898
Epoch: 0007 cost=0.105518784
Epoch: 0008 cost=0.107642690
Epoch: 0009 cost=0.118183563
Epoch: 0010 cost=0.132875464
Epoch: 0011 cost=0.091409684
Epoch: 0012 cost=0.096308317
Epoch: 0013 cost=0.083674399
Epoch: 0014 cost=0.084087519
Epoch: 0015 cost=0.083728114
Epoch: 0016 cost=0.083450293
Epoch: 0017 cost=0.068298592
Epoch: 0018 cost=0.084664288
Epoch: 0019 cost=0.066607926
Epoch: 0020 cost=0.055899209
Optimization Finished!
Full id_acc: 0.9842946
Full od_acc: 0.2263278


  pickle.dump(logging, open(logpath, "wb" ) )


Epoch: 0001 cost=1.415993463
Epoch: 0002 cost=0.269202844
Epoch: 0003 cost=0.166176918
Epoch: 0004 cost=0.128974009
Epoch: 0005 cost=0.115458864
Epoch: 0006 cost=0.114192300
Epoch: 0007 cost=0.129313072
Epoch: 0008 cost=0.105124109
Epoch: 0009 cost=0.098642305
Epoch: 0010 cost=0.111410574
Epoch: 0011 cost=0.113569829
Epoch: 0012 cost=0.125822809
Epoch: 0013 cost=0.090212189
Epoch: 0014 cost=0.079191717
Epoch: 0015 cost=0.081612732
Epoch: 0016 cost=0.085531130
Epoch: 0017 cost=0.082816751
Epoch: 0018 cost=0.062391651
Epoch: 0019 cost=0.076741228
Epoch: 0020 cost=0.075102939
Optimization Finished!
Full id_acc: 0.984917
Full od_acc: 0.1742946


  pickle.dump(logging, open(logpath, "wb" ) )


Epoch: 0001 cost=1.414175890
Epoch: 0002 cost=0.266679985
Epoch: 0003 cost=0.173073583
Epoch: 0004 cost=0.138220014
Epoch: 0005 cost=0.126533599
Epoch: 0006 cost=0.104097093
Epoch: 0007 cost=0.127592015
Epoch: 0008 cost=0.118197800
Epoch: 0009 cost=0.102483354
Epoch: 0010 cost=0.110217021
Epoch: 0011 cost=0.125598973
Epoch: 0012 cost=0.090255968
Epoch: 0013 cost=0.081311449
Epoch: 0014 cost=0.099984785
Epoch: 0015 cost=0.090556699
Epoch: 0016 cost=0.069995164
Epoch: 0017 cost=0.061335341
Epoch: 0018 cost=0.082666107
Epoch: 0019 cost=0.075507766
Epoch: 0020 cost=0.084146202
Optimization Finished!
Full id_acc: 0.9843361
Full od_acc: 0.37981328


  pickle.dump(logging, open(logpath, "wb" ) )


Epoch: 0001 cost=1.430252694
Epoch: 0002 cost=0.274533997
Epoch: 0003 cost=0.174690929
Epoch: 0004 cost=0.138522227
Epoch: 0005 cost=0.114805660
Epoch: 0006 cost=0.107113865
Epoch: 0007 cost=0.110320976
Epoch: 0008 cost=0.123253561
Epoch: 0009 cost=0.122440584
Epoch: 0010 cost=0.112358964
Epoch: 0011 cost=0.104228913
Epoch: 0012 cost=0.098558531
Epoch: 0013 cost=0.086626781
Epoch: 0014 cost=0.076850010
Epoch: 0015 cost=0.101666160
Epoch: 0016 cost=0.093148136
Epoch: 0017 cost=0.087364781
Epoch: 0018 cost=0.057816398
Epoch: 0019 cost=0.057671977
Epoch: 0020 cost=0.060922634
Optimization Finished!
Full id_acc: 0.9896473
Full od_acc: 0.26259336


  pickle.dump(logging, open(logpath, "wb" ) )


Epoch: 0001 cost=1.456774112
Epoch: 0002 cost=0.263717954
Epoch: 0003 cost=0.171987178
Epoch: 0004 cost=0.132333702
Epoch: 0005 cost=0.116206844
Epoch: 0006 cost=0.116332592
Epoch: 0007 cost=0.116534928
Epoch: 0008 cost=0.115032876
Epoch: 0009 cost=0.108660011
Epoch: 0010 cost=0.103829893
Epoch: 0011 cost=0.125688366
Epoch: 0012 cost=0.104688706
Epoch: 0013 cost=0.071455682
Epoch: 0014 cost=0.084175305
Epoch: 0015 cost=0.109452106
Epoch: 0016 cost=0.072623637
Epoch: 0017 cost=0.082576785
Epoch: 0018 cost=0.086227874
Epoch: 0019 cost=0.052669654
Epoch: 0020 cost=0.050251905
Optimization Finished!
Full id_acc: 0.98470956
Full od_acc: 0.18678424


  pickle.dump(logging, open(logpath, "wb" ) )


Epoch: 0001 cost=1.445305963
Epoch: 0002 cost=0.262265727
Epoch: 0003 cost=0.174942650
Epoch: 0004 cost=0.131063823
Epoch: 0005 cost=0.119551799
Epoch: 0006 cost=0.102295861
Epoch: 0007 cost=0.131355289
Epoch: 0008 cost=0.122235612
Epoch: 0009 cost=0.102608946
Epoch: 0010 cost=0.102187467
Epoch: 0011 cost=0.102250096
Epoch: 0012 cost=0.098785785
Epoch: 0013 cost=0.114411853
Epoch: 0014 cost=0.077399770
Epoch: 0015 cost=0.071134232
Epoch: 0016 cost=0.085112779
Epoch: 0017 cost=0.086233776
Epoch: 0018 cost=0.071606815
Epoch: 0019 cost=0.060250390
Epoch: 0020 cost=0.068694641
Optimization Finished!
Full id_acc: 0.9880913
Full od_acc: 0.18317427


  pickle.dump(logging, open(logpath, "wb" ) )


Epoch: 0001 cost=1.436690042
Epoch: 0002 cost=0.263301497
Epoch: 0003 cost=0.169596112
Epoch: 0004 cost=0.135675165
Epoch: 0005 cost=0.114188269
Epoch: 0006 cost=0.123624373
Epoch: 0007 cost=0.122770524
Epoch: 0008 cost=0.107927109
Epoch: 0009 cost=0.124109505
Epoch: 0010 cost=0.111593101
Epoch: 0011 cost=0.116172210
Epoch: 0012 cost=0.086585968
Epoch: 0013 cost=0.115434562
Epoch: 0014 cost=0.073888725
Epoch: 0015 cost=0.070838030
Epoch: 0016 cost=0.091828810
Epoch: 0017 cost=0.075095684
Epoch: 0018 cost=0.080244831
Epoch: 0019 cost=0.065618461
Epoch: 0020 cost=0.096415291
Optimization Finished!
Full id_acc: 0.9841909
Full od_acc: 0.28838176


  pickle.dump(logging, open(logpath, "wb" ) )


Epoch: 0001 cost=1.431007665
Epoch: 0002 cost=0.265647287
Epoch: 0003 cost=0.170975210
Epoch: 0004 cost=0.132090695
Epoch: 0005 cost=0.120762899
Epoch: 0006 cost=0.106702059
Epoch: 0007 cost=0.118220635
Epoch: 0008 cost=0.127847743
Epoch: 0009 cost=0.112342823
Epoch: 0010 cost=0.102962083
Epoch: 0011 cost=0.100329899
Epoch: 0012 cost=0.085683854
Epoch: 0013 cost=0.095772201
Epoch: 0014 cost=0.094640945
Epoch: 0015 cost=0.085296139
Epoch: 0016 cost=0.087824001
Epoch: 0017 cost=0.066217948
Epoch: 0018 cost=0.088257882
Epoch: 0019 cost=0.060709250
Epoch: 0020 cost=0.057652198
Optimization Finished!
Full id_acc: 0.9784647
Full od_acc: 0.19904564


  pickle.dump(logging, open(logpath, "wb" ) )


Epoch: 0001 cost=1.429397490
Epoch: 0002 cost=0.263364158
Epoch: 0003 cost=0.167762156
Epoch: 0004 cost=0.133689059
Epoch: 0005 cost=0.113967265
Epoch: 0006 cost=0.116958645
Epoch: 0007 cost=0.118398267
Epoch: 0008 cost=0.122569570
Epoch: 0009 cost=0.117463648
Epoch: 0010 cost=0.092174112
Epoch: 0011 cost=0.103138497
Epoch: 0012 cost=0.124186643
Epoch: 0013 cost=0.086301818
Epoch: 0014 cost=0.075712571
Epoch: 0015 cost=0.090721949
Epoch: 0016 cost=0.091949441
Epoch: 0017 cost=0.062874140
Epoch: 0018 cost=0.070409531
Epoch: 0019 cost=0.061815710
Epoch: 0020 cost=0.077599688
Optimization Finished!
Full id_acc: 0.9781535
Full od_acc: 0.21954356


  pickle.dump(logging, open(logpath, "wb" ) )


### Results for Bayesian network

<img src="./images/bayesian1.png" width="500" />

<img src="./images/bayesian2.png" width="500" />



### Ongoing / Future Work

-We used a simple few layer MLP which is good enough for MNIST but more powerful NN architectures can still benefit from this Bayesian perspective and from NCPs.

-It is worth trying other experiments to generate OOD instances and gain some understanding of how the in-distribution regions transition into the out-of-distribution regions. In particular, two kinds of interpolations could be interesting. One is to **interpolate between in-distribution data to out-of-distribution data**: borrow ideas from the "Mixup" training procedure, but apply them at test time as a form of analysis instead of during training as a form of data augmentation. In Mixup, you would take a convex combination in data space between pairs of inputs [and same for labels]. But we could do this: for a holdout set, we use ideas of Mixup and vary a parameter lambda from [0,1] so the inputs go from being in-distribution to being OOD. So at test time, evaluate a series of inputs which are made by interpolating between an in-distribution training point to something that is definitely OOD (an MNIST omitted holdout digit). Then look at the uncertainty as a function of lambda [i.e. uncertainty as a function of distance away from in-distribution]. Roughly speaking, in a Bayesian NN with NCPs we would expect uncertainty to increase as we go from in-distribution to OOD, vs. without NCPs maybe we remain overconfident and the uncertainty may not change much as we go to something that we know is actually OOD. But to confirm this could be interesting. One concern though is that linearly interpolating between e.g. a "7" and a "9", directly in data space, may not be as easily interpretable as we would hope, e.g. it could have intermediate interpolations which are neither "7" nor "9" like, so maybe the OOD "9" with lambda=1 would actually look more like a realistic input than any of the intermediate interpolations, so in other words, there would be no reason to think that this graph of uncertainty would be monotonically increasing as we get further from the in-distribution space. But it might still be interesting to see and compare to a regular Bayesian NN without NCPs.

-On the other hand, we can also **interpolate between two categories of in-distribution data**. For the MNIST digits, this could be e.g. going from a “3” to a “7.” We would expect that for data as high dimensional as MNIST images, the set would be non-convex. So when we interpolate between two points in the set, we are very likely to move outside of the set and get to OOD regions. The issue here is how meaningful those regions may be. Although they would be OOD, would they be useful/realistic? What does a digit that is halfway between a “3” and a “7” tell you about digits you have never seen before?


-Another very cool experiment could build on the KL divergence term of the NCP loss used by Hafner et al. In the univariate regression setting, they use a network with two output nodes; one for a gaussian mean and the other for it's variance. Extending their ideas for regression to the classification setting is very useful but would require some changes. This would be a new novelty since not only are NCP’s fairly new (1st version of paper submitted in July 2018) but it also seems like the authors focused on active learning in the regression setting so extending this to classification is a new area. Here we outline one possibility for this approach. The KL divergence NCP term requires a prior on the label space. If we think of the network's final output K-category softmax vector as belonging to a K-dimensional distribution, we see that it has a few properties that lend itself well to Bayesian modeling:

1) Considering the traditional softmax output, the elements are constrained to sum to 1.

2) The elements correspond to probabilities so are nonnegative

These conditions suggest a Dirichlet prior on the label space could be appropriate. First, the support of the Dirichlet distribution is nonnegative. Second, for the K-dimensional Dirichlet distribution, the support is a K-1 simplex, so it automatically has the property that the elements sum to 1.

To incorporate this into the NCP KL divergence term, to encourage higher uncertainty for OOD data, we could use the KL divergence to a Dirichlet distribution that has the mean vector as the equiprobable point (all dimensions equally uncertain). For in-distribution data, the Dirichlet would be parameterized such that the mode is at the corner of the simplex corresponding to the given class label.

One caveat is that the concentration parameters of the Dirichlet distribution must be positive but are allowed to be greater than 1. So we cannot keep our softmax layer, which imposes to strong a constraint. Instead we can use for example a softplus layer, i.e. log(1 + exp(x)), which is nonnegative and can be arbitrarily large. The interpretation of the network layer changes a little bit, so now it represents the parameters of a distribution which when sampled, have the properties we require of any probabilistic classification decision.

Finally, one interesting aspect of this Dirichlet label prior is how the concentration paramters (alpha vector) simultaneously control both the mean and covariances of the distribution over classification probabilities. Compare this to our current work using an entropy uncertainty penalty which is invariant to permutations, i.e. the entropy is the same as long as the set has the same elements, but the order / structure does not matter. For this MNIST classification task, that may be a bad property: imagine the probability vector of an MNIST classifier which assigns equal classification probability to "0" and "8" (which is understandable) vs. another classifier which assigns equally high probabilities to "0" and "4" (which is less excusable). The current entropy formulation would give the same loss in both cases despite one being more tolerable than the other.

The network would have K output neurons, each representing a concentration parameter of the Dirichlet. The network could look something like this:

<img src="./images/NCP_BNN_dirichlet_categorical_classifier.PNG" width="800" />

--------------------------------------------------------------------------------------



-Another open question is what happens if we include output categories that are basically anomaly detectors? I.e. another neuron which is a catch-all category for things unfamiliar? But perhaps the best solution would be a dynamic network which can learn to increase/decrease the number of classification categories on the fly. In other words, a dynamic graph that can prune or grow nodes and have an architecture that changes during learning.


### APPENDIX: Additional Experiments

### Experiment 2: “rotate”

As described earlier, a priori we might expect that doing such a simple transformation as rotation alone would not sufficiently move points away from in-distribution to any OOD regions that are worth sampling, i.e. by rotating a “7” by any arbitrary amount we will never recover an “8.” We seek to confirm or refute this with a rotation experiment.
For a given set of holdout digit classes (e.g. {8,9}), and a reasonable alpha value, we vary the range of angles that the digits are allowed to rotate through to transform them from in-distribution to a synthetic sample of OOD data. We retrain the network from the same random seed 10 times. Each time, the range of allowed angles is increased, ranging from 0 degrees (no rotation at all so synthetic OOD looks exactly like in-distribution) to 180 degees. The actual angle of rotation for each image is chosen uniformly at random  from [–Theta,Theta].


In [4]:
# code running deterministic network, on the rotate experiment
# parameters and mroe details are in the file itself:
# see ncp_classifier/models/mnist_det.py

#Reset tf graph in case you have already run one of our other experiments in this notebook session:
tf.reset_default_graph()

from ncp_classifier.models.mnist_det import rotation_experiment

rotation_experiment()

## During training, log files will be periodically saved in ncp_classifier/logs
#Full id_acc:     accuracy over the entire partition of in-distribution data
#Full od_acc:     accuracy over the entire partition of out-of-distribution data
    
#(Ignore Futurewarning)

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.

Epoch: 0001 cost=6.094564092
Epoch: 0002 cost=1.237203221
Optimization Finished!
Full id_acc: 0.8814315
Full od_acc: 0.73116183
Epoch: 0001 cost=6.047517842
Epoch: 0002 cost=1.237732139
Optimization Finished!
Full id_acc: 0.8819502
Full od_acc: 0.7116805
Epoch: 0001 cost=5.939010039
Epoch: 0002 cost=1.202573476
Optimization Finished!
Full id_acc: 0.8897925
Full od_acc: 0.6310996
Epoch: 0001 cost=5.840448677
Epoch: 0002 cost=1.175111916
Optimization Finished!
Full id_acc: 0.8985477
Full od_acc: 0.5657469
Epoch: 0001 cost=5.741275329
Epoch: 0002 cost=1.152782226
Optimization Finished!
Full id_acc: 0.90678424
Full od_acc: 0.511473
Epoch: 0001 cost=5.667923265
Epoch: 0002 cost=1.134241587
Optimization Finished!
Full id_acc: 0.9085892
Full od_acc: 0.46201244
Epoch: 0001 cost=5.612907799
Epoch: 0002 cost

In [3]:
# The above training saves out data in ncp_classifier/logs
# Once above training is done, we can plot the results:

from ncp_classifier.scripts.plotting import run_plotting

# Run the plotting script
run_plotting('rotate')

# Output is saved in ncp_classifier/output/rotate

# Example plots are shown below

Running plotting script with experiment type
rotate
['ncp_classifier\\logs\\log_ncp_on__rotate_0.p', 'ncp_classifier\\logs\\log_ncp_on__rotate_1.p', 'ncp_classifier\\logs\\log_ncp_on__rotate_2.p', 'ncp_classifier\\logs\\log_ncp_on__rotate_3.p', 'ncp_classifier\\logs\\log_ncp_on__rotate_4.p']


<img src="./images/det_rotate_CE_loss.png" width="500" />

<img src="./images/det_rotate_entropy.png" width="500" />

### Experiment 3: “successive holdout of digit classes”

In this experiment, we look at the effects of holding out (“omitting”) more and more digit classes, i.e. we try various bipartitions of the set {0,…,9} that leave successively fewer digit categories in the in-distribution training set. We constrain the problem to have at least two classes for in-distribution training, and at least one digit class omitted. In terms of training accuracy, we might expect the easiest classification problem to be when there are two classes, i.e. binary classification (instead of multiclass classification with some digits being visually similar). In this case we might expect uncertainty during training to also be low, and perhaps to lead to overconfident decisions when the model sees omitted (unseen) classes. Also, we might reason that the highest uncertainty on the omitted classes would occur when there are very many different omitted categories.



In [None]:
## code running deterministic network, on the rotate experiment
## parameters and mroe details are in the file itself:
## see ncp_classifier/models/mnist_det.py


from ncp_classifier.models.mnist_det import digits_out_experiment

digits_out_experiment()

## During training, log files will be periodically saved in ncp_classifier/logs
#Full id_acc:     accuracy over the entire partition of in-distribution data
#Full od_acc:     accuracy over the entire partition of out-of-distribution data
    
#(Ignore Futurewarning)

In [None]:
#Run the plotting script
run_plotting('digout')

#Output is saved in ncp_classifier/output/alpha

#Example plots are shown below, for a case with
#a small in-distribution ser {0,1}
#and large omitted set {2,3,4,5,6,7,8,9}:

<img src="./images/det_digout_accuracy.png" width="500" />

<img src="./images/det_digout_entropy.png" width="500" />