# Introduction

본 실습에서는 Bayesian Neural Network(BNN)와 Variational Inference (VI)의 개념을 복습하고, 이를 deep learning에서 구현하는 방법인 Monte-Carlo DropOut (MCDO)의 구현을 알아봅니다. 또한, MCDO를 Dense layer가 아닌 CNN, RNN 등의 구조에서 적용하는 방법에 대해서도 알아봅니다. 

# Recap : Bayesian NN (BNN) and Variational Inference (VI)

일반적인 deep learning의 학습은 다음과 같이 정의됩니다.

* 네트워크 구조 정의 및 파라미터 초기화
* Loss function 정의
* Loss function을 최소화 시키는 파라미터 찾기(with SGD)

이러한 학습 과정은 Loss function을 최소화 시키는 하나의 파라미터 값을 찾기 때문에 빈도주의적 접근(Frequentist approach)라고 표현합니다. 그렇다면 이 과정을 어떻게 사후 분포를 찾는 Bayesian approach로 변화시킬 수 있을까요? 이를 위해서는 먼저 파라미터의 분포 ($p(\theta)$)를 정의하고 위 과정을 다음과 같이 수정하는 방식으로 달성할 수 있습니다.


* 네트워크 구조 정의 및 파라미터의 사전 분포 ($p(\theta)$) 정의
* Likelihood function ($p(\mathcal{D}|\theta)$)정의
* Bayes Theorem($p(\theta| \mathcal{D}) = \frac{p(\theta,\mathcal{D})}{p(\mathcal{D})}$)을 적용하여 파라미터의 사후분포 찾기

이 과정에서 Bayes Theorem은 다음과 같은 적분을 필요로 하며 이를 계산하는 것은 매우 어렵습니다.

$$
p(\mathcal{D}) = \int_{\Omega} p(\mathcal{D} | \theta) p(\theta) d\theta
$$

따라서 이 계산을 우회하기 위해 우리는 파라미터의 사후 분포 ($p(\theta | \mathcal{D}$)를 다음과 같이 해석할 수 있습니다.

$$
p(\theta | \mathcal{D} ) = \arg \inf_{q} \Big\{ -\int_{\Omega} p(\mathcal{D} | \theta ) q(\theta)d\theta + D_{KL}(q(\theta)\| p(\theta)) \Big\}
$$

이때 $\arg \inf_{q} \Big\{-\int_{\Omega} p(\mathcal{D} | \theta ) q(\theta)d\theta \Big\}$는 제안된 분포 $q(\theta)$가 얼마나 likelihood function $p(\mathcal{D} | \theta )$을 최대화 시키는지를 나타내며,
$\arg \inf_{q} \Big\{D_{KL}(q(\theta)\| p(\theta)) \Big\}$는 제안된 분포 $q(\theta)$가 얼마나 prior $p(\theta)$에 가까운지를 나타냅니다.

즉, 우리가 나타내고자 하는 사후 분포 $p(\theta | \mathcal{D})$는 데이터에 대한 설명력을 높이면서 동시에 사전 분포에 가깝게 하는 문제를 **임의의 파라미터 분포**에 대해서 찾은 답을 의미합니다.

한편 **임의의 파라미터 분포**는 우리가 최적화하기 어려운 대상이므로 이를 최적화하기 쉬운 좀더 작은 분포로 나타낼 수 있습니다. 예를 들어, 파라미터의 분포를 

* 각 파라미터가 서로 독립적이고,
* 각 파라미터가 Gaussian 분포를 따른다고

가정하게 되면 해당 분포는 **각 Gaussian 분포의 평균과 표준편차**만으로 표현할 수 있게 됩니다. 그리고 이러한 분포들 중에서 위 목적 함수를 최대화하는 분포는 각 Gaussian의 평균과 표준편차를 최적화(with SGD)함으로써 찾을 수 있습니다. 이렇게 **임의의 파라미터 분포**에서 최적화하는 사후 분포와 달리, 변분 추론(variational inference, VI)는 우리가 **최적화하기 적합한 더 작은 분포들**에 대해서 최적화하여 근사적으로 사후 분포를 찾는 방법이라고 할 수 있습니다. 정리하면 BNN의 VI는 다음과 같이 적용할 수 있습니다.

* 네트워크 구조 정의 및 파라미터의 사전 분포 ($p(\theta)$) 정의
* Likelihood function ($p(\mathcal{D}|\theta)$)정의
* 최적화 하고자 하는 분포의 범위(e.g. 서로 독립적인 Gaussian) 정의
* Negative likelihood + Prior의 KL divergence가 최소화 되는 파라미터 찾기(with SGD)

앞으로 설명하게 될 MCDO는 최적화 하고자 하는 분포의 범위를 **DropOut으로 표현 가능한 모든 파라미터의 분포**라고 정의합니다.

# Recap : DropOut and Monte-Carlo DropOut (MCDO)

기존의 DropOut은 다음과 같이 train/test time의 네트워크를 다르게 activate 시킵니다.

* Train : 각 레이어의 input을 정해진 확률 $p$만큼 random으로 drop하여 0으로 만들고 계산.("thinned" networks)
* Test : 각 레이어의 모든 input을 계산하되, weight matrix에 $p$만큼 곱하여 evaluation.

![](./images/dropout.png)

한편 MCDO에서는 train/test time의 네트워클 모두 동일하게 dropout하여 사용합니다. 즉, MCDO는 test time에도 DropOut이 적용되는 input에 따라 다른 결과값이 계산되며 이로부터 

# MCDO on FFNN

In [1]:
# The network strucuture of this tutorial is based on 
# https://github.com/jwlee-ml/Tensorflow_FastCampus_9th

# import tf and set random seed
import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data

tf.set_random_seed(2019)

In [2]:
class MLP:
    
    def __init__(self, n_hidden=512, n_output=10):
        self.n_hidden = n_hidden
        self.n_output = n_output
        
        # define placeholders for training MLP
        self.x = tf.placeholder(tf.float32, shape=[None, 784])
        self.y = tf.placeholder(tf.float32, [None, 10])
        self.dropout_rate = tf.placeholder(tf.float32)
        self.lr = tf.placeholder(tf.float32)
        
        self.logit, self.loss, self.opt = self.build_graph(n_hidden=self.n_hidden, n_output=self.n_output)

    def build_graph(self, n_hidden=512, n_output=10):
        """
        Build computational graph for MNIST training with DropOut.

        Args:
            x : MINST input data. (m, 784)
            n_hidden : the number of hidden units in MLP.
            n_output : the size of output layer (=10)
            rate: dropout rate for dropout layers in MLP (tf.placeholder)

        Returns:
            logit : log-probability of prediction. (m, 10)
            loss : cross entropy loss 
            learning rate : learning rate of Adam
        """

        with tf.variable_scope('mlp'):
            # initializers for weight and bias
            w_init = tf.contrib.layers.variance_scaling_initializer()
            b_init = tf.constant_initializer(0.)

            # 1st hidden layer
            w0 = tf.get_variable('w0', [self.x.get_shape()[1], n_hidden], initializer=w_init)
            b0 = tf.get_variable('b0', [n_hidden], initializer=b_init)
            h0 = tf.matmul(self.x, w0) + b0
            h0 = tf.nn.relu(h0)
            h0 = tf.nn.dropout(h0, rate=self.dropout_rate)

            # 2nd hidden layer
            w1 = tf.get_variable('w1', [h0.get_shape()[1], n_hidden], initializer=w_init)
            b1 = tf.get_variable('b1', [n_hidden], initializer=b_init)
            h1 = tf.matmul(h0, w1) + b1
            h1 = tf.nn.relu(h1)
            h1 = tf.nn.dropout(h1, rate=self.dropout_rate)

            # 3nd hidden layer
            w2 = tf.get_variable('w2', [h1.get_shape()[1], n_hidden], initializer=w_init)
            b2 = tf.get_variable('b2', [n_hidden], initializer=b_init)
            h2 = tf.matmul(h1, w2) + b2
            h2 = tf.nn.relu(h2)
            h2 = tf.nn.dropout(h2, rate=self.dropout_rate)

            # 4nd hidden layer
            w3 = tf.get_variable('w3', [h2.get_shape()[1], n_hidden], initializer=w_init)
            b3 = tf.get_variable('b3', [n_hidden], initializer=b_init)
            h3 = tf.matmul(h2, w3) + b3
            h3 = tf.nn.relu(h3)
            h3 = tf.nn.dropout(h3, rate=self.dropout_rate)

            # output layer
            wo = tf.get_variable('wo', [h3.get_shape()[1], n_output], initializer=w_init)
            bo = tf.get_variable('bo', [n_output], initializer=b_init)
            logit = tf.matmul(h3, wo) + bo
            
            # loss
            loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(
            logits=logit, labels=self.y))
            
            # optimizer
            opt = tf.train.AdamOptimizer(learning_rate=self.lr).minimize(loss)

        return logit, loss, opt

In [3]:
mlp = MLP()
sess = tf.Session()
# get MINST data
mnist = input_data.read_data_sets("./data/", one_hot=True)

W0716 11:40:08.360730 4614247872 deprecation.py:323] From <ipython-input-3-2639ea0d9fc8>:4: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
W0716 11:40:08.361680 4614247872 deprecation.py:323] From /anaconda3/lib/python3.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please write your own downloading logic.
W0716 11:40:08.362490 4614247872 deprecation.py:323] From /anaconda3/lib/python3.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:

Extracting ./data/train-images-idx3-ubyte.gz
Extracting ./data/train-labels-idx1-ubyte.gz
Extracting ./data/t10k-images-idx3-ubyte.gz


W0716 11:40:08.577684 4614247872 deprecation.py:323] From /anaconda3/lib/python3.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


Extracting ./data/t10k-labels-idx1-ubyte.gz


In [4]:
def train_MNIST(net, sess, num_epoch=15, batch_size=100, data=mnist):
        
    sess.run(tf.global_variables_initializer())

    # iterating epoch
    for epoch in range(num_epoch):
        avg_loss = 0.
        num_batch = int(data.train.num_examples / batch_size)

        for i in range(num_batch):
            # get minibatch of X,Y
            batch_xs, batch_ys = data.train.next_batch(batch_size)
            feed_dict = {net.x: batch_xs,
                         net.y: batch_ys,
                         net.dropout_rate: 0.3, 
                         net.lr : 1e-3}
            l, _ = sess.run([net.loss, net.opt], feed_dict=feed_dict)
            avg_loss += l / num_batch

        print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.9f}'.format(avg_loss))

    print('Learning Finished!')

In [5]:
# evaluation
def evaluate_MNIST(net, sess, data=mnist):
    
    correct_prediction = tf.equal(tf.argmax(net.logit, 1), tf.argmax(net.y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    print('DropOut Accuracy:', sess.run(accuracy, feed_dict={net.x: data.test.images,
                                                             net.y: data.test.labels,
                                                             net.dropout_rate: 0.0}))
    
    logits = list()
    for i in range(30):
        logits.append(sess.run(net.logit, feed_dict={net.x: data.test.images,
                                                     net.y: data.test.labels,
                                                     net.dropout_rate: 0.3}))
    logit = np.mean(np.array(logits), 0)
    correct_prediction = np.equal(np.argmax(logit, 1), np.argmax(data.test.labels, 1))
    accuracy = np.mean(correct_prediction)
    print('MCDO Accuracy:', accuracy)

In [6]:
train_MNIST(mlp, sess)

Epoch: 0001 cost = 0.365000198
Epoch: 0002 cost = 0.162444700
Epoch: 0003 cost = 0.126597639
Epoch: 0004 cost = 0.105118152
Epoch: 0005 cost = 0.092673498
Epoch: 0006 cost = 0.080030959
Epoch: 0007 cost = 0.076510305
Epoch: 0008 cost = 0.070558808
Epoch: 0009 cost = 0.062834471
Epoch: 0010 cost = 0.058524062
Epoch: 0011 cost = 0.054050435
Epoch: 0012 cost = 0.053791622
Epoch: 0013 cost = 0.052053770
Epoch: 0014 cost = 0.047733511
Epoch: 0015 cost = 0.045533521
Learning Finished!


In [7]:
evaluate_MNIST(mlp, sess)

DropOut Accuracy: 0.9822
MCDO Accuracy: 0.9817


# MCDO on CNN

In [8]:
class CNN:
    
    def __init__(self, kernel_size=3, n_hidden=512, n_output=10):
        self.kernel_size = kernel_size
        self.n_hidden = n_hidden
        self.n_output = n_output
        
        # define placeholders for training MLP
        self.x = tf.placeholder(tf.float32, shape=[None, 784])
        self.x_img = tf.reshape(self.x, [-1, 28, 28, 1])
        self.y = tf.placeholder(tf.float32, [None, 10])
        self.dropout_rate = tf.placeholder(tf.float32)
        self.lr = tf.placeholder(tf.float32)
        
        self.logit, self.loss, self.opt = self.build_graph(kernel_size=self.kernel_size, n_hidden=self.n_hidden, n_output=self.n_output)

    def build_graph(self, kernel_size=3, n_hidden=512, n_output=10):
        """
        Build computational graph for MNIST training with DropOut.

        Args:
            x : MINST input data. (m, 784)
            n_hidden : the number of hidden units in MLP.
            n_output : the size of output layer (=10)
            rate: dropout rate for dropout layers in MLP (tf.placeholder)

        Returns:
            logit : log-probability of prediction. (m, 10)
            loss : cross entropy loss 
            learning rate : learning rate of Adam
        """

        with tf.variable_scope('cnn'):
            strides = [1, 1, 1, 1]
            # initializers for weight and bias
            w_init = tf.contrib.layers.variance_scaling_initializer()
            b_init = tf.constant_initializer(0.)

            # 1st hidden layer : (m, 28, 28, 1) -> (m, 14, 14, 32)
            w0 = tf.get_variable('w0', [kernel_size, kernel_size, 1, 32], initializer=w_init)
            h0 = tf.nn.conv2d(self.x_img, w0, strides=strides, padding='SAME')
            h0 = tf.nn.relu(h0)
            h0 = tf.nn.max_pool(h0, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
            h0 = tf.nn.dropout(h0, rate=self.dropout_rate)

            # 2nd hidden layer : (m, 14, 14, 32) -> (m, 7, 7, 64)
            w1 = tf.get_variable('w1', [kernel_size, kernel_size, 32, 64], initializer=w_init)
            h1 = tf.nn.conv2d(h0, w1, strides=strides, padding='SAME')
            h1 = tf.nn.relu(h1)
            h1 = tf.nn.max_pool(h1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
            h1 = tf.nn.dropout(h1, rate=self.dropout_rate)

            # 3nd hidden layer : (m, 7, 7, 64) -> (m, 4, 4, 128)
            w2 = tf.get_variable('w2', [kernel_size, kernel_size, 64, 128], initializer=w_init)
            h2 = tf.nn.conv2d(h1, w2, strides=strides, padding='SAME')
            h2 = tf.nn.relu(h2)
            h2 = tf.nn.max_pool(h2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
            h2 = tf.nn.dropout(h2, rate=self.dropout_rate)
            
            # reshape : (m, 4, 4, 128) -> (m, 4 * 4 * 128)
            h2 = tf.reshape(h2, [-1, 4 * 4 * 128])

            # 4nd hidden layer : (m, 4 * 4 * 128) -> (m, n_hidden)
            w3 = tf.get_variable('w3', [h2.get_shape()[1], n_hidden], initializer=w_init)
            b3 = tf.get_variable('b3', [n_hidden], initializer=b_init)
            h3 = tf.matmul(h2, w3) + b3
            h3 = tf.nn.relu(h3)
            h3 = tf.nn.dropout(h3, rate=self.dropout_rate)

            # output layer : (m, n_hidden) -> (m, n_output)
            wo = tf.get_variable('wo', [h3.get_shape()[1], n_output], initializer=w_init)
            bo = tf.get_variable('bo', [n_output], initializer=b_init)
            logit = tf.matmul(h3, wo) + bo
            
            # loss
            loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(
            logits=logit, labels=self.y))
            
            # optimizer
            opt = tf.train.AdamOptimizer(learning_rate=self.lr).minimize(loss)

        return logit, loss, opt

In [9]:
cnn = CNN()

In [10]:
train_MNIST(cnn, sess)

Epoch: 0001 cost = 0.376906734
Epoch: 0002 cost = 0.109382831
Epoch: 0003 cost = 0.076635950
Epoch: 0004 cost = 0.064737637
Epoch: 0005 cost = 0.054634125
Epoch: 0006 cost = 0.049057720
Epoch: 0007 cost = 0.044841536
Epoch: 0008 cost = 0.041880070
Epoch: 0009 cost = 0.041395386
Epoch: 0010 cost = 0.036367649
Epoch: 0011 cost = 0.034328564
Epoch: 0012 cost = 0.033231129


E0716 11:47:52.877957 4614247872 ultratb.py:149] Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3296, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-10-26433b7db5bc>", line 1, in <module>
    train_MNIST(cnn, sess)
  File "<ipython-input-4-3381bfae259a>", line 15, in train_MNIST
    l, _ = sess.run([net.loss, net.opt], feed_dict=feed_dict)
  File "/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/anaconda3/lib/python3.7/site-packages/tensorflow/pytho

KeyboardInterrupt: 

In [11]:
evaluate_MNIST(cnn, sess)

DropOut Accuracy: 0.9939
MCDO Accuracy: 0.9935


# MCDO on RNN

In [2]:
class RNN:
    
    def __init__(self, kernel_size=3, n_hidden=512, n_output=10):
        self.kernel_size = kernel_size
        self.n_hidden = n_hidden
        self.n_output = n_output
        
        # define placeholders for training MLP
        self.x = tf.placeholder(tf.float32, shape=[None, 784])
        self.x_img = tf.reshape(self.x, [-1, 28, 28])
        self.y = tf.placeholder(tf.float32, [None, 10])
        self.dropout_rate = tf.placeholder(tf.float32)
        self.lr = tf.placeholder(tf.float32)
        
        self.logit, self.loss, self.opt = self.build_graph(kernel_size=self.kernel_size, n_hidden=self.n_hidden, n_output=self.n_output)

    def build_graph(self, kernel_size=3, n_hidden=512, n_output=10):
        """
        Build computational graph for MNIST training with DropOut.

        Args:
            x : MINST input data. (m, 784)
            n_hidden : the number of hidden units in MLP.
            n_output : the size of output layer (=10)
            rate: dropout rate for dropout layers in MLP (tf.placeholder)

        Returns:
            logit : log-probability of prediction. (m, 10)
            loss : cross entropy loss 
            learning rate : learning rate of Adam
        """

        with tf.variable_scope('rnn'):
            # initializers for weight and bias
            w_init = tf.contrib.layers.variance_scaling_initializer()
            b_init = tf.constant_initializer(0.)

            # rnn cell
            rnn_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_hidden)
            keep_prob = 1. - self.dropout_rate
            var_rnn_cell = tf.contrib.rnn.DropoutWrapper(rnn_cell,
                                                        input_keep_prob=keep_prob,
                                                        output_keep_prob=keep_prob,
                                                        state_keep_prob=keep_prob,)
            outputs, _ = tf.nn.dynamic_rnn(rnn_cell, self.x_img, dtype=tf.float32)

            # output layer
            wo = tf.get_variable('wo', [n_hidden, n_output], initializer=w_init)
            bo = tf.get_variable('bo', [n_output], initializer=b_init)
            last_rnn_output = outputs[:, -1, :]
            logit = tf.matmul(last_rnn_output, wo) + bo
            
            # loss
            loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(
            logits=logit, labels=self.y))
            
            # optimizer
            opt = tf.train.AdamOptimizer(learning_rate=self.lr).minimize(loss)

        return logit, loss, opt

In [3]:
rnn = RNN()

W0716 12:07:45.806266 4436415936 deprecation.py:323] From <ipython-input-2-ce9e6a25182e>:39: BasicRNNCell.__init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.SimpleRNNCell, and will be replaced by that in Tensorflow 2.0.
W0716 12:07:45.809188 4436415936 deprecation.py:323] From <ipython-input-2-ce9e6a25182e>:45: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
W0716 12:07:45.920722 4436415936 deprecation.py:506] From /anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of pa

In [None]:
train_MNIST(rnn, sess)

In [None]:
evaluate_MNIST(rnn, sess)