# Imports

In [1]:
import tensorflow as tf
import numpy as np
import pickle

# Load Dataset

In [2]:
# Load dataset from disk
pickle_in_x = open('X.pickle', 'rb')
X = pickle.load(pickle_in_x)

pickle_in_y = open('y.pickle', 'rb')
y = pickle.load(pickle_in_y)

In [5]:
y[0]

[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

## Architecture

```
                            Loss
                              ↑
                    ┌─────────┴─────────┐
      Labels → Margin Loss      Reconstruction Loss
                    ↑                   ↑
                  Length             Decoder
                    ↑                   ↑ 
             Digit Capsules ────Mask────┘
               ↖↑↗ ↖↑↗ ↖↑↗
             Primary Capsules
                    ↑      
               Input Images
```

# Input Images

Input images are of size 28x28 pixels and are grayscale thus channel=1. We don't know the batch size thus the first dimension of the shape in None.

In [2]:
X = tf.placeholder(shape=[None, 28, 28, 1], dtype=tf.float32, name='X')

# Primary Capsules

The first layer will contain 32 maps of 6x6 capsules each, and each capsule will output a 8D activation vector.

In [3]:
caps1_n_maps = 32
caps1_n_caps = caps1_n_maps * 6 * 6  # 1152 primary capsules
caps1_n_dims = 8

To compute the output of primary capsules, we apply two regular convolutional layers

In [4]:
conv1_params = {
    'filters': 256,
    'kernel_size': 9,
    'strides': 2,
    'padding': 'valid',
    'activation': tf.nn.relu
}

conv2_params = {
    'filters': caps1_n_maps * caps1_n_dims,  # 256 convolutional filters
    'kernel_size': 9,
    'strides': 2,
    'padding': 'valid',
    'activation': tf.nn.relu
}

In [5]:
conv1 = tf.layers.conv2d(X, name='conv1', **conv1_params)
conv2 = tf.layers.conv2d(conv1, name='conv2', **conv2_params)

Due to the kernel size of 9 and valid padding, the image size is reduced by 8 pixels after each convolutional layer i.e. 28x28 -> 20x20 -> 12x12. Furthur, by applying a stride of 2, the image size is reduced to half thus 6x6 feature maps.

Next, we reshape the output to get a bunch of 8D vectorsrepresenting output of primary capsules. The output of conv2d is an array (scalar) containing 32x8=256 feature maps for each instance, where each feature map is 6x6. So the shape of this output is (batch size, 6, 6, 256). We want to chop the 256 into 32 vectors of 8 dimensions each. We could do this by reshaping to (batch size, 6, 6, 32, 8). However, since this first capsule layer will be fully connected to the next capsule layer, we can simply flatten the 6×6 grids. This means we just need to reshape to (batch size, 6×6×32, 8).

In [6]:
caps1_raw = tf.reshape(conv2, [-1, caps1_n_caps, caps1_n_dims], name='caps1_raw')

Now the vectors need to be squashed.
We cannot directly use the tf.norm() function because the derivative of || s || is undefined for || s || = 0. So we compute the square root by adding epislon value.

In [7]:
def squash(s, axis=-1, epsilon=1e-7, name=None):
    with tf.name_scope(name, default_name='squash'):
        squared_norm = tf.reduce_sum(tf.square(s), axis=axis, keepdims=True)
        safe_norm = tf.sqrt(squared_norm + epsilon)
        squash_factor = squared_norm / (1. + squared_norm)
        unit_vector = s / safe_norm
        return squash_factor * unit_vector

In [8]:
caps1_output = squash(caps1_raw, name='caps1_output')

caps1_output is the output of the first capsule layer

# Digit Capsules

To compute the output of the digit capsules, we must first compute the predicted output vectors (one for each primary / digit capsule pair). Then we can run the routing by agreement algorithm.

# Compute the Predicted Output Vectors

The digit capsule layer contains 10 capsules (one for each digit) of 16 dimensions each:

In [9]:
caps2_n_caps = 10
caps2_n_dims = 16

For each capsule $i$ in the first layer, we want to predict the output of every capsule $j$ in the second layer. For this, we will need a transformation matrix $\mathbf{W}_{i,j}$ (one for each pair of capsules ($i$, $j$)), then we can compute the predicted output $\hat{\mathbf{u}}_{j|i} = \mathbf{W}_{i,j} \, \mathbf{u}_i$. Since we want to transform an 8D vector into a 16D vector, each transformation matrix $\mathbf{W}_{i,j}$ must have a shape of (16, 8).

To compute $\hat{\mathbf{u}}_{j|i}$ for every pair of capsules ($i$, $j$), a feature of `tf.matmul()` function will be used. In addition to matrix multiplication, `tf.matmul()` also allows multiplication of higher dimensional arrays. It treats the higher dimensional arrays as arrays of matrices, and it performs itemwise matrix multiplication. For example, suppose there are two 4D arrays, each containing a 2×3 grid of matrices. The first contains matrices $\mathbf{A}, \mathbf{B}, \mathbf{C}, \mathbf{D}, \mathbf{E}, \mathbf{F}$ and the second contains matrices $\mathbf{G}, \mathbf{H}, \mathbf{I}, \mathbf{J}, \mathbf{K}, \mathbf{L}$. On multiplying these two 4D arrays using the `tf.matmul()` function, we get:

$
\pmatrix{
\mathbf{A} & \mathbf{B} & \mathbf{C} \\
\mathbf{D} & \mathbf{E} & \mathbf{F}
} \times
\pmatrix{
\mathbf{G} & \mathbf{H} & \mathbf{I} \\
\mathbf{J} & \mathbf{K} & \mathbf{L}
} = \pmatrix{
\mathbf{AG} & \mathbf{BH} & \mathbf{CI} \\
\mathbf{DJ} & \mathbf{EK} & \mathbf{FL}
}
$

We can apply this function to compute $\hat{\mathbf{u}}_{j|i}$ for every pair of capsules ($i$, $j$) like this (recall that there are 6×6×32=1152 capsules in the first layer, and 10 in the second layer):

$
\pmatrix{
  \mathbf{W}_{1,1} & \mathbf{W}_{1,2} & \cdots & \mathbf{W}_{1,10} \\
  \mathbf{W}_{2,1} & \mathbf{W}_{2,2} & \cdots & \mathbf{W}_{2,10} \\
  \vdots & \vdots & \ddots & \vdots \\
  \mathbf{W}_{1152,1} & \mathbf{W}_{1152,2} & \cdots & \mathbf{W}_{1152,10}
} \times
\pmatrix{
  \mathbf{u}_1 & \mathbf{u}_1 & \cdots & \mathbf{u}_1 \\
  \mathbf{u}_2 & \mathbf{u}_2 & \cdots & \mathbf{u}_2 \\
  \vdots & \vdots & \ddots & \vdots \\
  \mathbf{u}_{1152} & \mathbf{u}_{1152} & \cdots & \mathbf{u}_{1152}
}
=
\pmatrix{
\hat{\mathbf{u}}_{1|1} & \hat{\mathbf{u}}_{2|1} & \cdots & \hat{\mathbf{u}}_{10|1} \\
\hat{\mathbf{u}}_{1|2} & \hat{\mathbf{u}}_{2|2} & \cdots & \hat{\mathbf{u}}_{10|2} \\
\vdots & \vdots & \ddots & \vdots \\
\hat{\mathbf{u}}_{1|1152} & \hat{\mathbf{u}}_{2|1152} & \cdots & \hat{\mathbf{u}}_{10|1152}
}
$


The shape of the first array is (1152, 10, 16, 8), and the shape of the second array is (1152, 10, 8, 1). Note that the second array must contain 10 identical copies of the vectors $\mathbf{u}_1$ to $\mathbf{u}_{1152}$. To create this array, we will use the handy `tf.tile()` function, which lets you create an array containing many copies of a base array, tiled in any way you want.

Now, we also need to consider the _batch size_. Say we feed 50 images to the capsule network, it will make predictions for these 50 images simultaneously. So the shape of the first array must be (50, 1152, 10, 16, 8), and the shape of the second array must be (50, 1152, 10, 8, 1). The first layer capsules actually already output predictions for all 50 images, so the second array will be fine, but for the first array, we will need to use `tf.tile()` to have 50 copies of the transformation matrices.

So we start by creating a trainable variable of shape (1, 1152, 10, 16, 8) that will hold all the transformation matrices. The first dimension of size 1 will make this array easy to tile. The transformation matrix needs to be initialized with random values, so we initialize this variable randomly using a normal distribution with a standard deviation to 0.1.

In [10]:
# Creating the transformation matrix variable W

init_sigma = 0.1

W_init = tf.random_normal(
    shape=(1, caps1_n_caps, caps2_n_caps, caps2_n_dims, caps1_n_dims),
    stddev=init_sigma, dtype=tf.float32, name='W_init'
)
W = tf.Variable(W_init, name='W')

In [11]:
# Creating the first array by repeating W per instance

batch_size = tf.shape(X)[0]
W_tiled = tf.tile(W, [batch_size, 1, 1, 1, 1], name='W_tiled')

Now moving on to the second array. We need to create an array of shape (_batch size_, 1152, 10, 8, 1), containing the output of the first layer capsules, repeated 10 times (once per digit, along the third dimension, which is axis=2). The `caps1_output` array has a shape of (_batch size_, 1152, 8), so we first need to expand it twice, to get an array of shape (_batch size_, 1152, 1, 8, 1), then we can repeat it 10 times along the third dimension:

In [12]:
caps1_output_expanded = tf.expand_dims(caps1_output, -1, name='caps1_output_expanded')
caps1_output_tile = tf.expand_dims(caps1_output_expanded, 2, name='caps1_output_tile')
caps1_output_tiled = tf.tile(caps1_output_tile, [1, 1, caps2_n_caps, 1, 1], name='caps1_output_tiled')

In [13]:
# Now check the shapes
print(W_tiled)
print(caps1_output_tiled)

Tensor("W_tiled:0", shape=(?, 1152, 10, 16, 8), dtype=float32)
Tensor("caps1_output_tiled:0", shape=(?, 1152, 10, 8, 1), dtype=float32)


Now, to get all the predicted output vectors $\hat{\mathbf{u}}_{j|i}$, we multiply the two arrays using `tf.matmul()`:

In [14]:
caps2_predicted = tf.matmul(W_tiled, caps1_output_tiled, name='caps2_predicted')

In [15]:
# Check the shape
caps2_predicted

<tf.Tensor 'caps2_predicted:0' shape=(?, 1152, 10, 16, 1) dtype=float32>

For each instance in the batch and for each pair of first and second layer capsules (1152×10) we have a 16D predicted output column vector (16×1). Now we can apply the apply the routing by agreement algorithm!

# Routing by agreement

First, initialize the raw routing weights $b_{i,j}$ to zero:

In [16]:
raw_weights = tf.zeros([batch_size, caps1_n_caps, caps2_n_caps, 1, 1], dtype=np.float32, name='raw_weights')

The reason for adding two extra dimensions to raw_weights is given below.

### Round 1

First, apply the softmax function to compute the routing weights, $\mathbf{c}_{i} = \operatorname{softmax}(\mathbf{b}_i)$

In [17]:
routing_weights = tf.nn.softmax(raw_weights, axis=2, name='routing_weights')

Now compute the weighted sum of all the predicted output vectors for each second-layer capsule, $\mathbf{s}_j = \sum\limits_{i}{c_{i,j}\hat{\mathbf{u}}_{j|i}}$

In [18]:
weighted_predictions = tf.multiply(routing_weights, caps2_predicted, name='weighted_predictions')
weighted_sum = tf.reduce_sum(weighted_predictions, axis=1, keepdims=True, name='weighted_sum')

* To perform elementwise matrix multiplication, we use the `tf.multiply()` function. It requires `routing_weights` and `caps2_predicted` to have the same rank, which is why two extra dimensions of size 1 were added to `routing_weights`, earlier.
* The shape of `routing_weights` is (_batch size_, 1152, 10, 1, 1) while the shape of `caps2_predicted` is (_batch size_, 1152, 10, 16, 1).  Since they don't match on the fourth dimension (1 _vs_ 16), `tf.multiply()` automatically _broadcasts_ the `routing_weights` 16 times along that dimension.

And finally, let's apply the squash function to get the outputs of the second layer capsules at the end of the first iteration of the routing by agreement algorithm, $\mathbf{v}_j = \operatorname{squash}(\mathbf{s}_j)$ :

In [19]:
caps2_output_round_1 = squash(weighted_sum, axis=-2, name='caps2_output_round_1')

In [20]:
caps2_output_round_1

<tf.Tensor 'caps2_output_round_1/mul:0' shape=(?, 1, 10, 16, 1) dtype=float32>

We have ten 16D output vectors for each instance, as expected.

### Round 2

First, let's measure how close each predicted vector $\hat{\mathbf{u}}_{j|i}$ is to the actual output vector $\mathbf{v}_j$ by computing their scalar product $\hat{\mathbf{u}}_{j|i} \cdot \mathbf{v}_j$.

* Quick math reminder: if $\vec{a}$ and $\vec{b}$ are two vectors of equal length, and $\mathbf{a}$ and $\mathbf{b}$ are their corresponding column vectors (i.e., matrices with a single column), then $\mathbf{a}^T \mathbf{b}$ (i.e., the matrix multiplication of the transpose of $\mathbf{a}$, and $\mathbf{b}$) is a 1×1 matrix containing the scalar product of the two vectors $\vec{a}\cdot\vec{b}$. In Machine Learning, we generally represent vectors as column vectors, so when we talk about computing the scalar product $\hat{\mathbf{u}}_{j|i} \cdot \mathbf{v}_j$, this actually means computing ${\hat{\mathbf{u}}_{j|i}}^T \mathbf{v}_j$.

Since we need to compute the scalar product $\hat{\mathbf{u}}_{j|i} \cdot \mathbf{v}_j$ for each instance, and for each pair of first and second level capsules $(i, j)$, we will once again use `tf.matmul()` to multiply many matrices simultaneously. This will require playing around with `tf.tile()` to get all dimensions to match (except for the last 2). So let's look at the shape of `caps2_predicted`, which holds all the predicted output vectors $\hat{\mathbf{u}}_{j|i}$ for each instance and each pair of capsules:

In [21]:
caps2_predicted

<tf.Tensor 'caps2_predicted:0' shape=(?, 1152, 10, 16, 1) dtype=float32>

And now let's look at the shape of `caps2_output_round_1`, which holds 10 outputs vectors of 16D each, for each instance:

In [22]:
caps2_output_round_1

<tf.Tensor 'caps2_output_round_1/mul:0' shape=(?, 1, 10, 16, 1) dtype=float32>

To get these shapes to match, we just need to tile the `caps2_output_round_1` array 1152 times (once per primary capsule) along the second dimension:

In [23]:
caps2_output_round_1_tiled = tf.tile(
    caps2_output_round_1, [1, caps1_n_caps, 1, 1, 1], name='caps2_output_round_1_tiled'
)

Now we calculate the agreement

In [24]:
agreement = tf.matmul(caps2_predicted, caps2_output_round_1_tiled, transpose_a=True, name='agreement')

We can now update the raw routing weights $b_{i,j}$ by simply adding the scalar product $\hat{\mathbf{u}}_{j|i} \cdot \mathbf{v}_j$ we just computed: $b_{i,j} \gets b_{i,j} + \hat{\mathbf{u}}_{j|i} \cdot \mathbf{v}_j$

In [25]:
raw_weights_round_2 = tf.add(raw_weights, agreement, name='raw_weights_round_2')

The rest of round 2 is the same as in round 1:

In [26]:
routing_weights_round_2 = tf.nn.softmax(
    raw_weights_round_2, axis=2, name='routing_weights_round_2'
)

weighted_predictions_round_2 = tf.multiply(
    routing_weights_round_2, caps2_predicted, name='weighted_predictions_round_2'
)

weighted_sum_round_2 = tf.reduce_sum(
    weighted_predictions_round_2, axis=1, keepdims=True, name='weighted_sum_round_2'
)

caps2_output_round_2 = squash(weighted_sum_round_2, axis=-2, name='caps2_output_round_2')

If we want, we can go for a few more rounds, by repeating exactly the same steps as in round 2. But for now, we stop here.

In [27]:
caps2_output = caps2_output_round_2

# Estimated Class Probabilities (Length)

The lengths of the output vectors represent class probabilities. We cannot use the `tf.norm()` function directly because of the risk discussed during writing squash function. So we create our own function

In [28]:
def safe_norm(s, axis=-1, epsilon=1e-7, keepdims=False, name=None):
    with tf.name_scope(name, default_name='safe_norm'):
        squared_norm = tf.reduce_sum(tf.square(s), axis=axis, keepdims=keepdims)
        return tf.sqrt(squared_norm + epsilon)

In [29]:
y_proba = safe_norm(caps2_output, axis=2, name='y_proba')

To predict the class for each instance, we selected the one with the highest estimated probability. We start by finding its index:

In [30]:
y_proba_argmax = tf.argmax(y_proba, axis=2, name='y_proba')

Let's look at the shape of `y_proba_argmax`:

In [31]:
y_proba_argmax

<tf.Tensor 'y_proba_1:0' shape=(?, 1, 1) dtype=int64>

We now have the index of the longest output vector. Let's get rid of the last two dimensions. This gives us the capsule network's predicted class for each instance:

In [32]:
y_pred = tf.squeeze(y_proba_argmax, axis=[1,2], name='y_pred')

In [33]:
y_pred

<tf.Tensor 'y_pred:0' shape=(?,) dtype=int64>

We are now ready to define the training operations.

# Labels

We start by defining a placeholder for labels

In [34]:
y = tf.placeholder(shape=[None], dtype=tf.int64, name='y')

# Margin loss

Margin loss formula to make it possible to detect two or more digits in an image.

$ L_k = T_k \max(0, m^{+} - \|\mathbf{v}_k\|)^2 + \lambda (1 - T_k) \max(0, \|\mathbf{v}_k\| - m^{-})^2$

* $T_k$ is equal to 1 if the digit of class $k$ is present, or 0 otherwise.
* $m^{+} = 0.9$, $m^{-} = 0.1$ and $\lambda = 0.5$.

In [35]:
m_plus = 0.9
m_minus = 0.1
lambda_ = 0.5

Since y will contain all the digit classes, from 0 to 9, to get $T_k$ for every instance and and every class, we can just use `tf.one_hot()` function:

In [36]:
T = tf.one_hot(y, depth=caps2_n_caps, name='T')

A small example of how the `tf.one_hot()` works is:

In [37]:
with tf.Session():
    print(T.eval(feed_dict={y: np.array([0, 1, 2, 3, 9])}))

[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]


Now we will compute the norm of the output vector for each output capsule and each instance. First, let's verify the shape of `caps2_output`:

In [38]:
caps2_output

<tf.Tensor 'caps2_output_round_2/mul:0' shape=(?, 1, 10, 16, 1) dtype=float32>

The 16D output vectors are in the second to last dimension of `caps2_output`, so we use `safe_norm()` with `axis=-2`

In [39]:
caps2_output_norm = safe_norm(caps2_output, axis=-2, keepdims=True, name='caps2_output_norm')

Now let's compute $\max(0, m^{+} - \|\mathbf{v}_k\|)^2$, and reshape the result to get a simple matrix of shape (_batch size_, 10):

In [40]:
present_error_raw = tf.square(tf.maximum(0., m_plus - caps2_output_norm), name='present_error_raw')
present_error = tf.reshape(present_error_raw, shape=(-1, 10), name='present_error')

Next let's compute $\max(0, \|\mathbf{v}_k\| - m^{-})^2$ and reshape it:

In [41]:
absent_error_raw = tf.square(tf.maximum(0., caps2_output_norm - m_minus), name='absent_error_raw')
absent_error = tf.reshape(absent_error_raw, shape=(-1, 10), name='absent_error')

Now, we can compute the loss for each instance and each digit.

In [42]:
L = tf.add(T * present_error, lambda_ * (1.0 - T) * absent_error, name='L')

Now we can sum the digit losses for each instance ($L_0 + L_1 + \cdots + L_9$), and compute the mean over all instances. This gives us the final margin loss:

In [43]:
margin_loss = tf.reduce_mean(tf.reduce_sum(L, axis=1), name='margin_loss')

# Reconstruction Loss

Now a decoder network needs to be added on top of the capsule network. It is a regular 3-layer fully connected neural network which will learn to reconstruct the input images based on the output of the capsule network. This will force the capsule network to preserve all the information required to reconstruct the digits, across the whole network. This constraint regularizes the model: it reduces the risk of overfitting the training set, and it helps to generalize new digits.

## Mask

During training, instead of sending all the outputs of the capsule network to the decoder network, we send only the output vector that corresponds to the target digit. All the other output vectors must be masked out. At inference time, we must mask all output vectors except for the longest one, i.e., the one that corresponds to the predicted digit.

We need a placeholder to tell TensorFlow whether we want to mask the output vectors based on the labels (`True`) or on the predictions (`False`, the default):

In [44]:
# mask_with_lables (tf.placeholder_with_default), will return False when its output will not be fed
# shape=() means a scalar value
mask_with_lables = tf.placeholder_with_default(False, shape=(), name='mask_with_lables')

Now we use `tf.cond()` to define the reconstruction targets as the label `y` is `mask_with_lables` is `True`, or `y_pred` otherwise.

In [48]:
reconstruction_targets = tf.cond(
    mask_with_lables,  # condition
    lambda: y,  # true function (if True)
    lambda: y_pred,  # false function (if False)
    name='reconstruction_targets'
)

`tf.cond()` function expects the if-True and if-False tensors to be passed _via_ functions: these functions will be called just once during the graph construction phase (not during the execution phase), similar to `tf.while_loop()`. This allows TensorFlow to add the necessary operations to handle the conditional evaluation of the if-True or if-False tensors.  
However, in our case, the tensors `y` and `y_pred` are already created by the time we call `tf.cond()`, so unfortunately TensorFlow will consider both `y` and `y_pred` to be dependencies of the `reconstruction_targets` tensor. The `reconstruction_targets` tensor will end up with the correct value, but:
1. whenever we evaluate a tensor that depends on `reconstruction_targets`, the `y_pred` tensor will be evaluated (even if `mask_with_layers` is `True`). This is not a big deal because computing `y_pred` adds no computing overhead during training, since we need it anyway to compute the margin loss. And during testing, if we are doing classification, we won't need reconstructions, so `reconstruction_targets` won't be evaluated at all.
2. we will always need to feed a value for the `y` placeholder (even if `mask_with_layers` is `False`). This is a bit annoying, but we can pass an empty array, because TensorFlow won't use it anyway (it just does not know it yet when it checks for dependencies).

After reconstruction_targets, we need to create the reconstruction_mask. It should be equal to 1.0 for the target class, and 0.0 for the other classes, for each instance. For this we can just use the `tf.one_hot()` function:

In [53]:
reconstruction_mask = tf.one_hot(reconstruction_targets, depth=caps2_n_caps, name='reconstruction_mask')

In [54]:
# check the shape of reconstruction_mask
reconstruction_mask

<tf.Tensor 'reconstruction_mask_1:0' shape=(?, 10) dtype=float32>

In [55]:
# compare the shape of reconstruction_mask with caps2_output
caps2_output

<tf.Tensor 'caps2_output_round_2/mul:0' shape=(?, 1, 10, 16, 1) dtype=float32>

To multiply `reconstruction_mask` to `caps2_output`, we need to reshape `reconstruction_mask` to (_batch size_, 1, 10, 1, 1)

In [56]:
reconstruction_mask_reshaped = tf.reshape(reconstruction_mask, [-1, 1, caps2_n_caps, 1, 1])

In [57]:
# Now we can apply the mask
caps2_output_masked = tf.multiply(caps2_output, reconstruction_mask_reshaped, name='caps2_output_masked')

In [58]:
caps2_output_masked

<tf.Tensor 'caps2_output_masked:0' shape=(?, 1, 10, 16, 1) dtype=float32>

In [59]:
# Flatten the decoder's inputs
# This will give an array of shape (batch_size, 160)
decoder_input = tf.reshape(caps2_output_masked, [-1, caps2_n_caps * caps2_n_dims], name='decoder_input')

In [60]:
decoder_input

<tf.Tensor 'decoder_input:0' shape=(?, 160) dtype=float32>

## Decoder

The decoder will contain, two dense (fully connected) ReLU layers followed by a dense output sigmoid layer:

In [61]:
n_hidden1 = 512
n_hidden2 = 1024
n_output = 28 * 28

In [62]:
with tf.name_scope('decoder'):
    hidden1 = tf.layers.dense(decoder_input, n_hidden1, activation=tf.nn.relu, name='hidden1')
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name='hidden2')
    decoder_output = tf.layers.dense(hidden2, n_output, activation=tf.nn.sigmoid, name='decoder_output')

## Reconstruction Loss

Reconstruction loss is the squared difference between the input image and the reconstructed image:

In [63]:
X_flat = tf.reshape(X, [-1, n_output], name='X_flat')
squared_difference = tf.square(X_flat - decoder_output, name='squared_difference')
reconstruction_loss = tf.reduce_mean(squared_difference, name='reconstruction_loss')

## Final Loss

The final loss is the sum of the margin loss and the reconstruction loss (scaled down by a factor of 0.0005 to ensure the margin loss dominates training):

In [64]:
alpha = 0.0005
loss = tf.add(margin_loss, alpha * reconstruction_loss, name='loss')

# Final Touches

## Accuracy

To measure the model's accuracy, we need to count the number of instances that are properly classified. For this, we can simply compare `y` and `y_pred`, convert the boolean value to a float32 (0.0 for False and 1.0 for True), and compute the mean over all instances:

In [65]:
correct = tf.equal(y, y_pred, name='correct')
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')

## Training Operations

In [66]:
# We use Adam Optimizer with default parameter values
optimizer = tf.train.AdamOptimizer()
training_op = optimizer.minimize(loss, name='training_op')

## Init and Saver

In [67]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

Construction of the network is done!

# Training

Training of this capsule network will be done in a pretty standard way i.e. there won't be any fancy hyperparameter tuning, dropout or anything, we will just run the training operation over and over again, displaying the loss, and at the end of each epoch, measure the accuracy on the validation set, display it, and save the model if the validation loss is the lowest seen found so far (this is a basic way to implement early stopping, without actually stopping). Here are a few details to note:
* if a checkpoint file exists, it will be restored (this makes it possible to interrupt training, then restart it later from the last checkpoint),
* we must not forget to feed `mask_with_labels=True` during training,
* during testing, we let `mask_with_labels` default to `False` (but we still feed the labels since they are required to compute the accuracy),
* the images loaded _via_ `mnist.train.next_batch()` are represented as `float32` arrays of shape \[784\], but the input placeholder `X` expects a `float32` array of shape \[28, 28, 1\], so we must reshape the images before we feed them to our model,
* we evaluate the model's loss and accuracy on the full validation set (5,000 instances). To view progress and support systems that don't have a lot of RAM, the code evaluates the loss and accuracy on one batch at a time, and computes the mean loss and mean accuracy at the end.

In [None]:
n_epochs = 10
batch_size = 50
restore_checkpoint = True

n_iterations_per_epoch = mnist.train.num_examples // batch_size
n_iterations_validation = mnist.validation.num_examples // batch_size
best_loss_val = np.infty
checkpoint_path = './my_capsule_network'

with tf.Session() as sess:
    if restore_checkpoint and tf.train.checkpoint_exists(checkpoint_path):
        saver.restore(sess, checkpoint_path)
    else:
        init.run()
    
    for epoch in range(n_epochs):
        for iteration in range(1, n_iterations_per_epoch + 1):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            # Run the training operation and measure the loss:
            _, loss_train = sess.run(
                [training_op, loss],
                feed_dict={
                    X: X_batch.reshape([-1, 28, 28, 1]),
                    y: y_batch,
                    mask_with_labels: True
                }
            )
            print('\rIteration: {}/{} ({:.1f}%)  Loss: {:.5f}'.format(
                iteration, n_iterations_per_epoch,
                iteration * 100 / n_iterations_per_epoch,
                loss_train
            ), end='')
        
        # After the end of each epoch, measure validation loss and accuracy
        loss_vals = []
        acc_vals = []
        for iteration in range(1, n_iterations_validation + 1):
            X_batch, y_batch = mnist.validation.next_batch(batch_size)
            loss_val, acc_val = sess.run(
                [loss, accuracy],
                feed_dict={
                    X: X_batch.reshape([-1, 28, 28, 1]),
                    y: y_batch,
                    mask_with_labels: True
                }
            )
            loss_vals.append(loss_val)
            acc_vals.append(acc_val)
            print("\rEvaluating the model: {}/{} ({:.1f}%)".format(
                iteration, n_iterations_validation,
                iteration * 100 / n_iterations_validation
            ), end=' ' * 10)
        loss_val = np.mean(loss_vals)
        acc_val = np.mean(acc_vals)
        print("\rEpoch: {}  Val accuracy: {:.4f}%  Loss: {:.6f}{}".format(
            epoch + 1, acc_val * 100, loss_val, " (improved)" if loss_val < best_loss_val else ""
        ))
        
        # Save the model if it has improved
        if loss_val < best_loss_val:
            save_path = saver.save(sess, checkpoint_path)
            best_loss_val = loss_val

Training is finished. Now we will evaluate the model on the test set.

# Evaluation

In [None]:
n_iterations_test = mnist.test.num_examples // batch_size

with tf.Session() as sess:
    saver.restore(sess, checkpoint_path)
    
    loss_tests = []
    acc_tests = []
    for iteration in range(1, n_iterations_test + 1):
        X_batch, y_batch = mnist.test.next_batch(batch_size)
        loss_test, acc_test = sess.run(
            [loss, accuracy],
            feed_dict={
                X: X_batch.reshape([-1, 28, 28, 1]),
                y: y_batch
            }
        )
        loss_tests.append(loss_test)
        acc_tests.append(acc_test)
        print("\rEvaluating the model: {}/{} ({:.1f}%)".format(
            iteration, n_iterations_test,
            iteration * 100 / n_iterations_test
        ), end=' ' * 10)
    loss_test = np.mean(loss_tests)
    acc_test = np.mean(acc_tests)
    print("\rFinal test accuracy: {:.4f}%  Loss: {:.6f}".format(acc_test * 100, loss_test))

# Predictions

To make predictions, we first fix a few images from the test set, then start a session, restore the trained model, evaluate `caps2_output` to get the capsule network's output vectors, `decoder_output` to get the reconstructions, and `y_pred` to get the class predictions:

In [None]:
n_samples = 5
sample_images = mnist.test.images[:n_samples].reshape([-1, 28, 28, 1])

with tf.Session() as sess:
    saver.restore(sess, checkpoint_path)
    caps2_output_value, decoder_output_value, y_pred_value = sess.run(
        [caps2_output, decoder_output, y_pred],
        feed_dict={
            X: sample_images,
            y: np.array([], dtype=np.int64)
        }
    )

**Note:** we feed `y` with an empty array, but Tensorflow will not use it. Check the masking section for explanation.

In [None]:
# Plot the images and their labels
sample_images = sample_images.reshape(-1, 28, 28)

plt.figure(figsize=(n_samples * 2, 3))
for index in range(n_samples):
    plt.subplot(1, n_samples, index + 1)
    plt.imshow(sample_images[index], cmap="binary")
    plt.title("Label:" + str(mnist.test.labels[index]))
    plt.axis("off")

plt.show()


# Plot the corresponding reconstructions and predictions
reconstructions = decoder_output_value.reshape([-1, 28, 28])

plt.figure(figsize=(n_samples * 2, 3))
for index in range(n_samples):
    plt.subplot(1, n_samples, index + 1)
    plt.title("Predicted:" + str(y_pred_value[index]))
    plt.imshow(reconstructions[index], cmap="binary")
    plt.axis("off")
    
plt.show()