# Fundamentals of Deep Learning
## 目录
- Chapter 5. Convolutional Neural Networks
    - Vanilla Deep Neural Networks Don’t Scale
    - Filters and Feature Maps
    - Full Description of the Convolutional Layer
    - Max Pooling
    - Full Architectural Description of Convolution Networks
    - Closing the Loop on MNIST with Convolutional Networks
    - Building a Convolutional Network for CIFAR-10
    - Visualizing Learning in Convolutional Networks
    - Leveraging Convolutional Filters to Replicate Artistic Styles

## Vanilla Deep Neural Networks Don’t Scale
In MNIST, our images were only 28 x 28 pixels and were black and white. As a result, a neuron in a fully connected hidden layer would have 784 incoming weights. This seems pretty tractable for the MNIST task, and our vanilla neural net performed quite well. This technique, however, does not scale well as our images grow larger. For example, for a full-color 200 x 200 pixel image, our input layer would have 200 x 200 x 3 = 120,000 weights. 

![5-3](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0503.png)

Figure 5-3. The density of connections between layers increases intractably as the size of the image increases

As we’ll see, the neurons in a convolutional layer are only connected to a small, local region of the preceding layer. A convolutional layer’s function can be expressed simply: it processes a three-dimensional volume of information to produce a new three-dimensional volume of information.

![5-4](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0504.png)

Figure 5-4. Convolutional layers arrange neurons in three dimensions, so layers have width, height, and depth

## Filters and Feature Maps
A `filter` is essentially a feature detector.

![5-5](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0505.png)

Figure 5-5. We’ll analyze this simple black-and-white image as a toy example

Let’s say that we want to detect vertical and horizontal lines in the image. For example, to detect vertical lines, we would use the feature detector on the top, slide it across the entirety of the image, and at every step check if we have a match. This result is our `feature map`, and it indicates where we’ve found the feature we’re looking for in the original image. We can do the same for the horizontal line detector (bottom), resulting in the feature map in the bottom-right corner.

![5-6](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0506.png)

Figure 5-6. Applying filters that detect vertical and horizontal lines on our toy example

This operation is called a convolution. We take a filter and we multiply it over the entire area of an input image.

Filters represent combinations of connections (one such combination is highlighted in Figure 5-7) that get replicated across the entirety of the input.

The output layer is the feature map generated by this filter. A neuron in the feature map is activated if the filter contributing to its activity detected an appropriate feature at the corresponding position in the previous layer.

![5-7](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0507.png)

Figure 5-7. Representing filters and feature maps as neurons in a convolutional layer

Express the feature map as follows:

$$m_{ij}^k=f((W \cdot x)_{ij} + b^k)$$

- the $k^{th}$ feature map in layer m as $m^k$
- the corresponding filter by the values of its weights upper W
- assuming the neurons in the feature map have bias $b^k$ (note that the bias is kept identical for all of the neurons in a feature map)

And we have accumulated three feature maps, one for eyes, one for noses, and one for mouths. We know that a particular location contains a face if the corresponding locations in the primitive feature maps contain the appropriate features (two eyes, a nose, and a mouth). In other words, **to make decisions about the existence of a face, we must combine evidence over multiple feature maps.**

As a result, feature maps must be able to operate over volumes, not just areas. This is shown below in Figure 5-8. Each cell in the input volume is a neuron. A local portion is multiplied with a filter (corresponding to weights in the convolutional layer) to produce a neuron in a filter map in the following volumetric layer of neurons.

![5-8](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0508.png)

Figure 5-8. Representing a full-color RGB image as a volume and applying a volumetric convolutional filter

The depth of the output volume of a convolutional layer is equivalent to the number of filters in that layer, because each filter produces its own slice. We visualize these relationships in Figure 5-9.

![5-9](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0509.png)

Figure 5-9. A three-dimensional visualization of a convolutional layer, where each filter corresponds to a slice in the resulting output volume

## Full Description of the Convolutional Layer
This input volume has the following characteristics:

- Its width $w_{in}$
- Its height $h_{in}$
- Its depth $d_{in}$
- Its zero padding p

This volume is processed by a total of k filters, which represent the weights and connections in the convolutional network. These filters have a number of hyperparameters, which are described as follows:

- Their spatial extent e, which is equal to the filter’s height and width.
- Their stride s, or the distance between consecutive applications of the filter on the input volume. If we use a stride of 1, we get the full convolution described in the previous section. We illustrate this in Figure 5-10.
- The bias b (a parameter learned like the values in the filter) which is added to each component of the convolution.

![5-10](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0510.png)

Figure 5-10. An illustration of a filter’s stride hyperparameter

This results in an output volume with the following characteristics:

- Its function f, which is applied to the incoming logit of each neuron in the output volume to determine its final value
- Its width $w_{out}=\lceil \frac{w_{in}-e+2p}{s} \rceil + 1$
- Its height $h_{out}=\lceil \frac{h_{in}-e+2p}{s} \rceil + 1$
- Its depth $d_{out}=k$

![5-11](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0511.png)

Figure 5-11. This is a convolutional layer with an input volume that has width 5, height 5, depth 3, and zero padding 1. There are 2 filters, with spatial extent 3 and applied with a stride of 2. It results in an output volume with width 3, height 3, and depth 2. We apply the first convolutional filter to the upper-leftmost 3 x 3 piece of the input volume to generate the upper-leftmost entry of the first depth slice.

![5-12](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0512.png)

Figure 5-12. Using the same setup as Figure 5-11, we generate the next value in the first depth slice of the output volume.  

TensorFlow provides us with a convenient operation to easily perform a convolution on a minibatch of input volumes (note that we must apply our choice of function  ourselves and it is not performed by the operation itself):

```py
tf.nn.conv2d(input, filter, strides, padding, use_cudnn_on_gpu=True, name=None)
```

- `input`:a four-dimensional tensor of size $N \times h_{in} \times w_{in} \times d_{in}$, where  is the number of examples in our minibatch.
- `filter`:also a four-dimensional tensor representing all of the filters applied in the convolution. It is of size $e \times e \times d_{in} \times k$.
- The resulting tensor emitted by this operation has the same structure as `input`
- Setting the padding argument to "SAME" also selects the zero padding so that height and width are preserved by the convolutional layer.

## Max Pooling
The essential idea behind max pooling is to break up each feature map into equally sized tiles.Then we create a condensed feature map. Specifically, we create a cell for each tile, compute the maximum value in the tile, and propagate this maximum value into the corresponding cell of the condensed feature map. This process is illustrated in Figure 5-13.

![5-13](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0513.png)

Figure 5-13. An illustration of how max pooling significantly reduces parameters as we move up the network

We can describe a pooling layer with two parameters:

- Its spatial extent e
- Its stride s

It’s important to note that only two major variations of the pooling layer are used. The first is the nonoverlapping pooling layer with e = 2, s = 2. The second is the overlapping pooling layer with e = 3， s = 2. The resulting dimensions of each feature map are as follows:

- Its width $w_{out}=\lceil \frac{w_{in}-e}{s} \rceil + 1$
- Its height $h_{out}=\lceil \frac{h_{in}-e}{s} \rceil + 1$

## Full Architectural Description of Convolution Networks
![5-14](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0514.png)

Figure 5-14. Various convolutional network architectures of various complexities. The architecture of VGGNet, a deep convolutional network built for ImageNet, is shown in the rightmost network.

## Closing the Loop on MNIST with Convolutional Networks
We’ll build a convolutional network with a pretty standard architecture (modeled after the second network in Figure 5-14): two pooling and two convolutional interleaved, followed by a fully connected layer (with dropout, p=0.5) and a terminal softmax. 

In [3]:
def conv2d(input, weight_shape, bias_shape):
    incoming = weight_shape[0] * weight_shape[1] * weight_shape[2]
    weight_init = tf.random_normal_initializer(stddev=(2.0/incoming)**0.5)
    W = tf.get_variable("W", weight_shape, initializer=weight_init)
    bias_init = tf.constant_initializer(value=0)
    b = tf.get_variable("b", bias_shape, initializer=bias_init)
    return tf.nn.relu(tf.nn.bias_add(tf.nn.conv2d(input, W, strides=[1, 1, 1, 1], padding='SAME'), b))

def max_pool(input, k=2):
    return tf.nn.max_pool(input, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME')

def layer(input, weight_shape, bias_shape):
    weight_init = tf.random_normal_initializer(stddev=(2.0/weight_shape[0])**0.5)
    bias_init = tf.constant_initializer(value=0)
    W = tf.get_variable("W", weight_shape, initializer=weight_init)
    b = tf.get_variable("b", bias_shape, initializer=bias_init)
    return tf.nn.relu(tf.matmul(input, W) + b)

def inference(x, keep_prob):
    x = tf.reshape(x, shape=[-1, 28, 28, 1])
    with tf.variable_scope("conv_1"):
        conv_1 = conv2d(x, [5, 5, 1, 32], [32])
        pool_1 = max_pool(conv_1)

    with tf.variable_scope("conv_2"):
        conv_2 = conv2d(pool_1, [5, 5, 32, 64], [64])
        pool_2 = max_pool(conv_2)

    with tf.variable_scope("fc"):
        pool_2_flat = tf.reshape(pool_2, [-1, 7 * 7 * 64])
        fc_1 = layer(pool_2_flat, [7*7*64, 1024], [1024])
        
        # apply dropout
        fc_1_drop = tf.nn.dropout(fc_1, keep_prob)

    with tf.variable_scope("output"):
        output = layer(fc_1_drop, [1024, 10], [10])
    return output

def loss(output, y):
    xentropy = tf.nn.softmax_cross_entropy_with_logits(logits=output, labels=y)    
    loss = tf.reduce_mean(xentropy)
    return loss

def training(cost, global_step):
    tf.summary.scalar("cost", cost)
    optimizer = tf.train.AdamOptimizer(learning_rate)
    train_op = optimizer.minimize(cost, global_step=global_step)
    return train_op

def evaluate(output, y):
    correct_prediction = tf.equal(tf.argmax(output, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    tf.summary.scalar("validation error", (1.0 - accuracy))
    return accuracy

- Input Layer
    - take the flattened versions of the input pixel values and reshape them into a tensor of the $N \times 28 \times 28 \times 1$, where 
        - N is the number of examples in a minibatch, 
        - 28 is the width and height of each image, and 
        - 1 is the depth (because the images are black and white; if the images were in RGB color, the depth would instead be 3 to represent each color map).
- Conv Layer
    - build a convolutional layer with 32 filters that have spatial extent 5.
    - results in taking an input volume of depth 1 and emitting a output tensor of depth 32.
- Max Pooling
    - then passed through a max pooling layer which compresses the information.
- Conv Layer2
    - build a second convolutional layer with 64 filters, again with spatial extent 5, taking an input tensor of depth 32 and emitting an output tensor of depth 64.
- Max Pooling
    - passed through a max pooling layer to compress information.

In [None]:
import input_data
mnist = input_data.read_data_sets("data/", one_hot=True)

import tensorflow as tf

# Architecture
n_hidden_1 = 256
n_hidden_2 = 256

# Parameters
learning_rate = 0.0001
training_epochs = 1000
batch_size = 100
display_step = 10

with tf.device("/gpu:0"):
    with tf.Graph().as_default():
        with tf.variable_scope("mnist_conv_model"):
            x = tf.placeholder("float", [None, 784]) # mnist data image of shape 28*28=784
            y = tf.placeholder("float", [None, 10]) # 0-9 digits recognition => 10 classes
            keep_prob = tf.placeholder(tf.float32) # dropout probability

            output = inference(x, keep_prob)
            cost = loss(output, y)
            global_step = tf.Variable(0, name='global_step', trainable=False)
            train_op = training(cost, global_step)
            eval_op = evaluate(output, y)
            summary_op = tf.summary.merge_all()
            saver = tf.train.Saver()

            sess = tf.Session()
            summary_writer = tf.summary.FileWriter("conv_mnist_logs/", graph=sess.graph)

            init_op = tf.global_variables_initializer()
            sess.run(init_op)

            # Training cycle
            for epoch in range(training_epochs):
                avg_cost = 0.
                total_batch = int(mnist.train.num_examples/batch_size)
                # Loop over all batches
                for i in range(total_batch):
                    minibatch_x, minibatch_y = mnist.train.next_batch(batch_size)
                    # Fit training using batch data
                    sess.run(train_op, feed_dict={x: minibatch_x, y: minibatch_y, keep_prob: 0.5})
                    # Compute average loss
                    avg_cost += sess.run(cost, feed_dict={x: minibatch_x, y: minibatch_y, keep_prob: 0.5})/total_batch
                # Display logs per epoch step
                if epoch % display_step == 0:
                    print "Epoch:", '%04d' % (epoch+1), "cost =", "{:.9f}".format(avg_cost)
                    accuracy = sess.run(eval_op, feed_dict={x: mnist.validation.images, y: mnist.validation.labels, keep_prob: 1})
                    print "Validation Error:", (1 - accuracy)

                    summary_str = sess.run(summary_op, feed_dict={x: minibatch_x, y: minibatch_y, keep_prob: 0.5})
                    summary_writer.add_summary(summary_str, sess.run(global_step))
                    saver.save(sess, "conv_mnist_logs/model-checkpoint", global_step=global_step)

            print "Optimization Finished!"
            accuracy = sess.run(eval_op, feed_dict={x: mnist.test.images, y: mnist.test.labels, keep_prob: 1})
            print "Test Accuracy:", accuracy

## Building a Convolutional Network for CIFAR-10
The CIFAR-10 challenge consists of 32 x 32 color images that belong to one of 10 possible classes.

Normalization of image inputs helps out the training process by making it more robust to variations. Batch normalization takes this a step further by normalizing inputs to every layer in our neural network. Specifically, we modify the architecture of our network to include operations that:

1. Grab the vector of logits incoming to a layer before they pass through the nonlinearity
1. Normalize each component of the vector of logits across all examples of the minibatch by subtracting the mean and dividing by the standard deviation (we keep track of the moments using an exponentially weighted moving average)
1. Given normalized inputs x̂, use an affine transform to restore representational power with two vectors of (trainable) parameters: γx̂ + β

Expressed in TensorFlow, batch normalization can  be expressed as follows for a convolutional layer:

```py
def conv_batch_norm(x, n_out, phase_train):
    beta_init = tf.constant_initializer(value=0.0, dtype=tf.float32)
    gamma_init = tf.constant_initializer(value=1.0, dtype=tf.float32)

    beta = tf.get_variable("beta", [n_out], initializer=beta_init)
    gamma = tf.get_variable("gamma", [n_out], initializer=gamma_init)

    batch_mean, batch_var = tf.nn.moments(x, [0,1,2], name='moments')
    ema = tf.train.ExponentialMovingAverage(decay=0.9)
    ema_apply_op = ema.apply([batch_mean, batch_var])
    ema_mean, ema_var = ema.average(batch_mean), ema.average(batch_var)

    def mean_var_with_update():
        with tf.control_dependencies([ema_apply_op]):
            return tf.identity(batch_mean), tf.identity(batch_var)

    mean, var = control_flow_ops.cond(phase_train,
        mean_var_with_update,
        lambda: (ema_mean, ema_var))

    normed = tf.nn.batch_norm_with_global_normalization(x, mean, var, beta, gamma, 1e-3, True)
    return normed
```

Non-Batch:[convnet_cifar.py](convnet_cifar.py)

Batch:

In [6]:
def conv_batch_norm(x, n_out, phase_train):
    beta_init = tf.constant_initializer(value=0.0, dtype=tf.float32)
    gamma_init = tf.constant_initializer(value=1.0, dtype=tf.float32)

    beta = tf.get_variable("beta", [n_out], initializer=beta_init)
    gamma = tf.get_variable("gamma", [n_out], initializer=gamma_init)

    batch_mean, batch_var = tf.nn.moments(x, [0,1,2], name='moments')
    ema = tf.train.ExponentialMovingAverage(decay=0.9)
    ema_apply_op = ema.apply([batch_mean, batch_var])
    ema_mean, ema_var = ema.average(batch_mean), ema.average(batch_var)
    def mean_var_with_update():
        with tf.control_dependencies([ema_apply_op]):
            return tf.identity(batch_mean), tf.identity(batch_var)
    mean, var = control_flow_ops.cond(phase_train,
        mean_var_with_update,
        lambda: (ema_mean, ema_var))

    normed = tf.nn.batch_norm_with_global_normalization(x, mean, var,
        beta, gamma, 1e-3, True)
    return normed

def layer_batch_norm(x, n_out, phase_train):
    beta_init = tf.constant_initializer(value=0.0, dtype=tf.float32)
    gamma_init = tf.constant_initializer(value=1.0, dtype=tf.float32)

    beta = tf.get_variable("beta", [n_out], initializer=beta_init)
    gamma = tf.get_variable("gamma", [n_out], initializer=gamma_init)

    batch_mean, batch_var = tf.nn.moments(x, [0], name='moments')
    ema = tf.train.ExponentialMovingAverage(decay=0.9)
    ema_apply_op = ema.apply([batch_mean, batch_var])
    ema_mean, ema_var = ema.average(batch_mean), ema.average(batch_var)
    def mean_var_with_update():
        with tf.control_dependencies([ema_apply_op]):
            return tf.identity(batch_mean), tf.identity(batch_var)
    mean, var = control_flow_ops.cond(phase_train,
        mean_var_with_update,
        lambda: (ema_mean, ema_var))

    reshaped_x = tf.reshape(x, [-1, 1, 1, n_out])
    normed = tf.nn.batch_norm_with_global_normalization(reshaped_x, mean, var,
        beta, gamma, 1e-3, True)
    return tf.reshape(normed, [-1, n_out])

def filter_summary(V, weight_shape):
    ix = weight_shape[0]
    iy = weight_shape[1]
    cx, cy = 8, 8
    V_T = tf.transpose(V, (3, 0, 1, 2))
    tf.image_summary("filters", V_T, max_images=64) 

def conv2d(input, weight_shape, bias_shape, phase_train, visualize=False):
    incoming = weight_shape[0] * weight_shape[1] * weight_shape[2]
    weight_init = tf.random_normal_initializer(stddev=(2.0/incoming)**0.5)
    W = tf.get_variable("W", weight_shape, initializer=weight_init)
    if visualize:
        filter_summary(W, weight_shape)
    bias_init = tf.constant_initializer(value=0)
    b = tf.get_variable("b", bias_shape, initializer=bias_init)
    logits = tf.nn.bias_add(tf.nn.conv2d(input, W, strides=[1, 1, 1, 1], padding='SAME'), b)
    return tf.nn.relu(conv_batch_norm(logits, weight_shape[3], phase_train))

def max_pool(input, k=2):
    return tf.nn.max_pool(input, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME')

def layer(input, weight_shape, bias_shape, phase_train):
    weight_init = tf.random_normal_initializer(stddev=(2.0/weight_shape[0])**0.5)
    bias_init = tf.constant_initializer(value=0)
    W = tf.get_variable("W", weight_shape, initializer=weight_init)
    b = tf.get_variable("b", bias_shape, initializer=bias_init)
    logits = tf.matmul(input, W) + b
    return tf.nn.relu(layer_batch_norm(logits, weight_shape[1], phase_train))

def inference(x, keep_prob, phase_train):
    with tf.variable_scope("conv_1"):
        conv_1 = conv2d(x, [5, 5, 3, 64], [64], phase_train, visualize=True)
        pool_1 = max_pool(conv_1)

    with tf.variable_scope("conv_2"):
        conv_2 = conv2d(pool_1, [5, 5, 64, 64], [64], phase_train)
        pool_2 = max_pool(conv_2)

    with tf.variable_scope("fc_1"):
        dim = 1
        for d in pool_2.get_shape()[1:].as_list():
            dim *= d

        pool_2_flat = tf.reshape(pool_2, [-1, dim])
        fc_1 = layer(pool_2_flat, [dim, 384], [384], phase_train)
        
        # apply dropout
        fc_1_drop = tf.nn.dropout(fc_1, keep_prob)

    with tf.variable_scope("fc_2"):
        fc_2 = layer(fc_1_drop, [384, 192], [192], phase_train)
        # apply dropout
        fc_2_drop = tf.nn.dropout(fc_2, keep_prob)

    with tf.variable_scope("output"):
        output = layer(fc_2_drop, [192, 10], [10], phase_train)

    return output

def loss(output, y):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(output, tf.cast(y, tf.int64))    
    loss = tf.reduce_mean(xentropy)
    return loss

def training(cost, global_step):
    tf.scalar_summary("cost", cost)
    optimizer = tf.train.AdamOptimizer(learning_rate)
    train_op = optimizer.minimize(cost, global_step=global_step)
    return train_op

def evaluate(output, y):
    correct_prediction = tf.equal(tf.cast(tf.argmax(output, 1), dtype=tf.int32), y)
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    tf.scalar_summary("validation error", (1.0 - accuracy))
    return accuracy

1. we integrate batch normalization into the convolutional and fully connected layers
1. use two convolutional layers (each followed by a max pooling layer)
1. two fully connected layers followed by a softmax
1. dropout is included for reference, but in the batch normalization version, keep_prob=1 during training

In [None]:
import cifar10_input
cifar10_input.maybe_download_and_extract()

import tensorflow as tf
from tensorflow.python import control_flow_ops
import numpy as np
import os

# Architecture
n_hidden_1 = 256
n_hidden_2 = 256

# Parameters
learning_rate = 0.01
training_epochs = 1000
batch_size = 128
display_step = 1

def inputs(eval_data=True):
    data_dir = os.path.join('data/cifar10_data', 'cifar-10-batches-bin')
    return cifar10_input.inputs(eval_data=eval_data, data_dir=data_dir,
                                batch_size=batch_size)

def distorted_inputs():
    data_dir = os.path.join('data/cifar10_data', 'cifar-10-batches-bin')
    return cifar10_input.distorted_inputs(data_dir=data_dir,
                                          batch_size=batch_size)

with tf.device("/gpu:0"):
    with tf.Graph().as_default():
        with tf.variable_scope("cifar_conv_bn_model"):
            x = tf.placeholder("float", [None, 24, 24, 3])
            y = tf.placeholder("int32", [None])
            keep_prob = tf.placeholder(tf.float32) # dropout probability
            phase_train = tf.placeholder(tf.bool) # training or testing

            distorted_images, distorted_labels = distorted_inputs()
            val_images, val_labels = inputs()

            output = inference(x, keep_prob, phase_train)
            cost = loss(output, y)
            global_step = tf.Variable(0, name='global_step', trainable=False)
            train_op = training(cost, global_step)
            eval_op = evaluate(output, y)

            summary_op = tf.summary.merge_all()

            saver = tf.train.Saver()
            sess = tf.Session()

            summary_writer = tf.summary.FileWriter("conv_cifar_bn_logs/", graph=sess.graph)
            init_op = tf.global_variables_initializer()
            sess.run(init_op)
            tf.train.start_queue_runners(sess=sess)

            # Training cycle
            for epoch in range(training_epochs):
                avg_cost = 0.
                total_batch = int(cifar10_input.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN/batch_size)
                # Loop over all batches
                for i in range(total_batch):
                    # Fit training using batch data

                    train_x, train_y = sess.run([distorted_images, distorted_labels])
                    _, new_cost = sess.run([train_op, cost], feed_dict={x: train_x, y: train_y, keep_prob: 1, phase_train: True})
                    # Compute average loss
                    avg_cost += new_cost/total_batch
                    print "Epoch %d, minibatch %d of %d. Cost = %0.4f." %(epoch, i, total_batch, new_cost)

                # Display logs per epoch step
                if epoch % display_step == 0:
                    print "Epoch:", '%04d' % (epoch+1), "cost =", "{:.9f}".format(avg_cost)
                    val_x, val_y = sess.run([val_images, val_labels])
                    accuracy = sess.run(eval_op, feed_dict={x: val_x, y: val_y, keep_prob: 1, phase_train: False})
                    print "Validation Error:", (1 - accuracy)

                    summary_str = sess.run(summary_op, feed_dict={x: train_x, y: train_y, keep_prob: 1, phase_train: False})
                    summary_writer.add_summary(summary_str, sess.run(global_step))
                    saver.save(sess, "conv_cifar_bn_logs/model-checkpoint", global_step=global_step)

            print "Optimization Finished!"
            val_x, val_y = sess.run([val_images, val_labels])
            accuracy = sess.run(eval_op, feed_dict={x: val_x, y: val_y, keep_prob: 1, phase_train: False})
            print "Test Accuracy:", accuracy

## Visualizing Learning in Convolutional Networks
![5-17](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0517.png)

Figure 5-17. Training a convolutional network without batch normalization (left) versus with batch normalization (right). Batch normalization vastly accelerates the training process. 

![5-18](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0518.png)

Figure 5-18. A subset of the learned filters in the first convolutional layer of our network

We then take this high-dimensional representation for each image and use an algorithm known as `t-Distributed Stochastic Neighbor Embedding`, or `t-SNE`, to compress it to a two-dimensional representation that we can visualize.

![5-19](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0519.png)

Figure 5-19. The t-SNE embedding (center) surrounded by zoomed-in subsegments of the embedding (periphery). 

## Leveraging Convolutional Filters to Replicate Artistic Styles
The goal of neural style is to be able to take an arbitrary photograph and re-render it as if it were painted in the style of a famous artist.

Let’s take a pre-trained convolutional network. There are three images that we’re dealing with. The first two are the source of content p and the source of style a. The third image is the generated image x. Our goal will be to derive an error function that we can backpropagate that, when minimized, will perfectly combine the content of the desired photograph and the style of the desired artwork.

We start with content first. If a layer in the network has $k_l$ filters, then it produces a total of $k_l$ feature maps. Let’s call the size of each feature map $m_l$, the height times the width of the feature map. This means that the activations in all the feature maps of this layer can be stored in a matrix $F^{(l)}$ of size $k_l \times m_l$. We can also represent all the activations of the photograph in a matrix $P^{(l)}$ and all the activations of the generated image in the matrix $X^{(l)}$. We use the relu4_2 of the original VGGNet:

$$E_{content}(p,x)=\sum_{ij}(P_{ij}^{(l)}-X_{ij}^{(l)})^2$$

Now we can try tackling style. To do this we construct a matrix known as the `Gram matrix`, which represents correlations between feature maps in a given layer. The correlations represent the texture and feel that is common among all features, irrespective of which features we’re looking at. Constructing the Gram matrix, which is of size $k_l \times k_l$, for a given image is done as follows:

$$G_{ij}^{(l)}=\sum_{c=0}^{m_l} F_{ic}^{(l)}F_{jc}^{(l)}$$

We can compute the Gram matrices for both the artwork in matrix $A^{(l)}$ and the generated image in $G^{(l)}$. We can then represent the error function as:

$$E_{style}(a,x)=\frac{1}{4k_l^2m_l^2}\sum_{l=1}^L \sum_{ij}\frac{1}{L}(A_{ij}^{(l)}-G_{ij}^{(l)})^2$$

TensorFlow code (https://github.com/darksigma/Fundamentals-of-Deep-Learning-Book/tree/master/archive/neural_style) 

![5-20](https://www.safaribooksonline.com/library/view/fundamentals-of-deep/9781491925607/assets/fodl_0520.png)

Figure 5-20. The result of mixing the Rain Princess with a photograph of the MIT Dome. Image credit: Anish Athalye.