# Getting Started With TensorFlow

This training session is intended to get you started programming in TensorFlow. 

TensorFlow provides multiple APIs. The lowest level API--**TensorFlow Core**-- provides you with complete programming control. We recommend TensorFlow Core for machine learning researchers and others who require fine levels of control over their models. The higher level APIs are built on top of TensorFlow Core. These higher level APIs are typically easier to learn and use than TensorFlow Core. In addition, the higher level APIs make repetitive tasks easier and more consistent between different users. A high-level API like `tf.estimator` helps you manage data sets, estimators, training and inference.

We begin with an introduction to TensorFlow Core. Later, you will see how to implement the same model in `tf.estimator`. Knowing TensorFlow Core principles will give you a great mental model of how things are working internally when you use the more compact higher level API.

# Tensors

The central unit of data in TensorFlow is the **tensor**. A tensor consists of a **set of primitive values shaped into an array of any number of dimensions**. A tensor's rank is its number of dimensions. Here are some examples of tensors:

In [None]:
# a rank 0 tensor; this is a scalar with shape []
[1., 2., 3.] # a rank 1 tensor; this is a vector with shape [3]
[[1., 2., 3.], [4., 5., 6.]] # a rank 2 tensor; a matrix with shape [2, 3]
[[[1., 2., 3.]], [[7., 8., 9.]]]; # a rank 3 tensor with shape [2, 1, 3]

## TensorFlow Core tutorial

In [None]:
import tensorflow as tf
import pandas as pd
from IPython.display import Image

In [None]:
print(tf.__version__)

In [None]:
print(pd.__version__)

### The Computational Graph

You might think of TensorFlow Core programs as consisting of two discrete sections:

 1. **Building the computational graph**.
 2. **Running the computational graph**.

A computational graph is a **series of TensorFlow operations arranged into a graph of nodes**. Each node takes zero or more tensors as inputs and produces a tensor as an output. One type of node is a **constant**. Like all TensorFlow constants, it takes **no inputs**, and it **outputs a value it stores internally**. We can create two floating point tensors (node1 and node2) as follows:

In [None]:
node1 = tf.constant(3.0, dtype=tf.float32)
node2 = tf.constant(4.0)  # also tf.float32 implicitly
print(node1, node2)

Notice that printing the nodes does not output the values 3.0 and 4.0 as you might expect. Instead, they are nodes that, when evaluated, would produce 3.0 and 4.0, respectively. **To actually evaluate the nodes, we must run the computational graph within a session**. A session encapsulates the control and state of the TensorFlow runtime.

In [None]:
sess = tf.Session()
print(sess.run([node1, node2]))

We can **build more complicated computations by combining `Tensor` nodes with operations (operations are also nodes in the computational graph)**. For example, we can add our two constant nodes and produce a new graph as follows:

In [None]:
node3 = tf.add(node1, node2)
print("node3:", node3)
print("sess.run(node3):", sess.run(node3))

As it stands, this graph is not especially interesting because it always produces a constant result. **A graph can be parameterized to accept external inputs, known as placeholders**. A placeholder is a promise to provide a value later.

In [None]:
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
adder_node = a + b  # + provides a shortcut for tf.add(A, b)

In [None]:
print(sess.run(adder_node, {a:3, b:4.5}))
print(sess.run(adder_node, {a: [1,3], b: [2,4]}))

We can make the computational graph more complex by adding another operation. For example,

In [None]:
add_and_triple = adder_node * 3.
print(sess.run(add_and_triple, {a: 3, b: 4.5}))

In machine learning we will typically want a model that can take arbitrary inputs, such as the one above. To make the model trainable, we need to be able to modify the graph to get new outputs with the same input. **Variables allow us to add trainable parameters to a graph. They are constructed with a type and initial value:**

In [None]:
W = tf.Variable([.3], dtype=tf.float32)
b = tf.Variable([-.3], dtype=tf.float32)
x = tf.placeholder(tf.float32)
linear_model = W * x + b

Constants are initialized when you call tf.constant, and their value can never change. By contrast, variables are not initialized when you call tf.Variable. **To initialize all the variables in a TensorFlow program, you must explicitly call a special operation as follows:**

In [None]:
init = tf.global_variables_initializer()
sess.run(init)

It is important to realize `init` is a handle to the TensorFlow sub-graph that initializes all the global variables. Until we call sess.run, the variables are uninitialized.

Since x is a placeholder, we can evaluate linear_model for several values of x simultaneously as follows:

In [None]:
print(sess.run(linear_model, {x: [1, 2, 3, 4]}))

We've created a model, but we don't know how good it is yet. To evaluate the model on training data, we need a `y` placeholder to provide the desired values, and we need to write a loss function.

A loss function measures how far apart the current model is from the provided data. We'll use a standard loss model for linear regression, which sums the squares of the deltas between the current model and the provided data. `linear_model - y` creates a vector where each element is the corresponding example's error delta. We call `tf.square` to square that error. Then, we sum all the squared errors to create a single scalar that abstracts the error of all examples using `tf.reduce_sum`:

In [None]:
y = tf.placeholder(tf.float32)
squared_deltas = tf.square(linear_model - y)
loss = tf.reduce_sum(squared_deltas)
print(sess.run(loss, {x: [1, 2, 3, 4], y: [0, -1, -2, -3]}))

We could improve this manually by reassigning the values of `W` and `b` to the perfect values of -1 and 1. A variable is initialized to the value provided to `tf.Variable` but can be changed using operations like `tf.assign`. For example, `W=-1` and `b=1` are the optimal parameters for our model. We can change `W` and `b` accordingly:

In [None]:
fixW = tf.assign(W, [-1.])
fixb = tf.assign(b, [1.])
sess.run([fixW, fixb])
print(sess.run(loss, {x: [1, 2, 3, 4], y: [0, -1, -2, -3]}))

## tf.train API

A complete discussion of machine learning is out of the scope for this training session. However, **TensorFlow provides optimizers that slowly change each variable in order to minimize the loss function**. The simplest optimizer is **gradient descent**. It modifies each variable according to the magnitude of the derivative of loss with respect to that variable. In general, computing symbolic derivatives manually is tedious and error-prone. Consequently, **TensorFlow can automatically produce derivatives** given only a description of the model using the function `tf.gradients`. For simplicity, optimizers typically do this for you. For example:

In [None]:
optimizer = tf.train.GradientDescentOptimizer(0.01)  # Using a learning rate of 0.01
train = optimizer.minimize(loss)

In [None]:
sess.run(init) # reset values to incorrect defaults.
for i in range(1000):
    sess.run(train, {x: [1, 2, 3, 4], y: [0, -1, -2, -3]})

print(sess.run([W, b]))

Although doing this simple linear regression doesn't require much TensorFlow core code, more complicated models and methods to feed data into your model necessitate more code. Thus **TensorFlow provides higher level abstractions for common patterns, structures, and functionality**. We will learn how to use some of these abstractions in the next section.

### Putting it all together

In [None]:
import tensorflow as tf

# Model parameters
W = tf.Variable([.3], dtype=tf.float32)
b = tf.Variable([-.3], dtype=tf.float32)
# Model input and output
x = tf.placeholder(tf.float32)
linear_model = W * x + b
y = tf.placeholder(tf.float32)

# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)

# training data
x_train = [1, 2, 3, 4]
y_train = [0, -1, -2, -3]
# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init) # reset values to wrong
for i in range(1000):
    sess.run(train, {x: x_train, y: y_train})

# evaluate training accuracy
curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x: x_train, y: y_train})
print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))

Notice that the loss is a very small number (very close to zero). If you run this program, your loss may not be the exact same because the model is initialized with pseudorandom values.

## tf.estimator

`tf.estimator` is a high-level TensorFlow library that simplifies the mechanics of machine learning, including the following:

- running training loops
- running evaluation loops
- managing data sets

`tf.estimator` defines many common models.

In [None]:
import tensorflow as tf
# NumPy is often used to load, manipulate and preprocess data.
import numpy as np

### Basic Usage

In [None]:
# Declare list of features. We only have one numeric feature. There are many
# other types of columns that are more complicated and useful.
feature_columns = [tf.feature_column.numeric_column("x", shape=[1])]

In [None]:
# An estimator is the front end to invoke training (fitting) and evaluation
# (inference). There are many predefined types like linear regression,
# linear classification, and many neural network classifiers and regressors.
# The following code provides an estimator that does linear regression.
estimator = tf.estimator.LinearRegressor(feature_columns=feature_columns)

In [None]:
# TensorFlow provides many helper methods to read and set up data sets.
# Here we use two data sets: one for training and one for evaluation
# We have to tell the function how many batches of data (num_epochs) we want 
# and how big each batch should be.
x_train = np.array([1., 2., 3., 4.])
y_train = np.array([0., -1., -2., -3.])
x_test = np.array([2., 5., 8., 1.])
y_test = np.array([-1.01, -4.1, -7, 0.])
input_fn = tf.estimator.inputs.numpy_input_fn(
    {"x": x_train}, y_train, batch_size=4, num_epochs=None, shuffle=True)
train_input_fn = tf.estimator.inputs.numpy_input_fn(
    {"x": x_train}, y_train, batch_size=4, num_epochs=1000, shuffle=False)
test_input_fn = tf.estimator.inputs.numpy_input_fn(
    {"x": x_test}, y_test, batch_size=4, num_epochs=1000, shuffle=False)

In [None]:
# We can invoke 1000 training steps by invoking the  method and passing the
# training data set.
estimator.train(input_fn=input_fn, steps=1000)

In [None]:
# Here we evaluate how well our model did.
train_metrics = estimator.evaluate(input_fn=train_input_fn)
test_metrics = estimator.evaluate(input_fn=test_input_fn)
print("train metrics: %r"% train_metrics)
print("test metrics: %r"% test_metrics)

Notice how our test data has a higher loss, but it is still close to zero. **That means we are learning properly.**

### A Custom Model

`tf.estimator` does not lock you into its predefined models. Suppose we wanted to create a custom model that is not built into TensorFlow. **We can still retain the high level abstraction of data set, feeding, training, etc. of `tf.estimator`.** For illustration, we will show how to implement our own equivalent model to LinearRegressor using our knowledge of the lower level TensorFlow API.

To define a custom model that works with tf.estimator, we need to use **`tf.estimator.Estimator`**. `tf.estimator.LinearRegressor` is actually a sub-class of `tf.estimator.Estimator`. Instead of sub-classing `Estimator`, we simply provide `Estimator` a function `model_fn` that tells `tf.estimator` how it can evaluate predictions, training steps, and loss. The code is as follows:

In [None]:
import numpy as np
import tensorflow as tf

In [None]:
# Declare list of features, we only have one real-valued feature
def model_fn(features, labels, mode):
    # Build a linear model and predict values
    W = tf.get_variable("W", [1], dtype=tf.float64)
    b = tf.get_variable("b", [1], dtype=tf.float64)
    y = W * features['x'] + b
    # Loss sub-graph
    loss = tf.reduce_sum(tf.square(y - labels))
    # Training sub-graph
    global_step = tf.train.get_global_step() # getting global step tensor
    optimizer = tf.train.GradientDescentOptimizer(0.01)
    train = tf.group(optimizer.minimize(loss), tf.assign_add(global_step, 1))  # Update global_step by adding 1
    # EstimatorSpec connects subgraphs we built to the appropriate functionality.
    return tf.estimator.EstimatorSpec(
        mode=mode,
        predictions=y,
        loss=loss,
        train_op=train)

In [None]:
estimator = tf.estimator.Estimator(model_fn=model_fn)

In [None]:
# define our data sets
x_train = np.array([1., 2., 3., 4.])
y_train = np.array([0., -1., -2., -3.])
x_test = np.array([2., 5., 8., 1.])
y_test = np.array([-1.01, -4.1, -7, 0.])

In [None]:
input_fn = tf.estimator.inputs.numpy_input_fn(
    {"x": x_train}, y_train, batch_size=4, num_epochs=None, shuffle=True)
train_input_fn = tf.estimator.inputs.numpy_input_fn(
    {"x": x_train}, y_train, batch_size=4, num_epochs=1000, shuffle=False)
test_input_fn = tf.estimator.inputs.numpy_input_fn(
    {"x": x_test}, y_test, batch_size=4, num_epochs=1000, shuffle=False)

In [None]:
# train
estimator.train(input_fn=input_fn, steps=1000)

In [None]:
# Here we evaluate how well our model did.
train_metrics = estimator.evaluate(input_fn=train_input_fn)
test_metrics = estimator.evaluate(input_fn=test_input_fn)
print("train metrics: %r"% train_metrics)
print("test metrics: %r"% test_metrics)

Notice how the contents of the custom `model_fn()` function are very similar to our manual model training loop from the lower level API.

# Example: SoftMax Regression

When one learns how to program, there's a tradition that the first thing you do is print "`Hello World.`" Just like programming has Hello World, machine learning has MNIST.

MNIST is a simple computer vision dataset. It consists of images of handwritten digits. It also includes labels for each image, telling us which digit it is. 

We're going to train a model to look at images and predict what digits they are. Our goal isn't to train a really elaborate model that achieves state-of-the-art performance, but rather to dip a toe into using TensorFlow. As such, we're going to start with a very simple model, called a Softmax Regression.

The actual code is very short, and all the interesting stuff happens in just three lines. However, it is very important to understand the ideas behind it: both how TensorFlow works and the core machine learning concepts. Because of this, we are going to very carefully work through the code.

### Loading the data

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("data/mnist/", one_hot=True)

The MNIST data is split into three parts: 55,000 data points of **training data** (`mnist.train`), 10,000 points of **test data** (`mnist.test`), and 5,000 points of **validation data** (`mnist.validation`). This split is very important: it's essential in machine learning that we have separate data which we don't learn from so that **we can make sure that what we've learned actually generalizes!**

As mentioned earlier, every MNIST data point has two parts: an image of a handwritten digit and a corresponding label. We'll call the images "x" and the labels "y". **Both the training set and test set contain images and their corresponding labels;** for example the training images are `mnist.train.images` and the training labels are `mnist.train.labels`.

Each image is 28 pixels by 28 pixels. We can interpret this as a big array of numbers.

We can **flatten this array into a vector of 28x28 = 784 numbers**. It doesn't matter how we flatten the array, as long as we're consistent between images. From this perspective, the MNIST images are just a bunch of points in a 784-dimensional vector space, with a very rich structure (warning: computationally intensive visualizations).

Flattening the data throws away information about the 2D structure of the image. The result is that **`mnist.train.images` is a tensor (an n-dimensional array) with a shape of [55000, 784]**. The first dimension is an index into the list of images and the second dimension is the index for each pixel in each image. **Each entry in the tensor is a pixel intensity between 0 and 1, for a particular pixel in a particular image**.

Each image in MNIST has a corresponding label, a **number between 0 and 9** representing the digit drawn in the image.

For the purposes of this tutorial, we're going to want our labels as **"one-hot vectors"**. A one-hot vector is a vector which is 0 in most dimensions, and 1 in a single dimension. In this case, **the nth digit will be represented as a vector which is 1 in the nth dimension**. For example, 3 would be [0,0,0,1,0,0,0,0,0,0]. Consequently, `mnist.train.labels` is a `[55000, 10]` array of floats.

### Implementing the Regression

In [None]:
import tensorflow as tf

In [None]:
x = tf.placeholder(tf.float32, [None, 784])

In [None]:
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

In [None]:
y = tf.nn.softmax(tf.matmul(x, W) + b)

### Training

In [None]:
y_ = tf.placeholder(tf.float32, [None, 10])

In [None]:
y_

In [None]:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))

In [None]:
cross_entropy

In [None]:
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

In [None]:
sess = tf.InteractiveSession()

In [None]:
tf.global_variables_initializer().run()

In [None]:
for _ in range(1000):
    batch_xs, batch_ys = mnist.train.next_batch(100)
    sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

### Evaluating Our Model

In [None]:
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_, 1))

In [None]:
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

In [None]:
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

# Example: Deep Learning

TensorFlow is a powerful library for doing large-scale numerical computation. One of the tasks at which it excels is **implementing and training deep neural networks**. We will now learn the basic building blocks of a TensorFlow model while constructing a deep convolutional MNIST classifier.

The first part builds a **softmax regression model**, which is a basic implementation of a Tensorflow model. The second part shows some ways to improve the accuracy.

We will take the following steps:

- Create a softmax regression function that is a model for recognizing MNIST digits, based on looking at every pixel in the image
- Use Tensorflow to train the model to recognize digits by having it "look" at thousands of examples (and run our first Tensorflow session to do so)
- Check the model's accuracy with our test data
- Build, train, and test a multilayer convolutional neural network to improve the results


Before we create our model, we will first load the MNIST dataset, and start a TensorFlow session.

### Load MNIST Data

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

**Here `mnist` is a lightweight class which stores the training, validation, and testing sets as NumPy arrays.** It also provides a function for iterating through data minibatches, which we will use below.

### Start TensorFlow InteractiveSession

TensorFlow relies on a highly efficient C++ backend to do its computation. **The connection to this backend is called a session**. The common usage for TensorFlow programs is to first create a computation graph and then launch it in a session.

Here we instead use the convenient InteractiveSession class, which makes TensorFlow more flexible about how you structure your code. **It allows you to interleave operations which build a computation graph with ones that run the graph.** This is particularly convenient when working in interactive contexts like IPython. If you are not using an InteractiveSession, then you should build the entire computation graph before starting a session and launching the graph.

In [None]:
import tensorflow as tf
sess = tf.InteractiveSession()

### Computation Graph

To do efficient numerical computing in Python, we typically use libraries like NumPy that do expensive operations such as matrix multiplication outside Python, using highly efficient code implemented in another language. **Unfortunately, there can still be a lot of overhead from switching back to Python every operation. This overhead is especially bad if you want to run computations on GPUs or in a distributed manner, where there can be a high cost to transferring data.**

TensorFlow also does its heavy lifting outside Python, but it takes things a step further to avoid this overhead. Instead of running a single expensive operation independently from Python, **TensorFlow lets us describe a graph of interacting operations that run entirely outside Python. This approach is similar to that used in other deep learning frameworks like Theano or Torch.**

The role of the Python code is therefore to build this external computation graph, and to dictate which parts of the computation graph should be run.

### Build a Softmax Regression Model

We will first build a **softmax regression model with a single linear layer**. Later, we will extend this to the case of softmax regression with a multilayer convolutional network.

#### Placeholders

We start building the computation graph by creating nodes for the input images and target output classes.

In [None]:
# Placeholders
x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, shape=[None, 10])

Here `x` and `y_` aren't specific values. Rather, they are each **a placeholder -- a value that we'll input when we ask TensorFlow to run a computation.**

The input images `x` will consist of a 2d tensor of floating point numbers. Here we assign it a shape of `[None, 784]`, where 784 is the dimensionality of a single flattened 28 by 28 pixel MNIST image, and **`None` indicates that the first dimension, corresponding to the batch size, can be of any size.** The target output classes `y_` will also consist of a 2d tensor, where each row is a one-hot 10-dimensional vector indicating which digit class (zero through nine) the corresponding MNIST image belongs to.

The shape argument to placeholder is optional, but **it allows TensorFlow to automatically catch bugs stemming from inconsistent tensor shapes**. Our recommendation is to add shape as most bugs in deep learning code originates from mistakes in the shapes of your tensors.

#### Variables

We now define the weights `W` and biases `b` for our model. We could imagine treating these like additional inputs, but TensorFlow has an even better way to handle them: `Variable`. A `Variable` is a value that lives in TensorFlow's computation graph. It can be used and even modified by the computation. In machine learning applications, you will generally define the model parameters as Tensorflow `Variables`.

In [None]:
# Variables
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

We pass the initial value for each parameter in the call to `tf.Variable`. In this case, we initialize both `W` and `b` as tensors full of zeros. `W` is a 784x10 matrix (because we have 784 input features and 10 outputs) and `b` is a 10-dimensional vector (because we have 10 classes).

Before `Variables` can be used within a session, they must be initialized using that session. This step takes the initial values (in this case tensors full of zeros) that have already been specified, and assigns them to each Variable. This can be done for all `Variables` at once:

In [None]:
sess.run(tf.global_variables_initializer())

### Predicted Class and Loss Function

We can now implement our regression model. It only takes one line! We multiply the vectorized input images `x` by the weight matrix `W`, add the bias `b`.

In [None]:
y = tf.matmul(x, W) + b

We can specify a loss function just as easily. **Loss indicates how bad the model's prediction was on a single example;** we try to minimize that while training across all the examples. Here, **our loss function is the cross-entropy between the target and the softmax activation function applied to the model's prediction:**

In [None]:
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))

Note that `tf.nn.softmax_cross_entropy_with_logits` internally applies the softmax on the model's unnormalized model prediction and sums across all classes, and `tf.reduce_mean` takes the average over these sums.

### Train the Model

Now that we have defined our model and training loss function, it is straightforward to train using TensorFlow. **Because TensorFlow knows the entire computation graph, it can use automatic differentiation to find the gradients of the loss with respect to each of the variables.** TensorFlow has a variety of built-in optimization algorithms. For this example, we will use steepest gradient descent, with a step length of 0.5, to descend the cross entropy.

In [None]:
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

What TensorFlow actually did in that single line was to **add new operations to the computation graph**. These operations included ones to **compute gradients, compute parameter update steps, and apply update steps to the parameters.**

The returned operation `train_step`, when run, will **apply the gradient descent updates to the parameters**. Training the model can therefore be accomplished by repeatedly running `train_step`.

In [None]:
for _ in range(1000):
    batch = mnist.train.next_batch(100)
    train_step.run(feed_dict={x: batch[0], y_: batch[1]})

We load 100 training examples in each training iteration. We then run the `train_step` operation, using `feed_dict` to replace the placeholder tensors `x` and `y_` with the training examples. **Note that you can replace any tensor in your computation graph using `feed_dict` -- it's not restricted to just placeholders.**

### Evaluate the Model

How well did our model do?

First we'll figure out where we predicted the correct label. `tf.argmax` is an extremely useful function which gives you the index of the highest entry in a tensor along some axis. For example, `tf.argmax(y,1)` is the label our model thinks is most likely for each input, while `tf.argmax(y_,1)` is the true label. We can use `tf.equal` to check if our prediction matches the truth.

In [None]:
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))

That gives us a list of booleans. To determine what fraction are correct, we cast to floating point numbers and then take the mean. For example, `[True, False, True, True]` would become `[1,0,1,1]` which would become `0.75`.

In [None]:
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

Finally, we can evaluate our accuracy on the test data. This should be about 92% correct.

In [None]:
print(accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

## Build a Multilayer Convolutional Network

Getting 92% accuracy on MNIST is bad. We'll fix that, **jumping from a very simple model to something moderately sophisticated: a small convolutional neural network**. This will get us to around 99.2% accuracy.

Here is a diagram, created with TensorBoard, of the model we will build:

In [None]:
Image(filename="figures/mnist_deep.png", width=400)

### Weight Initialization

To create this model, we're going to need to create a lot of weights and biases. **One should generally initialize weights with a small amount of noise for symmetry breaking, and to prevent 0 gradients.** Since we're using ReLU neurons, it is also good practice to initialize them with a slightly positive initial bias to avoid "dead neurons". Instead of doing this repeatedly while we build the model, let's create two handy functions to do it for us.

In [None]:
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

### Convolution and Pooling

TensorFlow also gives us a lot of flexibility in convolution and pooling operations. How do we handle the boundaries? What is our stride size? In this example, we're always going to choose the vanilla version. **Our convolutions uses a stride of one and are zero padded so that the output is the same size as the input. Our pooling is plain old max pooling over 2x2 blocks.** To keep our code cleaner, let's also abstract those operations into functions.

In [None]:
def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

### First Convolutional Layer

We can now implement our first layer. It will consist of a **convolution, followed by max pooling**. The convolution will compute 32 features for each 5x5 patch. Its weight tensor will have a shape of [5, 5, 1, 32]. The first two dimensions are the patch size, the next is the number of input channels, and the last is the number of output channels. We will also have a bias vector with a component for each output channel.

In [None]:
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])

To apply the layer, we **first reshape `x` to a 4d tensor**, with the second and third dimensions corresponding to image width and height, and the final dimension corresponding to the number of color channels.

In [None]:
x_image = tf.reshape(x, [-1, 28, 28, 1])

We then **convolve `x_image` with the weight tensor, add the bias, apply the ReLU function, and finally max pool.** The max_pool_2x2 method will reduce the image size to 14x14

In [None]:
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

### Second Convolutional Layer

In order to build a deep network, we **stack several layers of this type**. The second layer will have 64 features for each 5x5 patch.

In [None]:
W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

### Densely Connected Layer

Now that the image size has been reduced to 7x7, we add a fully-connected layer with 1024 neurons to allow processing on the entire image. We reshape the tensor from the pooling layer into a batch of vectors, multiply by a weight matrix, add a bias, and apply a ReLU.

In [None]:
W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

### Dropout

To **reduce overfitting**, we will apply dropout before the readout layer. We create a placeholder for the probability that a neuron's output is kept during dropout. This allows us to **turn dropout on during training, and turn it off during testing**. TensorFlow's `tf.nn.dropout` op automatically handles scaling neuron outputs in addition to masking them, so dropout just works without any additional scaling.

Note: for this small convolutional network, performance is actually nearly identical with and without dropout. Dropout is often very effective at reducing overfitting, but it is most useful when training very large neural networks.

In [None]:
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

### Readout Layer

Finally, we add a layer, just like for the one layer softmax regression above.

In [None]:
W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

### Train and Evaluate the Model

How well does this model do? To train and evaluate it, we will use code that is nearly identical to that for the simple one layer SoftMax network above.

The differences are that:

- We will replace the steepest gradient descent optimizer with the more sophisticated ADAM optimizer.
- We will include the additional parameter `keep_prob` in `feed_dict` to control the dropout rate.
- We will add logging to every 100th iteration in the training process.

We will also use `tf.Session` rather than `tf.InteractiveSession`. This better separates the process of creating the graph (model specification) and the process of evaluating the graph (model fitting). It generally makes for cleaner code. The `tf.Session` is created within a `with block` so that it is automatically destroyed once the block is exited.

**Be aware that the next code does 20,000 training iterations and may take a while (possibly up to half an hour), depending on your processor.**

**To avoid having to wait, we will only run this code with 1.000 training iterations, which obviously will result in a low accuracy.**

In [None]:
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(1000):
        batch = mnist.train.next_batch(50)
        if i % 100 == 0:
            train_accuracy = accuracy.eval(feed_dict={x: batch[0], y_: batch[1], keep_prob: 1.0})
            print('step %d, training accuracy %g' % (i, train_accuracy))
        train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

    print('test accuracy %g' % accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

The final test set accuracy after running this code (**with 20.000 training iterations**) should be approximately **99.2%**.

We have learned how to quickly and easily build, train, and evaluate a fairly sophisticated deep learning model using TensorFlow.

##### MNIST TensorBoard

In [None]:
Image(filename="figures/mnist_tensorboard.png", width=600)

NOTE: A detailed explanation of the usage of TensorBoard is out-of-scope for this training session. Check out Google's tutorial on Tensorboard at https://www.tensorflow.org/get_started/summaries_and_tensorboard

Sources:
- https://www.tensorflow.org/get_started/
- https://www.tensorflow.org/programmers_guide/