<h2>Introduction to TensorFlow</h2>
<p>TensorFlow relies on a highly efficient C++ backend to do its computation. The connection to this backend is called a <b>Session</b>. The common usage for TensorFlow programs is to first create a <b>graph</b> and then launch it in a session.</p>

<p>Tensorflow represent computations as graphs. Nodes in the graph are called <b>ops</b> (short for operations). An op takes tensors, performs some computation, and produces tensors. A <b>Tensor</b> is a typed multi-dimensional array. For example, you can represent a mini-batch of images as a 4-D array of floating point numbers with dimensions [batch, height, width, channels].</p>

<p><b>Why use graph?</b> Instead of running a single expensive operation independently from Python, TF allows us to define a graph of interacting operations that run entirely outside Python to avoid the overhead from switching back to Python every operation. The role of the Python code is therefore to build this external computation graph, and to dictate which parts of the computation graph should be run. This approach is similar to that used in Theano or Torch</p>

<p>A TensorFlow graph is a description of computations. To compute anything, a graph must be launched in a Session. A Session places the graph ops onto <b>Devices</b>, such as CPUs or GPUs, and provides methods to execute them. These methods return tensors produced by ops as numpy ndarray objects.</p>
<br>
<h3>Tensorboard</h3>
<p><b>Initialize Tensorboard</b>: tensorboard --logdir=path/to/logs<b> Or specify another port 8008</b>: tensorboard --logdir=/tmp/mnist_logs_try --port=8008</p>
<p><b>Compare train & test of your model:</b> output event summaries to /train and /test folders under path/to/logs</p>
<p>To get an updated graph representation you should stop tensorboard and jupyter, delete your tensorflow logdir, restart jupyter, run the script, and then restart tensorboard.</p>

In [1]:
import sys
sys.path = ['', '/Applications/Spyder-Py2.app/Contents/Resources', '/Applications/Spyder-Py2.app/Contents/Resources/lib/python27.zip', '/Applications/Spyder-Py2.app/Contents/Resources/lib/python2.7', '/Applications/Spyder-Py2.app/Contents/Resources/lib/python2.7/plat-darwin', '/Applications/Spyder-Py2.app/Contents/Resources/lib/python2.7/plat-mac', '/Applications/Spyder-Py2.app/Contents/Resources/lib/python2.7/plat-mac/lib-scriptpackages', '/Applications/Spyder-Py2.app/Contents/Resources/lib/python2.7/lib-tk', '/Applications/Spyder-Py2.app/Contents/Resources/lib/python2.7/lib-old', '/Applications/Spyder-Py2.app/Contents/Resources/lib/python2.7/lib-dynload', '/Applications/Spyder-Py2.app/Contents/Resources/lib/python2.7/site-packages.zip', '/Applications/Spyder-Py2.app/Contents/Resources/lib/python2.7/site-packages', '/Library/Python/2.7/site-packages/']

In [92]:
import tensorflow as tf
import numpy as np

### Simple Example1 - print out string

In [105]:
summary

<tf.Tensor 'Scope_2/ScalarSummary/TensorSummary:0' shape=() dtype=string>

In [125]:
hello = tf.constant(12345) 
sess = tf.Session() # launch graph
print(sess.run(hello)) 
sess.close() 

12345


### Simple Example2 - matrix multiplication

In [89]:
# Create a Constant op that produces a 1x2 matrix.  The op is
# added as a node to the default graph.
matrix1 = tf.constant([[3., 3.]])

# Create another Constant that produces a 2x1 matrix.
matrix2 = tf.constant([[2.],[2.]])

# Create a Matmul op that takes 'matrix1' and 'matrix2' as inputs.
product = tf.matmul(matrix1, matrix2)

# use 'with' to launch the default graph and close it automatically after use, to release the CPU/GPU resource
# 'run' causes the execution of threes ops in the graph: the two constants and matmul
with tf.Session() as sess:
    result = sess.run(product)
    print(result)

[[ 12.]]


### Simple Example3 - fit a line to some data

In [132]:
tf.reset_default_graph()

with tf.name_scope('input'):
    # Create 100 phony x, y data points in NumPy, y = x * 0.1 + 0.3
    x_data = np.random.rand(100).astype(np.float32)
    y_data = x_data * 0.1 + 0.3

with tf.name_scope('parameters'):
    # Try to find values for W and b that compute y_data = W * x_data + b
    # (We know that W should be 0.1 and b 0.3, but TensorFlow will
    # figure that out for us.)
    W = tf.Variable(tf.random_uniform([1], -1.0, 1.0), name="weights")
    b = tf.Variable(tf.zeros([1]), name="bias")

with tf.name_scope('estimated_y'):
    y = W * x_data + b

# Minimize the mean squared errors.
with tf.name_scope('MES_loss'):
    loss = tf.reduce_mean(tf.square(y - y_data))
    # create a summary for loss
    tf.scalar_summary("loss", loss)
    tf.histogram_summary("loss_hist", loss)
with tf.name_scope('train'):
    train_op = tf.train.GradientDescentOptimizer(0.5).minimize(loss) # train operation

# Before starting, initialize the variables.  We will 'run' this first.
init = tf.initialize_all_variables()

# Launch the graph.
with tf.Session() as sess:
    sess.run(init)

    # Instead of executing every summary operation individually we can merge them all together into a single merged summary operation.
    summary_op = tf.merge_all_summaries()
    # Tensorflow summaries are essentially logs. And in order to write logs we need a log writer a SummaryWriter
    writer = tf.train.SummaryWriter("/tmp/tensorboard_example3", graph=tf.get_default_graph())

    # Fit the line.
    for step in range(201):
        #sess.run(train)
        # perform the operations we defined earlier on batch
        _, summary = sess.run([train_op, summary_op])

        # write log
        writer.add_summary(summary, step)

        if step % 50 == 0:
            print 'step:',step, ' w:', sess.run(W), ' b:',sess.run(b)
# Learns best fit is W: [0.1], b: [0.3]

step: 0  w: [ 0.29914504]  b: [ 0.2643007]
step: 50  w: [ 0.10568427]  b: [ 0.29688805]
step: 100  w: [ 0.10019204]  b: [ 0.29989487]
step: 150  w: [ 0.10000648]  b: [ 0.29999647]
step: 200  w: [ 0.10000023]  b: [ 0.29999989]


### Example4 - Softmax regression model with a single linear layer on MNIST

In [5]:
# download and read data
from tensorflow.examples.tutorials.mnist import input_data
# mnist is a class which stores the training, validation, and testing sets as NumPy arrays
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting MNIST_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


In [11]:
# use InteractiveSession to allow interactive operations in IPython
sess = tf.InteractiveSession()

In [12]:
'''
Start building the computation graph by creating nodes for the input images and target output classes.

Here x and y_ aren't specific values. Rather, they are each a placeholder -- a value that we'll input when we ask TensorFlow to run a computation.

The input images x will consist of a 2d tensor of floating point numbers. Here we assign it a shape of [None, 784], 
where 784 is the dimensionality of a single flattened 28 by 28 pixel MNIST image, and None indicates that the first dimension, 
corresponding to the batch size, can be of any size. The target output classes y_ will also consist of a 2d tensor, where each 
row is a one-hot 10-dimensional vector indicating which digit class (zero through nine) the corresponding MNIST image belongs to.
'''
x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, shape=[None, 10])

In [13]:
# define model parameters as Variables, will be used in TensorFlow's computation graph
# W is a 784x10 matrix (because we have 784 input features and 10 outputs)
# b is a 10-dimensional vector (because we have 10 classes)
W = tf.Variable(tf.zeros([784,10]))
b = tf.Variable(tf.zeros([10]))

In [14]:
# Before Variables can be used within a session, they must be initialized using that session. 
# This can be done for all Variables at once
sess.run(tf.initialize_all_variables())

In [15]:
# calculate the unnormalized logit
y = tf.matmul(x,W) + b

In [16]:
# First, tf.log computes the logarithm of each element of y. Next, we multiply each element of y_ with the corresponding 
# element of tf.log(y). Then tf.reduce_sum adds the elements in the second dimension of y, due to the reduction_indices=[1] 
# parameter. Finally, tf.reduce_mean computes the mean over all the examples in the batch.
# This is numerically unstable in practice, use below

# cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))

# tf.nn.softmax_cross_entropy_with_logits internally applies the softmax on the model's unnormalized model prediction 
# and calculate cross-entropy loss and sums across all classes
# tf.reduce_mean returns the average over these sums, this gives a single number
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y, y_))

In [17]:
# Use gradient descent with learning rate = 0.5, to decrease cross-entropy loss.
# It adds new operations to computation graph to compute gradients, compute parameter update steps, and apply update 
# steps to the parameters.
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

In [20]:
# The returned train_step, when run, will apply the gradient descent updates to the parameters. 
# Training the model can therefore be accomplished by repeatedly running train_step.

# Do training for 1000 epochs, with mini-batch size = 100
# Use feed_dict to replace the placeholder tensors x and y_ with the training examples. Note that you can replace 
# any tensor in your computation graph using feed_dict -- it's not restricted to just placeholders.
for i in range(1000):
    batch = mnist.train.next_batch(100) # batch is a tuple (mini_train, mini_label)
    train_step.run(feed_dict={x: batch[0], y_: batch[1]}) # feed_dict: maps data onto Tensor placeholders

In [21]:
# Figure out where we predicted the correct label. 
# tf.argmax gives the index of the highest entry in a tensor along some axis. 
# e.g. tf.argmax(y,1) is the predicted label for each input instance, tf.argmax(y_,1) is the true label
# tf.equal checks if our prediction matches the truth and gives boolean results: [True, False, True, True]
correct_predictions = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))

# Cast boolean to float numbers and then take the mean, the second parameter specifies the new data type 
# For example, [True, False, True, True] would become [1,0,1,1] which have mean=0.75.
accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32))

# Evaluate our accuracy on the test data
print(accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels})) # feed_dict: maps data onto Tensor placeholders

0.9245


### Example5 - Softmax on MNIST with Tensorboard Visualization

In [140]:
# reset everything to rerun in jupyter
tf.reset_default_graph()

# config
batch_size = 100
learning_rate = 0.5
training_epochs = 1000
logs_path = "/tmp/mnist_softmax"

# different summaries for a single variable
def variable_summaries(var, name):
    with tf.name_scope('summaries'):
        mean = tf.reduce_mean(var)
        tf.scalar_summary('mean/' + name, mean)
        with tf.name_scope('stddev'):
            stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
        tf.scalar_summary('stddev/' + name, stddev)
        tf.scalar_summary('max/' + name, tf.reduce_max(var))
        tf.scalar_summary('min/' + name, tf.reduce_min(var))
        tf.histogram_summary('histogram/' + name, var)

# load mnist data set
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

# input images
with tf.name_scope('input'):
    # None -> batch size can be any size, 784 -> flattened mnist image
    x = tf.placeholder(tf.float32, shape=[None, 784], name="x-input") 
    # target 10 output classes
    y_ = tf.placeholder(tf.float32, shape=[None, 10], name="y-input")

with tf.name_scope("parameters"):
    # model parameters will change during training so we use tf.Variable
    with tf.name_scope("weights"):
        W = tf.Variable(tf.zeros([784, 10]))
        variable_summaries(W, 'weights')
    with tf.name_scope("biases"):
        b = tf.Variable(tf.zeros([10]))
        variable_summaries(b, 'bias')

# implement model
with tf.name_scope("softmax"):
    # y is our prediction
    y = tf.nn.softmax(tf.matmul(x,W) + b)

# specify cost function
with tf.name_scope('cross_entropy'):
    # this is our cost
    cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
    tf.scalar_summary("cost", cross_entropy)
    
# specify optimizer
with tf.name_scope('train'):
    # optimizer is an "operation" which we can execute in a session
    train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)

with tf.name_scope('Accuracy'):
    with tf.name_scope('correct_prediction'):
        correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
    with tf.name_scope('accuracy'):
        accuracy_op = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
        tf.scalar_summary("accuracy", accuracy_op)

# merge all summaries into a single "operation" which we can execute in a session 
summary_op = tf.merge_all_summaries()

with tf.Session() as sess:
    # create a log folder and save the graph structure, do this before training
    #writer = tf.train.SummaryWriter(logs_path, graph=tf.get_default_graph())
    train_writer = tf.train.SummaryWriter(logs_path + '/train',graph=tf.get_default_graph())
    test_writer = tf.train.SummaryWriter(logs_path + '/test')
    
    # variables need to be initialized before we can use them
    sess.run(tf.initialize_all_variables())

    # perform training cycles
    for epoch in range(training_epochs):
        batch_x, batch_y = mnist.train.next_batch(batch_size)

        # write training summaries at every epoch
        summary, _ = sess.run([summary_op, train_op], feed_dict={x: batch_x, y_: batch_y})
        train_writer.add_summary(summary, epoch)
            
        if epoch % batch_size == 0:  
            # write testing summaries at every batch_size epochs
            summary, acc = sess.run([summary_op, accuracy_op], feed_dict={x: mnist.test.images, y_: mnist.test.labels})
            test_writer.add_summary(summary, epoch)
            print('Accuracy at epoch %s: %s' % (epoch, acc))
    
    print "done"

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
Accuracy at epoch 0: 0.4075
Accuracy at epoch 100: 0.8948
Accuracy at epoch 200: 0.9031
Accuracy at epoch 300: 0.9074
Accuracy at epoch 400: 0.9037
Accuracy at epoch 500: 0.9125
Accuracy at epoch 600: 0.9148
Accuracy at epoch 700: 0.9181
Accuracy at epoch 800: 0.9166
Accuracy at epoch 900: 0.9154
done


### Example6 - ConvNet on MNIST

#### the next block is the same as the first three steps from softmax

In [None]:
# download and read data
from tensorflow.examples.tutorials.mnist import input_data
# mnist is a class which stores the training, validation, and testing sets as NumPy arrays
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

# use InteractiveSession to allow interactive operations in IPython
sess = tf.InteractiveSession()

'''
Start building the computation graph by creating nodes for the input images and target output classes.

Here x and y_ aren't specific values. Rather, they are each a placeholder -- a value that we'll input when we ask TensorFlow to run a computation.

The input images x will consist of a 2d tensor of floating point numbers. Here we assign it a shape of [None, 784], 
where 784 is the dimensionality of a single flattened 28 by 28 pixel MNIST image, and None indicates that the first dimension, 
corresponding to the batch size, can be of any size. The target output classes y_ will also consist of a 2d tensor, where each 
row is a one-hot 10-dimensional vector indicating which digit class (zero through nine) the corresponding MNIST image belongs to.
'''
x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, shape=[None, 10])

In [56]:
# Initialize weights and baises
def weight_variable(shape):
    # tf.truncated_normal returns random values from a normalal distribution and made sure no value exceeds 2 std
    # shape must be a 1-D interger array
    initial = tf.truncated_normal(shape, stddev=0.1) 
    return tf.Variable(initial)

def bias_variable(shape):
    # Since we use ReLU, we will initialize them with a slightly positive initial bias to avoid "dead neurons"
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

In [29]:
# Functions for convolution and pooling

# Convolutions uses a stride of one and are zero padded so that the output is a tensor with the same size as the input. 
# x is input: must be of shape [batch, in_height, in_width, in_channels]
# w is filter/kernel: be of shape [filter_height, filter_width, in_channels, out_channels]
# strides: A list of ints. 1-D of length 4
# padding:  A string from: "SAME", "VALID". The type of padding algorithm to use
def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

# Max pooling over 2x2 blocks on the input
# x: A 4-D Tensor with shape [batch, height, width, channels] and type tf.float32
# ksize: A list of ints that has length >= 4. The size of the window for each dimension of the input tensor z. In this case
# it is 2*2 over height & width of each input instance
# strides: A list of ints that has length >= 4. The stride of the sliding window for each dimension of the input tensor. In this
# case it is calculating one pixel out of every 2*2 pixels
def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding='SAME')

In [65]:
# First convolutional layer, which is the result of convolution, followed by max pooling
# [5, 5, 1, 32] indicates there are 32 filters with patch size 5*5 and input channel 1
# therefore the output channel will be 32
W_conv1 = weight_variable([5, 5, 1, 32])
# have a bias term for each filter
b_conv1 = bias_variable([32])

# To apply the layer, we first reshape x to a 4d tensor
# The first dimension is -1 means the batch size is to be computed based on image width, height and channel, 
# The second and third dimensions correspond to image size 28 * 28
# The final dimension corresponds to the number of color channels.
# so that the total size remains constant
x_image = tf.reshape(x, [-1,28,28,1])

# convolve x_image with the weight tensor, add the bias, apply the ReLU function, and finally max pool
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1) # h means hidden
h_pool1 = max_pool_2x2(h_conv1) # now activation map is 14 * 14 for each slice, and the whole tensor is batch_size*14*14*32

In [66]:
# Second convolutional layer

W_conv2 = weight_variable([5, 5, 32, 64]) # 64 filters with patch size 5*5 and input channel 32
b_conv2 = bias_variable([64]) # have a bias term for each filter

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2) # now activation map is 7 * 7 for each slice, and the whole tensor is batch_size*7*7*64

In [67]:
# Densely/Fully connected layer

# 7*7*64 is the dimension for the activation maps from the second convolutional layer
# add a fully-connected layer with 1024 neurons to allow processing on the entire image
W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024]) # one biase for each neuron on FC layer

# Reshape the tensor from the pooling layer into a batch of vectors, each row vector is an activation map
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
# Multiply by a weight matrix, add a bias, and apply a ReLU
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

In [68]:
# apply dropout before the readout/output layer. 
# keep_proba: a placeholder for the probability that a neuron's output is kept during dropout. 
# This allows us to turn dropout on during training, and turn it off during testing. 
# tf.nn.dropout op automatically handles scaling so dropout just works without any additional scaling.
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

In [69]:
# Readout/Output layer

# weights that connect FC layer with output layer
# 10 output neurons since there are 10 classes
W_output = weight_variable([1024, 10])
b_output = bias_variable([10])

# unnoralized logit
y_conv = tf.matmul(h_fc1_drop, W_output) + b_output

In [70]:
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_conv, y_)) # cross-entropy loss
# used Adam optimizer instead of vanilla gradient descent to update parameters, learning rate = 0.0001
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy) 
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1)) # boolean prediction results
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) # prediction accuracy

# Before Variables can be used within a session, they must be initialized using that session. 
# This can be done for all Variables at once
sess.run(tf.initialize_all_variables())

# training with this many epochs
for i in range(100): 
    # mini-batch size=50
    # batch is a tuple (mini_train, mini_label)
    batch = mnist.train.next_batch(50) 
    # print out accuracy on trainingset every 100 epochs
    if i%100 == 0: 
        # note keep_prob for dropout is 1, which means no neurons is dropped out when evaluating the training accuracy
        train_accuracy = accuracy.eval(feed_dict={
            x:batch[0], y_: batch[1], keep_prob: 1.0})  # feed_dict: maps data onto Tensor placeholders
        print("epoch %d, training accuracy %g"%(i, train_accuracy))
    # note keep_prob for dropout is 0.5, which means half of neurons will be dropped out during training
    train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

# print testing results
# note keep_prob for dropout is 1, which means no neurons is dropped out during testing
print("test accuracy %g"%accuracy.eval(feed_dict={
    x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0})) # feed_dict: maps data onto Tensor placeholders

epoch 0, training accuracy 0.12
test accuracy 0.8374


### Example7 - ConvNet on MNIST with Tensorboard Visualization

In [141]:
# reset everything to rerun in jupyter
tf.reset_default_graph()

# config
batch_size = 50
learning_rate = 0.5
training_epochs = 500
logs_path = "/tmp/mnist_convnet"

# different summaries for a single variable
def variable_summaries(var, name):
    with tf.name_scope('summaries'):
        mean = tf.reduce_mean(var)
        tf.scalar_summary('mean/' + name, mean)
        with tf.name_scope('stddev'):
            stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
        tf.scalar_summary('stddev/' + name, stddev)
        tf.scalar_summary('max/' + name, tf.reduce_max(var))
        tf.scalar_summary('min/' + name, tf.reduce_min(var))
        tf.histogram_summary('histogram/' + name, var)

# Initialize weights and baises
def weight_variable(shape):
    # tf.truncated_normal returns random values from a normalal distribution and made sure no value exceeds 2 std
    # shape must be a 1-D interger array
    initial = tf.truncated_normal(shape, stddev=0.1) 
    return tf.Variable(initial)

def bias_variable(shape):
    # Since we use ReLU, we will initialize them with a slightly positive initial bias to avoid "dead neurons"
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

# Convolutions uses a stride of one and are zero padded so that the output is a tensor with the same size as the input. 
# x is input: must be of shape [batch, in_height, in_width, in_channels]
# w is filter/kernel: be of shape [filter_height, filter_width, in_channels, out_channels]
# strides: A list of ints. 1-D of length 4
# padding:  A string from: "SAME", "VALID". The type of padding algorithm to use
def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

# Max pooling over 2x2 blocks on the input
# x: A 4-D Tensor with shape [batch, height, width, channels] and type tf.float32
# ksize: A list of ints that has length >= 4. The size of the window for each dimension of the input tensor z. In this case
# it is 2*2 over height & width of each input instance
# strides: A list of ints that has length >= 4. The stride of the sliding window for each dimension of the input tensor. In this
# case it is calculating one pixel out of every 2*2 pixels
def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

# load mnist data set
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

# input images
with tf.name_scope('input'):
    # None -> batch size can be any size, 784 -> flattened mnist image
    x = tf.placeholder(tf.float32, shape=[None, 784], name="x-input") 
    # To apply the conv layer, we first reshape x to a 4d tensor
    # The first dimension is -1 means the batch size is to be computed based on image width, height and channel, 
    # The second and third dimensions correspond to image size 28 * 28
    # The final dimension corresponds to the number of color channels.
    # so that the total size remains constant
    x_image = tf.reshape(x, [-1,28,28,1])
    
    # target 10 output classes
    y_ = tf.placeholder(tf.float32, shape=[None, 10], name="y-input")

# First convolutional layer, which is the result of convolution, followed by max pooling
with tf.name_scope('Conv1'):
    with tf.name_scope('conv1_parameters'):
        with tf.name_scope("conv1_weights"):
            # [5, 5, 1, 32] indicates there are 32 filters with size 5*5 and input channel 1
            # therefore the output channel will be 32
            W_conv1 = weight_variable([5, 5, 1, 32])
            variable_summaries(W_conv1, 'W_conv1')
        with tf.name_scope("conv1_biases"):
            # have a bias term for each filter
            b_conv1 = bias_variable([32])
            variable_summaries(b_conv1, 'b_conv1')

    with tf.name_scope('Conv1_activated_conv'):
        # convolve x_image with the weight tensor, add the bias, apply the ReLU function, and finally max pool
        h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1) # h means hidden
    with tf.name_scope('Conv1_max_pool'):
        h_pool1 = max_pool_2x2(h_conv1) # now activation map is 14 * 14 for each slice, and the whole tensor is batch_size*14*14*32


# Second convolutional layer
with tf.name_scope('Conv2'):
    with tf.name_scope('conv2_parameters'):
        with tf.name_scope("conv2_weights"):
            # [5, 5, 32, 64] indicates there are 64 filters with size 5*5 and input channel 32
            # therefore the output channel will be 64
            W_conv2 = weight_variable([5, 5, 32, 64]) 
            variable_summaries(W_conv2, 'W_conv2')
        with tf.name_scope("conv2_biases"):
            b_conv2 = bias_variable([64]) # have a bias term for each filter
            variable_summaries(b_conv2, 'b_conv2')
    
    with tf.name_scope('Conv2_activated_conv'):
        h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
    with tf.name_scope('Conv2_max_pool'):
        h_pool2 = max_pool_2x2(h_conv2) # now activation map is 7 * 7 for each slice, and the whole tensor is batch_size*7*7*64        

# Densely/Fully connected layer
with tf.name_scope('FC_layer'):
    with tf.name_scope('FC1_parameters'):
        with tf.name_scope("FC1_weights"):
            # 7*7*64 is the dimension for the activation maps from the second convolutional layer
            # add a fully-connected layer with 1024 neurons to allow processing on the entire image
            W_fc1 = weight_variable([7 * 7 * 64, 1024])
            variable_summaries(W_fc1, 'W_FC')
        with tf.name_scope("FC1_biases"):
            b_fc1 = bias_variable([1024]) # one biase for each neuron on FC layer
            variable_summaries(b_fc1, 'b_FC')

    with tf.name_scope('FC1_flatten'):
        # Reshape the tensor from the pooling layer into a batch of vectors, each row vector is an activation map
        h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
    with tf.name_scope('FC1_activated'):
        # Multiply by a weight matrix, add a bias, and apply a ReLU
        h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

# apply dropout before the readout/output layer. 
# keep_proba: a placeholder for the probability that a neuron's output is kept during dropout. 
# This allows us to turn dropout on during training, and turn it off during testing. 
# tf.nn.dropout op automatically handles scaling so dropout just works without any additional scaling.
with tf.name_scope('Dropout_layer'):    
    keep_prob = tf.placeholder(tf.float32)
    h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

# Readout/Output layer
with tf.name_scope('Output_layer'):
    with tf.name_scope('output_parameters'):
        with tf.name_scope("output_weights"):
            # weights that connect FC layer with output layer
            # 10 output neurons since there are 10 classes
            W_output = weight_variable([1024, 10])
            variable_summaries(W_output, 'W_output')
        with tf.name_scope("output_biases"):
            b_output = bias_variable([10])
            variable_summaries(b_output, 'b_output')

    with tf.name_scope('softmax'):
        y_conv = tf.matmul(h_fc1_drop, W_output) + b_output
    
with tf.name_scope('cross_entropy'):
    cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_conv, y_)) # cross-entropy loss
    tf.scalar_summary("cost", cross_entropy)

with tf.name_scope('train'):
    # used Adam optimizer instead of vanilla gradient descent to update parameters, learning rate = 0.0001
    train_op = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy) 

with tf.name_scope('Accuracy'):
    with tf.name_scope('correct_prediction'):
        correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1)) # boolean prediction results
    with tf.name_scope('accuracy'):
        accuracy_op = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) # prediction accuracy
        tf.scalar_summary("accuracy", accuracy_op)    

# merge all summaries into a single "operation" which we can execute in a session 
summary_op = tf.merge_all_summaries()

with tf.Session() as sess:
    # create a log folder and save the graph structure, do this before training
    #writer = tf.train.SummaryWriter(logs_path, graph=tf.get_default_graph())
    train_writer = tf.train.SummaryWriter(logs_path + '/train',graph=tf.get_default_graph())
    test_writer = tf.train.SummaryWriter(logs_path + '/test')
    
    # variables need to be initialized before we can use them
    sess.run(tf.initialize_all_variables())

    # perform training cycles
    for epoch in range(training_epochs):
        batch_x, batch_y = mnist.train.next_batch(batch_size)

        # write training summaries at every epoch
        summary, _ = sess.run([summary_op, train_op], feed_dict={x: batch_x, y_: batch_y, keep_prob: 0.5})
        train_writer.add_summary(summary, epoch)
            
        # write testing summaries at every batch_size epoch
        if epoch % batch_size == 0:  
            summary, acc = sess.run([summary_op, accuracy_op], feed_dict={x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0})
            test_writer.add_summary(summary, epoch)
            print('Test accuracy at epoch %s: %s' % (epoch, acc))
    
    print "done"


Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
Test accuracy at epoch 0: 0.1155
Test accuracy at epoch 50: 0.7281
Test accuracy at epoch 100: 0.8236
Test accuracy at epoch 150: 0.8727
Test accuracy at epoch 200: 0.9059
Test accuracy at epoch 250: 0.9075
Test accuracy at epoch 300: 0.9239
Test accuracy at epoch 350: 0.9288
Test accuracy at epoch 400: 0.9344
Test accuracy at epoch 450: 0.9425
done


'\nwith tf.name_scope("parameters"):\n    with tf.name_scope("weights"):\n        W = tf.Variable(tf.zeros([784, 10]))\n        variable_summaries(W, \'weights\')\n    with tf.name_scope("biases"):\n        b = tf.Variable(tf.zeros([10]))\n        variable_summaries(b, \'bias\')\n\n# implement model\nwith tf.name_scope("softmax"):\n    # y is our prediction\n    y = tf.nn.softmax(tf.matmul(x,W) + b)\n\n# specify cost function\nwith tf.name_scope(\'cross_entropy\'):\n    # this is our cost\n    cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))\n    tf.scalar_summary("cost", cross_entropy)\n    \n# specify optimizer\nwith tf.name_scope(\'train\'):\n    # optimizer is an "operation" which we can execute in a session\n    train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)\n\nwith tf.name_scope(\'Accuracy\'):\n    with tf.name_scope(\'correct_prediction\'):\n        correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y