# Deep Learning with TensorFlow
## Recitation Notebook

### Authors: Trevin Gandhi, Jordan Hurwitz, Brady Neal

This recitation will consist of two parts:  
[1) Building a feedforward Deep Neural Network in TensorFlow and discussing some best practices](#section1)  
[2) Using TensorBoard for visualizations](#section2)

<a href='#section1'><h3> Section 1: Building a Deep Feedforward Neural Network</h3></a>  
(Based on the TensorFlow tutorials)

A quick first thing to note --- for most applications of deep learning 
(for example, image recognition), instead of training a deep neural
network from scratch (which can take on the order of days or weeks), it
is common to download weights for pre-trained networks and "fine-tune"
the network to fit your application. This allows you to train a neural 
network even when you don't have a bunch of data. However, the data 
that the pretrained model was trained on has to be similar 
to your data. 

In this notebook, however, we train the network from scratch.

In [1]:
# First, we do the basic setup.
import tensorflow as tf
sess = tf.InteractiveSession()

In [2]:
# We will be training this deep neural network on MNIST,
# so let's first load the dataset.
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


In [3]:
# Now let's initialize some placeholders

# Here, x is a placeholder for our input data. Since MNIST
# uses 28x28 pixel images, we "unroll" them into a 784-pixel
# long vector. The `None` indicates that we can input an
# arbitrary amount of datapoints. Thus we are saying x is a
# matrix with 784 columns and an arbitrary (to be decided 
# when we supply the data) number of rows.
x  = tf.placeholder(tf.float32, [None, 784])

# We define y_ to be the placeholder for our *true* y's. 
# We are giving y_ 10 rows because each row will be a
# one-hot vector with the correct classification of the
# image.
y_ = tf.placeholder(tf.float32, shape=[None, 10])

In [4]:
# Here we make a handy function for initializing biases. 
# Note that we are returning a "Variable" - this means
# something that is subject to change during training.
# TensorFlow is actually using gradient descent to optimize
# the value of all "Variables" in our network. 
def bias_variable(shape):
    # Here we choose to initialize our biases to 0.01 to
    # ensure that all ReLU units fire in the beginning.
    # However, this is not an agreed-upon standard and
    # many just initialize the biases to 0.
    initial = tf.constant(0.01, shape=shape)
    return tf.Variable(initial)

In [5]:
# Let's define the first set of weights and biases (corresponding to our first layer)
# We use He initialization for the weights as good practice for when we're training
# deeper networks. Here, get_variable is similar to when we return a Variable and assign
# it, except it also checks to see if the variable already exists.

# This is: [number of input neurons, number of neurons in the first hidden layer,
# number of neurons in the second hidden layer, number of classes]
num_neurons = [784, 768, 1280, 10]

# Just store this for convenience
he_init = tf.contrib.layers.variance_scaling_initializer()

w1 = tf.get_variable("w1", shape=[num_neurons[0], num_neurons[1]], 
                     initializer=he_init)
b1 = bias_variable([num_neurons[1]])

# Now let's define the computation that takes this layer's input and runs it through
# the neurons. Note that we use the ReLU activation function to avoid problems
# with our gradients. This line is the equivalent of saying the output of the
# first hidden layer is max(x*w1 + b1, 0).
h1 = tf.nn.relu(tf.matmul(x, w1) + b1)

# We also apply dropout after this layer and the next. Dropout is a form of regularization
# in neural networks where we "turn off" randomly selected neurons during training.
keep_prob = tf.placeholder(tf.float32)
h1_drop = tf.nn.dropout(h1, keep_prob)

In [6]:
# Define the second layer, similarly to the first.
w2 = tf.get_variable("w2", shape=[num_neurons[1], num_neurons[2]], 
                     initializer=he_init)
b2 = bias_variable([num_neurons[2]])
h2 = tf.nn.relu(tf.matmul(h1_drop, w2) + b2)
h2_drop = tf.nn.dropout(h2, keep_prob)

# And define the third layer to output the log probabilities 
w3 = tf.get_variable("w3", shape=[num_neurons[2], num_neurons[3]], 
                     initializer=he_init)
b3 = bias_variable([num_neurons[3]])
y  = tf.matmul(h2_drop, w3) + b3

In [7]:
# We define our loss function to be cross entropy over softmax probabilities.
# Here our true labels are defined by y_, and our log probabilities
# (TensorFlow calls them `logits`) are defined by y.
cross_entropy = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))

In [8]:
# We will use the `Adam` optimizer. Adam is an fancier variant of
# standard gradient descent.
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

# Here we build a binary vector corresponding to where our predicted 
# classes matched the actual classes.
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))

accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

sess.run(tf.global_variables_initializer())

for i in range(20000):
    batch = mnist.train.next_batch(50)
    if i%500 == 0:
#         train_accuracy = accuracy.eval(feed_dict={x:batch[0], y_: batch[1], keep_prob: 1.0})
#         loss = cross_entropy.eval(feed_dict={x:batch[0], y_: batch[1], keep_prob: 1.0})
        train_accuracy, loss = sess.run([accuracy, cross_entropy], 
                                        feed_dict={x:batch[0], y_: batch[1], keep_prob: 1.0})
        print("step %d, training accuracy %g, loss %g"%(i, train_accuracy, loss))
    train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

# Need to change this to be clean
test_accuracy = 0
for i in range(20):
    batch = mnist.test.next_batch(500)
    test_accuracy += 500 * accuracy.eval(feed_dict={
        x:batch[0], y_: batch[1], keep_prob: 1.0})

print("test accuracy %g"%(test_accuracy / 10000))
# print("test accuracy %g"%accuracy.eval(feed_dict={
#     x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

step 0, training accuracy 0.08, loss 2.36143
step 500, training accuracy 0.82, loss 0.355271
step 1000, training accuracy 0.92, loss 0.225537
step 1500, training accuracy 0.94, loss 0.246066
step 2000, training accuracy 0.98, loss 0.0766248
step 2500, training accuracy 0.98, loss 0.100453
step 3000, training accuracy 0.94, loss 0.0873284
step 3500, training accuracy 0.98, loss 0.120597
step 4000, training accuracy 0.96, loss 0.120495
step 4500, training accuracy 1, loss 0.0117914
step 5000, training accuracy 0.98, loss 0.0775767
step 5500, training accuracy 0.98, loss 0.063437
step 6000, training accuracy 1, loss 0.00433112
step 6500, training accuracy 0.98, loss 0.117304
step 7000, training accuracy 1, loss 0.0297559
step 7500, training accuracy 1, loss 0.0112341
step 8000, training accuracy 0.98, loss 0.154639
step 8500, training accuracy 0.96, loss 0.128729
step 9000, training accuracy 0.98, loss 0.0539484
step 9500, training accuracy 1, loss 0.00962552
step 10000, training accuracy

### Can we make this simpler? 
With TensorFlow 1.0, we can!

In [9]:
from tensorflow.contrib.layers import fully_connected, dropout, batch_norm

# Instead of making keep_prob a placeholder (like we did for dropout
# above), we can make a boolean`is_training` placeholder that dropout
# and batch normalization can check to determine what parameter
# values to use (i.e. if is_training = True, then dropout will use
# a keep_prob of 0.5. Otherwise, it uses a keep_prob of 1.0).
is_training = tf.placeholder(tf.bool, shape=(), name='is_training')

# We can even easily add Batch Normalization, which can also be quite
# useful when training deep neural networks (although it won't do much
# here).
bn_params = {
    'is_training': is_training,
    'decay': 0.99,
    'updates_collections': None
}

# Define the first hidden layer using `fully_connected`
# There are similar functions (e.g. conv2d) for other
# types of layers.
hidden1 = fully_connected(x, num_neurons[1], 
                          weights_initializer=he_init,
                          normalizer_fn=batch_norm, 
                          normalizer_params=bn_params)
hidden1_drop = dropout(hidden1, keep_prob, is_training=is_training)

hidden2 = fully_connected(hidden1_drop, num_neurons[2], 
                          weights_initializer=he_init,
                          normalizer_fn=batch_norm, 
                          normalizer_params=bn_params)
hidden2_drop = dropout(hidden2, keep_prob, is_training=is_training)

logits = fully_connected(hidden2_drop, num_neurons[3], activation_fn=None)

<a href='#section2'><h3>Using TensorBoard for Visualizations</h3></a>

### Text Generation with RNNs
Based off the TensorFlow Tutorial and Andrej Karpathy's [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)

In [10]:
# Let us first download the dataset we will be using,
# the works of Shakespeare. Dataset from Andrej Karpathy.
import urllib2
print ('Downloading Shakespeare data')
source = urllib2.urlopen("https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt")
shakespeare = source.read()
print ('Download complete')

Downloading Shakespeare data
Download complete


In [None]:
lstm_size = 512
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size, state_is_tuple=False)
stacked_lstm = tf.contrib.rnn.MultiRNNCell([lstm, lstm],
    state_is_tuple=False)
initial_state = state = stacked_lstm.zero_state(batch_size, tf.float32)
for i in range(num_steps):
    # The value of state is updated after processing each batch of words.
    output, state = stacked_lstm(words[:, i], state)

    # The rest of the code.
    # ...
 
final_state = state