# Classifying MNIST With Neural Networks Using Tensorflow

In this assignment you will be training a neural network with a Tensorflow implementation of a simple two-layer fully connected neural network. Due to the complexity of constructing a neural network from scratch and that we are not testing you on the knowledge of specific neural network libraries, there will no coding for this assigment; instead, we will be asking relevant questions to test your understanding of how neural networks function. Make sure that you have Tensorflow installed in your environment before beginning this assignment.

Let's begin by importing the necessary modules and data.

In [None]:
import numpy as np
import tensorflow as tf

from tensorflow.examples.tutorials.mnist import input_data
data = input_data.read_data_sets("MNIST_data/", one_hot=True)

We begin the neural network by defining the input layer, which we will call `x`, and the correct labels, which we will call `label`. Since each image is 28 by 28 pixels, we need 784 input neurons. We are predicting between 10 digits, and so our labels will be expressed with a 10-dimensional vector using one-hot encoding (where all elements are 0 except for a single 1).

In [None]:
x = tf.placeholder(tf.float32, [None, 784])
label = tf.placeholder(tf.float32, [None, 10])

We construct our hidden layer with 200 neurons by creating a 784 by 400 matrix and a bias, and then apply a sigmoid activation function. We'll call the output of the hidden layer `h`.

In [None]:
W_h = tf.Variable(tf.truncated_normal([784, 400], stddev=0.1))
b_h = tf.Variable(tf.truncated_normal([400], stddev=0.1))
h = tf.sigmoid(tf.matmul(x, W_h) + b_h)

### Question 1:
What is the purpose of applying an activation function at each layer? How would our neural network be limited if we didn't apply any activation functions to the layers?

*YOUR ANSWER HERE*

### Question 2:
A problem with deep neural networks with sigmoid or tanh activation function is the vanishing gradient problem. Luckily, our neural network isn't deep enough for it be be an issue, but from observing the behavior of the sigmoid and tanh functions, can you see what problem might arise from computing gradients, especially in the first layers? What can be done to mitigate or resolve this?

*YOUR ANSWER HERE*


Now we construct the output layer. Since the neural network is predicting from 10 choices, we create a 400 by 10 matrix with bias, and then apply a softmax activation. The softmax function is like a generalized sigmoid, with the property that the sum of all outputs is equal to one (along with other properties that are similar to those from the sigmoid function). We will call our final output prediction `y`.

In [None]:
W_y = tf.Variable(tf.truncated_normal([400, 10], stddev=0.1))
b_y = tf.Variable(tf.truncated_normal([10], stddev=0.1))
y = tf.nn.softmax(tf.matmul(h, W_y) + b_y)

Now we define our loss function. For this neural network, we'll be using the cross entropy loss. Tensorflow also provides an operation that automatically finds the gradients of all operations in the neural network and performs the backpropagation step. We'll do backprop with the standard gradient descent optimizer that dynamically reduces the learning rate over time; we'll start with a learning rate of `0.2`.

We'll also create an operation that lets us test the accuracy of our predictions.

In [None]:
loss = tf.reduce_mean(-tf.reduce_sum(label * tf.log(y), reduction_indices=[1]))
backprop = tf.train.GradientDescentOptimizer(0.2).minimize(loss)

correct = tf.equal(tf.argmax(y,1), tf.argmax(label,1))
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

### Question 3:
If you were to calculate the backpropagation step yourself, how would you do it? Explain in words the general method for performing the backpropagation step. Please be as detailed in your explanation as possible.

*YOUR ANSWER HERE*

### Question 4:
Why might it be helpful to reduce the learning rate (the coefficient multiplied to the gradient before subtracting it from the weights) over time?

*YOUR ANSWER HERE*

The next segment of code is for Tensorflow initialization; you don't need to worry about it.

In [None]:
sess = tf.Session()
sess.run(tf.initialize_all_variables())

Now that we've constructed the neural network, it time to start training. We'll train in mini-batches of 100 for 5000 steps. This may take a few minutes to execute. Training a neural network takes a lot of computing power!

In [None]:
for i in range(5000):
    x_batch, y_batch = data.train.next_batch(100)
    sess.run(backprop, feed_dict={x: x_batch, label: y_batch})
    if (i+1) % 500 == 0:
        train_accuracy = sess.run(accuracy, feed_dict={x: x_batch, label: y_batch})
        print('Iteration {}, current training accuracy is {}'.format(i+1, train_accuracy))

### Question 5:
What are the advantages and disadvantages of using mini-batches rather than pure stochastic training for each training step?

*YOUR ANSWER HERE*


Let's see how well our neural network performs on the test set. You should be getting around 95% accuracy.

In [None]:
print(sess.run(accuracy, feed_dict={x: data.test.images, label: data.test.labels}))

Now that you've successfully built and trained the neural network, try playing around with adjusting some parameters! Some suggestions include: different batch sizes, mean squared error loss, changing the number of neurons in the hidden layer, changing the activation function, or adding more layers. If you're feeling especially daring, you can even try implementing a convolutional neural network, which can achieve more than 99% accuracy!

This assignment was inspired by Google's MNIST for ML Beginners tutorial.