Code adapted from [siddk's tensorflow workshop](https://github.com/siddk/tensorflow-workshop).

## The Math

$
\begin{align*}
X &= \text{ Flattened image as a vector of length } 1 \times 784 \\
Y &= \text{ Digit Label } \\
W_1 &= \text{ 1st matrix of weights, } B_1 = \text{ 1st vector of bias } \\
W_2 &= \text{ 2nd matrix of weights, } B_2 = \text{ 2nd vector of bias } \\
H &= \operatorname{ReLU}(X \times W_1 + B_1) \\
O &= \operatorname{Softmax}(H \times W_2 + B_2) \\
L &= \operatorname{Loss}(O, Y)
\end{align*}
$

$X, Y, H, \text{ and } O$ all have an extra dimension of batch that I'm ignoring here for simplicity.

## Setup

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf 

# Fetch MNIST Dataset using the supplied Tensorflow Utility Function
mnist = input_data.read_data_sets("data/MNIST_data/", one_hot=True)

# Setup the Model Parameters
INPUT_SIZE, HIDDEN_SIZE, OUTPUT_SIZE = 784, 100, 10  

### Start Building the Computation Graph ###

# Initializer - initialize our variables from standard normal with stddev 0.1
initializer = tf.random_normal_initializer(stddev=0.1)

# Setup Placeholders => None argument in shape lets us pass in arbitrary sized batches
X = tf.placeholder(tf.float32, shape=[None, INPUT_SIZE])  
Y = tf.placeholder(tf.float32, shape=[None, OUTPUT_SIZE])

## Hidden Layer & ReLU

Going from the input layer to the hidden layer we have

$$\underset{(n \times 100)}{\text{H}} = \operatorname{ReLU}(\underset{(n \times 784)}{\text{X}} \times
\underset{(784 \times 100)}{\text{$W_1$}} +
\underset{(1 \times 100)}{\text{$B_1$}} )$$

Here ReLU (Rectified Linear Unit) is defined as $\operatorname{ReLU}(x) = \max(0,x)$ which is applied to every value. This acts as a non-linear transformation on the values, allowing the model to go beyond making linear predictions. Other functions such as sigmoids are used but ReLU is fast and produces good results.

In [3]:
# Hidden Layer Variables
W_1 = tf.get_variable("Hidden_W", shape=[INPUT_SIZE, HIDDEN_SIZE], initializer=initializer)
b_1 = tf.get_variable("Hidden_b", shape=[HIDDEN_SIZE], initializer=initializer)

# Hidden Layer Transformation
hidden = tf.nn.relu(tf.matmul(X, W_1) + b_1)

## Output Layer & Softmax

Going from the hidden layer to the output layer we have

$$ \underset{(n \times 10)}{\text{O}} = \operatorname{Softmax}(\underset{(n \times 100)}{\text{H}} \times
\underset{(100 \times 10)}{\text{$W_2$}} +
\underset{(1 \times 10)}{\text{$B_2$}}) $$

Softmax is defined as $$\operatorname{Softmax}(\vec{x})_j = \frac{e^{x_j}}{\sum_{i=1}^{\vert \vec{x} \vert} e^{x_i}}$$ 

Essentially softmax takes a vector of real numbers and turn it into a vector of values between 0 and 1 that sum to 1, forming a valid probability distribution. The loss for the model is then the cross-entropy between the correct label and this probability vector (if you're not familiar with cross-entropy check out [this blog post](https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/)). This causes the model to slowly train towards the maximum probability in the output vector being the correct label. During testing, the maximum probability of the output vector is considered the prediction.

In [4]:
# Output Layer Variables
W_2 = tf.get_variable("Output_W", shape=[100, 10], initializer=initializer)
b_2 = tf.get_variable("Output_b", shape=[10], initializer=initializer)

# Output Layer Transformation
output = tf.matmul(hidden, W_2) + b_2

# Compute Loss
loss = tf.losses.softmax_cross_entropy(Y, output)

## Training & Results

In [5]:
# Compute Accuracy
correct_prediction = tf.equal(tf.argmax(Y, 1), tf.argmax(output, 1))
accuracy = 100 * tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# Setup Optimizer
train_op = tf.train.AdamOptimizer().minimize(loss)

### Launch the Session, to Communicate with Computation Graph ###
BATCH_SIZE, NUM_TRAINING_STEPS = 100, 1000
with tf.Session() as sess:
    # Initialize all variables in the graph
    sess.run(tf.global_variables_initializer())

    # Training Loop
    for i in range(NUM_TRAINING_STEPS):
        batch_x, batch_y = mnist.train.next_batch(BATCH_SIZE)
        curr_acc, _ = sess.run([accuracy, train_op], feed_dict={X: batch_x, Y: batch_y})
        if i % 100 == 0:
            print('Step {} Current Training Accuracy: {:.3f}'.format(i, curr_acc))
    
    # Evaluate on Test Data
    print('Test Accuracy: {:.3f}'.format(sess.run(accuracy, feed_dict={X: mnist.test.images, 
                                                                Y: mnist.test.labels})))

Step 0 Current Training Accuracy: 9.000
Step 100 Current Training Accuracy: 87.000
Step 200 Current Training Accuracy: 89.000
Step 300 Current Training Accuracy: 94.000
Step 400 Current Training Accuracy: 92.000
Step 500 Current Training Accuracy: 94.000
Step 600 Current Training Accuracy: 96.000
Step 700 Current Training Accuracy: 96.000
Step 800 Current Training Accuracy: 95.000
Step 900 Current Training Accuracy: 94.000
Test Accuracy: 95.450
