# MNIST for ML Beginners

## The MNIST Data
Download and read the MNIST data automatically:

In [41]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


The MNIST data is split into three parts: 55,000 data points of training data (mnist.train), 10,000 points of test data (mnist.test), and 5,000 points of validation data (mnist.validation).  
Every MNIST data point has two parts: an image of a handwritten digit and a corresponding label.  
Each image is 28 pixels by 28 pixels. We can flatten this array into a vector of 28x28 = 784 numbers.  
Each image in MNIST has a corresponding label, a number between 0 and 9 representing the digit drawn in the image.  
We're going to want our labels as "one-hot vectors". In this case, the nth digit will be represented as a vector which is 1 in the nth dimension. For example, 3 would be [0,0,0,1,0,0,0,0,0,0].

## Softmax Regressions

The evidence for a class $i$ given an input $x$ is:
$$\text{evidence}_i = \sum_j W_{i,j}x_j+b_i$$
where $W_i$ is the weights and bi is the bias for class $i$, and $j$ is an index for summing over the pixels in our input image $x$.  
Convert the evidence tallies into the predicted probabilities $y$ using the "softmax" function:
$$y=\text{softmax(evidence)}$$
Softmax equation:
$$\text{softmax}(x)_i=\frac{\text{exp}(x_i)}{\sum_j\text{exp}(x_j)}$$

Fox instance, input is a three dimensional array $x = [x_1, x_2, x_3]$ and output is a ternary class $y = [y_1, y_2, y_3]$.  
Picture softmax regression:
<img src="./MNIST-ML/softmax-regression-picture.png">
Write out as equations:
<img src="./MNIST-ML/softmax-regression-equation.png">
Matrix multiplication:
<img src="./MNIST-ML/softmax-regression-matrix.png">
Write compactly:
$$y=\text{softmax}(Wx+b)$$

## Implement the Regression
TensorFlow lets us describe a graph of interacting operations that run entirely outside Python. (Approaches like this can be seen in a few machine learning libraries.)

In [42]:
import tensorflow as tf
x = tf.placeholder(tf.float32, [None, 784])

$x$ is a **placeholder**, a value that we'll input when we ask TensorFlow to run a computation. We represent $x$ as a 2-D tensor of floating-point numbers, with a shape ```[None, 784]```. (Here ```None``` means that a dimension can be of any length.)  
A **Variable** is a modifiable tensor that lives in TensorFlow's graph of interacting operations.

In [43]:
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

Implement the model:

In [44]:
y = tf.nn.softmax(tf.matmul(x, W) + b)

First, we multiply $x$ by $W$ with the expression ```tf.matmul(x, W)```.  We then add $b$, and finally apply ```tf.nn.softmax```.

## Training
Cross entropy (log-loss):
$$H_{y'}(y)=-\sum_i y'_i\log(y_i)$$
Where $y$ is our predicted probability distribution, and $y′$ is the true distribution.

In [45]:
y_ = tf.placeholder(tf.float32, [None, 10])

The raw formulation of cross-entropy:  

```cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))```  

First, ```tf.log``` computes the logarithm of each element of $y$. Next, we multiply each element of $y\_$ with the corresponding element of ```tf.log(y)```. Then ```tf.reduce_sum``` adds the elements in the second dimension of $y$, due to the ```reduction_indices=[1]``` parameter. Finally, ```tf.reduce_mean``` computes the mean over all the examples in the batch.
Unfortunatelly, this can be numerically unstable.  
Instead, we apply ```tf.nn.softmax_cross_entropy_with_logits``` on the unnormalized logits (e.g., we call ```softmax_cross_entropy_with_logits on tf.matmul(x, W) + b)```, because this more numerically stable function internally computes the softmax activation.

In [46]:
 cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))

TensorFlow knows the entire graph of your computations, it can automatically use the <a href="http://colah.github.io/posts/2015-08-Backprop/">backpropagation algorithm</a> to efficiently determine how your variables affect the loss you ask it to minimize. Then it can apply your choice of optimization algorithm to modify the variables and reduce the loss.

In [47]:
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

In this case, we ask TensorFlow to minimize ```cross_entropy``` using the **gradient descent algorithm** with a learning rate of 0.5. Gradient descent is a simple procedure, where TensorFlow simply shifts each variable a little bit in the direction that reduces the cost. But TensorFlow also provides many other <a href="https://www.tensorflow.org/api_guides/python/train#Optimizers">optimization algorithms</a>: using one is as simple as tweaking one line. 

What TensorFlow actually does here, behind the scenes, is to add new operations to your graph which implement backpropagation and gradient descent. Then it gives you back a single operation which, when run, does a step of gradient descent training, slightly tweaking your variables to reduce the loss.

Launch the model in an ```InteractiveSession```:

In [48]:
sess = tf.InteractiveSession()

Initialize the variables:

In [49]:
tf.global_variables_initializer().run()

Run the training step 1000 times!

In [50]:
for _ in range(1000):
  batch_xs, batch_ys = mnist.train.next_batch(100)
  sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

Each step of the loop, we get a "batch" of one hundred random data points from our training set. We run ```train_step``` feeding in the batches data to replace the ```placeholder```'s.

## Evaluating Model
```tf.argmax``` gives you the index of the highest entry in a tensor along some axis. Use ```tf.equal``` to check if our prediction matches the truth:

In [51]:
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))

That gives us a list of booleans. To determine what fraction are correct, we cast to floating point numbers and then take the mean. For example, ```[True, False, True, True]``` would become ```[1,0,1,1]``` which would become $0.75$.

In [52]:
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

Accuracy on test data:

In [53]:
print(sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

0.9075
