# Windows
Install Docker
Download and install Docker from the official Docker website.

Run the Docker Container
Run the command below to start a jupyter notebook server with TensorFlow:

**docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow**
Users in China should use the b.gcr.io/tensorflow/tensorflow instead of gcr.io/tensorflow/tensorflow

You can access the jupyter notebook at localhost:8888. The server includes 3 examples of TensorFlow notebooks,
but you can create a new notebook to test all your code.

In [1]:
import tensorflow as tf

# Create TensorFlow object called tensor
hello_constant = tf.constant('Hello World!')

with tf.Session() as sess:
    # Run the tf.constant operation in the session
    output = sess.run(hello_constant)
    print(output)

Hello World!


# Tensor
In TensorFlow, data isnâ€™t stored as integers, floats, or strings. These values are encapsulated in an object called a tensor. 

In the case of hello_constant = tf.constant('Hello World!'), hello_constant is a 0-dimensional string tensor.

The tensor returned by tf.constant() is called a constant tensor, because the value of the tensor never changes.

![Flowchart](https://d17h27t6h515a5.cloudfront.net/topher/2016/October/580feadb_session/session.png "TensorFlow Sessions")

In [14]:
x = tf.placeholder(tf.string) ##The tensor x being set to the string "Hello, world"
y = tf.placeholder(tf.int32)
z = tf.placeholder(tf.float32)

with tf.Session() as sess:
    output = sess.run(z, feed_dict={x: 'Test String', y: 123, z: 45.67}) ##Feed x,y,z tensor values
    print(output)

45.6699981689


In [8]:
import tensorflow as tf
x = tf.add(5, 2)  # 7
y = tf.subtract(10, 4) # 6
z = tf.multiply(2, 5)  # 10
w = tf.subtract(tf.cast(tf.constant(2.0), tf.int32), tf.constant(1))   # 1
## Cast a value to another type. In this case, converting the 2.0 to an integer before subtracting
with tf.Session() as sess:
    output = sess.run(x)
    print(output)

7


In [3]:
import tensorflow as tf

x = tf.constant(10)
y = tf.constant(2)
z = tf.subtract(tf.divide(x,y),tf.cast(tf.constant(1), tf.float64))## Data type is float64 after division

# TODO: Print z from a session
with tf.Session() as sess:
    output = sess.run(z)
    print(output)

4.0


# TensorFlow Linear Function
For example, if we want to classify images as digits.

x would be our list of pixel values, and y would be the logits, one for each digit. Let's take a look at y = Wx, where the weights, W, determine the influence of x at predicting each y.

![Classification](https://d17h27t6h515a5.cloudfront.net/topher/2016/November/5839dd1c_wx-1/wx-1.jpg "classification of characters")

In TensorFlow, we actually use y = xW + b, because this is what TensorFlow uses.

![TensorFlow Forms](https://d17h27t6h515a5.cloudfront.net/topher/2016/November/58353057_codecogseqn-18/codecogseqn-18.gif)


# Weights and Bias in TensorFlow
The goal of training a neural network is to modify weights and biases to best predict the labels. In order to use weights and bias, you'll need a Tensor that can be modified. This leaves out tf.placeholder() and tf.constant(), since those Tensors can't be modified. This is where **tf.Variable** class comes in.

** x = tf.Variable(5) **

The tf.Variable class creates a tensor with an initial value that can be modified, much like a normal Python variable. This tensor stores its state in the session, so you must initialize the state of the tensor manually. You'll use the tf.global_variables_initializer() function to initialize the state of all the Variable tensors.

**init = tf.global_variables_initializer()**  
** with tf.Session() as sess: **  
**    sess.run(init) **
    
The tf.global_variables_initializer() call returns an operation that will initialize all TensorFlow variables from the graph. You call the operation using a session to initialize all the variables as shown above. Using the tf.Variable class allows us to change the weights and bias, but an initial value needs to be chosen.

Initializing the weights with random numbers from a normal distribution is good practice. Randomizing the weights helps the model from becoming stuck in the same place every time you train it. You'll learn more about this in the next lesson, when you study gradient descent.

Similarly, choosing weights from a normal distribution prevents any one weight from overwhelming other weights. You'll use the **tf.truncated_normal()** function to generate random numbers from a normal distribution.

**n_features = 120**  
**n_labels = 5**  
**weights = tf.Variable(tf.truncated_normal((n_features, n_labels)))**  

The tf.truncated_normal() function returns a tensor with random values from a normal distribution whose magnitude is no more than 2 standard deviations from the mean.

Since the weights are already helping prevent the model from getting stuck, you don't need to randomize the bias. Let's use **tf.zeros()** to set the bias to 0.

**n_labels = 5**  
**bias = tf.Variable(tf.zeros(n_labels))**  



# Linear Classifier Quiz
![Subset of MNIST dataset](https://d17h27t6h515a5.cloudfront.net/topher/2016/November/582cf7a7_mnist-012/mnist-012.png)

You'll be classifying the handwritten numbers 0, 1, and 2 from the MNIST dataset using TensorFlow. The above is a small sample of the data you'll be training on. Notice how some of the 1s are written with a serif at the top and at different angles. The similarities and differences will play a part in shaping the weights of the model.

<img src="https://d17h27t6h515a5.cloudfront.net/topher/2016/November/582ce9ef_weights-0-1-2/weights-0-1-2.png" width="500">

The images above are trained weights for each label (0, 1, and 2). The weights display the unique properties of each digit they have found. Complete this quiz to train your own weights using the MNIST dataset.

### Instructions
1. Open quiz.py.
    1. Implement get_weights to return a tf.Variable of weights
    2. Implement get_biases to return a tf.Variable of biases
    3. Implement xW + b in the linear function
2. Open sandbox.py
    1. Initialize all weights

In [9]:
import tensorflow as tf

def get_weights(n_features, n_labels):
    """
    Return TensorFlow weights
    :param n_features: Number of features
    :param n_labels: Number of labels
    :return: TensorFlow weights
    """
    # TODO: Return weights
    return tf.Variable(tf.truncated_normal((n_features, n_labels)))


def get_biases(n_labels):
    """
    Return TensorFlow bias
    :param n_labels: Number of labels
    :return: TensorFlow bias
    """
    # TODO: Return biases
    return tf.Variable(tf.zeros(n_labels))


def linear(input, w, b):
    """
    Return linear function in TensorFlow
    :param input: TensorFlow input
    :param w: TensorFlow weights
    :param b: TensorFlow biases
    :return: TensorFlow linear function
    """
    # TODO: Linear Function (xW + b)
    return tf.add(tf.matmul(input, w), b) ## xW in xW + b is matrix multiplication


from tensorflow.examples.tutorials.mnist import input_data

def mnist_features_labels(n_labels):
    """
    Gets the first <n> labels from the MNIST dataset
    :param n_labels: Number of labels to use
    :return: Tuple of feature list and label list
    """
    mnist_features = []
    mnist_labels = []

    mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

    # In order to make quizzes run faster, we're only looking at 10000 images
    for mnist_feature, mnist_label in zip(*mnist.train.next_batch(10000)):

        # Add features and labels if it's for the first <n>th labels
        if mnist_label[:n_labels].any():
            mnist_features.append(mnist_feature)
            mnist_labels.append(mnist_label[:n_labels])

    return mnist_features, mnist_labels


# Number of features (28*28 image is 784 features)
n_features = 784
# Number of labels
n_labels = 3

# Features and Labels
features = tf.placeholder(tf.float32)
labels = tf.placeholder(tf.float32)

# Weights and Biases
w = get_weights(n_features, n_labels)
b = get_biases(n_labels)

# Linear Function xW + b
logits = linear(features, w, b)

# Training data
train_features, train_labels = mnist_features_labels(n_labels)

with tf.Session() as session:
    # TODO: Initialize session variables
    session.run(tf.global_variables_initializer())
    # Softmax
    prediction = tf.nn.softmax(logits)

    # Cross entropy
    # This quantifies how far off the predictions were.
    # You'll learn more about this in future lessons.
    cross_entropy = -tf.reduce_sum(labels * tf.log(prediction), reduction_indices=1)

    # Training loss
    # You'll learn more about this in future lessons.
    loss = tf.reduce_mean(cross_entropy)

    # Rate at which the weights are changed
    # You'll learn more about this in future lessons.
    learning_rate = 0.08

    # Gradient Descent
    # This is the method used to train the model
    # You'll learn more about this in future lessons.
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

    # Run optimizer and get loss
    _, l = session.run(
        [optimizer, loss],
        feed_dict={features: train_features, labels: train_labels})

# Print loss
print('Loss: {}'.format(l))


Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /datasets/ud730/mnist/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /datasets/ud730/mnist/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /datasets/ud730/mnist/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /datasets/ud730/mnist/t10k-labels-idx1-ubyte.gz
Loss: 8.50671863556


We combine two files sandbox.py and quiz.py(Defined functions) together.  
We see in the sandbox.py sextion, we could import training data from tensorflow.examples.tutorials.mnist.
Further readings about [MNIST](https://www.tensorflow.org/versions/r0.11/tutorials/mnist/beginners/)

# Softmax
The next step is to assign a probability to each label, which you can then use to classify the data. Use the softmax function to turn your logits into probabilities.

$S(y_i)=\dfrac{e^{y_i}}{\sum_j e^{y_j}}$

For the next quiz, you'll implement a softmax(x) function that takes in x, a one or two dimensional array of logits.

In the one dimensional case, the array is just a single set of logits. In the two dimensional case, each column in the array is a set of logits. The softmax(x) function should return a NumPy array of the same shape as x.


In [16]:
import numpy as np

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)

## In a two-dimensional array, each column represents a set of logits:
logits = np.array([
    [1, 2, 3, 6],
    [2, 4, 5, 6],
    [3, 8, 7, 6]])

print(softmax(logits))

[[ 0.09003057  0.00242826  0.01587624  0.33333333]
 [ 0.24472847  0.01794253  0.11731043  0.33333333]
 [ 0.66524096  0.97962921  0.86681333  0.33333333]]


Now let's see how softmax is done in TensorFlow:

**x = tf.nn.softmax([2.0, 1.0, 0.2])**  

tf.nn.softmax() implements the softmax function for you. It takes in logits and returns softmax activations.

The following cell use the softmax function to return the softmax of the logits:

In [14]:
import tensorflow as tf

def run():
    output = None
    logit_data = [2.0, 1.0, 0.1]
    logits = tf.placeholder(tf.float32)

    softmax = tf.nn.softmax(logits)

    with tf.Session() as sess:
        output = sess.run(softmax, feed_dict={logits: logit_data})
    return output

When the logits are multiplied by 10, the probabilities get close to 0.0 or 1.0 while when the logits are divided by 10,
the probabilities get close to the uniform distribution.(Less distinguished) In other words, when I increase the size of the outputs, my classifier would become very confident about its predictions. And for our neural network, we want it not to be very confident about its predictions at the beginning, but as it learns over time, it shall gain more confidence.

**One Hot Encoding** to represent different labels mathematically. A predicted output would be a vector of the size "1 * feature", where each column is the possibility corresponding to each class/label. And by one hot encoding, each label is represented by a vector with one 1 and the rest 0, i.e. [0 0 1 0 0]. 

When there are so many classes that one hot encoding would become inefficient as it has 0 almost everywhere, we will solve this problem with embeddings later. 

Now, we could measure how well we are doing predictions by simply comparing the vector that comes out of classifiers and contains the probabilities of your classes and the one hot encoded vector that corresponds to our labels.

# Cross Entropy


$D(S,L)=-\sum_i L_i \log(S_i)$,  S for the output prediction and L for the Label.

Remember to have the labels and the distributions in the right place because the function is non-commutatuive, $D(S,L)\neq D(L,S)$.

<img src="https://s29.postimg.org/b6oxz186v/flowchart.png" width="500">

Above is usually called **multinomial logistic classification**

### Reduce Sum

**x = tf.reduce_sum([1, 2, 3, 4, 5])  # 15**

The tf.reduce_sum() function takes an array of numbers and sums them together.

### Natural Log

**x = tf.log(100)  # 4.60517**

This function does exactly what you would expect it to do. tf.log() takes the natural log of a number.

We implement these two functions to calculate the cross entropy as illustrated below:

In [1]:
import tensorflow as tf

softmax_data = [0.7, 0.2, 0.1]
one_hot_data = [1.0, 0.0, 0.0]

softmax = tf.placeholder(tf.float32)
one_hot = tf.placeholder(tf.float32)

# TODO: Print cross entropy from session
cross_entropy = -tf.reduce_sum(tf.multiply(one_hot,tf.log(softmax)))
with tf.Session() as sess:
    output = sess.run(cross_entropy,feed_dict={softmax: softmax_data, one_hot: one_hot_data})
    print(output)


0.356675


We are calculating D(S(wx + b),L) for our input and we want the cross entropy, D, to have a high distance for the incorrect class D(A,b) while a low distance for the correct class D(A,a). 

### Training Loss

$L = \dfrac{1}{N}\sum D(S(wx_i+b),L_i)$ 

to measure the distance averaged over the entire training set for all the inputs and all the labels available.

We want this loss, which characterize how well we are classifying every example in the training data, to be small. And as a function of weights and biases, we could use gradient descent to minimize this numerical funciton.

We are now to discuss the tools that compute these derivatives, and the good and bad about gradient descent. Besides, there are still two major problems we need to solve before training our own model:
1. How to dill image pixels to the classifier
2. Where to initialize the optimization

# Numerical Stability
We now add $10^{-6}$ $10^6$ times to $10^9$ and subtract $10^9$ again:

In [4]:
a = 1000000000
for i in range(1000000):
    a = a + 1e-6
print(a - 1000000000)

0.953674316406


When we are calculating values that are too large or too small, especially when we are adding very small values to a very large value, a lot of errors could be introduced. Alternatively, when we replace the 1,000,000,000 with 1 in the example above, we would see the error to be very tiny. 

In this case, we want our training loss function never to be too big or too small. In practice, a good guiding principle is that we always want our variables to have zero mean and equal variance whenever possible.(Imagine an oval away from the origin and a circle with its center right at the origin. Our classfier doesn't need to do a lot of searching to find a good solution when the circle is around the origin.)

To deal with images, each pixel's RGB value, [0,255] could be processed in the following way to make it much easier for the optimization to proceed numerically:

$Pixel = [\dfrac{R-128}{128}, \dfrac{G-128}{128}, \dfrac{B-128}{128}]$

In this way, the training data is normalized to have zero mean and unit variance.

To have weights and biases initialized at a good starting point for the gradient descent to proceed, we could draw the weights randomly from a Guassian distribution with zero mean and standard deviation $\sigma$. The $\sigma$ determines the order of magnitude of outputs at the initial point of optimization. 

As we have illustrated previously in the softmax section, due to the softmax function on top of it, i.e. S(wx+b), the order of magnitudes also determines the peakness of the initial probability distribution. A large $\sigma$ means large peaks(more distinguished distributed possibility in each output), which means the classifier would be very opinionated, which is not what we want at the beginning. As we said, we want it not to be very confident about its predictions at the beginning, but as it learns over time, it shall gain more confidence. So we shall use a small $\sigma$ to begin with.

With all what we have now, ou magical packages, at last, would calculate for us:
1. $w-\alpha \Delta_wL$
2. $b-\alpha \Delta_bL$

The derivative of loss with respect to weights and biases and takes a step back in the direction opposite to that derivative.

Another important feature about our classifiers is that they tend to perform well on images within our training set but fails to do as well on new images. This is because it has memorized the training set and fails to generalize to new examples. In fact, every classifier that you will build will tend to try and memorize the training sets, but we will help it generalize to new data instead. So, how do we measure how well the classifier generalizes rather than simply memorizing data?

One way to measure our classifier performance is by hiding a portion of training data and only use them for evaluation after the classifier is properly tuned. However, training a classifier is a process of trials and errors, so when we choose what classifier to use, or how parameters are modified based on our test set, we are making modifications based on our observation of the performance, and though little information is fed into the classifier with our tuning, it adds up. So now the test data bleeds into our training data.

# Cross Validation

There are some information about cross validation which I found on websites:

[Why every statistician should know about cross-validation](http://robjhyndman.com/hyndsight/crossvalidation/)

[TAMU Lecture: Validation](http://research.cs.tamu.edu/prism/lectures/iss/iss_l13.pdf)

[CMU Cross Validation](https://www.cs.cmu.edu/~schneide/tut5/node42.html)

[3.1. Cross-validation: evaluating estimator performance](http://scikit-learn.org/stable/modules/cross_validation.html)

We have to be very careful about [overfitting](http://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/) on test set, so use the validation set.

And when the validation set is small, when we tweak our model, the accuracy might change, but this could just be noise. And a useful rule of thumb is that, a change that affects 30 examples in your validation set, one way or another, is usually statistically significant and typically can be trusted. 

As 30 in 3000 is a 1% change, people tend to hold back 30,000 examples for validation, which makes the accuracy figures significant to the first decimal place and gives enough resolution to see small improvements. However, if my classes are not well-balanced and some classes are very rare compared to others, this heuristic is no longer good and more data would be needed. This means we need to hold back more than 30,000 data in training set, which is already a great amount, especially for some small training sets. In this case, cross-validation is one possible way to mitigate the issue but might take more time.

Training logistic regression with gradient descent is a good trial as we are directly optimizing the error measure that we care about, and that is why in practice lots of ML research is about designing the right lost function to optimize. However, the biggest problem is that it is very difficult to scale. When larger dataset is fed in, the calculation would become humongous, and that is why we are using stochastic gradient descent.

# Stochastic Gradient Descent

If calculating loss function takes n floating point operations, computing its gradient takes about three times that compute. And for now, our loss function is already huge and depends on every element in the training set, so we are going to cheat.

We will compute an estimate of the loss instead the loss itself. The estimate is simply computing the average loss for a very small random fraction of the training data, between 1 and 1000 training samples each time. (In fact, this estimate is so terrible that we have to use measures to make it less terrible.) This S.G.D is actually at the heart of deep learning because it can be efficiently scaled both with data and model size, which is exactly the big data and big models we are dealing with. But still, we need to solve a lot of issues coming with this rough estimation.

Some measures we used to help SGD is listed below:
1. Inputs
    1. zero mean
    2. equal and small variance
2. Initial weights
    1. random
    2. zero mean
    3. equal and small variance

## Momentum

We can refer from previous random steps because that on aggregate, those steps take us towards the minimum of loss. To keep a running average of the gradients and use this running average instead of the direction of the current batch of the data.

1. For the traditional gradient descent, its gradient $\Delta L$ and step direction $-\alpha\Delta L(w_i)$
2. For the momentum technique, its gradient $\Delta L + 0.9M$ and step direction $-\alpha M(w_i)$

This momentum technique works well and often leads to better convergence.

## Learning Rate Decay

As we are using SGD, we are taking smaller and noisier steps towards our objective. Besides all that still need to be researched on, making steps smaller and smaller as we train is usually beneficial. (There are many ways to determine this decaying learning rates, for example, we can lower the learning rate when the loss reach a plateau or simply use a exponential decay function to characterize it.)

We might once tend to believe larger learning rates lead to faster learning; however, that is not true. High learning rates start the learning faster but then plateaus, while lower learning rates keep going and get better.

So for SGD, we have many hyperparameters to play with:
1. initial learning rate
2. learning rate decay
3. momentum
4. batch size
5. weight initialization

We have to make these parameters right to make the neural network learn, but when things don't work, try to lower the learning rate first.

There are similar measures like **AdaGrad**, which is a modififation of SGD which implicitly does momentum and learning rate decay. This will make tuning hyperparameters easier but the performance tends to be a little worse than precisely tuned SGD with momentum. At last, we need to remember we are still dealing with a linear model and no deep learning is into the play.

# Mini-batch

Mini-batching is a technique for training on subsets of the dataset instead of all the data at one time. This provides the ability to train a model, even if a computer lacks the memory to store the entire dataset.

Mini-batching is computationally inefficient, since you can't calculate the loss simultaneously across all samples. However, this is a small price to pay in order to be able to run the model at all.

It's also quite useful combined with SGD. The idea is to randomly shuffle the data at the start of each epoch, then create the mini-batches. For each mini-batch, you train the network weights with gradient descent. Since these batches are random, you're performing SGD with each batch.

Let's look at the MNIST dataset with weights and a bias:

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf

n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

In this sample, train_features Shape: (55000, 784) Type: float32 takes up the most memory, which is 172.48MB. However, larger datasets that we'll use in the future measured in gigabytes or more. 

In order to use mini-batching, you must first divide your data into batches.

Unfortunately, it's sometimes impossible to divide the data into batches of exactly equal size. For example, imagine you'd like to create batches of 128 samples each from a dataset of 1000 samples. Since 128 does not evenly divide into 1000, you'd wind up with 7 batches of 128 samples, and 1 batch of 104 samples. 

In that case, the size of the batches would vary, so you need to take advantage of TensorFlow's tf.placeholder() function to receive the varying batch sizes.

Continuing the example, if each sample had n_input = 784 features and n_classes = 10 possible labels, the dimensions for features would be [None, n_input] and labels would be [None, n_classes].


In [None]:
# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

What does None do here?

The None dimension is a placeholder for the batch size. At runtime, TensorFlow will accept any batch size greater than 0.

Going back to our earlier example, this setup allows you to feed features and labels into the model as either the batches of 128 samples or the single batch of 104 samples.

Now let's see how to implement mini-batching. Implement the **batches function** to batch features and labels. The function should return each batch with a maximum size of batch_size.

In [1]:
import math
from pprint import pprint
def batches(batch_size, features, labels):
    """
    Create batches of features and labels
    :param batch_size: The batch size
    :param features: List of features
    :param labels: List of labels
    :return: Batches of (Features, Labels)
    """
    assert len(features) == len(labels)
    # TODO: Implement batching
    output_batches = []
    
    sample_size = len(features)
    for start_i in range(0, sample_size, batch_size):
        end_i = start_i + batch_size
        batch = [features[start_i:end_i], labels[start_i:end_i]]
        output_batches.append(batch)
        
    return output_batches
# 4 Samples of features
example_features = [
    ['F11','F12','F13','F14'],
    ['F21','F22','F23','F24'],
    ['F31','F32','F33','F34'],
    ['F41','F42','F43','F44']]
# 4 Samples of labels
example_labels = [
    ['L11','L12'],
    ['L21','L22'],
    ['L31','L32'],
    ['L41','L42']]

# PPrint prints data structures like 2d arrays, so they are easier to read
pprint(batches(3, example_features, example_labels))

[[[['F11', 'F12', 'F13', 'F14'],
   ['F21', 'F22', 'F23', 'F24'],
   ['F31', 'F32', 'F33', 'F34']],
  [['L11', 'L12'], ['L21', 'L22'], ['L31', 'L32']]],
 [[['F41', 'F42', 'F43', 'F44']], [['L41', 'L42']]]]


Let's then use mini-batching to feed batches of MNIST features and labels into a linear model. Set the batch size and run the optimizer over all the batches with the batches function. The recommended batch size is 128.

In [2]:
import math
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np

def batches(batch_size, features, labels):
    assert len(features) == len(labels)
    outout_batches = []
    
    sample_size = len(features)
    for start_i in range(0, sample_size, batch_size):
        end_i = start_i + batch_size
        batch = [features[start_i:end_i], labels[start_i:end_i]]
        outout_batches.append(batch)
        
    return outout_batches

learning_rate = 0.001
n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))


# TODO: Set batch size
batch_size = 128
assert batch_size is not None, 'You must set the batch size'

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    
    # TODO: Train optimizer on all batches
    for batch_features, batch_labels in batches(batch_size, train_features, train_labels):
        sess.run(optimizer, feed_dict={features: batch_features, labels: batch_labels})

    # Calculate accuracy for test dataset
    test_accuracy = sess.run(
        accuracy,
        feed_dict={features: test_features, labels: test_labels})

print('Test Accuracy: {}'.format(test_accuracy))

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /datasets/ud730/mnist/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /datasets/ud730/mnist/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /datasets/ud730/mnist/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /datasets/ud730/mnist/t10k-labels-idx1-ubyte.gz
Test Accuracy: 0.0854000002146


The accuracy is low, but you probably know that you could train on the dataset more than once. You can train a model using the dataset multiple times.
# Epochs
An epoch is a single forward and backward pass of the whole dataset. This is used to increase the accuracy of the model without requiring more data. This section will cover epochs in TensorFlow and how to choose the right number of epochs.

The following TensorFlow code trains a model using 10 epochs.

In [4]:
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
import math

def batches(batch_size, features, labels):
    assert len(features) == len(labels)
    outout_batches = []
    
    sample_size = len(features)
    for start_i in range(0, sample_size, batch_size):
        end_i = start_i + batch_size
        batch = [features[start_i:end_i], labels[start_i:end_i]]
        outout_batches.append(batch)
        
    return outout_batches


def print_epoch_stats(epoch_i, sess, last_features, last_labels):
    """
    Print cost and validation accuracy of an epoch
    """
    current_cost = sess.run(
        cost,
        feed_dict={features: last_features, labels: last_labels})
    valid_accuracy = sess.run(
        accuracy,
        feed_dict={features: valid_features, labels: valid_labels})
    print('Epoch: {:<4} - Cost: {:<8.3} Valid Accuracy: {:<5.3}'.format(
        epoch_i,
        current_cost,
        valid_accuracy))

n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)

# Import MNIST data
mnist = input_data.read_data_sets('/datasets/ud730/mnist', one_hot=True)

# The features are already scaled and the data is shuffled
train_features = mnist.train.images
valid_features = mnist.validation.images
test_features = mnist.test.images

train_labels = mnist.train.labels.astype(np.float32)
valid_labels = mnist.validation.labels.astype(np.float32)
test_labels = mnist.test.labels.astype(np.float32)

# Features and Labels
features = tf.placeholder(tf.float32, [None, n_input])
labels = tf.placeholder(tf.float32, [None, n_classes])

# Weights & bias
weights = tf.Variable(tf.random_normal([n_input, n_classes]))
bias = tf.Variable(tf.random_normal([n_classes]))

# Logits - xW + b
logits = tf.add(tf.matmul(features, weights), bias)

# Define loss and optimizer
learning_rate = tf.placeholder(tf.float32)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

# Calculate accuracy
correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

init = tf.global_variables_initializer()

batch_size = 128
epochs = 100
learn_rate = 0.001

train_batches = batches(batch_size, train_features, train_labels)

with tf.Session() as sess:
    sess.run(init)

    # Training cycle
    for epoch_i in range(epochs):

        # Loop over all batches
        for batch_features, batch_labels in train_batches:
            train_feed_dict = {
                features: batch_features,
                labels: batch_labels,
                learning_rate: learn_rate}
            sess.run(optimizer, feed_dict=train_feed_dict)

        # Print cost and validation accuracy of an epoch
        print_epoch_stats(epoch_i, sess, batch_features, batch_labels)

    # Calculate accuracy for test dataset
    test_accuracy = sess.run(
        accuracy,
        feed_dict={features: test_features, labels: test_labels})

print('Test Accuracy: {}'.format(test_accuracy))

Extracting /datasets/ud730/mnist/train-images-idx3-ubyte.gz
Extracting /datasets/ud730/mnist/train-labels-idx1-ubyte.gz
Extracting /datasets/ud730/mnist/t10k-images-idx3-ubyte.gz
Extracting /datasets/ud730/mnist/t10k-labels-idx1-ubyte.gz
Epoch: 0    - Cost: 11.0     Valid Accuracy: 0.121
Epoch: 1    - Cost: 10.2     Valid Accuracy: 0.129
Epoch: 2    - Cost: 9.57     Valid Accuracy: 0.14 
Epoch: 3    - Cost: 9.03     Valid Accuracy: 0.152
Epoch: 4    - Cost: 8.56     Valid Accuracy: 0.169
Epoch: 5    - Cost: 8.13     Valid Accuracy: 0.185
Epoch: 6    - Cost: 7.75     Valid Accuracy: 0.207
Epoch: 7    - Cost: 7.39     Valid Accuracy: 0.227
Epoch: 8    - Cost: 7.07     Valid Accuracy: 0.244
Epoch: 9    - Cost: 6.76     Valid Accuracy: 0.26 
Epoch: 10   - Cost: 6.48     Valid Accuracy: 0.279
Epoch: 11   - Cost: 6.22     Valid Accuracy: 0.297
Epoch: 12   - Cost: 5.98     Valid Accuracy: 0.316
Epoch: 13   - Cost: 5.75     Valid Accuracy: 0.333
Epoch: 14   - Cost: 5.53     Valid Accuracy: 0.3

The accuracy only reached 0.73, but that could be because the learning rate was too high. Lowering the learning rate would require more epochs, but could ultimately achieve better accuracy. And in the upcoming TensorFLow Lab, we are to choose own learning rate, epoch count, and batch size to improve the model's accuracy.

# Deep Neural Network