Deep Learning
=============

Assignment 3
------------

Previously in `2_fullyconnected.ipynb`, you trained a logistic regression and a neural network model.

The goal of this assignment is to explore regularization techniques.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle

First reload the data we generated in _notmist.ipynb_.

In [2]:
pickle_file = 'notMNIST.pickle'

with open(pickle_file, 'rb') as f:
  save = pickle.load(f)
  train_dataset = save['train_dataset']
  train_labels = save['train_labels']
  valid_dataset = save['valid_dataset']
  valid_labels = save['valid_labels']
  test_dataset = save['test_dataset']
  test_labels = save['test_labels']
  del save  # hint to help gc free up memory
  print('Training set', train_dataset.shape, train_labels.shape)
  print('Validation set', valid_dataset.shape, valid_labels.shape)
  print('Test set', test_dataset.shape, test_labels.shape)

('Training set', (200000, 28, 28), (200000,))
('Validation set', (10000, 28, 28), (10000,))
('Test set', (18724, 28, 28), (18724,))


Reformat into a shape that's more adapted to the models we're going to train:
- data as a flat matrix,
- labels as float 1-hot encodings.

In [3]:
image_size = 28
num_labels = 10

def reformat(dataset, labels):
  dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
  # Map 2 to [0.0, 1.0, 0.0 ...], 3 to [0.0, 0.0, 1.0 ...]
  labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
  return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

('Training set', (200000, 784), (200000, 10))
('Validation set', (10000, 784), (10000, 10))
('Test set', (18724, 784), (18724, 10))


In [4]:
def accuracy(predictions, labels):
  return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
          / predictions.shape[0])

---
Problem 1
---------

Introduce and tune L2 regularization for both logistic and neural network models. Remember that L2 amounts to adding a penalty on the norm of the weights to the loss. In TensorFlow, you can compue the L2 loss for a tensor `t` using `nn.l2_loss(t)`. The right amount of regularization should improve your validation / test accuracy.

---

Introduce L2 Regularization for the Multinomial Logistic Regression model in tensorflow

In [5]:
batch_size = 128

# define the input variable
# the input variable will receive the image's pixels for every batch
tf_train_dataset = tf.placeholder(tf.float32, [None, image_size * image_size]) # batch_sizex784 matrix 
tf_train_labels = tf.placeholder(tf.float32, shape=(None, num_labels))

tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)

# define the weights' parameters
W = tf.Variable(tf.truncated_normal([image_size*image_size, num_labels])) # 784x10 Matrix 

# define the biases
b = tf.Variable(tf.zeros([num_labels]))

logits = tf.matmul(tf_train_dataset, W) + b
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))

regularizers = tf.nn.l2_loss(W)
loss += 5e-4 * regularizers 

# optimize the loss function using gradient descent
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

train_predictions = tf.nn.softmax(logits)

# run the validation dataset using the trained netword (weights and biases)
valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, W) + b)

# run the test dataset in the trained netword (weights and biases)
test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, W) + b)


Let's now train the model

- It is worth noting that the validation dataset is ran in each interation of the traing phase. So as the batches go and improve the parameters, the valid dataset is ran over the graph using these weights and biases recently improved.

- At the end, the test dataset in ran over the network to evaluate the network as a whole since the weights and biases have already finished training

- On the Assignment 2, the same multinomial logistic regressiton using SGD (Stochastic Gradient Descent) accomplished an accuracy of 85.6%. So, using L2 regularization we get an improvement of roughly 2.5% since the model now gets around 88.1% accuracy.

In [6]:
# properly initialize the tensorflow variables
init = tf.initialize_all_variables()

# initialize the model in run the operation to initialize the variables
sess = tf.Session()
sess.run(init)

num_steps = 3001

for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)

    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    
    opt, l, predictions = sess.run(
      [optimizer, loss, train_predictions], feed_dict=feed_dict)
    
    if (step % 500 == 0):
        print("Minibatch loss at step %d: %f" % (step, l))
        print("Minibatch accuracy %.1f%%" % accuracy(predictions, batch_labels))
        print("Validation accuracy %.1f%%" % accuracy(valid_prediction.eval(session=sess), valid_labels))
        
print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(session=sess), test_labels))

Minibatch loss at step 0: 17.299664
Minibatch accuracy 6.2%
Validation accuracy 10.1%
Minibatch loss at step 500: 2.236001
Minibatch accuracy 76.6%
Validation accuracy 75.1%
Minibatch loss at step 1000: 1.811796
Minibatch accuracy 75.8%
Validation accuracy 76.8%
Minibatch loss at step 1500: 1.225581
Minibatch accuracy 77.3%
Validation accuracy 78.8%
Minibatch loss at step 2000: 1.408456
Minibatch accuracy 78.9%
Validation accuracy 79.1%
Minibatch loss at step 2500: 0.966792
Minibatch accuracy 81.2%
Validation accuracy 80.4%
Minibatch loss at step 3000: 1.071054
Minibatch accuracy 75.8%
Validation accuracy 80.4%
Test accuracy: 88.1%


Now, lets introduce the L2 regularization to a 2-layer neural network.

In [11]:
batch_size = 128

# define the input variable
# the input variable will receive the image's pixels for every batch
tf_train_dataset = tf.placeholder(tf.float32, [None, image_size * image_size]) # [128 x 784] matrix 
tf_train_labels = tf.placeholder(tf.float32, shape=(None, num_labels))

# load the valid and test datasets
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)

# define the hidden layer size
hidden_layer_size = 1024

# define the weights (parameters) of the first layer
W_layer1 = tf.Variable(tf.truncated_normal([image_size*image_size, hidden_layer_size])) # [784x1024] Matrix 

# define the biases for the firtst layer
b_layer1 = tf.Variable(tf.zeros([hidden_layer_size]))

# training computation
hidden_layer = tf.matmul(tf_train_dataset, W_layer1) + b_layer1 # [128 x 1024] Matrix

# apply the relu (Rectified Linear Regression) function 
relu_hidden_layer = tf.nn.relu(hidden_layer)

# define the parameters for the second layer
W_layer2 = tf.Variable(tf.truncated_normal([hidden_layer_size, num_labels])) # [1024 x 10] Matrix

# define the biases for the second layer 
b_layer2 = tf.Variable(tf.zeros([num_labels]))

# training computation
logits = tf.matmul(relu_hidden_layer, W_layer2) + b_layer2 # [128 x 10]

# calculate cross entropy
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))

# apply L2 regularization of the trained weights
regularizers = tf.nn.l2_loss(W_layer1) + tf.nn.l2_loss(W_layer2)
loss += 1e-4 * regularizers 

# optimize the loss function using gradient descent
optimizer = tf.train.GradientDescentOptimizer(0.05).minimize(loss)

train_predictions = tf.nn.softmax(logits)

# run the validation dataset using the trained netword (weights and biases)
valid_prediction_hidden_layer = tf.nn.relu(tf.matmul(tf_valid_dataset, W_layer1) + b_layer1)
valid_prediction = tf.nn.softmax(tf.matmul(valid_prediction_hidden_layer, W_layer2) + b_layer2)

# run the test dataset in the trained netword (weights and biases)
test_prediction_hidden_layer = tf.nn.relu(tf.matmul(tf_test_dataset, W_layer1) + b_layer1)
test_prediction = tf.nn.softmax(tf.matmul(test_prediction_hidden_layer, W_layer2) + b_layer2)

Lets now train the network.
- The same 2-layer Neural Network, on assignment 2, accomplished an accuracy of roughly 89.6%. Using L2 regularization, we get an improvement of nearly 0.3% on final Test accuracy: 89.3%, using 4e-4 L2 constant.

- I noticed that using different values for the regularization multiple constant, we get different accuracy results on the test set. Bellow I registered the different final test accuracy values for different L2 multiple constants. 

- 5e-4 --> 89.3%
- 4e-4 --> 89.9%
- 3e-4 --> 89.0%
- 2e-4 --> 89.5%
- 1e-4 --> 89.7%

Obviously, with the introduction of one more hyper-parameter, the combination of possible values for a optimal solution just increases. 

In [12]:
# properly initialize the tensorflow variables
init = tf.initialize_all_variables()

# initialize the model in run the operation to initialize the variables
sess = tf.Session()
sess.run(init)

num_steps = 3001

for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)

    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    
    opt, l, predictions = sess.run(
      [optimizer, loss, train_predictions], feed_dict=feed_dict)
    
    if (step % 500 == 0):
        print("Minibatch loss at step %d: %f" % (step, l))
        print("Minibatch accuracy %.1f%%" % accuracy(predictions, batch_labels))
        print("Validation accuracy %.1f%%" % accuracy(valid_prediction.eval(session=sess), valid_labels))
        
print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(session=sess), test_labels))

Minibatch loss at step 0: 399.356720
Minibatch accuracy 6.2%
Validation accuracy 15.7%
Minibatch loss at step 500: 51.906380
Minibatch accuracy 79.7%
Validation accuracy 76.9%
Minibatch loss at step 1000: 42.132526
Minibatch accuracy 86.7%
Validation accuracy 79.9%
Minibatch loss at step 1500: 43.823505
Minibatch accuracy 85.2%
Validation accuracy 81.0%
Minibatch loss at step 2000: 48.532589
Minibatch accuracy 78.9%
Validation accuracy 82.1%
Minibatch loss at step 2500: 38.938141
Minibatch accuracy 82.0%
Validation accuracy 81.2%
Minibatch loss at step 3000: 40.805653
Minibatch accuracy 79.7%
Validation accuracy 82.7%
Test accuracy: 89.7%


---
Problem 2
---------
Let's demonstrate an extreme case of overfitting. Restrict your training data to just a few batches. What happens?

---

In [13]:
batch_size = 4*128

# define the input variable
# the input variable will receive the image's pixels for every batch
tf_train_dataset = tf.placeholder(tf.float32, [None, image_size * image_size]) # [128 x 784] matrix 
tf_train_labels = tf.placeholder(tf.float32, shape=(None, num_labels))

# load the valid and test datasets
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)

# define the hidden layer size
hidden_layer_size = 1024

# define the weights (parameters) of the first layer
W_layer1 = tf.Variable(tf.truncated_normal([image_size*image_size, hidden_layer_size])) # [784x1024] Matrix 

# define the biases for the firtst layer
b_layer1 = tf.Variable(tf.zeros([hidden_layer_size]))

# training computation
hidden_layer = tf.matmul(tf_train_dataset, W_layer1) + b_layer1 # [128 x 1024] Matrix

# apply the relu (Rectified Linear Regression) function 
relu_hidden_layer = tf.nn.relu(hidden_layer)

# define the parameters for the second layer
W_layer2 = tf.Variable(tf.truncated_normal([hidden_layer_size, num_labels])) # [1024 x 10] Matrix

# define the biases for the second layer 
b_layer2 = tf.Variable(tf.zeros([num_labels]))

# training computation
logits = tf.matmul(relu_hidden_layer, W_layer2) + b_layer2 # [128 x 10]

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))

# regularizers = tf.nn.l2_loss(W_layer1) + tf.nn.l2_loss(W_layer2)
# loss += 5e-4 * regularizers 

# optimize the loss function using gradient descent
optimizer = tf.train.GradientDescentOptimizer(0.05).minimize(loss)

train_predictions = tf.nn.softmax(logits)

# run the validation dataset using the trained netword (weights and biases)
valid_prediction_hidden_layer = tf.nn.relu(tf.matmul(tf_valid_dataset, W_layer1) + b_layer1)
valid_prediction = tf.nn.softmax(tf.matmul(valid_prediction_hidden_layer, W_layer2) + b_layer2)

# run the test dataset in the trained netword (weights and biases)
test_prediction_hidden_layer = tf.nn.relu(tf.matmul(tf_test_dataset, W_layer1) + b_layer1)
test_prediction = tf.nn.softmax(tf.matmul(test_prediction_hidden_layer, W_layer2) + b_layer2)

In [14]:
# properly initialize the tensorflow variables
init = tf.initialize_all_variables()

# initialize the model in run the operation to initialize the variables
sess = tf.Session()
sess.run(init)

num_steps = 3001

for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    # Lets try to overfit the model using a small portion of the global training set
    # offset = (step * batch_size) % (train_labels.shape[0] - batch_size)

    # Generate a minibatch.
    batch_data = train_dataset[0:(batch_size), :]
    batch_labels = train_labels[0:(batch_size), :]
    
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    
    opt, l, predictions = sess.run(
      [optimizer, loss, train_predictions], feed_dict=feed_dict)
    
    if (step % 500 == 0):
        print("Minibatch loss at step %d: %f" % (step, l))
        print("Minibatch accuracy %.1f%%" % accuracy(predictions, batch_labels))
        print("Validation accuracy %.1f%%" % accuracy(valid_prediction.eval(session=sess), valid_labels))
        
print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(session=sess), test_labels))

Minibatch loss at step 0: 396.384552
Minibatch accuracy 10.2%
Validation accuracy 16.7%
Minibatch loss at step 500: 0.000011
Minibatch accuracy 100.0%
Validation accuracy 58.1%
Minibatch loss at step 1000: 0.000007
Minibatch accuracy 100.0%
Validation accuracy 58.1%
Minibatch loss at step 1500: 0.000005
Minibatch accuracy 100.0%
Validation accuracy 58.1%
Minibatch loss at step 2000: 0.000004
Minibatch accuracy 100.0%
Validation accuracy 58.1%
Minibatch loss at step 2500: 0.000004
Minibatch accuracy 100.0%
Validation accuracy 58.0%
Minibatch loss at step 3000: 0.000003
Minibatch accuracy 100.0%
Validation accuracy 58.0%
Test accuracy: 64.8%


Doing only 101 (from 0 to 100) batches, each one of size 128, the final test accuracy was 82.3% (Without Regularization). If this approach was to overfit the model, I was expecting that the Minibaches accuracy went up. However, at step 100, the Minibatch accuracy was only 74.2%.

A second approach is to shrink the traning data to be the size of a few batches and run a Gradient descent approach over that data many times. Restricting the training dataset to 128x4 = 512 training data points, and running it over 3001 steps (each of them applying the same 512 dataset), we can see the Minibatch accuracy go up to 100.0%. Which makes total sense, now we have an overfitted model in the training dataset. The test accuracy was 67.3%, which also makes sense since an overfitted model cannot generalize well.

- Using drop out in the same example, with 0.5 probability, only over the training dataset. The Test Accuracy jumps to 85.4%. That proves the assertion that Dropout alleviates the model to overfit.

---
Problem 3
---------
Introduce Dropout on the hidden layer of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise your evaluation results would be stochastic as well. TensorFlow provides `nn.dropout()` for that, but you have to make sure it's only inserted during training.

What happens to our extreme overfitting case?

---

In [15]:
batch_size = 128

# define the input variable
# the input variable will receive the image's pixels for every batch
tf_train_dataset = tf.placeholder(tf.float32, [None, image_size * image_size]) # [128 x 784] matrix 
tf_train_labels = tf.placeholder(tf.float32, shape=(None, num_labels))

# load the valid and test datasets
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)

# define the hidden layer size
hidden_layer_size = 1024

# define the weights (parameters) of the first layer
W_layer1 = tf.Variable(tf.truncated_normal([image_size*image_size, hidden_layer_size])) # [784x1024] Matrix 

# define the biases for the firtst layer
b_layer1 = tf.Variable(tf.zeros([hidden_layer_size]))

# training computation
hidden_layer = tf.matmul(tf_train_dataset, W_layer1) + b_layer1 # [128 x 1024] Matrix

# apply the relu (Rectified Linear Regression) function 
relu_hidden_layer = tf.nn.relu(hidden_layer)

# to alleviate overfitting, lets add drop out in between the layers
# i.e. when the activations from the first layer are flowing to the
# second.
# lets create a variable to record the probability that an activation
# will be dropped out or not.
keep_prob = tf.placeholder(tf.float32)
relu_hidden_layer = tf.nn.dropout(relu_hidden_layer, keep_prob)

# define the parameters for the second layer
W_layer2 = tf.Variable(tf.truncated_normal([hidden_layer_size, num_labels])) # [1024 x 10] Matrix

# define the biases for the second layer 
b_layer2 = tf.Variable(tf.zeros([num_labels]))

# training computation
logits = tf.matmul(relu_hidden_layer, W_layer2) + b_layer2 # [128 x 10]

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))

regularizers = tf.nn.l2_loss(W_layer1) + tf.nn.l2_loss(W_layer2)
loss += 5e-3 * regularizers 

# optimize the loss function using gradient descent
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

train_predictions = tf.nn.softmax(logits)

# run the validation dataset using the trained netword (weights and biases)
valid_prediction_hidden_layer = tf.nn.relu(tf.matmul(tf_valid_dataset, W_layer1) + b_layer1)
valid_prediction = tf.nn.softmax(tf.matmul(valid_prediction_hidden_layer, W_layer2) + b_layer2)

# run the test dataset in the trained netword (weights and biases)
test_prediction_hidden_layer = tf.nn.relu(tf.matmul(tf_test_dataset, W_layer1) + b_layer1)
test_prediction = tf.nn.softmax(tf.matmul(test_prediction_hidden_layer, W_layer2) + b_layer2)

Using L2 regularization and Dropout on this deeper model, the final test accuracy jumps from 89.9% to 90.8%. Both using L2 constant as 4e-4 and Dropout 0.5 on the training set.

In [16]:
# properly initialize the tensorflow variables
init = tf.initialize_all_variables()

# initialize the model in run the operation to initialize the variables
sess = tf.Session()
sess.run(init)

num_steps = 3001

for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)

    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob: 0.5}
    
    opt, l, predictions = sess.run(
      [optimizer, loss, train_predictions], feed_dict=feed_dict)
    
    if (step % 500 == 0):
        print("Minibatch loss at step %d: %f" % (step, l))
        print("Minibatch accuracy %.1f%%" % accuracy(
                train_predictions.eval(
                    session=sess, feed_dict = {tf_train_dataset : batch_data, 
                                               tf_train_labels : batch_labels, 
                                               keep_prob: 1.0}), batch_labels))
        print("Validation accuracy %.1f%%" % 
              accuracy(valid_prediction.eval(session=sess), valid_labels))
        
print("Test accuracy: %.1f%%" % 
      accuracy(test_prediction.eval(session=sess), test_labels))

Minibatch loss at step 0: 2144.921875
Minibatch accuracy 39.8%
Validation accuracy 28.4%
Minibatch loss at step 500: 128.687347
Minibatch accuracy 91.4%
Validation accuracy 81.3%
Minibatch loss at step 1000: 11.063235
Minibatch accuracy 86.7%
Validation accuracy 84.6%
Minibatch loss at step 1500: 1.486928
Minibatch accuracy 91.4%
Validation accuracy 84.5%
Minibatch loss at step 2000: 0.841210
Minibatch accuracy 85.9%
Validation accuracy 84.1%
Minibatch loss at step 2500: 0.669946
Minibatch accuracy 88.3%
Validation accuracy 84.8%
Minibatch loss at step 3000: 0.793008
Minibatch accuracy 88.3%
Validation accuracy 84.5%
Test accuracy: 90.8%


---
Problem 4
---------

Try to get the best performance you can using a multi-layer model! The best reported test accuracy using a deep network is [97.1%](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html?showComment=1391023266211#c8758720086795711595).

One avenue you can explore is to add multiple layers.

Another one is to use learning rate decay:

    global_step = tf.Variable(0)  # count the number of steps taken.
    learning_rate = tf.train.exponential_decay(0.5, step, ...)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
 
 ---


Try to increase the accuracy by adding more layers and by using learning rate decay.

In [20]:
batch_size = 128

# define the input variable
# the input variable will receive the image's pixels for every batch
tf_train_dataset = tf.placeholder(tf.float32, [None, image_size * image_size]) # [128 x 784] matrix 
tf_train_labels = tf.placeholder(tf.float32, shape=(None, num_labels))

# placeholder to determine the percentage of data that will be dropped out
keep_prob = tf.placeholder(tf.float32)

# load the valid and test datasets
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)

##############################################################
## Layer 1
##############################################################
# define the hidden layer size
hidden_layer_size = 1024

# define the weights (parameters) of the first layer
W_layer1 = tf.Variable(tf.truncated_normal([image_size*image_size, hidden_layer_size], stddev=0.03)) # [784x1024] Matrix 

# define the biases for the firtst layer
b_layer1 = tf.Variable(tf.zeros([hidden_layer_size]))

# training computation
hidden_layer = tf.matmul(tf_train_dataset, W_layer1) + b_layer1 # [128 x 1024] Matrix

# apply the relu (Rectified Linear Regression) function 
relu_hidden_layer = tf.nn.dropout(tf.nn.relu(hidden_layer), keep_prob)

##############################################################
## Layer 2
##############################################################
# define the hidden layer size
hidden_layer2_size = 1024

# define the weights (parameters) of the first layer
W_layer2 = tf.Variable(tf.truncated_normal([hidden_layer_size, hidden_layer2_size], stddev=0.03)) # [1024x2048] Matrix 

# define the biases for the firtst layer
b_layer2 = tf.Variable(tf.zeros([hidden_layer2_size])) # [2048]

# training computation
hidden_layer2 = tf.matmul(relu_hidden_layer, W_layer2) + b_layer2 # [128 x 2048] Matrix

# apply the relu (Rectified Linear Regression) function 
relu_hidden_layer2 = tf.nn.dropout(tf.nn.relu(hidden_layer2), keep_prob)

##############################################################
## Layer 3
##############################################################
# define the hidden layer size
# hidden_layer3_size = 1024

# define the weights (parameters) of the first layer
# W_layer3 = tf.Variable(tf.truncated_normal([hidden_layer2_size, hidden_layer3_size], stddev=0.03)) # [2048x2048] Matrix 

# define the biases for the firtst layer
# b_layer3 = tf.Variable(tf.zeros([hidden_layer3_size])) # [2048]

# training computation
# hidden_layer3 = tf.matmul(relu_hidden_layer2, W_layer3) + b_layer3 # [128 x 2048] Matrix

# apply the relu (Rectified Linear Regression) function 
# relu_hidden_layer3 = tf.nn.dropout(tf.nn.relu(hidden_layer3), keep_prob)

##############################################################
## Layer 4
##############################################################

# define the parameters for the second layer
W_layer3 = tf.Variable(tf.truncated_normal([hidden_layer2_size, num_labels], stddev=0.03)) # [1024 x 10] Matrix

# define the biases for the second layer 
b_layer3 = tf.Variable(tf.zeros([num_labels]))

# training computation
logits = tf.matmul(relu_hidden_layer2, W_layer3) + b_layer3 # [128 x 10]

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))

# apply regularization
regularizers = tf.nn.l2_loss(W_layer1) + tf.nn.l2_loss(W_layer2) + tf.nn.l2_loss(W_layer3)
loss += 4e-4 * regularizers 

global_step = tf.Variable(0, trainable=False)
starter_learning_rate = 0.5
learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
                                           decay_steps = batch_size, 
                                           decay_rate = 0.95, 
                                           staircase=True)

# optimize the loss function using gradient descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)

train_predictions = tf.nn.softmax(logits)

# run the validation dataset using the trained netword (weights and biases)
valid_prediction_hidden_layer = tf.nn.relu(tf.matmul(tf_valid_dataset, W_layer1) + b_layer1)
valid_prediction_hidden_layer = tf.nn.relu(tf.matmul(valid_prediction_hidden_layer, W_layer2) + b_layer2)
#valid_prediction_hidden_layer = tf.nn.relu(tf.matmul(valid_prediction_hidden_layer, W_layer3) + b_layer3)
valid_prediction = tf.nn.softmax(tf.matmul(valid_prediction_hidden_layer, W_layer3) + b_layer3)

# run the test dataset in the trained netword (weights and biases)
test_prediction_hidden_layer = tf.nn.dropout(tf.nn.relu(tf.matmul(tf_test_dataset, W_layer1) + b_layer1), 1.0)
test_prediction_hidden_layer = tf.nn.dropout(tf.nn.relu(tf.matmul(test_prediction_hidden_layer, W_layer2) + b_layer2), 1.0)
#test_prediction_hidden_layer = tf.nn.dropout(tf.nn.relu(tf.matmul(test_prediction_hidden_layer, W_layer3) + b_layer3), 1.0)
test_prediction = tf.nn.softmax(tf.matmul(test_prediction_hidden_layer, W_layer3) + b_layer3)

Now, in order to improve the final test accuracy, I made a few changes:
    - Added a new layer with 2048 hidden neurons;
    - Change the weights initialization from stddev=1.0 (Standard deviation - Default in tensorflow) to stddev=0.003 since small noise works better in deeper models.
    - Added exponential learning rate decay (basically the learning rate decreases as the model learns)
    
We registered a final test accuracy of roughly 94%. Of course, better results are subjugated to better parameterization tunning. 

In [21]:
# properly initialize the tensorflow variables
init = tf.initialize_all_variables()

# initialize the model in run the operation to initialize the variables
sess = tf.Session()
sess.run(init)

num_steps = 3001
learning_rate_decay = []

for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)

    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob: 0.5}
    
    opt, l, l_rate, predictions = sess.run(
      [optimizer, loss, learning_rate, train_predictions], feed_dict=feed_dict)
    
    
    learning_rate_decay.append(l_rate)
    
    if (step % 500 == 0):
        print("Minibatch loss at step %d: %f" % (step, l))
        print("Minibatch accuracy %.1f%%" % accuracy(predictions, batch_labels))
        print("Validation accuracy %.1f%%" % accuracy(valid_prediction.eval(session=sess), valid_labels))
        
print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(session=sess), test_labels))

import matplotlib.pyplot as plt

# display a learning rate decreasing in a graph
plt.plot(learning_rate_decay)
plt.grid(1)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.show()

Minibatch loss at step 0: 2.600160
Minibatch accuracy 8.6%
Validation accuracy 21.8%
Minibatch loss at step 500: 0.767629
Minibatch accuracy 83.6%
Validation accuracy 85.3%
Minibatch loss at step 1000: 0.688453
Minibatch accuracy 87.5%
Validation accuracy 86.6%
Minibatch loss at step 1500: 0.577422
Minibatch accuracy 86.7%
Validation accuracy 87.4%
Minibatch loss at step 2000: 0.788710
Minibatch accuracy 83.6%
Validation accuracy 87.9%
Minibatch loss at step 2500: 0.524377
Minibatch accuracy 89.1%
Validation accuracy 88.5%
Minibatch loss at step 3000: 0.697827
Minibatch accuracy 85.9%
Validation accuracy 88.6%
Test accuracy: 94.2%
