Deep Learning
=============

Assignment 3
------------

Previously in `2_fullyconnected.ipynb`, you trained a logistic regression and a neural network model.

The goal of this assignment is to explore regularization techniques.

In [2]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle

First reload the data we generated in `1_notmnist.ipynb`.

In [3]:
pickle_file = 'notMNIST.pickle'

with open(pickle_file, 'rb') as f:
  save = pickle.load(f)
  train_dataset = save['train_dataset']
  train_labels = save['train_labels']
  valid_dataset = save['valid_dataset']
  valid_labels = save['valid_labels']
  test_dataset = save['test_dataset']
  test_labels = save['test_labels']
  del save  # hint to help gc free up memory
  print('Training set', train_dataset.shape, train_labels.shape)
  print('Validation set', valid_dataset.shape, valid_labels.shape)
  print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (10000, 28, 28) (10000,)


Reformat into a shape that's more adapted to the models we're going to train:
- data as a flat matrix,
- labels as float 1-hot encodings.

In [4]:
image_size = 28
num_labels = 10

def reformat(dataset, labels):
  dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
  # Map 1 to [0.0, 1.0, 0.0 ...], 2 to [0.0, 0.0, 1.0 ...]
  labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
  return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 784) (200000, 10)
Validation set (10000, 784) (10000, 10)
Test set (10000, 784) (10000, 10)


In [5]:
def accuracy(predictions, labels):
  return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
          / predictions.shape[0])

---
Problem 1
---------

Introduce and tune L2 regularization for both logistic and neural network models. Remember that L2 amounts to adding a penalty on the norm of the weights to the loss. In TensorFlow, you can compute the L2 loss for a tensor `t` using `nn.l2_loss(t)`. The right amount of regularization should improve your validation / test accuracy.

---

In [5]:
batch_size = 128
size_hidden = 1024

graph = tf.Graph()
with graph.as_default():

  # Input data. For the training data, we use a placeholder that will be fed
  # at run time with a training minibatch.
  tf_train_dataset = tf.placeholder(tf.float32,shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  
  # Variables.
  W1 = tf.Variable(tf.truncated_normal([image_size * image_size, size_hidden]))
  b1 = tf.Variable(tf.zeros([size_hidden]))
  W2 = tf.Variable(tf.truncated_normal([size_hidden, num_labels]))
  b2 = tf.Variable(tf.zeros([num_labels]))
  
  # Training computation.
  logits = tf.matmul(tf.nn.relu(tf.matmul(tf_train_dataset, W1) + b1), W2) + b2
  loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
  l2_regulize = tf.nn.l2_loss(W1) + tf.nn.l2_loss(W2)
  loss += 5e-4 * l2_regulize
  
  # Optimizer.
  optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(logits)
  valid_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset, W1) + b1), W2) + b2)
  test_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, W1) + b1), W2) + b2)

In [6]:
num_steps = 3001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print("Initialized")
  for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 500 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
        valid_prediction.eval(), valid_labels))
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 481.710632
Minibatch accuracy: 11.7%
Validation accuracy: 34.4%
Minibatch loss at step 500: 129.154663
Minibatch accuracy: 86.7%
Validation accuracy: 81.5%
Minibatch loss at step 1000: 101.674202
Minibatch accuracy: 82.0%
Validation accuracy: 78.2%
Minibatch loss at step 1500: 73.685928
Minibatch accuracy: 85.9%
Validation accuracy: 80.4%
Minibatch loss at step 2000: 56.385788
Minibatch accuracy: 85.2%
Validation accuracy: 83.8%
Minibatch loss at step 2500: 43.838940
Minibatch accuracy: 90.6%
Validation accuracy: 84.5%
Minibatch loss at step 3000: 34.167339
Minibatch accuracy: 90.6%
Validation accuracy: 84.5%
Test accuracy: 90.8%


---
Problem 2
---------
Let's demonstrate an extreme case of overfitting. Restrict your training data to just a few batches. What happens?

---

In [7]:
batch_size = 128
size_hidden = 1024

graph = tf.Graph()
with graph.as_default():

  # Input data. For the training data, we use a placeholder that will be fed
  # at run time with a training minibatch.
  tf_train_dataset = tf.placeholder(tf.float32,shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  
  # Variables.
  W1 = tf.Variable(tf.truncated_normal([image_size * image_size, size_hidden]))
  b1 = tf.Variable(tf.zeros([size_hidden]))
  W2 = tf.Variable(tf.truncated_normal([size_hidden, num_labels]))
  b2 = tf.Variable(tf.zeros([num_labels]))
  
  # Training computation.
  logits = tf.matmul(tf.nn.relu(tf.matmul(tf_train_dataset, W1) + b1), W2) + b2
  loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
  l2_regulize = tf.nn.l2_loss(W1) + tf.nn.l2_loss(W2)
  loss += 5e-4 * l2_regulize
  
  # Optimizer.
  optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(logits)
  valid_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset, W1) + b1), W2) + b2)
  test_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, W1) + b1), W2) + b2)

In [8]:
num_steps = 3001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print("Initialized")
  for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    #offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    offset = batch_size * np.random.randint(4)

    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 500 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
        valid_prediction.eval(), valid_labels))
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 489.446289
Minibatch accuracy: 7.0%
Validation accuracy: 31.0%
Minibatch loss at step 500: 122.007462
Minibatch accuracy: 100.0%
Validation accuracy: 75.3%
Minibatch loss at step 1000: 95.016602
Minibatch accuracy: 100.0%
Validation accuracy: 75.3%
Minibatch loss at step 1500: 73.996613
Minibatch accuracy: 100.0%
Validation accuracy: 75.3%
Minibatch loss at step 2000: 57.626900
Minibatch accuracy: 100.0%
Validation accuracy: 75.2%
Minibatch loss at step 2500: 44.878483
Minibatch accuracy: 100.0%
Validation accuracy: 75.2%
Minibatch loss at step 3000: 34.950302
Minibatch accuracy: 100.0%
Validation accuracy: 75.1%
Test accuracy: 81.8%


---
Problem 3
---------
Introduce Dropout on the hidden layer of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise your evaluation results would be stochastic as well. TensorFlow provides `nn.dropout()` for that, but you have to make sure it's only inserted during training.

What happens to our extreme overfitting case?

Got 3 % more accuracy on overfitting case

---

In [6]:
batch_size = 128
size_hidden = 1024

graph = tf.Graph()
with graph.as_default():

  # Input data. For the training data, we use a placeholder that will be fed
  # at run time with a training minibatch.
  tf_train_dataset = tf.placeholder(tf.float32,shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)
  
  # Variables.
  W1 = tf.Variable(tf.truncated_normal([image_size * image_size, size_hidden]))
  b1 = tf.Variable(tf.zeros([size_hidden]))
  W2 = tf.Variable(tf.truncated_normal([size_hidden, num_labels]))
  b2 = tf.Variable(tf.zeros([num_labels]))
  
  # Training computation.
  logits = tf.matmul(tf.nn.relu(tf.nn.dropout(tf.matmul(tf_train_dataset, W1) + b1, keep_prob=0.5)), W2) + b2
  loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
  l2_regulize = tf.nn.l2_loss(W1) + tf.nn.l2_loss(W2)
  loss += 5e-4 * l2_regulize
  
  # Optimizer.
  optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
  
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(logits)
  valid_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_valid_dataset, W1) + b1), W2) + b2)
  test_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, W1) + b1), W2) + b2)
    
    
num_steps = 3001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print("Initialized")
  for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    #offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    offset = batch_size * np.random.randint(4)

    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 500 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
        valid_prediction.eval(), valid_labels))
  print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 619.114685
Minibatch accuracy: 9.4%
Validation accuracy: 35.8%


KeyboardInterrupt: 

---
Problem 4
---------

Try to get the best performance you can using a multi-layer model! The best reported test accuracy using a deep network is [97.1%](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html?showComment=1391023266211#c8758720086795711595).

One avenue you can explore is to add multiple layers.

Another one is to use learning rate decay:

    global_step = tf.Variable(0)  # count the number of steps taken.
    learning_rate = tf.train.exponential_decay(0.5, global_step, ...)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
 
 ---


In [17]:
batch_size = 128
beta = 1e-5

h_size = [1024, 256, num_labels]
stddev= [np.sqrt(2.0/image_size**2)]
for i in range(len(h_size) - 1):
    stddev.append(np.sqrt(2.0/h_size[i]))

keep_prob = np.linspace(start=0.5, stop=0.9, num=len(h_size) - 1)
deep_graph = tf.Graph()
with deep_graph.as_default():
  tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size**2))
  tf_train_labels  = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset  = tf.constant(test_dataset)
  
  W = []
  b = []
  W.append(tf.Variable(tf.truncated_normal([image_size * image_size, h_size[0]], stddev=stddev[0])))
  b.append(tf.Variable(tf.zeros([h_size[0]])))
  for i in range(1, len(h_size)):
        W.append(tf.Variable(tf.truncated_normal([h_size[i-1], h_size[i]], stddev=stddev[i])))
        b.append(tf.Variable(tf.zeros([h_size[i]])))
  
  # Input -> Hidden
  logits       = tf.nn.dropout(tf.nn.relu(tf.matmul(tf_train_dataset, W[0]) + b[0]), keep_prob=keep_prob[0])
  valid_logits =               tf.nn.relu(tf.matmul(tf_valid_dataset, W[0]) + b[0])
  test_logits  =               tf.nn.relu(tf.matmul(tf_test_dataset,  W[0]) + b[0])

  # Hidden -> Hidden
  for i in range(1, len(W)-1):
    logits       = tf.nn.dropout(tf.nn.relu(tf.matmul(logits,       W[i]) + b[i]), keep_prob=keep_prob[i])
    valid_logits =               tf.nn.relu(tf.matmul(valid_logits, W[i]) + b[i])
    test_logits  =               tf.nn.relu(tf.matmul(test_logits,  W[i]) + b[i])

  # Hidden -> Output
  logits                =               tf.matmul(logits,       W[-1]) + b[-1]
  validation_prediction = tf.nn.softmax(tf.matmul(valid_logits, W[-1]) + b[-1])
  test_prediction       = tf.nn.softmax(tf.matmul(test_logits,  W[-1]) + b[-1])
  train_prediction      = tf.nn.softmax(logits)


  # Calculate the loss with regularization
  loss  = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
  loss += beta * sum([tf.nn.l2_loss(Wi) for Wi in W])
  
  # Learn with exponential rate decay.
  global_step = tf.Variable(0, trainable=False)
  learning_rate = tf.train.exponential_decay(0.5, global_step, 100000, 0.96)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)


In [18]:

num_steps = 1001

with tf.Session(graph=deep_graph) as session:
  tf.global_variables_initializer().run()
  print("Initialized")
  for step in range(num_steps):
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 100 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(validation_prediction.eval(), valid_labels))
  print("  Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
    
  

Initialized
Minibatch loss at step 0: 2.486742
Minibatch accuracy: 7.8%
Validation accuracy: 43.3%
Minibatch loss at step 100: 0.619711
Minibatch accuracy: 75.8%
Validation accuracy: 82.8%
Minibatch loss at step 200: 0.474756
Minibatch accuracy: 84.4%
Validation accuracy: 84.2%
Minibatch loss at step 300: 0.615917
Minibatch accuracy: 79.7%
Validation accuracy: 84.5%
Minibatch loss at step 400: 0.644437
Minibatch accuracy: 81.2%
Validation accuracy: 84.6%
Minibatch loss at step 500: 0.502544
Minibatch accuracy: 85.9%
Validation accuracy: 85.6%
Minibatch loss at step 600: 0.542684
Minibatch accuracy: 86.7%
Validation accuracy: 86.4%
Minibatch loss at step 700: 0.518109
Minibatch accuracy: 84.4%
Validation accuracy: 85.5%
Minibatch loss at step 800: 0.457067
Minibatch accuracy: 87.5%
Validation accuracy: 86.3%
Minibatch loss at step 900: 0.644939
Minibatch accuracy: 75.8%
Validation accuracy: 86.4%
Minibatch loss at step 1000: 0.579219
Minibatch accuracy: 84.4%
Validation accuracy: 86.8%


---
Notes and summary 
---
Increasing nodes in hidden layer is not computationaly feasible. Deep learning is about making many layers and making models deeper. This increases parameter efficiency. deeper > wider.
Many features we want to find have a hierarcical structure. In the first layers often lines and edges are detected, go futher and you have geometric objects, next we can see objects.

Deep Models shine only when you have enough data to train them. 

We often want a network that are way to big for the data, and then prevent them from overfitting, this is the solution to the skinny jeans problem.

WAYS TO PREVENT OVERFITTING:

Early termination: stop train as soon as the improvement on validation set stops improving.


Regularization: apply artificial contraints that implicitly reduce number of free parameters but not more difficult to optimize.
L2 regularization: punish high weights in network. apply to loss the 2 norm of the weights multiplied with a small constant. Loss now wants to prevent the individual weights to get to high. and the derivative of the L2 norm * 1/2 is just the weight itself.


DROPOUT:
randomly set half of the activations from one layer to another to zero. Take hald the data that flows in network and destroy it. Network can now not rely on any one parameter (weight) in network to be present and is therefore forced to learn a redundant representation for everything. THis make things more robust and prevents overfitting. Sort of works like taking an average of multiple networks and use the avg value, but in one network. If dropout dont work, you should probably use bigger network. During training you must multiply all remainng activation by 2, if setting half of them to zero. 



---