Deep Learning
=============

Assignment 4
------------

Previously in `2_fullyconnected.ipynb` and `3_regularization.ipynb`, we trained fully connected networks to classify [notMNIST](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html) characters.

The goal of this assignment is make the neural network convolutional.

In [33]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle
from six.moves import range

In [34]:
pickle_file = 'notMNIST.pickle'

with open(pickle_file, 'rb') as f:
  save = pickle.load(f)
  train_dataset = save['train_dataset']
  train_labels = save['train_labels']
  valid_dataset = save['valid_dataset']
  valid_labels = save['valid_labels']
  test_dataset = save['test_dataset']
  test_labels = save['test_labels']
  del save  # hint to help gc free up memory
  print('Training set', train_dataset.shape, train_labels.shape)
  print('Validation set', valid_dataset.shape, valid_labels.shape)
  print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (10000, 28, 28) (10000,)


Reformat into a TensorFlow-friendly shape:
- convolutions need the image data formatted as a cube (width by height by #channels)
- labels as float 1-hot encodings.

In [35]:
image_size = 28
num_labels = 10
num_channels = 1 # grayscale

import numpy as np

def reformat(dataset, labels):
  dataset = dataset.reshape(
    (-1, image_size, image_size, num_channels)).astype(np.float32)
  labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
  return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 28, 28, 1) (200000, 10)
Validation set (10000, 28, 28, 1) (10000, 10)
Test set (10000, 28, 28, 1) (10000, 10)


In [36]:
def accuracy(predictions, labels):
  return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
          / predictions.shape[0])

Let's build a small network with two convolutional layers, followed by one fully connected layer. Convolutional networks are more expensive computationally, so we'll limit its depth and number of fully connected nodes.

In [30]:
batch_size = 16
patch_size = 5
depth = 16
num_hidden = 64

graph = tf.Graph()

with graph.as_default():

  # Input data.
  tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size, image_size, num_channels))
  tf_train_labels  = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset  = tf.constant(test_dataset)
  
  # Variables.
  layer1_weights = tf.Variable(tf.truncated_normal([patch_size, patch_size, num_channels, depth], stddev=0.1))
  layer1_biases = tf.Variable(tf.zeros([depth]))
  layer2_weights = tf.Variable(tf.truncated_normal([patch_size, patch_size, depth, depth], stddev=0.1))
  layer2_biases = tf.Variable(tf.constant(1.0, shape=[depth]))
  layer3_weights = tf.Variable(tf.truncated_normal([image_size // 4 * image_size // 4 * depth, num_hidden], stddev=0.1))
  layer3_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
  layer4_weights = tf.Variable(tf.truncated_normal([num_hidden, num_labels], stddev=0.1))
  layer4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
  
  # Model.
  def model(data):
    # conv2d: input, filter/kernel, stride, padding. filter is [height, width, indepth, outdepth], stride [batch, height, width, channels ]
    conv    = tf.nn.conv2d(data, layer1_weights, [1, 2, 2, 1], padding='SAME')
    hidden  = tf.nn.relu(conv + layer1_biases)
    conv    = tf.nn.conv2d(hidden, layer2_weights, [1, 2, 2, 1], padding='SAME')
    hidden  = tf.nn.relu(conv + layer2_biases)
    shape   = hidden.get_shape().as_list()
    reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
    hidden  = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
    return tf.matmul(hidden, layer4_weights) + layer4_biases
  
  # Training computation.
  logits = model(tf_train_dataset)
  loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
    
  # Optimizer.
  global_step = tf.Variable(0, trainable=False)
  learning_rate = tf.train.exponential_decay(0.05, global_step, 100000, 0.96)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
  
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(logits)
  valid_prediction = tf.nn.softmax(model(tf_valid_dataset))
  test_prediction  = tf.nn.softmax(model(tf_test_dataset))

In [31]:
num_steps = 3001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  for step in range(num_steps):
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 50 == 0):
      print('Minibatch loss at step %d: %f' % (step, l))
      print('Minibatch accuracy: %.1f%%' % accuracy(predictions, batch_labels))
      print('Validation accuracy: %.1f%%' % accuracy(
        valid_prediction.eval(), valid_labels))
  print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 3.435241
Minibatch accuracy: 0.0%
Validation accuracy: 10.0%
Minibatch loss at step 50: 2.169914
Minibatch accuracy: 31.2%


KeyboardInterrupt: 

---
Problem 1
---------

The convolutional model above uses convolutions with stride 2 to reduce the dimensionality. Replace the strides by a max pooling operation (`nn.max_pool()`) of stride 2 and kernel size 2.

---

In [57]:
batch_size = 16
patch_size = 5
depth = 16
num_hidden = 64

graph = tf.Graph()

with graph.as_default():

  # Input data.
  tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size, image_size, num_channels))
  tf_train_labels  = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset  = tf.constant(test_dataset)

# Variables
  W1 = tf.Variable(tf.truncated_normal([patch_size, patch_size, num_channels, depth], stddev=0.1))
  b1 = tf.Variable(tf.zeros([depth]))

  W2 = tf.Variable(tf.truncated_normal([patch_size, patch_size, depth, depth], stddev=0.1))
  b2 = tf.Variable(tf.constant(1.0, shape=[depth]))

  W3 = tf.Variable(tf.truncated_normal([image_size // 4 * image_size // 4 * depth, num_hidden], stddev=0.1))
  b3 = tf.Variable(tf.constant(1.0, shape=[num_hidden]))

  W4 = tf.Variable(tf.truncated_normal([num_hidden, num_labels], stddev=0.1))
  b4 = tf.Variable(tf.constant(1.0, shape=[num_labels]))

  # Model.
  def model_maxpool(data):
    # conv2d: input, filter/kernel, stride, padding. filter is [height, width, indepth, outdepth], stride [batch, height, width, channels ]
    # max_pool : value, ksize=windowSize of kernel, strides = how windows is moved, 
    conv = tf.nn.relu(tf.nn.conv2d(data, W1, [1, 1, 1, 1], padding='SAME') + b1)
    pool = tf.nn.max_pool(conv, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
    
    
    conv = tf.nn.relu(tf.nn.conv2d(pool, W2, [1, 1, 1, 1], padding='SAME') + b2)
    pool = tf.nn.max_pool(conv, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
    
    #Reshape to enter fully connected layer
    shape   = pool.get_shape().as_list()
    reshape = tf.reshape(pool, [shape[0], shape[1] * shape[2] * shape[3]])
    
    hidden  = tf.nn.relu(tf.matmul(reshape, W3) + b3)
    return tf.matmul(hidden, W4) + b4

  logits = model_maxpool(tf_train_dataset)
  loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
    
  # Optimizer.
  global_step = tf.Variable(0, trainable=False)
  learning_rate = tf.train.exponential_decay(0.05, global_step, 100000, 0.96)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
  
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(logits)
  valid_prediction = tf.nn.softmax(model_maxpool(tf_valid_dataset))
  test_prediction  = tf.nn.softmax(model_maxpool(tf_test_dataset))
  

In [58]:
num_steps = 1001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  for step in range(num_steps):
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 50 == 0):
      print('Minibatch loss at step %d: %f' % (step, l))
      print('Minibatch accuracy: %.1f%%' % accuracy(predictions, batch_labels))
      print('Validation accuracy: %.1f%%' % accuracy(
        valid_prediction.eval(), valid_labels))
  print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 4.121006
Minibatch accuracy: 6.2%
Validation accuracy: 10.0%
Minibatch loss at step 50: 1.679028
Minibatch accuracy: 43.8%
Validation accuracy: 34.9%
Minibatch loss at step 100: 1.360486
Minibatch accuracy: 56.2%
Validation accuracy: 64.6%
Minibatch loss at step 150: 0.704471
Minibatch accuracy: 81.2%
Validation accuracy: 76.0%
Minibatch loss at step 200: 0.260568
Minibatch accuracy: 93.8%
Validation accuracy: 77.8%
Minibatch loss at step 250: 0.671885
Minibatch accuracy: 81.2%
Validation accuracy: 75.2%
Minibatch loss at step 300: 0.710637
Minibatch accuracy: 81.2%
Validation accuracy: 81.5%
Minibatch loss at step 350: 0.243036
Minibatch accuracy: 93.8%
Validation accuracy: 81.4%
Minibatch loss at step 400: 0.882618
Minibatch accuracy: 75.0%
Validation accuracy: 80.7%
Minibatch loss at step 450: 0.420508
Minibatch accuracy: 93.8%
Validation accuracy: 81.4%
Minibatch loss at step 500: 0.752772
Minibatch accuracy: 87.5%
Validation accuracy: 82.1%
Mi

---
Problem 2
---------

Try to get the best performance you can using a convolutional net. Look for example at the classic [LeNet5](http://yann.lecun.com/exdb/lenet/) architecture, adding Dropout, and/or adding learning rate decay.

---

In [72]:
batch_size = 32
patch_size = 5
depth = 16
num_hidden = 64
beta = 1e-5

graph = tf.Graph()

with graph.as_default():

  # Input data.
  tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size, image_size, num_channels))
  tf_train_labels  = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset  = tf.constant(test_dataset)

# Variables
  W1 = tf.Variable(tf.truncated_normal([patch_size, patch_size, num_channels, depth], stddev=0.1))
  b1 = tf.Variable(tf.zeros([depth]))

  W2 = tf.Variable(tf.truncated_normal([patch_size, patch_size, depth, depth], stddev=0.1))
  b2 = tf.Variable(tf.constant(1.0, shape=[depth]))

  W3 = tf.Variable(tf.truncated_normal([image_size // 4 * image_size // 4 * depth, num_hidden], stddev=0.1))
  b3 = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
  
  W4 = tf.Variable(tf.truncated_normal([num_hidden, num_hidden], stddev=0.1))
  b4 = tf.Variable(tf.constant(1.0, shape=[num_hidden]))  
    
  W5 = tf.Variable(tf.truncated_normal([num_hidden, num_labels], stddev=0.1))
  b5 = tf.Variable(tf.constant(1.0, shape=[num_labels]))

  

  # Model.
  def model_tryhard_dropout(data):
    # conv2d: input, filter/kernel, stride, padding. filter is [height, width, indepth, outdepth], stride [batch, height, width, channels ]
    # max_pool : value, ksize=windowSize of kernel, strides = how windows is moved, 
    conv = tf.nn.relu(tf.nn.conv2d(data, W1, [1, 1, 1, 1], padding='SAME') + b1)
    pool = tf.nn.max_pool(conv, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
    
    
    conv = tf.nn.relu(tf.nn.conv2d(pool, W2, [1, 1, 1, 1], padding='SAME') + b2)
    pool = tf.nn.max_pool(conv, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
    
    #Reshape to enter fully connected layer
    shape   = pool.get_shape().as_list()
    reshape = tf.reshape(pool, [shape[0], shape[1] * shape[2] * shape[3]])
    
    hidden  = tf.nn.dropout(tf.nn.relu(tf.matmul(reshape, W3) + b3), keep_prob = 0.7)
    hidden  = tf.nn.dropout(tf.nn.relu(tf.matmul(hidden,  W4) + b4), keep_prob = 0.7)
    return tf.matmul(hidden, W5) + b5

  # Model.
  def model_tryhard(data):
    # conv2d: input, filter/kernel, stride, padding. filter is [height, width, indepth, outdepth], stride [batch, height, width, channels ]
    # max_pool : value, ksize=windowSize of kernel, strides = how windows is moved, 
    conv = tf.nn.relu(tf.nn.conv2d(data, W1, [1, 1, 1, 1], padding='SAME') + b1)
    pool = tf.nn.max_pool(conv, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
    
    
    conv = tf.nn.relu(tf.nn.conv2d(pool, W2, [1, 1, 1, 1], padding='SAME') + b2)
    pool = tf.nn.max_pool(conv, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
    
    #Reshape to enter fully connected layer
    shape   = pool.get_shape().as_list()
    reshape = tf.reshape(pool, [shape[0], shape[1] * shape[2] * shape[3]])
    
    hidden  = tf.nn.relu(tf.matmul(reshape, W3) + b3)
    hidden  = tf.nn.relu(tf.matmul(hidden, W4) + b4)
    return tf.matmul(hidden, W5) + b5

  logits = model_tryhard_dropout(tf_train_dataset)
  loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits))
  loss += beta*(tf.nn.l2_loss(W1) + tf.nn.l2_loss(W2) + tf.nn.l2_loss(W3) + tf.nn.l2_loss(W4) + tf.nn.l2_loss(W5))
    
  # Optimizer.
  global_step = tf.Variable(0, trainable=False)
  learning_rate = tf.train.exponential_decay(0.05, global_step, 100000, 0.96)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss, global_step=global_step)
  
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(logits)
  valid_prediction = tf.nn.softmax(model_tryhard(tf_valid_dataset))
  test_prediction  = tf.nn.softmax(model_tryhard(tf_test_dataset))
  

In [73]:
num_steps = 1001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  for step in range(num_steps):
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 50 == 0):
      print('Minibatch loss at step %d: %f' % (step, l))
      print('Minibatch accuracy: %.1f%%' % accuracy(predictions, batch_labels))
      print('Validation accuracy: %.1f%%' % accuracy(
        valid_prediction.eval(), valid_labels))
  print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))

Initialized
Minibatch loss at step 0: 3.520660
Minibatch accuracy: 15.6%
Validation accuracy: 10.0%
Minibatch loss at step 50: 2.295704
Minibatch accuracy: 6.2%
Validation accuracy: 16.9%
Minibatch loss at step 100: 1.551952
Minibatch accuracy: 37.5%
Validation accuracy: 64.2%
Minibatch loss at step 150: 1.331701
Minibatch accuracy: 59.4%
Validation accuracy: 71.6%
Minibatch loss at step 200: 1.040500
Minibatch accuracy: 62.5%
Validation accuracy: 78.8%
Minibatch loss at step 250: 0.772387
Minibatch accuracy: 78.1%
Validation accuracy: 77.1%
Minibatch loss at step 300: 1.348500
Minibatch accuracy: 59.4%
Validation accuracy: 76.8%
Minibatch loss at step 350: 1.025625
Minibatch accuracy: 68.8%
Validation accuracy: 81.1%
Minibatch loss at step 400: 0.623338
Minibatch accuracy: 87.5%


KeyboardInterrupt: 

---
Notes and summary
---
Metadata helps: Network is greatly improves given it can assume something about the structure of the data, so it does not need to learn this itself. Fex color does matter when determining letters. 

Other thing is translation invariance. In text a word can be remembered and does not need to be realearned for every case.

This is solved with WEIGHT SHARING. When 2 inputs contain same infromations, share their weights. 

Convolutional networks, Convnets. Share their parameters across space.
Image has width, height and depth (color), this is the input.
Apply a small neural net with k outputs, as you would a convolution kernel, across the input data. The output is another image, with another width height and a different depth. HaveK depth. This is called convolution. Instead of matrix multiply layers, we have stacks of convolutions. Form a pyramid. Start with large area small depth inputs, applying convolutions progressively reduce spatial dimention, but increases the depth. 
Patch/kernel is the size of the area you slide.
Depth is called features maps. 
Stride is number of pixels shifted each time you move the filter. 
Padding, valid padding, zero padding...



---