# TensorFlow Assignment: Multilayer Perceptron (MLP) and Convolutional Neural Network (CNN)

**[Duke Community Standard](http://integrity.duke.edu/standard.html): By typing your name below, you are certifying that you have adhered to the Duke Community Standard in completing this assignment.**

Name: [Kun Xu（许堃）]

Now that you've run through a simple logistic regression model on MNIST, let's see if we can do better (Hint: we can). For this assignment, you'll build a multilayer perceptron (MLP) and a convolutional neural network (CNN), two popular types of neural networks, and compare their performance. Some potentially useful code:

In [1]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

# Import data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


In [2]:
# Helper functions for creating weight variables
def weight_variable(shape):
    """weight_variable generates a weight variable of a given shape."""
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    """bias_variable generates a bias variable of a given shape."""
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

# Tensorflow Functions that might also be of interest:
# tf.nn.sigmoid()
# tf.nn.relu()

### Multilayer Perceptron

Build a multilayer perceptron for MNIST digit classfication. Feel free to play around with the model architecture and see how the training time/performance changes, but to begin, try the following:

Image -> fully connected (500 hidden units) -> nonlinearity (Sigmoid/ReLU) -> fully connected (10 hidden units) -> softmax

Skeleton framework for you to fill in (Code you need to provide is marked by `###`):

In [27]:
# Model Inputs
x = tf.placeholder(tf.float32, [None, 784])
y_ =  tf.placeholder(tf.float32, [None, 10])
# Define the graph

activation = tf.nn.sigmoid
### Create your MLP here##
w1, b1 = weight_variable([784, 500]), bias_variable([1, 500])
ly_x = tf.matmul(x, w1) + b1
ly_x = activation(ly_x)

w2, b2 = weight_variable([500, 10]), bias_variable([1, 10])
y_mlp = tf.matmul(ly_x, w2) + b2
### Make sure to name your MLP output as y_mlp ###




# Loss 
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_mlp))

# Optimizer
train_step = tf.train.GradientDescentOptimizer(0.2).minimize(cross_entropy, var_list=[w1, b1, w2, b2])
# train_step = tf.train.MomentumOptimizer(0.1, 0.9).minimize(cross_entropy, var_list=[w1, b1, w2, b2])

# Evaluation
correct_prediction = tf.equal(tf.argmax(y_mlp, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

with tf.Session() as sess:
    # Initialize all variables
    sess.run(tf.global_variables_initializer())
    
    # Training regimen
    for i in range(8000):
        # train_x, train_y = mnist.train.next_batch(100)
        # feed_dict = {x: train_x, y_: train_y}
        # sess.run([train_step], feed_dict=feed_dict)
        
        # Validate every 250th batch
        if i % 500 == 0:
            validation_accuracy = 0
            for v in range(10):
                batch = mnist.validation.next_batch(100)
                validation_accuracy += (1./10) * accuracy.eval(feed_dict={x: batch[0], y_: batch[1]})
            print('step %d, validation accuracy %g' % (i, validation_accuracy))
            validation_accuracy=0
            for v in range(10):
                batch = mnist.train.next_batch(100)
                validation_accuracy += (1./10) * accuracy.eval(feed_dict={x: batch[0], y_: batch[1]})
            print('step %d, train accuracy %g' % (i, validation_accuracy))
        
        # Train    
        batch = mnist.train.next_batch(50)
        train_step.run(feed_dict={x: batch[0], y_: batch[1]})
        # print(cross_entropy.eval(feed_dict={x: batch[0], y_: batch[1]}))

    print('test accuracy %g' % accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

step 0, validation accuracy 0.103
step 0, train accuracy 0.08
step 500, validation accuracy 0.901
step 500, train accuracy 0.892
step 1000, validation accuracy 0.918
step 1000, train accuracy 0.908
step 1500, validation accuracy 0.917
step 1500, train accuracy 0.92
step 2000, validation accuracy 0.926
step 2000, train accuracy 0.911
step 2500, validation accuracy 0.925
step 2500, train accuracy 0.921
step 3000, validation accuracy 0.924
step 3000, train accuracy 0.924
step 3500, validation accuracy 0.943
step 3500, train accuracy 0.925
step 4000, validation accuracy 0.94
step 4000, train accuracy 0.939
step 4500, validation accuracy 0.935
step 4500, train accuracy 0.934
step 5000, validation accuracy 0.942
step 5000, train accuracy 0.941
step 5500, validation accuracy 0.944
step 5500, train accuracy 0.942
step 6000, validation accuracy 0.963
step 6000, train accuracy 0.944
step 6500, validation accuracy 0.952
step 6500, train accuracy 0.953
step 7000, validation accuracy 0.955
step 700

#### Comparison

How do the sigmoid and rectified linear unit (ReLU) compare?

***

Overall, the relu network gives a better convergence speed and the final performance is better. It indicates that the network with relu activation function is more easy to train because the gradient will not vanish. 

This conclusion is true when I change the optimizer to momentum optimizer. Hence, the relu activation can significantly avoid the gradient vanishing and boost the performance of neural networks.

Besides, if I change the total iteration into 8k, the network with sigmoid still cannot model the training data, i.e., it's accuracy on training data is not 100%. But the relu network can achieve better than 99% accuracy. It also indicates that the relu network is more easy to train and have a more powerful capacity to model data.

***

### Convolutional Neural Network

Build a simple 2-layer CNN for MNIST digit classfication. Feel free to play around with the model architecture and see how the training time/performance changes, but to begin, try the following:

Image -> CNN (32 5x5 filters) -> nonlinearity (ReLU) ->  (2x2 max pool) -> CNN (64 5x5 filters) -> nonlinearity (ReLU) -> (2x2 max pool) -> fully connected (1024 hidden units) -> nonlinearity (ReLU) -> fully connected (10 hidden units) -> softmax

Some additional functions that you might find helpful:

In [22]:
# Convolutional neural network functions
def conv2d(x, W):
    """conv2d returns a 2d convolution layer with full stride."""
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    """max_pool_2x2 downsamples a feature map by 2X."""
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

# Tensorflow Function that might also be of interest:
# tf.reshape()

Skeleton framework for you to fill in (Code you need to provide is marked by `###`):

*Hint: Convolutional Neural Networks are spatial models. You'll want to transform the flattened MNIST vectors into images for the CNN. Similarly, you might want to flatten it again sometime before you do a softmax. You also might want to look into the  [conv2d() documentation](https://www.tensorflow.org/api_docs/python/tf/nn/conv2d) to see what shape kernel/filter it's expecting.*

In [23]:
# Model Inputs
x = tf.placeholder(tf.float32, [None, 784])
y_ = tf.placeholder(tf.float32, [None, 10])

# Define the graph
ly_x = tf.reshape(x, [-1, 28, 28, 1])
w1, b1 = weight_variable([5, 5, 1, 32]), bias_variable([1, 1, 1, 32])
ly_x = conv2d(ly_x, w1) + b1
ly_x = tf.nn.relu(ly_x)
ly_x = max_pool_2x2(ly_x)

w2, b2 = weight_variable([5, 5, 32, 64]), bias_variable([1, 1, 1, 64])
ly_x = conv2d(ly_x, w2) + b2
ly_x = tf.nn.relu(ly_x)
ly_x = max_pool_2x2(ly_x)

ly_x = tf.reshape(ly_x, [-1, 7*7*64])
w3, b3 = weight_variable([7*7*64, 1024]), bias_variable([1, 1024])
ly_x = tf.matmul(ly_x, w3) + b3
ly_x = tf.nn.relu(ly_x)

w4, b4 = weight_variable([1024, 10]), bias_variable([1, 10])
ly_x = tf.matmul(ly_x, w4) + b4

y_conv = ly_x

### Create your CNN here##
### Make sure to name your CNN output as y_conv ###



# Loss 
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))

# Optimizer
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

# Evaluation
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

with tf.Session() as sess:
    # Initialize all variables
    sess.run(tf.global_variables_initializer())
    
    # Training regimen
    for i in range(10000):
        # Validate every 250th batch
        if i % 250 == 0:
            validation_accuracy = 0
            for v in range(10):
                batch = mnist.validation.next_batch(50)
                validation_accuracy += (1./10) * accuracy.eval(feed_dict={x: batch[0], y_: batch[1]})
            print('step %d, validation accuracy %g' % (i, validation_accuracy))
        
        # Train    
        batch = mnist.train.next_batch(50)
        train_step.run(feed_dict={x: batch[0], y_: batch[1]})

    print('test accuracy %g' % accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels}))

step 0, validation accuracy 0.076
step 250, validation accuracy 0.922
step 500, validation accuracy 0.96
step 750, validation accuracy 0.954
step 1000, validation accuracy 0.966
step 1250, validation accuracy 0.976
step 1500, validation accuracy 0.972
step 1750, validation accuracy 0.978
step 2000, validation accuracy 0.984
step 2250, validation accuracy 0.976
step 2500, validation accuracy 0.986
step 2750, validation accuracy 0.978
step 3000, validation accuracy 0.988
step 3250, validation accuracy 0.978
step 3500, validation accuracy 0.982
step 3750, validation accuracy 0.992
step 4000, validation accuracy 0.982
step 4250, validation accuracy 0.978
step 4500, validation accuracy 0.986
step 4750, validation accuracy 0.99
step 5000, validation accuracy 0.98
step 5250, validation accuracy 0.976
step 5500, validation accuracy 0.984
step 5750, validation accuracy 0.996
step 6000, validation accuracy 0.992
step 6250, validation accuracy 0.986
step 6500, validation accuracy 0.992
step 6750,

ResourceExhaustedError: OOM when allocating tensor with shape[10000,28,28,32]
	 [[Node: Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Reshape_55, Variable_74/read)]]

Caused by op u'Conv2D', defined at:
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 477, in start
    ioloop.IOLoop.instance().start()
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/zmq/eventloop/ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/tornado/ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell
    handler(stream, idents, msg)
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/ipykernel/ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/ipykernel/zmqshell.py", line 533, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2717, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2821, in run_ast_nodes
    if self.run_code(code, result):
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-23-53c32d970e7d>", line 8, in <module>
    ly_x = conv2d(ly_x, w1) + b1
  File "<ipython-input-22-7dd13cf9c0ee>", line 4, in conv2d
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 399, in conv2d
    data_format=data_format, name=name)
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/kunxu/ENV/envs/tf/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[10000,28,28,32]
	 [[Node: Conv2D = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Reshape_55, Variable_74/read)]]


Some differences from the logistic regression model to note:

- The CNN model might take a while to train. Depending on your machine, you might expect this to take up to half an hour. If you see your validation performance start to plateau, you can kill the training.

- The logistic regression model we used previously was pretty basic, and as such, we were able to get away with using the GradientDescentOptimizer, which performs implements the gradient descent algorithm. For more difficult optimization spaces (such as the ones deep networks pose), we might want to use more sophisticated algorithms. Prof David Carlson has a lecture on this later.
    
- Because of the larger size of our network, notice that our minibatch size has shrunk.
    
- We've added a validation step every 250 minibatches. This let's us see how our model is doing during the training process, rather than sit around twiddling our thumbs and hoping for the best when training finishes. This becomes especially significant as training regimens start approaching days and weeks in length. Normally, we validate on the entire validation set, but for the sake of time we'll just stick to 10 validation minibatches (500 images) for this homework assignment.

#### Comparison

How do the MLP and CNN compare in accuracy? Training time? Why would you use one vs the other? Is there a problem you see with MLPs when applied to other image datasets?

***

The performance of CNN is much better than the MLP, and it takes almost the same training time (On Titan X, compared to 8K iterations for MLP). However, the CNN contains much more parameters and directly compare these two models is not fair.

CNN is good at extracting visual features and gives a really good performance on the image classification tasks. So for visual task, CNN will be a really good choice. But for other networks, such as RNN or LSTMs, most of them only use fully connected layers rather than convolution layers. So CNN has more constrains on the application fields, but in visual task, it will be my first choice.

For the more complicated dataset compared to MNIST, such as CIFAR10 or IMAGENET, the MLP network's performance is far from the CNN, because it cannot share parameters across the feature map.

***