# Deep Learning on Image Processing Vannila Version and AlexNet

(This tutorial will improve over time, this is not the final version just, first draft)

## Introduction

In this tutorial we are going to dive into deep learning, use deep learning to solve image classification problem, build different models, implement the most trendy two type of neural net : Convolutional Neural Net

##  TensorFlow
Google's TensorFlow is currently the most popular Deep Learning library repositoy on GitHub. TensorFlow generally have a faster compile time than other main stream deep learning frameworks currently, and its computational graphs can be distributed on a cluster for computations. We are not going to experiment every deep learning framework here like Theano, Torch etc. TensorFlow performed well in the ImageNet category. 

### Installation
Detailed steps for installation can be found in the link below:

https://www.tensorflow.org/versions/r0.11/get_started/os_setup.html

In [None]:
import tensorflow as tf
import numpy as np

## Training/Testing data
We are going to start off with digit recognition problem since it is much quicker to finish running each epoch compared to big images. MNIST data sets has two parts: an image of a handwritten digit and a corresponding label.
<img src ="http://yann.lecun.com/exdb/lenet/gifs/asamples.gif">
Tensor flow made it easy to load in digit recognition data by using tensorflow.examples.tutorials.mnist.


The MNIST data is split into three parts: 55,000 data points of training data (mnist.train), 10,000 points of test data (mnist.test), and 5,000 points of validation data (mnist.validation).

Each image is 28 pixels by 28 pixels. We can interpret this as a big array of numbers:
<img src ="https://www.tensorflow.org/versions/r0.11/images/MNIST-Matrix.png">
Later on we can process images like cifar-10 or larger.

Preview:
<img src = "http://karpathy.github.io/assets/cifar_preview.png">

A little bit history about hand-written digit recogniton problem can be traced back to work of Yann LeCun at AT&T lab, which can be found here: 
http://yann.lecun.com/exdb/lenet/ (LeNet)

##### One Hot Representation:
We are using one-hot 10-dimensional vector indicating which digit class (0 through 1) the corresponding MNIST image belongs to. One-hot vector was also applied to NLP field but definately not best word representation as it is very sparse.

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

##  Parameters
I am going to explain each variable one by one and come back to explain some of them that will only make sense later on when we explored the model:

* training_iters : number of forward propagation, backward propagation epoch
* batch_size : mini batch training, I will explain later on
* img_vec_size : MNIST img is of size 28*28
* num_class : MNIST classes range 0-9 for digit 0-9


In [None]:
training_iters = 200000
batch_size = 100
img_vec_size = 784  # MNIST img 28*28
num_class = 10      # MNIST classes 0-9

### Placeholder
The concept of placeholder is unique to tensorflow. Placeholder is a value that we will input when we ask TensorFlow to run a computation. By creating nodes for the input images and target output classes, we are building the computation graphs inside TensorFlow. We will assign the shape for both x and y. 

### Concept about Dropout:
We will apply drop out to our CNN, intentionally "forget" some of the neuron in each layer to:
1. prevent the entire network from overfitting 
2. it is also one easy way to have Ensemble Learning inside a neural network, a good analogy is the Rndom Forest, where you combine results of weak learner(individual decision tree)
3. It makes Forward Propagation a bit faster

In [None]:
x = tf.placeholder(tf.float32, [None, img_vec_size])
y_ = tf.placeholder(tf.float32, [None, num_class])

#   Dropout to :    prevent overfit
#                   "Ensemble Learning"
#                   A bit faster fp
keep_prob = tf.placeholder(tf.float32) # probablity for dropout

## Initial Weight and Bias
We don't want identically zero for weight initialization.
We still want the weights for neurons to be very close to zero, but with randomness thus we can use Gaussian distribution with 0.1 stddev.

It is actually very common to simply use 0 bias initialization.

Concept about filter(CONV) will be explained below.

In [None]:
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

W_dict = {
    'W_conv1': weight_variable([5, 5, 1, 32]),      # 5x5 size filter, 1 channel, 32 depth
    'W_conv2': weight_variable([5, 5, 32, 64]),     # 5x5 size filter, 32 , 64 depth
    'W_fc1': weight_variable([7 * 7 * 64, 1024]),   # fc layer, 'vectorize' 7*7*64 inputs, 1024 outputs
    'W_fc2': weight_variable([1024, num_class])     # 1024 inputs, 10 output classes
}

b_dict = {
    'b_conv1': bias_variable([32]), # match the depth of convolution "cube"
    'b_conv2': bias_variable([64]),
    'b_fc1': bias_variable([1024]),
    'b_fc2': bias_variable([num_class])
}

##### Now, lets build our first CNN (Vannila Version first) block by block :)

##  Convolutional Neural Net
This is a picture of traditional Deep Neural network vs. convolutional neural network:
<img src = "https://annalyzin.files.wordpress.com/2016/01/cnn_overview.png?w=459&h=236">
<img src = "http://i.stack.imgur.com/OH3gI.png">

Note that CNN only has fully connected layer in the very end vs. DNN has traditional fully connected layers everywhere.

## Importent to remember:
1. Keep elongate the "depth" of convolution "block" by conv+relu together
2. Shrink the surface area by pooling

## Filter

The CONV filter is usually a small filter to scan through the image by some different weights. And by choosing a number of filters and concatenating them together we can have a "block" of depth. The size of filter is often very small like 3*3, 5*5, larger size is not recommended. 

<img src= "http://i.stack.imgur.com/GvsBA.jpg">

## Convolution
The process of using filter to scan through the image and get raw "extracted features".
We will do some image padding in the end to make sure the big block have same width and height of original picture
(Will further explain)

<img src= "http://deeplearning.stanford.edu/wiki/images/6/6c/Convolution_schematic.gif">

## ReLU

Activation function fro each neuron.
We have sigmoid, tanh as well but they are subject to gradient vanishing and slow converge.
Backprop suffers from a fundamental problem - vanishing gradient. 
During training, the gradient decreases in value back through the net. 
Higher gradient values lead to faster training, the layers closest to the input layer take the longest to train. 

<img src= "http://deepdish.io/public/images/activation-functions.svg">

## Pooling

We usually use max pooling to get the most representative feature of each filter block out of the convoluted block.
Max pooling is more frequently used compared to mean poling method.
Why pooling? Think about a distorted picture or picture with different color tone, or hand writing digit with some stains, we can still be able to do feature extraction with these external interfrence since we are only looking for most important features.




In [None]:
# Merge matmul+bias then relu 2 step into one convolution step
def convolution(x, W, b):
    return tf.nn.relu(tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME') + b)

# Pooling block size n*n
def max_pooling(x, n):
    return tf.nn.max_pool(x, ksize=[1, n, n, 1], strides=[1, n, n, 1], padding='SAME')


## Structure of CNN
We use three main types of layers to build ConvNet architectures: 
1. Convolutional Layer 
2. Pooling Layer 
3. Fully-Connected Layer 

Typical workflow:
#### [INPUT - CONV - RELU - POOL - FC]

We can use the combination of CONV - RELU - POOL as major block to build the entire CNN as long as the dimension connection between each block is compatible.
Conv Nets transform the original image layer by layer from the original pixel values to the final class scores.


In [None]:
def cnn(X, weight, bias, dropout):
    # Reshape input igm to 4-D
    X = tf.reshape(X, shape=[-1, 28, 28, 1])

    # Convolution Layer 1
    conv1 = convolution(X, weight['W_conv1'], bias['b_conv1'])
    # Max Pooling
    conv1 = max_pooling(conv1, n=2)
    # Dropout
    # conv1_drop = tf.nn.dropout(conv1, dropout)

    # Convolution Layer 2
    conv2 = convolution(conv1, weight['W_conv2'], bias['b_conv2'])
    # Max Pooling
    conv2 = max_pooling(conv2, n=2)
    # Dropout
    # conv2_drop = tf.nn.dropout(conv2, dropout)

    # Fully Connected Layer 1
    conv2flat = tf.reshape(conv2, [-1, weight['W_fc1'].get_shape().as_list()[0]]) 
    # Reshape col of conv2flat to row of W_fc1 same as: tf.reshape(conv2_drop, [-1, 7*7*64])
    fc1 = tf.nn.relu(tf.matmul(conv2flat, weight['W_fc1']) + bias['b_fc1']) # Relu activation
    fc1_drop = tf.nn.dropout(fc1, dropout)

    output = tf.nn.softmax(tf.matmul(fc1_drop, weight['W_fc2']) + bias['b_fc2'])
    return output

# Build CNN Graph
y_conv = cnn(x, W_dict, b_dict, keep_prob)


It should take a while, please use GPU supported version of tensorflow if you have good resource ...

## Evaluation:

A nice function to determine the loss of a model is "cross-entropy." Cross-entropy comes from thinking about information compressing codes in information theory but it winds up being an important idea in lots of areas, from gambling to machine learning. We will use gradient discent to train using entrophy value.


In [None]:
# Cost
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y_conv), reduction_indices=[1]))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

# Eval
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

#Train and Eval in session
with tf.Session() as sess:

    sess.run(tf.initialize_all_variables())

    for i in range(20000):
        batch = mnist.train.next_batch(128) # power of 2 for parellel processing ?
        if i%100 == 0:
            train_accuracy = accuracy.eval(feed_dict={x:batch[0], y_: batch[1], keep_prob: 1.0})
            print "step %d, training accuracy %g"%(i, train_accuracy)
        train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.75}) #should be close to 0.5 for more layers

    print "test accuracy %g"%accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0})

## Alex Net

<img src="http://images.duanshishi.com/mac_blogs_alexnet_architecture.jpg">

Next, let's build a Alex Net in next notebook.

Let's restart and clear output from kernel.


## Deep Learning on Image Processing  2 of 2 AlexNet Version

## Alex Net

<img src="http://www.panderson.me/images/alexnet2012-small.png">

Next, let's build a Alex Net

Alex Net is developed by Alex Krizhevsky, the first work that contributed to the "bloom" of Convolutional Networks in Computer Vision. Compare to the architecture of vanilla version of CNN and LeNet, AlexNet is deeper, bigger, and it has Convolutional Layers stacked on top of each other. AlexNet won the 2012 ImageNet competetion with 16% top 5 error rate.

In this tutorial we will shrink the surface area in order to match the mnist data set. 224by224 to 28by28

First, import required modules and load MNIST data as usual.

In [None]:
import tensorflow as tf
import numpy as np

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

###  Parameters and Tensorflow placeholders

* training_iters : number of forward propagation, backward propagation epoch
* batch_size : mini batch training, I will explain later on
* img_vec_size : MNIST img is of size 28*28
* num_class : MNIST classes range 0-9 for digit 0-9
* dropout : rate of neurons about to be forgotten

In [None]:
learning_rate = 0.001
training_iters = 200000
batch_size = 64
display_step = 20
img_vec_size = 784 #img 28*28
num_class = 10  
dropout = 0.8 # Ideally, for large images this number should be getting close to 0.6 but 0.8 works perfect for mnist

# tf Graph input
x = tf.placeholder(tf.float32, [None, img_vec_size])
y = tf.placeholder(tf.float32, [None, num_class])
keep_prob = tf.placeholder(tf.float32) # dropout (keep probability)

def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

### Define wieghts and bias for AlexNet
* Keep elongate the "depth" of convolution "block" by conv+relu 
* Shrink the surface area by pooling
* We use 4 max pooling layers here and 2 fully connected layers

In [None]:
# Alex Net W & b
W_dict = {
    'wc1': weight_variable([3, 3, 1, 64]),          # 3x3 size filter, 1 channel, 64 depth
    'wc2': weight_variable([3, 3, 64, 128]),        # 3x3 size filter, 64 , 128 depth
    'wc3': weight_variable([3, 3, 128, 256]),       # 3x3 size filter
    'wc4': weight_variable([2, 2, 256, 512]),       # 2x2 size filter try to avoid 2*2 in shallow net
    'wfc1': weight_variable([2 * 2 * 512, 1024]),   # fc layer1, 'vectorize' 2*2*512 inputs, 1024 outputs
    'wfc2': weight_variable([1024, 1024]),          # fc2 1024 inputs, 1024 output classes
    'wDest': weight_variable([1024, num_class])
}

b_dict = {
    'bc1': bias_variable([64]),
    'bc2': bias_variable([128]),
    'bc3': bias_variable([256]),
    'bc4': bias_variable([512]),
    'bfc1': bias_variable([1024]),
    'bfc2': bias_variable([1024]),
    'bDest': bias_variable([num_class])
}

Then, we define conv, pool step and normalization for faster converge as we did in last notebook tutorial (1 of 2).

In [None]:
# Merge matmul+bias then relu 2 step in tutorial into one convolution step
def convolution(name, x, W, b):
    return tf.nn.relu(tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME') + b, name=name)

# Pooling block size n*n
def max_pooling(name, x, n):
    return tf.nn.max_pool(x, ksize=[1, n, n, 1], strides=[1, n, n, 1], padding='SAME', name=name)

# Parameter referencing existing AlexNet implementation 
def norm(name, poolingRes, depth_radius=4):
    return tf.nn.lrn(poolingRes, depth_radius, bias=1.0, alpha=0.001 / 9.0, beta=0.75, name=name)
    # Local Response Normalization with depth_radius (divide by sum)

## Structure of AlexNet
We use three main types of layers to build AlexNet architectures: 
1. Convolutional Layer 
2. Pooling Layer 
3. Fully-Connected Layer 

AlexNet workflow:
#### [INPUT - CONV - POOL - NORM - 
####                 CONV - POOL - NORM - 
####                 CONV - POOL - NORM - 
####                 CONV - POOL - NORM 
####                 - FC1- FC2 - OUTPUT]

In [None]:
def alexNet(X, weight, bias, dropout):
    # Reshape input igm to 4-D
    X = tf.reshape(X, shape=[-1, 28, 28, 1])

    # Convolution Layer 1
    conv1 = convolution('conv1', X, weight['wc1'], bias['bc1'])
    # Max Pooling
    pool1 = max_pooling('pool1', conv1, n=2)
    # Normalization
    norm1 = norm('norm1', pool1)
    # Dropout
    norm1 = tf.nn.dropout(norm1, dropout)

    # Convolution Layer 2
    conv2 = convolution('conv2', norm1, weight['wc2'], bias['bc2'])
    # Max Pooling
    pool2 = max_pooling('pool2', conv2, n=2)
    # Normalization
    norm2 = norm('norm2', pool2)
    # Dropout
    norm2 = tf.nn.dropout(norm2, dropout)

    # Convolution Layer 3
    conv3 = convolution('conv3', norm2, weight['wc3'], bias['bc3'])
    # Max Pooling
    pool3 = max_pooling('pool3', conv3, n=2)
    # Normalization
    norm3 = norm('norm3', pool3)
    # Dropout
    norm3 = tf.nn.dropout(norm3, dropout)

    # Convolution Layer 4
    conv4 = convolution('conv4', norm3, weight['wc4'], bias['bc4'])
    # Max Pooling
    pool4 = max_pooling('pool4', conv4, n=2)
    # Normalization
    norm4 = norm('norm4', pool4)
    # Dropout
    norm4 = tf.nn.dropout(norm4, dropout)

    # Memory Peak here
    # Fully Connected Layer 1
    fc1 = tf.reshape(norm4, [-1, weight['wfc1'].get_shape().as_list()[0]]) 
    # Reshape column of conv4(norm4) to wfc1 row number for them to connect
    fc1 = tf.nn.relu(tf.matmul(fc1, weight['wfc1']) + bias['bfc1'], name='fc1') 

    # Fully Connected Layer 2
    fc2 = tf.nn.relu(tf.matmul(fc1, weight['wfc2']) + bias['bfc2'], name='fc2') 

    # Output
    output = tf.matmul(fc2, weight['wDest']) + bias['bDest']

    return output

## Evaluation:

Evaluate accuracy using cross-entropy.

In [None]:
# Softmax_cross_entropy_with_logits
# http://stackoverflow.com/questions/34240703/difference-between-tensorflow-tf-nn-softmax-and-tf-nn-softmax-cross-entropy-with

# Build AlexNet in Graph
y_alexnet = alexNet(x, W_dict, b_dict, keep_prob)

# Cost
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_alexnet, y))
optimizer = tf.train.AdamOptimizer(0.001).minimize(cost)

# Eval
correct_prediction = tf.equal(tf.argmax(y_alexnet,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# Initializing the variables
init = tf.initialize_all_variables()

#Train and Eval in session
with tf.Session() as sess:

    sess.run(init)
    epoch = 1
    # Keep training until reach max iterations
    while epoch * batch_size < training_iters:
        #batch = mnist.train.next_batch(64)
        batch_xs, batch_ys = mnist.train.next_batch(batch_size)
        # Train
        sess.run(optimizer, feed_dict={x: batch_xs, y: batch_ys, keep_prob: dropout})
        if epoch % display_step == 0:
            # batch accuracy
            train_accuracy = accuracy.eval(feed_dict={x: batch_xs, y: batch_ys, keep_prob: 1.0})
            print "Iter " + str(epoch*batch_size) + ", Batch Training Accuracy= " + "{:.5f}".format(train_accuracy)
        epoch += 1

    print "Final Testing Accuracy:", sess.run(accuracy, feed_dict={x: mnist.test.images, y: mnist.test.labels, keep_prob: 1.})


##      Reference:
1. TensorFlow official documents 
2. Stanford CS231n 2016 Jan Lectures 
3. Aymeric Damien TF example
4. Bay Area Deep Learning
5. Picture of models from various tech blogs
6. Deep Learing.TV