# Machine Learning Engineer Nanodegree
## Capstone Project
## Project: Digit Recognition using Convolution Neural Network 

In this project, I have picked up a problem that might be common in the field of Machine Learning but the approach I am taking is not very common. I am taking the problem of building a MNIST Classifier using Convolution Neural Network using neural network framework with the following motivation:

1. Learn the nuances of Convolution Neural Network.
2. Compare different neural network framework for quick development. 
3. Learn this framework.
4. Solve a problem in Computer Vision domain using the above framework.

I have an ultimate goal to build my own prototype of self-driving car and I belive the above goals lay down a good foundation for this ultimate goal. 

-----

## Getting Started
In this project, I am trying to solve the problem of digit recognition on MNIST dataset using Convolution Neural Network Model. First I am going through the details of the Input Data used, Network Architecture, followed with the comparison of different NN framework and then followed with the result of the above model on a chosen framework. 

### Data Set
The MNIST dataset comprises of 60000 training examples and 10000 test examples of the handwritten digits 0-9, formatted as 28x28-pixel monochrome images.

In [2]:
import numpy as np
import tensorflow as tf

mnist = tf.contrib.learn.datasets.load_dataset("mnist")
train_data = mnist.train.images
train_labels = np.asarray(mnist.train.labels, dtype=np.int32)
eval_data = mnist.test.images
eval_labels = np.asarray(mnist.test.labels, dtype=np.int32)

Extracting MNIST-data\train-images-idx3-ubyte.gz
Extracting MNIST-data\train-labels-idx1-ubyte.gz
Extracting MNIST-data\t10k-images-idx3-ubyte.gz
Extracting MNIST-data\t10k-labels-idx1-ubyte.gz




### Understanding the Network Architecture
The below model is used to classify the images in MNIST dataset:

* **Convolution Layer #1**: 32 filters of kernel size 5x5 with ReLU activation function is applied. 
* **Pooling Layer #1**: Using max pooling with 2x2 filter and stride of 2.
* **Convolution Layer #2**: Increase the features that can be extracted by applying 64 5x5 filters, with ReLU activation function.
* **Pooling Layer #2**: Using max pooling with 2x2 filter with a stride of 2.
* **Fully Connected Layer #1**: A Fully Connected layer with 1024 nodes/neurons with ReLU activation function.
* **Logits Layer #1:** A logit layer with 10 nodes/nerons that returns the values for the predictions. These values are in the range of [-inf, +inf].

Using the above architecture, we will build each of these layers step-by-step hereon.

-----
## Implementing the Network

### Input Layer

Input is a gray-scale image from MNIST dataset with a dimension of 28x28. But convolution & pooling layers in tensorlow(TF) expects the input size in format [batch_size, Height, Width, NumChannels]. So, to suit this input requirement out input image can be seen as 28(H)x28(W)x1(D). We use 'reshape' function to achieve this transformation. 

* Input: [batch_size, 28, 28]
* Output: [batch_size, 28, 28, 1]

### Convolution Layer 1

In the convolution layer, we try to extract 'K' features by applying 'K' filters from a sub-region of  'M' x 'N'. There is no theoretical approach to get the right values for these hyper-parameters. It is emperically calculated. For our experiments, we pick the most common choice for these parameters. We pick 32 filters of 5x5 kernel size with padding of value=0.

* Input: [batch_size, 28, 28, 1]
* Output: [batch_size, 28, 28, 32]

### Pooling Layer 1

Every convolution layer is typically followed with a pooling layer to down-sample the feature volume. Again in the absence of any emperical data, we pick the most common choice of pooling layer i.e. 2x2 with a step size of 2. We use max-pooling function and due to the kernel-size of 2x2, it reduces the input dimension to H/2 x W/2. 

* Input: [batch_size, 28, 28, 32]
* Output: [batch_size, 14, 14, 32]

### Convolution Layer 2

Multiple architectures like AlexNet, GoogleNet, MobilNet etc. suggests that a convolution layer following a pooling should increase the number of features in the same order as the previous pooling layer has reduced the dimension by. Keeping this in mind, we use 64 filters now with our default kernel size of 5x5. 

* Input: [batch_size, 14, 14, 32]
* Output: [batch_size, 14, 14, 64]

### Pooling Layer 2

As Pooling Layer 1, we use another max-pooling function to down-sampler the feature volume. 

* Input: [batch_size, 14, 14, 64]
* Output: [batch_size, 7, 7, 64]

### Fully Connected Layer 1

Next, we add a fully connected layer to do classification based on the features extracted by the multiple convolution and pooling layer. Any fully connected layer expects input in 2-dimension so we transform our output of pooling layer 2 into a 2-dimension feature volume. Post this, we pass it through the fully connected layer of 1024 neurons. 

* Input: [batch_size, 7, 7, 64]
* Output: [batch_size, 1024]

### Fully Connected Layer 2

We add another fully connected layer to get the final classification. Since our output label size is 10, we take 10 neurons in this layer. At the end of this layer we will have 10 output values ranging from [-inf, inf], hence we apply softmax function to convert this space into this range [0, 1]. Also, softmax function converts score for reach class into probability such a way that sum of these scores is always 1. 

* Input: [batch_size, 1024]
* Output: [batch_size, 10]

## Neural Network Framework

We combine all the layers discussed above to come up with the final CNN using tensorflow in python. We had multiple options to choose from among the various Neural Network framework available. I did my literature survey to compare these different framework and came up with the following summary:

--------------------------------------------------------
|Property | Caffe | Neon | TensorFlow | Theano | Torch |
|------------------------------------------------------|
| Core | C++ | Python | C++ | Python | Lua|    
| CPU | Yes | Yes | Yes | Yes | Yes | 
| Multi-threaded CPU | Blas | Only Data Loader | Eigen | Blas, conv2D, limited OpenMP | Widely Used|    
| GPU | Yes | Customized nVidia backend | Yes | Yes | Yes|    
|Multi-GPU | Only Data Parallel | Yes | Most Flexible | No | Yes |    
| Nvidia cuDNN | Yes | No | Yes | Yes | Yes|    
| Quick Deploy, on standard Models | Easiest | Yes | Yes | No | Yes |   
 | Auto Gradient Compu | Yes | Yes | Yes | Most flexible | Yes |    
    

Looking at the above comparison metric, *Torch* and *TensorFlow* are the top two choice. But TensorFlow has a better support for Multi-GPU. This is a very important metric becuase, GPUs are heavily used for the training and having better support for multi-GPU means huge time savings. Due to this, I choose TensorFlow as the framework to learn and use for my project. 

## Tying up all the Layers

### Convolution Neural Network

Combining all the layers expalined above in tensorflow we create a CNN model as shown below:

In [3]:
# Create a function to create a CNN with the above architecture
# features --> Input Features
# labels --> Output labels
# mode --> Mode to specify if it's training or testing run
def digit_classifier(features, labels, mode):
    #Input Layer - 
    inputLayer = tf.reshape(features["x"], [-1, 28, 28, 1])
    
    #Convolution Layer 1
    #InputSize: [batch_size, 28, 28, 1]; 
    #OutputSize: [batch_size, 28, 28, 32]
    conv1 = tf.layers.conv2d(
                inputs=inputLayer,
                filters=32,
                kernel_size=[5, 5],
                padding="same",
                activation=tf.nn.relu)
    
    #Pooling Layer 1
    #InputSize: [batch_size, 28, 28, 32]
    #OutputSize: [batch_size, 14, 14, 32]
    pool1 = tf.layers.pooling2d(
                inputs=conv1,
                pool_size=[2, 2],
                strides=2)
    
    #Convolution Layer 2
    #InputSize: [batch_size, 14, 14, 32]
    #OutputSize: [batch_size, 14, 14, 64]
    conv2 = tf.layers.conv2d(
                inputs=pool1,
                filters=64,
                kernel_size=[5, 5],
                padding="same",
                activation=tf.nn.relu)
    
    #Pooling Layer 2
    #Input Size: [batch_size, 14, 14, 64]
    #Output Size: [batch_size, 7, 7, 64]
    pool2 = tf.layers.pooling2d(
                inputs=conv2,
                pool_size=[2,2],
                strides=2)
    
    #Preparation layer for Fully Connected Layer
    #Input Size: [batch_size, 7, 7, 64]
    #Output Size: [batch_size, 7*7*64]
    pool2_flat = tf.reshape(pool2, [-1, 7 * 7 * 64])
    
    
    #Fully Connected Layer 1
    #Input Size: [batch_size, 3136]
    #Output Size: [batch_size, 1024]
    fc1 = tf.layers.dense(
                inputs=pool2_flat,
                units=1024,
                activation=tf.nn.relu)
    
    #Regularization to avoid over-fitting
    #Input Size: [batch_size, 1024]
    #Output Size: [batch_size, 1024]
    dropout = tf.layers.dropout(
                inputs=fc1,
                rate=0.4,  #Drop Out Rate
                training=mode==tf.estimator.ModeKeys.TRAIN)
    
    #Fully Connected Layer 2
    fc2 = tf.layers.dense(
                inputs=droput,
                units=10)
    
    predictions = {
        #Generate predictions for PREDICT and EVAL mode
        "classes": tf.argmax(input=fc2, axis=1),
        
        #Generate probabilities
        "probabilities": tf.nn.softmax(fc2, name="softmax_tensor")
    }
    
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
    
    #Calculate loss for both TRAIN and EVAL modes
    onehot_labels = tf.one_hot(indices=tf.cast(labels, tf.int32), depth=10)
    loss = tf.losses.softmax_cross_entropy(
            onehot_labels=onehot_labels,
            logits=fc2)
    
    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
        train_op = optimizer.minimize(
                loss=loss,
                global_step=tf.train.get_global_step())
        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
    
    eval_metric_ops = {
        "accuracy": tf.metrics.accuracy(
                labels=labels, predictions=predictions["classes"])}
    
    return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

### Training the Model

We create a classifer(say mnist_classifier) from the using the above function defined. We use this classifier to train our data.

In [5]:
mnist_classifier = tf.estimator.Estimator(
                model_fn=digit_classifier, model_dir="./Model")

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_tf_random_seed': 1, '_save_checkpoints_steps': None, '_save_summary_steps': 100, '_log_step_count_steps': 100, '_model_dir': './Model', '_save_checkpoints_secs': 600, '_session_config': None}


We use the above created classifier to train the model. Training is done in a batch size of 100 instead of taking the entire training data in one go. 

In [6]:
train_input_fn = tf.estimator.inputs.numpy_input_fn(
            x={"x": train_data},
            y=train_labels,
            batch_size=100,
            num_epochs=None,
            shuffle=True)

The above function just prepares the training function and not the actual training. The actual tranining is done by feeding in this training function to the classifier. We don't run this function in our notebook as it's a long runinng function (around 5 hours). 

In [None]:
mnist_classifier.train(
            input_fn=train_input_fn,
            steps=20000)

### Evaluating the model
Once the training is complete, I evaluate the model on *eval_data* data set to calculate the accuracy of this model. Since we have not run the training model above in the notebook, we will not be running the evaluation function as well. All these are locally run in the maching using both CPU + GPU to come up with the results. 

In [None]:
eval_input_fn = tf.estimator.inputs.numpy_input_fn(
                x={"x": eval_data},
                y=eval_labels,
                num_epochs=1,
                shuffle=False)
eval_results = mnist_classifier.evaluate(input_fn=eval_input_fn)
print(eval_results)

### Results

INFO:tensorflow:Starting evaluation at 2017-09-23-07:13:55

INFO:tensorflow:Restoring parameters from /tmp/mnist_convnet_model\model.ckpt-60004

INFO:tensorflow:Finished evaluation at 2017-09-23-07:14:25

INFO:tensorflow:Saving dict for global step 60004: accuracy = 0.9841, global_step = 60004, loss = 0.050626

We observe an accuracy of 98.41% on the evaluation data of 10000 images. This says, that the model was able to rightly predict 9841 images out of 10000 evaluation images given in the MNIST data set. 

> **Note**: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to  
**File -> Download as -> HTML (.html)**. Include the finished document along with this notebook as your submission.