### Hand Written Digits Recognition using Convolutional Neural Networks

This will be a step by step tutorial making use of jupyter notebook code cells to illustrate working of Google's TensorFlow library for the recognition of hand written images of digits. The data set was taken from MNIST which is well known source for these hand written images repository (http://yann.lecun.com/exdb/mnist/).

The purpose of this notebook is to get some guided learning and note observations while applying nerual network learning mechanism using TensorFlow library.

##### Some Primitive things to note down

- The images are of hand written digits between 0 and 9 which puts the total possibilities as 10.
- Images will also come with labels which are just mere markings of what each image is.
- The data set will consists of following:
    - Training Data Set
    - Test Data Set
    - Label Data Set
    
##### Hand Written Sketch (from MNIST)

<img src="images/MNIST-Matrix.png" style="width: 400px" align="left" /><br/><br/><br/><br/><br/><br/><br/><br/><br/>


##### Training Image Illustration

<img src="images/mnist-train-xs.png" style="width: 400px" align="center" /><br/><br/><br/>

##### Label Vectors

<img src="images/mnist-train-ys.png" style="width: 400px" align="center" />


### What is Softmax Regression?

Firstly, we need to understand what problem we are tyring to solve here?

So, each image in this tutorial is any digit between 0 and 9 and we need to develop an algorithm which should be smart enough to look at the image and make some predictions about what might be that digit is. 

For example, pick an image of digit '9'. Now, the probability of sure evidence that this digit is actually nine would be 80%, while at the same time the probablity of taking it as digit '8' is 5% (since one loop above) and the rest of the probability can be anything which means that 15% chances of any other digit. 

- Probability of being the correct digit (i.e. '9') = 80%
- Probability of digit read as eight ('8') = 5%
- Probability for rest of anything else = 15% 

Remember that there is no sure shot rule of being a 100% correct guess.

Softmax is a regression technique which is simple and fast enough to apply on types of problems like the one we are onto it. A softmax regression works in two steps:

1. It will add up the evidence of our input (i.e. the image data) into the classes
2. Later, it will convert this summed up evidence in to probabilities

Now the question arises as how we will find out if our evidence is supporting for the image being in class and vice versa? We will do the pixel sum of weighted intensities and if the sum would be positive, that translates that our evidence is in the favor of the class. While, negative wieght will translate the fact that given input (image) is not the class.

On top of that, we will add some bias ('b') to make sure that our evidence is independent of the provided input. 

Below is an illustration of images with two colors as <font color='red'>RED</font> and <font color='blue'>BLUE</font>. Blue marks the positive evidence while red marks the negative evidence in the image:

<br/><img src="images/softmax-weights.png" style="width: 270px" align="center" /><br/><br/>

Moving forward, if we have an image called 'x' and we are hunting for the class 'i' then the equation for this evidence would be:


$$ \text{evidence}_i = \sum_j W_{i,~ j} x_j + b_i $$

where $W_i$ is the weights and $b_i$ is the bias for class $i$, and $j$ is an index for summing over the pixels in our input image $x$.

Then as second step, we will convert our evidence into probablities $y$ using softmax function.

$$ y = \text{softmax}(\text{evidence}) $$


The output of this softmax function will be the probability distribution over 10 cases which means that the result will tell us what is the probability that the image in question represents which digit among 0 to 10.

Above equation can be expanded to below form:

$$ \text{softmax}(x)_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)} $$

Let's take a more simpler approach to understand the working and outcome of Softmax function. What it does? The hypothesis coming out of this function SHOULD NOT have negative or zero weight. In order to achieve this, softmax function normalizes weight in a way that they could add up to one and this ensures a more formal & valid probability distribution.

For each output, we added a weighted sum of $x$'s, add bias ($b$) and then apply softmax function. Following is an illustration for some sample outputs ($x$):

<br/><img src="images/softmax-regression-scalargraph.png" style="width: 370px" align="center" /><br/><br/>

If I intend to convert above illustration as a mathematical equation, it would be;

<br/><img src="images/softmax-regression-scalarequation.png" style="width: 370px" align="center" /><br/>

... and this equation can also be written in matrix vector form to make things more simple and easier to understand:

<br/><img src="images/softmax-regression-vectorequation.png" style="width: 370px" align="center" /><br/>

And last (but not the least), the more compact form can be:

$$ \text{y} = softmax (W_x + b) $$


#### Time to Get the Hands Dirty with TensorFlow

As we understand the concept and underlying mechanism as how prediction works with softmax regression as our technique, its time to see this theory in action. Let's give it a try and see how TensorFlow unfolds this mystery by implementing the regression.

First step is to prepare this procedure in the form which TensorFlow can use. 

1. We will define a graph of interactions that tensor flow can use to do heavy numerical computations outside of Python. This will have place holders and variables, such as;

- '$x$' will be place holder for 784 dimensional vector to hold input images.
- '$W$' will be a variable matirx with modified weights having dimensions of 784 x 10.
- '$b$' will be the biased factor to be added to evrey computation.  
 
 

In [5]:
import tensorflow as tf

x = tf.placeholder(tf.float32, [none,784])

W = tf.Variable(tf.zeros[784,10])

b = tf.Variable(tf.zeros[10])


#### Why those dimensions?

Before we proceed to define the model, let's understand why we have defined these dimensions as this is quite important to understand.

$W$ which represents weight is a 784 by 10 dimensional matrix full of zeros. It is evident that since this $W$ will be multiplied to the image vector $x$ which has 784 dimensions so matrix multiplication will happen only when first vector ($x$ in this case) columns should be matching to the second vector ($W$ here) rows. The outcome will be a 10 dimensional vector of evidences. This resultant vector will then be added to vector $b$ which has got some dimensions as the evidence vector.

Lastly, we will apply ```python softmax() ``` function from tensor flow library. Before this function is applied, we will use library function ```python tf.matmul(x,W) ``` to multiply two vectors and then add '$b$' as bias.


In [None]:
y = tf.nn.softmax(tf.matmul(x,W) + b)

Above statement was the final statement to define the model in tensor flow. Once defined, you can use this model on many devices and with many shapes and forms of data. As model is defined, now we will focus on training this model.

#### Training the Model

Training the model is basically to identify how good (or bad) is our model data to be evaluated into something useful. In machine learning, we usually try to find out how far are we from getting good test results. In statistics, this is called "cost" or "loss" of a model. There are various techniques to calculate this loss, one of them well known function is cross-entropy. It is defined as;

$$ H_{y'}(y) = -\sum_i y'_i \log(y_i)  $$

where $y'$ is the predicted probability distribution, $y$ is the actual distribution (hot-vector of digital labels). To define the cross entropy in tensor flow, we first need a new place holder:

In [None]:
y_ = tf.placeholder(tf.float32, [None, 10])

# Below we are implementing cross entropy function

cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))