# Neural Network Example

Building a 2-hidden layers fully connected neural network (a.k.a multilayer perceptron) with TensorFlow.

This example is using some of TensorFlow higher-level wrappers (tf.estimators, tf.layers, tf.metrics, ...), you can check 'neural_network_raw' example for a raw, and more detailed TensorFlow implementation.

- Author: Dr. Deepak Mishra, IIST

## Neural Network Overview

<img src="http://cs231n.github.io/assets/nn1/neural_net2.jpeg" alt="nn" style="width: 400px;"/>

## MNIST Dataset Overview

This example is using MNIST handwritten digits. The dataset contains 60,000 examples for training and 10,000 examples for testing. The digits have been size-normalized and centered in a fixed-size image (28x28 pixels) with values from 0 to 1. For simplicity, each image has been flattened and converted to a 1-D numpy array of 784 features (28*28).

![MNIST Dataset](http://neuralnetworksanddeeplearning.com/images/mnist_100_digits.png)

More info: http://yann.lecun.com/exdb/mnist/

## Importing MNIST dataset for training
We will be using Tensorflow for building our neural network and training it. Also, we will make use of matplotlib for visualization and numpy for matrix operations.

In [None]:
from __future__ import print_function

# Import MNIST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=False)

import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np

## Defining hyperparameters for training
Hyperparameters are the variables which determines the network structure(eg. Number of Hidden Units) and the variables which determine how the network is trained (eg. Learning Rate).

### Learning Rate
The learning rate defines how quickly a network updates its parameters.
Low learning rate slows down the learning process but converges smoothly. Larger learning rate speeds up the learning but may not converge.

### Number of epochs
Number of epochs (`num_steps`) is the number of times the whole training data is shown to the network while training.
Increase the number of epochs until the validation accuracy starts decreasing even when training accuracy is increasing(overfitting).

### Batch size
Mini batch size is the number of sub samples given to the network after which parameter update happens.
A good default for batch size is taken in power of 2, although any arbitrary batch size can be taken. We can try try 32, 64, 128, 256, and so on.



In [None]:
# Parameters
learning_rate = 0.1
num_steps = 1000 # number of iterations to train for 
batch_size = 128 # batch size for training
display_step = 100 # display progress after this no of epoch

# Network Parameters
n_hidden_1 = 256 # 1st layer number of neurons
n_hidden_2 = 256 # 2nd layer number of neurons
num_input = 784 # MNIST data input (img shape: 28*28)
num_classes = 10 # MNIST total classes (0-9 digits)

In [None]:
# Define the input function for training
input_fn = tf.estimator.inputs.numpy_input_fn(
    x={'images': mnist.train.images}, y=mnist.train.labels,
    batch_size=batch_size, num_epochs=None, shuffle=True)

## The neural network is define as a function
We construct the neural network as per our defined hyperparameters earlier.

**You need to fill the ToDO lines here**

In [None]:
# Define the neural network
def neural_net(x_dict):
    # TF Estimator input is a dict, in case of multiple inputs
    x = x_dict['images']
    # ToDO :  Hidden fully connected layer with 256 neurons
    layer_1 = 
    
    # ToDO :  Hidden fully connected layer with 256 neurons
    layer_2 = 
    
    # ToDO :  Output fully connected layer with a neuron for each class (use tf.layers.dense)
    out_layer = 
    
    return out_layer

## Model Function
In this model function we define how we want to train the neural network. This function will be used for training as well as testing. Hence we impose a prediction mode, where we won't compute the gradient.

For training, we use the softmax cross-entropy loss. Cross entropy indicates the distance between what the model believes the output distribution should be, and what the original distribution really is. It is defined as
$$ L(y,p) = -\sum_i y_i \log(p_i)$$
where $$p_i = \frac{e^a_i}{\sum_k e^a_k}$$ 

We use **Stochastic Gradient Descent** as the optimizer. 

### Stochastic Gradient Descent (SGD)

Stochastic gradient descent (often shortened to SGD), also known as incremental gradient descent, is an iterative method for optimizing a differentiable objective function, a stochastic approximation of gradient descent optimization.

We define the problem of minimizing the objective function $Q(w) = L(y,p)$ where the parameter $w$ which minimizes $Q(w)$ is to be estimated. 

In classical statistics, sum-minimization problems arise in least squares and in maximum-likelihood estimation (for independent observations). The general class of estimators that arise as minimizers of sums are called M-estimators. However, in statistics, it has been long recognized that requiring even local minimization is too restrictive for some problems of maximum-likelihood estimation. Therefore, contemporary statistical theorists often consider stationary points of the likelihood function (or zeros of its derivative, the score function, and other estimating equations).

The sum-minimization problem also arises for empirical risk minimization. In this case, $Q_i( w )$ is the value of the loss function at i-th example, and $Q(w)$ is the empirical risk.

When used to minimize the above function, a standard (or "batch") gradient descent method would perform the following iterations :

$$ w:=w-\eta \nabla Q(w)=w-\eta \sum _{i=1}^{n}\nabla Q_{i}(w)/n $$

where $\eta$ is a step size (sometimes called the learning rate in machine learning). 

In [None]:
# Define the model function (following TF Estimator Template)
def model_fn(features, labels, mode):
    
    # Build the neural network
    logits = neural_net(features)
    
    # Predictions
    pred_classes = tf.argmax(logits, axis=1)
    pred_probas = tf.nn.softmax(logits)
    
    # If prediction mode, early return
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode, predictions=pred_classes) 
        
    # Define loss and optimizer
    loss_op = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
        logits=logits, labels=tf.cast(labels, dtype=tf.int32)))
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
    train_op = optimizer.minimize(loss_op, global_step=tf.train.get_global_step())
    
    # Evaluate the accuracy of the model
    acc_op = tf.metrics.accuracy(labels=labels, predictions=pred_classes)
    
    # TF Estimators requires to return a EstimatorSpec, that specify
    # the different ops for training, evaluating, ...
    estim_specs = tf.estimator.EstimatorSpec(
      mode=mode,
      predictions=pred_classes,
      loss=loss_op,
      train_op=train_op,
      eval_metric_ops={'accuracy': acc_op})

    return estim_specs

In [None]:
# Build the Estimator
model = tf.estimator.Estimator(model_fn)

**Fill the ToDO here**

In [None]:
# ToDO : Train the Model (use model.train())


In [None]:
# Evaluate the Model
# Define the input function for evaluating
input_fn = tf.estimator.inputs.numpy_input_fn(
    x={'images': mnist.test.images}, y=mnist.test.labels,
    batch_size=batch_size, shuffle=False)
# Use the Estimator 'evaluate' method
model.evaluate(input_fn)

In [None]:
# Predict single images
n_images = 4
# Get images from test set
test_images = mnist.test.images[:n_images]
# Prepare the input data
input_fn = tf.estimator.inputs.numpy_input_fn(
    x={'images': test_images}, shuffle=False)
# Use the model to predict the images class
preds = list(model.predict(input_fn))

# Display
for i in range(n_images):
    plt.imshow(np.reshape(test_images[i], [28, 28]), cmap='gray')
    plt.show()
    print("Model prediction:", preds[i])

## Bibliography

* aymericdamien (GitHub)