# Introduction to Neural Networks
We have thus far dealt with two types of simple classifiers:
1.  Logistic: when we have two classes
2.  Softmax: when we have more than two classes

In this workbook, we will introduce yet another classifier - a neural network.   In fact, once can view both Logistic Regression and Softmax Regression as simple versions of neural networks:
* They both have an input layer, where the features are presented to the algorithm.
* They have an output layer, with 1 or more output.
* The inputs are connected directly to the outputs via the $\theta$ or **weight** matrix.

Neural networks add another level of complexity to our classifier, by adding one or more so-called **hidden** layers between the input layer and the output layer.   

A cartoon of such a network is shown below:
![alt text](https://github.com/big-data-analytics-physics/data/blob/master/images/neural_network_model.jpg?raw=true)

In fact the lower part of the neural network is exactly the same as the softmax regression that we discussed earlier:

![alt text](https://github.com/big-data-analytics-physics/data/blob/master/images/softmax_classifcation.jpg?raw=true)


# Activation functions
In all of the networks we have used so far, as well as the neural network we are using today, we employ **activation** functions at each of the nodes.   These activations can (and often will) be different depending on the placement of the node:
* Input nodes: here the activation is simply **identity**: the output of the node ($A_1$ in the above graphic) is simply the input.
* Hidden nodes: here we have a variety of possibilities, but they usually are chosen from among 3: sigmoid, tanh, and relu (rectified linear unit).  In our neural network we will use $tanh$.   See the section titled "Commonly used activation functions" [in this reference ](http://cs231n.github.io/neural-networks-1/) for more details.
* Output nodes: here we typically have *sigmoid* if there is a single output, or *softmax* if there are multiple outputs.

#Neural Network Outputs and Cost Function

We have as our inputs and outputs:
* A set of "m" samples, each a vector of "n" input features: 
$$X_i = {x_{1i},x_{2i},x_{3i},...,x_{ni}}$$
* Instead of a single set of weights $\theta$, we will have two sets of weights $W_1$ and $W_2$, which connect the input layer to the hidden layer, and the hidden layer to the output layer.
* As with softmax, our classifier has "k" outputs, each of the form:
$$p_k = {{e^{X\theta_k}}\over{\sum_{i=0}^{i=k-1}e^{X\theta_k}}}$$

where $p_k$ is the probability that the class=k for that specific set of features.   


We will define our cost function $J(W_1,W_2)$:
$$J(W_1,W_2) = -{1\over{m}}\sum_{i=0}^{k-1}y_{i,c} ~log~{p_{i,c}}$$
where
*   $y_{i,c}$ is a binary indicator (0 or 1) if class label c is the correct classification for observation i
*  $p_{i,c}$ - predicted probability observation i is of class c

Although this might look different, this is actually just a more compact way of writing the same cost function we used for softmax resgression.   In addtion, with a little bit of algebra, we could show that this is exactly the same cost function we used for logistic regression.

As with *softmax*, our goal is to find the set of weights $W_1$ and $W_2$, which minimize the cost function J.   Before we get to that, lets implement the functions necessary to perform the neural network calculation.

## Some activation functions and their derivatives

In [None]:
import numpy as np
def sigmoid(z):
  sm = 1.0 / (1.0 + np.exp(z))
  return sm

def sigmoid_deriv(z):
  sm = sigmoid(z)*(1-sigmoid(z))
  return sm

def tanh(z):
  return np.tanh(z)


def tanh_deriv(z):
  return 1 - np.square(np.tanh(z))

## Implementation of the softmax function
We implement a slightly different version of this than we did previously.   Before we had:

$$p_k = {{e^{X\theta_k}}\over{\sum_{i=0}^{i=k-1}e^{X\theta_k}}}$$

where $p_k$ is the probability that the class=k for that specific set of features.   

Instead we will use:
$$p_k = {{e^{z}}\over{\sum_{i=0}^{i=k-1}e^{z}}}$$

Notice in our implelmentation below, we find the max term in the exponential sum, and add it back into the numerator and denominator of the softmax expression.   You can check for yourself that this does not change the expression.   However, it does help to avoid overflow when calculating very large exponentials.


In [None]:
def softmaxNew(v):
  logC = -np.max(v)
  return np.exp(v + logC)/np.sum(np.exp(v + logC), axis = 1)[:,np.newaxis]

## Next we implement the cost function
As with softmax, the cost function we will actually implement below has an additional term for **regularization**.   We initially set this to be 0.


In [None]:
def calc_cost(yp_oneHot,output,w1,w2,Lambda=0.0):
  m = yp_oneHot.shape[0] #First we get the number of training examples
  cost = (-1 / m) * np.sum(yp_oneHot * np.log(output)) + (Lambda/2.0)*(np.sum(np.square(w1[:, 1:])) + np.sum(np.square(w2[:, 1:])))
  return cost 


## The Forward Pass
Sending a single instance of our feature vectors through the neural network is called a **forward pass**.   If we look at the cartoon of our network, this is actually pretty straightforward to implement.


In [None]:
def forward_pass(x, w1, w2):
#
#
# x: input matrix, dimension features by samples
# w1: weight matrix connecting input layer to hidden layer (takes place of earlier Theta matrix) 
#       ==> dimension hidden nodes by (features+1)
# w2: weight matrix connecting hidden layer to output layer (takes place of earlier Theta matrix) 
#       ==> dimension output nodes by (hidden nodes+1)
# a1: the "output" (also called the activation) of the input layer => just a copy of the input layer
#     we need to add the ones column which activates the boas
  ones = np.ones((len(x),1))
  a1 = np.append(ones,x,axis=1)
#
# z2: the input to the hidden layer = weight w1 matrix applied to (input features plus bias)
  z2 = a1.dot(w1.T)
        #print("z2.shape",z2.shape)
        #applies the tanh function to obtain the input mapped to a distrubution of values between -1 and 1
#
# The output of the hidden layer is the input passed through an "activation" function.  This can
# be sigmoid, tanh, relu, etc
  a2 = tanh(z2)
#
# Need to add "ones" column to this just like a1
  ones = np.ones((len(a2),1))
  a2 = np.append(ones,a2,axis=1)

#
# z3: the input to the output layer = weight w2 matrix applied to (a2 plus bias)  
  z3 = a2.dot(w2.T)
# 
# The "output" of the output layer is pass through the softmax activation
  a3 = softmaxNew(z3)
  return a1, z2, a2, z3, a3

## Backpropagation
Remember that our goal is to find the weights $W_1$ and $W_2$ that minimize the cost function J.  To do this using simple gradient descent would be difficult, so instead we will use a concept called **backpropagation**.  

There are some helpful writeups that might make things clear for you:
* [A simple overview of the chain rule and backpropagation ](https://ml-cheatsheet.readthedocs.io/en/latest/backpropagation.html)
* [A detailed  application of the chain rule and backpropagation](http://www.briandolhansky.com/blog/2013/9/27/artificial-neural-networks-backpropagation-part-4)
* [The derivative of the softmax output and the cross entropy cost function](https://deepnotes.io/softmax-crossentropy)


We want to find the best $W_1$ and $W_2$, so we will adjust these weights in proportion to how much they contribute to the overall cost: 
$$W_1:= W_1 - \alpha {\partial J\over \partial W_1}$$
$$W_2:= W_2 - \alpha {\partial J\over \partial W_2}$$

This looks just like gradient descent, but the challenge is: how do we determine ${\partial J\over \partial \theta_j}$?   This is where backpropagation comes in.

If we look at the forward pass, we see how we get from an input X to an output a3:
$$x\rightarrow a_1 \rightarrow z_2=a_1 w_1 \rightarrow a_2=f_{tanh}(z_2) \rightarrow z_3=a_2 w_2 \rightarrow a_3=f_{softmax}(z_3) $$

The basic idea is look at the forward  pass, and view our cost function as a series of **nested** equations.
* Thinking of J as a function of $W_2$:
$$J = f_{cost}( f_{a3}(f_{z3}(f_{W_2})))$$

* Thinking of J as a function of $W_1$:
$$J = f_{cost}( f_{a3}(f_{z3}(f_{a2}(f_{z2}(f_{W_1}))))$$

To get the partial derivatives of J with respect to $W_1$ and $W_2$, we use the chain rule:
 $${\partial J\over \partial W_2} =    {\partial J\over \partial a_3} ~ {\partial a_3\over \partial z_3} ~{\partial z_3\over \partial W_2}       $$

 $${\partial J\over \partial W_1} =    {\partial J\over \partial a_3} ~ {\partial a_3\over \partial z_3} ~{\partial z_3\over \partial a_2}  ~ {\partial a_2\over \partial z_2}  ~ {\partial z_2\over \partial W_1}      $$
 
Here are each of the above derivatives:
1.  From the third link above, we can get the combined derivative of the softmax output and the cross entropy cost function$ {\partial J\over \partial a_3}{\partial a_3\over \partial z_3} = (a_3 - y_i)$
3. ${\partial z_3\over \partial W_2} = a_2$
4. ${\partial z_3\over \partial a_2} = W_2 $
5. ${\partial a_2\over \partial z_2} = deriv(tanh(z_2))$
6. ${\partial z_2\over \partial W_1}  = a_1$

In [None]:
def backpropMine(a1, a2, a3, z2, y_enc, w1, w2,Lambda):

#
  pJ_pa3__pa3_pz3 = (a3 - y_enc)
  pz3_pw2 = a2
#
# Pull it all together
  grad_w2 = pJ_pa3__pa3_pz3.T.dot(pz3_pw2)

  pz3_pa2 = w2
  ones = np.ones((len(z2),1))
  z2 = np.append(ones,z2,axis=1)
  pa2_pz2 = tanh_deriv(z2)
  pz2_pw1 = a1
  sigma2 = pJ_pa3__pa3_pz3.dot(pz3_pa2) * pa2_pz2
#
# Pull it all together
  grad_w1 = sigma2[:, 1:].T.dot(pz2_pw1)
#
# add the regularization term
  grad_w1[:, 1:]+= (w1[:, 1:]*Lambda) # derivative of .5*l2*w1^2
  grad_w2[:, 1:]+= (w2[:, 1:]*Lambda) # derivative of .5*l2*w2^2
  return grad_w1, grad_w2


## Iterating until we converge
The basic algorithm then to implement gradient descent looks like this:
1. Initialize each of the $W_1$ and $W_2$  parameters to some reasonable value - in this case, it turns out that initializing these weights to 0 would be a *bad idea*.  Instead, we initialize them to a random, but small, number. Remember that these two matrices connect subsequent layers, so:
  *  $W_1$ is a matrix of (nhidden) by (nfeatures +1)
  * $W_2$ is a matrix of (noutputs) by (nhidden+1)
2. Choose a learning rate $\alpha$, and number of maxmimum allowed iterations.  Iterations in which we have processed a full training set are called **epochs**.  
3. It is common for the algorithms to pass in chunks of the training set at a time (called mini-batches), and then update the weights after each mini-batch.   This idea is implemented below.   It is also common for the training to be stopped once the cost begins to increase on the testing set.   This is **not** implemented below.
4. Run a forward pass to get our values at each stage.
5. Run backpropagation to get our gradients for our weights and adjust the weights.

It is helpful to keep track of the cost for each iteration (or with each mini-batch), so you can plot it and inspect its behavior.   And of course you need to keep track of the last value of the $W_1$ and $W_2$ parameters so you can return them.

An implementation of this iteration algorithm is shown below.

In [None]:
def fit(X, y_oneHot, n_hidden=100,epochs=100,numBatches=500,Lambda=0.0,learning_rate=0.001):
  
  decay_rate=0.00001
#
# Get copies of the data
  X_data = X.copy()
  y_enc = y_oneHot.copy()
# Get the initial values
  m,n_features = X_data.shape   # this has the true "n" features 
#
# How many outputs do we have
  m,n_output = y_enc.shape
#
# Initialize the weights to small random numbers
  w1 = np.random.uniform(-1.0, 1.0, size = n_hidden *(n_features + 1)).reshape(n_hidden, (n_features + 1))/(10.0*n_features + 1)
  w2 = np.random.uniform(-1.0, 1.0, size= n_output*(n_hidden+1)).reshape(n_output,n_hidden+ 1) /(10.0*n_hidden + 1)
  prev_grad_w1 = np.zeros(w1.shape)
  prev_grad_w2 = np.zeros(w2.shape)
  costs = []

# Run through the dataset some fixed number of epochs
  for i in range(epochs):
    learning_rate /= (1 + decay_rate*i)
#
# Split the data up into chunks
    mini = np.array_split(range(X_data.shape[0]), numBatches)
    print("epoch",i)
    for idx in mini:
#
# Forward pass
      a1, z2, a2, z3, a3= forward_pass(X_data[idx], w1, w2)
      cost = calc_cost(y_enc[idx], a3, w1, w2, Lambda=Lambda)
      costs.append(cost)

# compute gradient via backpropagation
      grad1, grad2 = backpropMine(a1=a1, a2=a2, a3=a3, z2=z2, y_enc=y_enc[idx], w1=w1, w2=w2, Lambda=Lambda)
      w1_update, w2_update = learning_rate*grad1, learning_rate*grad2
      w1 += -w1_update
      w2 += -w2_update
      prev_grad_w1, prev_grad_w2 = w1_update, w2_update
  return w1,w2,costs

# Apply the Neural Network Algorithm to the MNIST data

## Get the Data
We will use the MNIST data sample to test our softmax regression algorithm.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Form our test and train data
from sklearn.model_selection import train_test_split

#short = ""
short = "short_"
dfCombined = pd.DataFrame()
#
# Read in digits
for digit in range(10):
  fname = 'https://raw.githubusercontent.com/big-data-analytics-physics/data/master/ch3/digit_' + short + str(digit) + '.csv'
  df = pd.read_csv(fname,header=None)
  df['digit'] = digit
  dfCombined = pd.concat([dfCombined, df])


## Make Separate Test and Train Samples
We will do a simple 70/30 split to form our Train/Test sample.

We also need to:
* Scale the input data.   Since we know the input pixel data goes from 0-255, we can just divide by 255.
* Convert our output labels to 1-hot.   We will use a **keras** utility for this.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from keras.utils.np_utils import to_categorical   

train_digits,test_digits = train_test_split(dfCombined, test_size=0.3, random_state=42)
yTrain = train_digits['digit'].values
XTrain = train_digits.as_matrix(columns=train_digits.columns[:784])

yTest = test_digits['digit'].values
XTest = test_digits.as_matrix(columns=test_digits.columns[:784])

#
# one hot encode the labels
num_classes = len(np.unique(yTrain))
print("Number distinct classes ",num_classes)
yTrain_oneHot = to_categorical(yTrain, num_classes=num_classes)
yTest_oneHot = to_categorical(yTest, num_classes=num_classes)
for i in range(10):
  print("digit ",yTrain[i],"encoding",yTrain_oneHot[i])
  
#
# We need to normalize our data - just divide by 256!
XTrain = XTrain/255.0
XTest = XTest / 255.0
#
print("XTrain",XTrain.shape)
print("XTest",XTest.shape)

## Now run the algorithm
Set our parameters to reasonable values and run!

In [None]:

n_hidden=100
epochs=100
learning_rate=0.0001
numBatches=20

w1,w2,costs = fit(XTrain, yTrain_oneHot, n_hidden=n_hidden,epochs=epochs,numBatches=numBatches,learning_rate=learning_rate)
print("costs ",costs[-1])

## Examine Results
Look at the performance of our classifier.    We can look at both the accuracy (for test and train), as well as the cost plot.

In [None]:
def getResults(X,w1,w2):
#
# Run a forward pass for our input data
  a1, z2, a2, z3, a3 = forward_pass(X, w1, w2)
#
# Our outut probability vector for each class is a3
#
# Get the max probabilites
  probs = np.max(a3, axis = 1)
#
# Get the predicted classes
  classes = np.argmax(a3, axis = 1)
#
  return classes,probs


In [None]:
#
# Train results
classes,probs = getResults(XTrain,w1,w2)
correct = 0
for i in range(yTrain.shape[0]):
  if yTrain[i] == classes[i]:
    correct += 1
acc = 100.0*correct / yTrain.shape[0]
print("Train accuracy ",acc)
#
# Test results
classes,probs = getResults(XTest,w1,w2)
correct = 0
for i in range(yTest.shape[0]):
  if yTest[i] == classes[i]:
    correct += 1
acc = 100.0*correct / yTest.shape[0]
print("Test accuracy ",acc)


## Note the difference in the enable_plotly_in_cell call:

In [None]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
#
# OLD (google colab)
#  display(IPython.core.display.HTML('''
#        <script src="/static/components/requirejs/require.js"></script>
#  '''))
#  init_notebook_mode(connected=False)
#
# New (OSC) [thanks to Stephen Gant for this!]
  init_notebook_mode(connected=True)


In [None]:
from plotly.offline import iplot
import plotly.graph_objs as go

enable_plotly_in_cell()
data = go.Scatter(
    x=np.array(range(0,len(costs))),
    y=costs,
    mode='markers',
    name="fitted data"
)


iplot(dict(data=[data]))