# Week 5, Back-propagation

## Further reading

* Elements of Statistical Learning: Chapter 11, Neural Networks ([get a copy](https://web.stanford.edu/~hastie/ElemStatLearn/))
* Computer Age Statistical Inference: Chapter 18, Neural Networks and Deep Learning (p. 351) ([get a copy](https://web.stanford.edu/~hastie/CASI/))
* [Peter's Notes](http://peterroelants.github.io/posts/neural_network_implementation_part01/) are a bit mathy and specific, but I've found them helpful when confused

## Code

### Feed-forward basics

During the feed-forward, all data is pushed forward until it culminates to the output. The output depends on everything that precedes it.

#### Numpy importing

To get started, we'll need to import `numpy` to deal with all the matrices involved. Each NN library you use will have a way of handling matrices. They tend to be similar and might even just work with `numpy` matrices seemlessly.

In [8]:
import numpy as np

#### Values a-flowing

To feed-forward data through a neural network is to pass data through the network's weights and activation functions to create an output. The feed-forward technically begins with the input layer, a copy of the input data; however, the network's actual activity begins at the first hidden layer, when the input signal is passed through synapses (weights), and adjusted (bias) and transformed (activation function) by the neuron. Here is what it does:

Signal $X$ passing though weights $W_1$: $X \times W_1$

Signal adjusted by neuron bias $B_1$: $X \times W_1 + B_1 = z_1$

Signal shaped by activation function $\sigma$: $a_1 = \sigma(z_1)$

Let's look at that more closely. The $W_1$ (weights) are the "synapses" of the "neurons": they are the connections the neurons use to pull in data. These synapses can be increased to amplify an incoming variable, set to zero to ignore one, or made negative to invert the inbound signal. During training they are tuned by the neurons to help the NN minimize prediction error.

* Synapses strengthen or weaken incoming variables with weights
* Biases adjust the sum of the weighted signals
* Neurons put all of it together and transform it with an activation function

The weights are matrices with dimensions $in \times out$, $in$ the size of the data coming in and $out$ the size of the data coming out. $out$ is the number of neurons in the layers, and $in$ is the amount of values each of these neurons is fed during feed-forward. The biases are $1 \times out$, one bias for each neuron.

![Dimensions](W5_SimpleNeurons.png "Dimensions")

Above you can see that the input data has size 3 and that each neuron has 3 synapses ($2 \times 3 = 6$ in total). There are 2 neurons in the layer and they produce 2 outputs, one each neuron. There are also 2 biases, 1 for each of the 2 neurons.

Each layer has only two quantities: how much comes in and how much goes out.

#### Neurons a-working

The trick of the neurons is that they each get their own copy of the data to work on. This trick occurs because of the way matrix multiplcation is performed. Rows don't mix with other rows, neither columns with other columns.

Here is a trivial but familiar example. You can see that each neuron (column) does its own thing. Change one of the weight's element to see the effect on the output. The five neurons' outputs are the five elements in the output array.

In [9]:
x = np.array([[1, 2, 3, 4, 5]])
weight = np.array([[1, 0, 0, 0, 0],
                   [0, 1, 0, 0, 0],
                   [0, 0, 1, 0, 0],
                   [0, 0, 0, 1, 0],
                   [0, 0, 0, 0, 1]])
x.dot(weight)

array([[1, 2, 3, 4, 5]])

I will give you my own example of using the neurons' synapses to play with the input data. I can perform operations on the inputs separately for each neuron. Look at each weight column vertically.

In [10]:
x = np.array([[1, 2, 3, 4, 5]])
weight = np.array([[-1,  0,  0,  0,  0],
                   [ 0,  2,  0,  0,  0],
                   [ 2,  0,  1, -1,  0],
                   [ 0,  0,  0,  0, -1],
                   [ 0,  0,  0,  1,  1]])
x.dot(weight)

array([[5, 4, 3, 2, 1]])

The feed-forward is then a great mixing of data over many layers of neurons. Each layer expands data into multiple copies and its neurons compress it back into a few outputs. In the diagram above you see 3 units of inputs expanded into 6 synapse signals and then collapsed into 2 output signals. The power of the neural network comes from the fact that the next layer *then copies* these 2 output signals to each of its own neurons, so everything affects everything.

Or in other words, feed-forward is like a decision reached by successive commitees. The neurons in a layer form a commitee that looks at data together, performs analysis, and then summarizes its findings into a condensed form. Higher commitities then analyze this report at a higher level, and so on, until the final output layer makes a decision based on the accumulated wisdom so far: it outputs a single value between 0 and 1.

#### Hiddens layers feed-forwarding

With all that in mind, this is the feed-forward:

1. $z_1 = X \times W_1 + B_1$
2. $a_1 = \sigma(z_1)$
3. $z_2 = a_1 \times W_2 + B_2$
4. $a_2 = \sigma(z_2)$
5. $z_{output} = a_2 \times W_{output} + B_{output}$
6. $a_{output} = \sigma(z_{output})$

By way of comparison, here is Andrew Ng's notation.

1. $a^{(1)} = x$
2. $z^{(2)} = \theta^{(1)} a^{(1)})$
3. $a^{(2)} = g(z^{(2)})$
2. $z^{(3)} = \theta^{(2)} a^{(2)})$
3. $a^{(3)} = g(z^{(3)})$
2. $z^{(4)} = \theta^{(3)} a^{(3)})$
6. $a_{4} = h_{\theta}(x)=g(z^{(4)})$

Let's generate some data. Thanks to the properties of matrix multiplication, I can have 10 rows of input data and these will be processed entirely separate yielding 10 rows of output data.

In [11]:
X = np.random.random((10,5)) # Ten records of 5 variables
b1 = np.random.random((1,3)) # Bias: 1 x layer_1_size
w1 = np.random.random((5,3)) # Weight: input_vars x layer_1_size
b2 = np.random.random((1,2)) # Bias: 1 x layer_2_size
w2 = np.random.random((3,2)) # Weight: layer_1_size x layer_2_size
b_out = np.random.random((1,1)) # Bias: 1 x output_size
w_out = np.random.random((2,1)) # Weight: layer_2_size x output_size

Here then are the feed-forward results.

In [12]:
def sigmoid(z):
    return 1/(1+np.exp(-z))

# First hidden layer, three neurons each give an output
z1 = b1 + X.dot(w1)
a1 = sigmoid(z1)
print(a1)

[[ 0.82974261  0.76827086  0.92042924]
 [ 0.77466191  0.77161415  0.86631361]
 [ 0.84549502  0.85915791  0.931171  ]
 [ 0.89972488  0.8552271   0.95105239]
 [ 0.83860152  0.84618743  0.92708683]
 [ 0.85779854  0.77560829  0.91356551]
 [ 0.88537909  0.88456776  0.940265  ]
 [ 0.82648994  0.82618912  0.93272566]
 [ 0.89141779  0.87686866  0.95865405]
 [ 0.84877291  0.78046325  0.91539761]]


In [13]:
# Second hidden layer, two neurons each give an ouput
z2 = b2 + a1.dot(w2)
a2 = sigmoid(z2)
print(a2)

[[ 0.79655718  0.9159496 ]
 [ 0.79479093  0.91229396]
 [ 0.81032192  0.92284997]
 [ 0.81066034  0.92537856]
 [ 0.80829336  0.92163685]
 [ 0.79749173  0.91764283]
 [ 0.81437169  0.92633559]
 [ 0.80557097  0.91990679]
 [ 0.8139685   0.9264766 ]
 [ 0.79825718  0.91757714]]


In [14]:
# Output layer: one output for each input record
z_out = b_out + a2.dot(w_out)
a_out = sigmoid(z_out)
print(a_out)

[[ 0.72473087]
 [ 0.72410195]
 [ 0.72592529]
 [ 0.72635779]
 [ 0.7257159 ]
 [ 0.72502198]
 [ 0.72652471]
 [ 0.72541727]
 [ 0.72654841]
 [ 0.72501143]]


### Back propagation

If the feed-forward is data pushed all the way forward to the outputs, then back-propagation is the trickling down of errors flowing back from the outputs all the way back to the very first hidden layer.

#### Intuition

Back-propagation is a way of using gradient descent on neural networks of multiple layers. It isn't necessary with a linear or logistic regression because these are effectively single-layer neural networks. Back-propagation is a technique for applying gradient successively over multiple layers.

It all starts at the output. Here there is a clear link between the choice of weights and of biases and the output error, and the approach is the same as simple gradient descent.

At the layer preceeding the output, we'll call it $l_2$, there is an extra step. What is the link between $l_2$ weights and biases and the output error? It has multiple steps: $l_2$ has an effect on the output layer's data, and the output layer's data has an effect on the accuracy of the model's output.

In other words, the output layer is the boss and it is directly responsible for the model's error. If the output layer changes its behaviour, it can directly improve its accuracy.

Similarly, the hidden layers are not directly responsible for the model's error; however, they are responsible for providing the output layer accurate analyses of the model's input data. Knowing their boss, they have an idea of how to change their computations so that the big cheese makes more informed decisions.