# Week 5, Lecture 9, Back-propagation

These notes cover a bit of week 4 as well. It's better to have feed forward, back propagation, and gradient checking all in one place.

## Further reading

* Elements of Statistical Learning: Chapter 11, Neural Networks ([get a copy](https://web.stanford.edu/~hastie/ElemStatLearn/))
* Computer Age Statistical Inference: Chapter 18, Neural Networks and Deep Learning (p. 351) ([get a copy](https://web.stanford.edu/~hastie/CASI/))
* [Peter's Notes](http://peterroelants.github.io/posts/neural_network_implementation_part01/) are a bit mathy and specific, but I've found them helpful when confused

## Code

### Feed-forward basics

During the feed-forward, the data has to be fed to layers one after the other.

#### Numpy importing

To get started, we'll need to import `numpy` to deal with all the matrices involved.

Each NN library you use will have a way of handling matrices. They tend to be similar and might even just work with `numpy` matrices seemlessly.

In [1]:
import numpy as np

#### Values a-flowing

In a neural network, values flow through layers of synapses and neurons. This is called feed-forward.

To feed-forward data through a neural network is to pass data through the network's weights and activation functions to create an output. The feed-forward receives its data at the input layer, a copy of the input. The activity begins at the first hidden layer, when the input signal is passed through synapses (weights), and adjusted (bias) and transformed (activation function) by the neuron. Here is what it does:

##### First hidden layer

Signal $x$ passing though weights $w_1$: $x \times w_1$

Signal adjusted by neuron bias $b_1$: $x \times w_1 + b_1 = z_1$

Signal shaped by activation function $\sigma$: $a_1 = \sigma(z_1)$

Let's look at that more closely. The $w_1$ matrix (weights) is the collection of "synapses" of the "neurons": they are the connections the neurons use to pull in data. These synapses can be increased to amplify an incoming variable, set to zero to ignore one, or made negative to invert the inbound signal. During training they are tuned by the neurons to help the NN minimize prediction error.

The biases $b_1$ are unique to each neuron. They're used by the neuron to adjust what they receive.

The activation function like $\sigma$, the sigmoid function, causes the neuron to output binary. Unlike a digital computer though, the signal can range *between* 0 and 1 if the neuron is unsure.

* Synapses strengthen or weaken incoming variables with weights
* Biases adjust the sum of the weighted signals
* Neurons put all of it together and transform it with an activation function

The weights are matrices with dimensions $in \times out$, $in$ the size of the data coming in and $out$ the size of the data coming out. $out$ is the number of neurons in the layer, and $in$ is the amount of values each of these neurons is fed during feed-forward. The biases are $1 \times out$, one bias for each neuron.

![Dimensions](W5_SimpleNeurons.png "Dimensions")

Above you can see that the input data has size 3 and that each neuron has 3 synapses ($2 \times 3 = 6$ in total). There are 2 neurons in the layer and they produce 2 outputs, one each neuron. There are also 2 biases integrated inside the 2 neurons.

Each layer has only two quantities: how much comes in and how much goes out.

The first aspect of the feed-forward is then the flowing of data through weights, biases, and activation functions.

#### Neurons a-working

A neural network has a dual nature: a linear nature at the unit level and a complex non-linear one at the network level.

The neurons' linear nature helps them perform computations. They each get their own copy of the data to work on. This amazing trick is possible because of matrix multiplcation (or dot product). Rows don't mix with other rows, neither columns with other columns.

##### Matrices are fun

Here is a trivial but familiar example. You can see that each neuron (column) does its own thing. Change one of the weight's element to see the effect on the output. The five neurons' outputs are the five elements in the output array.

In [2]:
x = np.array([[1, 2, 3, 4, 5]])
weight = np.array([[1, 0, 0, 0, 0],
                   [0, 1, 0, 0, 0],
                   [0, 0, 1, 0, 0],
                   [0, 0, 0, 1, 0],
                   [0, 0, 0, 0, 1]])
x.dot(weight)

array([[1, 2, 3, 4, 5]])

I will give you my own example of using the neurons' synapses to play with the input data. I can perform operations on the inputs separately for each neuron. Look at each weight column vertically.

In [3]:
x = np.array([[1, 2, 3, 4, 5]])
weight = np.array([[-1,  0,  0,  0,  0],
                   [ 0,  2,  0,  0,  0],
                   [ 2,  0,  1, -1,  0],
                   [ 0,  0,  0,  0, -1],
                   [ 0,  0,  0,  1,  1]])
x.dot(weight)

# I could also just flip the identify matrix horizontally to do this

array([[5, 4, 3, 2, 1]])

Matrix algebra allows the NN to perform arithmetic.

##### The non-linear mixing

Data is not mixed within a layer, but it is mixed between them. Neural networks get their power from the interactions of their hidden layers.

When you take a single-layer model like linear or logistic regression and add layers to it, you get the extra power.

The feed-forward is a mixing of data over many layers of neurons. Each layer expands data into multiple copies and its neurons compress it back into a few outputs. In the diagram above you see 3 units of inputs expanded into 6 synapse signals and then collapsed into 2 output signals. The power of the neural network comes from the fact that the next layer *then copies* these 2 output signals to each of its own neurons, so everything affects everything.

![Re-using footage](W5_SimpleNeurons.png "Re-using footage")

See how many arrows there are going from layer to layer? Information expands and contracts, and network elements are generously connected together.

Or in other words, feed-forward is like a decision reached by successive commitees. The neurons in a layer form a commitee that looks at data together, performs analysis, and then summarizes its findings into a small report. Higher commitities then analyze this report at a higher level, and so on. The final output layer makes a decision based on the accumulated wisdom of the executive summary it receives: it outputs a single value between 0 and 1.

All this mixing allows the neural network to work with very complicated data.

#### Hiddens layers feed-forwarding

With all that in mind, this is the feed-forward:

1. $z_1 = X \times W_1 + B_1$
2. $a_1 = \sigma(z_1)$
3. $z_2 = a_1 \times W_2 + B_2$
4. $a_2 = \sigma(z_2)$
5. $z_{output} = a_2 \times W_{output} + B_{output}$
6. $a_{output} = \sigma(z_{output})$

By way of comparison, here is Andrew Ng's notation.

1. $a^{(1)} = x$
2. $z^{(2)} = \theta^{(1)} a^{(1)})$
3. $a^{(2)} = g(z^{(2)})$
2. $z^{(3)} = \theta^{(2)} a^{(2)})$
3. $a^{(3)} = g(z^{(3)})$
2. $z^{(4)} = \theta^{(3)} a^{(3)})$
6. $a_{(4)} = h_{\theta}(x)=g(z^{(4)})$

Let's generate some data. Thanks to the properties of matrix multiplication, I can have 10 rows of input data and these will be processed fully separately, yielding 10 rows of output data.

In [4]:
x = np.random.random((10,5)) # Ten records of 5 variables
b1 = np.random.random((1,3)) # Bias: 1 x layer_1_size
w1 = np.random.random((5,3)) # Weight: input_vars x layer_1_size
b2 = np.random.random((1,2)) # Bias: 1 x layer_2_size
w2 = np.random.random((3,2)) # Weight: layer_1_size x layer_2_size
b_out = np.random.random((1,1)) # Bias: 1 x output_size
w_out = np.random.random((2,1)) # Weight: layer_2_size x output_size

Here then are the feed-forward results.

In [5]:
def sigmoid(z):
    return 1/(1+np.exp(-z))

# First hidden layer, three neurons each give an output
z1 = x.dot(w1) + b1
a1 = sigmoid(z1)
print(a1)

[[ 0.85964668  0.92911942  0.96598944]
 [ 0.74989039  0.83492003  0.894504  ]
 [ 0.82784755  0.92345775  0.94699802]
 [ 0.79368997  0.87538952  0.91478411]
 [ 0.8134559   0.87357528  0.93745025]
 [ 0.84377582  0.91963116  0.95285849]
 [ 0.82065815  0.88618434  0.9201231 ]
 [ 0.75517447  0.85887612  0.86282307]
 [ 0.7390089   0.77372     0.8186372 ]
 [ 0.72950202  0.82478689  0.88013197]]


In [6]:
# Second hidden layer, two neurons each give an ouput
z2 = b2 + a1.dot(w2)
a2 = sigmoid(z2)
print(a2)

[[ 0.81659826  0.91946945]
 [ 0.80826541  0.90506906]
 [ 0.81567922  0.91628225]
 [ 0.81173275  0.91072747]
 [ 0.81206762  0.9132373 ]
 [ 0.81565776  0.91750589]
 [ 0.81281659  0.91323257]
 [ 0.80962827  0.90450471]
 [ 0.80306589  0.89757058]
 [ 0.8071836   0.90235338]]


In [7]:
# Output layer: one output for each input record
z_out = b_out + a2.dot(w_out)
a_out = sigmoid(z_out)
print(a_out)

[[ 0.85298277]
 [ 0.85121437]
 [ 0.85261462]
 [ 0.85191587]
 [ 0.85219783]
 [ 0.85274726]
 [ 0.85221495]
 [ 0.85118472]
 [ 0.85026614]
 [ 0.85089053]]


### Back propagation

Backprop is how the network evaluates its performance during feed forward.

#### Intuition

If the feed-forward is data pushed all the way forward to the outputs, then back-propagation is the trickling down of errors flowing back from the outputs all the way back to the very first hidden layer.

Backprop is a way of using gradient descent on neural networks of multiple layers. It isn't necessary with a linear or logistic regression because these are simple single-layer networks. Back-propagation allows you to apply gradient descent more than once.

It all starts at the output. Here there is a clear link between the choice of parameters (weights and biases) and the output error. The approach here is the same as simple gradient descent.

At the layer preceeding the output, we'll call it $l_2$, there is an extra step. What is the link between $l_2$ weights and biases and the output error? It has multiple steps: $l_2$ has a direct effect on the output layer's data, and the output layer's data has a direct effect on what the model decides to output. It takes two steps to get back to the end.

In other words, the output layer is the boss and it is directly responsible for the model's error. If the output layer changes its behaviour, it can directly improve its accuracy. It's the easiest to train.

The hidden layers are not directly responsible for the model's error; however, they are responsible for providing the output layer accurate analyses of the model's input data. Knowing their boss, they have an idea of how to change their computations so that the big cheese makes more informed decisions. Their gradient formulas in fact depend on the output layer's weights (the boss's personality, you might say).