# Week 5, Back-propagation

## Further reading

* Elements of Statistical Learning: Chapter 11, Neural Networks ([get a copy](https://web.stanford.edu/~hastie/ElemStatLearn/))
* Computer Age Statistical Inference: Chapter 18, Neural Networks and Deep Learning (p. 351) ([get a copy](https://web.stanford.edu/~hastie/CASI/))
* [Peter's Notes](http://peterroelants.github.io/posts/neural_network_implementation_part01/) are a bit mathy and specific, but I've found them helpful when confused

## Code

### Feed-forward basics

#### Numpy importing

To get started, we'll need to import `numpy` to deal with all the matrices involved. Each NN library you use will have a way of handling matrices. They tend to be similar and might even just work with `numpy` matrices seemlessly.

In [10]:
import numpy as np

#### Values a-flowing

To feed-forward data through a neural network is to pass data through the network's weights and activation functions to create an output. The feed-forward technically begins with the input layer, a copy of the input data; however, the network's actual activity begins at the first hidden layer, when the input signal is passed through synapses (weights), and adjusted (bias) and transformed (activation function) by the neuron. Here is what it does:

Signal $X$ passing though weights $W_1$: $X W_1$

Signal adjusted by neuron bias $B_1$: $X W_1 + B_1 = z_1$

Signal shaped by activation function $\sigma$: $a_1 = \sigma(z_1)$

Let's look at that more closely. The $W_1$ (weights) are the "synapses" of the "neurons": they are the connections the neurons use to pull in data. These synapses can be increased to amplify an incoming variable, set to zero to ignore one, or made negative to invert the inbound signal. During training they are tuned by the neurons to help the NN minimize prediction error.

* Synapses strengthen or weaken incoming variables with weights
* Biases adjust the sum of the weighted signals
* Neurons put all of it together and transform it with an activation function

The weights are matrices with dimensions $in \times out$, $in$ the size of the data coming in and $out$ the size of the data coming out. $out$ is the number of neurons in the layers, and $in$ is the amount of values each of these neurons is fed during feed-forward. The biases are $1 \times out$, one bias for each neuron.

![Dimensions](W5_SimpleNeurons.png "Dimensions")

Above you can see that the input data has size 3 and that each neuron has 3 synapses ($2 \times 3 = 6$ in total). There are 2 neurons in the layer and they produce 2 outputs, one each neuron. There are also 2 biases, 1 for each of the 2 neurons.

Each layer has only two quantities: how much comes in and how much goes out.

#### Neurons a-working

The trick of the neurons is that they each get their own copy of the data to work on. This trick occurs because of the way matrix multiplcation is performed. Rows don't mix with other rows, neither columns with other columns.

Here is a trivial but familiar example. You can see that each neuron (column) does its own thing. Change one of the weight's element to see the effect on the output. The five neurons' outputs are the five elements in the output array.

In [11]:
x = np.array([[1, 2, 3, 4, 5]])
weight = np.array([[1, 0, 0, 0, 0],
                   [0, 1, 0, 0, 0],
                   [0, 0, 1, 0, 0],
                   [0, 0, 0, 1, 0],
                   [0, 0, 0, 0, 1]])
x.dot(weight)

array([[1, 2, 3, 4, 5]])

I will give you my own example of using the neurons' synapses to play with the input data. I can perform operations on the inputs separately for each neuron. Look at each weight column vertically.

In [12]:
x = np.array([[1, 2, 3, 4, 5]])
weight = np.array([[-1,  0,  0,  0,  0],
                   [ 0,  2,  0,  0,  0],
                   [ 2,  0,  1, -1,  0],
                   [ 0,  0,  0,  0, -1],
                   [ 0,  0,  0,  1,  1]])
x.dot(weight)

array([[5, 4, 3, 2, 1]])

The feed-forward is then a great mixing of data over many layers of neurons. Each layer expands data into multiple copies and its neurons compress it back into a few outputs. In the diagram above you see 3 units of inputs expanded into 6 synapse signals and then collapsed into 2 output signals. The power of the neural network comes from the fact that the next layer *then copies* these 2 output signals to each of its own neurons, so everything affects everything.

Or in other words, feed-forward is like a decision reached by successive commitees. The neurons in a layer form a commitee that looks at data together, performs analysis, and then summarizes its findings into a condensed form. Higher commitities then analyze this report at a higher level, and so on, until the final output layer makes a decision based on the accumulated wisdom so far: it outputs a single value between 0 and 1.

#### Hiddens layers feed-forwarding

With all that in mind, this is the feed-forward:

$$z_1 = B_1 + X W_1$$

$$a_1 = \sigma(z_1)$$

$$z_2 = B_2 + a_1 W_2$$

$$a_2 = \sigma(z_2)$$

$$z_{output} = B_{output} + a_2 W_{output}$$

$$a_{output} = \sigma(z_{output})$$

Let's generate some data.

In [13]:
X = np.random.random((10,5)) # Ten records of 5 variables
b1 = np.random.random((1,3)) # Bias: 1 x layer_1_size
w1 = np.random.random((5,3)) # Weight: input_vars x layer_1_size
b2 = np.random.random((1,2)) # Bias: 1 x layer_2_size
w2 = np.random.random((3,2)) # Weight: layer_1_size x layer_2_size
b_out = np.random.random((1,1)) # Bias: 1 x output_size
w_out = np.random.random((2,1)) # Weight: layer_2_size x output_size

Here then are the feed-forward results.

In [14]:
def sigmoid(z):
    return 1/(1+np.exp(-z))

# First hidden layer, three neurons each give an output
z1 = b1 + X.dot(w1)
a1 = sigmoid(z1)
print(a1)

[[ 0.64227262  0.60985251  0.78711892]
 [ 0.70114371  0.68118057  0.85642685]
 [ 0.91000195  0.85217972  0.88737508]
 [ 0.86233322  0.8824174   0.89839849]
 [ 0.87061492  0.81245418  0.92443492]
 [ 0.7876347   0.68648412  0.871667  ]
 [ 0.83796698  0.73418675  0.84921294]
 [ 0.88872636  0.89106346  0.9514142 ]
 [ 0.876029    0.87653442  0.93500203]
 [ 0.82500859  0.75478229  0.90458187]]


In [15]:
# Second hidden layer, two neurons each give an ouput
z2 = b2 + a1.dot(w2)
a2 = sigmoid(z2)
print(a2)

[[ 0.85911287  0.84150666]
 [ 0.87068112  0.85267735]
 [ 0.88492     0.87719358]
 [ 0.88616093  0.87476691]
 [ 0.88586449  0.87422142]
 [ 0.87421201  0.86118636]
 [ 0.8751496   0.86618194]
 [ 0.8915369   0.87902956]
 [ 0.88936122  0.87700965]
 [ 0.88089184  0.86792231]]


In [16]:
# Output layer: one output for each input record
z_out = b_out + a2.dot(w_out)
a_out = sigmoid(z_out)
print(a_out)

[[ 0.80715585]
 [ 0.80807655]
 [ 0.80968044]
 [ 0.80961853]
 [ 0.80958369]
 [ 0.80858287]
 [ 0.80883824]
 [ 0.81000174]
 [ 0.80983374]
 [ 0.80912454]]
