# Week 5, Back-propagation

## Further reading

* Elements of Statistical Learning: Chapter 11, Neural Networks ([get a copy](https://web.stanford.edu/~hastie/ElemStatLearn/))
* Computer Age Statistical Inference: Chapter 18, Neural Networks and Deep Learning (p. 351) ([get a copy](https://web.stanford.edu/~hastie/CASI/))
* [Peter's Notes](http://peterroelants.github.io/posts/neural_network_implementation_part01/) are a bit mathy and specific, but I've found them helpful when confused

## Code

### Feed-forward basics

#### Numpy importing

To get started, we'll need to import `numpy` to deal with all the matrices involved. Each NN library you use will have a way of handling matrices. They tend to be similar and might even just work with `numpy` matrices seemlessly.

In [40]:
import numpy as np

#### Values flowing

The feed-forward of a neural network runs input through the network's weights and actication functions to produce an output. The first "layer" of the NN is called the input layer, and it's merely a copy of the input data. The first active layer is the first of the NN's hidden layers. Here is what it does:

$$z_1 = B_1 + x W_1$$
$$a_1 = \sigma(z_1)$$

Let's look at that more closely. $B_1$ (biases) and $W_1$ (weights) are the "neurons" of the first layer. The convention is that their dimensions are $in \times out$: $in$ is the number of input variables and $out$ is the number of neurons emitting output signals. The number of columns in the biases and weights are the neurons that individually perform some work on the incoming data.

The crucial part of the neural network is that $X$ is matrix-multiplied by $W_1$, which is really $1 \times in$ by $in \times out$ giving a $1 \times out$ matrix. The properties of matrix multiplication being what they are, each neuron "works" on the whole data independently and outputs a value separately.

#### Neurons working

The data is attended by multiple neurons, each able to perform its own processing of the data.

Here is a trivial but familiar example. You can see that each neuron (column) does its own thing. Change one of the weight's element to see the effect on the output.

In [41]:
x = np.array([[1, 2, 3, 4, 5]])
weight = np.array([[1, 0, 0, 0, 0],
                   [0, 1, 0, 0, 0],
                   [0, 0, 1, 0, 0],
                   [0, 0, 0, 1, 0],
                   [0, 0, 0, 0, 1]])
x.dot(weight)

array([[1, 2, 3, 4, 5]])

The feed-forward is then a great mixing of data among layers of neurons, finally creating a network output. This "densely connected" NN has each neuron working separately from its neighbors but seeing the whole of the previous layer's data. These plentiful connections mean that the NN can model much more complicated functions. Although each layer is essentially linear, great power comes from the layers *interacting*.

Or in other words, feed-forward is like a decision reached by successive commitees. Each neuron of each commitee works on a group report sent to its superior commitee, who work at a "higher" level further removed from the raw data. The big cheese at the output layer summarizes everything into a value from 0 to 1.

#### Hiddens layers a-feed-forwarding

With all that in mind, this is the feed-forward:

$$z_1 = B_1 + X W_1$$
$$a_1 = \sigma(z_1)$$
$$z_2 = B_2 + a_1 W_2$$
$$a_2 = \sigma(z_2)$$
$$z_{output} = B_{output} + a_2 W_{output}$$
$$a_{output} = \sigma(z_{output})$$

Let's generate some data.

In [42]:
X = np.random.random((10,5)) # Ten records of 5 variables
b1 = np.random.random((1,3)) # 1 x layer_1_size
w1 = np.random.random((5,3)) # input_vars x layer_1_size
b2 = np.random.random((1,2)) # 1 x layer_2_size
w2 = np.random.random((3,2)) # layer_1_size x layer_2_size
b_out = np.random.random((1,1)) # 1 x output_size
w_out = np.random.random((2,1)) # layer_2_size x output_size

Here then are the feed-forward results.

In [43]:
def sigmoid(z):
    return 1/(1+np.exp(-z))

# First hidden layer, three neurons each give an output
z1 = b1 + X.dot(w1)
a1 = sigmoid(z1)
print(a1)

[[ 0.89478841  0.89253446  0.80753584]
 [ 0.90960405  0.86890526  0.70819821]
 [ 0.9477195   0.89983057  0.88796524]
 [ 0.91068958  0.93866011  0.82139541]
 [ 0.85850324  0.86447675  0.78321093]
 [ 0.85971402  0.88829667  0.68667252]
 [ 0.88032779  0.8868804   0.76358102]
 [ 0.90843132  0.87460983  0.84149436]
 [ 0.89837268  0.87407415  0.85356932]
 [ 0.92133014  0.86259464  0.83790139]]


In [44]:
# Second hidden layer, two neurons each give an ouput
z2 = b2 + a1.dot(w2)
a2 = sigmoid(z2)
print(a2)

[[ 0.85199045  0.74286491]
 [ 0.83905824  0.73282678]
 [ 0.86341688  0.75287126]
 [ 0.85516863  0.74659035]
 [ 0.84718797  0.73797151]
 [ 0.83497427  0.72942731]
 [ 0.84583163  0.73780024]
 [ 0.85621194  0.74597947]
 [ 0.85737624  0.74672015]
 [ 0.85589858  0.74568201]]


In [45]:
# Output layer: one output for each input record
z_out = b_out + a2.dot(w_out)
a_out = sigmoid(z_out)
print(a_out)

[[ 0.8684594 ]
 [ 0.86674157]
 [ 0.86999788]
 [ 0.86891918]
 [ 0.8677859 ]
 [ 0.86618764]
 [ 0.86763444]
 [ 0.86901086]
 [ 0.8691588 ]
 [ 0.86896786]]
