![tinhatben](tinhatben_svg.png)

# Neural Networks & Backpropagation: Part Two

This Jupyter notebook has been written to partner with the
[tinhatben.com](https://tinhatben.com) article on : [Neural Networks & Backpropagation: Part Two]()

For more information on data science and machine learning go to
[tinhatben.com](https://www.tinhatben.com)

This jupyter notebook is licensed under the Mozilla Public License 2.0 if a copy of the license was not provided with this notebook it can be downloaded [here](https://www.mozilla.org/en-US/MPL/2.0/)

THIS NOTEBOOK IS PROVIDED UNDER THIS LICENSE ON AN “AS IS” BASIS, WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED, IMPLIED, OR STATUTORY, INCLUDING, WITHOUT LIMITATION, WARRANTIES THAT THE COVERED SOFTWARE IS FREE OF DEFECTS, MERCHANTABLE, FIT FOR A PARTICULAR PURPOSE OR NON-INFRINGING

In [1]:
# Imports
from tinhatbenbranding import TINHATBEN_GRAY, TINHATBEN_YELLOW, add_tinhatbendotcom
from matplotlib import pyplot as plt
import numpy as np

%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 10

In [2]:
def prettyprintarray(array, format="%d"):
    rows, cols = array.shape
    msg = "Shape: %i x %i\n" % (rows, cols)
    for row in range(rows):
        for col in range(cols):
            msg += format % (array[row, col])
            msg += "\t"
        msg += "\n"
    
    print(msg)

## Review of Part One: Forward Propagation

From the Part One: we have established the following:

Given the following inputs, we expect the outputs of our neural network to be:

| Input 1 | Input 2 | Output | 
|---------|:--------|:-------|
| 0 | 0 | 0 |
| 1 | 0 | 1 |
| 0 | 1 | 1 |
| 1 | 1 | 1 |

We will be constructing a neural network with the following topology and intial weights values.

<img src="net_all.png">

## Training Data
Dividing the training data into inputs and targets where x_train represents the inputs and y_train represents the target outputs

In [3]:
# Training data
x_train = np.array([
        [0, 0],
        [1, 0],   
        [0, 1],
        [1, 1],
    ])

y_train = np.array([
        [0],
        [1],
        [1],
        [0],
    ])

# Adding bias to the inputs
I = np.hstack((np.ones((x_train.shape[0], 1)),
              x_train))

## Cost Function
We need a cost function to determine the error in the network.  In the article we use the squared error function so:

$$ E = \frac{1}{2}(y_{train} - a_o)^2$$

In [4]:
def cost_function(outputs, targets):
    return 0.5 * ((targets - outputs) ** 2)

We will also need the derivative of the cost function:

$$ \frac{\partial{E}}{\partial{a_o}} = a_o - y_{train}$$

In [5]:
def cost_function_prime(outputs, targets):
    return outputs - targets

## Sigmoid Function
This is the linearity function to be used for the hidden and output layers in the network

$$g(z) = \frac{1}{1 + e^{-z}}$$

In [6]:
def sigmoid(z):
    return 1.0 / (1 + np.exp(-z))

We will also need the derivative of the sigmoid function for later use in backpropagation, so:

$$\frac{\partial{g}}{\partial{z}} = \sigma'(z) = g(z)(1 - g(z))$$

In [7]:
def sigmoid_prime(z):
    return sigmoid(z) * (1 - sigmoid(z))

## Initialised Weights

In [8]:
# Layer 1 weights
w1 = 0.1
w2 = 0.2
w3 = 0.3
w4 = 0.4
b1 = 0.5
b2 = 0.5

# Layer 2 weights
w5 = 0.01
w6 = 0.02
b3 = 0.03

# Representing the weights as a matrix
Weights_hidden = np.array([
        [b1, b2],        
        [w1, w2],
        [w3, w4],
    ])

Weights_output = np.array([
        [b3],
        [w5],
        [w6],
    ])

# Store the weights in a list for later use
Weights = [Weights_hidden, Weights_output]

## Forward Propagation
Given the weights and inputs we can now compute the activations for each of the nodes in the network.  Computing all training examples at once using linear algebra:



In [9]:
# Hidden Layer activations
h_hidden = np.dot(I, Weights_hidden)
a_hidden = sigmoid(h_hidden)
print("Hidden layer activations")
print("a_hidden")
prettyprintarray(a_hidden, format="%0.3f")

Hidden layer activations
a_hidden
Shape: 4 x 2
0.622	0.622	
0.646	0.668	
0.690	0.711	
0.711	0.750	



In [10]:
# Output layer activations
# Insert the bias term
a_hidden_bias = np.hstack((np.ones((a_hidden.shape[0], 1)),
                a_hidden))
# hidden layer activations with bias
print("a_hidden_bias")
prettyprintarray(a_hidden_bias, format="%0.3f")
h_output = np.dot(a_hidden_bias, Weights_output)

print("h_output")
prettyprintarray(h_output, format="%0.3f")
a_output = sigmoid(h_output)
print("Network output")
print("a_output")
prettyprintarray(a_output, format="%0.4f")

a_hidden_bias
Shape: 4 x 3
1.000	0.622	0.622	
1.000	0.646	0.668	
1.000	0.690	0.711	
1.000	0.711	0.750	

h_output
Shape: 4 x 1
0.049	
0.050	
0.051	
0.052	

Network output
a_output
Shape: 4 x 1
0.5122	
0.5125	
0.5128	
0.5130	



## Computing backpropagation
Starting at the output layer for the network and working back, remembering that the intent of backpropagation is to determine how much each weight in the network contributes to the error; providing a means of changing the weights to reduce the error.  

So at the output layer:

$$\begin{eqnarray}
\frac{\partial{E}}{\partial{h_o}} = \delta_o & = & (a_o - y_{train})\sigma'(h_o) \\ \delta_o & = & (a_o - y_{train})g(h_o)(1 - g(h_o))\\ \end{eqnarray}$$

So the change in weights in the hidden layer:

$$\frac{\partial{E}}{\partial{w_{output}}} = \delta_oa_{hidden}$$

In [11]:
# Output layer
# delta_o
delta_o = cost_function_prime(a_output, y_train) * sigmoid_prime(h_output)
print("delta_o")
prettyprintarray(delta_o, format="%0.3f")

# Change in error due to weights
# dE/dw_output
dE_dw_output = np.dot(delta_o.T, a_hidden_bias).T
print("dE/dw_output")
prettyprintarray(dE_dw_output, format="%0.3f")

delta_o
Shape: 4 x 1
0.128	
-0.122	
-0.122	
0.128	

dE/dw_output
Shape: 3 x 1
0.013	
0.008	
0.008	



Now we can determine the change in weights for the hidden layer:

$$\begin{eqnarray}\frac{\partial{E}}{\partial{h_{hidden}}} & = & \delta_h = \delta_ow_{output}\sigma'({h_{hidden}})\\\frac{\partial{E}}{\partial{w_{hidden}}} & = & \delta_hI\end{eqnarray}$$

In [12]:
# delta_h
# Need to strip out the biases in order to calculate the contribution to error at the activation units
delta_h = np.dot(Weights[1][1:], delta_o.T) * sigmoid_prime(h_hidden).T
print("delta_h")
prettyprintarray(delta_h, format="%0.3f")

# Change in error due to weights
# dE/dw hidden
dE_dw_hidden = np.dot(delta_h, I).T
print("dE/dw_hidden")
prettyprintarray(dE_dw_hidden, format="%0.6f")

delta_h
Shape: 2 x 4
0.000	-0.000	-0.000	0.000	
0.001	-0.001	-0.001	0.000	

dE/dw_hidden
Shape: 3 x 2
0.000025	0.000041	
-0.000015	-0.000060	
0.000003	-0.000020	



So now we have two matrices of $\frac{\partial{E}}{\partial{W}}$ with the correct size:

* we have 3 weights in the output layer (3 x 1 matrix); and
* 6 weights in the hidden layer (3 x 2 matrix)

### General backprop rule
No matter how many hidden layers, we can observe a general pattern with backpropagation.  Where $l$ is the current layer amd $l+1$ is the next layer in the network (and the layer previously calculated during backprop).

$$\begin{eqnarray}\frac{\partial{E}}{\partial{h_l}} & = & \delta_l = \delta_{l+1}w_{l+1}\sigma'(h_l)\\ \frac{\partial{E}}{\partial{w_l}} & = & \delta_la_{l-1}\end{eqnarray}$$

Where:
$a_{l-1}$ is the activations of the previous layer, or in the case of the first hidden layer the inputs to the network.

## Update The Weights
Now that we have determined how much each of the weights contribute to the error we can update the weights, hopefully to improve the predictions made by the network.

$$W \leftarrow W - \eta\frac{\partial{E}}{\partial{W}}$$

We also have a new term $\eta$, this is the **learning rate**, which defines the size of the step taken down the $\frac{\partial{E}}{\partial{W}}$ gradient; or in other terms how much of the change to apply to the weights.  In neural networks this is a critical parameter.  If the learning rate is too small, the network will take a long time to train; conversely if too large the network may never converge due to instability in the weights.  If you are having issues with your network and are unsure about $\eta$, a rule of thumb:

> Reduce the learning rate



In [13]:
# Define the learning rate
eta = 0.01

# update the output weights
Weights_hidden -= (eta * dE_dw_hidden)
Weights_output -= (eta * dE_dw_output)

# The new weights
print("New hidden weights")
prettyprintarray(Weights_hidden, format="%0.6f")

print("New output weights")
prettyprintarray(Weights_output, format="%0.6f")

New hidden weights
Shape: 3 x 2
0.500000	0.500000	
0.100000	0.200001	
0.300000	0.400000	

New output weights
Shape: 3 x 1
0.029874	
0.009919	
0.019921	

