# A Simple Artificial Neural Network in 11 Lines

The code and explanations are adapted from a blog by Andrew Trask https://iamtrask.github.io/2015/07/12/basic-python-network/

In [1]:
import numpy as np

The following code implements an ANN and trains it to solve the XOR problem.

In [4]:
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
t = np.array([[0,1,1,0]]).T
W1 = 2*np.random.random((3,4)) - 1
W2 = 2*np.random.random((4,1)) - 1

print(W1)
for j in range(60000):
    h = 1/(1+np.exp(-(np.dot(X,W1))))
    y = 1/(1+np.exp(-(np.dot(h,W2))))
    y_delta = (t - y)*(y*(1-y))
    h_delta = np.dot(y_delta, W2.T) * (h * (1-h))
    W2 += np.dot(h.T, y_delta)
    W1 += np.dot(X.T, h_delta)

[[ 0.65698538 -0.64522731 -0.8600113   0.29002209]
 [-0.11121401 -0.76011788 -0.25390778  0.42609755]
 [ 0.4212753  -0.39859889  0.67257366  0.90874581]]


In [25]:
X = np.array([[0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
t = np.array([[0,1,1,0]]).T
W1 = 2*np.random.random((3,4)) - 1
W2 = 2*np.random.random((4,1)) - 1

In [24]:
print(X)
print("***********")
print(y)
print("***********")

print(W1)
print("***********")

print(W2)

[[0 0 1]
 [0 1 1]
 [1 0 1]
 [1 1 1]]
***********
[[0.00271669]
 [0.99641162]
 [0.99583942]
 [0.00519848]]
***********
[[-0.98897579  0.25363973 -0.3106236  -0.6600815 ]
 [-0.56374814  0.04757171 -0.95067411 -0.27628703]
 [-0.69173401  0.27664081 -0.7694167  -0.69348932]]
***********
[[ 0.54990431]
 [ 0.73516547]
 [-0.26938963]
 [-0.02559256]]


In [31]:
np.random.random((3,4)) -1

array([[-0.14238645, -0.98753243, -0.27991274, -0.99500197],
       [-0.93598625, -0.64517199, -0.34431362, -0.6127013 ],
       [-0.42466773, -0.80132635, -0.57919529, -0.82266348]])

Here is some code that passes an input through the network, to produce a prediction.

In [0]:
def predict(x):
  '''Produce a prediction for x, using the weights defined above.'''
  ### INPUT -> HIDDEN LAYER
  # Multiply x by the first set of weights
  z1 = np.dot(x,W1)
  # Activation function
  h = 1/(1+np.exp(-(z1)))

  ### HIDDEN LAYER -> OUTPUT
  # Multiply the hidden layer outputs by the next set of weights
  z2 = np.dot(h,W2)
  # Activation function
  y = 1/(1+np.exp(-(z2)))
  return y

In [0]:
# The final input is always 1 because it corresponds with the bias parameter
print(predict([0,0,1]))
print(predict([0,1,1]))
print(predict([1,0,1]))
print(predict([1,1,1]))

[0.00465699]
[0.99754167]
[0.99521407]
[0.00361097]


# Breakdown of the 11 lines:

In [4]:
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
W1 = 2*np.random.random((3,4)) - 1

print(X)
print(W1)

[[0 0 1]
 [0 1 1]
 [1 0 1]
 [1 1 1]]
[[ 0.34477793 -0.8700625  -0.90486842  0.31404457]
 [ 0.93204318 -0.82879262 -0.63249823 -0.77777939]
 [ 0.20453021 -0.33803014  0.00851125 -0.14883615]]


In [5]:
X.dot(W1)

array([[ 0.20453021, -0.33803014,  0.00851125, -0.14883615],
       [ 1.13657339, -1.16682277, -0.62398698, -0.92661554],
       [ 0.54930814, -1.20809264, -0.89635716,  0.16520842],
       [ 1.48135132, -2.03688527, -1.5288554 , -0.61257097]])

## Line 1-2:

```Python
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
t = np.array([[0,1,1,0]]).T
```

The training set `X` and training labels `y`. The last input is always $1$, as it represents the bias parameter. So the inputs can actually be thought of as 2-dimensional.

For example, the expected output when the input is `[0,1]` is `1`.

## Line 3-4:

```Python
W1 = 2*np.random.random((3,4)) - 1
W2 = 2*np.random.random((4,1)) - 1
```

These are all the parameters of the ANN. The ANN has 2 input nodes (+1 bias), a hidden layer containing 4 hidden nodes, and 1 output node.

`W1` contains the weights that connect the 2+1 inputs to the 4 hidden nodes.

`W2` contains the weights that connect the 4 hidden nodes to the 1 output node.

The parameters are randomly initialized with mean $-1$ and variance $2$.

## Line 5-7:

```Python
for j in range(60000):
  h = 1/(1+np.exp(-(np.dot(X,W1))))
  y = 1/(1+np.exp(-(np.dot(h,W2))))
```

The weights of the ANN are going to be adjusted over the course of 60,000 passes of the training set.

`h` is the **output of the hidden layer.**

`y` is the **output in the final layer.**

`np.dot(X.W1)` is the matrix product of `X` and `W1`. For example:
>$\begin{pmatrix}
    0 & 0 & 1 \\
    0 & 1 & 1 \\
    1 & 0 & 1 \\
    1 & 1 & 1
  \end{pmatrix}
  \begin{pmatrix}
    w_{11} & w_{12} & w_{13} & w_{14}\\
    w_{21} & w_{22} & w_{23} & w_{24} \\
    w_{31} & w_{32} & w_{33} & w_{34}
  \end{pmatrix} = 
  \begin{pmatrix}
    z_{11} & z_{12} & z_{13} & z_{14} \\
    z_{21} & z_{22} & z_{23} & z_{24} \\
    z_{31} & z_{32} & z_{33} & z_{34} \\
    z_{41} & z_{42} & z_{43} & z_{44}
  \end{pmatrix}$

$w_{ij}$ is the weight connecting input $i$ to hidden node $j$.

$z_{ij}$ represents the weighted input to hidden node $j$ when the input is example number $i$ (corresponding to a row of `X`).

Each weighted input then has the activation function applied to it, which in this case is $\sigma(z) = \frac{1}{1 + e^{-z}}$. This gives the values in `h`.

This process is repeated to get the output layer, `y`, but this time a $4 \times 4$ matrix `h` is multiplied by a $4 \times 1$ matrix `W2`. This gives a $4 \times 1$ matrix, where each entry corresponds to the output, for a given input example. Ideally we want `y` to look like `t`: $\begin{pmatrix} 0 \\ 1 \\ 1 \\ 0 \end{pmatrix}$

## Line 8

```Python
y_delta = (t - y)*(y*(1-y))
```

Lines 8 and 9 are the backpropagation step.

The loss function is $L(y,t) = \frac{1}{2}(y-t)^2$ and the activation function is $y = \sigma(z) = \frac{1}{1+e^{-z}}$.

`y_delta` represents how the loss changes with respect to **the weighted inputs that go into the output layer**.

We know by the chain rule that $\frac{dL}{dz} = \frac{dL}{dy} \cdot \frac{dy}{dz}$.

Taking derivatives we find that:

$\frac{dL}{dy} = y-t$,

$\frac{dy}{dz} = \frac{d\sigma}{dz} = \sigma(z)\cdot(1-\sigma(z)) = y(1-y)$.

Therefore:

$\frac{dL}{dz} = (y-t)\cdot y \cdot (1-y)$.

But since we want to minimize the loss, we want to take a step in the direction of $-\frac{dL}{dz}$, which explains `y_delta`.

## Line 9

```Python
h_delta = np.dot(y_delta, W2.T) * (h * (1-h))
```

`h_delta` represents how the loss changes with respect to **the weighted inputs that go into the hidden layer.** However, the output of the hidden layer goes through the second set of weights, `W2`, so ultimately `h_delta` must depend on `W2`. 

We already know how $L$ depends on the weighted inputs to the output layer, `y_delta`. We can pass those values backwards across connections between the output and hidden layer. We do this simply by multiplying `y_delta` by the weights in `W2`, and using transpose where appropriate. The resulting values describe how the loss changes with respect to the output of the hidden layer.

Then, according to the chain rule, we also multiply by a factor which describes how the output of the hidden node changes with respect to the weighted input to the hidden node. We already did this: it is the derivative of the activation function $\sigma'(z) = \sigma(z) \cdot (1 - \sigma(z))$, where $h = \sigma(z)$. 

## Line 10-11

```Python
W2 += np.dot(h.T, y_delta)
W1 += np.dot(X.T, h_delta)
```

This is the gradient descent step.

The second set of weights `W2` should be updated by taking a step in direction `y_delta`. The size of the step is scaled by the hidden layer output coming along that connection. The matrix product does this for every training example simultaneously, so that the weight update improves the ANN's ability to classify _every_ example.

A similar calculation is done to update the weights of `W1`, this time using `h_delta` to determine the direction of the step, and the inputs from the first layer `X` to dictate the magnitude of the step.

# Exercises (for your benefit, won't be marked)

1. How would you modify the code to add more nodes to the hidden layer?

2. How would you modify the code to add another hidden layer?

3. Suppose I wanted to use a different activation function. Which lines of the code would have to change?

4. Try removing a training example. If you train the network and ask it to predict a value for the missing example, does it do it correctly?

5. You can measure the overall loss of the network by taking the average loss over all the training examples. Try recording this overall loss once per training loop, and plotting it with respect to the number of weight updates.

In [7]:
np.sum(W1, axis=1, keepdims=1)

array([[5.87276665],
       [4.91662386],
       [5.47691733]])