# Chapter 11 : Implementing a Multilayer Artificial Neural Network from Scratch

## Modeling complex functions with artificial neural networks

Note that a *neural network* originates in the 1940s with models such as ADALINE and the perceptron, already covered in this book.  Neural networks have become more populated after the ability to create *deep neural networks* which have multiple layers of neurons.

### Single layer neural network recap

For example, consider ADALINE.  Recall that in every epoch, the weight vector, $w$, and bias unit, $b$, are updated. Where
$$
    w := w + \Delta w \text{ and } b := b + \Delta b
$$
and
$$
    \Delta w_j = -\eta \frac{\partial L}{\partial w_j}
$$
and
$$
    \Delta b = -\eta\frac{\partial L}{\partial b}
$$

This is done through multiple passes over the training set where an activiation function (in ADALINE, the identity function) output is compared with the actual value.  Recall that this takes the opposite direction of the loss gradient ($\nabla L(w)$) to find optimal weights of the model.  Generally $L$ is defined as the mean of square errors.  The model learning phase is accelerated by stochastic gradient descent.

This will be used to implement and train a *multilayer perceptron* (MLP) model.

### Introducing the multilayer neural network architecture

A MLP model consists in an input layer, one or more hidden layers, and an output payer. If a network has more than one hidden layer, then it is said to be a *deep NN*. 

The $i$th activation unit in the $l$th later is denoted $a_i^{(l)}$.  Numerical units are usally not used for the layers; therfore, $x_i^{(in)}$ refers to the $i$th input feature value, $a_i^{(h)}$ refers to the $i$th unit in the hidden layer, and $a_i^{(out)}$ refers to the $i$th unit in the output layer. $b^{(h)}$, $b^{(out)}$ denote bias unit vectors storing $d$, the number of nodes, bias units for the hidden and output layers.

### Activating a neural network via forward propagation

The MLP procedure is summarized in three steps:
1. Starting at $x_i^{(in)}$, patterns are forward propagated through the network
2. Based on network output, calculate the loss we want to minimize using a loss function
3. Back propogate the loss, finding the derivative of each weight and bias unity, updating the model
These steps are repeated for each epoch. 

Moving step by step as follows.  First, calcualte the activation unit for the hidden layer as follows:
$$
    z_1^{(h)} = x_1^{(in)}w_{1,1}^{(h)} + \dots + x_m^{(in)}w_{1,m}^{(h)}
$$
thus,
$$
    a_1^{(h)} = \sigma(z_1^{(h)})
$$
Note that complex problems need a nonlinear activation function.  For example, the logistic regression's sigmoid activation function.

The calculation can be generalized to $n$ examples as follows:
$$
    Z^{(h)}= X^{(in)}W^{{h)T} + b^{(h)}
$$
Thus:
$$
    A^{(h)} = \sigma(Z^{(h)})
$$
And finally:
$$
    A^{(out)} = \sigma(Z^{(out)})
$$

## Classifying handwritten digits

Get data set from sci-kit learn.print(X.shape)

In [1]:
from sklearn.datasets import fetch_openml 
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
X = X.values
y = y.astype(int).values

In [2]:
print(X.shape)
print(y.shape)

(70000, 784)
(70000,)
