# Multilayer Perceptrons

---
When are they going to call them Sigmoid neurons, if ever?

- I need a new definition of perceptron, if they aren't constricted to binary input and output.

---

## Implement the hidden layer

> Before, we were dealing with only one output node which made the code straightforward. However now that we have multiple input units and multiple hidden units, the weights between them will require two indices: $w_{ij}$ where $i$ denotes input units and $j$ are the hidden units.

- The indices on $w$ are like matrix indices. Nothing more complicated
  - Just where the dimensions are coming from changed :)

Imagine the network:

![image](https://d17h27t6h515a5.cloudfront.net/topher/2017/February/589978f4_network-with-labeled-weights/network-with-labeled-weights.png "Weights are labeled with the __input__ source node and the __hidden layer__ destination node")

It's funny that this notation is like a matrix, because __we store these in matrices__

- That's right, our weights array just became a matrix (at least)
  - Tensors will flow, in due time

Also note:

- Rows will be all weights leading __out__ of a __single Input Node__
  - Depicted by the first index of the matrix
- In the following image, note that everything in a given column will be taken in for a given node
  - This is the second index of the matrix

![weighted labeled hidden layer](https://d17h27t6h515a5.cloudfront.net/topher/2017/February/58a49908_multilayer-diagram-weights/multilayer-diagram-weights.png "Relax and let the data flow")

Examining the following code, a few things become apparent:

```python
# Number of records and input units
n_records, n_inputs = features.shape
# Number of hidden units
n_hidden = 2
weights_input_to_hidden = np.random.normal(0, n_inputs**-0.5, size=(n_inputs, n_hidden))
```

1. $n^{\frac {-1} 2} = 1/{n^{\frac 1 2}}  = 1/{\sqrt{n}}$
  - We used the latter in the previous sections.
  - We use the former now. More concise
2. `size=...` is the dimensionality that comes with this being a __matrix__ now
  - Assignment via a tuple is a rather elegant way to go about it
3. In the example, the number of hidden nodes is trival and appears irrelevant
  - But we know that the layers provide deeper insight...
    - Specifically because networks can grow the number of nodes as they learn the problem
    - Meta-learning, in a way
  - There may be more to this

### Determining the node-input is easy

... right after you get a dot product

Let's use $h_1$ for this

- Its weighted inputs are the dot product of the inputs - $x_1, x_2, x_3$ - and the hidden layer weights for just its column - $w_{11}, w_{21}, w_{31}$
  - See the orange above
  - Important: "__The inputs__" to $h_1$ are a vector
    - As is its column, taken independently

So we get something a little like:

![h1 weighted inputs](https://d17h27t6h515a5.cloudfront.net/topher/2017/January/588ae392_codecogseqn-2/codecogseqn-2.png "Notice this is just h1...")

- And we'd have to do the same thing for $h_2$...

\begin{equation}
h_2 = x_1 w_{12} + x_2 w_{22} + x_3 w_{32}
\end{equation}

- But wouldn't that get a little long winded?
  - Yes, little Timmy. I believe it would...

So hows about we just take the cross product (vector x matrix) and get back a vector of the hidden, weighted input values?

- Seems legit. Simple, straight forward
- Do the math in one swing :)

That looks a little something like:

\begin{equation*}
h_j = x \times w = \begin{vmatrix}
x_1 & x_2 & x_3
\end{vmatrix} \times \begin{vmatrix}
w_{11} & w_{12} \\
w_{21} & w_{22} \\
w_{31} & w_{32}
\end{vmatrix}
\end{equation*}

And that outputs a vector of:

\begin{vmatrix}
{x \cdot w_{i1}} & {x \cdot w_{i2} }
\end{vmatrix}

- Where we let $i$ stand in for the row

So just like we talked about earlier :) Instead of column-wise, one-at-a-time, we just do the matrix maths

### A word of caution to this tale

You could very well setup the inputs as a column, and transpose the matrix.

- Just be aware that the rows would become the hiddens' inputs, and the columns would be a given nodes' outputs

That would look something like:

![column-flipped hidden layer](https://d17h27t6h515a5.cloudfront.net/topher/2017/January/588b7c74_inputs-matrix/inputs-matrix.png "Oh the choices available")

---
And for the sake of being unabashedly clear:

Where a vector is an array, in code...

- If you want to multiply __matrix-by-vector__
  - the numer of columns in the matrix must be the rows of the vector
    - Like in the above picture
- " " __vector-by-matrix__
  - Vector's length must match the number of columns of the matrix

---
It's really quite humorous that this is a point they keep harping on

> Don't mix up your dimensions, or else your code won't compile.

Yup... Makes sense.

- Maybe I've just seen so much bad Java, that has 10+ parameters to a function, that I just "get it"
- Or maybe it's the fact that I still have the algebra *relatively* fresh in mind
  - Getting fresher all the time, reviewing all of grade school with [Khan academy](https://www.khanacademy.org/mission/pre-algebra)

---
And just some code, for icing on the cake

### Nuances of NumPy vectors

> You see above that sometimes you'll want a column vector, even though by default Numpy arrays work like row vectors. It's possible to get the transpose of an array like so `arr.T`, but for a 1D array, the transpose will return a row vector. Instead, use `arr[:,None]` to create a column vector.

That said, here's what it looks like

In [1]:
import numpy as np

In [2]:
features = np.random.normal(size=3)

In [3]:
print(features)

[ 0.47491054  0.43920491 -1.5094122 ]


In [4]:
print(features.T)

[ 0.47491054  0.43920491 -1.5094122 ]


In [5]:
print(features[:, None])

[[ 0.47491054]
 [ 0.43920491]
 [-1.5094122 ]]


Or you could just tell NumPy to give you a 2-D vector, so you can work with it in a matrix-transpository way :P

In [6]:
np.array(features, ndmin=2)

array([[ 0.47491054,  0.43920491, -1.5094122 ]])

In [7]:
np.array(features, ndmin=2).T # there's that column we love

array([[ 0.47491054],
       [ 0.43920491],
       [-1.5094122 ]])

There's a coding test, to implement a 4x3x2 feed-forward network

- We're calling them "hidden" layers, because soon they were be generated programmatically.
  - At least, that is my belief

The exercise skeleton code:

```python
import numpy as np

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1/(1+np.exp(-x))

# Network size
N_input = 4
N_hidden = 3
N_output = 2

np.random.seed(42)
# Make some fake data
X = np.random.randn(4)

weights_input_to_hidden = np.random.normal(0, scale=0.1, size=(N_input, N_hidden))
weights_hidden_to_output = np.random.normal(0, scale=0.1, size=(N_hidden, N_output))


# TODO: Make a forward pass through the network

hidden_layer_in = None
hidden_layer_out = None

print('Hidden-layer Output:')
print(hidden_layer_out)

output_layer_in = None
output_layer_out = None

print('Output-layer Output:')
print(output_layer_out)

```

- I'm a little caught on how to produce the hidden layer's output
  - I would think it's just a mapping of the sigmoid function over the arrays
  - But that feels weird to me

### Quiz reflection

Turns out that "mapping" the function was the right approach, because that's exactly what happens.

- This is most optimally (performance) done by [`np.vectorize`](https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.vectorize.html)

```python
# Given that everything is activated by a sigmoid function
activation_func = np.vectorize(sigmoid)
```

At this point the only thing one would question is probably:

- What was the rest of the solution?

__I finally understand the simplicity__

- Just invoke the function (Functional Programming is a straight flush here) for all relevant nodes
  - The activation function __is__ the function for the output
  - If it's a Sigmoid, then pass $h$ as a parameter to that function
  - Boom! There's your output

```python
# input layer
X = np.random.randn(4)

hidden_layer_inputs = np.dot(X, weights_input_to_hidden)
# here's that simplicity
hidden_layer_output = activation_func(hidden_layer_inputs)
```

---
The mathematics here is 

- Recursive in logic
  - Do this algorithm throughout the network, until there are no more "child" nodes
- Iterative in explanation
  - For every column, do this operation *omitted*
  - Pass those results to the next column
  - Repeat

Where are the code is... iterative.

Just food for thought

---
I imagine we're not very far at all from just inlining the input calculation.

- But wait. That doesn't make any sense.
  - Need the inputs available to calculate the errors...

Shoot. Almost.

- Though I'm not really a fan of inlining :P

---
In review (3날후) I say this bit about inlining because each node is a little calculator:

- You clean up what goes in (input $h$)
- Then it just spits out the output of some simple function
  - Often the result of a Sigmoid.

And __that's it__. Thus, "nearly-inline-able" :P

---
Turns out you can just call a function on an `np.array` and it will behave as you expect it to.

In [9]:
map_example_arr = np.array([2,3,4])
print(map_example_arr)

[2 3 4]


In [10]:
map_example_arr * 5

array([10, 15, 20])

- Didn't, necessarily, need to "map" via `vectorize`
  - Might in the future, with larger tensors

# Backpropagation

> We're dealing with multiple layers, but we'd still like to train with Gradient Descent.

We've already learned that error in the output node is $\delta = (y - \hat{y})f'(h)$

- But this was only with one layer
- __In our case__, this produces the error between the hidden layer and output layer

So... How do we get the error between the input layer and the hidden layer?

### How do we find the error to use in the gradient descent step?

> Error for units is proportional to the error in the output layer *times* the weight between the units.

Remember that analogy, that this is taking an opinion from a person?

- Now imagine that *they* heard it through the grapevine.
- Not very credible now, are they?

__In other words__

The unit with a __strong weight__ to the output will contribute a __stronger error__

- Before training of course

In the following picture, not the subscripts on our error-term $\delta$ and weights $W_1$ and $W_2$

- Are those weights for a column? (matrix column; for more, see above "column" usage)

---
By the way, that $\delta$ is lowercase Greek __delta__. Haven't mentioned it all this time.

Sorry!

---

![weighted backprop level 1](./weighted backpropogation through layers.png)

> Instead of __propogating__ your inputs *forward*, you're __propogating__ the error *__backwards__*

- This is where this exercise gets its name

You can also imagine that you are feeding your error, as input, into a mirror network

- Everything is laid out exactly reflected as to how they were originally.

## Backpropogation Fundamentals

I would say that mathematics is *succint*, once you understand what the symbols are there for.

- Otherwise, it will just appear to be a foreign language.

---
Calculating the error for the output layers is a on a per-output-node basis.

Observe the following equation:

\begin{equation}
\delta_j^h = \sum W_{jk} \delta_k^o f'(h_j)
\end{equation}

Notation breakdown:

- $x^o$ is identifying our __o__utput-related nodes
- $k$ signifies how many output nodes there are for the current layers in the back propogation
- $\delta_j^h$ is the error-term of some $h$ in the hidden-layer
- $x_j$ identifies which node we're talking about in the given hidden layer

Vocabulary:

- __current__ is meant to simplify the observation of the problem
  - If you are *recursively iterating* back through each layer, to correct the error, you are aware of two contexts
    - The layer you previously oserved
      - On __first iteration__ this is the output layer
    - The layer currently being observed
      - (In a 3x2x1 network) On __first iteration__, this is the 2-node layer
  - To take a programmatic lens to it
    - In functional programming, a `reduce : [T] -> T -> a -> a` can be written (logically) as either perspective
      - "Current, next":
        - I've got the new data I'm building and the next item to consume
      - "Current, previous":
        - I've got the thing that is being built and the current item to consume
  - Then try looking at backpropogation in that way
    - This is a stretch for holistically absorbing this concept
    - But everything in programming can be written as a reduce ;) (or map)

---

Let's take a look at the new equation for calculating the *next step of our gradient descent*:

![new-old weight_ij](https://d17h27t6h515a5.cloudfront.net/topher/2017/January/588bc2d4_backprop-general/backprop-general.gif)

__It's the same as the old one!__ Great!

I want to take steps and eventually reach our goal, like walking isn't a novel experience for me
> That's expressed by our friend $V_{in}$.
- He's the value that got us here (be it on- or off-course)

I want to walk in the right direction, eventually reaching the bottom of the valley
> That direction is easily discernable from our friend $\delta$

If I'm going in the correct direction, I want to run - not walk.
> That's what $\eta$ is for. Bigger strides, not just "steps"

---
Mat takes a moment here to walk through an example. I will notate anything new that pops out.

> Again, __note__ this is "Lesson 8: Intro to Neural Networks, 15. Backpropogation"

### Notes from "working through an example"

- $a$ for __a__ctivation, is the definition of the result of the output function
  - "One *outputs* an *activation*"
- $o^h$ should be read as "the *output* for a given $h$"
  - Or a particular weigthed input, if that suits your fancy.
- $\eta$ has utility
  - Because these numbers are so small, but there's so much to correct for, it's good to scale them up
    - In the example the $\delta_$ (error-term of the output) is about half of the target
    - But the $\delta^h$ is about 0.003.
    - Needless to say, that would take a long time to correct (especially given how low the weights are genarl)

---
What I just observed is called the __vanishing gradient problem__.

> Because the maximum derivative of a Sigmoid function 0.25, the errors in the output layer get reduced by at least 75%
- And the hidden layer before it 93.75%!
  - Check the math (the first one is just subtraction. The next is harder)

Using a sigmoid activation function quickly reduces the weights steps (which we *try* to counteract with the learning rate)

- __Why not just use a better activation function?__

Speculation:

- This might just be proof that certain activations model the "butterfly effect" better than others.
  - That's hard to say, given how new I am to this
  - We need young minds to ask the basic, out-of-the-box questions
  - We need learned minds to reconsider what they feel is impossible

## Implementing in Numpy

We have to consider the error for *each unit*\* in the hidden layer

- I've been correcting it, but for *practically* this whole entry, Mat (or whoever) has been using *unit* in place of *node/neuron/perceptron*.
  - For the love of Pete, please! Some consistency would be nice.

Row x row multiplication sucks, so do the transposition trick we learned earlier:

```python
some_inputs[:, None] # BAM! Got a column now
```

- Matrix math, of course, has to have the proper dimensions
  - So check yourself at the door
- The instructor says we're likely to have a different number of *hidden* vs *input* units, so just straight multiplication across will fail
  - Again, 당연아지. The dimensions have to match and (x,) $\times$ (y,) doesn't exactly work
  - The math falls through

## Programming exercise
To see how we're holding up

As before, the skeleton code:

```python
import numpy as np


def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))


x = np.array([0.5, 0.1, -0.2])
target = 0.6
learnrate = 0.5

weights_input_hidden = np.array([[0.5, -0.6],
                                 [0.1, -0.2],
                                 [0.1, 0.7]])

weights_hidden_output = np.array([0.1, -0.3])

## Forward pass
hidden_layer_input = np.dot(x, weights_input_hidden)
hidden_layer_output = sigmoid(hidden_layer_input)

output_layer_in = np.dot(hidden_layer_output, weights_hidden_output)
output = sigmoid(output_layer_in)

## Backwards pass
## TODO: Calculate output error
error = target - output

# TODO: Calculate error term for output layer
# inline f'(h) of sigmoid (output * (1 - output))
output_error_term = error * output * (1 - output)

# TODO: Calculate error term for hidden layer
# f'(h) = hidden_layer_output
# o^o = output_error_term
# W = weight between current and previous layer
hidden_error_term = weights_hidden_output * output_error_term * hidden_layer_output

# TODO: Calculate change in weights for hidden layer to output layer
delta_w_h_o = learnrate * output_error_term * hidden_layer_output

# TODO: Calculate change in weights for input layer to hidden layer
delta_w_i_h = learnrate * hidden_error_term * x

print('Change in weights for hidden layer to output layer:')
print(delta_w_h_o)
print('Change in weights for input layer to hidden layer:')
print(delta_w_i_h)

```

And my solution:

It was really as simple as following the equations

- I see why they use the variable names
  - It's still lazy, but *I see why* now.

#### error is ALWAYS the difference between the prediction and target

> I forgot that simplest of ideas.

```python
## Backwards pass
## TODO: Calculate output error
error = target - output
```

#### inlining a sigmoid can be justified
but I needed a comment to do so

- It's so easy to shoot yourself in the foot

```python
# TODO: Calculate error term for output layer
# inline f'(h) of sigmoid (output * (1 - output))
output_error_term = error * output * (1 - output)
```

#### Lines of code that are *far* too long
I had forgotten to apply the sigmoid derivate
\begin{equation}
f'(h) = f(h)(1 - f(h))
\end{equation}
... against the hidden layer and had to debug bug for about 5 minutes

So in the following code block:

- $f'(h)$ = `hidden_layer_output * (1 - hidden_layer_output)`
- $o^o$ = `output_error_term`
- $W$ = weight between current layer and previous
  - `weights_hidden_output` __in this case__

```python
# TODO: Calculate error term for hidden layer
deriv_hidden_outputs = hidden_layer_output * (1 - hidden_layer_output)
hidden_error_term = weights_hidden_output * output_error_term * deriv_hidden_outputs
print('hidden error term:', hidden_error_term)
```

The shaping issue occured, because I'm a know-it-all half the time.

- On second edit, I added the dimensionality to our inputs $x_i$ and everything was kosher

```python
# TODO: Calculate change in weights for hidden layer to output layer
delta_w_h_o = learnrate * output_error_term * hidden_layer_output

# TODO: Calculate change in weights for input layer to hidden layer
delta_w_i_h = learnrate * hidden_error_term * x[:, None]
```

# Implementing backpropogation

So what do we know?

- Error in the output layer
- Error in a given hidden layer

A __remember__: we care about the error, so we can correct our predictions, increase accuracy, and automate the world :)

## Error in the output layer

The "error term" defined for, a given output node $k$, is

\begin{equation}
\delta_k = (y_k - \hat{y}_k)f'(a_k)
\end{equation}

- Remember that $a_k$ is the activation (output, discharge, ejaculate, etc.) of that node

## Error in the hidden layer
Defined for a given hidden node $j$:

\begin{equation}
\delta_j = \sum [w_{jk}\delta_k]f'(h_j)
\end{equation}

- Wrote that from memory by talking to myself :)

## A general algorithm
... for a simple network with one hidden layer and one output node

### 1. Set the weights for each layer to zero
- The input-to-hidden-node weights $\Delta w_{ij} = 0$
- The hidden-to-output-node weights $\Delta Wj = 0$
- __Thought__: These might be programmatically organized (like a `Map a Map ...` [Haskell syntax])

### 2. For each record in the data, do the following:

#### 1. Forward pass for $\hat{y}$
#### 2. Calculate error $y - \hat{y}$

#### 3. Calculate error __gradient__
  - This is our error (we just calculated) times the derivative of the output's activation function
    - $\delta^o = (y - \hat{y})f'(z)$
  - That derivate of action function should take a parameter $z$
    - This $z$ is equivalent to the sum of all weighted *activations*
    - Thus, scaling the outputs based on their corresponding inputs' weights

#### 4. Propogate the errors to the hidden layer
- Where the error for any given $h$ (weighted hidden input) is multiplied by the output layer's error gradient and the weights for the given node $j$ in the hidden layer
  - In math, it may be more clear as $\delta_j^h = \delta^o W_j f'(h_j)$
    - __the error term__ for a given hidden node is equivalent to the output layer's error term $\delta^o$ times the weights for that given node $W_j$ times the derivative of the activation function of the weighted input $h_j$

#### 5. Update weight steps
  - This is always $old + correction$
    - Where $correction = errorTerm * activation$
  - Or:
    - $\Delta W_j = \Delta W_jPrevious + \delta^o a_j$
    - $\Delta w_{ij} = \Delta w_{ij} + \delta_j^h a_i$
      - $i$ for *i*nput  

#### 6. Update weights
  - This is always $old + scaledStep/m$
    - Where $m$ = number of records (gotta be consistent with the math)
    - and $\eta$ is our multiplier/accelerator friend
  - Or:
    - $W_j = W_jPrevious + \eta \Delta W_j / m$
    - SAME with $w$ instead of $W$

## One last exercise

As always, the skeleton code:

```python
import numpy as np
from data_prep import features, targets, features_test, targets_test

np.random.seed(21)

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))


# Hyperparameters
n_hidden = 2  # number of hidden units
epochs = 900
learnrate = 0.005

n_records, n_features = features.shape
last_loss = None
# Initialize weights
weights_input_hidden = np.random.normal(scale=1 / n_features ** .5,
                                        size=(n_features, n_hidden))
weights_hidden_output = np.random.normal(scale=1 / n_features ** .5,
                                         size=n_hidden)

for e in range(epochs):
    del_w_input_hidden = np.zeros(weights_input_hidden.shape)
    del_w_hidden_output = np.zeros(weights_hidden_output.shape)
    for x, y in zip(features.values, targets):
        ## Forward pass ##
        # TODO: Calculate the output
        hidden_input = None
        hidden_output = None
        output = None

        ## Backward pass ##
        # TODO: Calculate the network's prediction error
        error = None

        # TODO: Calculate error term for the output unit
        output_error_term = None

        ## propagate errors to hidden layer

        # TODO: Calculate the hidden layer's contribution to the error
        hidden_error = None
        
        # TODO: Calculate the error term for the hidden layer
        hidden_error_term = None
        
        # TODO: Update the change in weights
        del_w_hidden_output += 0
        del_w_input_hidden += 0

    # TODO: Update weights
    weights_input_hidden += 0
    weights_hidden_output += 0

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        hidden_output = sigmoid(np.dot(x, weights_input_hidden))
        out = sigmoid(np.dot(hidden_output,
                             weights_hidden_output))
        loss = np.mean((out - targets) ** 2)

        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss

# Calculate accuracy on test data
hidden = sigmoid(np.dot(features_test, weights_input_hidden))
out = sigmoid(np.dot(hidden, weights_hidden_output))
predictions = out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))

```

The supporting files can be found