# BACKPROPAGATION

## Full, batch and stochastic gradient descent

* **Stochastic**: makes a prediction and updates weights one training example at a time
* **Full**: makes a prediction and updates weights after all training examples,    
calculate it as `weight_delta`, the average of all `weight_deltas`
* **Batch**: makes a prediction and updates weights after `batch_size` training examples   
`batch_size` usually between 8 and 256


In [14]:
import numpy as np

weights = np.array([0.5, 0.48, -0.7])
alpha = 0.1

# input: training data
streetlights = np.array( [[1, 0, 1],
                          [0, 1, 1],
                          [0, 0, 1],
                          [1, 1, 1],
                          [0, 1, 1],
                          [1, 0, 1]] )

# input: training labels
walk_vs_stop = np.array([0, 1, 0, 1, 1, 0])

# test input
input = streetlights[0]         # [1, 0, 1]
target_pred = walk_vs_stop[0]   # 0

for iteration in range(40):
    error_for_all_lights = 0
    
    # Stochastic gradient descent:
    # prediction & weight update for each training example
    for row_index in range(len(walk_vs_stop)):
        
        # current training example (example + label)
        input = streetlights[row_index]
        target_pred = walk_vs_stop[row_index]
        
        pred = input.dot(weights)
        
        error = (pred - target_pred) ** 2
        error_for_all_lights += error
        
        delta = pred - target_pred
        
        weights = weights - (alpha * (input * delta))
        
        if (iteration % 10 == 0):
            print(f'-- {iteration} -- {row_index} --')
            print(f'Error      {error:.3f}')
            print(f'Prediction {pred:.3f}')
            print(f'Delta      {delta:.3f}')
            print(f"Weights    {[f'{w:.3f}' for w in weights]}")

-- 0 -- 0 --
Error      0.040
Prediction -0.200
Delta      -0.200
Weights    ['0.520', '0.480', '-0.680']
-- 0 -- 1 --
Error      1.440
Prediction -0.200
Delta      -1.200
Weights    ['0.520', '0.600', '-0.560']
-- 0 -- 2 --
Error      0.314
Prediction -0.560
Delta      -0.560
Weights    ['0.520', '0.600', '-0.504']
-- 0 -- 3 --
Error      0.147
Prediction 0.616
Delta      -0.384
Weights    ['0.558', '0.638', '-0.466']
-- 0 -- 4 --
Error      0.684
Prediction 0.173
Delta      -0.827
Weights    ['0.558', '0.721', '-0.383']
-- 0 -- 5 --
Error      0.031
Prediction 0.176
Delta      0.176
Weights    ['0.541', '0.721', '-0.400']
-- 10 -- 0 --
Error      0.001
Prediction 0.037
Delta      0.037
Weights    ['0.166', '1.058', '-0.136']
-- 10 -- 1 --
Error      0.006
Prediction 0.922
Delta      -0.078
Weights    ['0.166', '1.066', '-0.128']
-- 10 -- 2 --
Error      0.016
Prediction -0.128
Delta      -0.128
Weights    ['0.166', '1.066', '-0.116']
-- 10 -- 3 --
Error      0.014
Prediction 1.117
De

## How does the network identify correlation?

* NN identifies **_correlation_** between the output and some of the input,   
and **_randomness_** between the output and the rest of the inputs
* each example/batch/iteration asserts either **_up pressure_** or **_down pressure_** on the weights
* pressure comes from the data:
  * each node **_independently_** tries to correctly predict the output
  * only **_cross communication_** between nodes occurs in that all weights mush share the same error measure
  * weight update = multiplying this shared error measure by each respective input
  * each weight is trying to compensate for error
* error attribution: given a shared error, the nn needs to figure out:
  * which weights contributed = **_relevant nodes for prediction_** -> up pressure
  * which didn't = **_irrelevant nodes_** -> down pressure
  
  
The prediction is a **weighted sum of the input**
  
### Edge case: Overfitting

* If a particular configuration of weights **_accidentally_** creates perfect correlation between prediction and output dataset (`error == 0`) without giving the heaviest weights to the best inputs, the NN will **_stop learning_**
* NNs can find many different configurations that will correctly predict for a subset of training data

### Edge case: Conflicting pressure

* As other nodes learn, they absorb some of the error; they **absorb part of the correlation**.
* They cause the network to predict with **moderate** correlative power, which reduces the error.
* The other weights then only try to adjust their weights to correctly predict **what's left**.

**Regularization**
* Forces weights with conflicting pressure to move toward 0
* If a weight has equal pressure upward and downward, it **_shouldn't stay on_** = noise = useless
* So only weights with strong correlation should stay on
* Side effects = **faster training** (fewer iterations)

### Edge case: Data doesn't have correlation

* All inputs have conflicting pressure (equal pressure between up and down)
* Create correlation with an intermediate network = **backpropagation**

## Backpropagation

* Add an intermediate network to create correlation: intermediate dataset that has correlation with output:
  * `layer_0` input
  * `layer_1` intermediate network (new input)
  * `layer_2` prediction   
<br/>       
   
* If `layer_2`is too high by *x* amount, how do we know which values at `layer_1` contributed to the error?
  * the ones with **_higher weights_** among `weights_1_2`
  * `weights_1_2` exactly describe:
    * how much each `layer_1` node contributed to `layer_2` prediction
    * in other words, how much each contributed to `layer_2` error   
<br/>       
   
* We use the `delta` at `layer_2` to figure out the `delta` at `layer_1` = `delta * weights_1_2`   

  = multiply a node's `delta` by its input `value`, then adjust its `weight` by that much (scaled with `alpha`)      
  = If you want this node to be *x* amount higher:
    * then each of these previous nodes needs to be `x * weights_1_2` amount higher/lower,   
    * because these weights were amplifying the prediction by `weights_1_2` times    
  
  = Prediction logic in reverse   

### Linear vs nonlinear

* The intermediate layer:
  * Each node in the intermediate layer has a certain amount of correlation with each input node
  * If the weight from an input to the middle layer is 1.0, then it subscribes to 100% for that node's movement
  * So it that node goes up by 0.3, the middle node will follow
  * If the weight from an input node is 0.5, each node in the middle layer subscribes to 50% of that input node's movement
  * only way for in middle node to escape the correlation of a particular input node is to take correlation from another input node    
  = Each middle node subscribes to a little correlation from each input node   
* We need the middle layer to **selectively correlate** with the input nodes = **_conditional correlation_**  
  * middle nodes either correlate by `weight`% or not at all   
   
<p style="background:#DDEEEE;padding: 15px;">
<b>Nonlinearity</b> = if the node is negative, set if to 0
<br/>
By turning off any middle node whenever it would be negative, you allow the network to subscribe to correlation from the most important inputs.
<br/>
This type of nonlinearity is the simplest, it's called <b>relu</b>
</p>

Adjusting the weights to reduce the error over a series of training examples ultimately searches for correlation between the input and the output layers. If no correlation exists, then the error will never reach 0.

Neural networks **search for correlation** between input and output **by adjusting weights**.

In [1]:
import numpy as np

np.random.seed(1)

def relu(x):
    return (x > 0) * x

def relu2deriv(output):
    # returns 1 if true, otherwise 0
    return output > 0

# training data
streetlights = np.array( [[1, 0, 1],
                          [0, 1, 1],
                          [0, 0, 1],
                          [1, 1, 1]] )

# training labels
walk_vs_stop = np.array([1, 1, 0, 0]).T

# learning parameters
alpha = 0.2
hidden_size = 4
weights_0_1 = 2 * np.random.random((3, hidden_size)) - 1  # 3*4 matrix (features*labels)
weights_1_2 = 2 * np.random.random((hidden_size, 1)) - 1  # 4*1 vector (labels)
# dimensions are based on previous and next layers

for iteration in range(61):
    layer_2_error = 0
    for i in range(len(streetlights)):
        layer_0 = streetlights[i:i+1]
        layer_1 = relu(np.dot(layer_0, weights_0_1))
        layer_2 = np.dot(layer_1, weights_1_2)
        
        layer_2_error += np.sum((layer_2 - walk_vs_stop[i:i+1]) ** 2)
        
        layer_2_delta = layer_2 - walk_vs_stop[i:i+1]
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
        # layer_1_delta = backpropagation: compute layer_1 delta given layer_2 delta
        # relu2deriv(layer_1) = 
        #   if relu set the output of a layer_1 node to 0,
        #   then that node didn't contribute to the error, 
        #   so it shouldn't have any impact on the weight update either,
        #   so we set the delta of that node to 0.
        
        weights_1_2 -= alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 -= alpha * layer_0.T.dot(layer_1_delta)
    
    if iteration % 10 == 0:
        print('----')
        print(f'Error {layer_2_error:.3f}')
        print(f"Prediction {layer_2}")

----
Error 1.414
Prediction [[0.]]
----
Error 0.597
Prediction [[0.27767701]]
----
Error 0.326
Prediction [[0.19402478]]
----
Error 0.067
Prediction [[0.04904907]]
----
Error 0.005
Prediction [[0.0030369]]
----
Error 0.000
Prediction [[0.]]
----
Error 0.000
Prediction [[0.]]


### Backpropagation =
* Once you know how much the final prediction should move up or down (`delta`), you need to figure out how much each middle (`layer_1`) node should move up or down = **_intermediate predictions_**.
* If `relu` set the output of a `layer_1` node to 0, then it didn't contribute to the `error`, so we set the `delta` of this node to 0 (`relu2deriv`).
* Once you have the `delta` at `layer_1`, you can calculate a weight update:
  * for each weight, multiply its input value by its output `delta`
  * then increase the `weight` value by that much   
<br/>   
  
* When there is no direct correlation between input and output, the intermediate layers try to **identify/create configurations of features** that may or may not correlate with the output (ex. for output=cat, an ear, cat eyes, cat hair)
* The presense of many such configurations with give the final layer the information (correlation) it needs to correctly predict the output.