# GRADIENT DESCENT

How to set weight values so the network predicts accurately.
NN =
1. **predict**
  
2. **compare**:
  * measure how much a prediction missed by
  * only says 'a lot' or 'a little'
  * amplify bigger errors, minimize smaller ones
  
3. **learn**:
  * figuring out how each weight played its part in creating error
  * calculating a number for each weight
  * number that says how that weight should be higher or lower to reduce error
  * = adjusting weights up and down so the error is reduced   
  * = modifying specific parts of an error function until the `error` value goes to 0

NN = *searching* for the best possible configuration of weights so the network's error falls to 0

# Compare

In [1]:
weight, input, target_pred = (0.5, 0.5, 0.8)
pred = input * weight
error = (pred - target_pred) ** 2
print(error)

0.30250000000000005


* **error squared**:
  * forces the output to be positive: `(pred - target_pred)` could be negative
  * prioritization: squaring makes big errors (>1) bigger and small errors (<1) smaller
  * why only positive error: so that negative errors don't cancel positive errors when summing or averaging the errors

# Hot and cold learning

Iterations of trying both up and down and see which one reduces the error:
* after a prediction, you predict 2 more times:
  * once with higher weight, once with lower weight
* then move `weight` depending on which direction gave the smaller error (up or down)
* repeat until `error` is 0    
   
### Problem 1: 
It's inefficient to predict multiple times to make a single `weight` update
  
### Problem 2: 
If `step_amount` is not a factor of `weight`, unless `weight` is exactly `n*step_amount`:
  * nn will overshoot by some number less than `step_amount`
  * then will start alternating back and forth between higher and lower than `target_pred`

So even though you know the correct **_direction_** to move `weight`, you don't know the correct **_amount_**, so you fix one at random and try and see

In [7]:
weight, input, target_pred = (0.5, 0.5, 0.8)
step_amount = 0.001  # how much to move weights in each iteration

for iteration in range(1101):
    pred  = input * weight
    error = (pred - target_pred) ** 2
    print(f'{iteration} Error: {error:.2f}  Prediction: {pred:.2f}')
    
    # try up
    up_pred = input * (weight + step_amount)
    up_error = (up_pred - target_pred) ** 2
    
    # try down
    dn_pred = input * (weight - step_amount)
    dn_error = (dn_pred - target_pred) ** 2
    
    # if down is better, go down
    if (dn_error < up_error):
        weight = weight - step_amount
    
    # if up is better, to up
    if (dn_error > up_error):
        weight = weight + step_amount

0 Error: 0.30250000000000005  Prediction: 0.25
1 Error: 0.3019502500000001  Prediction: 0.2505
2 Error: 0.30140100000000003  Prediction: 0.251
3 Error: 0.30085225  Prediction: 0.2515
4 Error: 0.30030400000000007  Prediction: 0.252
5 Error: 0.2997562500000001  Prediction: 0.2525
6 Error: 0.29920900000000006  Prediction: 0.253
7 Error: 0.29866224999999996  Prediction: 0.2535
8 Error: 0.29811600000000005  Prediction: 0.254
9 Error: 0.2975702500000001  Prediction: 0.2545
10 Error: 0.29702500000000004  Prediction: 0.255
11 Error: 0.29648025  Prediction: 0.2555
12 Error: 0.29593600000000003  Prediction: 0.256
13 Error: 0.2953922500000001  Prediction: 0.2565
14 Error: 0.294849  Prediction: 0.257
15 Error: 0.29430625  Prediction: 0.2575
16 Error: 0.293764  Prediction: 0.258
17 Error: 0.2932222500000001  Prediction: 0.2585
18 Error: 0.292681  Prediction: 0.259
19 Error: 0.29214025  Prediction: 0.2595
20 Error: 0.2916  Prediction: 0.26
21 Error: 0.2910602500000001  Prediction: 0.2605
22 Error: 0

# Calculating both direction and amount: Gradient Descent

In [40]:
weight, input, target_pred = (0.5, 0.5, 0.8)

for iteration in range(5001):
    pred = input * weight
    error = (pred - target_pred) ** 2
    
    delta = (pred - target_pred)
    weight_delta = delta * input    # GD: scaling, negative reversal, and stopping
                                    # weight_delta = direction + amount
    
    alpha = 0.01                            # control how hast network learns
    weight -= weight_delta * alpha
    
    if (iteration % 250 == 0) or (iteration == 5000):
        print(f'-----{str(iteration)}')
        print(f'Weight: {weight:.2f}')
        print(f'Error : {error:.2f}  Prediction: {pred:.2f}')
        print(f'Delta : {delta:.2f} Weight delta: {weight_delta:.2f}')

# input = 1.1 - 300 iterations to pred = 0.8
# input = 0.5 - 5000 iterations

-----0
Weight: 0.50
Error : 0.06  Prediction: 0.55
Delta : -0.25 Weight delta: -0.28
-----250
Weight: 0.72
Error : 0.00  Prediction: 0.79
Delta : -0.01 Weight delta: -0.01
-----500
Weight: 0.73
Error : 0.00  Prediction: 0.80
Delta : -0.00 Weight delta: -0.00
-----750
Weight: 0.73
Error : 0.00  Prediction: 0.80
Delta : -0.00 Weight delta: -0.00
-----1000
Weight: 0.73
Error : 0.00  Prediction: 0.80
Delta : -0.00 Weight delta: -0.00
-----1250
Weight: 0.73
Error : 0.00  Prediction: 0.80
Delta : -0.00 Weight delta: -0.00
-----1500
Weight: 0.73
Error : 0.00  Prediction: 0.80
Delta : -0.00 Weight delta: -0.00
-----1750
Weight: 0.73
Error : 0.00  Prediction: 0.80
Delta : -0.00 Weight delta: -0.00
-----2000
Weight: 0.73
Error : 0.00  Prediction: 0.80
Delta : -0.00 Weight delta: -0.00
-----2250
Weight: 0.73
Error : 0.00  Prediction: 0.80
Delta : -0.00 Weight delta: -0.00
-----2500
Weight: 0.73
Error : 0.00  Prediction: 0.80
Delta : -0.00 Weight delta: -0.00
-----2750
Weight: 0.73
Error : 0.00  P

* **`weight_delta`**:
  * `delta` = the *pure error*, raw amount that node was too high or too low
  * `* input` = scaling the output node's `delta` by the weigth's `input`   
  -> the **weight delta**    
<br />    
    
* **pure error**:
  * raw *direction* and *amount* you missed
  * positive = you predicted too high, negative = predicted too low     
<br />  
    
* **stopping**:
  * first and simplest effect of multiplying pure error by `input`
  * if `input`is 0, then `weight_delta` will be 0 = nothing to learn, no moving the weight   
<br />  
    
* **negative reversal**:
  * most important effect of multiplying pure error by `input`
  * ensures that weight **_moves in the correct direction_** even in `input` is negative
    * if `input` positive, moving `weight` up makes `pred` move *up*
    * if `input` negative, moving `weight` up makes `pred` move *down* -> reversed!   
    = weight changes directions!
    solution: multiply by `input` and it will **_reverse the sign_** of `weight_delta` if `input` is negative   
<br />  
    
* **scaling**:
  * 3d effect caused by multiplying pure error by `input`
  * more of a side effect
  * if `input` is big, `weight` update should also be big
  * use `alpha` to avoid getting out of control    
<br />  
    
* **alpha**:
  * control how fast the network learns
  * if learns too fast, will update weights too aggressively -> overshoot  
<br />  

**Summary.** If my `input` was 0, then my weight wouldn't have mattered, and I wouldn't change a thing (*stopping*). If my `input` was negative, then I'd want to decrease my weight instead of increase it (*negative reversal*). But my `input`is positive and quite large, so I'm *guessing* that my personal prediction mattered a lot to the aggregated output. I'm going to move my weight up a lot to compensate (*scaling*).
<br />  


## Relationship between weight and error

#### Sensitivity

* How changing one variable changes the other (direction and amount)
  * how sensitive `error` is to `weight`
  * the direction and amount that `error` changes when you change `weight`   
  = **the goal**
  
* How can we use the formula to know how to change `weight` so that `error` moves to a particular direction?
  
  
#### Derivative

* Represents the direction and amount that one variable changes if you change the other variable.
  Defines the relationship between each weight and how much you missed.
  Ex: if derivative is 2, both variables move in the same direction, and one moves twice as much as the other   
<br />

* **positive derivative**: when change in one variable, the other moves in the *same* direction    
  = *positive sensitivity*
* **negative derivative**: when change in one variable, the other moves in the *opposite* direction    
  = *negative sensitivity*
* **zero senstivity**: one variable stays fixed regardless of changes to the other
<br />
  
* Derivative = the slope at a point on a line or curve:
  * curve = U-shaped
  * middle lower point at which `error` = 0
  * right of that point: slope is positive - left of that point: slope is negative
  * the farther away from the *goal weight* you move, the steeper the slope gets
  * **slope's sign** = direction - **slope's steepness** = amount   
<br />  
  
* We can use the slope (derivative, `direction_and_amount` or `weight_delta`) to know which direction reduces the error    
  Based on the steepness, you can get at least an idea of how far away you are from optimal point where slope is 0
  
= How to change one variable so that you can move another variable in a direction
NN = bunch of weights we use to compute an error function   
**_Gradient Descent_** = finding error minimums

## Divergence & Alpha

* If input is large, weight update can be large even if error is small    
  -> Network **overcorrects** -> causes a phenomenon called **divergence**    
  = predictions explode and alternate from negative to positive, farther and farther away from target prediction   
<br />   

* **Alpha**: simplest way to prevent overcorrecting weight updates:   

  * *problem*: if input too big, weight update overcorrects
  * *symptom*: overshooting.  
    when you overcorrect, new derivative is even larger in magnitude than previous steps
  * *solution*: multiply weight update by **_alpha_**, real-valued number between 0..1, a fraction that will make it smaller   
    If alpha is small (0.01), will reduce weight update considerably and prevent overshooting   
<br />   
  
* Finding **appropriate alpha** often done by guessing, even for state-of-the-art NNs