<a href="https://colab.research.google.com/github/rahiakela/grokking-deep-learning/blob/4-gradient-descent/gradient_descent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# introduction to neural learning: gradient descent

## Predict, compare, and learn

We learned about the paradigm “predict, compare, learn,” and we dove
deep into the first step: **predict**. So now we cover the next two steps of the paradigm: **compare and learn**.

### Compare

**Comparing gives a measurement of how much a prediction
“missed” by.**

Once you’ve made a prediction, the next step is to evaluate how well you did. This may
seem like a simple concept, but you’ll find that coming up with a good way to measure
error is one of the most important and complicated subjects of deep learning.

You’ll also learn that error is always positive! We’ll consider the analogy of an
archer hitting a target: whether the shot is too low by an inch or too high by an inch, the
error is still just 1 inch. 

In the neural network compare step, you need to consider these
kinds of properties when measuring error.

we evaluate only one simple way of measuring error: mean
squared error. It’s but one of many ways to evaluate the accuracy of a neural network.



### Learn
**Learning tells each weight how it can change to reduce the error.**

Learning is all about error attribution, or the art of figuring out how each weight played its part in creating error. It’s the blame game of deep learning.

So we’ll spend times looking at the most popular version of the deep learning blame game:**gradient descent**.

At the end of the day, it results in computing a number for each weight. That number
represents how that weight should be higher or lower in order to reduce the error. Then
you’ll move the weight according to that number, and you’ll be finished.

## Compare: Does your network make good predictions?
**Let’s measure the error and find out!**

<img src='https://github.com/rahiakela/img-repo/blob/master/measure-error-1.JPG?raw=1' width='800'/>

In [0]:
knob_weight = 0.5
input = 0.5
goal_pred = 0.8

pred = input * knob_weight

error = (pred - goal_pred) ** 2
print(error)

0.30250000000000005


## Why measure error?
**Measuring error simplifies the problem.**

The goal of training a neural network is to make correct predictions. That’s what you want.
And in the most pragmatic world you want the
network to take input that you can easily calculate (today’s stock price) and predict things that
are hard to calculate (tomorrow’s stock price). That’s what makes a neural network useful.

It turns out that changing knob_weight to make the network correctly predict
goal_prediction is slightly more complicated than changing knob_weight to make
error == 0. There’s something more concise about looking at the problem this way.

In [0]:
knob_weight = 0.9
input = 0.5
goal_pred = 0.8

pred = input * knob_weight

error = (pred - goal_pred) ** 2
print(error)

0.12250000000000003


**Different ways of measuring error prioritize error differently.**

By squaring the error, numbers that are less than 1 get smaller, whereas numbers that are greater than 1 get bigger. You’re going to change what I call pure error (pred - goal_pred) so that bigger errors become very big and smaller errors quickly become irrelevant.

By measuring error this way, you can prioritize big errors over smaller ones. When you have
somewhat large pure errors (say, 10), you’ll tell yourself that you have very large error $$(10**2 == 100)$$ and in contrast, when you have small pure errors (say, 0.01), you’ll tell yourself that you
have very small error $$(0.01**2 == 0.0001)$$ See what I mean about prioritizing? It’s just modifying
what you consider to be error so that you amplify big ones and largely ignore small ones.

In contrast, if you took the absolute value instead of squaring the error, you wouldn’t have this
type of prioritization. The error would just be the positive version of the pure error—which
would be fine, but different.

**Why do you want only positive error?**

Eventually, you’ll be working with millions of input -> goal_prediction pairs, and we’ll
still want to make accurate predictions. So, you’ll try to take the average error down to 0.

This presents a problem if the error can be positive and negative.

Imagine if you were
trying to get the neural network to correctly predict two datapoints—two input ->
goal_prediction pairs. 

If the first had an error of 1,000 and the second had an error of
–1,000, then the average error would be zero! 

You’d fool yourself into thinking you predicted
perfectly, when you missed by 1,000 each time! That would be really bad. 

Thus, you want the
error of each prediction to always be positive so they don’t accidentally cancel each other out
when you average them.

## What’s the simplest form of neural learning?

**Learning using the hot and cold method.**

At the end of the day, learning is really about one thing: adjusting knob_weight either up
or down so the error is reduced. If you keep doing this and the error goes to 0, you’re done
learning! How do you know whether to turn the knob up or down? Well, you try both up and
down and see which one reduces the error! Whichever one reduces the error is used to update
knob_weight. It’s simple but effective. After you do this over and over again, eventually
error == 0, which means the neural network is predicting with perfect accuracy.

### Hot and cold learning

Hot and cold learning means wiggling the weights to see which direction reduces the error
the most, moving the weights in that direction, and repeating until the error gets to 0.

### An empty network

<img src='https://github.com/rahiakela/img-repo/blob/master/hot-and-cold-learning-1.JPG?raw=1' width='800'/>


In [0]:
weight = 0.1
lr = 0.01

def neural_network(input, weight):
  prediction = input * weight
  return prediction

### PREDICT: Making a prediction and evaluating error

<img src='https://github.com/rahiakela/img-repo/blob/master/hot-and-cold-learning-2.JPG?raw=1' width='800'/>

In [0]:
number_of_toes = [8.5]
win_or_lose_binary = [1] # (won!!!)

input = number_of_toes[0]
true = win_or_lose_binary[0]

pred = neural_network(input, weight)

error = (pred - true) ** 2
print(error)

0.022499999999999975


### COMPARE: Making a prediction with a higher weight and evaluating error

<img src='https://github.com/rahiakela/img-repo/blob/master/hot-and-cold-learning-3.JPG?raw=1' width='800'/>

In [0]:
lr = 0.1 # higher

pred_up = neural_network(input, weight + lr)

err_up = (pred_up - true) ** 2
print(err_up)

0.49000000000000027


In [0]:
lr = 0.01  # lower

pred_down = neural_network(input, weight - lr)

err_down = (pred_down - true) ** 2
print(err_down)

0.05522499999999994


### COMPARE + LEARN: Comparing the errors and setting the new weight

<img src='https://github.com/rahiakela/img-repo/blob/master/hot-and-cold-learning-4.JPG?raw=1' width='800'/>

In [0]:
if (error > err_down or error > err_up):
  if ( err_down < err_up):
    weight -= lr
  if (err_up < err_up):
    weight += lr  

This reveals what learning in neural networks really is: a search problem. You’re searching
for the best possible configuration of weights so the network’s error falls to 0 (and predicts
perfectly). As with all other forms of search, you might not find exactly what you’re looking
for, and even if you do, it may take some time.

## Hot and cold learning
**This is perhaps the simplest form of learning.**

#### Complete Code

In [0]:
weight = 0.5
input = 0.5

goal_prediction = 0.8

step_amount = 0.001  # How much to move the weights each iteration

# Repeat learning many times so the error can keep getting smaller.
for iteration in range(1101):
  prediction = input * weight
  error = (prediction - goal_prediction) ** 2

  print(f'Error: {str(error)}\t\tPrediction: {str(prediction)}')

  up_prediction = input * (weight + step_amount)   # try up!
  up_error = (goal_prediction - up_prediction) ** 2

  down_prediction = input * (weight - step_amount)  # try down!
  down_error = (goal_prediction - down_prediction) ** 2

  if (down_error < up_error):
    weight = weight - step_amount    # If down is better, go down!
  if (down_error > up_error):
    weight = weight + step_amount    # If up is better, go up!

### Characteristics of hot and cold learning
**It’s simple.**

Hot and cold learning is simple. After making a prediction, you predict two more times, once with a
slightly higher weight and again with a slightly lower weight. You then move weight depending on
which direction gave a smaller error. Repeating this enough times eventually reduces error to 0.

####Problem 1: It’s inefficient.


You have to predict multiple times to make a single knob_weight update. This seems very inefficient.

#### Problem 2: Sometimes it’s impossible to predict the exact goal prediction.

With a set step_amount, unless the perfect weight is exactly n*step_amount away, the network
will eventually overshoot by some number less than step_amount. When it does, it will then
start alternating back and forth between each side of goal_prediction. Set step_amount to 0.2
to see this in action. If you set step_amount to 10, you’ll really break it. When I try this, I see the
following output.

It never remotely comes close to 0.8!

In [0]:
weight = 0.5
input = 0.5

goal_prediction = 0.8

step_amount = 0.2  # How much to move the weights each iteration

# Repeat learning many times so the error can keep getting smaller.
for iteration in range(1101):
  prediction = input * weight
  error = (prediction - goal_prediction) ** 2

  print(f'Error: {str(error)}\t\tPrediction: {str(prediction)}')

  up_prediction = input * (weight + step_amount)   # try up!
  up_error = (goal_prediction - up_prediction) ** 2

  down_prediction = input * (weight - step_amount)  # try down!
  down_error = (goal_prediction - down_prediction) ** 2

  if (down_error < up_error):
    weight = weight - step_amount    # If down is better, go down!
  if (down_error > up_error):
    weight = weight + step_amount    # If up is better, go up!

In [0]:
weight = 0.5
input = 0.5

goal_prediction = 0.8

step_amount = 10  # How much to move the weights each iteration

# Repeat learning many times so the error can keep getting smaller.
for iteration in range(1101):
  prediction = input * weight
  error = (prediction - goal_prediction) ** 2

  print(f'Error: {str(error)}\t\tPrediction: {str(prediction)}')

  up_prediction = input * (weight + step_amount)   # try up!
  up_error = (goal_prediction - up_prediction) ** 2

  down_prediction = input * (weight - step_amount)  # try down!
  down_error = (goal_prediction - down_prediction) ** 2

  if (down_error < up_error):
    weight = weight - step_amount    # If down is better, go down!
  if (down_error > up_error):
    weight = weight + step_amount    # If up is better, go up!

The real problem is that even though you know the correct direction to move weight, you don’t know
the correct amount. Instead, you pick a fixed one at random (step_amount). Furthermore, this amount
has nothing to do with error. Whether error is big or tiny, step_amount is the same. 

So, hot and cold
learning is kind of a bummer. It’s inefficient because you predict three times for each weight update, and
step_ amount is arbitrary, which can prevent you from learning the correct weight value.

## Calculating both direction and amount from error

<img src='https://github.com/rahiakela/img-repo/blob/master/direction-and-amount.JPG?raw=1' width='800'/>

In [0]:
weight = 0.5
goal_pred = 0.8
input = 0.5

for iteration in range(20):
  pred = input * weight
  error = (pred - goal_pred) ** 2
  direction_and_amount = (pred - goal_pred) * input
  weight = weight - direction_and_amount
  print(f'Error: {str(error)} \t\t\t Prediction: {str(pred)}')

## One iteration of gradient descent

**This performs a weight update on a single training example
(input->true) pair.**

<img src='https://github.com/rahiakela/img-repo/blob/master/gradient-descent-1.JPG?raw=1' width='800'/>

In [0]:
weight = 0.1
alpha = 0.01

def neural_network(input , weight):
  prediction = input * weight
  return prediction

<img src='https://github.com/rahiakela/img-repo/blob/master/gradient-descent-2.JPG?raw=1' width='800'/>

In [0]:
number_of_toes = [8.5]
win_or_lose_binary = [1]  # (won!!!)

input = number_of_toes[0]
goal_pred = win_or_lose_binary[0]

pred = neural_network(input, weight)

error = (pred - goal_pred) ** 2
print(error)

0.022499999999999975


<img src='https://github.com/rahiakela/img-repo/blob/master/gradient-descent-3.JPG?raw=1' width='800'/>

In [0]:
number_of_toes = [8.5]
win_or_lose_binary = [1]  # (won!!!)

input = number_of_toes[0]
goal_pred = win_or_lose_binary[0]

pred = neural_network(input, weight)
print(pred)

delta = pred - goal_pred
print(delta)

0.8500000000000001
-0.1499999999999999


delta is a measurement of how much this node missed. The true prediction is 1.0, and the network’s prediction was 0.85, so the network was too low by 0.15. Thus, delta is negative 0.15.

The primary difference between gradient descent and this implementation is the new variable
delta. It’s the raw amount that the node was too high or too low. Instead of computing
direction_and_amount directly, you first calculate how much you want the output node to be
different. Only then do you compute direction_and_amount to change weight (in step 4, now
renamed weight_delta):

<img src='https://github.com/rahiakela/img-repo/blob/master/gradient-descent-4.JPG?raw=1' width='800'/>

In [0]:
number_of_toes = [8.5]
win_or_lose_binary = [1]  # (won!!!)

input = number_of_toes[0]
goal_pred = win_or_lose_binary[0]

pred = neural_network(input, weight)
print(pred)

error = (pred - goal_pred) ** 2
print(error)

delta = pred - goal_pred
print(delta)

weight_delta = input * delta
print(weight_delta)

0.8500000000000001
0.022499999999999975
-0.1499999999999999
-1.2749999999999992


weight_delta is a measure of how much a weight caused the network to miss. You calculate
it by multiplying the weight’s output node delta by the weight’s input. Thus, you create
each weight_delta by scaling its output node delta by the weight’s input. This accounts
for the three aforementioned properties of direction_and_amount: scaling, negative
reversal, and stopping.

<img src='https://github.com/rahiakela/img-repo/blob/master/gradient-descent-6.JPG?raw=1' width='800'/>

In [0]:
number_of_toes = [8.5]
win_or_lose_binary = [1]  # (won!!!)

input = number_of_toes[0]
goal_pred = win_or_lose_binary[0]

pred = neural_network(input, weight)
print(pred)

error = (pred - goal_pred) ** 2
print(error)

delta = pred - goal_pred
print(delta)

weight_delta = input * delta
print(weight_delta)

alpha = 0.01  # learning rate
weight -= weight_delta * alpha
print(weight)

0.8500000000000001
0.022499999999999975
-0.1499999999999999
-1.2749999999999992
0.11275


You multiply weight_delta by a small number alpha before using it to update weight. This lets you control how fast the network learns. If it learns too fast, it can update weights too aggressively and overshoot.

Note that the weight update made the same change (small increase) as hot and cold learning.

## Learning is just reducing error
**You can modify weight to reduce error.**

In [0]:
weight, goal_pred, input = (0.0, 0.8, 0.5)

for i in range(4):
  # These lines have a secret.
  pred = input * weight             # multiply weight parameter and input feature
  error = (pred - goal_pred) ** 2   # measure the diffrerence of predicted value and target value

  delta = pred - goal_pred          # get the diffrerence(missing value) of predicted value and target value
  weight_delta = input * delta      # multiply again input feature by the diffrerence(missing value)
  weight = weight - weight_delta    # adjust the weight by subtracting with multiplied input feature and the diffrerence(missing value)
  print(f'Error: {str(error)} \t\t Prediction: {str(pred)}')

Error: 0.6400000000000001 		 Prediction: 0.0
Error: 0.3600000000000001 		 Prediction: 0.2
Error: 0.2025 		 Prediction: 0.35000000000000003
Error: 0.11390625000000001 		 Prediction: 0.4625


**This approach adjusts each weight in the correct direction and by the correct amount so that error reduces to 0.**

All you’re trying to do is figure out the right direction and amount to modify weight so that
error goes down. The secret lies in the pred and error calculations. Notice that you use pred
inside the error calculation.

Let’s replace the pred variable with the code used to generate it:

In [0]:
error = ((input * weight) - goal_pred) ** 2
print(error)

0.06407226562500003


This doesn’t change the value of error at all! It just combines the two lines of code and computes error directly.

Remember that input and goal_prediction are fixed at 0.5 and
0.8, respectively (you set them before the network starts training). 

So, if you replace their
variables names with the values, the secret becomes clear:

In [0]:
error = ((0.5 * weight) - 0.8) ** 2
print(error)

0.06407226562500003


Let’s say you increased weight by 0.5. If there’s an exact relationship between error and weight,
you should be able to calculate how much this also moves error. What if you wanted to move
error in a specific direction? Could it be done?

<img src='https://github.com/rahiakela/img-repo/blob/master/reduce-error-1.JPG?raw=1' width='800'/>

This graph represents every value of error for every weight according to the relationship in the
previous formula. Notice it makes a nice bowl shape. The black dot is at the point of both the
current weight and error. The dotted circle is where you want to be (error == 0).

**The slope points to the bottom of the bowl (lowest error) no matter where you are in the
bowl. You can use this slope to help the neural network reduce the error.**

## Let’s watch several steps of learning

**Will we eventually find the bottom of the bowl?**

In [0]:
weight, goal_pred, input = (0.0, 0.8, 1.1)

for i in range(4):

  print(f'-----------\nWeight: {str(weight)}')

  # These lines have a secret.
  pred = input * weight             # multiply weight parameter and input feature
  error = (pred - goal_pred) ** 2   # measure the diffrerence of predicted value and target value

  delta = pred - goal_pred          # get the diffrerence(missing value) of predicted value and target value
  weight_delta = input * delta      # multiply again input feature by the diffrerence(missing value)
  weight = weight - weight_delta    # adjust the weight by subtracting with multiplied input feature and the diffrerence(missing value)
  print(f'Error: {str(error)}  Prediction: {str(pred)}')
  print(f'Delta: {str(delta)}  Weight Delta: {str(weight_delta)}')

-----------
Weight: 0.0
Error: 0.6400000000000001  Prediction: 0.0
Delta: -0.8  Weight Delta: -0.8800000000000001
-----------
Weight: 0.8800000000000001
Error: 0.02822400000000005  Prediction: 0.9680000000000002
Delta: 0.16800000000000015  Weight Delta: 0.1848000000000002
-----------
Weight: 0.6951999999999999
Error: 0.0012446784000000064  Prediction: 0.76472
Delta: -0.03528000000000009  Weight Delta: -0.0388080000000001
-----------
Weight: 0.734008
Error: 5.4890317439999896e-05  Prediction: 0.8074088
Delta: 0.007408799999999993  Weight Delta: 0.008149679999999992


<img src='https://github.com/rahiakela/img-repo/blob/master/reduce-error-2.JPG?raw=1' width='800'/>

<img src='https://github.com/rahiakela/img-repo/blob/master/reduce-error-3.JPG?raw=1' width='800'/>

<img src='https://github.com/rahiakela/img-repo/blob/master/reduce-error-4.JPG?raw=1' width='800'/>

<img src='https://github.com/rahiakela/img-repo/blob/master/reduce-error-5.JPG?raw=1' width='800'/>

<img src='https://github.com/rahiakela/img-repo/blob/master/reduce-error-6.JPG?raw=1' width='800'/>

## Why does this work? What is weight_delta, really?

**Let’s back up and talk about functions. What is a function?
How do you understand one?**

Consider this function:

def my_function(x):
   return x * 2

Every function has what you might call moving parts: pieces you can tweak or change to make
the output the function generates different. Consider my_function in the previous example. Ask
yourself, “What’s controlling the relationship between the input and the output of this function?”
The answer is, the 2. Ask the same question about the following function:

$$error = ((input * weight) - goal_pred) ** 2$$

What’s controlling the relationship between input and the output (error)? Plenty of things
are—this function is a bit more complicated! goal_pred, input, **2, weight, and all the
parentheses and algebraic operations (addition, subtraction, and so on) play a part in calculating
the error. Tweaking any one of them would change the error. This is important to consider.

As a thought exercise, consider changing goal_pred to reduce the error. This is silly, but totally
doable. In life, you might call this (setting goals to be whatever your capability is) “giving up.”
You’re denying that you missed! That wouldn’t do.

What if you changed input until error went to 0? Well, that’s akin to seeing the world as you
want to see it instead of as it actually is. You’re changing the input data until you’re predicting
what you want to predict (this is loosely how inceptionism works).

Now consider changing the 2, or the additions, subtractions, or multiplications. This is just
changing how you calculate error in the first place. The error calculation is meaningless if
it doesn’t actually give a good measure of how much you missed (with the right properties
mentioned a few pages ago). This won’t do, either.

What’s left? The only variable remaining is weight. Adjusting it doesn’t change your perception
of the world, doesn’t change your goal, and doesn’t destroy your error measure. Changing
weight means the function conforms to the patterns in the data. By forcing the rest of the
function to be unchanging, you force the function to correctly model some pattern in the data.
It’s only allowed to modify how the network predicts.

To sum up: you modify specific parts of an error function until the error value goes to 0. This error
function is calculated using a combination of variables, some of which you can change (weights) and
some of which you can’t (input data, output data, and the error logic):

In [0]:
weight = 0.5
goal_pred = 0.8
input = 0.5

for i in range(20):
  print(f'-----------\nWeight: {str(weight)}')

  # These lines have a secret.
  pred = input * weight             # multiply weight parameter and input feature
  error = (pred - goal_pred) ** 2   # measure the diffrerence of predicted value and target value

  # get the diffrerence(missing value) of predicted value and target value
  # multiply again input feature by the diffrerence(missing value)
  direction_and_amount = (pred - goal_pred) * input    
  # adjust the weight by subtracting with multiplied input feature and the diffrerence(missing value)  
  weight = weight - direction_and_amount   
  print(f'Error: {str(error)}  Prediction: {str(pred)}')

Learning is all about automatically changing the prediction function so that it
makes good predictions—aka, so that the subsequent error goes down to 0.

Now that you know what you’re allowed to change, how do you go about doing the changing?
That’s the good stuff. That’s the machine learning, right? In the next section, we’re going to talk
about exactly that.

##How to use a derivative to learn
**weight_delta is your derivative.**

Error is a measure of how much you missed. The
derivative defines the relationship between each
weight and how much you missed. In other
words, it tells how much changing a weight
contributed to the error. So, now that you know
this, how do you use it to move the error in a
particular direction?

<img src='https://github.com/rahiakela/img-repo/blob/master/derivative-1.JPG?raw=1' width='800'/>

You’ve learned the relationship between
two variables in a function, but how do you
exploit that relationship? As it turns out, this
is incredibly visual and intuitive. Check out
the error curve again. The black dot is where
weight starts out: (0.5). The dotted circle is where you want it to go: the goal weight. Do you see
the dotted line attached to the black dot? That’s the slope, otherwise known as the derivative. It
tells you at that point in the curve how much error changes when you change weight. Notice
that it’s pointed downward: it’s a negative slope.

The slope of a line or curve always points in the opposite direction of the lowest point of the line or
curve. So, if you have a negative slope, you increase weight to find the minimum of error.

Remember back to the goal again: you’re trying to figure out the direction and the amount to
change the weight so the error goes down. A derivative gives you the relationship between any
two variables in a function. You use the derivative to determine the relationship between any
weight and error. You then move the weight in the opposite direction of the derivative to find the
lowest weight. Voilà! The neural network learns.

**This method for learning (finding error minimums) is called gradient descent.** This name should
seem intuitive. You move the weight value opposite the gradient value, which reduces error to 0. By opposite, I mean you increase the weight when you have a negative gradient, and vice versa.
It’s like gravity.

In [3]:
weight = 0.0
goal_pred = 0.8
input = 1.1

for i in range(4):
  pred = input * weight
  error = (pred - goal_pred) ** 2
  delta = pred - goal_pred
  weight_delta = delta * input   # Derivative(how fast the error changes, given changes in the weight)
  weight = weight - weight_delta
  print(f'Error: {str(error)} Prediction: {str(pred)} Weight: {str(weight)} Weight Delta: {str(weight_delta)} Delta: {str(delta)}')


Error: 0.6400000000000001 Prediction: 0.0 Weight: 0.8800000000000001 Weight Delta: -0.8800000000000001 Delta: -0.8
Error: 0.02822400000000005 Prediction: 0.9680000000000002 Weight: 0.6951999999999999 Weight Delta: 0.1848000000000002 Delta: 0.16800000000000015
Error: 0.0012446784000000064 Prediction: 0.76472 Weight: 0.734008 Weight Delta: -0.0388080000000001 Delta: -0.03528000000000009
Error: 5.4890317439999896e-05 Prediction: 0.8074088 Weight: 0.72585832 Weight Delta: 0.008149679999999992 Delta: 0.007408799999999993


<img src='https://github.com/rahiakela/img-repo/blob/master/derivative-2.JPG?raw=1' width='800'/>

<img src='https://github.com/rahiakela/img-repo/blob/master/derivative-3.JPG?raw=1' width='800'/>

## Breaking gradient descent
**Just give me the code!**

In [5]:
weight = 0.5
goal_pred = 0.8
input = 0.5

for i in range(20):
  pred = input * weight
  error = (pred - goal_pred) ** 2
  delta = pred - goal_pred
  weight_delta = delta * input   # Derivative(how fast the error changes, given changes in the weight)
  weight = weight - weight_delta
  print(f'Error: {str(error)} Prediction: {str(pred)}')

Error: 0.30250000000000005 Prediction: 0.25
Error: 0.17015625000000004 Prediction: 0.3875
Error: 0.095712890625 Prediction: 0.49062500000000003
Error: 0.05383850097656251 Prediction: 0.56796875
Error: 0.03028415679931642 Prediction: 0.6259765625
Error: 0.0170348381996155 Prediction: 0.669482421875
Error: 0.00958209648728372 Prediction: 0.70211181640625
Error: 0.005389929274097089 Prediction: 0.7265838623046875
Error: 0.0030318352166796153 Prediction: 0.7449378967285156
Error: 0.0017054073093822882 Prediction: 0.7587034225463867
Error: 0.0009592916115275371 Prediction: 0.76902756690979
Error: 0.0005396015314842384 Prediction: 0.7767706751823426
Error: 0.000303525861459885 Prediction: 0.7825780063867569
Error: 0.00017073329707118678 Prediction: 0.7869335047900676
Error: 9.603747960254256e-05 Prediction: 0.7902001285925507
Error: 5.402108227642978e-05 Prediction: 0.7926500964444131
Error: 3.038685878049206e-05 Prediction: 0.7944875723333098
Error: 1.7092608064027242e-05 Prediction: 0.7958

Now that it works, let’s break it. Play around with the starting weight, goal_pred, and
input numbers. You can set them all to just about anything, and the neural network will
figure out how to predict the output given the input using the weight. See if you can find
some combinations the neural network can’t predict. I find that trying to break something
is a great way to learn about it.

Let’s try setting input equal to 2, but still try to get the algorithm to predict 0.8.

In [6]:
weight = 0.5
goal_pred = 0.8
input = 2

for i in range(20):
  pred = input * weight
  error = (pred - goal_pred) ** 2
  delta = pred - goal_pred
  weight_delta = delta * input   # Derivative(how fast the error changes, given changes in the weight)
  weight = weight - weight_delta
  print(f'Error: {str(error)} Prediction: {str(pred)}')

Error: 0.03999999999999998 Prediction: 1.0
Error: 0.3599999999999998 Prediction: 0.20000000000000018
Error: 3.2399999999999984 Prediction: 2.5999999999999996
Error: 29.159999999999986 Prediction: -4.599999999999999
Error: 262.4399999999999 Prediction: 16.999999999999996
Error: 2361.959999999998 Prediction: -47.79999999999998
Error: 21257.639999999978 Prediction: 146.59999999999994
Error: 191318.75999999983 Prediction: -436.5999999999998
Error: 1721868.839999999 Prediction: 1312.9999999999995
Error: 15496819.559999991 Prediction: -3935.799999999999
Error: 139471376.03999993 Prediction: 11810.599999999997
Error: 1255242384.3599997 Prediction: -35428.59999999999
Error: 11297181459.239996 Prediction: 106288.99999999999
Error: 101674633133.15994 Prediction: -318863.79999999993
Error: 915071698198.4395 Prediction: 956594.5999999997
Error: 8235645283785.954 Prediction: -2869780.599999999
Error: 74120807554073.56 Prediction: 8609344.999999996
Error: 667087267986662.1 Prediction: -25828031.7999

Whoa! That’s not what you want. The predictions exploded! They alternate from negative to
positive and negative to positive, getting farther away from the true answer at every step. In
other words, every update to the weight overcorrects.

## Visualizing the overcorrections

In [7]:
weight = 0.5
goal_pred = 0.8
input = 2

for i in range(20):
  pred = input * weight
  error = (pred - goal_pred) ** 2
  delta = pred - goal_pred
  weight_delta = delta * input   # Derivative(how fast the error changes, given changes in the weight)
  weight = weight - weight_delta
  print(f'Error: {str(error)} Prediction: {str(pred)} Weight: {str(weight)} Weight Delta: {str(weight_delta)} Delta: {str(delta)}')

Error: 0.03999999999999998 Prediction: 1.0 Weight: 0.10000000000000009 Weight Delta: 0.3999999999999999 Delta: 0.19999999999999996
Error: 0.3599999999999998 Prediction: 0.20000000000000018 Weight: 1.2999999999999998 Weight Delta: -1.1999999999999997 Delta: -0.5999999999999999
Error: 3.2399999999999984 Prediction: 2.5999999999999996 Weight: -2.2999999999999994 Weight Delta: 3.599999999999999 Delta: 1.7999999999999996
Error: 29.159999999999986 Prediction: -4.599999999999999 Weight: 8.499999999999998 Weight Delta: -10.799999999999997 Delta: -5.399999999999999
Error: 262.4399999999999 Prediction: 16.999999999999996 Weight: -23.89999999999999 Weight Delta: 32.39999999999999 Delta: 16.199999999999996
Error: 2361.959999999998 Prediction: -47.79999999999998 Weight: 73.29999999999997 Weight Delta: -97.19999999999996 Delta: -48.59999999999998
Error: 21257.639999999978 Prediction: 146.59999999999994 Weight: -218.2999999999999 Weight Delta: 291.59999999999985 Delta: 145.79999999999993
Error: 19131

<img src='https://github.com/rahiakela/img-repo/blob/master/derivative-4.JPG?raw=1' width='800'/>

<img src='https://github.com/rahiakela/img-repo/blob/master/derivative-5.JPG?raw=1' width='800'/>

## Divergence
**Sometimes neural networks explode in value. Oops?**

<img src='https://github.com/rahiakela/img-repo/blob/master/derivative-6.JPG?raw=1' width='800'/>

What really happened? The explosion in the error was caused by the fact that you made the
input larger. Consider how you’re updating the weight:

$$weight = weight - (input * (pred - goal_p))$$

If the input is sufficiently large, this can make the weight update large even when the error is
small. What happens when you have a large weight update and a small error? The network
overcorrects. If the new error is even bigger, the network overcorrects even more. This
causes the phenomenon you saw earlier, called divergence.

If you have a big input, the prediction is very sensitive to changes in the weight (because
pred = input * weight). This can cause the network to overcorrect. In other words, even
though the weight is still starting at 0.5, the derivative at that point is very steep. See how
tight the U-shaped error curve is in the graph?

This is really intuitive. How do you predict? By multiplying the input by the weight. So, if the
input is huge, small changes in the weight will cause changes in the prediction. The error is
very sensitive to the weight. In other words, the derivative is really big. How do you make
it smaller?



## Introducing alpha
**It’s the simplest way to prevent overcorrecting weight updates.**

What’s the problem you’re trying to solve? That if the input is too big, then the weight
update can overcorrect. What’s the symptom? That when you overcorrect, the new
derivative is even larger in magnitude than when you started (although the sign will be
the opposite).

The symptom is this overshooting. The solution is to multiply the weight update by a
fraction to make it smaller. In most cases, this involves multiplying the weight update
by a single real-valued number between 0 and 1, known as alpha. Note: this has no
effect on the core issue, which is that the input is larger. It will also reduce the weight
updates for inputs that aren’t too large.

Finding the appropriate alpha, even for state-of-the-art neural networks, is often done
by guessing. You watch the error over time. If it starts diverging (going up), then the
alpha is too high, and you decrease it. If learning is happening too slowly, then the alpha
is too low, and you increase it. There are other methods than simple gradient descent
that attempt to counter for this, but gradient descent is still very popular.

### Alpha in code
**Where does our “alpha” parameter come into play?**

You just learned that alpha reduces the weight update so it doesn’t overshoot. How does this
affect the code? Well, you were updating the weights according to the following formula:

$$weight = weight - derivative$$

Accounting for alpha is a rather small change, as shown next. Notice that if alpha is
small (say, 0.01), it will reduce the weight update considerably, thus preventing it from
overshooting:

$$weight = weight - (alpha * derivative)$$

That was easy. Let’s install alpha into the tiny implementation from the beginning of this
chapter and run it where input = 2 (which previously didn’t work):

In [8]:
weight = 0.5
goal_pred = 0.8
input = 2
alpha = 0.1   # What happens when you make alpha crazy small or big? What about making it negative?

for i in range(20):
  pred = input * weight
  error = (pred - goal_pred) ** 2
  derivative = input * (pred - goal_pred)  #  Derivative(how fast the error changes, given changes in the weight)
  weight = weight - (alpha * derivative)
  print(f'Error: {str(error)} Prediction: {str(pred)}')

Error: 0.03999999999999998 Prediction: 1.0
Error: 0.0144 Prediction: 0.92
Error: 0.005183999999999993 Prediction: 0.872
Error: 0.0018662400000000014 Prediction: 0.8432000000000001
Error: 0.0006718464000000028 Prediction: 0.8259200000000001
Error: 0.00024186470400000033 Prediction: 0.815552
Error: 8.70712934399997e-05 Prediction: 0.8093312
Error: 3.134566563839939e-05 Prediction: 0.80559872
Error: 1.1284439629823931e-05 Prediction: 0.803359232
Error: 4.062398266736526e-06 Prediction: 0.8020155392
Error: 1.4624633760252567e-06 Prediction: 0.8012093235200001
Error: 5.264868153690924e-07 Prediction: 0.8007255941120001
Error: 1.8953525353291194e-07 Prediction: 0.8004353564672001
Error: 6.82326912718715e-08 Prediction: 0.8002612138803201
Error: 2.456376885786678e-08 Prediction: 0.8001567283281921
Error: 8.842956788836216e-09 Prediction: 0.8000940369969153
Error: 3.1834644439835434e-09 Prediction: 0.8000564221981492
Error: 1.1460471998340758e-09 Prediction: 0.8000338533188895
Error: 4.1257699

Voilà! The tiniest neural network can now make good predictions again. How did I
know to set alpha to 0.1? To be honest, I tried it, and it worked. And despite all the crazy
advancements of deep learning in the past few years, most people just try several orders of
magnitude of alpha (10, 1, 0.1, 0.01, 0.001, 0.0001) and then tweak it from there to see what
works best. It’s more art than science.

## Memorizing
**It’s time to really learn this stuff.**

Why does this work? Well, for starters, the only way to know you’ve gleaned all the
information necessary from this chapter is to try to produce it from your head. Neural
networks have lots of small moving parts, and it’s easy to miss one.

Why is this important for the rest of the book? In the following chapters, I’ll be referring to
the concepts discussed in this chapter at a faster pace so that I can spend plenty of time on
the newer material. It’s vitally important that when I say something like “Add your alpha
parameterization to the weight update,” you immediately recognize which concepts from
this chapter I’m referring to.

All that is to say, memorizing small bits of neural network code has been hugely beneficial
for me personally, as well as for many individuals who have taken my advice on this subject
in the past.