# Building Your First "Deep" Neural Network Introduction to Backpropagation

## IN THIS CHAPTER

* The Streetlight Problem
* Matrices and the Matrix Relationship
* Full / Batch / Stochastic Gradient Descent
* Neural Networks Learn Correlation
* Overfitting
* Creating our Own Correlation
* Backpropagation: Long Distance Error Attribution
* Linear vs Non-Linear
* The Secret to Sometimes Correlation
* Our First "Deep" Network
* Backpropagation in Code / Bringing it all Together

In [1]:
import numpy as np 

# inputs 
streetlights = np.array( 
                         [  
                            [ 1, 0, 1 ],
                            [ 0, 1, 1 ],
                            [ 0, 0, 1 ],
                            [ 1, 1, 1 ],
                            [ 0, 1, 1 ],
                            [ 1, 0, 1 ] 
                         ] 
                       )

# outputs
walk_vs_stop = np.array( 
                            [   
                                 [ 0 ],
                                 [ 1 ],
                                 [ 0 ],
                                 [ 1 ],
                                 [ 1 ],
                                 [ 0 ] 
                             ] 
                        )

# weights 
weights = np.array( [0.5, 0.48, -0.7] )

# input 
input = streetlights[0]

# target value 
goal_prediction = walk_vs_stop[0]

# alpha's job is to avoids jumping irrelevant values. 
alpha = 0.1

prediction = 0
for iteration in range(20):
    prediction = input.dot(weights)
    error = (goal_prediction - prediction) ** 2 
    delta = prediction - goal_prediction
    #print(f'Before Weights = {weights}')
    weights = weights - (alpha * (input * delta))
    #print(f'Weights = {weights}, ->>{alpha * (input * delta)}')
    print ("Error:" + str(error) + " Prediction:" + str(prediction))

Error:[0.04] Prediction:-0.19999999999999996
Error:[0.0256] Prediction:-0.15999999999999992
Error:[0.016384] Prediction:-0.1279999999999999
Error:[0.01048576] Prediction:-0.10239999999999982
Error:[0.00671089] Prediction:-0.08191999999999977
Error:[0.00429497] Prediction:-0.06553599999999982
Error:[0.00274878] Prediction:-0.05242879999999994
Error:[0.00175922] Prediction:-0.04194304000000004
Error:[0.0011259] Prediction:-0.03355443200000008
Error:[0.00072058] Prediction:-0.02684354560000002
Error:[0.00046117] Prediction:-0.021474836479999926
Error:[0.00029515] Prediction:-0.01717986918399994
Error:[0.00018889] Prediction:-0.013743895347199997
Error:[0.00012089] Prediction:-0.010995116277759953
Error:[7.73712525e-05] Prediction:-0.008796093022207963
Error:[4.95176016e-05] Prediction:-0.007036874417766459
Error:[3.1691265e-05] Prediction:-0.0056294995342132115
Error:[2.02824096e-05] Prediction:-0.004503599627370569
Error:[1.29807421e-05] Prediction:-0.003602879701896544
Error:[8.30767497

### Small practice that is irrelevant our network

In [15]:
import numpy as np
a = np.array([0,1,2,1])
b = np.array([2,2,2,3])
print (a*b) #elementwise multiplication
print (a+b) #elementwise addition
print (a * 0.5) # vector-scalar multiplication
print (a + 0.5) # vector-scalar addition

[0 2 4 3]
[2 3 4 4]
[0.  0.5 1.  0.5]
[0.5 1.5 2.5 1.5]


### *We need it to know more than one streetlight! How do we do this? Well... we train it on all the streetlights at once!

In [4]:
import numpy as np
weights = np.array([0.5,0.48,-0.7])
alpha = 0.1
streetlights = np.array( [  [ 1, 0, 1 ],
                            [ 0, 1, 1 ],
                            [ 0, 0, 1 ],
                            [ 1, 1, 1 ],
                            [ 0, 1, 1 ],
                            [ 1, 0, 1 ] ] )
walk_vs_stop = np.array( [ 0, 1, 0, 1, 1, 0 ] )
input = streetlights[0] # [1,0,1]
goal_prediction = walk_vs_stop[0] # equals 0... i.e. "stop"

for iteration in range(40):
    error_for_all_lights = 0 
    for row_index in range(len(walk_vs_stop)):
        input = streetlights[row_index]
        prediction = input.dot(weights)
        
        error = (goal_prediction - prediction) ** 2
        error_for_all_lights += error
        #print(f'input = {input}, delta={delta}, prediction = {prediction}')
        delta = prediction - goal_prediction 
        #print(f'input*delta = {input*delta}')
        weights = weights - (alpha * (input * delta))
        print(f'Prediction : {prediction}')
    print(f'Error for all : {error_for_all_lights} \n')

Prediction : -0.19999999999999996
Prediction : -0.19999999999999996
Prediction : -0.6599999999999999
Prediction : 0.42600000000000005
Prediction : -0.17919999999999997
Prediction : -0.1412799999999999
Error for all : 0.7491486783999998 

Prediction : -0.1130239999999999
Prediction : -0.11792959999999986
Prediction : -0.5814566399999999
Prediction : 0.46663238399999996
Prediction : -0.1295244927999999
Prediction : -0.10085460351999997
Error for all : 0.6094676664160464 

Prediction : -0.08068368281600002
Prediction : -0.08546576560640001
Prediction : -0.53032135992576
Prediction : 0.4589805137410559
Prediction : -0.1071365792407552
Prediction : -0.08405067852371972
Error for all : 0.5242608737226604 

Prediction : -0.06724054281897579
Prediction : -0.07058014125833462
Prediction : -0.4902864811231111
Prediction : 0.4361165960994073
Prediction : -0.09465878411423806
Prediction : -0.07546321282549368
Error for all : 0.454736347803985 

Prediction : -0.06037057026039494
Prediction : -0.062

In [5]:
weights

array([ 0.03208283,  0.03223432, -0.0371222 ])

## Full / Batch / Stochastic Gradient Descent

* Stochastic Gradient Descent - Updating weights one example at a time.
    
    -> As it turns out, this idea of learning "one example at a time" is a variant on Gradient
Descent called Stochastic Gradient Descent, and it is one of the handful of methods that can
be used for learning an entire dataset.

    -> How does Stochastic Gradient Descent work? As exemplified on the previous page,
it simply performs a prediction and weight update for each training example separately. In
other words, it takes the first streetlight, tries to predict it, calculates teh weight_delta, adn
updates the weights. Then it moves onto the second streetlight, etc. It iterates through the
entire dataset many times until it can find a weight configuration that works well for all of
the training examples.


* (Full) Gradient Descent - Updating weights one dataset at a time.
As it turns out, another method for learning an entire dataset is just called Gradient Descent
(or "Average/Full Gradient Descent" if you like). Instead of updating the weights once
for each training example, the network simply calculates the average weight_delta over the
entire dataset, only actually changing the weights each time it computes a full average.


* Batch Gradient Descent - Updating weights after "n" examples.
This will be covered in more detail later, but there is also a third configuration that sortof
"splits the difference" between Stochastic Gradient Descent and Full Gradient Descent.
Instead of updating the weights after just one or after the entire dataset of examples, you
choose a "batch size" (typically between 8 and 256) after which the weights are updated.

## Our First "Deep" Neural Network

* In the code below, we initialize our weights and make a forward propagation. New is bold.

In [31]:
import numpy as np 
np.random.seed(1)

def relu(x):  # this function sets all negative numbers to 0
    return (x>0) * x  # 0'dan kucuk ise 0, aksi halde sayinin kendisi

alpha = 0.2
hidden_size = 4 

streetlights = np.array( [  
                            [ 1, 0, 1 ],
                            [ 0, 1, 1 ],
                            [ 0, 0, 1 ],
                            [ 1, 1, 1 ] 
                         ] 
                       )

walk_vs_stop = np.array([[ 1, 1, 0, 0]]).T

weights_0_1 = 2*np.random.random((3,hidden_size)) - 1
weights_1_2 = 2*np.random.random((hidden_size,1)) - 1 

layer_0 = streetlights[0]
layer_1 = relu(np.dot(layer_0, weights_0_1))
layer_2 = np.dot(layer_1, weights_1_2)

In [16]:
layer_1

array([-0.        ,  0.51828245, -0.        , -0.        ])

In [7]:
layer_2

array([0.39194327])

In [13]:
layer_0

array([1, 0, 1])

In [32]:
weights_0_1

array([[-0.16595599,  0.44064899, -0.99977125, -0.39533485],
       [-0.70648822, -0.81532281, -0.62747958, -0.30887855],
       [-0.20646505,  0.07763347, -0.16161097,  0.370439  ]])

## Backpropagation in Code

 * How we can learn the amount that each weight contributes to the final error.

In [42]:
import numpy as np 

np.random.seed(1)

streetlights = np.array( [  
                            [ 1, 0, 1 ],
                            [ 0, 1, 1 ],
                            [ 0, 0, 1 ],
                            [ 1, 1, 1 ] 
                         ] 
                       )

walk_vs_stop = np.array([[ 1, 1, 0, 0]]).T

def relu(x):
    return (x > 0) * x      # returns x if x > 0
                            # return 0 otherwise
    
def relu2deriv(output):
    return output>0         # returns 1 for input > 0
                            # return 0 otherwise 
    
alpha = 0.2
hidden_size = 4

weights_0_1 = 2 * np.random.random((3,hidden_size)) - 1
weights_1_2 = 2 * np.random.random((hidden_size,1 )) - 1

for iteration in range(60):
    layer_2_error = 0
    for i in range(len(streetlights)):
        layer_0 = streetlights[i:i+1]
        layer_1 = relu(np.dot(layer_0, weights_0_1))
        layer_2 = np.dot(layer_1, weights_1_2)
        
        layer_2_error += np.sum((layer_2 - walk_vs_stop[i:i+1]) ** 2)
        
        layer_2_delta = (walk_vs_stop[i:i+1] - layer_2)
        
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
        '''
            this line computes the delta at layer_1 given the delta at layer_2 by taking
            the layer_2_delta and multiplying it by its connecting weights_1_2
        '''
        weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
        
    if(iteration % 10 == 9):
        print(f'Error : {layer_2_error}')

Error : 0.6342311598444467
Error : 0.35838407676317513
Error : 0.0830183113303298
Error : 0.006467054957103705
Error : 0.0003292669000750734
Error : 1.5055622665134859e-05


In [36]:
(layer_2 - walk_vs_stop[0:0+1]) ** 2

array([[0.36973299]])

In [37]:
layer_2

array([0.39194327])

In [39]:
weights_1_2.T

array([[-0.5910955 ,  0.75623487, -0.94522481,  0.34093502]])

# One Iteration of Backpropagation

## 1) Initialize the Network's Weights and Data

In [54]:
import numpy as np 

np.random.seed(1)

def relu(x):
    return (x>0) * x

def relu2deriv(output):
    return output>0

lights = np.array( 
                    [
                        [1, 0, 1],
                        [0, 1, 1],
                        [0, 0, 1],
                        [1, 1, 1]
                    ]
                 )

walk_stop = np.array([[1, 1, 0, 0]]).T

alpha = 0.2
hidden_size = 3

weights_0_1 = 2*np.random.random((3, hidden_size)) - 1
weights_1_2 = 2*np.random.random((hidden_size, 1)) - 1

## 2) PREDICT & COMPARE: Make a Prediction, Calculate Output Error and Delta

In [55]:
layer_0 = lights[0:1]

layer_1 = np.dot(layer_0, weights_0_1)
layer_1 = relu(layer_1)
print(f'layer_1 = {layer_1}')

layer_2 = np.dot(layer_1, weights_1_2)
print(f'layer_2 = {layer_2}')

error = (layer_2 - walk_stop[0:1]) ** 2

layer_2_delta = (layer_2 - walk_stop[0:1])

layer_1 = [[-0.          0.13177044 -0.        ]]
layer_2 = [[-0.02129555]]


## 3) LEARN: Backpropagate From layer_2 to layer_1

In [56]:
layer_1_delta = layer_2_delta.dot(weights_1_2.T)

In [57]:
# just tried
layer_2_delta

array([[-1.02129555]])

In [58]:
# just tried
weights_1_2

array([[ 0.07763347],
       [-0.16161097],
       [ 0.370439  ]])

In [59]:
# just tried
layer_1_delta

array([[-0.07928672,  0.16505257, -0.3783277 ]])

In [60]:
layer_1_delta *= relu2deriv(layer_1) # - li olan değerli saptayıp etkisiz hale getiriyoruz.

In [61]:
# just tried
relu2deriv(layer_1)

array([[False,  True, False]])

In [62]:
# just tried
layer_1_delta

array([[-0.        ,  0.16505257, -0.        ]])

## 4) LEARN: Generate Weight Deltas and Update Weights

In [72]:
weight_delta_1_2 = layer_1.T.dot(layer_2_delta)
weight_delta_0_1 = layer_0.T.dot(layer_1_delta)

weights_1_2 -= alpha * weight_delta_1_2
weights_0_1 -= alpha * weight_delta_0_1

In [69]:
# just tried
layer_0.T

array([[1],
       [0],
       [1]])

In [70]:
# just tried
layer_1_delta

array([[-0.        ,  0.16505257, -0.        ]])

In [71]:
# just tried
weight_delta_0_1

array([[-0.        ,  0.16505257, -0.        ],
       [-0.        ,  0.        , -0.        ],
       [-0.        ,  0.16505257, -0.        ]])

In [73]:
# just tried
weights_1_2

array([[ 0.07763347],
       [-0.13469566],
       [ 0.370439  ]])

* As we can see, backpropagation in its entirety is about calculating deltas for intermediate
layers so that we can perform Gradient Descent. In order to do so, we simply take
the weighted average delta on layer_2 for layer_1 (weighted by the weights inbetween them).
We then turn off (set to 0) nodes that weren't participating in the forward prediction, since
they could not have contributed to the error.

## Putting it all together: Here's the self sufficient program

In [79]:
import numpy as np 

np.random.seed(1)

def relu(x):
    return (x>0) * x

def relu2deriv(output):
    return output>0

streetlights = np.array( [  
                            [ 1, 0, 1 ],
                            [ 0, 1, 1 ],
                            [ 0, 0, 1 ],
                            [ 1, 1, 1 ] 
                         ] 
                       )

walk_vs_stop = np.array([[ 1, 1, 0, 0]]).T

alpha = 0.2 
hidden_size = 4

weights_0_1 = 2*np.random.random((3,hidden_size)) - 1
weights_1_2 = 2*np.random.random((hidden_size,1)) -1 

for iteration in range(60):
    layer_2_error = 0
    for i in range(len(streetlights)):
        layer_0 = streetlights[i:i+1]
        layer_1 = relu(np.dot(layer_0, weights_0_1))
        layer_2 = np.dot(layer_1, weights_1_2)
        
        layer_2_error += np.sum((layer_2 - walk_vs_stop[i:i+1]))
        
        layer_2_delta = (layer_2 - walk_vs_stop[i:i+1])
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
        
        weights_1_2 -= alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 -= alpha * layer_0.T.dot(layer_1_delta)
    
    if(iteration % 10 == 9):
        print(f'Error : {layer_2_error}')

Error : -0.04582578503217233
Error : 0.13479559058037924
Error : 0.09509399355028508
Error : 0.0335902124677014
Error : 0.008381318009502588
Error : 0.0018016776786300738
