### Backpropagation

by calculating the gradients, which is the vector of all possible partial derivatives of a functiion with respect to all of its respective parameters, we can do the chain rule, which is the derivative of the chained functions, which are the product of the partial derivatives with respect to the loss

lets demonstrate the backpropagation by computing it at the ReLU activation output layer first. We will then do this for the loss function:

In [2]:
import math
#the input values, its weights, and the bias
x = [1.0, -2.0, 3.0]
w = [-3.0, -1.0, 2.0]
b = 1.0

above shows 2 vectors: the inputs and weights, and the bias.

as you can see, the number of neurons for the subsequent layer will be 1, given that there is only one sample for x, and one set of weights for the inputs.

(n_samples, num_features) * (num_features, n_neurons) = (n_samples, n_neurons)

(1, 3) * (3, 1) = (1, 1), so single value for the output

In [3]:
input_weights = [x*w for x, w in zip(x, w)]
print(input_weights)

[-3.0, 2.0, 6.0]


input_weights represent the input and weight muliplication operation, multiplying each element in x with each element in w element-wise.

z represents the sum of input_weights, plus the bias

y represents the reLU activation function, which returns z if z > o, otherwise 0

In [7]:
z = sum(input_weights) + b
print('output before reLU activation function:', z)

y = max(z, 0)
print('output after the reLU activation function: ', y)

output before reLU activation function: 6.0
output after the reLU activation function:  6.0


now, lets compute the gradients:

first, we must compute the derivative of the reLU function.

> the derivative of reLU() with respect to its input z is:

* 1 if z > 0, otherwise 0.

we must take the derivative (gradient) from the next layer, in this case we will make up this value for demonstration purposes


In [8]:
#derivative from the next layer
dvalue = 1.0

#derivative of ReLU and the chain rule, which is gradient of the next
#layer multiplied by the derivative of the current layer
drelu_dz = dvalue * (1 if z > 0 else 0)

print('the derivative of the next layer is: ', dvalue)
print('the derivative of the current layer: ', (1 if z > 0 else 0))
print('the chain rule of the next layer with the current: ', drelu_dz)

the derivative of the next layer is:  1.0
the derivative of the current layer:  1
the chain rule of the next layer with the current:  1.0


next, we have to calculate the partial derivative of the sum function, then use the chain rule to multiply this with the partial derivative of the reLU partial derivative.

results will be:

> drelu_dwx0: the partial derivative of the ReLU wrt. the first weighted input, w0x0

> drelu_dwx1: the partial derivative of the ReLU wrt. the second weighted input, w1x1

> drelu_dwx2: the partial derivative of the ReLU wrt. the third weighted input, w2x2

> drelu_db: the partial derivative of the ReLU wrt. the bias, b

the partial derivative of the sum operation is always 1, no matter the inputs.

$ f(x, y) = x + y $

partial derivative of the function with respect to x:
$ \frac{\partial}{\partial_x}f(x,y) = 1 $

partial derivative of the function with respect to y:
$ \frac{\partial}{\partial_y}f(x,y) = 1 $

thus, the partial derivative of the sum with respect to each of its input*weights are:

> dsum_dxw0: 1

> dsum_dxw1: 1

> dsum_dxw2: 1

> dsum_db: 1

from here, we do the chain rule to find the derivative of the reLU with respect to each of the input*weights:


In [11]:
#partial derivatives of the sum function wrt each of the weighted inputs,
#and the bias. the sum operation is always 1
dsum_dxw0 = 1
dsum_dxw1 = 1
dsum_dxw2 = 1
dsum_db = 1

#using chain rule so that we know the partial derivatives of the reLU
#function with respect to each of the weighted inputs. 
drelu_dwx0 = dsum_dxw0 * drelu_dz
drelu_dwx1 = dsum_dxw1 * drelu_dz
drelu_dwx2 = dsum_dxw2 * drelu_dz
drelu_db = dsum_db * drelu_dz

print('partial derivative of the sum function wrt. weighted inputs0: ', dsum_dxw0)
print('partial derivative of the sum function wrt. the weighted inputs1', dsum_dxw1)
print('partial derivative of the sum function wrt the weighted inputs2: ', dsum_dxw2)
print('partial derivative of the sum function wrt the bias: ', dsum_db)

print('partial derivative of the reLU function wrt the weighted inputs0', drelu_dwx0)
print('partial derivative of the reLU function wrt the weighted inputs1', drelu_dwx1)
print('partial derivative of the reLU function wrt the weighted inputs2', drelu_dwx2)
print('partial derivative of the reLU function wrt the bias', drelu_db)

partial derivative of the sum function wrt. weighted inputs0:  1
partial derivative of the sum function wrt. the weighted inputs1 1
partial derivative of the sum function wrt the weighted inputs2:  1
partial derivative of the sum function wrt the bias:  1
partial derivative of the reLU function wrt the weighted inputs0 1.0
partial derivative of the reLU function wrt the weighted inputs1 1.0
partial derivative of the reLU function wrt the weighted inputs2 1.0
partial derivative of the reLU function wrt the bias 1.0


now, we must find the find the partial derivative of the multiplication function, then use the chain rule again.

the partial derivative of the multiplicaiton function is whatever the input is being multiplied by. for example:

$ f(x, y) = x * y $

partial derivative of the function with respect to x:
$ \frac{\partial}{\partial_x}f(x,y) = y $

partial derivative of the function with respect to y:
$ \frac{\partial}{\partial_y}f(x,y) = x $

thus, the partial derivative of the first weighted inputs with respect to the input is the weight

then, we apply chain rule and multiply this partial with the partial of the subsequent function.

In [18]:
#partial derivatives of the multiplication function wrt. each of the inputs:
dmul_dx0 = w[0]
dmul_dx1 = w[1]
dmul_dx2 = w[2]

#you do not need to do the partial derivative of the bias, because it isnt being multiplied.
print('partial derivative of the multiplication function wrt x0: ', dmul_dx0)
print('partial derivative of the multiplication function wrt x1: ', dmul_dx1)
print('partial derivative of the multiplication function wrt x2: ', dmul_dx2, '\n')


#now, do chain rule of these derivatives with the subsequent one we did previously
drelu_dx0 = drelu_dwx0 * dmul_dx0
drelu_dx1 = drelu_dwx1 * dmul_dx1
drelu_dx2 = drelu_dwx2 * dmul_dx2

#partial derivatives of the multiplication function wrt. each of the weights:
dmul_dw0 = x[0]
dmul_dw1 = x[1]
dmul_dw2 = x[2]

print('partial derivative of the multiplication function wrt w0: ', dmul_dw0)
print('partial derivative of the multiplication function wrt w1: ', dmul_dw1)
print('partial derivative of the multiplication function wrt x2: ', dmul_dx2, '\n')

#now, do chain rule of these derivatives with subsequent one we did preiously
drelu_dw0 = drelu_dwx0 * dmul_dw0
drelu_dw1 = drelu_dwx0 * dmul_dw1
drelu_dw2 = drelu_dwx0 * dmul_dw2

print('partial derivative of the reLU function with resepct to the input 0: ', drelu_dx0)
print('partial derivative of the reLU function with resepct to the weight 0: ', drelu_dw0)
print('partial derivative of the reLU function with resepct to the input 1: ', drelu_dx1)
print('partial derivative of the reLU function with resepct to the weight 1: ', drelu_dw1)
print('partial derivative of the reLU function with resepct to the input 2: ', drelu_dx2)
print('partial derivative of the reLU function with resepct to the weight 2: ', drelu_dw2)

print('partial derivatve of the reLU function with respect to the bias: ', drelu_db)

partial derivative of the multiplication function wrt x0:  -3.0
partial derivative of the multiplication function wrt x1:  -1.0
partial derivative of the multiplication function wrt x2:  2.0 

partial derivative of the multiplication function wrt w0:  1.0
partial derivative of the multiplication function wrt w1:  -2.0
partial derivative of the multiplication function wrt x2:  2.0 

partial derivative of the reLU function with resepct to the input 0:  -3.0
partial derivative of the reLU function with resepct to the weight 0:  1.0
partial derivative of the reLU function with resepct to the input 1:  -1.0
partial derivative of the reLU function with resepct to the weight 1:  -2.0
partial derivative of the reLU function with resepct to the input 2:  2.0
partial derivative of the reLU function with resepct to the weight 2:  3.0
partial derivatve of the reLU function with respect to the bias:  1.0


now, we will uses these partial derivatives, (which is the gradient), to the weights to minimize the output. this is referred to as the optimizer

in this case, we will apply a negative fraction to the gradients of the weights, since we want to decrease the final output value.

lets show the current values of the weights and the bias:

In [20]:
print(w, b)

[-3.0, -1.0, 2.0] 1.0


let now modify the weight values using the gradients of each respective weights:

> changed the weights and biases slightly as to decrease the output.

In [23]:
dx = [drelu_dx0, drelu_dx1, drelu_dx2]
dw = [drelu_dw0, drelu_dw1, drelu_dw2]
db = drelu_db

w[0] += -0.001 * dw[0]
w[1] += -0.001 * dw[1]
w[2] += -0.001 * dw[2]
b += -0.001 * db

print(w, b)

[-3.0029999999999997, -0.994, 1.9910000000000003] 0.998


lets do another forward pass and see the changes:

In [24]:
#multiplying the inputs and weights
input_weights = [x*w for x, w in zip(x, w)]
print(input_weights)

[-3.0029999999999997, 1.988, 5.973000000000001]


In [25]:
#adding (performing the dot product, adding the bias)
z = sum(input_weights) + b
print('output before the reLU activation function: ', z)

#reLU activation function
y = max(z, 0)
print('output after the reLU activation function: ', y)

output before the reLU activation function:  5.956000000000001
output after the reLU activation function:  5.956000000000001


As you can see from the above code, we have minimized the reLU output. In a real world application, we do not minimize this layer, but rather the loss function. Remember we only did this for the reLU for demonstation purposes.

#### Backprop with multiple neurons

now, instead of a single neuron, we have a layer with multiple neurons. During backprop, wach neuron from the current layer will recieve a vector of partial deriviatives instead of a sinlge value. 

below is the code to show this: 

> take the transposed weights, which are the transposed aray of the derivatives wrt the inputs, and multiply them by their respective gradients (related to the given neurons) to apply the chain rule.

> then, we sum along with the inputs.

> calculate the gradient for the next layer in the backpropogation. the next layer is the previous layer in the order of creation of the model.

In [48]:
import numpy as np

#passed in gradient from the next layer. use vector of 1s for this example
#remember, this will have to be a 2D array of rows and columns.
dvalues = np.array([1., 1., 1.])
print('gradients from the next layer: ', dvalues)
print('dvalue shape: ', dvalues.shape)

gradients from the next layer:  [1. 1. 1.]
dvalue shape:  (3,)


In [33]:
#we have 3 sets of weights, one set for each neuron. 4 inputs, thus 4 weights.
weights = np.array([[0.2, 0.8, -0.5, 1], [0.5, 10.91, 0.26, -0.5], [-0.26, -0.27, 0.17, 0.87]])
print('this is the weights: \n', weights)
print('this is the weight''s shape', weights.shape)
weights_T = weights.T
print('this is the weights transposed: \n', weights_T)
print('this is now the weight''s shape: ', weights_T.shape)

this is the weights: 
 [[ 0.2   0.8  -0.5   1.  ]
 [ 0.5  10.91  0.26 -0.5 ]
 [-0.26 -0.27  0.17  0.87]]
this is the weights shape (3, 4)
this is the weights transposed: 
 [[ 0.2   0.5  -0.26]
 [ 0.8  10.91 -0.27]
 [-0.5   0.26  0.17]
 [ 1.   -0.5   0.87]]
this is now the weights shape:  (4, 3)


In [49]:
#sum weights of the given input. multiply the passed in gradient for this neuron
dx0 = sum(weights_T[0]) * dvalues[0]
dx1 = sum(weights_T[1]) * dvalues[0]
dx2 = sum(weights_T[2]) * dvalues[0]
dx3 = sum(weights_T[3]) * dvalues[0]

#gradient of the neuron's function wrt the inputs
dinputs = np.array([dx0, dx1, dx2, dx3])
print(dinputs)

[ 0.44 11.44 -0.07  1.37]


the sum of the multiplicaiton of the elements is the dot product. we can achieve same by doing np.dot

In [59]:
import numpy as np

dvalues = np.array([1,1,1])
print(dvalues.shape)

weights = np.array([[0.2, 0.8, -0.5, 1], [0.5, 10.91, 0.26, -0.5], [-0.26, -0.27, 0.17, 0.87]])

#sum weights of the given inputs, multiply the passed in each gradient for this neuron
dinputs = np.dot(dvalues, weights)
print(dinputs)

(3,)
[ 0.44 11.44 -0.07  1.37]


now, lets account for batches of samples. Above, we are using a single sample responsible for a single gradient that is backpropagated between alyers. The row vector that we created for dvalues is in proparation for the batch of data.

with more samples, the layer will return a list of gradients.

In [61]:
import numpy as np

dvalues = np.array([[1,1,1],[2,2,2],[3,3,3]])
print(dvalues.shape)

weights = np.array([[0.2, 0.8, -0.5, 1], [0.5, 10.91, 0.26, -0.5], [-0.26, -0.27, 0.17, 0.87]])

dinputs = np.dot(dvalues, weights)
print(dinputs)

(3, 3)
[[ 0.44 11.44 -0.07  1.37]
 [ 0.88 22.88 -0.14  2.74]
 [ 1.32 34.32 -0.21  4.11]]


lets combine the forward and backward pass of a single neuron with a full layer and batched-based partial derivatives. We'll minimize ReLU's output, once again, for this example only

In [71]:
import numpy as np

#passed in gradient from the next layer, array of incremental gradient values
dvalues = np.array([[1,1,1],[2,2,2],[3,3,3]])

#we have 3 sets of inputs - samples
inputs =  np.array([[1, 2, 3, 2.5],
                    [2., 5., -1., 2],
                    [-1.5, 2.7, 3.3, -0.8]])

print('input shape: ', inputs.shape)

'''we have 3 sets of weights - one set for each neuron,
   we have 4 inputs, thus 4 weights for each neuron'''
weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -0.91, 0.26, -0.5],
                    [-0.26, -0.27, 0.17, 0.87]]).T

print('weights shape: ', weights.shape)

#one bias for each neuron, biases are just a row vector of the shape (1, num_neurons)
biases = np.array([[2,3,0.5]])

#forward pass: perform the dot product, then perform the reLU activation function
layer_outputs = np.dot(inputs, weights) + biases
print('layer outputs before reLU activation function: \n', layer_outputs)
relu_outputs = np.maximum(0, layer_outputs)
print('layer outputs after the reLU activation function: \n', relu_outputs)

'''lets optimize and test the backpropagation.
    ReLU activation simulates derivative wrt the input values from the next layer
    passed to the current layer during backpropagation
'''
drelu = relu_outputs.copy()
drelu[layer_outputs <= 0] = 0
print(drelu)

#Dense layer: 
#dinputs - multiply by weights
dinputs = np.dot(drelu, weights.T)
#dweights - multiply by inputs
dweights = np.dot(inputs.T, drelu)
#dbiases - sum values, do this over the samples (first axis), keepdims.
dbiases = np.sum(drelu, axis=0, keepdims=True)

#update parameters
weights += -0.001 * dweights
biases += -0.001 * dbiases

print(weights)
print(biases)

input shape:  (3, 4)
weights shape:  (4, 3)
layer outputs before reLU activation function: 
 [[ 4.8    1.21   2.385]
 [ 8.9   -1.81   0.2  ]
 [ 1.41   1.051  0.026]]
layer outputs after the reLU activation function: 
 [[4.8   1.21  2.385]
 [8.9   0.    0.2  ]
 [1.41  1.051 0.026]]
[[4.8   1.21  2.385]
 [8.9   0.    0.2  ]
 [1.41  1.051 0.026]]
[[ 0.179515   0.5003665 -0.262746 ]
 [ 0.742093  -0.9152577 -0.2758402]
 [-0.510153   0.2529017  0.1629592]
 [ 0.971328  -0.5021842  0.8636583]]
[[1.98489  2.997739 0.497389]]


now, we will update the dense layer and ReLU activation code with a backward method.

In [94]:
class Dense_Layer:
    def __init__(self, num_features, num_neurons):
        self.weights = 0.01 * np.random.randn(num_features, num_neurons)
        self.biases = np.zeros((1, num_neurons))

    #remember, we need to remember what the inputs were, when doing backpropagation
    def forward(self, samples):
        self.outputs = np.dot(samples, self.weights) + self.biases
        self.samples = samples

    #it takes in the gradients from the next layer
    def backward(self, dvalues):
        #gradients on parameters
        self.dweights = np.dot(self.samples.T, dvalues)
        self.dbiases = np.sum(dvalues, axis=0, keepdims=True)
        #gradients on values
        self.dinputs = np.dot(dvalues, self.weights.T)

class reLU:
    def forward(self, inputs):
        self.outputs = np.maximum(0, inputs)
        self.inputs = inputs

    def backward(self, dvalues):
        #since we need to modify the original variable, lets make a copy of values first
        self.dinputs = dvalues.copy()

        #zero gradient where input values were negative
        self.dinputs[self.inputs <= 0] = 0

X  = np.array([[1, 2, 3, 2.5],
                [2., 5., -1., 2],
                [-1.5, 2.7, 3.3, -0.8]])

dvalues = np.array([[1,1,1],[2,2,2],[3,3,3]])

dense1 = Dense_Layer(3, 3)

dense1.weights = np.array([[0.2, 0.8, -0.5, 1],
                    [0.5, -0.91, 0.26, -0.5],
                    [-0.26, -0.27, 0.17, 0.87]]).T

dense1.biases = np.array([[2,3,0.5]])

print(f'current weights: {dense1.weights}\ncurrent biases: {dense1.biases}')
dense1.forward(X)
print(f'current sample(s): {dense1.samples}')
print(f'output before reLU activation function: {dense1.outputs}')
activation1 = reLU()
activation1.forward(dense1.outputs)
print(f'output after reLU activation function: {activation1.outputs}')

dense1.backward(dvalues)
print(f'current weights: {dense1.dweights}\ncurrent bias: {dense1.dbiases}\ncurrent inputs: {dense1.dinputs}')
        

current weights: [[ 0.2   0.5  -0.26]
 [ 0.8  -0.91 -0.27]
 [-0.5   0.26  0.17]
 [ 1.   -0.5   0.87]]
current biases: [[2.  3.  0.5]]
current sample(s): [[ 1.   2.   3.   2.5]
 [ 2.   5.  -1.   2. ]
 [-1.5  2.7  3.3 -0.8]]
output before reLU activation function: [[ 4.8    1.21   2.385]
 [ 8.9   -1.81   0.2  ]
 [ 1.41   1.051  0.026]]
output after reLU activation function: [[4.8   1.21  2.385]
 [8.9   0.    0.2  ]
 [1.41  1.051 0.026]]
current weights: [[ 0.5  0.5  0.5]
 [20.1 20.1 20.1]
 [10.9 10.9 10.9]
 [ 4.1  4.1  4.1]]
current bias: [[6 6 6]]
current inputs: [[ 0.44 -0.38 -0.07  1.37]
 [ 0.88 -0.76 -0.14  2.74]
 [ 1.32 -1.14 -0.21  4.11]]
