## __Code Backpropagation__

Before applying propagation to our code, let's understand and implement it with a basic example.

In [1]:
w = [1, -2, 3]
x = [-3, -1, 2]
b = 1

xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

z = xw0 + xw1 + xw2 + b

y = max(0, z)

Here we implemented a basic layer with relu activation. Let's take a look at the neuron's function

$f(x) = ReLu(sum(mul(x0, w0), mul(x1, w1), mul(x2, w2), b))$

Let's we want to find the impact of $x_{0}$ on this function. It would look like

#### $\frac{\partial}{\partial x_(0)}f(x) = \frac{dReLu()}{dsum()} * \frac{dsum()}{dmul()} * \frac{dmul()}{x_{0}}$

By implementing the chain rule, we can now compute the impact of $x_{0}$ on this function. While we calculate backward pass, we will be recieving gradients from previous layers, which we can set to 1 in our example. To calculate the derivative of relu with respect to derivative of sum, will just be 1, since we know that the sum function has an output more than 0, which is then multip;ed by gradient from next layer, with respect to current, so that we follow the chain rule.

In [2]:
dvalue = 1
drelu_dsum = (1 if z > 0 else 0) * dvalue

Now we can calculate the derivative of sum with respect to weighted inputs, which is always 1, since partial derivative of sum is always 1. Using this, we can get the derivative of relu, with respect to weighted inputs. We can also define derivative of sum function with respect to bias is also 1

In [3]:
dsum_dxw0 = 1
dsum_dxw1 = 1
dsum_dxw2 = 1
dsum_db = 1
drelu_dxw0 = dsum_dxw0 * drelu_dsum
drelu_dxw1 = dsum_dxw1 * drelu_dsum
drelu_dxw2 = dsum_dxw2 * drelu_dsum
drelu_db = dsum_db * drelu_dsum

Now we can use the multiplication rule of partial derivatives to first get derivative of multiplication function with respect to inputs, which is just the weights, then get the rest of the derivative and multiply to finish! We can also get the impact of wieghts, by using multiplication rule.

In [4]:
dmul_dx0 = w[0]
dmul_dx1 = w[1]
dmul_dx2 = w[2]
dmul_dw0 = x[0]
dmul_dw1 = x[1]
dmul_dw2 = x[2]
drelu_dx0 = drelu_dxw0 * dmul_dx0
drelu_dw0 = drelu_dxw0 * dmul_dw0
drelu_dx1 = drelu_dxw1 * dmul_dx1
drelu_dw1 = drelu_dxw1 * dmul_dw1
drelu_dx2 = drelu_dxw2 * dmul_dx2
drelu_dw2 = drelu_dxw2 * dmul_dw2

Now we can represent these as gradients!

In [5]:
dx = [drelu_dx0, drelu_dx1, drelu_dx2]
dw = [drelu_dw0, drelu_dw1, drelu_dw2]
db = drelu_db

Now let's implement our findings into more realistic code, before incorporating it with our main code. All we are doing here, is taking we what we did in the last few lines, and apply it to more realistic data and paramters

In [6]:
import numpy as np
dvalues = np.array([[1, 1, -1], [2, -2, 2], [-3, 3, -3]])
weights = np.array([[0.1, 0.5, 0.8, -0.3],
                   [-0.7, 1.2, -0.9, 1.1],
                   [0.5, -0.8, -0.7, 0.3]]).T
inputs = np.array([[1, 2, 3, 2.5],
                   [2, 5, -1, 2],
                   [-1.5, 2.7, 3.3, -0.8]])

dinputs = np.dot(dvalues, weights.T)
print(dinputs)
dweights = np.dot(inputs.T, dvalues)
print(dweights)
dbiases = np.sum(dvalues, axis=0, keepdims=True)
print(dbiases)
drelu = np.zeros_like(dvalues)
drelu[dvalues > 0] = 1
drelu *= dvalues
print(drelu)

[[-1.1  2.5  0.6  0.5]
 [ 2.6 -3.   2.  -2.2]
 [-3.9  4.5 -3.   3.3]]
[[  9.5  -7.5   7.5]
 [  3.9   0.1  -0.1]
 [ -8.9  14.9 -14.9]
 [  8.9  -3.9   3.9]]
[[ 0  2 -2]]
[[1 1 0]
 [2 0 2]
 [0 3 0]]


Now we are finally ready to add everything to our original code. Then we will learn to compute the gradients of the softmax and loss functions. After that we will apply optimization.