# Chapter 9 : Backpropogation

Pg.180-248

# Introduction 
(Pg.181-214)

Instead of trying to start trying to minimize the loss we are going to do an easier thing and minimize the output from a single neuron. However, minimizing the loss is the end goal for the neural network. We are going to attempt to calculate the impact each variable has on the ReLU activated output. To calculate the impact each variable has we will use the chain rule to calculate the derivative of the output with respect to each variable. 

In [36]:
# Single Neuron with 3 inputs
x = [1.0, -2.0, 3.0] # input values
w = [-3.0, -1.0, 2.0] # weights
b = 1.0 # bias

# first input 
xw0 = x[0] * w[0]
print(f"xw0: {xw0}")

# second input
xw1 = x[1] * w[1]
print(f"xw1: {xw1}")

# third input
xw2 = x[2] * w[2]
print(f"xw2: {xw2}")

# all
print(xw0, xw1, xw2)

xw0: -3.0
xw1: 2.0
xw2: 6.0
-3.0 2.0 6.0


![](../references/input_weight.png)

### Summation (2nd Operation)

In [37]:
# Summation
# adding weighted inputs and bias
z = xw0 + xw1 + xw2 + b
print(f"z = {xw0} + {xw1} + {xw2} + {b}")
print(f"z: {z}")

z = -3.0 + 2.0 + 6.0 + 1.0
z: 6.0


![](../references/weight_input_bias.png)

### ReLU Activation Function

In [38]:
# ReLU activation function
# forms the output of the neuron
print(f"z (input): {z}")
y = max(z, 0)
print(f"y (ReLU output): {y}")

z (input): 6.0
y (ReLU output): 6.0


![](../references/relu_output.png)

This was a complete forward pass through a single neuron and a ReLU activation function.

### One Big Function

This forward pass is a series of chained functions essentially. We are going to treat this as one big function. This big function consists of much simpler functions:
- multiplication (weight * input)
- summation (IxW1 + IxW2 + IxW3 + b)
- ReLU (max(0, IxW1 + IxW2 + IxW3 + b))

### Derivative of ReLU

![](../references/relu_der.png)

In [39]:
# backward pass

# derivative from next layer
# input value to ReLU: 6
# derivative of ReLU: 1
# so the highest a derivative of Relu can output is 1
# example value that wont change much
dvalue = 1.0

# z is the input to neuron (6.0)
print(f"Z : {z}")
# derivative of ReLU and chain rule
drelu_dz = dvalue * (1. if z > 0 else 0.)
print(f"drelu_dz: {drelu_dz}")

Z : 6.0
drelu_dz: 1.0


In [None]:
# during backward pass
# 1. calculate derivative of loss function
# 2. multiply with derivative of activation function (output layer)
# 3. multiply with derivative of output layer
# 4. and so on...

# derivatives with respect to weights and biases will form the gradients that use to update weights and biases
# derivative with respect to iniuts will form gradient to chain with previous layer
# gradients - how much each weight/bias contributes to error
# use gradients to adjust weights and biases to reduce error

# CHAIN GRADIENTS
# each layers adjustments are based on errors propogated from layers ahead of it

In [None]:
# color red shows derivatives

![](../references/relu_chain.png)

![](../references/grad1.png)

w.r.t. - stands for "with respect to"

In [40]:
# partial deriv. of multiplication, chain rule
# drelu_dxw0 - partial deriv. of relu with respect to first weighted input w0x0
dsum_dxw0 = 1 # partial deriv. of sum with respect to first weighted input w0x0
# drelu_dz - partial deriv. of relu with respect to z which is 1
drelu_dxw0 = drelu_dz * dsum_dxw0
print(f"drelu_dxw0: {drelu_dxw0}")

drelu_dxw0: 1.0


In [41]:
# partial deriv. of multiplication, chain rule
# next weighted input
dsum_dxw1 = 1
drelu_dxw1 = drelu_dz * dsum_dxw1
print(f"drelu_dxw1: {drelu_dxw1}")

drelu_dxw1: 1.0


In [42]:
# partial deriv. of multiplication, chain rule
# last weighted input
dsum_dxw2 = 1 
drelu_dxw2 = drelu_dz * dsum_dxw2
print(f"drelu_dxw2: {drelu_dxw2}")

drelu_dxw2: 1.0


In [43]:
# bias
dsum_db = 1
drelu_db = drelu_dz * dsum_db
print(f"drelu_db : {drelu_db}")

drelu_db : 1.0


In [44]:
# forward pass
x = [1.0, -2.0, 3.0] # input values
w = [-3.0, -1.0, 2.0] # weights
b = 1.0 # bias

# multiplying inputs by weights 
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# add weighted inputs and bias
z = xw0 + xw1 + xw2 + b
print(f"Weighted Input + Bias : {z}")

# relu activation
y = max(z, 0)
print(f"ReLU Activation : {y}")

# backward pass

# derivative from next layer ??????????????????
dvalue = 1.0

# Derivative of ReLU and chain rule
drelu_dz = dvalue * (1. if z > 0 else 0.)
print(f"Derivative of ReLU : {drelu_dz}")

# Partial derivatives of multiplication, the chain rule
dsum_dxw0 = 1
dsum_dxw1 = 1
dsum_dxw2 = 1
dsum_db = 1
drelu_dxw0 = drelu_dz * dsum_dxw0
drelu_dxw1 = drelu_dz * dsum_dxw1
drelu_dxw2 = drelu_dz * dsum_dxw2
drelu_db = drelu_dz * dsum_db
print(f"Partial Derivatives of Multiplication : {drelu_dxw0}, {drelu_dxw1}, {drelu_dxw2}, {drelu_db}")

Weighted Input + Bias : 6.0
ReLU Activation : 6.0
Derivative of ReLU : 1.0
Partial Derivatives of Multiplication : 1.0, 1.0, 1.0, 1.0


![](../references/9_17.png)

In [45]:
# even before summation there is multiplication of inputs x weights
# so we need to calculate partial derivatives of multiplication
# partial deriv. of first weighted input w.r.t input = weight
# apply chain rule 
# multiply by partial deriv. of subsequent function (sum)

dmul_dx0 = w[0]
print(f"dmul_dx0 (partial deriv. of first weighted input) : {dmul_dx0}")
drelu_dx0 = drelu_dxw0 * dmul_dx0
print(f"drelu_dx0: {drelu_dx0}")

dmul_dx0 (partial deriv. of first weighted input) : -3.0
drelu_dx0: -3.0


![](../references/later_chain.png)

In [46]:
# same for all other inputs and weights

# forward pass
x = [1.0, -2.0, 3.0] # input values
w = [-3.0, -1.0, 2.0] # weights
b = 1.0 # bias

# multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# add weighted inputs and a bias
z = xw0 + xw1 + xw2 + b

# ReLU Activation Function
y = max(z, 0)

# backward pass
# derivative from the last layer
dvalue = 1.0

# Derivative of ReLU and the Chain Rule
drelu_dz = dvalue * (1. if z > 0 else 0.)
print(f"drelu_dz: {drelu_dz}")

# Partial derivatives of the multiplication, the chain rule
dsum_dxw0 = 1
dsum_dxw1 = 13
dsum_dxw2 = 1
dsum_db = 1
drelu_dxw0 = drelu_dz * dsum_dxw0 # 1.0 * 1
drelu_dxw1 = drelu_dz * dsum_dxw1 # 1.0 * 13
drelu_dxw2 = drelu_dz * dsum_dxw2 # 1.0 * 1
drelu_db = drelu_dz * dsum_db # 1.0 * 1
print(f"drelu_dxw0: {drelu_dxw0}")
print(f"drelu_dxw1: {drelu_dxw1}")
print(f"drelu_dxw2: {drelu_dxw2}")
print(f"drelu_db: {drelu_db}")
#print(drelu_dxw0, drelu_dxw1, drelu_dxw2, drelu_db)
print("----------------------")

# Partial derivatives of the multiplication, the chain rule
dmul_dx0 = w[0] # -3.0
dmul_dx1 = w[1] # -1.0
dmul_dx2 = w[2] # 2.0
dmul_dw0 = x[0] # 1.0
dmul_dw1 = x[1] # -2.0
dmul_dw2 = x[2] # 3.0
drelu_dx0 = drelu_dxw0 * dmul_dx0 # 1.0 * -3.0
drelu_dw0 = drelu_dxw0 * dmul_dw0 # 1.0 * 1.0
drelu_dx1 = drelu_dxw1 * dmul_dx1 # 13.0 * -1.0
drelu_dw1 = drelu_dxw1 * dmul_dw1 # 13.0 * -2.0
drelu_dx2 = drelu_dxw2 * dmul_dx2 # 1.0 * 2.0
drelu_dw2 = drelu_dxw2 * dmul_dw2 # 1.0 * 3.0
#print(drelu_dx0, drelu_dw0, drelu_dx1, drelu_dw1, drelu_dx2, drelu_dw2)
print(f"drelu_dx0: {drelu_dx0}")
print(f"drelu_dw0: {drelu_dw0}")
print(f"drelu_dx1: {drelu_dx1}")
print(f"drelu_dw1: {drelu_dw1}")
print(f"drelu_dx2: {drelu_dx2}")
print(f"drelu_dw2: {drelu_dw2}")

drelu_dz: 1.0
drelu_dxw0: 1.0
drelu_dxw1: 13.0
drelu_dxw2: 1.0
drelu_db: 1.0
----------------------
drelu_dx0: -3.0
drelu_dw0: 1.0
drelu_dx1: -13.0
drelu_dw1: -26.0
drelu_dx2: 2.0
drelu_dw2: 3.0


![](../references/complete_back.png)

This animation by Sentdex really helped me understand the backpropogation process. 

https://nnfs.io/pro/

In [47]:
# associated animation
# https://nnfs.io/com/ 

# simplify
drelu_dx0 = drelu_dxw0 * dmul_dx0
# where
dmul_dx0 = w[0]
# insert
drelu_dx0 = drelu_dxw0 * w[0]
# where
drelu_dxw0 = drelu_dz * dsum_dxw0
# insert
drelu_dx0 = drelu_dz * dsum_dxw0 * w[0]
# where
dsum_dxw0 = 1
# insert
drelu_dx0 = drelu_dz * 1 * w[0]
drelu_dx0 = drelu_dz * w[0]
# where
drelu_dz = dvalue * (1. if z > 0 else 0.)
# insert
drelu_dx0 = dvalue * (1. if z > 0 else 0.) * w[0]

In [48]:
dx = [drelu_dx0, drelu_dx1, drelu_dx2] # gradients on inputs
dw = [drelu_dw0, drelu_dw1, drelu_dw2] # gradients on weights
db = drelu_db # gradient on bias (only 1 bias here)

In [49]:
print(w, b)

[-3.0, -1.0, 2.0] 1.0


In [50]:
# apply fractions of gradients
# trying to decrease the output
w[0] += -0.001 * dw[0]
w[1] += -0.001 * dw[1]
w[2] += -0.001 * dw[2]
b += -0.001 * db

print(w, b)

[-3.001, -0.974, 1.997] 0.999


In [51]:
# check the new output to see if it decreased
# Multiplying inputs by weights
xw0 = x[0] * w[0]
xw1 = x[1] * w[1]
xw2 = x[2] * w[2]

# Adding
z = xw0 + xw1 + xw2 + b

# ReLU activation function
print(f"Old Y : {y}")
y = max(z, 0)
print(f"New Y : {y}")

Old Y : 6.0
New Y : 5.937


## Amazing Explanation of Backpropogation

I found this amazing YouTube Channel explaining it. 

https://www.youtube.com/watch?v=tUoUdOdTkRw&t=559s 

# Categorical Cross-Entropy loss derivative 

(Pg.215-217)

# Categorical Cross-Entropy loss derivative code implementation 

(Pg.218-219)

# Softmax activation derivative 

(Pg.220-225)

# Softmax activation derivative code implementation 

(Pg.226-229)

# Common Categorical Cross-Entropy loss and Softmax activation derivative 

(Pg. 230-233)

# Common Categorical Cross-Entropy loss and Softmax activation derivative - code implementation 

(Pg.234-242)

# Full code up to this point 

(Pg.243-248)