<a href="https://colab.research.google.com/github/udlbook/udlbook/blob/main/CM20315_2023/CM20315_Coursework_V_2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Coursework V: Backpropagation in Toy Model**

This notebook computes the derivatives of a toy function similar (but different from) that in section 7.3 of the book.

Work through the cells below, running each cell in turn. In various places you will see the words "TO DO". Follow the instructions at these places and make predictions about what is going to happen or write code to complete the functions.  At various points, you will get an answer that you need to copy into Moodle to be marked.

Post to the content forum if you find any mistakes or need to clarify something.

# Problem setting

We're going to investigate how to take the derivatives of functions where one operation is composed with another, which is composed with a third and so on.  For example, consider the model:

\begin{equation}
     \mbox{f}[x,\boldsymbol\phi] = \beta_3+\omega_3\cdot\mbox{PReLU}\Bigl[\gamma, \beta_2+\omega_2\cdot\mbox{PReLU}\bigl[\gamma, \beta_1+\omega_1\cdot\mbox{PReLU}[\gamma, \beta_0+\omega_0x]\bigr]\Bigr],
\end{equation}

with parameters $\boldsymbol\phi=\{\beta_0,\omega_0,\beta_1,\omega_1,\beta_2,\omega_2,\beta_3,\omega_3\}$, where

\begin{equation}
\mbox{PReLU}[\gamma, z] = \begin{cases} \gamma\cdot z & \quad z \leq0 \\ z & \quad z> 0\end{cases}.
\end{equation}

Suppose that we have a binary cross-entropy loss function (equation 5.20 from the book):

\begin{equation*}
\ell_i  = -(1-y_{i})\log\Bigl[1-\mbox{sig}[\mbox{f}[\mathbf{x}_i,\boldsymbol\phi]]\Bigr] - y_{i}\log\Bigl[\mbox{sig}[\mbox{f}[\mathbf{x}_i,\boldsymbol\phi]]\Bigr].
\end{equation*}

Assume that we know the current values of $\beta_{0},\beta_{1},\beta_{2},\beta_{3},\omega_{0},\omega_{1},\omega_{2},\omega_{3}$, $\gamma$, $x_i$ and $y_i$. We want to know how $\ell_i$ changes when we make a small change to $\beta_{0},\beta_{1},\beta_{2},\beta_{3},\omega_{0},\omega_{1},\omega_{2}$, or $\omega_{3}$.  In other words, we want to compute the eight derivatives:

\begin{eqnarray*}
\frac{\partial \ell_i}{\partial \beta_{0}}, \quad \frac{\partial \ell_i}{\partial \beta_{1}}, \quad \frac{\partial \ell_i}{\partial \beta_{2}}, \quad \frac{\partial \ell_i }{\partial \beta_{3}},  \quad \frac{\partial \ell_i}{\partial \omega_{0}}, \quad \frac{\partial \ell_i}{\partial \omega_{1}}, \quad \frac{\partial \ell_i}{\partial \omega_{2}},  \quad\mbox{and} \quad \frac{\partial \ell_i}{\partial \omega_{3}}.
\end{eqnarray*}

In [None]:
# import library
import numpy as np

Let's first define the original function and the loss term:

In [None]:
# Defines the activation function
def paramReLU(gamma,x):
  if x > 0:
    return x
  else:
    return x * gamma

# Defines the main function
def fn(x, beta0, beta1, beta2, beta3, omega0, omega1, omega2, omega3, gamma):
  return beta3+omega3 * paramReLU(gamma, beta2 + omega2 * paramReLU(gamma, beta1 + omega1 * paramReLU(gamma, beta0 + omega0 * x)))

# Logistic sigmoid
def sig(z):
  return 1./(1+np.exp(-z))

# The loss function (equation 5.20 from book)
def loss(f,y):
  sig_net_out = sig(f)
  l = -(1-y) * np.log(1-sig_net_out) - y * np.log(sig_net_out)
  return l

Now we'll choose some values for the betas and the omegas and x and compute the output of the function:

In [None]:
beta0 = 1.0; beta1 = -2.0; beta2 = -3.0; beta3 = 0.4
omega0 = 0.1; omega1 = -0.4; omega2 = 2.0; omega3 = -3.0
gamma = 0.2
x = 2.3; y =1.0
f_val = fn(x,beta0,beta1,beta2,beta3,omega0,omega1,omega2,omega3, gamma)
l_i_func = loss(f_val, y)
print('Loss full function = %3.3f'%l_i_func)

# Forward pass

We compute a series of intermediate values $f_0, h_0, f_1, h_1, f_2, h_2, f_3$, and finally the loss $\ell$

In [None]:
x = 2.3; y =1.0
gamma = 0.2
# Compute all the f_k and h_k terms
# I've done the first two for you
f0 = beta0+omega0 * x
h1 = paramReLU(gamma, f0)


# TODO:  Replace the code below
f1 = 0
h2 = 0
f2 = 0
h3 = 0
f3 = 0


# Compute the loss and print
# The answer should be the same as when we computed the full function above
l_i = loss(f3, y)
print("Loss forward pass = %3.3f"%(l_i))


# Backward pass:  Derivative of loss function with respect to function output

Now, we'll compute the derivative $\frac{dl}{df_3}$ of the loss function with respect to the network output $f_3$.  In other words, we are asking how does the loss change as we make a small change in the network output.

Since the loss it itself a function of $\mbox{sig}[f_3]$ we'll compute this using the chain rule:

\begin{equation}
\frac{dl}{df_3} = \frac{d\mbox{sig}[f_3]}{df_3}\cdot \frac{dl}{d\mbox{sig}[f_3]}
\end{equation}

Your job is to compute the two quantities on the right hand side.


In [None]:
# Compute the derivative of the the loss with respect to the function output f_val
def dl_df(f_val,y):
  # Compute sigmoid of network output
  sig_f_val = sig(f_val)
  # Compute the derivative of loss with respect to network output using chain rule
  dl_df_val = dsig_df(f_val) * dl_dsigf(sig_f_val, y)
  # Return the derivative
  return dl_df_val

In [None]:
# MOODLE ANSWER # Notebook V 1a: Copy this code when you have finished it.

# Compute the derivative of the logistic sigmoid function with respect to its input (as a closed form solution)
def dsig_df(f_val):
  # TODO Write this function
  # Replace this line:
  return 1

# Compute the derivative of the loss with respect to the logistic sigmoid (as a closed form solution)
def dl_dsigf(sig_f_val, y):
  # TODO Write this function
  # Replace this line:
  return 1

Let's run that for some f_val, y.  Check previous practicals to see how you can check whether your answer is correct.

In [None]:
y = 0.0
dl_df3 = dl_df(f3,y)
print("Moodle Answer Notebook V 1b: dldh3=%3.3f"%(dl_df3))

y= 1.0
dl_df3 = dl_df(f3,y)
print("Moodle Answer Notebook V 1c: dldh3=%3.3f"%(dl_df3))

# Backward pass:  Derivative of activation function with respect to preactivations

Write a function to compute the derivative $\frac{\partial h}{\partial f}$ of the activation function (parametric ReLU) with respect to its input.


In [None]:
# MOODLE ANSWER Notebook V 2a: Copy this code when you have finished it.

def dh_df(gamma, f_val):
  # TODO:  Write this function
  # Replace this line:
  return 1


Let's run that for some values of f_val.  Check previous practicals to see how you can check whether your answer is correct.

In [None]:
f_val_test = 0.6
dh_df_val = dh_df(gamma, f_val_test)
print("Moodle Answer Notebook V 2b: dhdf=%3.3f"%(dh_df_val))

f_val_test = -0.4
dh_df_val = dh_df(gamma, f_val_test)
print("Moodle Answer Notebook V 2c: dhdf=%3.3f"%(dh_df_val))

 # Backward pass:  Compute the derivatives of $l_i$ with respect to the intermediate quantities but in reverse order:

\begin{eqnarray}
\frac{\partial \ell_i}{\partial h_3}, \quad \frac{\partial \ell_i}{\partial f_2}, \quad
\frac{\partial \ell_i}{\partial h_2}, \quad \frac{\partial \ell_i}{\partial f_1}, \quad\frac{\partial \ell_i}{\partial h_1},  \quad\mbox{and} \quad \frac{\partial \ell_i}{\partial f_0}.
\end{eqnarray}

The first of these derivatives can be calculated using the chain rule:

\begin{equation}
\frac{\partial \ell_i}{\partial h_{3}} =\frac{\partial f_{3}}{\partial h_{3}} \frac{\partial \ell_i}{\partial f_{3}} .
\end{equation}

The left-hand side asks how $\ell_i$ changes when $h_{3}$ changes.  The right-hand side says we can decompose this into (i) how $\ell_i$ changes when $f_{3}$ changes and how $f_{3}$ changes when $h_{3}$ changes.  So you get a chain of events happening:  $h_{3}$ changes $f_{3}$, which changes $\ell_i$, and the derivatives represent the effects of this chain.  Notice that we computed the first of these derivatives already.  The second term is the derivative of $\beta_{3} + \omega_{3}h_{3}$ with respect to $h_3$ which is simply $\omega_3$.  

We can continue in this way, computing the derivatives of the output with respect to these intermediate quantities:

\begin{eqnarray}
\frac{\partial \ell_i}{\partial f_{2}} &=& \frac{\partial h_{3}}{\partial f_{2}}\left(
\frac{\partial f_{3}}{\partial h_{3}}\frac{\partial \ell_i}{\partial f_{3}} \right)
\nonumber \\
\frac{\partial \ell_i}{\partial h_{2}} &=& \frac{\partial f_{2}}{\partial h_{2}}\left(\frac{\partial h_{3}}{\partial f_{2}}\frac{\partial f_{3}}{\partial h_{3}}\frac{\partial \ell_i}{\partial f_{3}}\right)\nonumber \\
\frac{\partial \ell_i}{\partial f_{1}} &=& \frac{\partial h_{2}}{\partial f_{1}}\left( \frac{\partial f_{2}}{\partial h_{2}}\frac{\partial h_{3}}{\partial f_{2}}\frac{\partial f_{3}}{\partial h_{3}}\frac{\partial \ell_i}{\partial f_{3}} \right)\nonumber \\
\frac{\partial \ell_i}{\partial h_{1}} &=& \frac{\partial f_{1}}{\partial h_{1}}\left(\frac{\partial h_{2}}{\partial f_{1}} \frac{\partial f_{2}}{\partial h_{2}}\frac{\partial h_{3}}{\partial f_{2}}\frac{\partial f_{3}}{\partial h_{3}}\frac{\partial \ell_i}{\partial f_{3}} \right)\nonumber \\
\frac{\partial \ell_i}{\partial f_{0}} &=& \frac{\partial h_{1}}{\partial f_{0}}\left(\frac{\partial f_{1}}{\partial h_{1}}\frac{\partial h_{2}}{\partial f_{1}} \frac{\partial f_{2}}{\partial h_{2}}\frac{\partial h_{3}}{\partial f_{2}}\frac{\partial f_{3}}{\partial h_{3}}\frac{\partial \ell_i}{\partial f_{3}} \right).
\end{eqnarray}

In each case, we have already computed all of the terms except the last one in the previous step, and the last term is simple to evaluate.  This is called the **backward pass**.

In [None]:
x = 2.3; y =1.0
dldf3 = dl_df(f3,y)

In [None]:
# MOODLE ANSWER Notebook V 3a: Copy this code when you have finished it.
# TODO -- Compute the derivatives of the output with respect
# to the intermediate computations h_k and f_k (i.e, run the backward pass)
# I've done the first two for you.  You replace the code below:
# Replace the code below
dldh3 = 1
dldf2 = 1
dldh2 = 1
dldf1 = 1
dldh1 = 1
dldf0 = 1

Finally, we consider how the loss~$\ell_{i}$ changes when we change the parameters $\beta_{\bullet}$ and $\omega_{\bullet}$. Once more, we apply the chain rule:




\begin{eqnarray}
\frac{\partial \ell_i}{\partial \beta_{k}} &=& \frac{\partial f_{k}}{\partial \beta_{k}}\frac{\partial \ell_i}{\partial f_{k}}\nonumber \\
\frac{\partial \ell_i}{\partial \omega_{k}} &=& \frac{\partial f_{k}}{\partial \omega_{k}}\frac{\partial \ell_i}{\partial f_{k}}.
\end{eqnarray}

\noindent In each case, the second term on the right-hand side was computed in step 2. When $k>0$, we have~$f_{k}=\beta_{k}+\omega_k \cdot h_{k}$, so:

\begin{eqnarray}
\frac{\partial f_{k}}{\partial \beta_{k}} = 1 \quad\quad\mbox{and}\quad \quad \frac{\partial f_{k}}{\partial \omega_{k}} &=& h_{k}.
\end{eqnarray}

In [None]:
# MOODLE ANSWER Notebook V 3b: Copy this code when you have finished it.
# TODO -- Calculate the final derivatives with respect to the beta and omega terms
# Replace these terms
dldbeta3 = 1
dldomega3 = 1
dldbeta2 = 1
dldomega2 = 1
dldbeta1 = 1
dldomega1 = 1
dldbeta0 = 1
dldomega0 = 1

In [None]:
# Print the last two values out (enter these into Moodle).  Again, think about how you can test whether these are correct.
print('Moodle Answer Notebook V 3c: dldbeta0=%3.3f'%(dldbeta0))
print('Moodle Answer Notebook V 3d: dldOmega0=%3.3f'%(dldomega0))

# Compute the derivatives of  $\ell_i$  with respect to the parmeter $\gamma$ of the parametric ReLU function.  

In other words, compute:

\begin{equation}
\frac{d\ell_i}{d\gamma}
\end{equation}

Along the way, we will need to compute derivatives

\begin{equation}
\frac{dh_k(\gamma,f_{k-1})}{d\gamma}
\end{equation}

This is quite difficult and not worth many marks, so don't spend too much time on it if you are confused!

In [None]:
# Computes how an activation changes with a small change in gamma assuming preactivations are f
# MOODLE ANSWER # Notebook V 4a: Copy this code when you have finished it.
def dhdgamma(gamma, f):
  # TODO -- Write this function
  # Replace this line
  return 1

In [None]:
# Compute how the loss changes with gamma
# Replace this line:
# MOODLE ANSWER # Notebook V 4b: Copy this code when you have finished it.
dldgamma = 1

In [None]:
print("Moodle Answer Notebook V 4c: dldgamma = %3.3f"%(dldgamma))