## 10.1. Pen & Paper: Backpropagation

Perform a forward and backward pass to calculate the gradients for the weights $w_0, w_1, w_2, w_s$ in the following MLP. Each node represents one unit with a weight $w_i, i \in \{0, 1, 2\}$ connecting it to the previous node. The connection from unit 0 to unit 2 is called a _skip connection_, which means unit 2 receives input from two sources and thus has an additional weight $w_s$. The weighted inputs are added before the nonlinearity is applied.

**![There should be an image here. If you can't see it, you probably forgot to download the mlp.png!](mlp.png)**

We assume that we want to solve a binary classification task, therefore the nonlinearity for unit 2 is the logistic function, so $h_2(a_2) = h_{\text{logistic}}(a_2)$. The corresponding loss function is the cross-entropy $L = CE(\hat{y}, y) = -y \log \hat{y} - (1 - y) \log (1 - \hat{y})$. The nonlinearities for units 0 and 1 are the hyperbolic tangent function: $h_0 = h_1 = h_{\text{tanh}}$.

We provide you with one evaluation step for the backward pass, because the derivation of the cross-entropy loss can take some time. Feel free to try it, but you have been warned! Otherwise simply use this (CE = Cross-Entropy):

\begin{equation}
\frac{\partial CE}{\partial \hat{y}} \frac{\partial \hat{y}}{a_2} = h_{\text{logistic}}(a_2) - y
\end{equation}

**Q 8.1.1: What difference does the skip connection make when propagating back the error?** 

** Answer: ** Regarding the backpropagation, the third layer receives input from two sources, therefore the gradient flows as well from the third to the first layer in the backward pass. The result is an additional term in the corresponding equations.

Regarding the effectiveness of training, this helps to reduce the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem). If we have many layers in our network, the magnitude of gradients can shrink towards the first layers to a point where they are basically not trained. Skip connections help bypass this by propagating the gradient uninterrupted from the last layers. With increasingly deep networks, the are applied quite often in modern architectures.

<img src='bp_solution.png'>

In [1]:
# This is for the exercise session in order to show the steps successively

import ipywidgets as widgets
from IPython.display import display, display_html

i = 0

html_pre = """
<svg width="1000px" height=0><defs><clipPath id="mask">
"""
html_post = """
</clipPath></defs></svg>
<style>
img {
    display: block;
    margin: 0 auto;
}
.bp {
    width: 1000px;
    -webkit-clip-path: circle(#mask);
    clip-path: url(#mask);
}
</style>
<img src='mlp.png' text-align='center'>
<img src="bp_solution.png" class=bp />
"""

fp = [390, 320, 250, 180, 110, 0]
bp = [100, 170, 170, 240, 310, 310, 380, 450]

def get_state(i):
    path = "<rect x=0 y=0 width='1000px' height='50' />"
    if i >= 14:
        path = "<rect x=0 y=0 width='1000px' height='500' />"
    else:
        # Forward pass
        path += "<rect x=0 y={:d} width='28%' height='500' />".format(fp[min(i, len(fp) - 1)])
        # Backward pass
        if i >= len(fp):
            path += "<rect x=0 y=0 width='72%' height='{:d}' />".format(bp[min(i - len(fp), len(bp) - 1)])
        # Weights
        if i >= 8:
            path += "<rect x='62%' y=0 width='1000px' height='240' />"
        if i >= 11:
            path += "<rect x='70%' y=0 width='1000px' height='330' />"


    return html_pre + path + html_post

image = widgets.HTML(value=get_state(i))

def show_image(btn):
    global i
    if btn.description == 'Previous':
        i = max(0, i - 1) 
    if btn.description == 'Next':
        i = min(14, i + 1)
    image.value = get_state(i)
        
prev_btn = widgets.Button(description='Previous')
next_btn = widgets.Button(description='Next')
prev_btn.on_click(show_image)
next_btn.on_click(show_image)

display(image)
box_layout = widgets.Layout(display='flex',
                    flex_flow='row',
                    align_items='stretch',
                    justify_content='center',
                    width='1000px')
box = widgets.Box(children=[prev_btn, next_btn], layout=box_layout)
display(box)


HTML(value='\n<svg width="1000px" height=0><defs><clipPath id="mask">\n<rect x=0 y=0 width=\'1000px\' height=\…

Box(children=(Button(description='Previous', style=ButtonStyle()), Button(description='Next', style=ButtonStyl…

### Bonus: Derivation of the Cross-Entropy gradient

We're interested in the gradient $\frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial a_2}$.

Let's start with the loss:

\begin{eqnarray}
L &=& -y \log \hat{y} - (1 - y) \log (1 - \hat{y}) \\
\frac{\partial L}{\partial \hat{y}} &=& -y \cdot \frac{1}{\hat{y}} + (1 - y) \cdot \frac{1}{1 - \hat{y}}
\end{eqnarray}

Now we turn to the logistic function:
\begin{eqnarray}
\hat{y} = h_{\text{logistic}}(a_2) = \frac{1}{1 + e^{-a_2}} = \frac{e^{a_2}}{1 + e^{a_2}}
\end{eqnarray}

The last reformulation will make it easier to get a nice derivative of the logistic function. It's mostly refactoring after applying the quotient rule:

\begin{eqnarray}
\frac{\partial \hat{y}}{\partial a_2} &=& \frac{e^{a_2} (1 + e) - e^{a_2} e^{a_2}}{(1 + e^{a_2})^2} = \frac{e^{a_2}}{(1 + e^{a_2})^2} \\
&=&  \frac{e^{a_2}}{(1 + e^{a_2})}  \frac{1}{(1 + e^{a_2})} \\
&=& h_{\text{logistic}}(a_2) \cdot \left( \frac{1 + e^{a_2} - e^{a_2}}{1 + e^{a_2}} \right) = h_{\text{logistic}}(a_2) \left(1 - h_{\text{logistic}}(a_2)\right)
\end{eqnarray}

With these two partials at hand, we can combine them to our desired gradient:

\begin{eqnarray}
\frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial a_2} &=& \left( -y \cdot \frac{1}{\hat{y}} + (1 - y) \cdot \frac{1}{1 - \hat{y}} \right) h_{\text{logistic}}(a_2) \left(1 - h_{\text{logistic}}(a_2)\right) \\
&=&  -y \cdot \frac{ h_{\text{logistic}}(a_2) \left(1 - h_{\text{logistic}}(a_2)\right)}{\hat{y}} + (1 - y) \cdot \frac{ h_{\text{logistic}}(a_2) \left(1 - h_{\text{logistic}}(a_2)\right)}{1 - \hat{y}} \\
&=& -y \cdot (1 - h_{\text{logistic}}(a_2)) + (1 - y) \cdot h_{\text{logistic}}(a_2) \\
&=& -y + y \cdot h_{\text{logistic}}(a_2) + h_{\text{logistic}}(a_2) - y \cdot h_{\text{logistic}}(a_2) \\
&=& h_{\text{logistic}}(a_2) - y
\end{eqnarray}