# Residual Networks

This chapter introduces $residual \ blocks$. This invovles computing an additive change at each layer to the current representation instead of transforming it directly. 

The problem that arises from this is that deeper networks can be trained through causes causes exponential increase in the activation magnitude at initialization. 

A $residual \ block$ will implement $batch \ normalization$ to compensate for this exploding activation at each layer. 

## Sequential Processing

So far most of our computations have invovled as sequential processing for example: 

$$ \begin{align}
h_1 &= f_1[x, \phi_1] \\
h_2 &= f_2[h_1, \phi_2] \\
h_3 &= f_3[h_2, \phi_3] \\ 
y &= f_4[h_3, \phi_4]
\end{align}$$

This can also be viewed as: 

$$y = f_4[f_3[f_2[f_1[x, \phi_1], \phi_2], \phi_3], \phi_4]$$

### Limitation

In principle as we add more layers, the greater the capacity for the network to understand more complex patterns. <br> However, in CNN performance decreases again when adding even more layers.

<div align="center">
<img src="../images/chap9/seqLimits.png" width="710"/>
</div>



## Residual connections (Skip connection)

$Skip \ connections$ are branches in the computational path where the input to each network $f[â€¢]$ is added back to the output for example:

$$\begin{align} 
h_1 &= x + f_1[x, \phi_1] \\ 
h_2 &= h_1 + f_2[h_1, \phi_2] \\ 
h_3 &= h_2 + f_3[h_2, \phi_3] \\
y &= h_3 + f_4[h_3, \phi_4]
\end{align}$$

If we wrote this sequentially: 

$$\begin{align} 
y &= x + f_1[x, \phi_1] \\ & + f_2[x + f_1[x, \phi_1], \phi_2] \\ & + f_3[x + f_1[x, \phi_1] + f_2[x + f_1[x, \phi_1], \phi_2], \phi_3] \\ & + f_4[x + f_1[x, \phi_1] + f_2[x + f_1[x, \phi_1], \phi_2] + f_3[x + f_1[x, \phi_1] + f_2[x + f_1[x, \phi_1], \phi_2], \phi_3], \phi_4]
\end{align}$$

### How this helps

- It can be viewed that a residual connection turn the original network into an ensemble of these smaller networks whose outputs are summed to compute the results

<div align="center">
<img src="../images/chap9/resViewOne.png" width="710"/>
</div>


- Another view, this network now has **16 paths** with differeing numbers of transformations between the input and output, <br> this helps since it causes the network to suffer less from shattered gradients by providing additive gradients. 

Suppose we wanted to compute the partial derivative of the output with respect to the first function $\frac{\partial y}{\partial f_1}$ <br> the first function is used **8 times** in the forward pass.

$$\frac{\partial y}{\partial f_1} = I + \frac{\partial f_2}{\partial f_1} + \left(\frac{\partial f_3}{\partial f_1} + \frac{\partial f_2}{\partial f_1}\frac{\partial f_3}{\partial f_2}\right) + \left(\frac{\partial f_4}{\partial f_1} + \frac{\partial f_2}{\partial f_1}\frac{\partial f_4}{\partial f_2} + \frac{\partial f_3}{\partial f_1}\frac{\partial f_4}{\partial f_3} + \frac{\partial f_2}{\partial f_1}\frac{\partial f_3}{\partial f_2}\frac{\partial f_4}{\partial f_3}\right)$$


<div align="center">
<img src="../images/chap9/resViewTwo.png" width="710"/>
</div>


### Order of Operation in residual block

In order for the skip connection to effectively work, we must add a non-linear functions at various interval of the skip connection. <br> Typically a linear transformation is applied followed by a non-linear activation whilst we add the skip connection. 

Below shows, three different types of residual blocks: 
1. The usual order of the skip connection
2. Reverse order, where we apply the linear transformation and then skip connect.
3. A common practice residual block

<div align="center">
<img src="../images/chap9/skipconnect.png" width="400"/>
</div>

### Exploding Gradients

Residual Blocks does add stability to the gradient flow in the backpropagation, since there exists a path where each layer directly contributess to the network output. That being said we can still observe exploding gradients (even with He Initialisation). 

Consider a network where we add the result of the processing in the residual block back to the input. Since each branch has some variability, together the overall variance can increase. With ReLU activations and He initialisation, the expected variance is unchanged by the processing in each block, which means that in the next connection the variance increases once more and so on and so forth. Note this idea also applies in the backpropagatio algorithm. 


<div align="center">
<img src="../images/chap9/stablaisation.png" width="600"/>
</div>

As such there are two solutions: 

1. Multiple every skip connection by $\frac{1}{\sqrt{2}}$ thus stablising the values.
2. Batch Normalisation