## Batch Normalisation

### Learning Objectives
- Formula behind batch normalisation
- Application in PyTorch
- Advantage of batch normalisation

### Prerequisites
- Data normalisation
- Feed-forward neural networks
- Convolutional neural networks

### Intro
Before training a machine learning algorithm, it is common practice to normalise the input data, especially when the features have different scales, e.g. house prices and building year. This can lead to features with higher values having an unwanted greater impact on changes of a predictor. Normalising data can avoid this and lead to better performance. Since this technique is proven to work for the input data, it is natural to apply the same technique to the hidden layers in a neural network. <br>
Batch normalisation tackles the problem of internal covariate shift in deep neural networks, which describes the phenomenon that the input distribution of each layer changes a lot as the input to each layer is affected by the parameters of all preceding layers s.t. even small changes in parameters have a big impact on the parameters further down the line. This problem typically requires careful initialisation of the parameters as well as small learning rates.

### Mathematical Notation
Batch normalisation basically sets the mean of each feature to zero and the variance to 1:
\begin{equation*}
\hat{x} = \frac{x - \mu(x)}{\sqrt{\sigma^2(x)}}
\end{equation*}
where the mean and variance are computed over the batch during training and over the population after training.<br>
The normalised features are then scaled and shifted by introducing the parameters $\gamma$ and $\beta$:
\begin{equation*}
y = \gamma \hat{x} + \beta
\end{equation*}
These steps are applied to the activations of each layer before feeding them as input to the next layer. Therefore, we must also include them in the backpropagation, which we do by calculating the gradient w.r.t. the new parameters $\gamma$ and $\beta$.<br><br>

In a convolutional neural network, the normalisation is applied jointly over all locations in a feature map, s.t. we can learn the parameters $\gamma$ and $\beta$ per feature map and not per activation because we want to normalise the features in the same way regardless of whether they are in different convolutional windows. This means that we normalise the same activations in the same feature maps.  

### Implementation
Pytorch, mnist image prediction with cnn (parameter batch normalisation as optional to compare later)

### Exercise
Compare to cnn without batch normalisation <br>
Add plots to show stability and fast convergence

### Summary
With batch normalisation, we can use larger learning rates and have to be less careful with the selection of initial parameters, which leads to faster convergence and better performance.