# Batch Normalization

# Accelerating Deep Network Training by Reducing Internal Covariate Shift


## Introduction

Deep neural netowrks most often suffer from a training complication where the distribution of the input data changes. As the data moves down stream through the hidden layers, the distribution of the data going into the hidden layers changes since the parameters from the previous layers are changing. This change in data distribution slows down the training process by requiring *lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities*. The change in the distribution of the data (or network activations) is known as **Internal Co-variate Shift**. 

**Covariate shift** refers to the change in the distribution of the input variables present in the training and the test data. It is termed as **Internal** since the data distribution here is referred for the **hidden layers**.

Covariate shift is know to slow down training since the layers first need to adapt to the newer distribution everytime before applying the convolutional transformations.

The BatchNorm paper propose address this issue by - ***normalizing the input by dividing the dataset into mini-batches and perform normalization on per batch***. **Mini-batches** as opposed to using the single data item as a time in known to have several advantages - 

- The gradient of loss over a mini batch is an estimate of the gradient over the fll training set. This estimate is known to get closer to hat of the full training set with increase in the batch size.

- Computation is a batch could utilize the parallel processing capabilities as supported by the hardware.

The idea of covariate shift can be extended to a sub-system of hidden layers block of the network as well. Consider a network computing - <br>
**l = F<sub>2</sub>(F<sub>1</sub>(μ, θ<sub>1</sub>), θ<sub>2</sub>)** where,

- F<sub>1</sub> & F<sub>2</sub> are arbitrary transformations.
- θ<sub>1</sub> & θ<sub>2</sub> are the parameters to be learned.

In the above system, F<sub>2</sub> can be seen as a sub-network getting its inputs from F<sub>1</sub>. Let **x = F<sub>1</sub>(μ, θ<sub>1</sub>)**. Then, **l = F<sub>2</sub>(x, θ<sub>2</sub>)** <br>
Now, if the distribution of input changes at **x**, the distribution of the input to the hidden layers system at **F<sub>2</sub>** will change and the lerning algorithm would first need to adapt to this new distribution before applying the convolutional transformations.

Therefore, it is always benificial for the input distribution to remain fixed over the entire training phase to ensure a smoother and faster training of the network.


## Covariate Shifts and Activations

Inputs with fixed distribution to a network or sub-network will have positive effect on the activation layers as well. A layer with sigmoid activation with **z = g(Wu + b)** where **W** represents the weights, **b** is the bias term and **u** represents the layer input. **g(x) = 1/(1 + exp(-x))**

As **|x|** increases, **g'(x)** tends to **zero** and the gradients flowing down to **u** will tend to vanish from all directions. Now, since **'x'** is affected by **W**, **b** and all the previous layers, changes to these parameters during training will staurate the gradients. This affect is amplified with increase in the netowkr depth. This results in the slow training of the network.

If the input distribution remains stable during the training cycle, the optimizer would less likely get stuck in the saturated regime and accelarate the training.

## Batch Normalization

Batch Norm attempts to reduce the **Internal Co-variate Shift** via a *normalization step that attempts to fix the mean and variance of a layer's input*. This helps to make the gradient flow smoother through the netowkr by reducing the dependence of gradients on - 

- the scale of the parameters.
- initial values of the parameters.

### Reducing Internal Co-variate Shift

Internal Covariate Shift is defined as the change in the distribution of network activations due to the change in network parameters during training. To improve the training performance and speed, internal covariate shift must be reduced. It is a known fact that a neural netowkr trains faster if the inputs are all whitened. **Whitening** refers to 
*linearly transformed to have zero means and unit variances, and decorrelated*. It removes the ill-effects of internal covariate shifts.

Whitening of activations can be considered at each training step or some interval.This can be done by modifying the network directly or by updating the parameters of the optimization algorithm. If whitening is mingled with the optimization steps, then gradient descent may attempt to update the parameters in a way that requires normalization to be updated then this would reduce the effect of the gradient descent step. 

As training continues, the bias will tend to grow while the loss remains fixed. Ths could get worse if normalization not only centres but alsi scales the inout activations. This could even blow up the network when the normalization parameters are computed outside of the gradient descent step. This happens because the gradient descent does not consider the fact that normalization takes place. The [paper](https://arxiv.org/abs/1502.03167) proposes to address this by ensuring that the network always produces activations within the desired distribution. This will inturn ensure that the gradient of the loss w.r.t the model parameters account for the normalization and its dependence on the model parameters.

Normalization: **$\hat{x}$ = Norm(x, X)** where,

- x: input to a layer
- X: inputs over the full training dataset.

Let **θ** be the training parameter. Normalization depends not only on the training example **x** but also on all examples **X**, each of which deoends on **θ** if **x** is generated by an hidden layer. For back-propagation both of the following needs to be computed to avoid the network explosion discussed above. 

![Norm_BackPropagationPNG.PNG](attachment:Norm_BackPropagationPNG.PNG)
**Image Source:** [(Ioffe and Szegedy 2015)](https://arxiv.org/abs/1502.03167)

However, computing the whitening factors by this approach is quite an expensive task considering the amount of data points in the entire training set after every parameter update.

## Normalization via Mini-Batch Statistics

Since full whitening of every layers input is a very expensive task and also not dfferentiable everywhere, the [paper](https://arxiv.org/abs/1502.03167) suggests 2 simplifications - 

#### 1. Normalize each scalar feature independently to have zero mean and unit variance.
For a layer with multidimensional inputs: **x = (x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>d</sub>)** eacg dimension should be normalized as per - 
![MultiDimensionalNorm.PNG](attachment:MultiDimensionalNorm.PNG)
**Image Source:** [(Ioffe and Szegedy 2015)](https://arxiv.org/abs/1502.03167)
    
where the expectation and variance are computed over the entire training dataset. Now, it is very much possible that the normalizaion could alter what the layer represents. This is addressed by ensuring that the "*transformation inserted in the network can represent the identity transformation*" and the layer retains its representational power. To accomplish this, 2 new learnable parameters are added for each activation **x<sup>(k)</sup>**. The parameters are **γ<sup>(k)</sup>** (gamma) and **β<sup>(k)</sup>** (beta) which scale and shift the normalized values respectively.

**y<sup>(k)</sup> = γ<sup>(k)</sup>  $\hat{x}$ <sup>(k)</sup> + β<sup>(k)</sup>**

**γ<sup>(k)</sup>** and **β<sup>(k)</sup>** are learned along with the other parameters during training and help to restore the representational power of the network. 

By setting - <br>
**γ<sup>(k)</sup> = $\sqrt{Var[x(k)]}$** , and <br>
**β<sup>(k)</sup> = E[x<sup>(k)</sup>]**

the original activations could be recovered. This would allow the netwrok to undo the batch normalization and simply act as an **identity transofrmation** function. *The identity transform is a data transformation that copies the source data into the destination data without change*.

#### 2. Use of Mini-batch Statistics

*Each mini-batch produces estimates of the mean and variance of each activation.*  This would allow the statistics used for normalization to participate in the gradient backpropagation. Considering a mini-batch of size **m**, B = {(x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>m</sub>)}, and since normalization is applied to each activation independently, let the corresponding normalized values be {($\hat{x}$<sub>1</sub>, $\hat{x}$<sub>2</sub>, ..., $\hat{x}$<sub>m</sub>)} and the corresponsing linear transformations be {(y<sub>1</sub>, y<sub>2</sub>, ..., y<sub>m</sub>)}, then

**BN<sub>γ,β</sub> : x<sub>1...m</sub> : y<sub>1...m</sub>** is known as the **Batch Normalizing Transform**.

![BatchNormAlgo.PNG](attachment:BatchNormAlgo.PNG)
**Image Source:** [Algorithm 1: Batch Normalizing Transform, applied to activation x over a mini-batch](https://arxiv.org/abs/1502.03167)

**ε** is a constant added to the mini-batch variance for numerical stability of the BatchNorm.

- The scaled and shifted values are passed to the subsequent network layers.

- Each normalized activation $\hat{x}$ <sup>(k)</sup> is an input to a sub-network composed of linear transformation. **y<sup>(k)</sup> = γ<sup>(k)</sup>  $\hat{x}$ <sup>(k)</sup> + β<sup>(k)</sup>**. The introduction of normalized inputs accelerates the training of the sub-network and consequently the whole network.

- The gradient of the loss is back propagated during training. The gradients are computed through the parameters of batch normalization. The BatchNorm transform if differentiable and introduces normalized activations into the network. This ensures that the hidden layers can continue to learn on input distributions with less covariate shift and thus accelerate training.
