## 27: [Normalizing Activations](https://youtu.be/tNIpEZLv_eg)

## 28: [Fitting Batch Norm into Neural Networks](https://youtu.be/em6dfRxYkYU)

## 29: [Why Does Batch Norm Work?](https://youtu.be/nUUqwaxLnWs)

## 30: [Batch Norm at Test Time](https://youtu.be/5qefnAek8OA)

## Batch Normalization

The basic idea is, we normalized our data inputs in Video 9 to make the loss landscape easier to navigate. We can do the same thing for the activations (outputs) $a^{[n]}$ of our hidden layers, so the next layer has a more normalized input, and therefore its weights can be more easily optimizer. This is called **Batch Normalization**.

Now, there is debate on whether to apply the batch normalization before or after the activation function. In practice, normalizing the output before the activation function $z^{[n]}$ is more common. 

### Implementing Batch Normalization

Given some intermediate values at some layer $l$ in the NN, $z^{(1)}, \dots, z^{(m)}$, we want to normalize them. (This could be written as $z^{[l](1)}, \dots, z^{[l](m)}$ if we want to be more explicit about the layer, but we omit it here for clarity.)

We define the mean and variance of the batch as:

$$
\mu = \frac{1}{m} \sum_{i=1}^m z^{(i)} \\
\, \\
\sigma^2 = \frac{1}{m} \sum_{i=1}^m (z^{(i)} - \mu)^2
$$

And we normalize the batch as:

$$
z^{(i)}_{norm} = \frac{z^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}}
$$

Where $\epsilon$ is a small number to avoid division by zero.


Now, every component of $z$ has mean 0 and variance 1. But we don't want the hidden units to always have mean 0 and variance 1, as it makes sense for them to have a different distribution, so we add two new parameters $\gamma$ and $\beta$ to the layer, and we define the "corrected" values as:

$$
\tilde{z}^{(i)} = \gamma \times z^{(i)}_{norm} + \beta
$$

And here, $\gamma$ and $\beta$ are learnable parameters of the model, and we pass the $\tilde{z}^{(i)}$ to the activation function.

Note that we have a unique $\gamma$ and $\beta$ for each layer of the NN.

### Batch Normalization in Practice

In practice, batch norm is usually applied with mini-batches of your training set. \
(And also, doing the batch-norm factors out the bias $b$ parameters of each layer, so we can just omit them.)

Also note that the dimension of $\gamma^{(i)}$ and $\beta^{(i)}$ is the same as the number of neurons in the layer. (each neuron gets a scalar $\gamma$ and $\beta$)

## Intuition for Batch Norm

The reason number $1$ for why it works is the same argument as the one for why normalizing the inputs works: it makes the loss landscape easier to navigate, and therefore the weights are easier to optimize.

Now, To get a better idea about the intuition of the other reason, let's look at an example:

If you have a NN for figuring out if a picture is a cat picture or not, your model will probably not be able to generalize well to a new dataset if the training set only has pictures of black cats, and no pictures of colored cats.

This idea of your data distribution changing goes by the name of **Covariate Shift**.

And it says that if you learn some $X \mapsto Y$ mapping, if the distribution of $X$ changes, you need to retrain your model. And this is true, even if your ground truth function mapping from $X$ to $Y$ (cat pictures to outputting "cat") doesn't change. And the need to retrain your function gets worse if your ground truth changes as well.

So the $2^{nd}$ reason why Batch Norm works is that it makes weights that come later in your NN less dependent (or more robust to changes) on the weights that come earlier. \
This is because for each layer, basically the output of their previous layer is their input data, doing this batch norm makes it such that the input data (relative to that layer) is normalized, and therefore the weights of that layer don't have to change as much to adapt to changes in the previous layer. 

Pictures:

![](.graphics/2023-04-26-21-06-32.png)

![](.graphics/2023-04-26-21-06-42.png)




## Batch Norm as Regularization

We know that each mini-batch is scaled by the mean/variance computed on _just that_ mini-batch.

This (similar to dropout) adds some noise to the values $z^{[l]}$ within that mini-batch, because the mean and variance we compute from that mini-batch aren't the true mean and variance of the entire training set.

So by adding noise to our hidden units, this is forcing the downstream hidden units to rely not too much on any one unit that comes before them. Because the noise is comparatively small, this is not going to be a huge regularizing effect, but it is there.

So you would also reduce your regularization if you increase the size of your mini-batch.

Having said that, you should not rely on batch norm as a regularization technique, and you should still use other methods.

## Batch Norm at Test Time

This version of batch normalization will not make sense during test time, because you might be processing one data at a time, so the scaling will be nonsense. 

So we need to come up with some separate estimate of $\mu$ and $\sigma^2$ to use at test time. \
In typical implementations of the batch norm, this is done using an exponentially weighted average of the mean and variance of the mini-batches. (for each layer)

And then, we use these exponential averages to normalize the test data.
