In [4]:
import numpy as np

# Batch Normalization

The benefits of this technique:

1. More robust neural network
2. Increases the range of hyperparameters
3. Easier to tune hyperparamters
4. Easily train very deep networks

## How it works?

In batch gradient descent, normalizing our features can allow our neural network to train faster. It turns our loss space from something elongated to something more uniform.

However for deeper models, at each layer the inputs may not be the normalized form. Won't it be nice if we can normalize them between each layer to help train each layer quickly as well?

## Before or After Activation?

In practice, normalizing **before** is done more frequently.

## Implementation

Given some intermediate values in the neural network, $z^{[1]} \dots z^{[m]}$.

$$\mu = \frac{1}{m} \sum_{i} z^{(i(}$$
$$\sigma^2 = \frac{1}{m} \sum_{i} (z^{(i)} - \mu)^2$$
$$z^{(i)}_{\text{norm}} = \frac{z^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}}$$
$$\tilde{z}^{(i)} = \gamma z^{(i)}_{\text{norm}} + \beta$$
,where $\gamma$ and $\beta$ are learnable parameters.

Note that if:

$$\gamma = \sqrt{\sigma^2 + \epsilon}$$
$$\beta = \mu$$

, then $\tilde{z}^{(i)} = z^{(i)}$.

Why should we have $\gamma$ and $\beta$? Let me craft out an example to convince myself.

In [5]:
# TODO: Example

# Adding Batch Norm into a Neural Network

$\beta^{[l]}$ and $\gamma^{[l]}$ will now be added to the parameters to be trained. With these, we will also need $\delta \beta^{[l]}$ and $\delta \gamma^{[l]}$ during backpropagation.

Fortunately, in most programming frameworks they provide us with the utilities for implementing batch normalization.

## Working with Minibatches

Normally rather than working with our entire training example, batch norm works best when used with a minibatch. 

Note that for batch normalization, we actually cancel out the effect that our bias vector has on $z$ values. This is because batch norm averages across the training examples, and adding a value to each of the training example will get cancelled out during this averaging step.

Therefore we can leave our $b$ values as zero-vectors, or replace it with $\beta$ parameter instead.

## Implementing Gradient Descent with Batch Normalization

In each minibatch $t$:

1. Compute forward propagation on $X^{\{t\}}$
2. In each hidden layer, replace $Z^{[l]}$ with ${\tilde{Z^{[l]}}}$
3. Use backprop to compute `dW`, `dbeta`, and `dgamma`
4. Update parameters

This also works with other optimization algorithms as well.

# Batch Normalization Intuition

Why does batch normalization work? Let's explore a little bit on the intuition behind this trick.

We saw how normalizing input features can speed up learning. This is now performing a similar effect, not just for input units, but for hidden units.

## Covariate Shift

Suppose we built a cat classifier that learns very well the images of black cats. But when we feed it pictures of white cats it may not perform very well. This effect is known as the **covariate shift**.

The idea is that if we learn some $x$ and $y$ mapping, and if $x$'s distribution changes, then we have to retrain our entire model. Batch normalization ensures that the mean and variance of the $z$ value at each layer will stay at a certain mean and variance.

It kind of allows each layer of the network to learn by itself, and **decouples** layers from each other.

## Regularization

Each minibatch is scaled by mean and variance of each minibatch. That mean and variance has a little bit of **noise** in it. It provides a slight regularization effect.

# Batch Normalization during Inference

We come up with estimations of $\mu$ and $\sigma^2$ using exponentially weighted average (across mini-batches).

Suppose we have minibatches $X^{\{1\}}, X^{\{2\}}, X^{\{3\}}, \dots$, we calculate the exponentially weighted of the mean of each minibatch.

Then we calculate the $z_{\text{norm}}$ using our estimated $\mu$ and $\sigma^2$.