# Neural Network Optimization

## Regularization

Adding regularization to the cost function of neural network is,

$$
\mathcal{J}(W, b) = \frac{1}{m} \sum_{i = 1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l = 1}^{L} \| W^{[l]} \|^2
$$

where

$$
\| W^{[l]} \|_{F}^2 = \sum_{i = 1}^{n^{[l]}} \sum_{j = 1}^{n^{[l - 1]}} (w_{i, j}^{[l]})^2
$$

Because we wanna minimize $\mathcal{J}$, if we set a large $\lambda$, $W$ will be close to 0, because otherwise, $\mathcal{J}$ won't be small.

We have double sums because $W$ is a **matrix** with weights, and its dimension is,

$$
W: (n^{[l]} \times n^{[l - 1]})
$$

So in one weight matrix, we square each element of weights and sum up to all the rows and columns directions to make it a single **scalar**. This $\| W^{[l]} \|_{F}^2$ is called **Frobenius norm**.

**Gradient descent** with regularization will be the following. First, we get derivative of cost function with respect to weight. $\alpha$ is a learning rate. $\lambda$ is a **regularization parameter** to control how much effective the regularization is. $m$ is the number of examples. Let $\text{backprops}$ be a sequence of partial derivatives obtained in **backward propagation**.

$$
\frac{\partial \mathcal{J}}{\partial W} = \text{backprops} + \frac{\lambda}{m} W
$$

So the gradient descent will be,

$$
W = W - \alpha \frac{\partial \mathcal{J}}{\partial W}
$$
$$
= W - \alpha \left[ \text{backprops} + \frac{\lambda}{m} W \right]
$$
$$
= W - \alpha \frac{\lambda}{m} W - \alpha \left[ \text{backprops} \right]
$$
$$
= (1 - \frac{\alpha \lambda}{m}) W - \alpha \left[ \text{backprops} \right]
$$

Typically $\frac{\alpha \lambda}{m} \ll 1$, so with the regularization parameter, it has an effect of making $W$ smaller, because subtracting some number from 1. It is the same effect as we see in the regularization in linear regression to make paramters small. Because of regularization, weights get smaller, so some people say it's **weight decay**. 

**Dropout** is another to introduce regularization by making a neural network small by randomly making some weights 0. 

In image recognition tasks, **data augmentation** can be used as regularization by adding noise, changing angles, zoom in/out and flipping, then we can add more challenging data or diverse data to training data.

## Normalization

After normalizing the input, the contour of the cost function will be more sphere than elongated ellipse shape, so the gradient descent will be more straight to the minimum. But in the elongated ellipse, the gradient descent is easy to offshoot and oscillate.

Notice that the same mean and variance used in the training data should be applied to the test data. No re-calculation of mean and variance from the test data.

## Weight Initialization

Vanishing/exploding gradients

He initialization for ReLU

Xavier initialization for tanh

## Gradient Checking

**Grad check**

## Batch

Use **epoch** to describe passing through the entire training set.

**Batch gradient descent** uses the entire data per gradient descent. So 1 gradient descent per epoch.

**Mini-batch gradient descent** uses a part of all the data per gradient descent, so multiple gradient descents per epoch.

People use mini-batch gradient descent because it runs faster.

When mini-batch size is 1, it's called **stochastic gradient descent**.

When training set is huge, we should use mini-batch, but if the training set is small, we are fine to use batch gradient descent.

People believe that computers will run faster if we use the power of two, for example, $64 = 2^6$, $128 = 2^7$, $256 = 2^8$, $512 = s^9$.

## Exponentially weighted moving average

$$
v_t = \beta v_{t - 1} + (1 - \beta) \theta_{t}
$$

Where

$$
\beta = 0.9
$$

$\theta_t$ is the actual data which will be turned into moving average.

$v_t$ as approximately average over $\approx \frac{1}{1 - \beta}$ number of data. For example, when $\beta = 0.9$

$$
\frac{1}{1 - 0.9} = \frac{1}{0.1} = 10
$$

So it's the average over the last 10 data. When $\beta = 0.98$

$$
\frac{1}{1 - 0.98} = \frac{1}{0.02} = 50
$$

- If $\beta$ is large, average over more data, moving average is smoother.
- If $\beta$ is small, average over less data, moving average is more spiky.

When $\beta = 0.9$

$$
v_3 = 0.9 v_2 + 0.1 \theta_3
$$
$$
v_2 = 0.9 v_1 + 0.1 \theta_2
$$
$$
v_1 = 0.9 v_0 + 0.1 \theta_1
$$

By substituting $v_2$ and $v_1$ into $v_3$ equation,

$$
v_3 = 0.9 (0.9 \times (0.9 v_0 + 0.1 \theta_1) + 0.1 \theta_2) + 0.1 \theta_3
$$
$$
= 0.9^3 v_0 + 0.9^2 \times 0.1 \theta_1 + 0.9 \times 0.1 \theta_2 + 0.1 \theta_3
$$

By rearranging,

$$
= 0.1 \theta_3 + 0.1 \times 0.9 \theta_2 + 0.1 \times 0.9^2 \theta_1 + 0.9^3 v_0
$$
$$
= 0.1 \times 0.9^0 \theta_3 + 0.1 \times 0.9^1 \theta_2 + 0.1 \times 0.9^2 \theta_1 + 0.9^3 v_0
$$

Because we choose $\beta \le 1$, if we take power of $\beta$, it will get smaller. So exponentially weighted moving average has the effect of putting more weight on the recent data (e.g. $\theta_3$) and less weight on the old data (e.g. $\theta_1$)

**Bias correction**. This is because the above uses $v_0 = 0$. The initial phase of moving average is small numbers. The correction uses $\frac{v_t}{1 - \beta^t}$. In practice, people don't do this, because the bias will be corrected after initial iterations.


https://www.coursera.org/learn/deep-neural-network/lecture/lXv6U/normalizing-inputs

In [None]:
# Visualize exponentially moving average

By using mini-batch gradient descent, optimization will oscillate because of the random chance of the mini-batch sample of the training set. Smoothing this oscillation make the learning faster. Oscillation appears because the gradients oscillate. So the idea is to apply the exponetially weighted moving average to the noisy gradients. This approach is called **gradient descent with momentum**.

In a basic form, $\alpha$ is learning rate, we have,

$$
W = W - \alpha \frac{\partial \mathcal{J}}{\partial W}
$$

But in gradient descent with momentum, after getting derivative, we first apply exponentially weighted moving average to gradients to get the average gradients, and then do gradient descent with the average gradient.

$$
v_{\frac{\partial \mathcal{J}}{\partial W}, t} = \beta v_{\frac{\partial \mathcal{J}}{\partial W}, (t - 1)} + (1 - \beta) \frac{\partial \mathcal{J}}{\partial W}_t
$$
$$
W = W - \alpha v_{\frac{\partial \mathcal{J}}{\partial W}, t}
$$

Typical choice for $\beta$ is 0.9, the average of the last 10 gradients.

https://www.coursera.org/learn/deep-neural-network/lecture/y0m1f/gradient-descent-with-momentum