# Neural Network Optimization

## Regularization

Adding regularization to the cost function of neural network is,

$$
\mathcal{J}(W, b) = \frac{1}{m} \sum_{i = 1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l = 1}^{L} \| W^{[l]} \|^2
$$

where

$$
\| W^{[l]} \|_{F}^2 = \sum_{i = 1}^{n^{[l]}} \sum_{j = 1}^{n^{[l - 1]}} (w_{i, j}^{[l]})^2
$$

Because we wanna minimize $\mathcal{J}$, if we set a large $\lambda$, $W$ will be close to 0, because otherwise, $\mathcal{J}$ won't be small.

We have double sums because $W$ is a **matrix** with weights, and its dimension is,

$$
W: (n^{[l]} \times n^{[l - 1]})
$$

So in one weight matrix, we square each element of weights and sum up to all the rows and columns directions to make it a single **scalar**. This $\| W^{[l]} \|_{F}^2$ is called **Frobenius norm**.

**Gradient descent** with regularization will be the following. First, we get derivative of cost function with respect to weight. $\alpha$ is a learning rate. $\lambda$ is a **regularization parameter** to control how much effective the regularization is. $m$ is the number of examples. Let $\text{backprops}$ be a sequence of partial derivatives obtained in **backward propagation**.

$$
\frac{\partial \mathcal{J}}{\partial W} = \text{backprops} + \frac{\lambda}{m} W
$$

So the gradient descent will be,

$$
W = W - \alpha \frac{\partial \mathcal{J}}{\partial W}
$$
$$
= W - \alpha \left[ \text{backprops} + \frac{\lambda}{m} W \right]
$$
$$
= W - \alpha \frac{\lambda}{m} W - \alpha \left[ \text{backprops} \right]
$$
$$
= (1 - \frac{\alpha \lambda}{m}) W - \alpha \left[ \text{backprops} \right]
$$

Typically $\frac{\alpha \lambda}{m} \ll 1$, so with the regularization parameter, it has an effect of making $W$ smaller, because subtracting some number from 1. It is the same effect as we see in the regularization in linear regression to make paramters small. Because of regularization, weights get smaller, so some people say it's **weight decay**. 

**Dropout** is another to introduce regularization by making a neural network small by randomly making some weights 0. 

In image recognition tasks, **data augmentation** can be used as regularization by adding noise, changing angles, zoom in/out and flipping, then we can add more challenging data or diverse data to training data.

https://www.coursera.org/learn/deep-neural-network/lecture/lXv6U/normalizing-inputs