# Regularization

- Penalizes weights being large.
- $J(w,b) = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i))}) + \dfrac{\lambda}{2m}||w||^{2}$
- Where $||w||^{2} = \displaystyle\sum_{j=1}^{n_{x}}w_{j}^{2} = w^{T}w$

In general
- $J(w^{[1]},b^{[1]} \dots w^{[2]},b^{[2]}) = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i))}) + \dfrac{\lambda}{2m}\displaystyle\sum_{l=1}^{L}||w^{[l]}||_{F}^{2}$
- Where $||w^{[l]}||_{F}^{2} = \displaystyle\sum_{i=1}^{n^{[l-1]}}\displaystyle\sum_{j=1}^{n^{[l]}}(w_{ij}^{[l]})^{2}$
- Add $\dfrac{\lambda}{m}w^{[l]}$ to $dw^{[l]}$
- $w^{[l]} = w^{[l]} - \alpha dw^{[l]}$ remains the same.

## Dropout

Example: $l =3$
- keep_prop = 0.8 (20% chance that units will be shutdown)
- d3 = np.random.rand(a3.shape[0], a3.shape[1]) $\gt$ keep_prop
- a3 = np.multiply(a3, d3)
- a3 = a3 / keep_prop

Other regularization
- Data augmentation.
- Early stopping.

## Normalizing inputs

- To make gradient descent faster. 
- Applies to both training and test sets.
- $\mu = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}x^{(i)}$, $\sigma^{2} = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}x^{(i)}**2$
- $x = \dfrac{x-\mu}{\sigma}$

## Vanishing/exploding gradient

- Happens in deep neural network.
- To partially overcome, weight initialization.
    - $Var(w_{i})$ = $\dfrac{1}{n}$ or $\dfrac{2}{n}$ (good for RELU)
    - $w^{[l]}$ = np.random.randn() * init_factor
    - init_factor = np.sqrt$\left(\dfrac{2}{n^{[l-1]}}\right)$ (good for RELU) or np.sqrt$\left(\dfrac{1}{n^{[l-1]}}\right)$ (Xavier initialization)
    
## Gradient checking

- Take $w^{[1]},b^{[1]} \dots w^{[l]},b^{[l]}$ and reshape into a big factor $\theta$
    - $J(w^{[1]},b^{[1]} \dots w^{[l]},b^{[l]}) J(\theta)$
- Take $dw^{[1]},db^{[1]} \dots dw^{[l]},db^{[l]}$ and reshape into a big factor $d\theta$
- For each $i$
    - $d\theta_{approx}[i] = \dfrac{J(\theta_{1}, \theta_{2} \dots \theta_{i+\epsilon} \dots) - J(\theta_{1}, \theta_{2} \dots \theta_{i-\epsilon} \dots)}{2\epsilon} \approx \partial \theta[i] = \dfrac{\partial J}{\partial \theta_{i}}$
- Check
    - $\dfrac{||d\theta_{approx}-d\theta||_{2}}{||d\theta_{approx}||_{2} + ||d\theta||_{2}} \approx 10^{-7}$ good 
    - bigger than $10^{-3}$ means something wrong!
- Don't use in training, only to debug.
- Include regularization term in $d\theta$ calculation.
- Doesn't work with dropout. (you can grad check with keep_prop=1.0, then later turn on dropout)

## Example

### Packages

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.datasets
import scipy.io

from dl_utils import *

%matplotlib inline
plt.rcParams['figure.figsize'] = (7.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

### Data

In [3]:
train_X, train_Y, test_X, test_Y = load_2D_dataset()

NameError: name 'scipy' is not defined