## Weight Initialization

* Why initialize weights randomly like:
> np.random.randn(D) / np.sqrt(D)
* Read the paper ["Efficient BackProp"](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)

### Weight initialization is important

### Initializing to 0 (or constant)

* Common to initialize weights to 0 in linear models (e.g., linear regression) and it works fine
* But not work in neural network, why?



* If weights are initialized to be constant, all units calculate the same feature, it is like having only 1 unit in that layer. In other words, adding more units will not increase the expressiveness of the neural network.

> Initializing weights randomly allows us to break this symmetry and also allows us to make use of all the hitting units in the neural network

* Now we know that we would be better initialize weights randomly. Then, we want to know what distribution should they come from and what are the parameters of this distribution
* Before answering this question, we first look at vanishing gradients and exploring gradients problem.

## Vanishing Gradients

* When it comes to neural neural networks, one premise is that deeper is better
    * Researchers have found that by adding more layers, the nerual network can have less hitting units per layer and achieve better performance (add paper)
* Researchers believe that the sigmoid ("S" shape) activation function was the best possible possible activation function. This could be due to the fact that sigmoid fuction has nice derivatives. 
    * for sigmoind, the derivative is: 
    
    $$ output * (1 - output) $$
    
    * for tanh, the derivative is: 
    
    $$ 1 - output * output $$
    
    
* These functions are <b style="color:red">smooth</b> meaning they are differentiable everywhere. The differentiability is important because the learning method is gradient descent and we can not do gradient descent if we can not take derivatives. 

**Problem with the sigmoid in deep network**
* The neural nework has basic form: $ y = f(g(h(..p(x)..))) $, where $f, g, h$ all the way down to $p$ represent neural network layers/
* Due to the chain rule of calculus, the derivative with respect to the weights at the first layer is calculated by multiplying the derivative at each layer that comes after that. 
$$ {dy \over dw_1} = {df \over dg} * {dg \over dh} * ...$$
* The max value of derivative of sigmoid function is 0.25, if we multiple 0.25 with large amound of times, the outcome would infinitly approach zero. This is when we are assuming that we are able to get the peak value of derivative of sigmoid at every layer. In normal scenarios, the derivative we get would be smaller than 0.25, in which case the derivatives are diminishing even faster. 
<img src="images/derivative_sigmoid.png" alt="Drawing" style="width:50%;height:50%"/>
* Therefore, by using sigmoid, a deep neural network would have a lots of derivatives very close to zero causing the neural network to learn very slowly. Because of this, the "standard" backpropagation is not a good way to train a very deep neural network.
* We can address the limitation of standard backpropagation in many ways:
    * <b style="color:red">Greedy layer-wise unsupervised pre-training</b>, developed by Geoff Hinton's 
    * Or use <b style="color:red">ReLu</b> activation function
        * No pre-training required
        * Training the whole neural network (sometimes called "end-to-end training") from scrach with backpropagation

## Exploding Gradients

* Happens when we multiply $ w * w * w * w * ...$, where $ |w| > 1 $
* May happen in recurrent neural networks

> In order to avoid both vanishing gradient and exploding gradients, we need to initialize weights just right

## Ways to Initialize Weights