#Deep Learning With Computer Vision And Advanced NLP (DL_CV_NLP)

$$ Revision Notes $$
$$ A-Note-by-**Bappy Ahmed** $$

# Understanding problem faces in training Neural Networks:

 ### 1. Vanishing & Exploding Gradient.
   - **Vanishing Gradient**: Gradient becomes very small.
   - **Exploding Gradient**: Gradient becomes very huge.
  
  That's also called unstable gradient problem that can never solved.

 ### 2. It requires lot of data to train.
      - It can be solved by the help of Transfer Learning or Data augmentation techniques.

 ### 3. Increasing the size of NN means increasing the no. of hidden layers. So, it may cause slow training.
    - Solution can be go for a better optimizer or a better activation function as well.

 ### 4. Risk of overfitting (Milions of parameters).This also happens for not enough data or for noisy data.
    - Solution can be Regularization or dropout.





# Vanishing & Exploding Gradient:


So here, in the situation where the value of the weights is larger than 1, that problem is called exploding gradient because it hampers the gradient descent algorithm. When the weights are less than 1 then it is called vanishing gradient because the value of the gradient becomes considerably small with time.

Now lets see about Vanishing & Exploding Gradient. Lets take a simple network.

   <img src="https://github.com/entbappy/Branching-tutorial/blob/master/18.png?raw=true" width="600" 
     height="300">


**Assumption**,
 - Each layer has 1 neuron
 - Bias = 0
 - Error function, $e = (y-\hat{y})^2$
 - $y$ = actual value
 - $\hat{y}$ = predicted value
 - $\sigma$ is activation function (sigmoid)


weights update rule (Gradient Descent),

$w= w - \eta \bigtriangledown e$

Here, 

$\bigtriangledown e = \frac{\partial e}{\partial w}$

$\therefore w=w-\eta \frac{\partial e}{\partial w}$

From the chain rule we found some dependency (use the figure above),

 - $e= (y-\hat{y})^2 = (y-a_2)^2\rightarrow [f(a_2)]$  
 - $a_2 = \sigma(z_2) \rightarrow [f(z_2)]$
 - $z_2 = w_2.a_1 \rightarrow [f(w_2)]$

####Now lets find out $e$ w.r.t $w_2$

####$\frac{\partial e}{\partial w_2} = \frac{\partial e}{\partial a_2}* \frac{\partial a_2}{\partial z_2}*\frac{\partial z_2}{\partial w_2} \rightarrow (1)$

$\therefore \frac{\partial e}{\partial w_2} = -2(y-a_2)*\sigma(z_2)(1-\sigma(z_2))*a_1$

Now we can update the weight $w_2$ by,
$$w_2= w_2 - \eta \frac{\partial e}{\partial w_2} $$


#### Now lets find out $e$ w.r.t $w_1$

Again from the chain rule we found some dependency (use the figure above),

 - $e= (y-\hat{y})^2 = (y-a_2)^2\rightarrow [f(a_2)]$  
 - $a_2 = \sigma(z_2) \rightarrow [f(z_2)]$
 - $z_2 = w_2.a_1 \rightarrow [f(a_1)]$
 - $a_1 = \sigma(z_1) \rightarrow [f(z_1)]$
 - $z_1 = w_1a_0 \rightarrow [f(w_1)]$

so, 

####$\frac{\partial e}{\partial w_1} = \frac{\partial e}{\partial a_2}* \frac{\partial a_2}{\partial z_2}*\frac{\partial z_2}{\partial a_1}*\frac{\partial a_1}{\partial z_1}*\frac{\partial z_1}{\partial w_1}$

Note: 

  - We have already calculated $\frac{\partial e}{\partial a_2}* \frac{\partial a_2}{\partial z_2} =\frac{\partial e}{\partial w_2} $


####$\therefore \frac{\partial e}{\partial w_1} = \frac{\partial e}{\partial a_2}* \frac{\partial a_2}{\partial z_2}*\frac{\partial z_2}{\partial a_1}*\frac{\partial a_1}{\partial z_1}*\frac{\partial z_1}{\partial w_1}\rightarrow (2)$


$\therefore \frac{\partial e}{\partial w_1} = \frac{\partial e}{\partial w_2} * w_2*\sigma(z_1)(1-\sigma(z_1))*a_0$


Now we can update the weight $w_1$ by,
$$w_1= w_1 - \eta \frac{\partial e}{\partial w_1} $$




### Major Observations:

 1. All components of equation $(1)$ and $(2)$ are kind of ratio.

 2. If some how this ratio output between 0 and 1. Then weight update will be no change or very less changes. So , it would be vanishing gradient. If you have more number of ratio or products  then it will be more less value and the learning will be very slow (minor update).Lower layer is hard to train. This is called vanishing gradient.

3. If ratio or product term is > 1 then it calls Exploding Gradient. This problem is faced by RNN frequently. In this problem your solution will be diverge.

## Note: 
Vanishing & Exploding gradient depend upon your choice of activation function and weight initialization technique as well. It was observed in 2010 by Xavier Glorot and Yoshua Bengio in their paper ```Understanding the difficulty of training deep feedforward neural networks``` [Paper link](http://proceedings.mlr.press/v9/glorot10a)

Vanishing & Exploding gradient >> Each layer learn at different speed.


    

### Lets have an observation w.r.t sigmoid function,

   <img src="https://github.com/entbappy/Branching-tutorial/blob/master/17.png?raw=true" width="600" 
     height="300">



## Case 1.

Assumption,

- $bias = 0 $
- Random initialization weight $w= 10$
- and $x=1$

so,

$z= w.x= 10$

> $\sigma(z=10) = \frac{1}{1+e^{-10}}$
>>>>$= \frac{1}{1+0}$

>>>>$\approx 1$

Now find out the derivative at $z=10$ of sigmoid function,

we know derivation of sigmoid function is,

$\frac{\partial \sigma(z)}{\partial z} = \sigma(z)(1-\sigma(z))$

$=\frac{\partial \sigma(10)}{\partial z} = 1(1-1)= 0$

As you can see the gradient is zero, So, the weight won't get updated.



## Case 2.

Assumption,

- $bias = 0 $
- Random initialization weight $w= -10$
- and $x=1$

so,

$z= w.x= -10$

> $\sigma(z=-10) = \frac{1}{1+e^{10}}$

>>>>$\approx 0$

Now find out the derivative at $z=-10$ of sigmoid function,

we know derivation of sigmoid function is,

$\frac{\partial \sigma(z)}{\partial z} = \sigma(z)(1-\sigma(z))$

$=\frac{\partial \sigma(-10)}{\partial z} = 0(1-0)= 0$

As you can see the gradient is zero, So, the weight won't get updated.


## Conclution:
Vanishing & Exploding gradient depend upon your choice of activation function and weight initialization technique as well. we have seen the proved above. So, instead of using random initialization of weight we have have to use some other technique such that our model don't get Vanishing & Exploding gradient problem. On the figure we can see we should initilize the weight between slope region not in saturation region. We can also do that expriment with other activation function as well.


### connection weight must be initialized such a way,

$$fan_{avg} = \frac{fan_{in}+fan_{out}}{2}$$



   <img src="https://github.com/entbappy/Branching-tutorial/blob/master/20.png?raw=true" width="500" 
     height="300">


Refference: [Paper link](http://proceedings.mlr.press/v9/glorot10a)


## Weights initialization techniques are,
 - glorot or Xavier Glorot $\rightarrow$ None, tanh, sigmoid, logistic, softmax
 - he or Kaiming he $\rightarrow$ ReLu & its variants
 - lecun or Yann Lecun $\rightarrow$ SELU