#Deep Learning With Computer Vision And Advanced NLP (DL_CV_NLP)

$$ Revision Notes $$
$$ A-Note-by-**Bappy Ahmed** $$

# Different optimizers:

Optimizers are algorithm or methods used to change the attributes of the neural network such as weights & biases to reduce the losses. Optimizers are used to solve optimization problem by minimizing the function.

   <img src="https://media.geeksforgeeks.org/wp-content/uploads/20200511223856/f110.jpg" width="600"
     height="300">


## Different optimizers:
 - Gradient Descent (GD)
 - Stochastic Gradient Descent (SGD)
 - Momentum based GD
 - Nesterov Accelerated Gradient (NAG)
 - Adagrad (Adaptive Gradinent)
 - RMSprop
 - Adam
  - Adamax
  - Nadam

#Drawback of Gradient Descent:
Observing...

$w_{new} = w_{old} - \eta \frac{\partial c}{\partial w}|_{w=w_{old}}$


- **1st step**:

  initial weight = $w_0$

  $\therefore w_1 = w_0 - \eta \frac{\partial c}{\partial w_0}....(1)$


- **2nd step**:

  weight = $w_1$

  $\therefore w_2 = w_1 - \eta \frac{\partial c}{\partial w_1}....(2)$

So here, $(1)>(2)$

### **Observation**:
 - $\bigtriangledown w \uparrow if \frac{\partial c}{\partial w} \uparrow $ more steep (Huge update)

 - $\bigtriangledown w \downarrow if \frac{\partial c}{\partial w} \downarrow $ less steep (less update)

 - It will be stuck at saddle point
 - It takes more time to convergence
 - It doesn't depend upon past weight update or accumulation
 - It is very slow for deep NN







# Momentum based GD

It was proposed by Boris Polyak in 1964. The idea was when a ball is rolling down a hill it starts very slow but picks up momentum and reaches terminal velocity. This idea comes from Physics. In physics,
$$momentum = mass*velocity$$
$$p= m.v$$

- terminal velocity => Maximum attainable velocity during a free fall in a fluid or air.

It reaches terminal velocity by accumulating past change in velocity. Similarly in our algoritgm. Now gradient will be used to accelerated gradinet descent. We introduce here a new term is momentum (m),

$m \leftarrow \beta m+ \eta \bigtriangledown_\theta J(\theta)$

$\theta \leftarrow \theta - m$

or

$\theta \leftarrow \theta - \beta m + \eta \bigtriangledown_\theta J(\theta)$

$m= \beta m +\eta \frac{\partial c}{\partial w}|_{w=w}$   [Assumption bias = 0]

$\therefore w= w-m$

Here $\beta $ is coefficient of momentum term or friction. we keep $\beta = 0.9$ it is exprimentally proven. But we can tweak also.

- ### Observation:
 ### **step 1**:-

 initially, $w= w_0, m_0=0, \beta= 0.9$

 so,

 $m_1 = \beta m_0 + \eta \frac{\partial c}{\partial w}|_{w=w_0}$

 =>$m_1 =\eta \frac{\partial c}{\partial w}|_{w=w_0}$

 $w_1 = w_0 -m_1$

 $\therefore w_1= w_0 -\eta \frac{\partial c}{\partial w}|_{w=w_0} \rightarrow$ similar to GD


 ### **step 2:-**

 $m_1 =\eta \frac{\partial c}{\partial w}|_{w=w_0}, \beta = 0.9, w= w_1$

 so,

 $m_2 = \beta m_1 + \eta \frac{\partial c}{\partial w}|_{w=w_1}$

 => $m_2 = \beta \eta \frac{\partial c}{\partial w}|_{w=w_0} + \eta \frac{\partial c}{\partial w}|_{w=w_1}$

 =>$m_2 = \eta [\beta \frac{\partial c}{\partial w}|_{w=w_0} + \frac{\partial c}{\partial w}|_{w=w_1}]$

 $\therefore w_2 = w_1 - m_2$

 This time weight update will be larger.

 $if \beta = 0$ then it just gradient descent.

 ### **step 3**:-

 $w = w_2, \beta = 0.9, m_2 = \eta [\beta \frac{\partial c}{\partial w}|_{w=w_0} + \frac{\partial c}{\partial w}|_{w=w_1}]$

 so,

 $m_3 = \beta m_2 + \eta \frac{\partial c}{\partial w}|_{w=w_2}$

 => $m_3 = \beta \eta [\beta \frac{\partial c}{\partial w}|_{w=w_0} + \frac{\partial c}{\partial w}|_{w=w_1}] + \eta \frac{\partial c}{\partial w}|_{w=w_2}$

 => $m_3 = \eta [\beta^2 \frac{\partial c}{\partial w}|_{w=w_0} + \beta \frac{\partial c}{\partial w}|_{w=w_1} + \eta \frac{\partial c}{\partial w}|_{w=w_2}]$

 $\therefore w_3 = w_2 - m_3$

 This time weight update will be more than $w_2$


## Drawbacks of Momentum:
- One extra parameter $\beta$ although $\beta = 0.9$ works fine most of the cases.
- It oscillates when it reaches closer to local minima or global minima, cz of the accumulation of past momentum. Sometimes it gets overshoot.

## Advantages:
- Momentum helps in fast convergence
- oscillation can also help to come out local minima.

## Keras implementation:




In [None]:
import tensorflow as tf
tf.keras.optimizers.SGD(learning_rate=0.01, momentum= 0.9)

# Nesterov Accelerated Gradient (NAG):

It was introduced by Yurii Nesterov in 1983. This is also known as Nesterov momentum optimization. It is faster that momentm optimmization.

### Whats new?
It calculates gradient slightly ahead in the direction of momentum. $(\theta - \beta m) or (w- \beta m)$

###Algorithm:

$m \leftarrow \beta m + \eta \bigtriangledown_\theta(\theta-\beta m)$

$\theta \leftarrow \theta - m$

$m = \beta m + \eta \frac{\partial c}{\partial w}|_{w= (w-\beta m)}$

$w= w-m$  [Assuming bias = 0]


###step 1:-

initially, $m_0 = 0, w= w_0, \beta = 0.9$

so,

$m_1 = \beta m_0 + \eta \frac{\partial c}{\partial w}|_{w= (w-\beta m_0)} $

=> $m_1 =  \eta \frac{\partial c}{\partial w}|_{w= w_0} \rightarrow$ simple GD

$\therefore w_1 = w_0 - m_1$


###step 2:-

 $m_1 = \eta \frac{\partial c}{\partial w}|_{w= w_0}, w= w_1, \beta = 0.9$

 so,

 $m_2 = \beta m_1 + \eta \frac{\partial c}{\partial w}|_{w= (w_1-\beta m_1)}$

 =>$ m_2 = \beta  \eta \frac{\partial c}{\partial w}|_{w=w_0} + \eta \frac{\partial c}{\partial w}|_{w= (w_1-\beta m_1)}$

 => $m_2 = \eta [\beta \frac{\partial c}{\partial w}|_{w=w_0} + \frac{\partial c}{\partial w}|_{w= (w_1-\beta m_1)}]$

 $\therefore w_2 = w_1 - m_2$


###step 3:-


 $m_2 = \eta [\beta \frac{\partial c}{\partial w}|_{w=w_0} + \frac{\partial c}{\partial w}|_{w= (w_1-\beta m_1)}], w= w_2, \beta = 0.9$

 so,

 $m_2 = \beta m_2 + \eta \frac{\partial c}{\partial w}|_{w= (w_2-\beta m_2)}$

 => $m_2 = \eta [\beta^2 \frac{\partial c}{\partial w}|_{w=w_0} +\beta \frac{\partial c}{\partial w}|_{w= (w_1-\beta m_1)} + \beta m_2 + \eta \frac{\partial c}{\partial w}|_{w= (w_2-\beta m_2)}]$

 $\therefore w_3 = w_2 - m_3$


 ## Advantages:
 - It is faster than momentum
 - less oscillation
 - It reaches close to local or global minima

## Disadvantages:
 - Extra term $\beta $ although $beta = 0.9$ works fine

#Keras Implementation:



In [None]:
import tensorflow as tf
tf.keras.optimizers.SGD(learning_rate=0.01, momentum= 0.9, nesterov=True)

# Adagrad (Adaptive Gradient):
Elongated Bowl problem adagrad can solve.

   <img src="https://www.holehouse.org/mlclass/17_Large_Scale_Machine_Learning_files/Image%20[8].png" width="600"
     height="300">

It try to take very mini or baby step to avoid this zic zac issue problem.

### Algorithm:

$S\leftarrow S + \bigtriangledown _ \theta j(\theta) \otimes  \bigtriangledown _ \theta j(\theta)$

$\theta \leftarrow \theta + \eta \bigtriangledown _ \theta j(\theta) \phi \sqrt{S+\in }$

or,

$S = S + (\frac{\partial c}{\partial w} |_{w=w})^2$

$w = w - \eta \frac{\frac{\partial c}{\partial w}}{\sqrt{S+\epsilon }}$


here,

$\phi = $Elements wise division

$\epsilon = $ To avoid 0 division error $\epsilon = 10^{-7}$

$S = $ Scaling factor

$\otimes = $ Elements wise multiplication


###Step 1:

$S= 0$, $w= w_0$

$S_1 = S_0 + (\frac{\partial c}{\partial w} |_{w=w_0})^2$

> $=(\frac{\partial c}{\partial w} |_{w=w_0})^2$

$w_1 = w_0 - \eta \frac{\frac{\partial c}{\partial w} |_{w=w_0}}{\sqrt{S_1 + \epsilon }}$

$\therefore w_1 = w_0 - \eta \frac{\frac{\partial c}{\partial w} |_{w=w_0}}{(\sqrt{\frac{\partial c}{\partial w} |_{w=w_0})^2 + \epsilon }}$


###step 2:

$S_1 = (\frac{\partial c}{\partial w} |_{w=w_0})^2$, $w=w_1$

$S_2 = S_1 + (\frac{\partial c}{\partial w} |_{w=w_1})^2$

>$= (\frac{\partial c}{\partial w} |_{w=w_0})^2 + (\frac{\partial c}{\partial w} |_{w=w_1})^2$

$w_2 = w_1 - \eta \frac{\frac{\partial c}{\partial w} |_{w=w_1}}{\sqrt{S_2 + \epsilon }}$  $$\rightarrow L_2 norm$$


It is decaying the learning rate.



##Observation:
It corrects direction by scaling down the gradient vector along with stepest direction.

## Advanges:
 - It corrects the direction initially
 - less tuning of learning rate

##Disadvantages:
- Stops early before reaching global minima (sadle point)
- Takes longer time to converge due to decaying learning rate

## Notes: This is not recommended to use.

# RMS prop:
It was introduced by Geffrey Hinton et al. to solve the early stoping of Adagrad by accumulating gradient from recent iteration by using exponential decay.

### Algorithm:


$S\leftarrow \beta S + (1-\beta)\bigtriangledown _ \theta j(\theta) \otimes  \bigtriangledown _ \theta j(\theta)$

$\theta \leftarrow \theta + \eta \bigtriangledown _ \theta j(\theta) \phi \sqrt{S+\in }$

or,

$S = \beta S + (1-\beta)(\frac{\partial c}{\partial w} |_{w=w})^2$

$w = w - \eta \frac{\frac{\partial c}{\partial w}}{\sqrt{S+\epsilon }}$


Here, $\beta = 0.9$ works well

####Note:
  - Adagrad - decay would be fast
  - RMSprop - decay would be slow

### Observation:
- It is one of the best optimizer before Adam came up
- Although $\beta = 0.9$ but it's an extra parameter to tune
- It was able to solve early stoping case of Adagrad

In [None]:
import tensorflow as tf
tf.keras.optimizers.RMSProp(learning_rate=0.01,rho = 0.9)

# Adam Optimizer:
This is called as Adaptive moment estimation. This is a combine idea of momentum + RMSprop + eponential decay.

### Algorithm:

1. $m\leftarrow \beta_1 S + (1-\beta_1)\bigtriangledown _ \theta j(\theta) $
2. $S\leftarrow \beta_2 S + (1-\beta_2)\bigtriangledown _ \theta j(\theta) \otimes  \bigtriangledown _ \theta j(\theta)$
3. $\hat{m} \leftarrow \frac{m}{1-\beta_1^t}$
4. $\hat{S} \leftarrow \frac{S}{1-\beta_2^t}$
5. $\theta \leftarrow \theta - \eta \hat{m} \phi \sqrt{\hat{S}+\epsilon }$

here,

$t=$ iteration step

$\hat{m} = $ bias correction

$\hat{S}$ = smoothing operation


or,

1. $m = \beta_1m + (1-\beta_1) \frac{\partial c}{\partial w}|_w$
2. $S = \beta_2m + (1-\beta_2) (\frac{\partial c}{\partial w}|_w)^2$
3. $\hat{m} = \frac{m}{1-\beta_1^t}$
4. $\hat{S} = \frac{S}{1-\beta_2^t}$
5. $w= w - \eta \frac{\hat{m}}{\sqrt{\hat{S}+\epsilon }}$


For more info please refer the paper
[Adam](https://arxiv.org/pdf/1412.6980.pdf)

In [None]:
import tensorflow as tf
tf.keras.optimizers.Adam(learning_rate=0.01,beta_1 = 0.9, beta_2 = 0.999)