# Intuition about RMSprop and Adam optimizations

The two well-known optimizaion algorithms are widely used in the deep learning. Here I give some intuition about them.  
First, I begin with the gradient descent with momentum. For simplicity, I only consider updating one parameter.

## 1. Gradient descent with momentum
On the iteration $t$, the weight $w$ is updated, for the gradient descent with momentum, as follow:
\begin{align} 
 v_{dw}(t) &= \beta_1 v_{dw}(t-1) + (1-\beta_1)dw \\
w &= w - \alpha v_{dw}(t),
\end{align}
where $dw$ is the derivative of the loss $J$ at this interation with respect to the weight $w$. And $\beta_1$ and $\alpha$ (learning rate) are two hyperparameters.

This algorithm will give a very small $V_{dw}(t)$ of an oscilating $dw$ to damp out the oscillation. On the non-oscillating direction, $w$ is updated normally.

## 2. RMSprop (Root Mean Square prop)
In the algorithm, on the iteration $t$, the weight $w$ is updated as follows:
\begin{align} 
 s_{dw}(t) &= \beta_2 s_{dw}(t-1) + (1-\beta_2)dw^2 \\
w &= w - {\alpha \over \sqrt{s_{dw}(t)} } dw,
\end{align}
where $dw$ and $\alpha$ is same as before and $\beta_2$ is a hyperparameter.

Consider two direction corresponding two parameter $w_1$ and $w_2$. And the direction $w_1$ is a little steep and the direction $w_2$ is a little flat. We hope that on the steep direction, $w_1$, the learing rate might be smaller to avoid the overshoot and on the falt dirction, $w_2$, the learning rate might be larger to converge faster.  
This is implemented is RMSprop. Since $s_{dw_1} > s_{dw_2}$. 
Effectively, in RMSprop, $w_1$ would be updated by a smaller learning rate $\alpha_1 \equiv {\alpha \over \sqrt{s_{dw_1}}}$ than that of $w_2$, $\alpha_2 \equiv {\alpha \over \sqrt{s_{dw_2}}}$, as we wish.

## 3. Adam optimization algorithm


Gradient descent with momentum is efficient in damping out the oscillation, but can not adjust the learning rate along different directions according to the steepness. 
On the other side, RMSprop can adjust the learning rate automatically, but not very specific in damping the oscillation. For example, if a parameter is oscillation along a plateau dirction, it would be difficult for RMSprop to damp out the oscillating. 

By combining both of the algorithms together, Adam optimization algorithm can damp out the oscillation and at the same time, adjust the learing rate according to the steepness!

In the Adam algorithm, at the iteration $t$, the weight $w$ is updated as follows:
\begin{align} 
 v_{dw}(t) &= {1 \over 1 - \beta_1^t} [\beta_1 v_{dw}(t-1) + (1-\beta_1)dw]    \\
 s_{dw}(t) &= {1 \over 1 - \beta_2^t}[\beta_2 s_{dw}(t-1) + (1-\beta_2)dw^2 ] \\
w &= w - {\alpha \over \sqrt{s_{dw}(t)} } v_{dw}(t),
\end{align}
This almost the combination of the first two algorithms, except the extra factors $1/( 1 - \beta_1^t)$ and $1/( 1 - \beta_2^t)$. These factors are used as the bias correction to get better exponatially weighted averages at the initial several iterations.

Let me display the effection of each term:
- $v_{dw}$: damping out the oscillation by using the exponentially weighted average of the gradient.
- $s_{dw}$: adjuct the learning rate according to the steepness by dividing the learing rate by the root mean square of the exponentially weighted average of the gradient.