# Optimizer

Neural network or other machine learning algorithms using optimizer for find proper weight $W$ in model. 

----
### Gradient Descent
The other way of estimate $W$ is optimization. And Gradient Descent is the basic of first-order iterative optimization.

- GD(Gradient Descent) using partial derivatives of cost function.
- Basic theorem is here. (alpha is learning rate)
- It is important to make cost function convex.

$$ \Theta = \Theta - \alpha * \frac{\delta L}{\delta \Theta} $$

----
### Stochastic Gradient Descent
Gradient Descent is to slow for large dataset. Because we calculate cost for all rows in dataset. So, we should divide the dataset. It calls mini-batch.

- Gradient Descent using mini-batch dataset calls SGD(Stochastic Gradient Descent).
- It can be relatively inaccurate. But it is very fast compared to GD. 
- But there are still un-solved problems.
    - Direction of training
    - Step-size of training (learning rate)
    

In [2]:
# pseudo code
weight[i] += - learning_rate * gradient

----
### Focus on Direction of Training

Optimizing method focus on direction of training.

#### Momentum

- Momentum means same with inertia(관성, 탄력, 가속도). 
- Update the weight referring to the previous modification direction.
- The follow equations explain this optimizer. $ V(t) $ is momentum equation from previous state. And `m` is momentum hyper-parameter(Usually set to 0.9~0.95).

$$ V(t) = m * V(t-1) - \alpha*\frac{\delta Cost(W)}{\delta W} $$

$$ W(t+1) = W(t) + V(t) $$

In [None]:
# pseudo code
v = m * prev_v - learning_rate * gradient
weight[i] += v

----
### Focus on Step Size of Training

Optimizing methods focus on direction of training.

#### Adagrad(ADAptive GRADient)

- Adjust learning rate(step-size, $ \alpha $) according to each weight update step.
- Weights that have changed with a small gradient, learning rate rises steeply. In contrast, big change in gradient, learning rate rises slowly. Because of below equation.

$$ W = \{w_1, w_2, ... w_i\} $$

$$ G_i(t) = G_i(t-1) + (\frac{\delta}{\delta w_i(t)}Cost(w_i(t)))^2 $$

$$ w_i(t+1) = w_i(t) - \alpha * \frac{1}{\sqrt{G_i(t) + \epsilon}} * \frac{\delta}{\delta w_i} Cost(w_i)$$

- For calculate each feature's learning rate($  \alpha * \frac{1}{\sqrt{G_i(t) + \epsilon}} $), we must define cost like this(Not a exact equation) : $ \frac{1}{N}(y_i-\hat y_i) $, instead of $ \frac{1}{N}\sum_{i=1}^{n} (y_i-\hat y_i) $
- Adagrad is very useful when each feature has different frequency in sparse dataset `(e.g. Word2Vec)`
    - Because each word(feature)'s frequency is very different. (in Word2Vec)
    - So, words that appeared frequently in data, learns many times than infrequent word.
- `한글 해석 추가` : 일반적인 GD에서는 w마다 gradient는 다르지만 lr은 동일하게 적용되었음. 하지만 Adagrad는 lr도 다르게 적용됨. 이전의 모든 상태를 accumulative하게 cost에 반영하고 이는 각 스텝에서의 lr에 영향을 미치기 때문에, sparse한 데이터셋에서 빈도수가 높은 feature는 accumulative cost가 클 확률이 높고, 따라서 learning rate가 낮게끔 학습이 진행이 됨. 반대로 빈도수가 낮은 feature는 learning rate를 높게 학습함. (상식적으로 reasonable)
----
- $ G(t) $ means, at the time point $ t $, $ G(t) $ is sum of squares for every steps.

$$ G_i(t) = G_i(t-1) + (\frac{\delta}{\delta w_i(t)}Cost(w_i(t)))^2 $$

$$ \sum_{j=0}^{t} (\frac{\delta}{\delta w_i(j)} Cost(w_i(j)))^2 $$

- Because of this, (adagrad use accumulation of the squared gradients) it has destined to converged learning rate. But RMSProp overcome this problem.

#### RMSProp 

- Adagrad's $ G(t) $ can be radiate to infinite. So, RMSProp using moving average.
- RMSProp almost same with Adagrad, but using `Exponential Moving Average` (This is a method of considering weighting recently, although it is high weighted, but the old past has impact also.)
- Sometimes $ \gamma $ calls `decay factor`.

$$ G_i(t) = \gamma * G_i(t-1) + (1-\gamma)*(\frac{\delta}{\delta w_i(t)}Cost(w_i(t)))^2 $$

$$ w_i(t+1) = w_i(t) - \alpha * \frac{1}{\sqrt{G_i(t) + \epsilon}} * \frac{\delta}{\delta w_i} Cost(w_i)$$

----
### Hybrid Method

#### Adam(ADAptive Moment estimation)

Hybrid method of step-size focused optimizer and direction focused optimizer.

- Hybrid of Momentum and RMSProp.
    - Proper step-size for each feature and epoch. (From RMSProp, using adaptive exponential moving average)
    - Proper step-direction in terms of momentum or inertia (From Momentum)

$$ M_i(t) = \beta_1 * M_i(t-1) + (1-\beta_1)*\frac{\delta Cost(w_i(t))}{\delta w_i(t)} $$

$$ V_i(t) = \beta_2 * V_i(t-1) + (1-\beta_2)(\frac{\delta}{\delta w_i(t)}Cost(w_i(t)))^2 $$

$$ \hat{M_i} = \frac{M_i(t)}{1-\beta_1^t} $$

$$ \hat{V_i} = \frac{V_i(t)}{1-\beta_2^t} $$

$$ w_i(t+1) = w_i(t) - \alpha * \frac{\hat{M_i(t)}}{\sqrt{\hat{V_i(t)} + \epsilon}}$$

- And we have to use $ \hat{M(t)} $ and $ \hat{V(t)} $.
    - $ M(t) $ and $ V(t) $ initialize with 0.
    - The parameters should be biased to zero, Because we using `moving average` with start zero.
    - So, Adam researcher correct this with $ 1-\beta^t $.
- These parameter were created during the process of finding the expected value ($ \hat{M(t)} $ and $ \hat{V(t)} $).
- The derivation of this equation is [here](https://arxiv.org/abs/1412.6980)

----

#### FTRL (Follow The Regularized Leader) - Proximal

In large-scale binary prediction task(e.g. extreme-sparse featured logistic regression), whole-batch gradient descent is not usable. So we have to use SGD(Online batch gradient descent) algorithms like momentum, adagrad, adam, etc. But SGD like algorithms have relatively lower performance. FTRL-proximal is SOTA(State-of-the-art) algorithm like this method. FTRL-proximal's idea is based on SGD, Regularized Dual Averaging (RDA), etc.

- Usually, In sparse dataset, we should use L1 regularization because of the feature selection (set as zero).
- FTRL-proximal's equation is below.

$$ w_{t+1} = argmin_w(g_{1:t}*w + \frac{1}{2} \sum_{s=1}^{t}\sigma_s ||w-w_s||_2^2 + \lambda ||w||_1 ) $$

- In summary, Build a stable model that follows the trajectory of the sub-gradient, minimizes approximate losses, and prevents rapid variability of the model through regularization and proximity.
- In generally, FTRL is just optimizer. So, sparse-problem solving predictor like FM(Factorization Machine) is better than (LR + FTRL). But FTRL can be used for FM's optimizer.

----
#### references
- https://seamless.tistory.com/38
- http://incredible.ai/artificial-intelligence/2017/04/09/Optimizer-Adagrad/
- https://twinw.tistory.com/247
- https://arxiv.org/abs/1412.6980
- https://brunch.co.kr/@kakao-it/84#comment
- http://proceedings.mlr.press/v15/mcmahan11b/mcmahan11b.pdf
- https://dos-tacos.github.io/paper%20review/FTRL/