# Optimizers

## Batch GD
- Uses entire data, not practical.  

## SGD
- One training example at a time. 
- Much faster. 
- Has convergence guarantees. 
- High variance helps escape local minima.  
- Decreasing LR by Annealing helps stabilize convergence.  

## Minibatch SGD
- Batch SGD
- reduces variance of updates, so more stable convergence.

## Challenges
- Difficult to choose LR, adaptive LR schedules
- [saddle points](https://arxiv.org/abs/1406.2572) (points where one dimension slopes up and another slopes down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions).  
- Navigating ravines - Common around local optima. SGD has a hard time with this. 

## Notes: 
- **Avoiding Saddle points** - start with random weight initializations. 
![image.png](images/saddle_point.png)
- **Avoiding Plateaux** - Use leaky Relu
- **Avoiding Ravines**
    - Recenter hidden units to zero mean and unit variance, use batchnorm
    - An area of research known as second-order optimization develops algorithms which explicitly use curvature information, but these are complicated and difficult to scale to large neural nets and large datasets.
    - There is an optimization procedure called Adam which uses just a little bit of curvature information and often works much better than gradient descent. It’s available in all the major neural net frameworks.  

![image.png](images/ravines.png)

### Momentum 
- momentum dampens oscillations due to a ravine. 

### Adagrad, adadelta, RMSprop
- use gradients to calculate adaptive learning rate for each parameter.  

### Adam 
- Can be seen as RMSprop (exponentially decaying average of past squared gradients V_t) and momentum (exponentially decaying average of past gradients m_t).
- Calculate estimates for mean and variance for gradients. 
- Uses that to calculate apative LR for each parameter.  

### Adamax
- Instead of using L2 norm in v update, use L-infinity norm, as that converges to a more stable value. 

### Nadam
- Combines Nesterov Accelerted GD with Adam. 

### AMSgrad
- Says exponential MA of past squared gradients is the reason for poor generalization of adaptive learning rate methods.  
- For v_t, use max of v_t-1 and v_t. 
- In practice, didn't show much improvement over Adam. 

### AdamW
- It has been shown that L2 regularization will improve Adam. 
- But the weight regularization should not be added finally to the loss. 
- It should be added directly to tehe gradient (instead of g_t, use g_t + wt_decay*w). 
- The wt_decay parameter is also updated as wt_decay = wt_decay(1-lr). 
- Gives a better performance than Adam. 



### References
- http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/slides/lec7.pdf
- http://ruder.io/optimizing-gradient-descent/
- https://arxiv.org/abs/1406.2572
