# Adam [ Adaptive Moment Estimation ]

## Chapter Outline
<hr>

<div class="toc"><ul class="toc-item"><li><span><a href="#Chapter-Learning-Objectives" data-toc-modified-id="Chapter-Learning-Objectives-2">Chapter Learning Objectives</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-3">Imports</a></span></li><li><span><a href="#1.-Motivation-for-Adam" data-toc-modified-id="1.-Motivation-for-Adam-1">1. Motivation for Adam <li><span><a href="#2.-Animation for Adam Optimizer" data-toc-modified-id="2.-Animation for Adam Optimizer-2">2. Animation for Adam Optimizer

# Gradient Descent Update Rule

The main equation used by **Gradient Descent** to update the parameter **θ**  
given a learning rate **η** and the derivative of the cost function **∇θJ(θ)** is as follows:

$$\theta = \theta - \eta \nabla_{\theta} J(\theta)$$

The basic version of **Gradient Descent** computes the gradient for the cost function over the entire dataset.  
The most commonly used variation is **Mini-batch Gradient Descent**, which uses the same equation but calculates the gradients on one batch at a time.


# Limitations of Basic Gradient Descent

This basic form of optimization comes with several flaws:

- The convergence of the optimization is **highly sensitive** to the learning rate **η**.  
  - A **small** learning rate leads to **very slow convergence**.  
  - A **large** learning rate often results in **divergence**.  

- It uses the **same learning rate** for all parameters, regardless of any specificity, such as:  
  - Associated layer number  
  - Whether the layer is pre-trained or not  

- It is **highly sensitive** to **local minima**, which is a common issue in neural networks due to their **non-convex** cost functions.  

- Implementing **learning rate scheduling** (i.e., adapting **η** based on predefined schedules) is **not straightforward** and may become **ineffective** depending on the dataset characteristics.  


# Optimization Tweaks

To overcome the limitations of basic Mini-batch Gradient Descent (GD), several tweaks and improvements have been introduced.

## **1. Weight Decay**  

Weight Decay (WD) is a form of **regularization**. Unlike **L2 regularization**, which adds the sum of squared parameters to the loss function to penalize large weights, WD **directly adds a proportion of the weights** (i.e., **wd × θ**) to the gradient update.  

This technique helps improve **numerical stability** by avoiding the summation of large numbers. The updated weight equation becomes:

$$
\theta = \theta - \eta (\nabla_{\theta} J(\theta) + wd \cdot \theta) $$

---

## **2. Momentum**  

Momentum is a **convergence acceleration** technique that helps GD navigate optimization landscapes where the cost function is **steep in some directions and flat in others** (e.g., local optima), preventing oscillations.  

Momentum achieves this by adding to the gradient a fraction **β** (typically **0.9**) of the previous update applied to the weights. The weight update equations are:

$$ m_t = \beta m_{t-1} + \eta \nabla_{\theta} J(\theta)$$

$$ \theta_{t+1} = \theta_t - m_t $$

This helps smooth out updates and speeds up convergence, especially in deep learning scenarios.


# **Adam Optimizer**

**Adaptive Moment Estimation (Adam)** is an optimization algorithm that computes **adaptive learning rates** for each parameter individually.  

It keeps track of:  
- **\( v_t \)**: A vector holding the **exponential decaying average** of previous **squared gradients**.  
- **\( m_t \)**: A vector holding the **exponential decaying average** of previous **gradients** (similar to momentum).  

## **Mathematical Formulation**

The momentum term is updated as:

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_{\theta} J(\theta)$$

The squared gradient term is updated as:

$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) \nabla_{\theta} J(\theta)^2$$

### **Bias Correction**
To prevent **\( m_t \)** and **\( v_t \)** from being biased toward zero at the beginning, the authors of Adam propose **bias correction**:

$$\hat{m_t} = \frac{m_t}{1 - \beta_1^t}$$

$$\hat{v_t} = \frac{v_t}{1 - \beta_2^t}$$

### **Final Update Equation**
The final weight update equation is:

$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v_t}} + \epsilon} \hat{m_t}$$

### **Common Hyperparameters**
- $$\beta_1 = 0.9$$ 
- $$\beta_2 = 0.999$$  
- $$\epsilon = 10^{-8}$$ 


# **Visualization of Optimization Algorithms**

We will implement different **optimization algorithms** and apply them to a **simple optimization problem** using various learning rates.


### Learning Rate: $$\eta = 0.1$$

Using a **learning rate of 0.1**, the **loss** is evaluated at each iteration of the optimization algorithm.

---
![Adam Optimizer Animation](https://dzlab.github.io/assets/2019/20190615-optimizers-animation-adam-1.png)


## when $$\eta = 0.1$$

---
![Adam Optimizer Animation](https://dzlab.github.io/assets/2019/20190615-optimizers-animation-adam-2.png)
