```{contents}
```

## Adam Optmizers

Adam combines the **best features of previous optimizers**:

1. **Momentum (from SGD with momentum)**

   * Smooths updates of weights and biases to reduce zigzag/noisy paths.
   * Uses **exponentially weighted moving average** (EWMA) of past gradients.

2. **RMSProp features**

   * Dynamic learning rate that adapts per parameter.
   * Smoothens large or small gradient updates to prevent learning rate from exploding or vanishing.

---

### **Weight Update Formula**

For weights (w) and bias (b):

$$
w_t = w_{t-1} - \eta_t \cdot vdw_t
$$

$$
b_t = b_{t-1} - \eta_t \cdot vdb_t
$$

Where:

* $\eta_t$ = **dynamic learning rate**
* $vdw_t$, $vdb_t$ = **smoothed gradients** via momentum:

$$
vdw_t = \beta \cdot vdw_{t-1} + (1-\beta) \cdot \frac{\partial L}{\partial w_{t-1}}
$$

$$
vdb_t = \beta \cdot vdb_{t-1} + (1-\beta) \cdot \frac{\partial L}{\partial b_{t-1}}
$$

* $\beta$ = smoothing factor (e.g., 0.9)
* Exponential weighted averages are used for **both momentum and RMSProp-style dynamic learning rate**.

---

### **How Adam Works**

1. **Momentum smoothing**: Past gradients influence current update → reduces oscillations.
2. **Adaptive learning rate**: Parameters with large gradients are scaled down; small gradients scaled up.
3. **Bias correction**: Ensures initial steps are not underestimated due to EWMA starting at zero.

---

### **Why Adam is Popular**

* Combines **momentum + RMSProp + adaptive learning rate**
* Faster convergence
* Works well for deep neural networks, CNNs, RNNs, and most other architectures
* Robust to noisy or sparse gradients
