```{contents}
```

## SGD with Momentum

* In **Mini-batch SGD**, we update weights using a small batch of data.
* This reduces the noise compared to SGD (single data point) but **some noise still exists**.
* Noise causes **zigzag updates** in the loss surface, slowing convergence toward the global minimum.

---

### **SGD with Momentum**

Momentum is introduced to **smooth these zigzag updates**.

#### **Weight Update Formula**

Without momentum (standard SGD):
$$
w_t = w_{t-1} - \eta \frac{\partial L}{\partial w_{t-1}}
$$
Where:

* $w_t$ = new weight
* $w_{t-1}$ = previous weight
* $\eta$ = learning rate
* $L$ = loss function

With momentum:
$$
v_t = \beta v_{t-1} + (1-\beta) \frac{\partial L}{\partial w_{t-1}}
$$
$$
w_t = w_{t-1} - \eta v_t
$$
Where:

* $v_t$ = velocity term (smoothed gradient)
* $\beta$ = momentum factor (0 < β < 1)
* High β → previous gradients dominate → smoother updates
* Low β → current gradient dominates → less smoothing

---

### Exponential Weighted Average

* Momentum uses **exponential weighted average (EWA)** of past gradients.
* Formula for smoothing time-series or gradients:
  $
  V_t = \beta V_{t-1} + (1-\beta) A_t
  $
* $A_t$ = current value (gradient at time t)
* $V_t$ = smoothed gradient
* Effect: reduces noise and zigzag in weight updates.

---

### Intuition

* Without momentum: updates bounce back and forth → slow convergence.
* With momentum: updates accumulate direction → smoother path to global minimum.
* Analogy: momentum “pushes” updates in the right direction, dampening oscillations.

---

### Advantages

1. Reduces noise in Mini-batch SGD.
2. Faster convergence to global minimum.
3. Smoother weight updates → more stable training.

---
