```{contents}
```

## Adagrad Optimizer

---

### **Problem Recap**

* In standard SGD or Mini-batch SGD, the **learning rate (η)** is fixed.
* Fixed learning rate:

  * Too high → overshoot global minimum.
  * Too low → slow convergence.
* Ideal: **dynamic learning rate** — bigger steps initially, smaller steps near minimum.

---

### **Adagrad Idea**

* **Adagrad = Adaptive Gradient Descent**.
* Adjusts learning rate individually for each parameter.
* Parameters with **frequent updates → smaller learning rate**.
* Parameters with **infrequent updates → larger learning rate**.

---

### **Weight Update Formula**

$$
w_t = w_{t-1} - \eta_t \frac{\partial L}{\partial w_{t-1}}
$$

Where the **dynamic learning rate** is:
$$
\eta_t = \frac{\eta}{\sqrt{\alpha_t} + \epsilon}
$$

* $\eta$ = initial learning rate
* $\alpha_t = \sum_{i=1}^{t} (\frac{\partial L}{\partial w_i})^2$ → accumulated squared gradients
* $\epsilon$ = small constant to avoid division by zero

**Explanation:**

* Initially, αₜ is small → ηₜ is large → bigger steps → faster convergence.
* Over time, αₜ grows → ηₜ decreases → smaller steps → precise approach to minimum.

---

### **Intuition**

* Think of it as **automatic step size adjustment**:

  * High gradient history → slow down (avoid overshoot).
  * Low gradient history → speed up (less sensitive parameters).

---

### **Advantages**

1. Dynamic learning rate for each parameter.
2. Faster convergence initially.
3. Handles sparse data well (features updated infrequently get bigger steps).

---

### **Disadvantages**

1. **Aggressive decay**: In very deep networks, αₜ can become very large.

   * ηₜ → almost zero.
   * Weight updates stop → learning stalls.
2. Cannot recover from too small learning rate → convergence can freeze.

---


