```{contents}
```

## Adadelta and RMSProp

---

### **Problem Recap**

* **Adagrad** introduced dynamic learning rate:
  $$
  \eta_t = \frac{\eta}{\sqrt{\alpha_t} + \epsilon}
  $$
* **Issue:** In deep networks, $\alpha_t$ (sum of squared gradients) can become **very large** → $\eta_t$ becomes **almost zero** → weight updates stop → learning stalls.

---

### **Adadelta & RMSProp Solution**

* Goal: Prevent learning rate from shrinking too much.
* Key ideas:

  1. Use **exponential weighted average (EWA)** of past squared gradients instead of raw sum.
  2. Restrict extreme growth of gradient history.

---

### **Formula**

1. Compute smoothed squared gradient:
   $$
   sdw_t = \beta \cdot sdw_{t-1} + (1 - \beta) \cdot \left(\frac{\partial L}{\partial w_{t-1}}\right)^2
   $$

* $\beta$ = smoothing factor (e.g., 0.95)
* This keeps the gradient history from exploding.

2. Update weight with **dynamic learning rate**:
   $$
   w_t = w_{t-1} - \frac{\eta}{\sqrt{sdw_t} + \epsilon} \cdot \frac{\partial L}{\partial w_{t-1}}
   $$

---

**Key Points**

* **Dynamic Learning Rate:** Like Adagrad, but avoids shrinking too much.
* **Smoothing:** Exponential weighted average reduces noise in gradient updates.
* **Initialization:** (sdw_0 = 0) at the start.
* **Difference between Adadelta & RMSProp:** Minor variations in implementation; conceptually the same.

---

**Benefits**

1. Prevents learning rate from becoming too small (unlike Adagrad).
2. Smooths gradient updates → more stable convergence.
3. Works well in deep neural networks.

---

### **Differences Between Adadelta & RMSProp**

* **RMSProp**: Standard use of smoothed squared gradients in denominator.
* **Adadelta**: Further modifies the update to remove the need for a global learning rate (\eta) by also tracking previous parameter updates (adaptive step size).