```{contents}
```


## Exploding Gradient Problem 

* Exploding gradients happen when gradients (derivatives of the loss with respect to weights) become **extremely large** during backpropagation.
* This makes weight updates huge, causing the model to **overshoot the minimum** or diverge entirely.
* It’s the opposite of vanishing gradients, where updates become too small.

---

### **Why It Happens**

1. **Backpropagation multiplies gradients layer by layer**:

   $$
   \frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial o_3} \cdot \frac{\partial o_3}{\partial o_2} \cdot \frac{\partial o_2}{\partial o_1} \cdot \frac{\partial o_1}{\partial w_1}
   $$

2. **Large weights amplify gradients**:

   * If weights are initialized too big (e.g., 500, 1000), each multiplication in the chain rule makes the gradient even larger.
   * Result: `w_new = w_old - learning_rate * gradient` becomes **too large**, causing huge swings in weight values.

3. **Effect on training**:

   * Loss may oscillate or shoot up instead of decreasing.
   * Model may fail to converge to the global minimum.

---

### **Key Factors**

* **Weight Initialization**: Large initial weights → exploding gradients.
* **Activation Functions**: Certain activations (like sigmoid/tanh) can also amplify or saturate gradients.
* **Deep Networks**: More layers → more multiplications → higher risk.

---

### **Solutions**

* Use **careful weight initialization techniques**, like:

  * **Xavier/Glorot Initialization** for sigmoid/tanh
  * **He Initialization** for ReLU
* Use **gradient clipping**: Limit the gradient magnitude to a maximum value.
* Prefer **ReLU activations** for deep networks to reduce vanishing/exploding risks.