In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Vanishing Gradients in RNNs


## 1. Introduction

**Vanishing gradients** is a common problem in training deep neural networks and RNNs, especially with long sequences.

* During **backpropagation**, gradients are propagated backward through layers or time steps.
* If gradients become **very small**, the network **fails to learn long-term dependencies**.
* This is particularly a problem in **RNNs** because hidden states are repeatedly multiplied by weights at each time step.



## 2. How It Happens

Consider an RNN with hidden state update:

[
h_t = f(W_{hh} h_{t-1} + W_{xh} x_t)
]

During backpropagation, the gradient of the loss w.r.t an earlier hidden state:

[
\frac{\partial L}{\partial h_{t-k}} = \frac{\partial L}{\partial h_t} \cdot \prod_{i=t-k+1}^{t} f'(h_i) W_{hh}
]

* (f') is the derivative of the activation function (e.g., tanh, sigmoid)
* (W_{hh}) is the recurrent weight matrix

**Problem:**

* For **sigmoid or tanh**, (f'(h) < 1)
* Multiplying many small numbers → gradient approaches **zero**

[
\prod f'(h_i) W_{hh} \approx 0
]

* Early layers or time steps **learn very slowly**, losing long-term memory.



## 3. Consequences in RNNs

* Network **cannot capture long-range dependencies** in sequences
* Only learns **short-term patterns**
* Training may **stall**, resulting in poor performance for tasks like language translation or time series prediction with long sequences



## 4. Illustration

```
Sequence: x1 → x2 → x3 → x4 → x5
Hidden states: h1 → h2 → h3 → h4 → h5

Loss depends on h5

Gradient w.r.t h1: 
∂L/∂h1 = ∂L/∂h5 * W^4 * f'(h1)f'(h2)...f'(h5)

If W < 1 and f' < 1 → ∂L/∂h1 ≈ 0
```

* Early time steps “forget” their influence.



## 5. Solutions

1. **Use LSTM or GRU**

   * Special gates and cell state help maintain long-term memory
   * Reduce vanishing gradient problem

2. **Proper weight initialization**

   * Initialize weights carefully to avoid very small values

3. **Use ReLU activation**

   * Instead of tanh/sigmoid, can help mitigate gradient shrinkage

4. **Gradient clipping**

   * Clip gradients to a minimum threshold to prevent them from vanishing (or exploding)

5. **Shorter sequences / truncated BPTT**

   * Backpropagate through smaller chunks of sequences

---

## 6. Summary Table

| Concept            | Description                                                              |
| ------------------ | ------------------------------------------------------------------------ |
| Vanishing Gradient | Gradient becomes extremely small → early layers/time steps stop learning |
| Cause              | Multiplication of small numbers during backpropagation through time      |
| Effect             | RNN cannot capture long-term dependencies                                |
| Solution           | LSTM/GRU, ReLU, weight initialization, gradient clipping, truncated BPTT |


# Exploding Gradients in RNNs


## 1. Introduction

**Exploding gradients** is the opposite problem of vanishing gradients.

* During **backpropagation**, gradients can grow **very large**.
* This can **destabilize training**, causing the model to produce NaNs or fail to converge.
* Common in **deep networks** and **RNNs**, especially with long sequences.



## 2. How It Happens

Consider an RNN hidden state:

[
h_t = f(W_{hh} h_{t-1} + W_{xh} x_t)
]

During backpropagation:

[
\frac{\partial L}{\partial h_{t-k}} = \frac{\partial L}{\partial h_t} \cdot \prod_{i=t-k+1}^{t} f'(h_i) W_{hh}
]

* If (W_{hh}) has **eigenvalues > 1** or derivatives (f'(h_i) > 1)
* Multiplying many large numbers → gradient **grows exponentially**

[
\prod f'(h_i) W_{hh} \gg 1
]

* Early layers or time steps **receive extremely large gradient updates**



## 3. Consequences

* Network weights can **explode to infinity**
* Training becomes **unstable**
* Loss function may **diverge**
* Gradients may become **NaN**, preventing learning

---

## 4. Illustration

```
Sequence: x1 → x2 → x3 → x4 → x5
Hidden states: h1 → h2 → h3 → h4 → h5

Gradient w.r.t h1: 
∂L/∂h1 = ∂L/∂h5 * W^4 * f'(h1)f'(h2)...f'(h5)

If W > 1 and f' > 1 → ∂L/∂h1 → very large
```

* Early time steps dominate updates → instability



## 5. Solutions

1. **Gradient Clipping**

   * Limit gradients to a maximum threshold:

   ```python
   torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
   ```

   * Prevents excessively large updates.

2. **Proper Weight Initialization**

   * Avoid large initial weights for recurrent connections.

3. **Use LSTM / GRU**

   * Gates help stabilize gradient flow.

4. **Use Appropriate Activation Functions**

   * Avoid unbounded activations (like plain `tanh` with very large inputs).



## 6. Comparison with Vanishing Gradients

| Property           | Vanishing Gradient                    | Exploding Gradient                       |
| ------------------ | ------------------------------------- | ---------------------------------------- |
| Gradient Magnitude | Very small → 0                        | Very large → ∞                           |
| Effect on Training | Early layers stop learning            | Training becomes unstable                |
| Common Cause       | Small weights, sigmoids, deep network | Large weights, repeated multiplications  |
| Solution           | LSTM/GRU, ReLU, truncated BPTT        | Gradient clipping, weight init, LSTM/GRU |

---

## 7. Key Takeaway

* **Vanishing gradient:** prevents learning **long-term dependencies**
* **Exploding gradient:** prevents **stable training**
* Both are inherent challenges of deep RNNs.
* Modern RNN architectures (LSTM, GRU) and techniques (gradient clipping) mitigate these issues.