# Vanishing Gradient Problem Revision Notebook

A one-page flashcard-style revision tool explaining the vanishing gradient problem in RNNs, its causes, effects, and solutions. Designed for quick, deep review without code.

## 1. What is the Vanishing Gradient Problem?
- **Definition**: A training issue in RNNs where gradients become extremely small during backpropagation, slowing or halting learning of long-term dependencies.
- **Context**: Occurs in deep networks or long sequences, critical for RNNs processing text (e.g., sentiment analysis).
- **Example**: In "the cat that ate the food," the effect of "cat" on "food" weakens over time.

## 2. How Does It Happen in RNNs?
- **Backpropagation Through Time (BPTT)**:
  - Unrolls RNN over $T$ time steps, computing gradient of loss $L$ w.r.t. weights (e.g., $W_{hh}$).
  - Gradient: $\frac{\partial L}{\partial W_{hh}} = \sum_{t=1}^T \frac{\partial L}{\partial h_t} \frac{\partial h_t}{\partial W_{hh}}$.
- **Chain Rule Effect**:
  - $\frac{\partial h_t}{\partial h_{t-1}} = f'(W_{hh} h_{t-1} + W_{xh} x_t) W_{hh}$, where $f'$ is the activation derivative (e.g., tanh).
  - Repeated multiplication over $T$ steps: $\prod_{t'=t-T+1}^{t-1} \frac{\partial h_{t'}}{\partial h_{t'-1}}$.
- **Vanishing Cause**: If $f'$ or $W_{hh}$ values are < 1, the product shrinks exponentially with $T$, making early gradients tiny.

## 3. Why It Affects Long-Term Dependencies
- **Memory Loss**: Small gradients mean early inputs (e.g., "the cat") barely update weights, forgetting their influence on later outputs (e.g., "food").
- **Activation Role**: Tanh/sigmoid outputs (-1 to 1 or 0 to 1) have derivatives < 1, amplifying vanishing with depth.
- **Example**: In "I am not happy," "not"’s impact on "happy" fades if sequence is long, misclassifying sentiment.

## 4. Effects on Training
- **Slow Learning**: Weights for early time steps update minimally, stalling convergence.
- **Short-Term Bias**: RNN prioritizes recent inputs, ignoring distant context.
- **Practical Impact**: Limits RNNs to sequences of ~10 words, inadequate for complex tasks like long reviews.

## 5. Solutions to the Vanishing Gradient Problem
- **LSTM (Long Short-Term Memory)**:
  - Uses gates (forget, input, output) to control memory flow, avoiding gradient dilution.
  - Sentiment Benefit: Retains "not" over sentences.
- **GRU (Gated Recurrent Unit)**:
  - Simpler gates (update, reset) stabilize gradients, faster than LSTM.
  - Sentiment Benefit: Efficient for shorter texts.
- **Gradient Clipping**: Caps gradient magnitude during BPTT, preventing explosion but not vanishing.
- **Better Initialization**: Techniques like orthogonal initialization keep $W_{hh}$ near 1.
- **ReLU Activation**: Steeper gradient (0 or 1), reduces vanishing but risks sparsity.

## 6. Why It Matters for Your Path
- **RNN Limitation**: Highlights need for LSTMs/GRUs, bridging to attention/Transformers.
- **Transformer Advantage**: Self-attention avoids sequential BPTT, sidestepping vanishing gradients.
- **Sentiment Evolution**: Understanding this drives better contextual models (e.g., BERT).
- **Practice**: Test "the cat that ate the food" with basic RNN vs. LSTM.

## 7. Key Points (Milestones)
- **Cause**: Repeated multiplication of gradients < 1 over $T$ steps.
- **Effect**: Forgets long-term dependencies, limits sequence length.
- **Fix**: LSTMs/GRUs, clipping, initialization; Transformers eliminate issue.
- **Sentiment**: Unaddressed, RNNs drop to ~70% on long texts; LSTMs improve to ~85%.