# RNN Revision Notebook

A one-page flashcard-style revision tool explaining Recurrent Neural Networks (RNNs), their architecture, variants, and applications like sentiment analysis. Designed for quick, deep review without code.

## 1. What Are RNNs?
- **Definition**: Neural networks designed for sequential data, maintaining a "memory" of previous inputs to process sequences (e.g., sentences, time series).
- **Goal**: Learn patterns over time, unlike Word2Vec’s static embeddings.
- **Example**: Predicting next word in "I am happy" or sentiment from "not good".
- **Key Idea**: Same weights applied across time steps, sharing information.

## 2. RNN Architecture (Step-by-Step)
- **Input**: $x_t$ at time $t$ (e.g., word vector from W2V).
- **Hidden State**: $h_t = f(W_{xh} x_t + W_{hh} h_{t-1} + b)$, where:
  - $h_{t-1}$: Previous hidden state (memory).
  - $W_{xh}$: Input-to-hidden weights.
  - $W_{hh}$: Hidden-to-hidden weights.
  - $f$: Activation (e.g., tanh) for non-linearity.
  - $b$: Bias.
- **Output**: $y_t = W_{hy} h_t + b_y$ (e.g., sentiment score), optional per step.
- **Flow**: Loops back $h_t$ to next step, processing sequence left-to-right.
- **Parameters**: Shared $W_{xh}$, $W_{hh}$, $W_{hy}$, efficient for long sequences.

## 3. Loss Calculation and Backpropagation
- **Loss Function**: Typically cross-entropy for classification (e.g., sentiment):
  - $L = -\frac{1}{T} \sum_{t=1}^T [y_t \log \hat{y}_t + (1 - y_t) \log (1 - \hat{y}_t)]$, where $y_t$ is true label, $\hat{y}_t$ is predicted probability.
  - For sequence, sums over all time steps $T$.
- **Backpropagation Through Time (BPTT)**:
  - Unrolls RNN over $T$ steps into a feedforward network.
  - Computes gradients of $L$ w.r.t. $W_{xh}$, $W_{hh}$, $W_{hy}$ by chain rule across time.
  - Gradient for $W_{hh}$: $\frac{\partial L}{\partial W_{hh}} = \sum_{t=1}^T \frac{\partial L}{\partial h_t} \frac{\partial h_t}{\partial W_{hh}}$, accumulating over steps.
- **Challenges**:
  - Vanishing gradients: Small gradients multiply over $T$, weakening early updates.
  - Exploding gradients: Large gradients require clipping.
  - Solution: Truncate BPTT (e.g., backprop over 5-10 steps) or use LSTMs.

## 4. How RNNs Work for Sentiment Analysis
- **Sequence Processing**: Reads words one-by-one (e.g., "I", "am", "not", "good").
- **Memory**: $h_t$ accumulates context; "not" flips "good" to negative.
- **Output**: Final $h_t$ or $y_t$ fed to classifier (e.g., softmax) for positive/negative.
- **Advantage**: Captures order, unlike W2V’s bag-of-words assumption.

## 5. Limitations of Basic RNNs
- **Vanishing Gradients**: Long-term dependencies (e.g., "the cat that ate" → "food") fade due to repeated multiplication of gradients.
- **Exploding Gradients**: Unstable training with large weight updates, requiring clipping.
- **Sequential Nature**: Processes one step at a time, slow for long sequences.
- **Short Memory**: Struggles beyond ~10 words without fixes.

## 6. Advanced RNN Variants
- **LSTM (Long Short-Term Memory)**:
  - Adds gates: Forget ($f_t$), Input ($i_t$), Output ($o_t$) to control memory.
  - Cell state $c_t = f_t \cdot c_{t-1} + i_t \cdot \tilde{c}_t$ retains long-term info.
  - Sentiment Benefit: Remembers "not" over sentences.
- **GRU (Gated Recurrent Unit)**:
  - Simplified LSTM with update ($z_t$) and reset ($r_t$) gates.
  - $h_t = (1 - z_t) \cdot h_{t-1} + z_t \cdot \tilde{h}_t$, faster but less flexible.
  - Sentiment Benefit: Efficient for shorter texts.
- **Why Use**: Solve vanishing gradients, handle longer dependencies.

## 7. Why RNNs Matter
- **Bridge to Attention**: Memory ($h_t$) and sequence handling inspire attention mechanisms.
- **Pre-Transformer**: Base for bidirectional RNNs, leading to Transformers.
- **Sentiment Evolution**: From W2V’s static embeddings to RNN’s dynamic context, then attention/Transformers.
- **Practice**: Apply to "I am not happy" to see order impact.

## 8. Key Points (Milestones)
- **Structure**: Single layer with loops, ~1980s concept, popularized 2010s.
- **Dims**: Hidden state $h_t$ typically 100-512 dims.
- **Training**: Backpropagation through time (BPTT), gradient issues common.
- **Next Step**: Learn LSTMs/GRUs, then attention for Transformers.
- **Sentiment**: ~85% accuracy on SST-2 with LSTMs, vs. ~80% with W2V.