# RNN, Attention, and Transformers Learning Path Revision Notebook

A one-page flashcard-style revision tool explaining why understanding RNNs and attention is key to mastering Transformers after Word2Vec. Designed for quick, deep review without code.

## 1. Why Learn RNNs and Attention Before Transformers?
- **Progression**: Word2Vec (static embeddings) → Contextual models (RNNs, attention) → Transformers (state-of-the-art).
- **Foundation**: RNNs introduce sequence processing, a precursor to Transformers' handling of text.
- **Attention Bridge**: Attention builds on RNNs, enabling Transformers to focus on relevant words, critical for sentiment analysis.
- **Complexity**: Transformers combine RNN ideas (sequence) and attention (context), so understanding both simplifies the leap.
- **Practicality**: Sentiment tasks (e.g., sarcasm, negation) need sequential/contextual understanding, mastered via this path.

## 2. What Are RNNs?
- **Definition**: Neural networks that process sequences by maintaining a "memory" of previous inputs (e.g., words in a sentence).
- **How They Work**:
  - Updates hidden state $h_t = f(W_{xh} x_t + W_{hh} h_{t-1} + b)$ at each time step $t$.
  - $x_t$: Current input (e.g., word vector), $h_{t-1}$: Previous state, $f$: Activation (e.g., tanh).
- **Sentiment Use**: Captures order (e.g., "not good" as negative) via sequential memory.
- **Limitations**: Vanishing/exploding gradients, struggles with long dependencies.

## 3. What Are Advanced RNNs?
- **LSTM (Long Short-Term Memory)**:
  - Adds gates (forget, input, output) to control memory, solving long-term dependency issues.
  - Sentiment Benefit: Remembers context over sentences (e.g., "not" affects "good").
- **GRU (Gated Recurrent Unit)**:
  - Simplified LSTM with update/reset gates, faster but less flexible.
  - Sentiment Benefit: Efficient for shorter texts.
- **Why Learn**: Base for bidirectional processing and attention integration.

## 4. What is Attention?
- **Definition**: Mechanism to focus on relevant parts of the input sequence, weighting importance of each word.
- **How It Works**:
  - Computes attention scores $a_{ij} = \text{align}(h_i, h_j)$ between query $h_i$ (current state) and keys $h_j$ (all states).
  - Normalizes to weights $w_{ij} = \frac{\exp(a_{ij})}{\sum \exp(a_{ik})}$, then weighted sum of values.
- **Sentiment Use**: Highlights "not" in "not good" for accurate classification.
- **Advantage**: Overcomes RNN’s long-dependency limit by direct word connections.

## 5. How Do They Lead to Transformers?
- **RNN Evolution**: Bidirectional RNNs + attention (e.g., Bahdanau 2014) improve translation, inspiring Transformers.
- **Transformer Innovation**:
  - Replaces RNNs with self-attention, processing all words simultaneously.
  - Uses multi-head attention and feedforward layers, scaling to large data.
- **Sentiment Impact**: Captures full context (e.g., sarcasm across sentences), outperforming RNNs.
- **Why Learn Path**: RNNs teach sequence handling, attention adds focus, Transformers combine both efficiently.

## 6. Learning Path & Tips
- **Order**: Start with basic RNNs → LSTMs/GRUs → Attention → Transformers.
- **Resources**: 
  - RNNs: "Deep Learning" by Goodfellow (Chapter 10).
  - Attention: "Attention is All You Need" paper (Vaswani et al., 2017).
  - Transformers: Hugging Face tutorials or "The Illustrated Transformer".
- **Practice**: Apply to sentiment (e.g., classify "I am not happy" with RNN vs. Transformer).
- **Goal**: Build intuition for BERT’s bidirectional attention, key for modern sentiment.

## 7. Key Points (Milestones)
- **W2V**: Static embeddings, ~2013.
- **RNNs**: Sequence models, ~2015 (LSTMs).
- **Attention**: Contextual focus, ~2017.
- **Transformers**: Scalable context, 2017 onward (BERT, 2018).
- **Sentiment Gain**: From ~80% (W2V) to >90% (Transformers) on benchmarks like SST-2.