# Seq2Seq Models Revision Notebook

A one-page flashcard-style revision tool explaining sequence-to-sequence (seq2seq) models, why they emerged over standalone RNNs, their architecture, and applications like sentiment-related tasks. Designed for quick, deep review without code.

## 1. What Are Seq2Seq Models?
- **Definition**: A framework using two RNNs (encoder and decoder) to transform an input sequence into an output sequence of different lengths.
- **Context**: Introduced for machine translation, adaptable to sentiment analysis tasks.
- **Example**: Input "I am not happy" → Output "negative sentiment".
- **Goal**: Map variable-length inputs to variable-length outputs.

## 2. Why Seq2Seq Came Up (Why a Single RNN Wasn’t Enough)
- **Limitation 1: Single Direction Processing**:
  - A single RNN processes input sequentially, outputting one value per step (e.g., next word prediction).
  - Problem: Can’t naturally map an input sequence (e.g., "Je suis heureux") to a different output sequence (e.g., "I am happy") without a separate generation phase.
  - Example: A single RNN might predict "happy" after "suis," but lacks structure to translate the whole sentence.
- **Limitation 2: Fixed Input-Output Alignment**:
  - A single RNN assumes input and output lengths align (e.g., one $y_t$ per $x_t$), but real tasks need flexible lengths.
  - Problem: Can’t compress a long input to a short output (e.g., review to "negative") or expand a short input to a long output.
  - Example: Input "bad film" → Output "this movie was terrible"; a single RNN struggles to adjust length dynamically.
- **Limitation 3: Context Loss in Long Sequences**:
  - A single RNN’s final $h_t$ loses early information due to vanishing gradients, compressing input poorly.
  - Problem: For "the film was not good," "not" may fade, misclassifying sentiment if input is long.
  - Example: "I loved the start, but the end was terrible" → Single RNN forgets "loved" in $h_t$.
- **Limitation 4: No Separate Encoding/Decoding**:
  - A single RNN entangles input processing and output generation, limiting flexibility.
  - Problem: Can’t "read" the input fully then "write" a new sequence; e.g., translating requires understanding before generating.
  - Counterargument: Could a single RNN read, then autoregressively generate? Yes, but it compresses all input into $h_t$, losing details (e.g., "Je suis" to "I am" fails if $h_t$ misses "suis"). Seq2seq’s encoder-decoder split ensures better context retention.
- **Compelling Reason**: Seq2Seq (2014) emerged to overcome these, offering a robust encoder-decoder framework, later enhanced by attention and Transformers.

## 3. Seq2Seq Architecture (Step-by-Step)
- **Encoder**:
  - Processes input sequence (e.g., "I am not happy") with RNN/LSTM.
  - Produces hidden states $h_1, h_2, ..., h_T$.
  - Outputs fixed-size context vector (e.g., final $h_T$ or average) summarizing input.
- **Decoder**:
  - Takes context vector as initial state, generates output sequence (e.g., "negative").
  - Uses another RNN, predicting one token at a time (e.g., "neg" → "ative").
  - Updates hidden state $s_t$ based on previous output and context.
- **Training**: Maximizes likelihood of output sequence given input, using teacher forcing.
- **Flow**: Encoder compresses, decoder expands, bridging different lengths.

## 4. How Seq2Seq Works for Sentiment-Related Tasks
- **Sentiment Classification Variant**: Encoder summarizes text, decoder outputs sentiment label or explanation.
- **Example**: Input "the movie was not good" → Output "negative" or "disappointing experience".
- **Advantage**: Handles variable-length reviews, capturing context better than single RNN.
- **Limitation**: Early models rely on fixed context, losing details from long inputs.

## 5. Limitations of Basic Seq2Seq
- **Fixed Context Bottleneck**: Single vector (e.g., $h_T$) can’t capture full input for long sequences.
- **Vanishing Gradient**: Encoder struggles with distant dependencies, similar to basic RNNs.
- **Slow Decoding**: Generates output sequentially, limiting speed.
- **Solution Hint**: Attention mechanism (next step) addresses these.

## 6. Attention Integration with Seq2Seq
- **Improvement**: Adds attention to weight encoder hidden states dynamically.
- **Process**: Decoder computes context $c_t = \sum_j w_{tj} h_j$, where $w_{tj}$ highlights relevant $h_j$ (e.g., "not" in "not good").
- **Benefit**: Enhances long-sequence handling, boosting tasks like translation/sentiment.
- **Evolution**: Leads to self-attention in Transformers.

## 7. Why Seq2Seq Matters for Your Path
- **Bridge to Attention**: Introduces encoder-decoder, setting stage for attention mechanisms.
- **Pre-Transformer**: Foundation for Transformer’s encoder-decoder architecture.
- **Sentiment Practice**: Test "the film was not entertaining" → "negative" with/without attention.
- **Next Step**: Learn attention with seq2seq, then self-attention/Transformers.

## 8. Key Points (Milestones)
- **Origin**: Introduced for machine translation, ~2014 (Sutskever, Cho, Bahdanau).
- **Structure**: Encoder RNN + Decoder RNN, flexible sequence lengths.
- **Improvement**: Better than single RNN for variable outputs, ~85% on sentiment tasks.
- **Limit**: Fixed context; attention resolves this, leading to Transformers.