# seq2seq 

## background

### the problem with traditional neural networks

traditional neural networks like anns (artificial neural networks) and cnns (convolutional neural networks) weren't cutting it for text data.

Two main significant reasons : 

- **fixed input size**: these models typically expect a fixed input size, which doesn't work well for variable-length sequences like sentences.
- **lack of temporal understanding**: they don't naturally capture the order and context of words in a sequence.

### rnns (recurrent neural networks)

rnns were introduced to handle sequential data better. they process input sequentially, maintaining a hidden state that can capture some context. however, they had their own issues:

- **vanishing gradient problem**: as the sequence gets longer, rnns struggle to carry information from earlier time steps.
- **limited context**: they have trouble capturing long-term dependencies in the data.

### lstm (kind of solve)

long short-term memory (lstm) networks were design to solve the RNN problem 

- **gating mechanisms**: lstms use gates to control the flow of information, helping to mitigate the vanishing gradient problem.
- **better at long-term dependencies**: they can carry relevant information across longer sequences.

but even lstms (and their variants like grus - gated recurrent units) struggle with very long sequences.

## seq2seq: 

seq2seq (sequence-to-sequence) models were designed to handle tasks where both input and output are sequences, like machine translation.

### core idea

the seq2seq model consists of two main parts:

1. **encoder**: processes the input sequence
2. **decoder**: generates the output sequence

this architecture allows the model to map sequences of different lengths, which is crucial for tasks like translation where input and output lengths may vary.

### how seq2seq works

let's break down the process:

1. **input processing**:
   - text input is tokenized (split into words or subwords)
   - tokens are converted to numerical representations via an embedding layer

2. **encoding**:
   - the embedded input sequence is fed into the encoder (usually lstm-based)
   - encoder processes the sequence, updating its hidden state at each step
   - final hidden state of the encoder captures the essence of the input sequence

3. **context vector**:
   - the final hidden state of the encoder becomes the "context vector"
   - this vector is meant to encapsulate the meaning of the entire input sequence

4. **decoding**:
   - decoder initializes its hidden state with the context vector
   - at each step, the decoder:
     - takes the previous output and its current hidden state as input
     - produces a probability distribution over the output vocabulary
     - selects the most likely token as the output for that step

5. **output generation**:
   - the process continues until the decoder generates an end-of-sequence token or reaches a maximum length

### IMP finding in seq2seq

1. **separate encoder and decoder**:
   - allows handling different languages or domains for input and output
   - enables more parameters without excessive computational cost
   - can be trained separately, adding flexibility

2. **deep lstms**:
   - stacking multiple lstm layers (typically 4) in both encoder and decoder
   - increases model capacity to capture complex patterns
   - helps maintain long-term dependencies

3. **input reversal**:
   - reversing the order of input tokens (but not output tokens)
   - creates shorter dependencies between source and target
   - makes optimization easier for gradient-based methods like sgd

4. **attention mechanism** (a later addition):
   - allows decoder to focus on different parts of input for each output token
   - significantly improves performance, especially for long sequences
   - paved the way for transformer models


### beam search decoding

instead of greedily selecting the most probable token at each step, beam search maintains multiple candidate sequences:

- keeps top-k most likely sequences at each step
- improves output quality by exploring more possibilities

### handling unknown words

seq2seq models struggle with words not in their vocabulary. solutions include:

- subword tokenization (e.g., byte-pair encoding)
- pointer-generator networks for copying unknown words from input

### bidirectional encoders

using bidirectional lstms in the encoder to capture context from both directions of the input sequence.


## limitations 


- still struggle with very long sequences
- computationally intensive, especially during training
- require large amounts of parallel data for training
