## Sequence-to-Sequence (Seq2Seq) Models

Sequence-to-sequence models are a type of recurrent neural network (RNN) architecture designed to transform one sequence into another. Unlike models that produce a single output (label) or take a single input and produce a sequence, **Seq2Seq models handle both variable-length input and output sequences.**

They typically consist of two main components:

* **Encoder:** Processes the input sequence and compresses it into a fixed-length vector representation (the "context vector" or "thought vector"). This vector aims to capture the essential information from the entire input sequence.
* **Decoder:** Takes the context vector as input and generates the output sequence, step by step. It uses the information in the context vector and previously generated outputs to predict the next element in the sequence.

**Key characteristic:** Both input and output are sequences of potentially different lengths.

---

## Comparison with Label-to-Sequence and Sequence-to-Label Models

Here's a brief comparison:

| Model Type          | Input                      | Output                     | Use Cases                                                              |
| :------------------ | :------------------------- | :------------------------- | :----------------------------------------------------------------------- |
| **Sequence-to-Label** | Variable-length sequence   | Single label/classification | Sentiment analysis (text -> positive/negative), text categorization   |
| **Label-to-Sequence** | Single label/input vector  | Variable-length sequence   | Image captioning (image features -> sentence), music generation (genre -> melody) |
| **Sequence-to-Sequence** | Variable-length sequence   | Variable-length sequence   | Machine translation (English sentence -> French sentence), text summarization (long text -> short summary), chatbot responses |

**In essence:**

* **Sequence-to-Label:** Focuses on understanding the *entire* input sequence to make a single prediction.
* **Label-to-Sequence:** Generates a sequence based on a single starting point or concept.
* **Sequence-to-Sequence:** Deals with transforming a complex, ordered input into a new, ordered output, where the length and elements of both can vary.

---
**Step 1: Sample Sentence**

Let's consider the English sentence:


In [3]:
# Input and target sentence
input_sentence = "how are you"
output_sentence = "cómo estás tú"

**Step 2: Tokenization**

Tokenization is the process of splitting a sentence into individual words (or subwords).

It helps models process sequences word-by-word.

For the phrase "how are you", the tokenized form would be:

> `["how", "are", "you"]`

Each word becomes a separate token.

In [4]:
input_tokens = input_sentence.lower().split()
output_tokens = output_sentence.lower().split()
print("Input Tokens:", input_tokens)
print("Output Tokens:", output_tokens)


Input Tokens: ['how', 'are', 'you']
Output Tokens: ['cómo', 'estás', 'tú']


**Step 3: Vocabulary Creation (Word → Integer)**

After tokenization, we create a vocabulary, which is a mapping from each unique token in our dataset to a unique integer. This allows the model to work with numerical representations instead of raw text.

For our example tokens: `["how", "are", "you"]`, a possible vocabulary could be:

| Word | Integer |
|---|---|
| how | 1 |
| are | 2 |
| you | 3 |

So, the tokenized sentence `["how", "are", "you"]` would be represented as the integer sequence:

> `[1, 2, 3]`

In [5]:
def build_vocab(tokens):
    vocab = {word: idx + 1 for idx, word in enumerate(set(tokens))}
    vocab["<PAD>"] = 0
    vocab["<BOS>"] = len(vocab)
    vocab["<EOS>"] = len(vocab)
    return vocab

input_vocab = build_vocab(input_tokens)
output_vocab = build_vocab(output_tokens)

print("Input Vocab:", input_vocab)
print("Output Vocab:", output_vocab)


Input Vocab: {'are': 1, 'how': 2, 'you': 3, '<PAD>': 0, '<BOS>': 4, '<EOS>': 5}
Output Vocab: {'tú': 1, 'estás': 2, 'cómo': 3, '<PAD>': 0, '<BOS>': 4, '<EOS>': 5}


**Step 4: Encode Sentences (Words → Numbers)**

Now, we take our tokenized sentence and convert each token into its corresponding integer based on the vocabulary we created. We also typically add special tokens to signal the start and end of a sentence to the model.

Using our vocabulary from Step 3 and adding `<BOS>` (Begin of Sentence) and `<EOS>` (End of Sentence) tokens (let's assume `<BOS>` maps to index 0 and `<EOS>` maps to index 4), the encoding process for our example sentence "how are you" would look like this:

1. **Add `<BOS>` at the beginning:** `["<BOS>", "how", "are", "you"]`
2. **Add `<EOS>` at the end:** `["<BOS>", "how", "are", "you", "<EOS>"]`
3. **Map each token to its integer index:**

   | Token | Integer |
   |---|---|
   | `<BOS>` | 0 |
   | how | 1 |
   | are | 2 |
   | you | 3 |
   | `<EOS>` | 4 |

Therefore, the encoded representation of the sentence "how are you" becomes the numerical sequence:

> `[0, 1, 2, 3, 4]`

In [6]:
def encode_sentence(tokens, vocab, add_sos_eos=False):
    encoded = [vocab[word] for word in tokens]
    if add_sos_eos:
        return [vocab["<BOS>"]] + encoded + [vocab["<EOS>"]]
    return encoded

input_encoded = encode_sentence(input_tokens, input_vocab)
output_encoded = encode_sentence(output_tokens, output_vocab, add_sos_eos=True)

print("Encoded Input:", input_encoded)
print("Encoded Output:", output_encoded)


Encoded Input: [2, 1, 3]
Encoded Output: [4, 3, 2, 1, 5]


**Step 5: Padding (to equal length for batching)**

When training neural networks, it's often more efficient to process data in batches of sentences rather than one at a time. However, sentences in a dataset typically have varying lengths. To create batches of equal size, we use a technique called padding.

Padding involves adding a special "padding" token (often `<PAD>`, let's assume it has an index of 5 in our vocabulary) to the end of shorter sequences until they reach the length of the longest sequence in the batch.

**Example:**

Let's say we have another encoded sentence:

> `[0, 1, 4]`  (representing "<BOS> cat <EOS>")

And our current encoded sentence is:

> `[0, 1, 2, 3, 4]` (representing "<BOS> how are you <EOS>")

If we want to batch these two sentences, the maximum length is 5. The shorter sentence needs to be padded:

Original shorter sentence: `[0, 1, 4]`

Padded shorter sentence (to length 5): `[0, 1, 4, 5, 5]`

Now both sentences have the same length and can be processed together in a batch:

> `[[0, 1, 2, 3, 4],`
>
> ` [0, 1, 4, 5, 5]]`

The `<PAD>` tokens are usually masked or ignored by the model during training so they don't contribute to the learning process.

In [7]:
def pad_sequence(seq, max_len):
    return seq + [0] * (max_len - len(seq))

max_input_len = 5
max_output_len = 6

input_padded = pad_sequence(input_encoded, max_input_len)
output_padded = pad_sequence(output_encoded, max_output_len)

print("Padded Input:", input_padded)
print("Padded Output:", output_padded)


Padded Input: [2, 1, 3, 0, 0]
Padded Output: [4, 3, 2, 1, 5, 0]


**Step 6: One-Hot Encoding**

While the integer encoding from the previous steps provides a numerical representation, neural networks often work better with one-hot encoded vectors, especially for categorical data like word indices.

One-hot encoding transforms each integer into a binary vector where only the index corresponding to the integer has a value of 1, and all other indices have a value of 0. The length of the vector is equal to the size of the vocabulary (including special tokens like `<BOS>`, `<EOS>`, and `<PAD>`).

**Example (using our vocabulary of size 6: {`<BOS>`: 0, `how`: 1, `are`: 2, `you`: 3, `<EOS>`: 4, `<PAD>`: 5}):**

Let's take our padded, encoded sentence: `[0, 1, 2, 3, 4]` (we'll ignore the padded example for simplicity in this step).

Each integer in this sequence will be converted into a 6-dimensional one-hot vector:

* **0** (`<BOS>`) becomes: `[1, 0, 0, 0, 0, 0]`
* **1** (`how`) becomes: `[0, 1, 0, 0, 0, 0]`
* **2** (`are`) becomes: `[0, 0, 1, 0, 0, 0]`
* **3** (`you`) becomes: `[0, 0, 0, 1, 0, 0]`
* **4** (`<EOS>`) becomes: `[0, 0, 0, 0, 1, 0]`

So, the one-hot encoded representation of the sentence "how are you" would be a sequence of these vectors:

> `[[1, 0, 0, 0, 0, 0],`  (`<BOS>`)
>
> ` [0, 1, 0, 0, 0, 0],`  (`how`)
>
> ` [0, 0, 1, 0, 0, 0],`  (`are`)
>
> ` [0, 0, 0, 1, 0, 0],`  (`you`)
>
> ` [0, 0, 0, 0, 1, 0]]` (`<EOS>`)

Each word in the input sequence is now represented by a sparse vector that the neural network can process. While effective, for large vocabularies, one-hot encoding can be memory-intensive, which is why other embedding techniques are often used in practice.

In [9]:
import numpy as np

def one_hot_encode(seq, vocab_size):
    one_hot = np.zeros((len(seq), vocab_size))
    for t, val in enumerate(seq):
        one_hot[t, val] = 1
    return one_hot

input_vocab_size = len(input_vocab)
output_vocab_size = len(output_vocab)

input_oh = one_hot_encode(input_padded, input_vocab_size)
output_oh = one_hot_encode(output_padded, output_vocab_size)

print("One-hot Encoded Input Shape:", input_oh.shape)
print("One-hot Encoded Output Shape:", output_oh.shape)


One-hot Encoded Input Shape: (5, 6)
One-hot Encoded Output Shape: (6, 6)


## Recurrent Neural Networks (RNNs)

**Why Do We Need Memory in Models?**

Traditional neural networks, such as those built with simple feedforward layers, process each input independently without considering the order or context of previous inputs in a sequence. This limitation makes them unsuitable for tasks where the order of elements is crucial for understanding meaning.

**For example:**

Consider these two sentences:

* Sentence A: "The cat ate the food."
* Sentence B: "The food ate the cat."

Both sentences consist of the exact same words. However, the different order of these words completely changes the meaning of the sentences. A feedforward network, processing each word in isolation, would struggle to differentiate between these two semantically distinct phrases.

**RNNs are designed to remember previous inputs while processing the current input.** They achieve this through a mechanism called a **hidden state**.

At each step of processing a sequence, an RNN takes the current input and the previous hidden state as input. It then produces an output for the current step and updates its hidden state. This updated hidden state carries information about the history of the sequence encountered so far, effectively acting as the model's "memory". This allows RNNs to understand the relationships and dependencies between elements in a sequence, making them well-suited for tasks involving sequential data like text, time series, and audio.

---
**Example Sentence for RNN**

Let's use a simple toy sentence to illustrate how an RNN might process it:

In [11]:
sentence = input("Enter a sentence: ")

Enter a sentence: hello there


**Step 1: Tokenization**



In [12]:
tokens = sentence.lower().split()
print("Tokens:", tokens)


Tokens: ['hello', 'there']


**Step 2: Vocabulary with `<BOS>` and `<EOS>`**

Before encoding, we define our vocabulary, including the special tokens for the beginning and end of the sentence. For our example, let's say our vocabulary and their corresponding integer indices are:


In [13]:
def build_vocab_with_special(tokens):
    vocab = {word: idx + 1 for idx, word in enumerate(set(tokens))}
    vocab["<PAD>"] = 0
    vocab["<BOS>"] = len(vocab)
    vocab["<EOS>"] = len(vocab)
    return vocab

vocab = build_vocab_with_special(tokens)
print("Vocab:", vocab)


Vocab: {'there': 1, 'hello': 2, '<PAD>': 0, '<BOS>': 3, '<EOS>': 4}


**Step 3: Encode with `<BOS>` and `<EOS>`**

Now, we take our tokenized sentence and encode it using the vocabulary we defined in Step 2, including the `<BOS>` and `<EOS>` tokens:



In [14]:
def encode_with_bos_eos(tokens, vocab):
    return [vocab["<BOS>"]] + [vocab[t] for t in tokens] + [vocab["<EOS>"]]

encoded = encode_with_bos_eos(tokens, vocab)
print("Encoded Sentence:", encoded)


Encoded Sentence: [3, 2, 1, 4]


**Step 4: One-hot Encoding for Input**

In [15]:
def one_hot_encode(seq, vocab_size):
    one_hot = np.zeros((len(seq), vocab_size))
    for t, val in enumerate(seq):
        one_hot[t, val] = 1
    return one_hot

vocab_size = len(vocab)
input_seq = one_hot_encode(encoded, vocab_size)
print("Input Shape:", input_seq.shape)


Input Shape: (4, 5)


**Step 5: The RNN Forward Pass — Why Do We Need It?**

**First, Recall the Two Sentences:**

* Sentence A: "The cat ate the food."
* Sentence B: "The food ate the cat."

These two sentences have exactly the same words, just in a different order. But the meaning is completely different — in A, the cat is eating; in B, the food is eating (which is nonsense!).

So the model **must understand order**.

**What Happens if We Don’t Use an RNN?**

If we use a normal feedforward neural network, each word is treated independently — it won’t know what came before or after.

That means:
Markdown

**Step 5: The RNN Forward Pass — Why Do We Need It?**

**First, Recall the Two Sentences:**

* Sentence A: "The cat ate the food."
* Sentence B: "The food ate the cat."

These two sentences have exactly the same words, just in a different order. But the meaning is completely different — in A, the cat is eating; in B, the food is eating (which is nonsense!).

So the model **must understand order**.

**What Happens if We Don’t Use an RNN?**

If we use a normal feedforward neural network, each word is treated independently — it won’t know what came before or after.

That means:

"cat" -> output vector

"ate" -> output vector

"food" -> output vector


But it won’t know if "cat" came before "food" or after.

**So it treats Sentence A and Sentence B as the same!**

**What Does Step 5 (RNN Forward Pass) Do?**

RNN introduces **memory**.

At every time step $t$, it computes the hidden state $h_t$ based on the current input $x_t$ and the previous hidden state $h_{t-1}$:

$$h_t = \tanh(W_{xh} \cdot x_t + W_{hh} \cdot h_{t-1} + b)$$

That means:

* $x_t$ is the current word at time $t$ (e.g., "the", "cat", etc.)
* $h_{t-1}$ is the memory from the previous step — the context

**So:**

When the RNN sees the word “ate”, its interpretation depends on what it saw before:

* In Sentence A: it saw “the cat” $\rightarrow$ “ate” likely means the subject is “cat”
* In Sentence B: it saw “the food” $\rightarrow$ “ate” likely means the subject is “food”

**Example Flow (in Intuition)**

**Sentence A: "The cat ate the food"**

| Step | Word  | Hidden state ($h_t$) remembers |
|------|-------|------------------------------|
| 1    | "The" | Start of sentence            |
| 2    | "cat" | "The cat"                    |
| 3    | "ate" | "The cat ate"                |
| 4    | "the" | "The cat ate the"            |
| 5    | "food"| "The cat ate the food"       |

**Sentence B: "The food ate the cat"**

| Step | Word  | Hidden state ($h_t$) remembers |
|------|-------|------------------------------|
| 1    | "The" | Start of sentence            |
| 2    | "food"| "The food"                   |
| 3    | "ate" | "The food ate"               |
| 4    | "the" | "The food ate the"           |
| 5    | "cat" | "The food ate the cat"       |

**So What Does Step 5 Do?**

It creates a **unique hidden state for every word, based on its position and history.**

This is what helps the RNN understand sequence and context — so it can tell the difference between:

> "The cat ate the food"
>
> "The food ate the cat"

Without this, the model would see just a bag of words and get totally confused.

In [16]:
import numpy as np

# Hyperparameters
input_dim = vocab_size
hidden_dim = 4  # for simplicity

# Weights and bias
np.random.seed(42)
W_xh = np.random.randn(input_dim, hidden_dim)
W_hh = np.random.randn(hidden_dim, hidden_dim)
b = np.zeros((hidden_dim,))

def rnn_forward(inputs):
    h_t = np.zeros((hidden_dim,))
    hidden_states = []

    for x_t in inputs:
        h_t = np.tanh(np.dot(x_t, W_xh) + np.dot(h_t, W_hh) + b)
        hidden_states.append(h_t.copy())

    return np.array(hidden_states)

hidden_states = rnn_forward(input_seq)
print("Hidden States Shape:", hidden_states.shape)


Hidden States Shape: (4, 4)


**Visualize Hidden States**

In [17]:
for t, h in enumerate(hidden_states):
    print(f"Step {t}: {h}")


Step 0: [ 0.23734834 -0.95736006 -0.93845248 -0.50967271]
Step 1: [ 0.74875492  0.83230964  0.66401981 -0.9792767 ]
Step 2: [ 0.02453996  0.48632465 -0.48786326  0.9848857 ]
Step 3: [-0.74511185 -0.49053645 -0.34737921 -0.99764036]


In [2]:
# Real Seq2Seq Translation with Transformers (English → French)

!pip install -q transformers sentencepiece


In [3]:
from transformers import MarianMTModel, MarianTokenizer

# Load Pretrained English → French Model
model_name = 'Helsinki-NLP/opus-mt-en-fr'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)


# Translate Function
def translate_real(sentence):
    print(f"Input: {sentence}")
    tokens = tokenizer(sentence, return_tensors="pt", padding=True)
    output = model.generate(**tokens)
    translated = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"Output: {translated}\n")


# Try Demo Sentences
translate_real("I am a student.")
translate_real("The cat ate the food.")
translate_real("The food ate the cat.")
translate_real("I love computer science.")


Input: I am a student.
Output: Je suis étudiant.

Input: The cat ate the food.
Output: Le chat a mangé la nourriture.

Input: The food ate the cat.
Output: La nourriture a mangé le chat.

Input: I love computer science.
Output: J'adore l'informatique.

