# Sequence-to-Sequence (Seq2Seq) Models

## Introduction

Sequence-to-Sequence (Seq2Seq) models are a class of deep learning architectures designed to convert one sequence of elements (e.g., words in a sentence) into another sequence. They have proven highly effective in a variety of Natural Language Processing (NLP) tasks.

In this notebook, we will:

- **Understand** the core architecture of Seq2Seq models, known as the Encoder-Decoder framework.
- **Discuss** how Seq2Seq models handle variable-length inputs and outputs.
- **Explore** key applications of Seq2Seq models in NLP, such as Machine Translation, Text Summarization, and Chatbots.
- **Examine** how attention mechanisms can be integrated into Seq2Seq models to improve performance.

**Resources for Further Reading:**

- [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215) by Sutskever et al.
- [TensorFlow Seq2Seq Tutorial](https://www.tensorflow.org/text/tutorials/nmt_with_attention)

**Prerequisites:**

- Familiarity with RNNs, LSTMs, or GRUs.
- Understanding of basic Python and PyTorch fundamentals.
- Knowledge of NLP tasks and tokenization.

---

## 1. The Seq2Seq Architecture

### 1.1 Encoder-Decoder Framework

The Seq2Seq model typically consists of two main components:

- **Encoder:**  
  The encoder reads and encodes the input sequence into a fixed-length vector representation (also called the context vector or thought vector). This vector summarizes all the information in the input sequence.

- **Decoder:**  
  The decoder takes the context vector and generates the output sequence, one token at a time. At each timestep, the decoder predicts the next output token based on the previously generated tokens and the context vector from the encoder.

**Key Idea:**  
The encoder transforms a variable-length input sequence into a fixed-length representation. The decoder then uses this representation to produce a variable-length output sequence. The encoder and decoder are often implemented using RNN variants like LSTMs or GRUs, but can also be built with Transformers.

### 1.2 Handling Variable-Length Inputs and Outputs

RNNs inherently process sequences step-by-step, making them suitable for variable-length inputs. By defining the end of a sequence with a special token (e.g., `<EOS>`), the model also handles variable-length outputs. Common practices include:

- **Padding and Masking:**  
  For batching, we often pad sequences to the same length. A masking mechanism ignores the padded positions during computation.

- **Special Tokens:**  
  Use `<SOS>` (start-of-sequence) and `<EOS>` (end-of-sequence) tokens to indicate where the decoder should begin and stop generating output.

---

## 2. Applications of Seq2Seq Models

### 2.1 Machine Translation

**Example:** Translating an English sentence into French.  
The Seq2Seq model reads the entire English sentence (e.g., "I love cats") and then generates the French translation (e.g., "J'aime les chats").

### 2.2 Text Summarization

**Example:** Given a long document (e.g., a news article), the Seq2Seq model can produce a short summary. It reads the entire article as input and outputs a concise summary capturing the main points.

### 2.3 Chatbots

**Example:** The Seq2Seq model can be used for conversational agents. Given a user’s query, the model produces a coherent and contextually relevant response. Over multiple turns, this can simulate a conversation.

---

## 3. Implementing a Basic Seq2Seq Model in PyTorch

**Note:** The following code is a simplified example to illustrate the Seq2Seq architecture. Training this model on real data would require a prepared dataset and possibly a more complex setup.

### 3.1 Setup

In [17]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### 3.2 Example Data

For demonstration, let's consider a toy dataset of simple "translation"-like tasks. We'll create a small dataset where the input is a sequence of integers and the output is that sequence reversed. Although trivial, this demonstrates the Seq2Seq model structure.

In [18]:
# Toy dataset: input is a sequence of numbers [1,2,3,...] and output is the reversed sequence.
# For example, input: [1, 2, 3], output: [3, 2, 1]

def create_toy_dataset(n_samples=1000, seq_len=5, vocab_size=10):
    inputs = []
    targets = []
    for _ in range(n_samples):
        seq = np.random.randint(1, vocab_size, size=seq_len).tolist()
        rev = seq[::-1]
        inputs.append(seq)
        targets.append(rev)
    return inputs, targets

inputs, targets = create_toy_dataset(n_samples=1000, seq_len=6, vocab_size=20)

# Create vocab mappings
# In a real scenario, you'd have predefined vocabularies for source and target languages.
# Here, we just have numbers as tokens.
input_vocab_size = 21  # 1-20 plus 0 for padding
target_vocab_size = 21
SOS_token = 0  # start-of-sequence
EOS_token = 0  # end-of-sequence (for simplicity, let's reuse 0 as EOS in this toy example)

### 3.3 Dataset and DataLoader

In [19]:
class SeqDataset(Dataset):
    def __init__(self, inputs, targets):
        self.inputs = inputs
        self.targets = targets
    def __len__(self):
        return len(self.inputs)
    def __getitem__(self, idx):
        inp = self.inputs[idx]
        tgt = self.targets[idx]
        return torch.tensor(inp, dtype=torch.long), torch.tensor(tgt, dtype=torch.long)

dataset = SeqDataset(inputs, targets)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, drop_last=True)

### 3.4 Encoder and Decoder Definitions

**Encoder:** Takes an input sequence and produces a context vector (the final hidden state).

In [20]:
class Encoder(nn.Module):
    def __init__(self, input_dim, embed_dim, hidden_dim, num_layers=1):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(input_dim, embed_dim)
        self.rnn = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers, batch_first=True)
        
    def forward(self, x):
        # x: (batch_size, seq_len)
        embedded = self.embedding(x)  # (batch_size, seq_len, embed_dim)
        outputs, (h, c) = self.rnn(embedded)  # h,c: (num_layers, batch_size, hidden_dim)
        return h, c

**Decoder:** Uses the context vector from the encoder to generate the output sequence one token at a time.

In [21]:
class Decoder(nn.Module):
    def __init__(self, output_dim, embed_dim, hidden_dim, num_layers=1):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(output_dim, embed_dim)
        self.rnn = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x, h, c):
        # x: (batch_size) - single token input
        x = x.unsqueeze(1)  # (batch_size, 1)
        embedded = self.embedding(x)  # (batch_size, 1, embed_dim)
        output, (h, c) = self.rnn(embedded, (h, c))  # (batch_size, 1, hidden_dim)
        logits = self.fc(output.squeeze(1))  # (batch_size, output_dim)
        return logits, h, c

**Seq2Seq Model:** Combines the encoder and decoder. During training, we can use "teacher forcing" where we feed the target token at the previous timestep as input to the next decoder step.

In [22]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size = src.size(0)
        trg_len = trg.size(1)
        output_dim = self.decoder.fc.out_features
        
        outputs = torch.zeros(batch_size, trg_len, output_dim).to(device)
        
        h, c = self.encoder(src)
        
        # First input to decoder is the <SOS> token.
        input_tok = torch.zeros(batch_size, dtype=torch.long).to(device)  # <SOS> token = 0 here
        for t in range(trg_len):
            logits, h, c = self.decoder(input_tok, h, c)
            outputs[:, t, :] = logits
            # Decide whether to use teacher forcing
            teacher_force = np.random.random() < teacher_forcing_ratio
            top1 = logits.argmax(1)
            input_tok = trg[:, t] if teacher_force else top1
        
        return outputs

### 3.5 Training the Seq2Seq Model

In [23]:
embed_dim = 32
hidden_dim = 64
encoder = Encoder(input_vocab_size, embed_dim, hidden_dim).to(device)
decoder = Decoder(target_vocab_size, embed_dim, hidden_dim).to(device)
model = Seq2Seq(encoder, decoder).to(device)

criterion = nn.CrossEntropyLoss(ignore_index=0)  # ignoring pad/EOS token
optimizer = optim.Adam(model.parameters(), lr=0.001)

epochs = 30
model.train()
for epoch in range(epochs):
    total_loss = 0
    for inp, tgt in dataloader:
        inp, tgt = inp.to(device), tgt.to(device)
        optimizer.zero_grad()
        output = model(inp, tgt)
        # output: (batch_size, trg_len, output_dim)
        # tgt: (batch_size, trg_len)
        # reshape
        output_dim = output.shape[-1]
        output = output.contiguous().view(-1, output_dim)
        tgt = tgt.contiguous().view(-1)
        
        loss = criterion(output, tgt)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(dataloader):.4f}")

Epoch 1/30, Loss: 3.0035
Epoch 2/30, Loss: 2.8352
Epoch 3/30, Loss: 2.5144
Epoch 4/30, Loss: 2.2524
Epoch 5/30, Loss: 2.0697
Epoch 6/30, Loss: 1.9232
Epoch 7/30, Loss: 1.7965
Epoch 8/30, Loss: 1.6745
Epoch 9/30, Loss: 1.5690
Epoch 10/30, Loss: 1.4711
Epoch 11/30, Loss: 1.3606
Epoch 12/30, Loss: 1.2822
Epoch 13/30, Loss: 1.2178
Epoch 14/30, Loss: 1.1608
Epoch 15/30, Loss: 1.0738
Epoch 16/30, Loss: 1.0124
Epoch 17/30, Loss: 0.9337
Epoch 18/30, Loss: 0.8508
Epoch 19/30, Loss: 0.8189
Epoch 20/30, Loss: 0.7538
Epoch 21/30, Loss: 0.7440
Epoch 22/30, Loss: 0.6229
Epoch 23/30, Loss: 0.6018
Epoch 24/30, Loss: 0.5460
Epoch 25/30, Loss: 0.5123
Epoch 26/30, Loss: 0.4501
Epoch 27/30, Loss: 0.4257
Epoch 28/30, Loss: 0.3657
Epoch 29/30, Loss: 0.3400
Epoch 30/30, Loss: 0.3120


### 3.6 Testing the Model

In [24]:
model.eval()
with torch.no_grad():
    test_seq = torch.tensor([2,5,8,3,7,9], dtype=torch.long).unsqueeze(0).to(device)
    # We know the reversed sequence should be [9,7,3,8,5,2]
    h, c = model.encoder(test_seq)
    input_tok = torch.zeros(1, dtype=torch.long).to(device)  # <SOS>
    decoded = []
    for _ in range(6):
        logits, h, c = model.decoder(input_tok, h, c)
        top1 = logits.argmax(1)
        decoded.append(top1.item())
        input_tok = top1
    print("Input sequence:", test_seq.squeeze().tolist())
    print("Decoded sequence:", decoded)

Input sequence: [2, 5, 8, 3, 7, 9]
Decoded sequence: [9, 7, 3, 8, 5, 2]


You should see that the model attempts to produce the reversed sequence. With only 5 epochs and a tiny dataset, results might not be perfect, but you should see some improvement if you train longer or with more data.

---

## 4. Enhancements: Incorporating Attention Mechanisms

**Issue with Vanilla Seq2Seq:**  
The encoder compresses the entire input sequence into a single fixed-length vector. For long sequences, this can become a "bottleneck," making it difficult for the decoder to extract all necessary information.

**Solution: Attention Mechanisms:**  
Attention allows the decoder to look back at the encoder outputs and focus on the most relevant parts of the input at each decoding step, rather than relying solely on a single context vector.

### 4.1 Integration

To incorporate attention:

- Store all encoder hidden states rather than just the final one.
- Compute attention weights over these encoder states for each decoder step.
- Take a weighted sum of the encoder outputs to form a context vector dynamically.

### 4.2 Benefit

- **Improved Performance:** Especially on longer sequences.
- **Interpretability:** Attention weights provide insight into which input tokens the model focused on.

---

## Further Steps and Resources

1. **Try Different Architectures:**  
   Experiment with GRUs, LSTMs, or even Transformers for encoder-decoder models.

2. **Add Attention:**  
   Implement additive attention (Bahdanau) or multiplicative attention (Luong) to improve performance on longer sequences.

3. **Use Real Datasets:**  
   Apply the Seq2Seq model to a machine translation dataset (e.g., WMT) or a text summarization dataset.

4. **Learn From Tutorials:**  
   The [TensorFlow Seq2Seq Tutorial](https://www.tensorflow.org/text/tutorials/nmt_with_attention) and PyTorch tutorials offer hands-on guidance.

**Remember:** Seq2Seq models form the foundation of many advanced NLP applications. With attention and transformer architectures, Seq2Seq models have become even more powerful and efficient.

---

## References

- [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215) by Sutskever, Vinyals, and Le (2014)  
- [TensorFlow Seq2Seq Tutorial](https://www.tensorflow.org/text/tutorials/nmt_with_attention)
