# Module 12: Sequence-to-Sequence

**Encoder-Decoder Architecture for Translation**

---

## 1. Objectives

- âœ… Understand encoder-decoder architecture
- âœ… Implement Seq2Seq from scratch
- âœ… Use teacher forcing
- âœ… Implement greedy and beam search decoding

## 2. Prerequisites

- [Module 11: Language Modeling](../11_language_modeling/11_language_modeling.ipynb)

## 3. Seq2Seq Architecture

```
        Encoder                      Decoder
        
  "hello world"               "<SOS> bonjour monde <EOS>"
        â†“                              â†“
  [Embed] â†’ [LSTM] â†’ context â†’ [LSTM] â†’ [Output]
                      vector
```

### Key Insight
- **Encoder**: Compress input into context vector
- **Decoder**: Generate output conditioned on context

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import random

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

Device: cpu


## 4. Sample Data (Toy Translation)

In [2]:
# Toy English-French pairs
pairs = [
    ("hello", "bonjour"),
    ("world", "monde"),
    ("cat", "chat"),
    ("dog", "chien"),
    ("good morning", "bonjour"),
]

# Build vocabularies
SOS_TOKEN, EOS_TOKEN, PAD_TOKEN = 0, 1, 2

def build_vocab(sentences):
    vocab = {'<SOS>': 0, '<EOS>': 1, '<PAD>': 2}
    for sent in sentences:
        for word in sent.split():
            if word not in vocab:
                vocab[word] = len(vocab)
    return vocab

src_vocab = build_vocab([p[0] for p in pairs])
tgt_vocab = build_vocab([p[1] for p in pairs])
tgt_idx2word = {v: k for k, v in tgt_vocab.items()}

print(f"Source vocab: {len(src_vocab)}, Target vocab: {len(tgt_vocab)}")

Source vocab: 9, Target vocab: 7


## 5. Encoder

In [3]:
class Encoder(nn.Module):
    """LSTM Encoder."""

    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=1, dropout=0.1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=PAD_TOKEN)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers,
                            batch_first=True, dropout=dropout if num_layers > 1 else 0)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # x: (batch, seq)
        embedded = self.dropout(self.embedding(x))  # (batch, seq, embed)
        outputs, (hidden, cell) = self.lstm(embedded)
        # hidden, cell: (layers, batch, hidden)
        return hidden, cell

encoder = Encoder(len(src_vocab), embed_dim=64, hidden_dim=128)
x = torch.tensor([[src_vocab.get(w, 2) for w in "hello".split()]])
h, c = encoder(x)
print(f"Context shapes - h: {h.shape}, c: {c.shape}")

Context shapes - h: torch.Size([1, 1, 128]), c: torch.Size([1, 1, 128])


## 6. Decoder

In [4]:
class Decoder(nn.Module):
    """LSTM Decoder."""

    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=1, dropout=0.1):
        super().__init__()
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=PAD_TOKEN)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers,
                            batch_first=True, dropout=dropout if num_layers > 1 else 0)
        self.fc = nn.Linear(hidden_dim, vocab_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, hidden, cell):
        # x: (batch, 1) single token
        embedded = self.dropout(self.embedding(x))  # (batch, 1, embed)
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        logits = self.fc(output.squeeze(1))  # (batch, vocab)
        return logits, hidden, cell

decoder = Decoder(len(tgt_vocab), embed_dim=64, hidden_dim=128)
start = torch.tensor([[SOS_TOKEN]])
logits, h, c = decoder(start, h, c)
print(f"Decoder output: {logits.shape}")

Decoder output: torch.Size([1, 7])


## 7. Seq2Seq Model

In [5]:
class Seq2Seq(nn.Module):
    """Complete Seq2Seq model."""

    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, src, tgt, teacher_forcing_ratio=0.5):
        """
        Training forward pass with teacher forcing.

        Args:
            src: (batch, src_len)
            tgt: (batch, tgt_len) including <SOS>
            teacher_forcing_ratio: Probability of using true target
        """
        batch_size = src.size(0)
        tgt_len = tgt.size(1)
        tgt_vocab_size = self.decoder.vocab_size

        # Encode
        hidden, cell = self.encoder(src)

        # Store outputs
        outputs = torch.zeros(batch_size, tgt_len, tgt_vocab_size).to(src.device)

        # First input is <SOS>
        decoder_input = tgt[:, 0:1]  # (batch, 1)

        for t in range(1, tgt_len):
            logits, hidden, cell = self.decoder(decoder_input, hidden, cell)
            outputs[:, t, :] = logits

            # Teacher forcing
            use_teacher = random.random() < teacher_forcing_ratio
            top1 = logits.argmax(dim=1, keepdim=True)
            decoder_input = tgt[:, t:t+1] if use_teacher else top1

        return outputs

    def translate(self, src, max_len=20):
        """Inference: greedy decoding."""
        self.eval()
        with torch.no_grad():
            hidden, cell = self.encoder(src)
            decoder_input = torch.tensor([[SOS_TOKEN]]).to(src.device)

            output_tokens = []
            for _ in range(max_len):
                logits, hidden, cell = self.decoder(decoder_input, hidden, cell)
                top1 = logits.argmax(dim=1)

                if top1.item() == EOS_TOKEN:
                    break

                output_tokens.append(top1.item())
                decoder_input = top1.unsqueeze(1)

            return output_tokens

# Create model
encoder = Encoder(len(src_vocab), 64, 128)
decoder = Decoder(len(tgt_vocab), 64, 128)
model = Seq2Seq(encoder, decoder).to(device)

print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")

Total parameters: 200,583


## 8. Training

In [6]:
def prepare_batch(pairs, src_vocab, tgt_vocab):
    """Prepare training batch."""
    src_batch, tgt_batch = [], []
    for src, tgt in pairs:
        src_tokens = [src_vocab.get(w, PAD_TOKEN) for w in src.split()]
        tgt_tokens = [SOS_TOKEN] + [tgt_vocab.get(w, PAD_TOKEN) for w in tgt.split()] + [EOS_TOKEN]
        src_batch.append(src_tokens)
        tgt_batch.append(tgt_tokens)

    # Pad
    max_src = max(len(s) for s in src_batch)
    max_tgt = max(len(t) for t in tgt_batch)

    src_batch = [s + [PAD_TOKEN] * (max_src - len(s)) for s in src_batch]
    tgt_batch = [t + [PAD_TOKEN] * (max_tgt - len(t)) for t in tgt_batch]

    return torch.tensor(src_batch), torch.tensor(tgt_batch)

# Training
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss(ignore_index=PAD_TOKEN)

src, tgt = prepare_batch(pairs, src_vocab, tgt_vocab)
src, tgt = src.to(device), tgt.to(device)

for epoch in range(200):
    model.train()
    optimizer.zero_grad()

    outputs = model(src, tgt, teacher_forcing_ratio=0.5)

    # outputs: (batch, tgt_len, vocab), tgt: (batch, tgt_len)
    # Skip first target (<SOS>)
    loss = criterion(outputs[:, 1:].reshape(-1, len(tgt_vocab)), tgt[:, 1:].reshape(-1))

    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

    if (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch+1}: Loss={loss.item():.4f}")

Epoch 50: Loss=0.0316
Epoch 100: Loss=0.0053
Epoch 150: Loss=0.0028
Epoch 200: Loss=0.0019


## 9. Inference

In [7]:
# Test translation
def translate(model, sentence, src_vocab, tgt_idx2word):
    tokens = [src_vocab.get(w, PAD_TOKEN) for w in sentence.split()]
    src = torch.tensor([tokens]).to(device)
    output_tokens = model.translate(src)
    return ' '.join([tgt_idx2word.get(t, '<UNK>') for t in output_tokens])

for src, expected in pairs:
    result = translate(model, src, src_vocab, tgt_idx2word)
    print(f"{src} â†’ {result} (expected: {expected})")

hello â†’ bonjour (expected: bonjour)
world â†’ monde (expected: monde)
cat â†’ chat (expected: chat)
dog â†’ chien (expected: chien)
good morning â†’ bonjour (expected: bonjour)


## 10. ðŸ”¥ Real-World Usage

### Evolution

| Year | Model | Innovation |
|------|-------|------------|
| 2014 | Seq2Seq | Encoder-Decoder |
| 2015 | Attention | Look at relevant parts |
| 2017 | Transformer | Self-attention only |
| 2020+ | BART, T5, mBART | Pretrained Seq2Seq |

### Modern Practice
- Use pretrained: `transformers.AutoModelForSeq2SeqLM`
- Fine-tune on your task
- Examples: T5, BART, mT5

## 11. Interview Questions

**Q1: What is teacher forcing?**
<details><summary>Answer</summary>

During training, use ground truth previous token as decoder input instead of predicted token. Speeds up training but can cause exposure bias (train-test mismatch).
</details>

**Q2: What is the bottleneck problem in Seq2Seq?**
<details><summary>Answer</summary>

All source information must be compressed into a single fixed-size context vector. Long sequences lose information. Solution: Attention mechanism.
</details>

## 12. Summary

- **Encoder**: Compress input into context vector
- **Decoder**: Generate output token by token
- **Teacher forcing**: Use true targets during training
- **Limitation**: Bottleneck in context vector â†’ fixed by Attention

## 13. References

- [Seq2Seq Paper (2014)](https://arxiv.org/abs/1409.3215)
- [PyTorch Seq2Seq Tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)

---
**Next:** [Module 13: Attention Mechanism](../13_attention/13_attention.ipynb)