`Helsinki-NLP/opus_books` is a collection of copyright free books aligned by Andras Farkas. All texts are freely available for personal, educational and research use. Commercial use (e.g. reselling as parallel books) and mass redistribution without explicit permission are not granted.

See more about the dataset: https://huggingface.co/datasets/Helsinki-NLP/opus_books

In this section, we will use this dataset to train a seq2seq translation model.

We use the English - French as the example

In [5]:
from datasets import load_dataset

books = load_dataset("opus_books", "en-fr")

In [None]:
#resource limited, we only train on a small subset of dataset (but you can train on the full one if you want)
books_small = books["train"].train_test_split(train_size=0.1, seed=42)

books = books_small["train"].train_test_split(test_size=0.1, seed=42)

print(books)
print(books["train"][0])

The main purpose of this practical session is to get familiar with how the relevant libraries and models run. Considering possible resource limitations, we are not using large datasets or big models here: this is intended purely as usage guidance, and model performance is not our primary focus. You are welcome to explore longer training.

# Task1: Fine-tune a pre-trained seq2seq model using the Transformers library

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

### Preprocessing Function

The `preprocess_function` prepares translation data for model training:

- Extracts source and target sentences from `examples["translation"]`.
- Tokenizes the source text into `input_ids` and `attention_mask`.
- Tokenizes the target text as labels.
- Returns a dictionary with `input_ids`, `attention_mask`, and `labels` for the model.

In [None]:
source_lang = "en"
target_lang = "fr"
max_length = 128

def preprocess_function(examples):
    inputs = #YOUR CODE
    targets = #YOUR CODE
    model_inputs = #YOUR CODE

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_length, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = books.map(preprocess_function, batched=True)

In [None]:
!pip install evaluate
!pip install sacrebleu

### Evaluation Function

The code defines evaluation with the **sacreBLEU** metric:

- `postprocess_text`: cleans predictions and labels by stripping spaces and formatting labels.  
- `compute_metrics`:  
  - Decodes model predictions and labels.  
  - Postprocesses them for evaluation.  
  - Computes BLEU score using `sacrebleu`.  
  - Returns the BLEU score as a dictionary.

In [None]:
import evaluate
metric = #YOUR CODE

def postprocess_text(preds, labels):
    preds = [p.strip() for p in preds]
    labels = [[l.strip()] for l in labels]
    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    decoded_preds, decoded_labels = #YOUR CODE
    result = #YOUR CODE
    return {"bleu": result["score"]}

### Training with Seq2SeqTrainer

This task sets up and runs training for a sequence-to-sequence model:

- **Seq2SeqTrainingArguments**: defines training parameters such as batch size, epochs, logging.  
- **DataCollatorForSeq2Seq**: prepares batches with dynamic padding for seq2seq training.  
- **Seq2SeqTrainer**: combines the model, datasets, tokenizer, collator, and evaluation metric.  
- **trainer.train()**: starts the training process.

In [None]:
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

args = Seq2SeqTrainingArguments(
#YOUR CODE
    )

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Seq2SeqTrainer(
    #YOUR CODE
)


trainer.train()

In [None]:
results = trainer.evaluate()
print(results)

# Task2: Define a seq2seq transformer model using PyTorch

In [12]:
from tqdm import tqdm
import torch
import torch.nn as nn
import math

You can try to train this model; however, it requires certain GPU resources and data, and you may not be able to complete it due to resource limitations. This exercise is only meant to help you become familiar with the structure of the model. Training is not mandatory, so it's fine if you don't run the training process.

### Task: Implement Positional Encoding

Define a `PositionalEncoding` class in PyTorch that:

- Creates sinusoidal positional encodings based on sequence length and embedding dimension.  
- Uses sine for even indices and cosine for odd indices.  
- Adds positional encodings to input embeddings during the forward pass.  
- Ensures the encoding matrix is stored as a non-trainable buffer.

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)  # [max_len, 1]
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        #YOUR CODE
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        x: [batch_size, seq_len, d_model]
        """
        seq_len = x.size(1)
        #YOUR CODE

### Task: Implement a Simple Seq2Seq Transformer

Define a `SimpleSeq2SeqTransformer` class in PyTorch that:

- Uses an embedding layer to convert token IDs into dense vectors.  
- Adds positional encoding to capture token order information.  
- Employs `nn.Transformer` with configurable encoder and decoder layers.  
- Applies a final linear layer to project transformer outputs to vocabulary size.  
- Implements a forward pass that encodes source inputs, decodes target inputs with a causal mask, and produces output logits

In [None]:
class SimpleSeq2SeqTransformer(nn.Module):

    def __init__(self, vocab_size, d_model=256, n_heads=4, n_layers=3, dropout=0.1):
        super().__init__()


        self.embedding = #YOUR CODE

        self.pos_encoder  = #YOUR CODE

        self.transformer = nn.Transformer(
            #YOUR CODE
        )

        self.output = #YOUR CODE

    def forward(self, src, tgt):
        src_emb = self.embedding(src)
        tgt_emb = self.embedding(tgt)

        tgt_len = tgt.size(1)
        tgt_mask = nn.Transformer.generate_square_subsequent_mask(tgt_len)

        out = #YOUR CODE

        return self.output(out)

# MCQs

Which of the following statements about encoder–decoder (seq2seq) models is **correct**?

A. The encoder always produces a single hidden vector, which is directly mapped to the final output sequence without a decoder.  

B. In Transformer-based seq2seq models, the decoder attends to both the previously generated tokens and the encoder’s hidden states.  

C. Encoder–decoder models cannot handle variable-length input or output sequences; they require fixed-length sequences on both sides.  

D. Attention mechanisms are only useful in the encoder and have no role in the decoder.  



### Question: Attention Mechanisms

Which of the following statements about the **Scaled Dot-Product Attention** used in Transformers is **correct**?

A. The scaling factor \( \sqrt{d_k} \) is applied to the keys before the dot product to reduce computation cost.  

B. Multi-head attention simply averages the outputs of multiple attention heads to improve stability.  

C. The softmax function in attention ensures that the attention weights over all keys for a given query sum to 1.  

D. Self-attention cannot capture long-range dependencies because the receptive field is limited to local context.  



### Question: Attentive Encoder–Decoder Models

Which of the following statements about **attentive encoder–decoder models** is correct?

A. Without attention, the decoder always conditions on the entire sequence of encoder hidden states at every step.  

B. Attention allows the decoder to dynamically focus on different parts of the encoder’s hidden states when generating each output token.  

C. In attentive encoder–decoder models, attention weights are random and fixed; they are not learned during training.  

D. The introduction of attention prevents the decoder from using its own previously generated tokens.  



### Question: Transformer Block

Which of the following statements about a **Transformer block** (as used in the original Transformer architecture) is correct?

A. Each Transformer block consists of a single feed-forward network followed by multi-head attention.  

B. Residual connections and layer normalization are applied around both the multi-head attention sublayer and the feed-forward sublayer.  

C. The position-wise feed-forward network applies the same parameters to every position, but different positions use different networks.  

D. In the encoder block, masked self-attention is applied to prevent attending to future tokens.  



### Question: Self-Attention

Which of the following statements about **self-attention** in Transformers is correct?

A. Self-attention computes attention weights only between different sequences in a batch, not within a single sequence.  

B. In self-attention, queries, keys, and values are all derived from the same input representations.  

C. Self-attention cannot model long-range dependencies because it only attends to local neighbors.  

D. The complexity of self-attention is linear in sequence length, making it more efficient than RNNs for long inputs.  
