# Assignment 3: Machine Translation with T5

**Description:** This assignment notebook builds on the material from the
[lesson 6 notebook](https://github.com/datasci-w266/2025-spring-main/blob/master/materials/lesson_notebooks/lesson_6_Machine_Translation_With_Transformer.ipynb), in which we set up a new, very small version of a T5 encoder decoder model to train from scratch on translations from Shakespearean to Modern English. Since the model was trained from scratch, it didn't work very well. In this notebook, we'll first try to make that model work a little better, changing the model configuration and output generation parameters. Then we'll fine tune a small pre-trained T5 model on this task, to see how much better we can do with even a small pre-trained model. We'll apply several evaluation metrics, find some trade-offs, and try adding a secondary dataset to address some of the remaining challenges.

This notebook should be run on a Google Colab leveraging a GPU. By default, when you open the notebook in Colab it will try to use a GPU. Since colab is providing free access to a GPU they place constraints on that access.  Therefore you might want to turn off the GPU access (Edit -> Notebook Settings) while editing and initially debugging your code (at least the setup before you train each model). You will need a GPU to full train or evaluate each of the models. Total runtime of the entire notebook (with solutions and a Colab GPU) should be about 1-2h, but potentially more depending on how much you experiment. If Colab tells you that you have reached your GPU limit, wait 10-24 hours and you should be able to access a GPU again.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2024-fall-main/blob/master/assignment/a3/Machine_Translation_T5.ipynb)

The overall assignment structure is as follows:


0. Setup
  
  0.1 Libraries

  0.2 Data Acquisition

  0.3. Data Preparation


1. Tiny Seq2Seq Model Trained From Scratch
  
  1.1 Tokenizer and Model Setup

  1.2 Experimenting with Model Dimensions

  1.3 Text Generation Parameters

  1.4 Test Set Evaluation Metrics

2. Small Pre-Trained T5 Model

  2.1 Pre-Trained Model Setup and Tokenization

  2.2 Fine-Tuning the Pre-Trained Model

  2.3 Fine-Tuned Model Evaluation

  2.4 Style Classifier

  2.5 Revisit Decoder .Generate() Options

3. Adding Supplementary Paraphrase Dataset

  3.1 Load and preprocess the supplemental dataset

  3.2 Train T5 on Paraphrasing Task

  3.3 Fine-Tune Paraphrase-Trained Model on Main Task
  
  3.4 Paraphrase-Trained Model Evaluation

## 0. Setup

### 0.1 Libraries

In [1]:
!pip install -q transformers
!pip install -q datasets
!pip install -q evaluate
!pip install -q tokenizers

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/485.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m481.3/485.4 kB[0m [31m19.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import re
import random
import numpy as np
from scipy.special import softmax

import torch
import transformers
import evaluate
from datasets import Dataset, load_dataset

# For from-scratch T5 model
from transformers import T5TokenizerFast, T5Config, T5ForConditionalGeneration

# For pre-trained T5 model
from transformers import T5Tokenizer, T5ForConditionalGeneration  # this won't import twice, just noting here what's for each model

# For all T5 models
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

# For BLEURT (to load a trained model for evaluation)
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# For style classifier model (also for evaluating the seq2seq model output)
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import TrainingArguments, Trainer

### O.2 Data Acquisition

We'll use the Shakespeare-to-Modern-English translation dataset from Lesson 6. The data includes aligned sentences from a number of plays by William Shakespeare.

The data was copied from this repo --[https://github.com/cocoxu/Shakespeare](https://github.com/cocoxu/Shakespeare) -- and consolidated into one file for easier handling.

You will to grab a copy from our git repo and import it to your Google drive.  From there you'll be able to easily load it in to a Colab notebook.

In [3]:
# This cell will authenticate you and mount your Drive in the Colab.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
# Modify this path to the appropriate location in your Drive
text_file = 'drive/MyDrive/DATASCI_266/train_plays-org-mod.txt'

### O.3 Data Preparation

Each line contains a Shakespearean sentence and its corresponding modern English translation.

The Shakesperean sentence is the *source sequence* and modern English one is the *target sequence*.

In [7]:
with open(text_file) as f:
    lines = f.read().split("\n")[:-1]

text_pairs = []
for line in lines:
    old, mod = line.split("\t")
    old = old
    mod = mod
    text_pairs.append((old, mod))

In [8]:
# Look at some examples
for _ in range(5):
    print(random.choice(text_pairs))

('I do not well understand that.', 'I don’t understand that at all.')
('How I have thought of this and of these times, I shall recount hereafter; for this present, I would not, so with love I might entreat you, Be any further moved.', 'I’ll tell you later what I’ve thought about this And about these times,; for now, Please don’t to try to persuade me any further.')
('Ay, and he’ll tame her.', 'Yeah, and he’s going to tame her.')
("How have I been behaved, that he might stick The small'st opinion on my least misuse?", 'How have I behaved, that he might put The smallest opinion on my least misconduct?')
('Tut, I have the best armor of the world.', 'I have the best armor in the world.')


In [9]:
# Let's create some splits
random.shuffle(text_pairs)
num_val_samples = int(0.06 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples :]

print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

19088 total pairs
16798 training pairs
1145 validation pairs
1145 test pairs


Like we did in the lesson notebook, let's create a Huggingface dataset object from our data, so that it's easy to work with and pass to our model trainer.

In [10]:
def make_dataset(pairs):
    org_texts, mod_texts = zip(*pairs)
    org_texts = list(org_texts)
    mod_texts = list(mod_texts)

    dataset = Dataset.from_dict({"shakespeare": org_texts, "modern": mod_texts})
    return dataset.shuffle()

# Make the training data
train_dataset = make_dataset(train_pairs)

# Make the validation data
val_dataset = make_dataset(val_pairs)

## 1. Tiny Seq2Seq Model Trained From Scratch

As in the lesson 6 notebook, for our first model, we'll make a new tokenizer and model based on the T5 architecture, which we'll train from scratch only on our task dataset.

### 1.1 Tokenizer and Model Setup

The easiest way to make a new tokenizer is to load an existing T5 one, then call .train_new_from_iterator(), providing our own dataset and vocab size.

In [11]:
VOCAB_SIZE = 15000

def get_word_piece_tokenizer(text_samples, vocab_size):

    base_tokenizer = T5TokenizerFast.from_pretrained('t5-base')
    new_tokenizer = base_tokenizer.train_new_from_iterator(
        text_samples,
        vocab_size=VOCAB_SIZE
    )

    return new_tokenizer

In [12]:
shakespeare_samples = [text_pair[0] for text_pair in train_pairs]
modern_samples = [text_pair[1] for text_pair in train_pairs]

part1_tokenizer = get_word_piece_tokenizer(shakespeare_samples + modern_samples, VOCAB_SIZE)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

We'll need to preprocess the data using the tokenizer. Since our task is to translate from Shakespearean to Modern English, the Shakespeare text will be our input_ids and the Modern English will be the labels we use for training and evaluation. We'll create a function to do the tokenization, and then map it to our Huggingface datasets containing the train and validation data. We'll have the function take a tokenizer, because later we'll use a different pre-trained one.

In [13]:
MAX_SEQUENCE_LENGTH = 40

def preprocess_translation_batch(batch_text_pairs, tokenizer, prefix=""):
    if prefix:
        batch_text_pairs["shakespeare"] = [prefix + text for text in batch_text_pairs["shakespeare"]]

    shakespeare_encoded = tokenizer.batch_encode_plus(
        batch_text_pairs["shakespeare"],
        max_length=MAX_SEQUENCE_LENGTH,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )

    modern_encoded = tokenizer.batch_encode_plus(
        batch_text_pairs["modern"],
        max_length=MAX_SEQUENCE_LENGTH,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )

    return {'input_ids': shakespeare_encoded['input_ids'],
            'labels': modern_encoded['input_ids']}

In [14]:
train_ds_part1 = train_dataset.map(preprocess_translation_batch, batched=True,
                                   fn_kwargs={'tokenizer': part1_tokenizer})
val_ds_part1 = val_dataset.map(preprocess_translation_batch, batched=True,
                               fn_kwargs={'tokenizer': part1_tokenizer})

Map:   0%|          | 0/16798 [00:00<?, ? examples/s]

Map:   0%|          | 0/1145 [00:00<?, ? examples/s]

We'll need to create the new model from a config, specifying the model's dimensions. Then we'll need to make training arguments and trainer objects to be able to train the model. Let's create a function for each of those purposes, so that later we can use the functions to experiment with the available options.

First, make a function to create the model config and the model itself. Use the Lesson 6 notebook as a guide, and make sure to include all of the arguments that we've included in the function definition below. Those are what you'll experiment with next.

In [16]:
"""
Fill in the code to create a T5Config and new T5 model, using all of the function arguments
"""

def create_from_scratch_model(num_layers, embed_dim, keyvalue_dim, dense_dim, num_heads):

    ### YOUR CODE HERE

    t5_config = T5Config(
        vocab_size=VOCAB_SIZE,
        d_model=embed_dim,
        d_ff=dense_dim,
        num_heads=num_heads,
        num_layers=num_layers,
        decoder_start_token_id=part1_tokenizer.pad_token_id,
        key_value_dim=keyvalue_dim
    )

    # Create the T5 model based on the configuration
    t5_model = T5ForConditionalGeneration(config=t5_config)

    ### END YOUR CODE

    return t5_model

We'll also need to specify training arguments and a trainer for our model. Use the Seq2SeqTrainingArguments and Seq2SeqTrainer classes imported at the top of this notebook. You can use the Lesson 6 notebook as a guide for this too.

In [17]:
def create_seq2seq_training_args(batch_size, num_epochs):

    ### YOUR CODE HERE

    training_args = Seq2SeqTrainingArguments(
        "shakespeare_translation_model",  # Output directory
        evaluation_strategy='epoch',      # Evaluate after each epoch
        per_device_train_batch_size=batch_size,  # Batch size for training
        per_device_eval_batch_size=batch_size,   # Batch size for evaluation
        num_train_epochs=num_epochs,         # Number of epochs
        report_to='none'                 # Disable reporting to external services
    )

    ### END YOUR CODE

    return training_args

In [18]:
def create_seq2seq_trainer(model, training_args, train_ds, val_ds):

    ### YOUR CODE HERE

    trainer = Seq2SeqTrainer(
        model=model,             # The model to train
        args=training_args,      # The training arguments
        train_dataset=train_ds,  # The training dataset
        eval_dataset=val_ds      # The validation dataset
    )

    ### END YOUR CODE

    return trainer

### 1.2: Experimenting with Model Dimensions

In the Lesson 6 Notebook, we created a very small T5-style model with just one transformer layer and smaller dimensions for some of the internal layers. Now, you'll explore these options yourself, to see if you can get the model to work a little better when trained on this task.

Without adding any additional training data, can we configure the model to perform better when trained on this task? What happens if we add another one or more transformer layers to the encoder and decoder, or make some of the internal dimensions smaller or larger?

The T5Config gives us several hyperparameters to adjust the model's parameter dimensions. You can see the available arguments and their default values in the [T5Config documentation](https://huggingface.co/docs/transformers/v4.46.3/en/model_doc/t5#transformers.T5Config).

We'll give you the batch size and num_epochs:


In [19]:
part1_batch_size = 64
part1_num_epochs = 30

Now you decide the rest.

Try changing the values for *num_layers* (number of transformer blocks), *d_model* (size of embedding and pooler layers), *d_kv* (size of query, key, and value vectors per attention head), *num_heads* (the number of attention heads), and *d_ff* (size of feed forward layers after each attention layer).

Find hyperparameters that finish training 30 epochs in 10-20 minutes on a free Colab T4 GPU, and that give you as low of a validation loss as you can, at least below 1.8. Also try to do this without overwhelming overfitting, i.e. try to keep training_loss / validation_loss > 0.6 after 30 epochs.

Then answer the questions below.

In [20]:
"""
Define the values you want to use for d_model, d_kv, num_heads, and d_ff, for the T5Config below.
"""

### YOUR CODE HERE

embed_dim = 256
keyvalue_dim = 256
num_heads = 8
dense_dim = 2048
num_layers = 4


### END YOUR CODE

In [21]:
part1_model = create_from_scratch_model(num_layers, embed_dim, keyvalue_dim, dense_dim, num_heads)
part1_training_args = create_seq2seq_training_args(part1_batch_size, part1_num_epochs)
part1_trainer = create_seq2seq_trainer(part1_model, part1_training_args,
                                       train_ds_part1, val_ds_part1)

part1_trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,No log,2.506739
2,2.553400,2.351666
3,2.553400,2.254676
4,2.246900,2.176433
5,2.246900,2.11645
6,2.103900,2.073208
7,2.103900,2.030288
8,2.005900,2.004958
9,2.005900,1.971138
10,1.928600,1.947111


TrainOutput(global_step=7890, training_loss=1.852191912478216, metrics={'train_runtime': 941.892, 'train_samples_per_second': 535.029, 'train_steps_per_second': 8.377, 'total_flos': 1776170314137600.0, 'train_loss': 1.852191912478216, 'epoch': 30.0})

**QUESTION:**

 1.a What is the final validation loss that you were able to achieve for the part1 model after training for 30 epochs? (Copy and paste the decimal value for the final validation loss, to 5 significant digits, e.g. a number like 0.56781 or 0.87632. Put the answer in the answers file; it should match the value shown in your output in this notebook.)

**QUESTION:**

 1.b Which model config parameters (if any) did you increase, to achieve a lower validation loss, while staying within the training time and overfitting guidelines? (List the names of the parameters you increased, e.g. embed_dim, keyvalue_dim, num_heads, dense_dim, num_layers. Put this list in square brackets in the answers file.)

**QUESTION:**

 1.c Which model config parameters (if any) did you decrease, to achieve a lower validation loss, while staying within the training time and overfitting guidelines? (List the names of the parameters you decreased, e.g. embed_dim, keyvalue_dim, num_heads, dense_dim, num_layers. Put this list in square brackets in the answers file.)

1.a: 1.76248
1.b: embed_dim, dense_dim
1.c: num_layers

In [22]:
"""
Before moving on, save a checkpoint of the model you just trained in your Drive,
So that you can pick up where you left off later if needed
"""

# Modify this path to the location in your Drive where you want to save the part1 model
part1_model_checkpoint_filepath = 'drive/MyDrive/DATASCI_266/model_checkpoints/part1_model'

In [23]:
# Run this line only after you've trained the part1 model
part1_model.save_pretrained(part1_model_checkpoint_filepath, from_pt=True)

In [24]:
# Run this line only if you need to reload the model you trained earlier
part1_model = T5ForConditionalGeneration.from_pretrained(part1_model_checkpoint_filepath)

### 1.3: Text Generation Parameters

Cross-entropy loss is great for training, but it's not a very interpretable metric for manually reviewing how well the model is doing as we experiment with available options. Ultimately, we want to actually look at the translations the model outputs, compare them to human translations, and potentially judge other aspects of the actual output.

To do that, we need to actually generate some model output. Remember that the model itself predicts probabilities for each word in the vocabulary, based on what words have already been generated, at each decoder time-step. In order to select which actual words to output, there are multiple decoder strategies we can use that are build on top of the model's predicted probabilities. (E.g. beam search, top-k or top-p sampling, repeat ngram constraints, min/max length constraints, etc.)

Let's define a function below to generate translations for new inputs. Then we'll define another function to translate the validation set and calculate some standard evaluation metrics for translation, as well as print out some translations for manual inspection. We'll include some arguments that you'll experiment with next.

In [25]:
def generate_output(model, tokenizer, input_sentences, batch_size, **kwargs):

    all_outputs = []

    for i in range(int(len(input_sentences) / batch_size) + 1):
        start_i, end_i = i * batch_size, (i + 1) * batch_size
        if start_i >= len(input_sentences):
            break

        inputs_encoded = tokenizer(input_sentences[start_i:end_i], padding=True, return_tensors='pt')
        output_ids = model.cuda().generate(inputs_encoded['input_ids'].cuda(), **kwargs)
        generated_sentences = tokenizer.batch_decode(output_ids,
                                                     skip_special_tokens=True,
                                                     clean_up_tokenization_spaces=False)
        all_outputs.extend(generated_sentences)

    return all_outputs

In [26]:
# Load the BLEU metric and the trained BLEURT model for semantic similarity scoring

bleu = evaluate.load("bleu")

bleurt_tokenizer = AutoTokenizer.from_pretrained("Elron/bleurt-base-512")
bleurt_model = AutoModelForSequenceClassification.from_pretrained("Elron/bleurt-base-512")

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/321 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [27]:
def calculate_eval_metrics(text_pairs, model, tokenizer, batch_size, prefix="", **kwargs):
    original_texts = [prefix + pair[0] for pair in text_pairs]
    label_texts = [pair[1] for pair in text_pairs]

    # Translate original texts
    translations = generate_output(model, tokenizer, original_texts, batch_size, **kwargs)

    # Calculate BLEU scores
    bleu_results = bleu.compute(predictions=translations, references=label_texts)
    print('BLEU: ', bleu_results)

    # Calculate BLEURT scores
    bleurt_scores = []
    for i in range(int(len(translations) / batch_size) + 1):
        start_i, end_i = i * batch_size, (i + 1) * batch_size
        if start_i >= len(translations):
            break

        with torch.no_grad():
            scores = bleurt_model(**bleurt_tokenizer(label_texts[start_i:end_i],
                                                     translations[start_i:end_i],
                                                     truncation=True,
                                                     max_length=MAX_SEQUENCE_LENGTH,
                                                     padding='max_length',
                                                     return_tensors='pt'))[0].squeeze().numpy()
            if scores.shape:
                bleurt_scores.extend(scores)
            else:  # Happens when there was only one example in the last batch
                bleurt_scores.append(float(scores))

    print('BLEURT: ', np.mean(bleurt_scores))

    return translations

First, choose some keyword arguments to pass to the generate_output() function. These can be any parameters for the .generate() method (e.g. beam search or top-k or top-p sampling, no_repeat_ngram_size, etc). You will want to try the options listed in Question 1.e below, to be able to answer that question (but some of them can't be used at the same time). More info on each can be found in the [Huggingface documentation on text generation here](https://huggingface.co/docs/transformers/en/main_classes/text_generation).

Then run the function to translate the validation set and print out eval metrics. The function returns the translations, so we'll also print out a sample of those to manually inspect. Use what you see to iterate on the .generate() arguments, trying to find the most reasonable .generate() arguments that you can for the model you trained.

The output will not be great no matter what you do, but you should be able to make it a little more readable, with slightly better BLEU and BLEURT metrics, than the basic options specified in the Lesson 6 notebook.

Then answer the questions below.

In [34]:
"""
Fill in the decoder .generate() arguments that you want to use, like num_beams or top_p, etc.
"""

part1_generate_kwargs = {

    ### YOUR CODE HERE

    "num_beams": 5,
    "do_sample": True,
    "top_k": 50,
    "top_p": 0.9,
    "temperature": 1.0,
    "no_repeat_ngram_size": 4,

    ### END YOUR CODE
}

part1_val_translations = calculate_eval_metrics(
    val_pairs,
    part1_model,
    part1_tokenizer,
    part1_batch_size,
    **part1_generate_kwargs
)

BLEU:  {'bleu': 0.050238777393346486, 'precisions': [0.29869057449569425, 0.0932380289708394, 0.028968713789107765, 0.007896096229060586], 'brevity_penalty': 1.0, 'length_ratio': 1.2043759323719543, 'translation_length': 16954, 'reference_length': 14077}
BLEURT:  -1.0008367


In [35]:
# Print out a sample of outputs to manually review
for i in range(10):
    sample_i = random.choice(range(len(part1_val_translations)))
    print('Original:    ', val_pairs[sample_i][0])
    print('Reference:   ', val_pairs[sample_i][1])
    print('Translation: ', part1_val_translations[sample_i])
    print()

Original:     Seem they religious?
Reference:    Do they seem religious?
Translation:  they have they they they they will they they they in they they they?

Original:     I was, but I do find more pain in banishment Than death can yield me here by my abode.
Reference:    I was.
Translation:  I can do me in my death, but I was in me to do, but I can do

Original:     So I will, my liege, as I live.
Reference:    So I will, my liege.
Translation:  So I will, my I will, I will, as I will, I’m as I will

Original:     Since his Majesty went into the field, I have seen her rise from her bed, throw her nightgown upon her, unlock her closet, take forth paper, fold it, write upon't, read it, afterwards seal it, and again return to bed; yet all this while in a most fast sleep.
Reference:    Since his majesty went into the field, I have seen her rise from her bed, throw her nightgown upon her, unlock her closet, take out paper, fold it, write upon it, read it, afterwards seal it, and again retur

**QUESTION:**

 1.d What seems to be particularly bad about the part1 model's translations? (Choose one of the following options that you agree with most and put it in the answers file.)

 - A. The model keeps repeating the same common words or phrases over and over, which don't produce very meaningful statements.

 - B. The model is generating pretty good modern English, but it's quite offensive.

 - C. The model's output has mostly the same meaning as the input, but with minor grammatical mistakes.

 - D. The model is making up elaborate narrative details that don't appear in the original text.

**QUESTION:**

 1.e Which .generate() parameter seemed to help the most in addressing the main shortcoming(s) that you noticed in the part1 model's output? (Choose one of the following options and put it in the answers file.)

 - A. num_beams
 - B. do_sample
 - C. top_k
 - D. top_p
 - E. temperature
 - F. no_repeat_ngram_size

1.d: The model keeps repeating the same common words or phrases over and over, which don't produce very meaningful statements.
1.e: F. no_repeat_ngram_size

### 1.4 Test Set Evaluation Metrics

Once you've settled on training hyperparameters that produce good validation loss, and generation options that produce the best output you can so far, go ahead and calculate evaluation metrics on the test set, to warp up this from-scratch model.

Then answer the questions below.

In [36]:
# Print out eval metrics for the part1_model on the test set

part1_test_translations = calculate_eval_metrics(
    test_pairs,
    part1_model,
    part1_tokenizer,
    part1_batch_size,
    **part1_generate_kwargs
)

BLEU:  {'bleu': 0.05310149514645578, 'precisions': [0.2983491268847965, 0.09287450428553154, 0.02883951980129709, 0.009949876561681753], 'brevity_penalty': 1.0, 'length_ratio': 1.1650465213164838, 'translation_length': 16779, 'reference_length': 14402}
BLEURT:  -1.0047845


**QUESTION:**

 1.f What is the overall BLEU score that you achieved on the test set for the part1 model? (Copy and paste the decimal value for the overall BLEU score, to 5 significant digits, e.g. a number like 0.03671 or 0.09763. Put the answer in the answers file; it should match the value shown in your output in this notebook.)

**QUESTION:**

 1.g What is the mean BLEURT score that you achieved on the test set for the part1 model? (Copy and paste the decimal value for the mean BLEURT score, to 5 significant digits, e.g. a number like -1.12345 or -0.54321. Put the answer in the answers file; it should match the value shown in your output in this notebook.)

1.f: 0.05310
1.g: -1.00478

## 2. Small Pre-Trained T5 Model

What if we use a model that has already been pre-trained to recognize English (at least modern English), even if it hasn't yet been trained for our particular translation task?

We'll use a T5 small model, which should be able to generate good modern English, but we'll need to train it to encode and translate Shakespearean text.

### 2.1 Pre-trained Model Setup and Tokenization

The next two cells load the pre-trained model, and preprocess the data with the pre-trained tokenizer. Fill in the necessary code for each of these cells.

For preprocessing, you'll need to map the `preprocess_translation_batch` function that we created earlier to the `train_dataset` and `val_dataset`. Use the code from part 1 as an example, but now pass in the pretrained T5 tokenizer as a function keyward argument (kwarg). Also pass in the given task_prefix as the "prefix" kwarg for the preprocessing function.

In [37]:
"""
Load the pre-trained model and tokenizer
"""

t5_pretrained_checkpoint_name = 'google-t5/t5-small'

### YOUR CODE HERE

part2_tokenizer = T5Tokenizer.from_pretrained(t5_pretrained_checkpoint_name)
part2_model = T5ForConditionalGeneration.from_pretrained(t5_pretrained_checkpoint_name)


### END YOUR CODE

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [38]:
"""
Preprocess the datasets using the pretrained tokenizer, and the given task_prefix.
Use the task_prefix as the "prefix" argument to the function preprocess_translation_batch().
"""

task_prefix = 'Translate Shakespeare to Modern English: '

train_ds_part2 = train_dataset.map(
    preprocess_translation_batch,
    batched=True,
    fn_kwargs={

        ### YOUR CODE HERE
        'tokenizer': part2_tokenizer,
        'prefix': task_prefix
        ### END YOUR CODE

})

val_ds_part2 = val_dataset.map(preprocess_translation_batch,
    batched=True,
    fn_kwargs={

        ### YOUR CODE HERE
        'tokenizer': part2_tokenizer,
        'prefix': task_prefix
        ### END YOUR CODE

})

Map:   0%|          | 0/16798 [00:00<?, ? examples/s]

Map:   0%|          | 0/1145 [00:00<?, ? examples/s]

### 2.2 Fine-Tuning the Pre-Trained Model

Now create the training args and trainer to fine-tune this pre-trained model. We've given you part of the code: you'll use the same functions as above for `create_seq2seq_training_args` and `create_seq2seq_trainer`. Fill in the rest of the arguments that you need for this version of the model. Use the provided batch size and num_epochs.

In [40]:
"""
Create the training args and trainer for the pre-trained model.
Use the batch size and num_epochs provided below for this model.
"""

part2_batch_size = 32
part2_num_epochs = 4

part2_training_args = create_seq2seq_training_args(

    ### YOUR CODE HERE
    batch_size=part2_batch_size,
    num_epochs=part2_num_epochs,
    ### END YOUR CODE
)

part2_trainer = create_seq2seq_trainer(

    ### YOUR CODE HERE
    model=part2_model,
    training_args=part2_training_args,
    train_ds=train_ds_part2,
    val_ds=val_ds_part2
    ### END YOUR CODE
)



Run the cell below to fine-tune the part2 model, then answer the following questions.

In [41]:
part2_trainer.train()

Epoch,Training Loss,Validation Loss
1,0.9792,0.703788
2,0.7424,0.68585
3,0.7253,0.678866
4,0.7104,0.677436


TrainOutput(global_step=2100, training_loss=0.7858361017136347, metrics={'train_runtime': 392.4422, 'train_samples_per_second': 171.215, 'train_steps_per_second': 5.351, 'total_flos': 710459869102080.0, 'train_loss': 0.7858361017136347, 'epoch': 4.0})

**QUESTION:**

 2.a What is the final validation loss that you were able to achieve for the part2 model after training for 4 epochs? (Copy and paste the decimal value for the final validation loss, to 5 significant digits, e.g. a number like 0.56781 or 0.87632. Put the answer in the answers file; it should match the value shown in your output in this notebook.)

2.a: 0.67744

In [45]:
"""
Before moving on, save a checkpoint of the model you just trained in your Drive,
So that you can pick up where you left off later if needed
"""

# Modify this path to the location in your Drive where you want to save the part2 model
part2_model_checkpoint_filepath = 'drive/MyDrive/DATASCI_266/model_checkpoints/part2_model'

In [46]:
# Run this line only after you've fine-tuned the part2_model
part2_model.save_pretrained(part2_model_checkpoint_filepath, from_pt=True)

In [47]:
# Run this line only if you need to reload the model you fine-tuned earlier
part2_model = T5ForConditionalGeneration.from_pretrained(part2_model_checkpoint_filepath)

### 2.3 Fine-Tuned Model Evaluation

Now use the calculate_eval_metrics() function defined above to translate the test set and calculate evaluation metrics. Also print out a sample of the translated outputs. For now, use the same decoder .generate() kwargs that you chose for part1.

Then answer the questions below.

In [48]:
# Print out eval metrics for the part2_model on the test set

part2_test_translations = calculate_eval_metrics(
    test_pairs,
    part2_model,
    part2_tokenizer,
    part2_batch_size,
    task_prefix,
    **part1_generate_kwargs
)

BLEU:  {'bleu': 0.2788832967085738, 'precisions': [0.6337644057359022, 0.4129328898454314, 0.29932797179684917, 0.22467956773058556], 'brevity_penalty': 0.7656723305135211, 'length_ratio': 0.78926537980836, 'translation_length': 11367, 'reference_length': 14402}
BLEURT:  -0.0331898


In [49]:
# Print out a sample of the translated outputs to look at as well

for i in range(10):
    sample_i = random.choice(range(len(part2_test_translations)))
    print('Original:    ', test_pairs[sample_i][0])
    print('Reference:   ', test_pairs[sample_i][1])
    print('Translation: ', part2_test_translations[sample_i])
    print()

Original:     My lady has a white hand, and the Myrmidons are no bottle-ale houses.
Reference:    My girlfriend has beautiful white hands, and great warriors aren’t  Ha, ha!
Translation:  My lady has a white hand, and the Myrmidons are no bottle-a

Original:     Thou call’st on him that hates thee.
Reference:    You’re appealing to a son who hates you.
Translation:  You call him who hates you.

Original:     Tut, I have the best armor of the world.
Reference:    I have the best armor in the world.
Translation:  Tut, I have the best armor of the world.

Original:     God and our innocence defend and guard us!
Reference:    God defend and guard us innocents against them!
Translation:  God and our innocence defend and protect us!

Original:     Upon the stroke of ten.
Reference:    It’s almost ten o'clock.
Translation:  Upon the stroke of ten.

Original:     Had he not resembled My father as he slept, I had done't.
Reference:    If the King hadn’t resembled My father as he slept, I would’

**QUESTION:**

 2.b What is the overall BLEU score that you achieved on the test set for the part2 model? (Copy and paste the decimal value for the overall BLEU score, to 5 significant digits, e.g. a number like 0.03671 or 0.09763. Put the answer in the answers file; it should match the value shown in your output in this notebook.)

**QUESTION:**

 2.c What is the mean BLEURT score that you achieved on the test set for the part2 model? (Copy and paste the decimal value for the mean BLEURT score, to 5 significant digits, e.g. a number like -1.12345 or -0.54321. Put the answer in the answers file; it should match the value shown in your output in this notebook.)

**QUESTION:**

 2.d What do you notice about the part2 model's output? It should be much better than the part1 model's output. But the translations still probably don't perfectly match the reference human translations. What does the part2 model seem to still be doing poorly? (Chose one of the following options that you agree with most, and put it in the answers file.)

 - A. The generated translations are gibberish.

 - B. The generated translations are written in a far more casual style than the reference human translations.

 - C. The generated translations mean something completely different from the input text and reference translations.

 - D. The generated translations are too similar to the input text, and haven't been rephrased as much as the reference human translations.

2.b: 0.27888
2.c: -0.03319
2.d: D. The generated translations are too similar to the input text, and haven't been rephrased as much as the reference human translations.

### 2.4 Style Classifier

Now that the model is able to output more coherent translations, we can start to get more picky about different aspects of the output. We should also make sure that our evaluation metrics are capturing everything we want to be able to assess and improve in the model's output.

One thing we're not capturing yet is if the output has the right **style**. This task is sort of a translation task, but since it's between two different forms of English, we can also think of it as a style transfer task.

BLEU might help a little with that, but when the model chooses different words from the human reference, it could do so in ways that are still good modern English or that are still too much like Shakespeare. BLEURT won't tell us anything about the style, as long as the meaning is still similar to the reference.

How can we tell whether the output has the right style? We could train a separate classification model to predict whether text is Shakespearean or modern English. We have the data to do it! We just need to repurpose our data for a classification problem.

Use the code below to train a BERT classifier to predict whether a sentence is Shakespearean or modern English. We're providing this code for you, because it's not the main task and not based on a similar example from class. We want you to use it as one of your evaluation metrics, to help you iterate on your models for the main task.

In [50]:
def make_style_classifier_data(text_pairs):
    style_texts = [pair[0] for pair in train_pairs] + [pair[1] for pair in train_pairs]
    style_labels = [0 for pair in train_pairs] + [1 for pair in train_pairs]

    style_dataset = Dataset.from_dict({"text": style_texts, "label": style_labels})

    return style_dataset.shuffle()

style_train_ds = make_style_classifier_data(train_pairs)
style_valid_ds = make_style_classifier_data(val_pairs)

In [51]:
bert_checkpoint_name = 'bert-base-cased'
bert_tokenizer = BertTokenizer.from_pretrained(bert_checkpoint_name)
bert_style_classifier_model = BertForSequenceClassification.from_pretrained(bert_checkpoint_name)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [52]:
def preprocess_style_text(data):
    return bert_tokenizer.batch_encode_plus(
            data['text'],
            max_length=MAX_SEQUENCE_LENGTH,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_token_type_ids=True,
            return_tensors="pt"
        )

style_train_ds_preprocessed = style_train_ds.map(preprocess_style_text, batched=True)
style_valid_ds_preprocessed = style_valid_ds.map(preprocess_style_text, batched=True)

Map:   0%|          | 0/33596 [00:00<?, ? examples/s]

Map:   0%|          | 0/33596 [00:00<?, ? examples/s]

In [53]:
style_classifier_batch_size = 32
style_classifier_num_epochs = 2

style_training_args = TrainingArguments(
    output_dir="bert_shakespeare_style_classifier",
    per_device_train_batch_size=style_classifier_batch_size,
    per_device_eval_batch_size=style_classifier_batch_size,
    num_train_epochs=style_classifier_num_epochs,
    eval_strategy="epoch",
    save_strategy="epoch",
    report_to='none'
)

In [54]:
metric = evaluate.load('accuracy')

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [55]:
style_trainer = Trainer(
    model=bert_style_classifier_model,
    args=style_training_args,
    train_dataset=style_train_ds_preprocessed,
    eval_dataset=style_valid_ds_preprocessed,
    compute_metrics=compute_metrics
)

In [56]:
style_trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.412,0.299432,0.855638
2,0.2894,0.226104,0.887248


TrainOutput(global_step=2100, training_loss=0.36195233844575425, metrics={'train_runtime': 651.0491, 'train_samples_per_second': 103.206, 'train_steps_per_second': 3.226, 'total_flos': 1381168596230400.0, 'train_loss': 0.36195233844575425, 'epoch': 2.0})

In [57]:
"""
Before moving on, save a checkpoint of the model you just trained in your Drive,
So that you can pick up where you left off later if needed
"""

# Modify this path to the location in your Drive where you want to save the style classifier
style_classifier_checkpoint_filepath = 'drive/MyDrive/DATASCI_266/model_checkpoints/style_classifier'

In [58]:
# Run this line only after you've trained the style classifier model
bert_style_classifier_model.save_pretrained(style_classifier_checkpoint_filepath, from_pt=True)

In [59]:
# Run this line only if you need to reload the style classifier you trained earlier
bert_style_classifier_model = BertForSequenceClassification.from_pretrained(style_classifier_checkpoint_filepath)

Now let's use the style classifier to classify the output from the Shakespeare translation model, using the test set from our main task. The function reports the average predicted probability of the positive class, which is the modern English style (and which is our goal for our main task model).

We should also classify the original Shakespearean text and the human translations from the test set, to compare the scores as references.

Run the next two cells of code, then answer the following questions.

In [60]:
def get_modern_style_score(text):
  text_inputs = bert_tokenizer.batch_encode_plus(
            text,
            max_length=MAX_SEQUENCE_LENGTH,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_token_type_ids=True,
            return_tensors="pt"
        )

  with torch.no_grad():
      logits = bert_style_classifier_model.cuda()(text_inputs['input_ids'].cuda(),
                                                  attention_mask=text_inputs['attention_mask'].cuda()).logits

  probs = softmax(logits.cpu().numpy(), axis=1)
  return np.mean(probs[:, 1])

In [61]:
test_original_texts = [task_prefix + pair[0] for pair in test_pairs]
test_label_texts = [pair[1] for pair in test_pairs]

translations_score = get_modern_style_score(part2_test_translations)
reference_score = get_modern_style_score(test_label_texts)
shakespeare_score = get_modern_style_score(test_original_texts)

print("Modern style score for generated translations:  ", translations_score)
print("Modern style score for reference translations:  ", reference_score)
print("Modern style score for original Shakespeare:    ", shakespeare_score)

Modern style score for generated translations:   0.498473
Modern style score for reference translations:   0.8375867
Modern style score for original Shakespeare:     0.31582302


**QUESTION:**

 2.e What is the modern style classifier score that you got for the part2 model's generated translations? (Copy and paste the decimal value from the get_modern_style_score function above, to 5 significant digits, e.g. a number like 0.36712 or 0.97632. Put the answer in the answers file; it should match the value shown in your output in this notebook.)

**QUESTION:**

 2.f What is the modern style classifier score that you got for the human reference translations? (Copy and paste the decimal value from the get_modern_style_score function above, to 5 significant digits, e.g. a number like 0.36712 or 0.97632. Put the answer in the answers file; it should match the value shown in your output in this notebook.)

**QUESTION:**

 2.g What is the modern style classifier score that you got for the original Shakespeare text? (Copy and paste the decimal value from the get_modern_style_score function above, to 5 significant digits, e.g. a number like 0.36712 or 0.97632. Put the answer in the answers file; it should match the value shown in your output in this notebook.)

**QUESTION:**

 2.h What do you notice about differences between these scores, and what does that tell you about what the part2 model is doing? (Chose one of the following options that you agree with most, and put it in the answers file.)

 - A. The part2 model is generating output that is way more modern, casual, and younger generation speak than the human translations.

 - B. The part2 model is generating output that looks about as modern as the human translations, even if it doesn't always mean the same thing.

 - C. The part2 model is generating output that is partly modernized, more modern than the original Shakespeare, but still not as modern as the human references.

 - D. The part2 model is generating output that is still pretty much the same style as the original Shakespeare text.

2.e: 0.49847
2.f: 0.83759
2.g: 0.31582
2.h: C. The part2 model is generating output that is partly modernized, more modern than the original Shakespeare, but still not as modern as the human references.

### 2.5 Revisit Decoder .Generate() Options

Now that we have one more evaluation metrics, let's go back to the decoder .generate() arguments we used before. Are there any arguments you want to change, to try to do better on this latest evaluation metric?

Try different options for the part2_generate_kwargs below, and run the two cells afterward with each set of choices to see how the evaluation metrics change.

Then answer the questions below.

In [71]:
"""
Fill in the decoder .generate() arguments that you want to use for the part2 model, like num_beams or top_p, etc.
"""

part2_generate_kwargs = {

    ### YOUR CODE HERE

    "num_beams": 1,
    "do_sample": True,
    "top_k": 100,
    "top_p": 0.95,
    "temperature": 1.5,
    "no_repeat_ngram_size": 3

    ### END YOUR CODE
}

In [72]:
# Print out eval metrics for the part2_model on the test set, with the new kwargs

part2_test_translations = calculate_eval_metrics(
    test_pairs,
    part2_model,
    part2_tokenizer,
    part2_batch_size,
    task_prefix,
    **part2_generate_kwargs
)

BLEU:  {'bleu': 0.05801802229593404, 'precisions': [0.28446837554982557, 0.08670376214600116, 0.0381791483113069, 0.017400204708290685], 'brevity_penalty': 0.9119054019926945, 'length_ratio': 0.9155672823218998, 'translation_length': 13186, 'reference_length': 14402}
BLEURT:  -0.7483555


In [73]:
# Calculate modern style scores for the part2 translations after using the new kwargs

translations_score = get_modern_style_score(part2_test_translations)

print("Modern style score for generated translations:  ", translations_score)

Modern style score for generated translations:   0.7735577


In [74]:
# Print out a sample of the translated outputs with the revised .generate() parameters

for i in range(10):
    sample_i = random.choice(range(len(part2_test_translations)))
    print('Original:    ', test_pairs[sample_i][0])
    print('Reference:   ', test_pairs[sample_i][1])
    print('Translation: ', part2_test_translations[sample_i])
    print()

Original:     I am the very man— I’ll see that straight.
Reference:    I’m the one who— I’ll get right on that.
Translation:  I am the only man we’ve ever had in mind: I can just conceive it

Original:     How is ’t?
Reference:    How do you feel?
Translation:  To whom should I mean?

Original:     Fare you well.
Reference:    Goodbye.
Translation:  Get out loud...

Original:     I tell you 'tis not very well.
Reference:    I tell you it is not very well.
Translation:  I really don’t know what makes this well to do.

Original:     My voice is ragged.
Reference:    My voice is ragged.
Translation:  I’m not accustomed enough enough!

Original:     Petruchio is my name, Antonio’s son, A man well known throughout all Italy.
Reference:    My name is Petruchio, son of Antonio, a man well known throughout Italy.
Translation:  Petruchio’s son, an authentic man made by Italian origin in 1582 but it sounds

Original:     aside to Sebastian] I am right glad that he's so out [aside to Sebastian] I

**QUESTION:**

 2.i Which decoder strategy seemed to increase the modern style score the most? (Choose one of the following options and put it in the answers file.)

 - A. Using a stricter option to always choose the highest predicted possibility output (e.g. beam search, or small k or p when using sampling).

 - B. Using a looser sampling method to allow the model to choose more varied output (e.g. top-k or top-p rather than beam search, especially with higher k or p and/or higher temperature).

**QUESTION:**

 2.j What happens to the other evaluation metrics when you try to increase the modern style score by varying the decoder strategy discussed in 2.i? (Choose one of the following options and put it in the answers file.)

 - A. BLEU and BLEURT both seem to be positively correlated with the modern style score, when changing the decoder strategy.

 - B. BLEU and BLEURT both seem to be negatively correlated with the modern style score, when changing the decoder strategy.

 - C. BLEU seems to move with the modern style score, but BLEURT seems to go the other direction.

 - D. BLEURT seems to move with the modern style score, but BLEU seems to go the other direction.

**QUESTION:**

 2.k Why do you think the relationship in question 2.j is happening? (Choose one of the following options and put it in the answers file.)

 - A. A stricter decoder strategy makes the model more likely to output the best translation, which is good for BLEU, BLEURT, and modern style objectives.

 - B. A looser decoder strategy gives the model more freedom to find a good modern style translation, which should also end up saying more of the same things in the same way as the human translation.

 - C. A stricter decoder strategy makes the model more likely to output a translation that has correct exact words and style, increasing BLEU and modern style scores, but might not mean the same thing as the human translation.

 - D. A looser decoder strategy gives the model more freedom to choose more modern style words, which the pre-trained model is more familiar with, but that freedom can make the model less likely to end up with the exact same words or meaning as the human translation.

 - E. A stricter decoder strategy makes the model more likely to choose more of the exact same words as used in the dataset, but not necessarily in the same order, so the meaning and style don't end up being as close to the human translation.

2.i: B. Using a looser sampling method to allow the model to choose more varied output (e.g. top-k or top-p rather than beam search, especially with higher k or p and/or higher temperature).
2.j: B. BLEU and BLEURT both seem to be negatively correlated with the modern style score, when changing the decoder strategy.
2.k: D. A looser decoder strategy gives the model more freedom to choose more modern style words, which the pre-trained model is more familiar with, but that freedom can make the model less likely to end up with the exact same words or meaning as the human translation.

## 3. Adding Supplementary Paraphrase Dataset

Can we do anything else to make the model capable of rephrasing the input text more into a different style (i.e. modernizing it more fully away from the Shakespeare), but still keep the same meaning?

One related task that could help is a paraphrasing task. The [GLUE Microsoft Research Paraphrase Corpus (MRPC)](https://huggingface.co/datasets/nyu-mll/glue) dataset has pairs of sentences with labels indicating whether the two sentences are equivalent (i.e. they mean the same thing) or not.

We could use that data as an additional supporting task for our T5 model, to see if it helps our model get better at accurately rephrasing the input text into a differently worded output.

### 3.1 Load and preprocess the supplemental dataset

Load the dataset from Huggingface and look at the contents

In [97]:
mrpc_data = load_dataset('nyu-mll/glue', 'mrpc')

In [98]:
mrpc_data

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [99]:
mrpc_data['train'].features['label'].names

['not_equivalent', 'equivalent']

Let's use just the pairs that are labeled as equivalent (correct paraphrases), and split the sentences to use the first as the model's input and second as the model's output. Then we can use that to train our T5 model to better generate rephrased statements in modern English with the same meaning as the input but in different words.

In [100]:
mrpc_equiv_train = mrpc_data['train'].filter(lambda example: example['label'] == 1)
mrpc_equiv_valid = mrpc_data['validation'].filter(lambda example: example['label'] == 1)

Fill in the code below to encode "sentence1" as the model's input and "sentence2" as the model's output.

We will also add a different prefix for this supporting task, so make sure to add the prefix to the inputs in the function below. (You can use the preprocess_translation_batch function above as an example.)

In [106]:
def preprocess_mrpc_for_paraphrase_generation(mrpc_ds, tokenizer, prefix):

    ### YOUR CODE HERE

    if prefix:
        mrpc_ds["sentence1"] = [prefix + sentence for sentence in mrpc_ds["sentence1"]]

    # Encode both sentence1 and sentence2
    input_encoded = tokenizer.batch_encode_plus(
        mrpc_ds["sentence1"],
        max_length=MAX_SEQUENCE_LENGTH,
        padding='max_length',
        truncation=True,
        return_tensors="pt"
    )

    output_encoded = tokenizer.batch_encode_plus(
        mrpc_ds["sentence2"],
        max_length=MAX_SEQUENCE_LENGTH,
        padding='max_length',
        truncation=True,
        return_tensors="pt"
    )


    ### END YOUR CODE

    return {'input_ids': input_encoded['input_ids'],
            'labels': output_encoded['input_ids']}

Now map the preprocessing function to the MRPC train and validation datasets. Use the part2_tokenizer to preprocess the data, since we're using the same T5 pre-trained model checkpoint as in part 2. For the preprocessing function's "prefix" argument, use the paraphrase_prefix provided below.

In [107]:
paraphrase_prefix = 'Paraphrase in modern English: '

### YOUR CODE HERE

mrpc_paraphrase_train = mrpc_equiv_train.map(
    preprocess_mrpc_for_paraphrase_generation,
    batched=True,
    fn_kwargs={'tokenizer': part2_tokenizer, 'prefix': paraphrase_prefix}
)

mrpc_paraphrase_valid = mrpc_equiv_valid.map(
    preprocess_mrpc_for_paraphrase_generation,
    batched=True,
    fn_kwargs={'tokenizer': part2_tokenizer, 'prefix': paraphrase_prefix}
)

### END YOUR CODE

### 3.2 Train T5 on Paraphrasing Task

Load a fresh copy of the pre-trained T5 model (using the same pre-trained model checkpoint as part2), so that we can train it first on the paraphrase task, and last on the main task data.

In [108]:
"""
Load a new copy of the same pre-trained model (we'll use the same in tokenizer as part2)
"""

t5_pretrained_checkpoint_name = 'google-t5/t5-small'

### YOUR CODE HERE

part3_model = T5ForConditionalGeneration.from_pretrained(t5_pretrained_checkpoint_name)
part2_tokenizer = T5Tokenizer.from_pretrained(t5_pretrained_checkpoint_name)

### END YOUR CODE

Now create the training args and trainer for the paraphrase task, and train the model. Use the `create_seq2seq_training_args` and `create_seq2seq_trainer` functions like before.

You'll be using the part3_model you just loaded, and the MRPC data you preprocessed. Use the batch_size and num_epochs provided for the paraphrase task below.

In [112]:
"""
Create the training args and trainer for the paraphrase task.
"""

paraphrase_batch_size = 32
paraphrase_num_epochs = 4

### YOUR CODE HERE

# Create the training arguments for the paraphrase task
paraphrase_training_args = create_seq2seq_training_args(
    batch_size=paraphrase_batch_size,
    num_epochs=paraphrase_num_epochs
)

# Use the previously loaded model (part3_model) for training
paraphrase_trainer = create_seq2seq_trainer(
    model=part3_model,  # Use the part3_model you already loaded
    training_args=paraphrase_training_args,
    train_ds=mrpc_paraphrase_train,
    val_ds=mrpc_paraphrase_valid
)
### END YOUR CODE



In [113]:
paraphrase_trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,1.194336
2,No log,1.17907
3,No log,1.170664
4,No log,1.169051


TrainOutput(global_step=312, training_loss=1.2888753842084835, metrics={'train_runtime': 67.9836, 'train_samples_per_second': 145.564, 'train_steps_per_second': 4.589, 'total_flos': 104636130263040.0, 'train_loss': 1.2888753842084835, 'epoch': 4.0})

### 3.3 Fine-Tune Paraphrase-Trained Model on Main Task

Now create the training args and trainer for the main task. Use the `create_seq2seq_training_args` and `create_seq2seq_trainer` functions one more time.

You'll be using the same model that you just trained on the paraphrase task (part3_model). Use the batch size and num epochs provided below.

For training data, use the same data as part2: `train_ds_part2` and `val_ds_part2`. We're using the same pre-trained model checkpoint, i.e. the same tokenizer, and the same task prefix, so the data has already been preprocessed correctly.

In [114]:
"""
Create the training args and trainer for the main task using the part3_model.
"""

part3_batch_size = 32
part3_num_epochs = 4

### YOUR CODE HERE

part3_training_args = create_seq2seq_training_args(
    batch_size=part3_batch_size,
    num_epochs=part3_num_epochs
)

part3_model = T5ForConditionalGeneration.from_pretrained('drive/MyDrive/DATASCI_266/model_checkpoints/part3_model')

part3_trainer = create_seq2seq_trainer(
    model=part3_model,
    training_args=part3_training_args,
    train_ds=train_ds_part2,
    val_ds=val_ds_part2
)


### END YOUR CODE



In [115]:
part3_trainer.train()

Epoch,Training Loss,Validation Loss
1,0.6411,0.658234
2,0.6336,0.653408
3,0.6388,0.65138
4,0.6404,0.650632


TrainOutput(global_step=2100, training_loss=0.6393035162062872, metrics={'train_runtime': 402.7794, 'train_samples_per_second': 166.821, 'train_steps_per_second': 5.214, 'total_flos': 710459869102080.0, 'train_loss': 0.6393035162062872, 'epoch': 4.0})

In [116]:
"""
Before moving on, save a checkpoint of the model you just trained in your Drive,
So that you can pick up where you left off later if needed
"""

# Modify this path to the location in your Drive where you want to save the part3 model
part3_model_checkpoint_filepath = 'drive/MyDrive/DATASCI_266/model_checkpoints/part3_model'

In [117]:
# Run this line only after you've trained the part3 model on both tasks
part3_model.save_pretrained(part3_model_checkpoint_filepath, from_pt=True)

In [118]:
# Run this line only if you need to reload the part3 model you trained earlier
part3_model = T5ForConditionalGeneration.from_pretrained(part3_model_checkpoint_filepath)

### 3.4 Paraphrase-Trained Model Evaluation

Use the functions defined above to translate the test set and calculate the same set of evaluation metrics as used on the part2 model.

Use the same decoder .generate() arguments as part2 (`part2_generate_kwargs`), so that we can compare the part2 and part3 models as closely as possible.

Run the next three cells, then answer the questions below.

In [119]:
# Print out eval metrics for the part3_model on the test set, with the new kwargs

part3_test_translations = calculate_eval_metrics(
    test_pairs,
    part3_model,
    part2_tokenizer,
    part3_batch_size,
    task_prefix,
    **part2_generate_kwargs
)

BLEU:  {'bleu': 0.07690250293430466, 'precisions': [0.32618803688233905, 0.11382536382536383, 0.056159246081353975, 0.028783958602846053], 'brevity_penalty': 0.8737169445311601, 'length_ratio': 0.8810581863630051, 'translation_length': 12689, 'reference_length': 14402}
BLEURT:  -0.65186644


In [120]:
# Calculate modern style scores for the part3 translations after using the new kwargs

translations_score = get_modern_style_score(part3_test_translations)

print("Modern style score for generated translations:  ", translations_score)

Modern style score for generated translations:   0.7754838


In [121]:
# Print out a sample of the translated outputs to look at as well

for i in range(10):
    sample_i = random.choice(range(len(part3_test_translations)))
    print('Original:    ', test_pairs[sample_i][0])
    print('Reference:   ', test_pairs[sample_i][1])
    print('Translation: ', part3_test_translations[sample_i])
    print()

Original:     My lady has a white hand, and the Myrmidons are no bottle-ale houses.
Reference:    My girlfriend has beautiful white hands, and great warriors aren’t  Ha, ha!
Translation:  My lady has a white hand while the ones I’ve mentioned have no bottled wine from

Original:     Thou call’st on him that hates thee.
Reference:    You’re appealing to a son who hates you.
Translation:  From man who hates you you call on him.

Original:     Tut, I have the best armor of the world.
Reference:    I have the best armor in the world.
Translation:  Our old grandfather, of all of me, has the finest knight of today, where all

Original:     God and our innocence defend and guard us!
Reference:    God defend and guard us innocents against them!
Translation:  “Dear God and what are you going to do will defend us and guard for that!

Original:     Upon the stroke of ten.
Reference:    It’s almost ten o'clock.
Translation:  The stroke ten.

Original:     Had he not resembled My father as he slept

**QUESTION:**

 3.a What is the overall BLEU score that you achieved on the test set for the part3 model? (Copy and paste the decimal value for the overall BLEU score, to 5 significant digits, e.g. a number like 0.03671 or 0.09763. Put the answer in the answers file; it should match the value shown in your output in this notebook.)

**QUESTION:**

 3.b What is the mean BLEURT score that you achieved on the test set for the part3 model? (Copy and paste the decimal value for the mean BLEURT score, to 5 significant digits, e.g. a number like -1.12345 or -0.54321. Put the answer in the answers file; it should match the value shown in your output in this notebook.)

**QUESTION:**

 3.c What is the modern style classifier score that you got for the part3 model's generated translations? (Copy and paste the decimal value from the get_modern_style_score function above, to 5 significant digits, e.g. a number like 0.36712 or 0.97632. Put the answer in the answers file; it should match the value shown in your output in this notebook.)

**QUESTION:**

 3.d How do the part3 model's evaluation scores and output compare to the part2 model? Write a short answer about what you observe in the markdown cell below.

3.a: 0.07690
3.b: -0.65187
3.c: 0.77548

*** YOUR ANSWER TO QUESTION 3.d HERE ***

The part3 model is producing translations that sound more modern, with a modern style score of 0.77548, compared to part2's score of 0.49847. However, this shift towards modern language comes with a trade-off. The translations appear to be less accurate compared to the original Shakespearean text, as seen in the lower BLEU score of 0.07690 and BLEURT score of -0.65187 for part3, compared to part2's BLEU score of 0.27888 and BLEURT score of -0.03319.