# Project 4: Finetuning Transformer Language Models

In this project, you will first learn how to use Huggingface's Transformers library to load large language models. Next, we will generate text from these models. Finally, we will finetune models on two tasks (sentiment analysis and machine translation).

This project will be more open ended than the previous projects. We expect you to learn how to use the huggingface and torch documentation.

## Setup

First we install and import the required dependencies. These include:
* `torch` for modeling and training
* `transformers` for pre-trained models
* `datasets` from huggingface to load existing datasets.

In [1]:
%%capture
!pip install transformers
!pip install datasets
!pip install --upgrade sacrebleu sentencepiece

# Standard library imports
import torch
from torch.utils.data import Dataset, random_split
from transformers import AutoTokenizer, TrainingArguments, Trainer, AutoModelForCausalLM, GenerationConfig, EarlyStoppingCallback

Before proceeding, let's verify that we're connected to a GPU runtime and that `torch` can detect the GPU.
We'll define a variable `device` here to use throughout the code so that we can easily change to run on CPU for debugging.

In [2]:
if torch.cuda.is_available():
    print("Found GPU")
    device = "cuda"
else:
    print("Did not find GPU")
    device = "cpu"
print("Using device:", device)

Found GPU
Using device: cuda


In [3]:
import os

os.makedirs('./saves/', exist_ok=True)

### Loading Model

We will use GPT-2 medium for this project. This includes both the GPT-2 tokenizer and the GPT-2 model weights itself. If you want to learn more about this model, you can read the GPT-2 paper https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.

Let's first load the tokenizer for the GPT-2 medium model. You can find how to do this by reading the documentation for AutoTokenzier in transformers, and finding the GPT-2 model of ~345 million params in there.

In [4]:
# Your code here
tokenizer = AutoTokenizer.from_pretrained("gpt2-medium")
tokenizer.pad_token = tokenizer.eos_token

Let's tokenize and detokenize some text from this model.

In [5]:
print(tokenizer.encode('Hello world'))
print(tokenizer.decode(tokenizer.encode('Hello world')))
print(tokenizer.encode("Hola, cómo estás😍"))

[15496, 995]
Hello world
[39, 5708, 11, 269, 10205, 5908, 1556, 40138, 47249, 235]


Now let's load the GPT-2 medium model. Make sure you also put the model onto the GPU.

In [6]:
# Your code here
gpt2_model = AutoModelForCausalLM.from_pretrained("gpt2-medium")
gpt2_model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1024)
    (wpe): Embedding(1024, 1024)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout)

## Generate From the Model

Now let's generate some text from the model to test its LM capabilities. Let's generate 10 pieces of random text of length 50 tokens from the model using random sampling with temperature set to 0.7. This will allow the text to be somewhat high in diversity (random sampling) while maintaining reasonable quality (temperature < 1). When generating text, you can condition on phrases such as "The coolest thing in NLP right now is". Find the relevant function and arguments to use for generating text using the Huggingface documentation.

Hint: you may find https://huggingface.co/docs/transformers/main_classes/text_generation to be useful for learning about generating from LMs.

In [7]:
inputs = tokenizer("<|startoftext|>The coolest thing right now in NLP is", return_tensors="pt").input_ids.to(device)
# Your code here
generation_config = GenerationConfig.from_pretrained(
    "gpt2-medium", do_sample=True, num_return_sequences=10, max_new_tokens=50-len(inputs[0]), temperature=0.7, pad_token_id=tokenizer.pad_token_id
)
sample_outputs = gpt2_model.generate(inputs, generation_config=generation_config)

Now lets print the text.

In [8]:
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

0: <|startoftext|>The coolest thing right now in NLP is that you can have a lot of different words with different meanings depending on how the language is used. So if you're learning the language, then anything that is translated to
1: <|startoftext|>The coolest thing right now in NLP is the one that allows you to "fake" human speech. What it's like to be a human being in this world. It's like you're in a movie and
2: <|startoftext|>The coolest thing right now in NLP is that you can get a really really deep insight into the brain of someone who's been studying them, looking at their thoughts, or anything else. So, you can get
3: <|startoftext|>The coolest thing right now in NLP is that the subject of your query is often an image or text. With this, you can query for every instance of the subject in the database while not using any query engine
4: <|startoftext|>The coolest thing right now in NLP is the one-shot approach, where you just do one thing rather than having multiple steps. I

Now generate one piece of text of length 50 with the same prompt ("The coolest thing right now in NLP is") but use greedy decoding. This roughly corresponds to generating some text that is high likelihood for the model.

In [9]:
inputs = tokenizer("<|startoftext|>The coolest thing right now in NLP is", return_tensors="pt").input_ids.to(device)
# Your code here
generation_config = GenerationConfig.from_pretrained(
    "gpt2-medium", max_new_tokens=50-len(inputs[0]), pad_token_id=tokenizer.pad_token_id
)
outputs = gpt2_model.generate(inputs, generation_config=generation_config)[0]
tokenizer.decode(outputs, skip_special_tokens=True)

'<|startoftext|>The coolest thing right now in NLP is the ability to use the word "startoftext" to refer to a word that is not part of the text. This is useful for things like "the word"'

### Translation

Now let's try to see how good of a translation system GPT-2 medium is when used "out of the box". To accomplish this, we can condition on a prompt like the one below and generate from the model with greedy decoding. This will attempt to translate the sentence "UC Berkeley ist eine Schule in Kalifornien", which means "UC Berkeley is a school in California". Make sure to set the max length to be high enough so that the model generates sufficient text.

In [10]:
prompt = """Translate the following texts into English.

German: UC Berkeley ist eine Schule in Kalifornien
English:"""

In [11]:
# Your code here. Generate from the model using greedy decoding with the above prompt
inputs = tokenizer("<|startoftext|>"+prompt, return_tensors="pt").input_ids.to(device)
generation_config = GenerationConfig.from_pretrained(
    "gpt2-medium", max_new_tokens=20, pad_token_id=tokenizer.pad_token_id
)
outputs = gpt2_model.generate(inputs, generation_config=generation_config)[0]
tokenizer.decode(outputs, skip_special_tokens=True)

'<|startoftext|>Translate the following texts into English.\n\nGerman: UC Berkeley ist eine Schule in Kalifornien\nEnglish: UC Berkeley ist eine Schule in Kalifornien\n\nTranslate the following texts'

As we can see, translation quality is terrible, as it just repeats the words from the previous text.

Now, let's finetune GPT-2 on the translation task to improve the results. We will use a translation dataset from the Huggingface dataset repository (it has thousands of other datasets available). This dataset is one of TED talks translated between German and English.

In [12]:
import datasets
dataset = datasets.load_dataset("ted_talks_iwslt", language_pair=("de", "en"), year="2014")

Found cached dataset ted_talks_iwslt (/home/ubuntu/.cache/huggingface/datasets/ted_talks_iwslt/de_en_2014-9408486716c87367/1.1.0/a42f763b98f8e9cc19358a2ac1007b0b600554e260ee48e6316df39703ef37a4)


  0%|          | 0/1 [00:00<?, ?it/s]

In [13]:
print(dataset['train'][0]['translation'])

{'de': '"Ich habe Zerebralparese. Ich zappele die ganze Zeit", kündigt Maysoon Zayid zu Anfang dieses ungeheuer witzigen, erheiternden an. (Er ist wirklich ungeheur witzig.) "Als würde Shakira auf Muhammad Ali treffen." Elegant und scharfsinnig nimmt uns die arabisch-amerikanische Komikerin auf eine Reise durch ihre Abenteuer als Schauspielerin, Komikerin, Philanthropin und Fürsprecherin für Menschen mit Behinderungen mit.', 'en': '"I have cerebral palsy. I shake all the time," Maysoon Zayid announces at the beginning of this exhilarating, hilarious talk. (Really, it\'s hilarious.) "I\'m like Shakira meets Muhammad Ali." With grace and wit, the Arab-American comedian takes us on a whistle-stop tour of her adventures as an actress, stand-up comic, philanthropist and advocate for the disabled.'}


Now we can create a dataset. For each element in the dataset, it should have a text prompt and then the translation, similar to above. Your job is to fill in the labels field below. This field sets the labels to use for training during the language modeling task. 

For the labels, we only want to train the model to output the text after the words "English:". This is because in the prompt, everything before the words "English:" will also be provided to the model as input. Hint: use -100 as the label for tokens you do not want to train on.
Hint 2: When doing LM training, the labels are the same as the input tokens, except shifted to the left by one. You should check whether Huggingface is already doing the shifting, or whether you need to do the shifting yourself.

One thing to be careful of with all LMs is to make sure there are not extra spaces. So, the text should be formatted as like "English: Hello..." not "English:  Hello...". This issue is a common problem people face when using APIs like GPT-3 which we will cover next time.

In [14]:
prompt1 = """Translate the following texts into English.
German:"""
prompt2 = """
English:"""

class TranslationDataset(Dataset):
    def __init__(self, examples, tokenizer):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for example in examples:
            training_text = prompt1+" "+example['translation']['de']+prompt2+" "+example['translation']['en']+"<|endoftext|>"
            encodings_dict = tokenizer(training_text, max_length=275, padding="max_length", truncation=True)
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
            prompt_and_input_length = len(tokenizer.encode(prompt1+" "+example['translation']['de']+prompt2))
            # your code below
            self.labels.append(torch.cat(
                (torch.full((prompt_and_input_length,), -100),
                 torch.tensor(encodings_dict['input_ids'][prompt_and_input_length:]))
            ))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {'input_ids':self.input_ids[idx], 'attention_mask':self.attn_masks[idx], 'labels':self.labels[idx]}

In [15]:
translation_dataset = TranslationDataset(dataset['train'], tokenizer)

Now let's break the dataset into a train and test split.

In [16]:
train_size = int(0.9 * len(translation_dataset))
train_dataset, val_dataset = random_split(translation_dataset, [train_size, len(translation_dataset) - train_size])
print(len(train_dataset))
print(len(val_dataset))

2674
298


In [17]:
print(train_dataset[0])

{'input_ids': tensor([ 8291, 17660,   262,  1708, 13399,   656,  3594,    13,   198, 16010,
           25,  9626,  8687, 18320, 42803,  2688,  6877,   288,   292,   304,
        10745,  4891, 11896, 46097, 45371, 23773,   920, 35875,    83,    11,
          288,   562,  4656,   412,   328,   641, 11693, 14785, 18042,   520,
        11033,    67,  1452,   842,  2120,    13,   370, 48988,  1481,    11,
          509, 22157,   270, 11033,   912, 10366,   268,    11, 45371,   354,
         7972,   328,   365,   270,  4587,   376,  1046,    70, 11033,   782,
          263,  3318,   410,   494,   293,   290,   567,  1081,   431,    74,
          660, 18042,   520, 11033,    67,  1452,   300,   562,   268,   264,
          488,   257,   385,   304,  7274,   304,   259,    89,  9324,  1902,
         9101,    82,   325,   607,   293,   270,   268,    25,  9626,  1355,
           85,  9101,    75,  6122,  2150,    82,    89, 15668,    13,   554,
        10564,   368,  6184,   120,   527,  8847, 

Now we can use the Huggingface Trainer to finetune GPT-2 on this dataset. This abstracts away all of the details of training. Setup the training arguments to perform 3 epochs of training on this dataset, use a per-device batch size of 2 with gradient accumulation set to 8, use 100 warmup steps, a weight decay of 0.05. Set the eval batch size to be 2. Save a checkpoint every epoch. Set fp16 to True. Save the checkpoint in a specific output_dir so you can load it later. Hint: if it tries to launch Wandb, you may add the argument report_to="none".

In [18]:
# Your code here
training_args = TrainingArguments(output_dir="./saves/translation_trainer", num_train_epochs=3, optim="adamw_torch",
    per_device_train_batch_size=2, gradient_accumulation_steps=8, warmup_steps=100, weight_decay=0.05,
    per_device_eval_batch_size=2, evaluation_strategy="epoch", save_strategy="epoch", save_steps=1)

Next create a Huggingface Trainer object and call train() on it.

In [19]:
# Your code here
trainer = Trainer(
    model=gpt2_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)
trainer.train()

Epoch,Training Loss,Validation Loss


TrainOutput(global_step=501, training_loss=0.7546390847650593, metrics={'train_runtime': 1899.9537, 'train_samples_per_second': 4.222, 'train_steps_per_second': 0.264, 'total_flos': 3998491818393600.0, 'train_loss': 0.7546390847650593, 'epoch': 3.0})

Now load your saved checkpoint and see how well the finetuned GPT-2 model does on translating the sentence from before.

In [20]:
# your code here
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generation_config = GenerationConfig.from_pretrained(
    "gpt2-medium", max_new_tokens=20, pad_token_id=tokenizer.pad_token_id
)
outputs = gpt2_model.generate(inputs, generation_config=generation_config)[0]
tokenizer.decode(outputs, skip_special_tokens=True)

'Translate the following texts into English.\n\nGerman: UC Berkeley ist eine Schule in Kalifornien\nEnglish: UC Berkeley: A school in Kalifornia'

If training went correctly, you should see a reasonable translation of the sentence, with some errors.

For the project report, find two sentences where the model succeeds and two sentences where the model fails. Describe what might be causing these types of failures.

Finally, revisit the code from project 2 on using and running the Multi30k dataset. Your goal will be to translate the test set using the GPT-2 model you just finetuned. You will then submit your test predictions as a txt file, where you place your model's prediction for each test example on a separate line. Feel free to copy and paste any code from HW2 that may be useful. Submit the file named as mt_predictions.txt to gradescope.

The GPT-2 model may not work that well on the Multi30k dataset, because there is a distribution shift where the Multi30k data looks different than the Ted talks data that you finetuned the model on. The takeaway is that a general-purpose LM system can be decent at a task like translation, however, if you create a domain-specific model like a LSTM trained specifically on Multi30k, you can outperform the general purpose model.

For the project report, compare two translations from the GPT-2 versus LSTM model. Which one works better?

Hint: One failure mode for GPT-2 is that it may generate fluent sentences that are actually unrelated to the input.

In [30]:
prompt1 = """Translate the following texts into English.
German:"""
prompt2 = """
English:"""

def get_raw_predictions(gpt_model, dataset, tokenizer, method, batch_size=64):
    source_sentences = [prompt1+" "+example.src+prompt2 for example in dataset]
    if method == "greedy":
        generation_config = GenerationConfig.from_pretrained(
            "gpt2-medium", max_new_tokens=100, num_beams=5, pad_token_id=tokenizer.pad_token_id
        )
    else:
        generation_config = GenerationConfig.from_pretrained(
            "gpt2-medium", max_new_tokens=100, num_beams=1, pad_token_id=tokenizer.pad_token_id
        )

    padding_side = tokenizer.padding_side
    tokenizer.padding_side = "left"
    predictions = []
    for start_index in range(0, len(source_sentences), batch_size):
        inputs = tokenizer(source_sentences[start_index:start_index + batch_size],
            padding=True, return_tensors="pt")

        outputs = gpt_model.generate(
            input_ids=inputs["input_ids"].to(device),
            attention_mask=inputs["attention_mask"].to(device),
            generation_config=generation_config)

        prediction_batch = tokenizer.batch_decode(
            outputs[:, inputs["input_ids"].shape[-1]:],
            skip_special_tokens=True)
        predictions.extend([prediction.lstrip(" ") for prediction in prediction_batch])
    tokenizer.padding_side = padding_side
    return predictions

In [31]:
# Your code for generating mt_predictions.txt below
from torchtext.legacy import data
from torchtext.legacy import datasets

extensions = [".de", ".en"]
source_field = data.Field(tokenize=lambda x: x)
target_field = data.Field(tokenize=lambda x: x)
_, _, test_data = datasets.Multi30k.splits(
    extensions, [source_field, target_field], root=".")

predictions = get_raw_predictions(gpt2_model, test_data, tokenizer, "beam", batch_size=64)
with open('./saves/mt_predictions.txt', 'w') as outfile:
    for prediction in predictions:
        outfile.write(f"{prediction}\n")

### Sentiment Analysis

The beauty of language models is that we can apply this exact same machinery to solve a completely different task of sentiment analysis. Here, we will be given a movie review and the goal is to have the model predict whether the review is positive or negative.

First, we will load some sentiment analysis data. Your job is to copy what we did above for machine translation to load the dataset, build a Class to create the dataset, etc., 

When doing so, use the prompt below, where you put the text of the input in the first [] and in the second [], put the word Positive if the label is 1 and the word Negative if the label is 0. Make sure to also set the self.labels field correctly, we only want to compute a loss on the words Positive/Negative, and no other tokens in the model's input.

The following is a movie review. [Movie Review Text Here]. The sentiment of the review is [Positive/Negative].

In [10]:
import datasets
dataset = datasets.load_dataset("sst2")

Found cached dataset sst2 (/home/ubuntu/.cache/huggingface/datasets/sst2/default/2.0.0/9896208a8d85db057ac50c72282bcb8fe755accc671a57dd8059d4e130961ed5)


  0%|          | 0/3 [00:00<?, ?it/s]

Note: Some people were saying that this line of code wasn't working and they needed to use "dataset = datasets.load_dataset('glue', 'sst2')" instead.

In [11]:
prompt1 = """The following is a movie review."""
prompt2 = """
The sentiment of the review is"""

sentiments = ["Negative", "Positive"]

class SentimentDataset(Dataset):
    # Your code below
    def __init__(self, examples, tokenizer):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for example in examples:
            training_text = prompt1+" "+example['sentence']+prompt2+" "+sentiments[example['label']]+"<|endoftext|>"
            encodings_dict = tokenizer(training_text, max_length=275, padding="max_length", truncation=True)
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

            prompt_and_input_length = len(tokenizer.encode(prompt1+" "+example['sentence']+prompt2))
            self.labels.append(torch.cat(
                (torch.full((prompt_and_input_length,), -100),
                 torch.tensor(encodings_dict['input_ids'][prompt_and_input_length:]))
            ))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {'input_ids':self.input_ids[idx], 'attention_mask':self.attn_masks[idx], 'labels':self.labels[idx]}

In [12]:
sentiment_train_dataset = SentimentDataset(dataset['train'], tokenizer)
sentiment_val_dataset = SentimentDataset(dataset['validation'], tokenizer)

The data already comes with a validation and train split

In [13]:
print(len(sentiment_train_dataset))
print(len(sentiment_val_dataset))

67349
872


Now let's train the model using the same trainer arguments as before, except just do $<$1 epoch of training because this dataset is quite large and training on the entire thing will take some time. Make sure you also use a different output_dir so it doesn't overwrite your old results.

In [14]:
# Your code here
training_args = TrainingArguments(output_dir="./saves/sentiment_trainer", num_train_epochs=1, optim="adamw_torch",
    per_device_train_batch_size=2, gradient_accumulation_steps=8, warmup_steps=100, weight_decay=0.05,
    per_device_eval_batch_size=2, evaluation_strategy="steps", eval_steps=50, save_total_limit=5, load_best_model_at_end=True)

trainer = Trainer(
    model=gpt2_model,
    args=training_args,
    train_dataset=sentiment_train_dataset,
    eval_dataset=sentiment_val_dataset,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)
trainer.train()

Step,Training Loss,Validation Loss
50,No log,0.003866
100,No log,0.001307
150,No log,0.001018
200,No log,0.00121
250,No log,0.00123
300,No log,0.001142
350,No log,0.000969
400,No log,0.000867
450,No log,0.001354
500,0.360500,0.00098


TrainOutput(global_step=2850, training_loss=0.06385934582927771, metrics={'train_runtime': 14173.247, 'train_samples_per_second': 4.752, 'train_steps_per_second': 0.297, 'total_flos': 2.274591154176e+16, 'train_loss': 0.06385934582927771, 'epoch': 0.68})

At test-time, when you want to classify an incoming movie review, you can just check whether the model generates the words Positive or Negative as the final word.

In [15]:
prompt = """The following is a movie review. The acting was great but overall I was left disappointed by the film.
The sentiment of the review is"""

In [16]:
# Your code here
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generation_config = GenerationConfig.from_pretrained(
    "gpt2-medium", max_new_tokens=20, pad_token_id=tokenizer.pad_token_id
)
outputs = gpt2_model.generate(inputs, generation_config=generation_config)[0]
tokenizer.decode(outputs, skip_special_tokens=True)

'The following is a movie review. The acting was great but overall I was left disappointed by the film.\nThe sentiment of the review is Negative'

Finally, run the entire validation set through the model and get your model predictions. Save the results as a txt file, where each line just contains either "1" if your model predicted Positive and "0" if the model predicted Negative. You will get full credit if your model's accuracy is greater than 80%. Save the file as sst_predictions.txt and submit it to gradescope.

For the report, describe two possible improvements to your sentiment classifier.

In [17]:
prompt1 = """The following is a movie review."""
prompt2 = """
The sentiment of the review is"""

def get_raw_predictions(gpt_model, dataset, tokenizer, method, batch_size=64):
    source_sentences = [prompt1+" "+example['sentence']+prompt2 for example in dataset]
    if method == "greedy":
        generation_config = GenerationConfig.from_pretrained(
            "gpt2-medium", max_new_tokens=100, num_beams=5, pad_token_id=tokenizer.pad_token_id
        )
    else:
        generation_config = GenerationConfig.from_pretrained(
            "gpt2-medium", max_new_tokens=100, num_beams=1, pad_token_id=tokenizer.pad_token_id
        )

    padding_side = tokenizer.padding_side
    tokenizer.padding_side = "left"
    predictions = []
    for start_index in range(0, len(source_sentences), batch_size):
        inputs = tokenizer(source_sentences[start_index:start_index + batch_size],
            padding=True, return_tensors="pt")

        outputs = gpt_model.generate(
            input_ids=inputs["input_ids"].to(device),
            attention_mask=inputs["attention_mask"].to(device),
            generation_config=generation_config)

        prediction_batch = tokenizer.batch_decode(
            outputs[:, inputs["input_ids"].shape[-1]:],
            skip_special_tokens=True)
        predictions.extend([prediction.lstrip(" ") for prediction in prediction_batch])
    tokenizer.padding_side = padding_side
    return predictions

In [28]:
# Your code here for generating sst_predictions
predictions = get_raw_predictions(gpt2_model, dataset['validation'], tokenizer, "beam", batch_size=64)
with open('./saves/sst_predictions.txt', 'w') as outfile:
    for prediction in predictions:
        if prediction == "Positive" or prediction == "positive":
            outfile.write(f"1\n")
        elif prediction == "Negative" or prediction == "negative":
            outfile.write(f"0\n")
        else:
            outfile.write(f"-1\n")

## Submission

Turn in the following files on Gradescope:
* hw4.ipynb (this file; please rename to match)
* mt_predictions.txt (the predictions for the Multi30k test set)
* sst_predictions.txt (the predictions for the SST-2 validation set)
* report.pdf

Be sure to check the output of the autograder after it runs.  It should confirm that no files are missing and that the output files have the correct format.