# Project 4: Generating and Finetuning Transformer Language Models With Huggingface 

In this project, you will first learn how to use Huggingface's Transformers library to load large language models. Next, we will generate text from these models. Finally, we will finetune models on two tasks (sentiment analysis and machine translation).

This project will be more open ended than the previous projects. We expect you to learn how to use the huggingface and torch documentation.

## Setup

First we install and import the required dependencies. These include:
* `torch` for modeling and training
* `transformers` for pre-trained models
* `datasets` from huggingface to load existing datasets.

In [1]:
%%capture
!pip install transformers
!pip install datasets
!pip install --upgrade sacrebleu sentencepiece

# Standard library imports
import torch
from torch.utils.data import Dataset, random_split
from transformers import AutoTokenizer, TrainingArguments, Trainer, AutoModelWithLMHead

Before proceeding, let's verify that we're connected to a GPU runtime and that `torch` can detect the GPU.
We'll define a variable `device` here to use throughout the code so that we can easily change to run on CPU for debugging.

In [2]:
assert torch.cuda.is_available()
device = torch.device("cuda")
print("Using device:", device)

Using device: cuda


### Loading Model

We will use GPT-2 medium for this project. This includes both the GPT-2 tokenizer and the GPT-2 model weights itself. If you want to learn more about this model, you can read the GPT-2 paper https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.

Let's first load the tokenizer for the GPT-2 medium model. You can find how to do this by reading the documentation for AutoTokenzier in transformers, and finding the GPT-2 model of ~345 million params in there.

In [3]:
from transformers import AutoTokenizer
# Your code here
tokenizer = AutoTokenizer.from_pretrained("gpt2-medium")

Downloading (…)lve/main/config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Let's tokenize and detokenize some text from this model.

In [4]:
print(tokenizer.encode('Hello world'))
print(tokenizer.decode(tokenizer.encode('Hello world')))
print(tokenizer.encode("Hola, cómo estás😍"))

[15496, 995]
Hello world
[39, 5708, 11, 269, 10205, 5908, 1556, 40138, 47249, 235]


Now let's load the GPT-2 medium model. Make sure you also put the model onto the GPU.

In [43]:
from transformers import AutoModelWithLMHead
# Your code here
gpt2_model = AutoModelWithLMHead.from_pretrained("gpt2-medium").to('cuda')



## Generate From the Model

Now let's generate some text from the model to test its LM capabilities. Let's generate 10 pieces of random text of length 50 tokens from the model using random sampling with temperature set to 0.7. This will allow the text to be somewhat high in diversity (random sampling) while maintaining reasonable quality (temperature < 1). When generating text, you can condition on phrases such as "The coolest thing in NLP right now is". Find the relevant function and arguments to use for generating text using the Huggingface documentation.

Hint: you may find https://huggingface.co/docs/transformers/main_classes/text_generation to be useful for learning about generating from LMs.

In [8]:
inputs = tokenizer("<|startoftext|>The coolest thing right now in NLP is", return_tensors="pt").input_ids.cuda()
# Your code here
sample_outputs = gpt2_model.generate(
    inputs, num_return_sequences=10, do_sample=True, temperature=0.7, max_length=50, top_k=0, 
)

Now lets print the text.

In [9]:
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

0: <|startoftext|>The coolest thing right now in NLP is machine learning.

|endoftext|>Networking is ever changing.

|endoftext|>Postgres is quite the contrary.


1: <|startoftext|>The coolest thing right now in NLP is the ability to identify a person's face in-game. It's not a gimmick, but it's cool. So when you're trying to identify a face from a
2: <|startoftext|>The coolest thing right now in NLP is Learning by Example, a tool that lets you follow a user's text as they type into a browser. You can use it to learn a language or just observe their language
3: <|startoftext|>The coolest thing right now in NLP is more efficient object detection based on machine learning. Although that is not an official EDA (even though I am using it), I am able to detect recursively changes
4: <|startoftext|>The coolest thing right now in NLP is :epub|<|print|>The beloved, the beloved one. :epub|<|print|>The one that's always shifting the needle and
5: <|startoftext|>The coolest thing right now in NLP

Now generate one piece of text of length 50 with the same prompt ("The coolest thing right now in NLP is") but use greedy decoding (temperature = 0). This roughly corresponds to generating some text that is high likelihood for the model.

In [10]:
inputs = tokenizer("<|startoftext|>The coolest thing right now in NLP is", return_tensors="pt").input_ids.cuda()
# Your code here
sample_outputs = gpt2_model.generate(
    inputs, do_sample=False, temperature=0, max_length=50
)

Now let's try to see how good of a translation system GPT-2 medium is when used "out of the box". To accomplish this, we can condition on a prompt like the one below and generate from the model with greedy decoding. This will attempt to translate the sentence "UC Berkeley ist eine Schule in Kalifornien", which means "UC Berkeley is a school in California". Make sure to set the max length to be high enough so that the model generates sufficient text.

In [12]:
prompt = """Translate the following texts into English.

German: UC Berkeley ist eine Schule in Kalifornien
English:"""

In [14]:
# Your code here. Generate from the model using greedy decoding with the above prompt
eos_token_id = tokenizer.eos_token_id
inputs = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
greedy_output = gpt2_model.generate(
    inputs, do_sample=False, max_length=50, temperature=0, eos_token_id=eos_token_id,
)
print(tokenizer.decode(greedy_output[0]))

Translate the following texts into English.

German: UC Berkeley ist eine Schule in Kalifornien
English: UC Berkeley ist eine Schule in Kalifornien

English: UC Berkeley ist e


As we can see, translation quality is terrible, as it just repeats the words from the previous text.

Now, let's finetune GPT-2 on the translation task to improve the results. We will use a translation dataset from the Huggingface dataset repository (it has thousands of other datasets available). This dataset is one of TED talks translated between German and English.

In [15]:
import datasets
dataset = datasets.load_dataset("ted_talks_iwslt", language_pair=("de", "en"), year="2014")

Downloading builder script:   0%|          | 0.00/4.72k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

Downloading and preparing dataset ted_talks_iwslt/de_en_2014 to /root/.cache/huggingface/datasets/ted_talks_iwslt/de_en_2014-c6e771351acd148b/1.1.0/43935b3fe470c753a023642e1f54b068c590847f9928bd3f2ec99f15702ad6a6...


Downloading data:   0%|          | 0.00/1.67G [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset ted_talks_iwslt downloaded and prepared to /root/.cache/huggingface/datasets/ted_talks_iwslt/de_en_2014-c6e771351acd148b/1.1.0/43935b3fe470c753a023642e1f54b068c590847f9928bd3f2ec99f15702ad6a6. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [16]:
tokenizer.pad_token = tokenizer.eos_token

In [17]:
print(dataset['train'][0]['translation'])

{'de': '"Ich habe Zerebralparese. Ich zappele die ganze Zeit", kündigt Maysoon Zayid zu Anfang dieses ungeheuer witzigen, erheiternden an. (Er ist wirklich ungeheur witzig.) "Als würde Shakira auf Muhammad Ali treffen." Elegant und scharfsinnig nimmt uns die arabisch-amerikanische Komikerin auf eine Reise durch ihre Abenteuer als Schauspielerin, Komikerin, Philanthropin und Fürsprecherin für Menschen mit Behinderungen mit.', 'en': '"I have cerebral palsy. I shake all the time," Maysoon Zayid announces at the beginning of this exhilarating, hilarious talk. (Really, it\'s hilarious.) "I\'m like Shakira meets Muhammad Ali." With grace and wit, the Arab-American comedian takes us on a whistle-stop tour of her adventures as an actress, stand-up comic, philanthropist and advocate for the disabled.'}


Now we can create a dataset. For each element in the dataset, it should have a text prompt and then the translation, similar to above. Your job is to fill in the labels field below. This field sets the labels to use for training during the language modeling task. 

For the labels, we only want to train the model to output the text after the words "English:". This is because in the prompt, everything before the words "English:" will also be provided to the model as input. Hint: use -100 as the label for tokens you do not want to train on.
Hint 2: When doing LM training, the labels are the same as the input tokens, except shifted to the left by one. You should check whether Huggingface is already doing the shifting, or whether you need to do the shifting yourself.

One thing to be careful of with all LMs is to make sure there are not extra spaces. So, the text should be formatted as like "English: Hello..." not "English:  Hello...". This issue is a common problem people face when using APIs like GPT-3 which we will cover next time.

In [19]:
prompt = """Translate the following texts into English.
German: """

class TranslationDataset(Dataset):
    def __init__(self, examples, tokenizer):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for example in examples:
            training_text = prompt + example['translation']['de'] + '\nEnglish: ' + example['translation']['en'] + "<|endoftext|>"
            encodings_dict = tokenizer(training_text, max_length=275, padding="max_length", truncation=True)
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
            prompt_and_input_length = len(tokenizer.encode(prompt + example['translation']['de'] + '\nEnglish:'))
            # your code below
            label = encodings_dict['input_ids']
            label[:prompt_and_input_length] = [-100] * prompt_and_input_length
            self.labels.append(torch.tensor(label))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {'input_ids':self.input_ids[idx], 'attention_mask':self.attn_masks[idx], 'labels':self.labels[idx]}

In [20]:
translation_dataset = TranslationDataset(dataset['train'], tokenizer)

Now let's break the dataset into a train and test split.

In [21]:
train_size = int(0.9 * len(translation_dataset))
train_dataset, val_dataset = random_split(translation_dataset, [train_size, len(translation_dataset) - train_size])
print(len(train_dataset))
print(len(val_dataset))

2674
298


In [22]:
print(train_dataset[0])

{'input_ids': tensor([ 8291, 17660,   262,  1708, 13399,   656,  3594,    13,   198, 16010,
           25, 14236,   666, 15617,   647,    83,  3222, 33565,   861,   304,
          500,   285,  9101,    70,   677,   258,  1168,  2724,   403,   701,
          748,  7157,   893,   257,  3046, 38436,    87, 24814,  2815,   532,
          304,   259,  7157,    88,    11,   288,   292,   288,  2575,  5178,
           86,   392,    75,  2150,  3318,   402,   413,   488,   912,   332,
           75,  3536,  2150,  6188,   268,  1729,  4703,   518,   297,   288,
          283,   301,   695,    83,  3318,   523,   304,   500,   302,   528,
           85,   692, 19933,  3683,  4587,   509,  2002,   403,  1134,   341,
         1931,    76,  9101,  4743, 30830,    13,   198, 15823,    25, 14236,
          666, 15617,   647,    83, 35551,   530,  2003,   286,   262,  5175,
         3072,  1377,   257,  5485,    12,  1477, 13309,   290,  3463,    12,
         1477, 13309, 40445,   326,   366,  6381, 

Now we can use the Huggingface Trainer to finetune GPT-2 on this dataset. This abstracts away all of the details of training. Setup the training arguments to perform 3 epochs of training on this dataset, use a per-device batch size of 2 with gradient accumulation set to 8, use 100 warmup steps, a weight decay of 0.05. Set the eval batch size to be 2. Save a checkpoint every 250 steps. Set fp16 to True. Save the checkpoint in a specific output_dir so you can load it later. Hint: if it tries to launch Wandb, you may add the argument report_to="none".

In [23]:
# Your code here
training_args = TrainingArguments(
    output_dir="./gpt2_translation_checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,
    warmup_steps=100,
    weight_decay=0.05,
    save_steps=250,
    fp16=True,
    report_to="none" 
)

Next create a Huggingface Trainer object and call train() on it.

In [24]:
# Your code here
trainer = Trainer(
    model=gpt2_model, args=training_args, train_dataset=train_dataset, eval_dataset=val_dataset,
)
trainer.train()



Step,Training Loss
500,0.8599


TrainOutput(global_step=501, training_loss=0.8586751594990789, metrics={'train_runtime': 1148.129, 'train_samples_per_second': 6.987, 'train_steps_per_second': 0.436, 'total_flos': 3998491818393600.0, 'train_loss': 0.8586751594990789, 'epoch': 3.0})

Now load your saved checkpoint and see how well the finetuned GPT-2 model does on translating the sentence from before.

In [25]:
prompt = """Translate the following texts into English.

German: UC Berkeley ist eine Schule in Kalifornien
English:"""

In [49]:
# Your code here
inputs = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
greedy_output = gpt2_model.generate(
    inputs, do_sample=False, max_length=50, temperature=0, eos_token_id=eos_token_id,
)
print(tokenizer.decode(greedy_output[0]))

Translate the following texts into English.

        German: Ein Mädchen an einer Küste mit einem Berg im Hintergrund.
        English: The


If training went correctly, you should see a reasonable translation of the sentence, with some errors.

For the project report, find two sentences where the model succeeds and two sentences where the model fails. Describe what might be causing these types of failures.

In [39]:
# Find two sentences where the model succeeds and two sentences where the model fails
prompt = """Translate the following texts into English.

German: Der Mann ist im Wasser.
English:"""
inputs = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
greedy_output = gpt2_model.generate(
    inputs, do_sample=False, max_length=50, temperature=0, eos_token_id=eos_token_id,
)
print(tokenizer.decode(greedy_output[0]))

Translate the following texts into English.

German: Der Mann ist im Wasser.
English: The man is the ocean.<|endoftext|>


Finally, revisit the code from project 2 on using and running the Multi30k dataset. Your goal will be to translate the test set using the GPT-2 model you just finetuned. You will then submit your test predictions as a txt file, where you place your model's prediction for each test example on a separate line. Feel free to copy and paste any code from HW2 that may be useful. Submit the file named as mt_predictions.txt to gradescope.

The GPT-2 model may not work that well on the Multi30k dataset, because there is a distribution shift where the Multi30k data looks different than the Ted talks data that you finetuned the model on. The takeaway I want people to have is that a general-purpose LM system can be decent at a task like translation, however, if you create a domain-specific model like a LSTM trained specifically on Multi30k, you can outperform the general purpose model.

For the project report, compare two translations from the GPT-2 versus LSTM model. Which one works better?

Hint: One failure mode for GPT-2 is that it may generate fluent sentences that are actually unrelated to the input.

In [42]:
# Your code for generating mt_predictions.txt below
from tqdm.notebook import tqdm
from datasets import load_dataset
import sacrebleu
mydataset = load_dataset('bentrevett/multi30k')
test_dataset = mydataset['test']
target_sentences = []
predictions = []
with open("mt_predictions.txt", "w") as f:
    for sentence_pair in tqdm(test_dataset):
        prompt = """Translate the following texts into English.

        German: {}
        English:""".format(sentence_pair['de'])
        target_sentences.append(sentence_pair['en'])
        input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
        input_len = len(input_ids[0])
        greedy_output = gpt2_model.generate(
            input_ids, do_sample=False, max_length=100, temperature=0, eos_token_id=eos_token_id
        )
        prediction = tokenizer.decode(greedy_output[0][input_len:-1], skip_special_tokens=True)
        predictions.append(prediction)
        f.write(prediction+'\n')
print(sacrebleu.corpus_bleu(predictions, [target_sentences]).score)

Downloading and preparing dataset json/bentrevett--multi30k to /root/.cache/huggingface/datasets/json/bentrevett--multi30k-fd2305abd2b24ec2/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/4.60M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/156k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/bentrevett--multi30k-fd2305abd2b24ec2/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1000 [00:00<?, ?it/s]

6.988990202838409


### Sentiment Analysis

The beauty of language models is that we can apply this exact same machinery to solve a completely different task of sentiment analysis. Here, we will be given a movie review and the goal is to have the model predict whether the review is positive or negative.

First, we will load some sentiment analysis data. Your job is to copy what we did above for machine translation to load the dataset, build a Class to create the dataset, etc., 

When doing so, use the prompt below, where you put the text of the input in the first [] and in the second [], put the word Positive if the label is 1 and the word Negative if the label is 0. Make sure to also set the self.labels field correctly, we only want to compute a loss on the words Positive/Negative, and no other tokens in the model's input.

The following is a movie review. [Movie Review Text Here]. The sentiment of the review is [Positive/Negative].

In [61]:
import datasets
dataset = load_dataset('glue', 'sst2')

  0%|          | 0/3 [00:00<?, ?it/s]

Note: Some people were saying that this line of code wasn't working and they needed to use "dataset = datasets.load_dataset('glue', 'sst2')" instead.

In [62]:
class SentimentDataset(Dataset):
    # Your code below
    def __init__(self, examples, tokenizer):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        self.label = ['Negative', 'Positive']
        for example in tqdm(examples):
            training_text = prompt + example['sentence'] + 'The sentiment of the review is ' + self.label[example['label']] + ' <|endoftext|>'
            encodings_dict = tokenizer(training_text, max_length=275, padding="max_length", truncation=True)
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
            prompt_and_input_length = len(tokenizer(prompt + example['sentence'] + 'The sentiment of the review is ')['input_ids'])
            label = encodings_dict['input_ids']
            label[:prompt_and_input_length-1] = [-100]*(prompt_and_input_length-1)
            self.labels.append(torch.tensor(label))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {
            'input_ids': self.input_ids[idx], 
            'attention_mask': self.attn_masks[idx], 
            'labels': self.labels[idx]
        }

In [63]:
sentiment_train_dataset = SentimentDataset(dataset['train'], tokenizer)
sentiment_val_dataset = SentimentDataset(dataset['validation'], tokenizer)

  0%|          | 0/67349 [00:00<?, ?it/s]

  0%|          | 0/872 [00:00<?, ?it/s]

The data already comes with a validation and train split

In [64]:
print(len(sentiment_train_dataset))
print(len(sentiment_val_dataset))

67349
872


Now let's train the model using the same trainer arguments as before, except just do $<$1 epoch of training because this dataset is quite large and training on the entire thing will take some time. Make sure you also use a different output_dir so it doesn't overwrite your old results.

In [65]:
# Your code here
training_args = TrainingArguments(
    output_dir="./gpt2_sentiment_checkpoints",
    num_train_epochs=0.05,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,
    warmup_steps=100,
    weight_decay=0.05,
    save_steps=250,
    fp16=True,
    report_to="none", 
)

sentiment_gpt2_model = AutoModelWithLMHead.from_pretrained("gpt2-medium").to('cuda')
trainer = Trainer(
    model=sentiment_gpt2_model, args=training_args, train_dataset=sentiment_train_dataset, eval_dataset=sentiment_val_dataset
)
trainer.train()



Step,Training Loss


TrainOutput(global_step=211, training_loss=1.1155539417718825, metrics={'train_runtime': 473.1975, 'train_samples_per_second': 7.116, 'train_steps_per_second': 0.446, 'total_flos': 1683995556249600.0, 'train_loss': 1.1155539417718825, 'epoch': 0.05})

At test-time, when you want to classify an incoming movie review, you can just check whether the model generates the words Positive or Negative as the final word.

In [69]:
prompt = """The following is a movie review. The acting was great but overall I was left disappointed by the film. The sentiment of the review is"""

In [70]:
# Your code here
inputs = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
greedy_output = sentiment_gpt2_model.generate(
    inputs, do_sample=False, max_length=500, temperature=0, eos_token_id=eos_token_id,
)
print(tokenizer.decode(greedy_output[0]))

The following is a movie review. The acting was great but overall I was left disappointed by the film. The sentiment of the review is Negative <|endoftext|>


Finally, run the entire validation set through the model and get your model predictions. Save the results as a txt file, where each line just contains either "1" if your model predicted Positive and "0" if the model predicted Negative. You will get full credit if your model's accuracy is greater than 80%. Save the file as sst_predictions.txt and submit it to gradescope.

For the report, describe two possible improvements to your sentiment classifier.

In [71]:
# Your code here for generating sst_predictions
from sklearn.metrics import accuracy_score
def label_check(l):
    if l==33733:
        return 1
    elif l==36183:
        return 0
    else:
        return 0
    
test_dataset = sentiment_val_dataset
targets = []
predictions = []
with open("sst_predictions.txt", "w") as f:
    for text in tqdm(test_dataset):
        inputs = text['input_ids'].clone()
        label_pos = torch.where(inputs == 220)[0].tolist()[0]-1
        label = inputs[label_pos].item()
        targets.append(label_check(label))
        inputs = inputs[:label_pos]
        inputs = inputs.unsqueeze(0).to('cuda')
        greedy_output = sentiment_gpt2_model.generate(
            inputs, do_sample=False, max_length=500, temperature=0
        )
        prediction = label_check(greedy_output[0][label_pos].item())
        predictions.append(prediction)
        f.write(str(prediction)+'\n')
                          
acc = accuracy_score(targets, predictions)
print(acc)

  0%|          | 0/872 [00:00<?, ?it/s]

0.9059633027522935


## Submission

Turn in the following files on Gradescope:
* hw4.ipynb (this file; please rename to match)
* mt_predictions.txt (the predictions for the Multi30k test set)
* sst_predictions.txt (the predictions for the SST-2 validation set)
* report.pdf

Be sure to check the output of the autograder after it runs.  It should confirm that no files are missing and that the output files have the correct format.