# Assignment II: Pretraining & Fine-Tuning of Language Models

In this second assignment we will continue to work with PyTorch and Open AI's early Open Source Model GPT2 to develop a deeper understanding and intuition of how language models are trained. We will look at a specific simple task, Sentiment Classification, and see in two ways how we can use language models for this problem.

The structure of the Assignment is as follows:

1. **Continued GPT-2 Pretraining of GPT with a Movie Plots dataset**  

   Here we will explore how continued pretraining affects a Language Model. This gives us a good idea how pretraining morks, and more specifically, we will look how additional domain-specific pretraining affects the perplexity for text within this domain vs outside of the domain. We will use the *CMU Movie Summary Corpus* (https://www.cs.cmu.edu/~ark/personas/), released under the Creative Commons Attribution-ShareAlike License (http://creativecommons.org/licenses/by-sa/3.0/us/legalcode).
   We will learn that additional pretraining generally helps language models for in-domain tasks.

3. **Fine-tuning of GPT-2 for Sentiment Analysis of the IMDB dataset (using Pre/Post-Modifiers)**  
   We will then use both the original GPT-2 model and the model that was further pretrained on the CMU Movie Summary dataset for a Sentiment Analysis of the IMDB movie review dataset, which is part of Hugging Face datasets. We will (hopefully(!)... there are statistical fluctuations) see that fine-tuning the model that was further pre-trained on the movie plot dataset behaves somewhat better than the original gpt-2 model fine-tuned.

4. **IMDB Sentiment Classification with a Masked Language Model (BERT)**
   Finally, we will also look at a Masked Language Model, specifically BERT, as a tool for Sentiment Classification of the same dataset.



For reference, please consider the Lecture material for weeks 2 & 3 as well as the two Special Session notebooks:

* Intro to PyTorch I (Basics)
* Intro to PyTorch II (Huggingface & Language Models)
* All lesson material and notebooks to date



**INSTRUCTIONS:**

* This notebook needs to be run using a GPU. If you use Google Colab, a T4 chip is the recommendation.
  
* Questions are always indicated as **QUESTION:**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the answers file as you did in a1. Please do not remove the output from your notebooks when you submit them as we'll look at the output as well as your code for grading purposes.

* \### YOUR CODE HERE indicates that you are supposed to write code. All the way up to \### END YOUR CODE     

* **Important!!:** When you are done please re-run your notebook from beginning to end to that all of the seeds apply! This is very important!



## 0. Setup

Let us first install a few required packages. (You may want to comment this out in case you use a local environment that already has the suitable packages installed.)

In [1]:
%%capture

#!pip install torch           # not required for colabs. Uncomment if needed otherwise
#!pip install transformers    # not required for colabs. Uncomment if needed otherwise
#!pip install numpy           # not required for colabs. Uncomment if needed otherwise
!pip install portalocker
!pip install datasets         # Hugging Face's dataset library
#!pip install pandas          # not required for colabs. Uncomment if needed otherwise

Next, we will import required libraries

In [2]:
import copy
import random

import torch
import numpy as np
import pandas as pd

from datasets import load_dataset

#from torchtext import data as torchtext_data
from torch import nn

from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, GPT2Model, GPT2ForSequenceClassification, GPT2LMHeadModel

Let's make sure we will later put data and models on the proper device.

In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
#device = torch.device("mps")               # in case you run on a local Mac with metal performance shaders (setup/support is up to you)
device

device(type='cpu')

Now we will define various functions that we will use in this nodebook. You can jump over this part... but you don't have to.

We'll define:

1. perplexity(text, model) - a calculation giving us the perplexity for a given text and model. 'How certain is the model in picking the actual nect token?'
2. ClassificationData class - a class that created the Dataset for our IMDB classification with GPT-2. It has a number of options that we'll use to augment the text to make the classification easier for the model.
3. BERT ClassificationData class - same for a BERT model.
4. create_temp_set(base_data, split) - a function used to massage the IMDB dataset, just as we did in PyTorch Intro I.
5. random_huggingface_blog_text - A list of text of random Hugging Face blog snippets for some validation tests.
6. fake_data_loader -  a function that converts an array of text (fixed batch size for now) into a format that the perplexity calculation can use.


In [4]:
#@title Some Definitions

def perplexity(text_data, model):

    loss = []
    for step, batch_data in enumerate(text_data):

      batch_input = batch_data['input']
      batch_labels = batch_data['labels']

      if step % 100 == 0:
          print('Current batch: ', step)

      with torch.no_grad():
            try:
              batch_output = model(batch_input)
              batch_output_reshaped = batch_output.reshape((batch_size * (max_len - 1), -1))

              batch_labels_reshaped = batch_labels.reshape((batch_size * (max_len - 1),))
              batch_loss = loss_fn(batch_output_reshaped, batch_labels_reshaped)

              loss.append(batch_loss)
            except:
              continue

    avg_cost = np.mean([x.cpu().detach() for x in loss])
    avg_perplexity = np.exp(avg_cost)

    return avg_perplexity


class ClassificationData(Dataset):
    def __init__(self,
                 base_data,
                 tokenizer,
                 max_len,
                 use_prompt=False,
                 prompt_pre_text='',
                 prompt_post_text='',
                 classification_tokenset={1: 'good', 0: 'bad'},
                 num_examples=-1):

        self.max_len = max_len
        self.tokenizer = tokenizer  # assume that padding token has already been added to tokenizer
        self.data = []

        # really  not ideal having to iterate through the whole set. But ok for this small data volume


        prompt_pre_text = prompt_pre_text.strip()
        prompt_post_text = prompt_post_text.strip()


        for num_example, example in enumerate(base_data):

            if num_examples != -1 and num_example >= num_examples:
              break

            if num_example == 0:
              print(example)

            stripped_example = example['text'].strip()

            token_encoder = self.tokenizer(' ' + stripped_example)['input_ids'] # simulating that the text will not be at the beginning

            if len(token_encoder) <= self.max_len:
                continue    # avoids complications with short sentences. No padding is needed then.

            truncated_encoding = token_encoder[:self.max_len]
            truncated_example = tokenizer.decode(truncated_encoding) # reconstruct shortened review


            # LLMs do next-word predictions. You may want to add a prompt that the model can work with!


            if use_prompt:

                additional_token_length = len(self.tokenizer(prompt_pre_text)['input_ids']) + len(self.tokenizer(' ' + prompt_post_text)['input_ids'])  # simulating that the prompt_post_text will not be at the beginning

                cutoff = self.max_len + additional_token_length

                prompted_text_line = prompt_pre_text + truncated_example + ' ' + prompt_post_text

            else:
                cutoff = self.max_len
                prompted_text_line = truncated_example

            if len(self.tokenizer(prompted_text_line)['input_ids']) != cutoff:
                    continue



            tokenized_example = self.tokenizer(prompted_text_line,
                                               return_tensors="pt",
                                               max_length=cutoff,
                                               truncation=True,
                                               padding='max_length').to(device)

            if example['label'] == 1:
              token = classification_tokenset[1]
            else:
              token = classification_tokenset[0]

            token_id = self.tokenizer.encode(' ' + token)[0]
            label = torch.tensor(token_id, dtype=torch.int64, device=device)

            self.data.append({'label': label,
                              'input_ids': torch.squeeze(tokenized_example['input_ids']).to(device)
                              })

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):

        return {
            'input_ids': self.data[index]['input_ids'],
            'label': self.data[index]['label']
        }

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):

        return {
            'input_ids': self.data[index]['input_ids'],
            'label': self.data[index]['label']
        }


class BERTClassificationData(Dataset):
    def __init__(self,
                 base_data,
                 tokenizer,
                 max_len,
                 num_examples=-1):

        self.max_len = max_len
        self.tokenizer = tokenizer  # assume that padding token has already been added to tokenizer
        self.data = []

        # really  not ideal having to iterate through the whole set. But ok for this small data volume



        for num_example, example in enumerate(base_data):

            if num_examples != -1 and num_example >= num_examples:
              break


            token_encoder = self.tokenizer(example['text'])['input_ids']

            if len(token_encoder) <= self.max_len:
                continue    # avoids complications with short sentences. No padding is needed then.

            truncated_encoding = token_encoder[1:self.max_len + 1]
            truncated_example = tokenizer.decode(truncated_encoding) # reconstruct shortened review

            cutoff = self.max_len
            prompted_text_line = truncated_example

            tokenized_example = self.tokenizer(prompted_text_line,
                                               return_tensors="pt",
                                               max_length=cutoff,
                                               truncation=True,
                                               padding='max_length').to(device)

            label_val = example['label']

            label = torch.tensor(label_val, dtype=torch.int64, device=device)

            self.data.append({'label': label,
                              'input_ids': torch.squeeze(tokenized_example['input_ids']).to(device)
                              })

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):

        return {
            'input_ids': self.data[index]['input_ids'],
            'label': self.data[index]['label']
        }


def create_temp_set(base_data, num_examples=1000000000):
    num_positive = 0
    num_negative = 0
    num_other = 0

    temp_data = []
    out_data = []

    for example_num, example in enumerate(base_data):

      temp_data.append(example)

    random.shuffle(temp_data)

    for example_num, example in enumerate(temp_data):

      if num_examples != -1 and example_num > num_examples:
        break

      if example['label'] == 0:
        num_negative += 1
      elif example['label'] == 1:
        num_positive += 1
      else:
        num_other += 1

      out_data.append(example)


    print('positive: ', num_positive)
    print('negative: ', num_negative)
    print('other: ', num_other)

    return out_data


random_huggingface_blog_text = ["""1. Aspect candidate extraction

In this work we assume that aspects, which are usually features of products and services, are mostly nouns or noun compounds (strings of consecutive nouns). We use spaCy to tokenize and extract nouns/noun compounds from the sentences in the (few-shot) training set. Since not all extracted nouns/noun compounds are aspects, we refer to them as aspect candidates.

2. Aspect/Non-aspect classification

Now that we have aspect candidates, we need to train a model to be able to distinguish between nouns that are aspects and nouns that are non-aspects. For this purpose, we need training samples with aspect/no-aspect labels. This is done by considering aspects in the training set as True aspects, while other non-overlapping candidate aspects are considered non-aspects and therefore labeled as False:

Training sentence: "Waiters aren't friendly but the cream pasta is out of this world."
Tokenized: [Waiters, are, n't, friendly, but, the, cream, pasta, is, out, of, this, world, .]
Extracted aspect candidates: [Waiters, are, n't, friendly, but, the, cream, pasta, is, out, of, this, world, .]
Gold labels from training set, in BIO format: [B-ASP, O, O, O, O, O, B-ASP, I-ASP, O, O, O, O, O, .]
Generated aspect/non-aspect Labels: [Waiters, are, n't, friendly, but, the, cream, pasta, is, out, of, this, world, .]
Now that we have all the aspect candidates labeled, how do we use it to train the candidate aspect classification model? In other words, how do we use SetFit, a sentence classification framework, to classify individual tokens? Well, this is the trick: each aspect candidate is concatenated with the entire training sentence to create a training instance using the following template:""",
             """Normalization interrogations
During our first deeper dive in these surprising behavior, we observed that the normalization step was possibly not working as intended: in some cases, this normalization ignored the correct numerical answers when they were directly followed by a whitespace character other than a space (a line return, for example). Let's look at an example, with the generation being 10\n\nPassage: The 2011 census recorded a population of 1,001,360, and the gold answer being 10.

Normalization happens in several steps, both for generation and gold:

Split on separators |, -, or The beginning sequence of the generation 10\n\nPassage: contain no such separator, and is therefore considered a single entity after this step.
Punctuation removal The first token then becomes 10\n\nPassage (: is removed)
Homogenization of numbers Every string that can be cast to float is considered a number and cast to float, then re-converted to string. 10\n\nPassage stays the same, as it cannot be cast to float, whereas the gold 10 becomes 10.0.
Other steps A lot of other normalization steps ensue (removing articles, removing other whitespaces, etc.) and our original example becomes 10 passage 2011.0 census recorded population of 1001360.0.
However, the overall score is not computed on the string, but on the bag of words (BOW) extracted from the string, here {'recorded', 'population', 'passage', 'census', '2011.0', '1001360.0', '10'}, which is compared with the BOW of the gold, also normalized in the above manner, {10.0}. As you can see, they don’t intersect, even though the model predicted the correct output!

In summary, if a number is followed by any kind of whitespace other than a simple space, it will not pass through the number normalization, hence never match the gold if it is also a number! This first issue was likely to mess up the scores quite a bit, but clearly it was not the only factor causing DROP scores to be so low. We decided to investigate a bit more.

Diving into the results
Extending our investigations, our friends at Zeno joined us and undertook a much more thorough exploration of the results, looking at 5 models which were representative of the problems we noticed in DROP scores: falcon-180B and mistral-7B were underperforming compared to what we were expecting, Yi-34B and tigerbot-70B had a very good performance on DROP correlated with their average scores, and facebook/xglm-7.5B fell in the middle.

You can give analyzing the results a try in the Zeno project here if you want to!

The Zeno team found two even more concerning features:

Not a single model got a correct result on floating point answers
High quality models which generate long answers actually have a lower f1-score
At this point, we believed that both failure cases were actually caused by the same root factor: using . as a stopword token (to end the generations):

Floating point answers are systematically interrupted before their generation is complete
Higher quality models, which try to match the few-shot prompt format, will generate Answer\n\nPlausible prompt for the next question., and only stop during the plausible prompt continuation after the actual answer on the first ., therefore generating too many words and getting a bad f1 score.
We hypothesized that both these problems could be fixed by using \n instead of . as an end of generation stop word.""",
             """Text generation is a rich topic, and there exist several generation strategies for different purposes. We recommend this excellent overview on the subject. Many generation algorithms are supported by the text generation endpoints, and they can be configured using the following parameters:

do_sample: If set to False (the default), the generation method will be greedy search, which selects the most probable continuation sequence after the prompt you provide. Greedy search is deterministic, so the same results will always be returned from the same input. When do_sample is True, tokens will be sampled from a probability distribution and will therefore vary across invocations.
temperature: Controls the amount of variation we desire from the generation. A temperature of 0 is equivalent to greedy search. If we set a value for temperature, then do_sample will automatically be enabled. The same thing happens for top_k and top_p. When doing code-related tasks, we want less variability and hence recommend a low temperature. For other tasks, such as open-ended text generation, we recommend a higher one.""",
             """Recently, we released our Object Detection Leaderboard, ranking object detection models available in the Hub according to some metrics. In this blog, we will demonstrate how the models were evaluated and demystify the popular metrics used in Object Detection, from Intersection over Union (IoU) to Average Precision (AP) and Average Recall (AR). More importantly, we will spotlight the inherent divergences and pitfalls that can occur during evaluation, ensuring that you're equipped with the knowledge not just to understand but to assess model performance critically.

Every developer and researcher aims for a model that can accurately detect and delineate objects. Our Object Detection Leaderboard is the right place to find an open-source model that best fits their application needs. But what does "accurate" truly mean in this context? Which metrics should one trust? How are they computed? And, perhaps more crucially, why some models may present divergent results in different reports? All these questions will be answered in this blog.

So, let's embark on this exploration together and unlock the secrets of the Object Detection Leaderboard! If you prefer to skip the introduction and learn how object detection metrics are computed, go to the Metrics section. If you wish to find how to pick the best models based on the Object Detection Leaderboard, you may check the Object Detection Leaderboard section."""]

def fake_data_loader(text, tokenizer, max_len):
  return [{'input': tokenizer(text, return_tensors='pt', truncation=True, max_length=max_len)['input_ids'][:, :max_len-1].to(device),
            'labels': tokenizer(text, return_tensors='pt', truncation=True, max_length=max_len)['input_ids'][:, 1:max_len].to(device)}]


This should say 'cpu' if using a CPU, or 'cuda', if a GPU is used (or 'mps' per the comment).

Now we are ready to move to Language Models.

## 1. Continued Pretraining of GPT2 with a Movie Plots dataset

We are now downloading GPT-2 from Hugging Face, i.e. the tokenizer and the model. We will i) make sure that it is on the proper device, and ii) copy the model to a second one that will see additional pre-training before being used for a classification task.

In [5]:
%%capture

torch.manual_seed(10)
random.seed(10)
np.random.seed(10)

gpt_2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
gpt_2_tokenizer.add_special_tokens({'pad_token': '[PAD]'})

base_gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)
additinal_pretrain_gpt2_model = copy.deepcopy(base_gpt2_model)


We will now continue to pretrain the model *additinal_pretrain_gpt2_model* on a dataset of a specific domain - movies. We use the *CMU Movie Summary Corpus* (https://www.cs.cmu.edu/~ark/personas/, license: http://creativecommons.org/licenses/by-sa/3.0/us/legalcode). It contains 40k+ unlabeled movie plot summaries. As such, they represent domain-specific text which is available at many companies and institutions using their internal documents.

Get the dataset by uncommenting the first line below (we have it commented here because you may need to rerun the notebook multiple times when you already have the dataset. When you do that... make sure you comment out this line again):

In [None]:
#!wget https://www.cs.cmu.edu/~ark/personas/data/MovieSummaries.tar.gz
# We reccomend downloading the file! You can then upload the file later when needed from your local computer. Go to the folder icon on the left.

In [None]:
#!tar -xvf MovieSummaries.tar.gz

Next, we will create a list of 15k plots and convince ourselves that the data looks roughly as expected:

In [6]:
plots =  pd.read_csv('MovieSummaries/plot_summaries.txt', delimiter='\t')
plots.columns = ['id', 'plot']
plot_summary_list = [x for x in plots['plot']][:15000]
plot_summary_list[0][:400]

'The nation of Panem consists of a wealthy Capitol and twelve poorer districts. As punishment for a past rebellion, each district must provide a boy and girl  between the ages of 12 and 18 selected by lottery  for the annual Hunger Games. The tributes must fight to the death in an arena; the sole survivor is rewarded with fame and wealth. In her first Reaping, 12-year-old Primrose Everdeen is chose'

Next, we will create a Dataset class that takes text data and returns input token ids and labels for the next word predictions (simply the input token ids shifted one to the left). For simplicity, we will throw out any examples that are shorter than our desired length and truncate all other examples to this length:

In [7]:
#@title Class for Creation of Continued Pretraining Dataset (Movie Plots)

class ContinuedPretrainData(Dataset):
    def __init__(self, base_data, tokenizer, max_len, device):
        self.max_len = max_len
        self.tokenizer = tokenizer  # assume that padding token has already been added to tokenizer
        self.data = []

        tokenized_examples = tokenizer(base_data,
                                       max_length=max_len,
                                       truncation=True, padding='max_length',
                                       return_tensors="pt")

        tokens = tokenized_examples['input_ids'][tokenized_examples['attention_mask'][:, max_len - 1] > 0]

        self.data = tokens.to(device)

    def __len__(self):
        return self.data.shape[0]

    def __getitem__(self, index):

        return {'input': self.data[index][:self.max_len - 1],
                'labels': self.data[index][1:]
        }

Now, please build a simple neural net that serves for continued pre-training:

In [8]:
%%capture

class ContinuedTrainingNetwork(torch.nn.Module):
    """
    Build a simple PyTorch network that takes a batch (from a Dataloader)
    as input and returns the logits of each next word prediction.
    When instantiated, you need to pass in a pretrained base model.
    You need to define both, the __init__ and the forward methods.
    """
    def __init__(self, pretrainModel):

         ### YOUR CODE HERE

        super(ContinuedTrainingNetwork, self).__init__()
        self.pretrained_model = pretrainModel

        ### END YOUR CODE

    def forward(self, x):                             # x stands for the input that the network will use/act on later
        # get the logits for all tokens in all examples and call it logits.

        ### YOUR CODE HERE

        outputs = self.pretrained_model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits  
        
        ### END YOUR CODE

        return logits

pretrain_network = ContinuedTrainingNetwork(pretrainModel=additinal_pretrain_gpt2_model)

pretrain_network.to(device)

In [None]:
#... = {'input': tensor([...])}


Then we create the training sets:

In [9]:
max_len=100

train_data = ContinuedPretrainData(plot_summary_list[:10000], tokenizer=gpt_2_tokenizer, max_len=max_len, device=device)
test_data = ContinuedPretrainData(plot_summary_list[10000:], tokenizer=gpt_2_tokenizer, max_len=max_len, device=device)

In [10]:
train_data[0]

{'input': tensor([  464,  3277,   286,  5961,   368, 10874,   286,   257, 11574, 13241,
           290, 14104, 26647, 12815,    13,  1081,  9837,   329,   257,  1613,
         21540,    11,  1123,  4783,  1276,  2148,   257,  2933,   290,  2576,
           220,  1022,   262,  9337,   286,  1105,   290,  1248,  6163,   416,
         22098,   220,   329,   262,  5079, 32367,  5776,    13,   383,   256,
          7657,  1276,  1907,   284,   262,  1918,   287,   281, 13478,    26,
           262,  6195, 23446,   318, 20945,   351, 16117,   290,  5129,    13,
           554,   607,   717,   797,  9269,    11,  1105,    12,  1941,    12,
           727, 11460, 13698, 10776, 39060,   318,  7147,   422,  5665,  1105,
            13,  2332,  4697,  6621,  8595,    77,   747, 11661,   284]),
 'labels': tensor([ 3277,   286,  5961,   368, 10874,   286,   257, 11574, 13241,   290,
         14104, 26647, 12815,    13,  1081,  9837,   329,   257,  1613, 21540,
            11,  1123,  4783,  1276,  

Here is the data loader:

In [11]:
torch.manual_seed(10)
batch_size = 4
train_texts = DataLoader(train_data, batch_size=batch_size, shuffle=True)
test_texts = DataLoader(test_data, batch_size=batch_size, shuffle=True)

Next, we construct a network that takes the input from the loaders and returns the logits of each next word prediction:

Test the shape of the output. Is it correct? We first need to grab an example and then look at the shape of the model output:

In [12]:
example_data = next(iter(test_texts))
print(example_data)  # This will show you what keys are available

{'input': tensor([[38743,   494,    11, 40349,    11,   290,  1338, 39556,   338, 23169,
          4502,   389,   287,  3240,    13, 23169,  4502, 15314,   257,   905,
          1444,   366, 25946,  1869,   422,   347,  8553,    78,  1600,  9593,
           257, 44897,   351,   257, 21332,   286,   257,  3598,    12,  1941,
            12,   727,  1200,    13,   383,  3988,     6,  2988, 17567,   284,
          1309,   262,  1103, 23169,  4502,  1282,   625,   523,   511,  2802,
           468,   262,  3988,  3187,   683,   379,   262,   905,   338,  4067,
            13,  5334,  2802,  6688,   284,   262,  3988,   326, 23169,  4502,
           318,   262,  2042, 15900,   286,   262,  1641,    13,  1119,  1282,
           284,   262,   905,  4067,   290,  1194,  8383,  4952,   262],
        [  818,  4842,    11, 37211, 37547,    11, 41492, 34821,   494, 11345,
           220, 11864,   257, 26617,    11,   810,  3491, 17378,  2611,  9077,
           220, 17706,   764, 34821,   494,   31

In [13]:
example_data = next(iter(test_texts))

# Please call your model output pretrain_model_output

### YOUR CODE HERE

input_ids = example_data['input'].to(device)  
attention_mask = (input_ids != 0).to(device)  
pretrain_model_output = pretrain_network((input_ids, attention_mask))
print(pretrain_model_output.shape)

### END YOUR CODE
pretrain_model_output.shape


torch.Size([4, 99, 50257])


torch.Size([4, 99, 50257])

**QUESTION:**

1.a. What do the numbers above refer to?

**ANSWER:**

1.a. Batch Size, Sequence Lenght and Vocabulary Size

Next, we will calculate the initial perplexity for the test set of the movie plot summaries. We need the loss function for this. Please use the cross entropy to define the loss function *loss_fn*.

In [14]:
"""
Define the loss function loss_fn as the cross entropy and validate/report on the
calculation for the average loss for the two examples

example 1:
  label: 0
  logits: [-3.1, -2.4]
example 2:
  label: 1
  logits: [2.4, -3.1]

"""

### YOUR CODE HERE

loss_fn = nn.CrossEntropyLoss()

example_1_loss = loss_fn(torch.tensor([[-3.1, -2.4]]), torch.tensor([0]))
example_2_loss = loss_fn(torch.tensor([[2.4, -3.1]]), torch.tensor([1]))
average_loss = (example_1_loss + example_2_loss) / 2

### END YOUR CODE


#loss_fn(test_input,test_target)



In [15]:
# What is the average loss for these two examples?    

print(example_1_loss)
print(example_2_loss)
print(average_loss)

tensor(1.1032)
tensor(5.5041)
tensor(3.3036)


**QUESTION:**

1.b. What is the average loss for these two examples?    
1.c. (Ideally, by just looking at the labels and logits), which example contributes the higher loss? Choose from 'first' or 'second'.

**ANSWER:**

1.b. 3.3036 

1.c. second

Also, consider https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html and investigate the dimensions of the input! We will need to reshape the output and the labels, because CrossEntropy expects a tensor of individual decisions, not a tensor of decision sequences. Similar for the labels. Recall how to reshape from the previous notebook.

For the example data we calculate the loss. Then you need to calculate the perplexity.

In [16]:
# Reshape pretrain_model_output and example_data['labels']. Name them reshaped_pretrain_model_output and reshaped_pretrain_model_labels

### YOUR CODE HERE

reshaped_pretrain_model_output = pretrain_model_output.view(batch_size * (max_len - 1), -1)
reshaped_pretrain_model_labels = example_data['labels'].view(batch_size * (max_len - 1))

### END YOUR CODE

print('Shape of reshaped outputs: ', reshaped_pretrain_model_output.shape)
print('Shape of reshaped labels: ', reshaped_pretrain_model_labels.shape)

Shape of reshaped outputs:  torch.Size([396, 50257])
Shape of reshaped labels:  torch.Size([396])


In [17]:
# Now use the loss function loss_fn to first calculate - for this batch - the loss and then the perplexity.

### YOUR CODE HERE

initial_batch_loss = loss_fn(reshaped_pretrain_model_output, reshaped_pretrain_model_labels)
initial_batch_perplexity = np.exp(initial_batch_loss.cpu().detach().item())

### END YOUR CODE

print('Initial batch loss: ', initial_batch_loss)
print('Initial batch perplexity: ', initial_batch_perplexity)

Initial batch loss:  tensor(3.8828, grad_fn=<NllLossBackward0>)
Initial batch perplexity:  48.559882062318394


**QUESTION:**

1.d. What is the perplexity of this batch before the training?

Next, we will calculate the perplexity of the whole test set. For that we will use the perplexity function defined at the outset:

In [18]:
%%time

test_movie_plot_perplexity_before =  perplexity(test_texts, pretrain_network)
test_movie_plot_perplexity_before

Current batch:  0
Current batch:  100
Current batch:  200
Current batch:  300
Current batch:  400
Current batch:  500
Current batch:  600
Current batch:  700
Current batch:  800
Current batch:  900
CPU times: user 1h 38min 49s, sys: 3min 26s, total: 1h 42min 15s
Wall time: 14min 31s


131019.2

**QUESTION**:

1.e. What is the perplexity of the test set before further training?

Ok. What about random text not from this domain? Let us look at the random snippets from Hugging Face blogs defined above in random_huggingface_blog_text:

In [19]:
test_hf_perplexity_before = perplexity(fake_data_loader(random_huggingface_blog_text, tokenizer=gpt_2_tokenizer, max_len=100), pretrain_network)
test_hf_perplexity_before

Current batch:  0


236230.06

Good, about the same. (As hoped/expected. The model should not have any particular better understanding for either type of text.)

Next, we need to create the optimizer and generate a training loop. Nothing to do here for you, but take a look if interested.

In [20]:
pretrain_optimizer = torch.optim.AdamW(pretrain_network.parameters(), lr=0.00001)

def continued_train_loop(dataloader,
               model,
               loss_fn,
               optimizer,
               reporting_interval=100,
               max_len=100,
               steps=None):

    """
    Write the training loop for continued pre-training. In particular, you need to:
    - initialize the epoch_loss to 0 and set the model into training mode
    - iterate over the batches:
      - break if you are at 'steps' number of batches
      - get the inputs X and labels y (which in this case will be the actual next token)
      - get the model outputs
      - reshape y and model outputs in proper format for cross entropy calculation
      - zero out the gradient
      - calculate loss
      - propagate loss (loss.backward) and apply optimizer step
      - add the loss to the epoch_loss
    Reporting:
      - report the current average loss and perplexity every 'reporting_interval' batches
      - report the average loss and perplexity at the end of the epoch (done for you)

    """


    # Set the model to training mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices

    ### YOUR CODE HERE

    model.train()
    epoch_loss = 0
    batch_count = 0 

    for step, batch_data in enumerate(dataloader):
      if steps and step >= steps:
        break  # Stop training after a certain number of steps

        batch_input = batch_data['input'].to(device)
        batch_labels = batch_data['labels'].to(device)

        # Forward pass: get model outputs
        batch_output = model(batch_input)
        
        # Reshape model outputs and labels for cross-entropy loss
        batch_output_reshaped = batch_output.view(-1, batch_output.size(-1))
        batch_labels_reshaped = batch_labels.view(-1)

        # Zero out gradients
        optimizer.zero_grad()

        # Compute loss
        batch_loss = loss_fn(batch_output_reshaped, batch_labels_reshaped)

        # Backpropagate loss and apply optimizer step
        batch_loss.backward()
        optimizer.step()

        # Add batch loss to epoch loss
        epoch_loss += batch_loss.item()
        batch_count += 1

        # Report progress at intervals
        if step % reporting_interval == 0 and step > 0:
            avg_loss = epoch_loss / batch_count
            perplexity = np.exp(avg_loss)
            print(f"Step {step}, Avg train loss: {avg_loss:.6f}, Avg perplexity: {perplexity:.6f}")

    # Handle the case where no batches were processed
    if batch_count == 0:
      print("No batches were processed.")
      return

    # At the end of the epoch, report the average loss and perplexity
    avg_epoch_loss = epoch_loss / batch_count
    avg_perplexity = np.exp(avg_epoch_loss)
    

    ### END YOUR CODE

    print(f"Training Results: \n  Avg train loss: {epoch_loss/batch:>8f} \n Avg train perplexity: {np.exp(epoch_loss/batch):>8f} ")

Now we do the training:

In [21]:
epochs = 1
for t in range(epochs):
    # we just train for 1000 batches
    print(f"Epoch {t+1}\n-------------------------------")
    continued_train_loop(train_texts, pretrain_network, loss_fn, pretrain_optimizer, steps=1000)

print("Done!")

Epoch 1
-------------------------------
No batches were processed.
Done!


How did the perplexity of the test set change after the additional pre-training?

In [22]:
%%time

test_movie_plot_perplexity_after = perplexity(test_texts, pretrain_network)
test_movie_plot_perplexity_after

Current batch:  0
Current batch:  100
Current batch:  200
Current batch:  300
Current batch:  400
Current batch:  500
Current batch:  600
Current batch:  700
Current batch:  800
Current batch:  900
CPU times: user 1h 48min 18s, sys: 4min 46s, total: 1h 53min 4s
Wall time: 16min 52s


63214.67

What about the Hugging Face blog snippets that were not in the movie domain:



In [23]:
test_hf_perplexity_after =  perplexity(fake_data_loader(random_huggingface_blog_text, tokenizer=gpt_2_tokenizer, max_len=100), pretrain_network)
test_hf_perplexity_after

Current batch:  0


115401.84


**QUESTION:**

1.f. What is your observation about the perplexity change for the test movie plot set texts after the additional pre-training? About the same ('within 2'), higher, or lower?

1.g. What is your observation about the perplexity change for the Hugging Face texts after the additional pre-training? About the same ('within 2'), higher, or lower?

1.h. (Free form) What would these observations imply in terms of where/how this model could be used?

## 2. Sentiment Classification of the IMDB Movie dataset using GPT2 and Prompts

We will now get the IMDB dataset, just like we did in the PyTorch Intro II notebook. Refer to it for details.

In [24]:
imdb_dataset = load_dataset("IMDB")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [25]:
imdb_dataset['train'][10]

{'text': 'It was great to see some of my favorite stars of 30 years ago including John Ritter, Ben Gazarra and Audrey Hepburn. They looked quite wonderful. But that was it. They were not given any characters or good lines to work with. I neither understood or cared what the characters were doing.<br /><br />Some of the smaller female roles were fine, Patty Henson and Colleen Camp were quite competent and confident in their small sidekick parts. They showed some talent and it is sad they didn\'t go on to star in more and better films. Sadly, I didn\'t think Dorothy Stratten got a chance to act in this her only important film role.<br /><br />The film appears to have some fans, and I was very open-minded when I started watching it. I am a big Peter Bogdanovich fan and I enjoyed his last movie, "Cat\'s Meow" and all his early ones from "Targets" to "Nickleodeon". So, it really surprised me that I was barely able to keep awake watching this one.<br /><br />It is ironic that this movie is a

In [26]:
imdb_train_set = create_temp_set(imdb_dataset['train'])
imdb_test_set = create_temp_set(imdb_dataset['test'])

positive:  12500
negative:  12500
other:  0
positive:  12500
negative:  12500
other:  0


Usually we would first get the Dataset and the Dataloader, however for reasons that hopefully become apparent, in this case we first want to build the network. You should think of the network as taking input_ids $x$ from a batch of suitably tokenized sentences, and the **output should be the logits of the last token for each example in the batch.**

Please fill in the missing line:

In [27]:
%%capture

class TokenPredictionNetworkClass(torch.nn.Module):
    """
    Define a simple PyTorch network that takes a batch (from a Dataloader)
    as input and returns the logits for the last next-token prediction.
    When instantiated, you need to pass in a pretrained base language model
    (the 'logit_model').
    You need to define both, the __init__ and the forward methods.
    """

    def __init__(self, logit_model):
        ### YOUR CODE HERE

        super(TokenPredictionNetworkClass, self).__init__()
        self.logit_model = logit_model

        ### END YOUR CODE

    def forward(self, x):
        # get the logits for the last position of each each example. (Call them last_token_logits.). This will be just one line.
        # Use self.logit_model(x) to get the model output
        ### YOUR CODE HERE

        last_token_logits = self.logit_model(x).logits[:, -1, :]

        ### END YOUR CODE

        return last_token_logits


loss_fn = torch.nn.CrossEntropyLoss()

token_prediction_network_base_model = TokenPredictionNetworkClass(logit_model=base_gpt2_model)
token_prediction_network_addnl_pretrain_model = TokenPredictionNetworkClass(logit_model=additinal_pretrain_gpt2_model)

token_prediction_network_base_model.to(device)
token_prediction_network_addnl_pretrain_model.to(device)

We now construct our training and test sets for the Sentiment Classification. We want to follow a different approach than we have in the PyTorch intro II notebook. We want to leverage the language model and what it is good at - predicting the next tokens - to the maximum. So why not put a 'wrapper' around the review in a way that the proper sentiment would be naturally the next word?

As a simple example, rather than trying to use the last output vector and add a classification layer let's try to reframe the problem like this (as an illustrative example):

  
 "This is a review: <truncated review text>... The reviewer classifies reviews as good or bad. In this case they thought the movie was"

 or

 "This is a review: <truncated review text>... The reviewer has positive or negative sentiments about movies. In this case the sentiment was"

 ...

 One would think that the LM should already do a decent job getting the proper sentiment simply using the next word prediction task it is trained on!

 How could we test this? We could simply consider the cross entropy loss for the next token relative to the actual sentiment, i.e. the next word we would expect for a positive or a negative review.

 So we can experient with:

 * The pre-fix before the review
 * The text after the review
 * The words we would expect for pos/neg reviews

Try a few combinations and see which ones give you the lowest loss.

**NOTE:** Usually we also would do a good chunk of text pre-processing (take out html, etc.), but for simplicity we will ignore this.

In [28]:
class ClassificationData(Dataset):
    def __init__(self,
                 base_data,
                 tokenizer,
                 max_len,
                 use_prompt=False,
                 prompt_pre_text='',
                 prompt_post_text='',
                 classification_tokenset={1: 'good', 0: 'bad'},
                 num_examples=-1):

        self.max_len = max_len
        self.tokenizer = tokenizer  # assume that padding token has already been added to tokenizer
        self.data = []

        # really  not ideal having to iterate through the whole set. But ok for this small data volume



        for num_example, example in enumerate(base_data):

            if num_examples != -1 and num_example >= num_examples:
              break

            if num_example == 0:
              print(example)


            token_encoder = self.tokenizer(example['text'])['input_ids']

            if len(token_encoder) <= self.max_len:
                continue    # avoids complications with short sentences. No padding is needed then.

            truncated_encoding = token_encoder[:self.max_len]
            truncated_example = tokenizer.decode(truncated_encoding) # reconstruct shortened review

            # LLMs do next-word predictions. You may want to add a prompt that the model can work with!


            if use_prompt:

                additional_token_length = len(self.tokenizer(prompt_pre_text)['input_ids']) + len(self.tokenizer(prompt_post_text)['input_ids'])
                cutoff = self.max_len + additional_token_length - 1

                prompted_text_line = prompt_pre_text + truncated_example + prompt_post_text

            else:
                cutoff = self.max_len
                prompted_text_line = truncated_example

            if len(self.tokenizer(prompted_text_line)['input_ids']) != cutoff:
                    continue



            tokenized_example = self.tokenizer(prompted_text_line,
                                               return_tensors="pt",
                                               max_length=cutoff,
                                               truncation=True,
                                               padding='max_length').to(device)

            #if num_example == 0:
            #  print(self.tokenizer.decode(tokenized_example['input_ids'][0]))

            if example['label'] == 1:
              token = classification_tokenset[1]
            else:
              token = classification_tokenset[0]

            token_id = self.tokenizer.encode(' ' + token)[0]
            label = torch.tensor(token_id, dtype=torch.int64, device=device)

            self.data.append({'label': label,
                              'input_ids': torch.squeeze(tokenized_example['input_ids']).to(device)
                              })

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):

        return {
            'input_ids': self.data[index]['input_ids'],
            'label': self.data[index]['label']
        }

In [29]:
#Suggested, but try a bunch!
# prompt_pre_text = 'Here is a movie review: '
#prompt_post_text = ' ...  The reviewer classifies reviews as good or bad. In this case they thought the movie was'
#classification_tokenset = {1: 'good', 0: 'bad'}

prompt_pre_text = 'Here is a movie review: '
prompt_post_text = ' ...  The reviewer classifies reviews as good or bad. In this case they thought the movie was'
classification_tokenset = {1: 'good', 0: 'bad'}

# make a modification to the prompt_pre_text, prompt_post_text, and classification_tokenset
# that gets the loss below 1.7

### YOUR CODE HERE

prompt_pre_text = 'Here is a movie review: '
prompt_post_text = ' ...  The reviewer classifies reviews as good or bad. In this case they thought the movie was'
classification_tokenset = {1: 'good', 0: 'bad'}

### END YOUR CODE

play_data = ClassificationData(imdb_train_set,
                                tokenizer=gpt_2_tokenizer,
                                max_len=100,
                                use_prompt=True,
                                prompt_pre_text = prompt_pre_text,
                                prompt_post_text = prompt_post_text,
                                classification_tokenset=classification_tokenset,
                                num_examples=20
                                )

batch_size = 4
toy_texts = DataLoader(play_data, batch_size=batch_size, shuffle=True)


loss = 0
predicted_tokens = []
labels = []

for batch_num, toy_text_batch in enumerate(toy_texts):
    sample_output = token_prediction_network_base_model(toy_text_batch['input_ids']).to(device)
    sample_labels = toy_text_batch['label']
    loss += loss_fn(sample_output, sample_labels).detach()

    predicted_tokens += gpt_2_tokenizer.decode(torch.argmax(sample_output, dim=-1)).split()
    labels += gpt_2_tokenizer.decode(sample_labels).split()

loss /= (batch_num + 1)



print('Average loss: ', loss)
print('Predicted tokens vs labels: ', [(x, y) for x,y in zip(predicted_tokens, labels)])


{'text': "Normally I would never rent a movie like this, because you know it's going to be bad just by looking at the box. I rented seven movies at the same time, including Nightmare on Elm Street 5, 6 and Wes Craven's New Nightmare. Unfortunately, when I got home I found out the videostore-guy gave me the wrong tape. In the box of Wes Craven's New Nightmare I found this lame movie.<br /><br />This movie is incredibly boring, the acting is bad and the plot doesn't make any sense. It's hard to write a good review, because I have no idea what the movie was really about. At the end of the movie you have more questions then answers.<br /><br />On 'Max Power's Scale of 1 to 10' I rate this movie: 1<br /><br />PS I would like to correct Corinthian's review (right below mine). He says Robert Englund is ripping off lingerie, riding horses naked, etc. The guy that did those things was Mahmoud, played by Juliano Mer, not by Robert Englund.", 'label': 0}


2024-10-02 19:09:16.626694: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Average loss:  tensor(2.0389)
Predicted tokens vs labels:  [('good', 'bad'), ('good', 'good'), ('good', 'good'), ('good', 'good'), ('good', 'bad'), ('good', 'good'), ('good', 'good'), ('good', 'good'), ('good', 'good'), ('a', 'good'), ('good', 'good'), ('good', 'bad'), ('bad', 'bad'), ('good', 'good'), ('good', 'good'), ('good', 'good'), ('a', 'bad'), ('good', 'bad'), ('good', 'bad')]


Ok, the accuracy using the old GPT2 is not exactly amazing (newer and larger models would be MUCH better out of the box). However, even for GPT2 at least a token of the right type is predicted. Fine-tuning should make this much better!


**QUESTION:**

2.a. Write down two different prompt_pre_text/prompt_post_text combinations and their respective average loss. Pick one that sounds reasonable but is quite a bit worse (say, average loss > 3), and another that gets the loss below 1.7. (Note, for the later you probably havde to counteract a bit the model's tendency to be positve. You may also want to be more clear where the review
starts and ends.)


Now let's do the fine-tuning that is supposed to help! Start by getting the full dataset and dataloaders:

In [30]:
imdb_train_data = ClassificationData(imdb_train_set,
                                tokenizer=gpt_2_tokenizer,
                                max_len=100,
                                use_prompt=True,
                                prompt_pre_text = prompt_pre_text,
                                prompt_post_text = prompt_post_text,
                                classification_tokenset=classification_tokenset,
                                num_examples=-1
                                )

imdb_test_data = ClassificationData(imdb_test_set,
                                tokenizer=gpt_2_tokenizer,
                                max_len=100,
                                use_prompt=True,
                                prompt_pre_text = prompt_pre_text,
                                prompt_post_text = prompt_post_text,
                                classification_tokenset=classification_tokenset,
                                num_examples=-1
                                )


batch_size = 4
imdb_train_loader = DataLoader(imdb_train_data, batch_size=batch_size, shuffle=True)
imdb_test_loader = DataLoader(imdb_test_data, batch_size=batch_size, shuffle=True)

{'text': "Normally I would never rent a movie like this, because you know it's going to be bad just by looking at the box. I rented seven movies at the same time, including Nightmare on Elm Street 5, 6 and Wes Craven's New Nightmare. Unfortunately, when I got home I found out the videostore-guy gave me the wrong tape. In the box of Wes Craven's New Nightmare I found this lame movie.<br /><br />This movie is incredibly boring, the acting is bad and the plot doesn't make any sense. It's hard to write a good review, because I have no idea what the movie was really about. At the end of the movie you have more questions then answers.<br /><br />On 'Max Power's Scale of 1 to 10' I rate this movie: 1<br /><br />PS I would like to correct Corinthian's review (right below mine). He says Robert Englund is ripping off lingerie, riding horses naked, etc. The guy that did those things was Mahmoud, played by Juliano Mer, not by Robert Englund.", 'label': 0}


Token indices sequence length is longer than the specified maximum sequence length for this model (1038 > 1024). Running this sequence through the model will result in indexing errors


{'text': "I know most of the other reviews say that this movie was great, but I have to disagree.<br /><br />Sure, it's a good book! It was actually one of my favorites when I was verrry little. But it's just not meant for theaters. Maybe for a little half-hour short, but I don't see how they can turn a short kiddie book into a whole feature film.<br /><br />It is a cute movie, but I would only recommend it for really little kids. Older kids will have no interest it. Adults may have a little more interest if they watch it with their young ones. But anyone ages 7-Adult will have a snore-fest.<br /><br />Sorry if you disagree with me, but this is my opinion. :)", 'label': 0}


Let's set up the optimizers as before:

In [31]:
adam_optimizer_base_model = torch.optim.AdamW(token_prediction_network_base_model.parameters(), lr=0.00001)
adam_optimizer_addtl_pretrain_model = torch.optim.AdamW(token_prediction_network_addnl_pretrain_model.parameters(), lr=0.00001)

Here is the new training loop. Please fill in the lines for optimizer zeroing, the prediction calculation, and the loss.

In [32]:
def train_loop(dataloader, model, loss_fn, optimizer, reporting_interval=100, steps=None):

    """
    Write the training loop to fine-tune the model for sentiment
    classification using the final next-token-prediction task.
    In particular, you need to:
    - initialize the epoch_loss to 0 and set the model into training mode
    - iterate over the batches:
      - break if you are at 'steps' number of batches
      - get the inputs X and labels y (which in this case will be the actual next token)
      - get the model outputs
      - reshape y and model outputs in proper format for cross entropy calculation
      - zero out the gradient
      - calculate loss
      - propagate loss (loss.backward) and apply optimizer step
      - add the loss to the epoch_loss
    Reporting:
      - report the current average loss every 'reporting_interval' batches
      - report the average loss at the end of the epoch (done for you)

    """

    ### YOUR CODE HERE

    model.train()
    epoch_loss = 0
    batch_count = 0

    for step, batch_data in enumerate(dataloader):
      if steps and step >= steps:
        break

      batch_input = batch_data['input_ids'].to(device)
      batch_labels = batch_data['label'].to(device)

      batch_output = model(batch_input)

      batch_output_reshaped = batch_output.view(-1, batch_output.size(-1))
      batch_labels_reshaped = batch_labels.view(-1)

      optimizer.zero_grad()

      batch_loss = loss_fn(batch_output_reshaped, batch_labels_reshaped)

      batch_loss.backward()
      optimizer.step()

      epoch_loss += batch_loss.item()
      batch_count += 1

      if step % reporting_interval == 0 and step > 0:
          avg_loss = epoch_loss / batch_count
          print(f"Step {step}, Avg train loss: {avg_loss:.6f}")

    if batch_count == 0:
      print("No batches were processed.")
      return
    
    avg_epoch_loss = epoch_loss / batch_count

    ### END YOUR CODE

    print(f"Training Results: \n  Avg train loss: {epoch_loss/batch:>8f} \n")


def test_loop(dataloader, model, loss_fn, reporting_interval=100, contrast_pair=None,steps=None):
    """
    Write the test loop to fine-tune the model for sentiment classification using the final next-token-prediction task.
    In particular, you need to:

    - set the model into eval mode and initialize the test_loss to 0. Also, set the number of correct &
      total test examples to 0, like:
      'test_loss, correct_token_predictions,  correct_label_class, total = 0, 0, 0, 0'
        (See the two approaches for accuracy below for correct_token_predictions
        and correct_label_class)

    - use torch.no_grad to iterate over the batches:
        - break if you are at 'steps' number of batches
        - from the batch, get the test inputs X and labels y (which in this case will be the actual next token). You may want to look at the format of batches by using 'next(iter(imdb_test_loader))' in a separate cell
        - get the model outputs
        - calculate loss and add to test_loss (reshaping should not be necessary)
        - For the accuracy, we can try two approaches (and in this case they should turn out to be
            probably the same in the end):
              i) Test Class Accuracy:
                  - Define the predicted class (I call it selected class) by comparing the logits for our two
                    'evaluating tokens' (like 'good', 'bad'). if the logit for (in this example) 'good' is higher,
                    then the predicted class is the positive one, etc.
              ii)  Token Prediction Accuracy:
                  - see how often the correct 'evaluation token' is predicted. I.e., here we do not compare
                    whether the model believes that 'good' is a more likely next token than 'bad', but was
                    'good' the actual next token prediction (and vise versa).
              - get these numbers for each batch and add to the totals

    - add the loss to the epoch_loss

    Reporting:

    - report on the average token accuracy, average class accuracy and average test loss every 'reporting_interval' batches
    - report on the same at the end (done for you)

    """

    # let's get the proper class ids for the 'evaluating next tokens' (like 'good', 'bad')

    if contrast_pair is not None:
      class_1, class_2 = contrast_pair
      class_1_id, class_2_id = gpt_2_tokenizer.encode(' ' + class_1 + ' ' + class_2)

    # now the loop starts:

    ### YOUR CODE HERE

    model.eval()

    test_loss, correct_token_predictions, correct_label_class, total = 0, 0, 0, 0

    for step, batch_data in enumerate(dataloader):
      if steps and step >= steps:
        break

      batch_input = batch_data['input_ids'].to(device)
      batch_labels = batch_data['label'].to(device)

      batch_output = model(batch_input)

      batch_loss = loss_fn(batch_output, batch_labels)
      test_loss += batch_loss.item()

      for i in range(batch_output.size(0)):
        predicted_class = torch.argmax(batch_output[i]).item()
        correct_class = batch_labels[i].item()

        if predicted_class == correct_class:
          correct_label_class += 1

        if contrast_pair:
          if predicted_class == class_1_id:
            predicted_token = class_1
          elif predicted_class == class_2_id:
            predicted_token = class_2
          else:
            predicted_token = 'unknown'

          if correct_class == class_1_id:
            correct_token = class_1
          elif correct_class == class_2_id:
            correct_token = class_2
          else:
            correct_token = 'unknown'

          if predicted_token == correct_token:
            correct_token_predictions += 1

        else:
          if predicted_class == correct_class:
            correct_token_predictions += 1

        total += 1

    avg_test_loss = test_loss / total
    avg_token_accuracy = correct_token_predictions / total
    avg_class_accuracy = correct_label_class / total

    ### END YOUR CODE

    print(correct_label_class)
    print(f"Test Results: \n\t Test Token Accuracy: {(100*correct):>0.1f}% \n\t Test Class Accuracy: {(100*correct_label_class):>0.1f}%  \n\t Avg test loss: {test_loss:>8f} \n")

In [33]:
epochs = 1
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(imdb_train_loader, token_prediction_network_base_model, loss_fn, adam_optimizer_base_model, steps=2000)
    test_loop(imdb_test_loader, token_prediction_network_base_model, loss_fn, contrast_pair=('good', 'bad'),
              steps=500
              ) # no optimizer use here!
print("Done!")

Epoch 1
-------------------------------
Step 100, Avg train loss: 0.691266
Step 200, Avg train loss: 0.600858
Step 300, Avg train loss: 0.556341
Step 400, Avg train loss: 0.538435
Step 500, Avg train loss: 0.515655
Step 600, Avg train loss: 0.507377
Step 700, Avg train loss: 0.499678
Step 800, Avg train loss: 0.493247
Step 900, Avg train loss: 0.488447
Step 1000, Avg train loss: 0.479650
Step 1100, Avg train loss: 0.467019
Step 1200, Avg train loss: 0.463335
Step 1300, Avg train loss: 0.457502
Step 1400, Avg train loss: 0.451654
Step 1500, Avg train loss: 0.446935
Step 1600, Avg train loss: 0.442717
Step 1700, Avg train loss: 0.440059
Step 1800, Avg train loss: 0.433151
Step 1900, Avg train loss: 0.430274


NameError: name 'batch' is not defined

Now we redo this for the model that saw the additional pre-training:

In [None]:
epochs = 1
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(imdb_train_loader, token_prediction_network_addnl_pretrain_model, loss_fn, adam_optimizer_addtl_pretrain_model, steps=2000)
    test_loop(imdb_test_loader, token_prediction_network_addnl_pretrain_model, loss_fn, contrast_pair=('good', 'bad'),
              steps=500
              ) # no optimizer use here!
print("Done!")

This looks good! So we had better movie review sentiment classification using the model had had seen the additional pretraining.

**QUESTION:**

2.b. What was your test accuracy after fine-tuning, when starting with the base model?

2.c. What was your test accuracy after fine-tuning, when starting with the model that had additional pre-training?

2.d. Based on this and what we saw in the previous section (and, as there are statistical fluctuations, based on what 'should' be the case), what would be your expectation for these two starting models when used for sentiment analysis tasks that deal with data **inside** the movie domain? ('base model slightly better', or 'additional pretrain model slightly better')

2.e. Based on this and what we saw in the previous section (and, as there are statistical fluctuations, based on what 'should' be the case), what would be your expectation for these two starting models when used for sentiment analysis tasks that deal with data  **outside** the movie domain? ('base model slightly better', or 'additional pretrain model slightly better')

## 3. Sentiment Classification with BERT

Now we will see how well the classification with BERT works in comparison. We will get the model tokenizer for that model, then - as discussed in class - use the output of the initial [CLS] token to classify the sentiment. Will it be better? Or worse?

See https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertModel for more details around the model.



In [34]:
%%capture

from transformers import AutoTokenizer, BertModel


bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
bert_model = BertModel.from_pretrained("bert-base-cased").to(device)

Let us look at a simple bert tokenization

In [35]:
bert_toy_inputs = bert_tokenizer("This is new", return_tensors="pt").to(device)
bert_toy_outputs = bert_model(**bert_toy_inputs)

last_hidden_states = bert_toy_outputs.last_hidden_state

last_hidden_states.shape

torch.Size([1, 5, 768])

Play with decode method of the tokenizer to see why the shape is ... x 5 x ... . Then identify the first value of the output of the [CLS] token.

In [36]:
# Decode the tokenization. Call it bert_toy_tokens.
### YOUR CODE HERE

bert_toy_tokens = bert_tokenizer.decode(bert_toy_inputs['input_ids'][0])

### END YOUR CODE
print('The tokens after tokenization: ', bert_toy_tokens)

The tokens after tokenization:  [CLS] This is new [SEP]


**QUESTION:**

3.a Why is the shape .. x 5 x ... and not .. x 3 x ... ? Explain. (You may need to to look up what the purpose is of one of the extra tokens. Don't write more than 2-3 lines.)

In [37]:
# Get the output for the [CLS] token. Call it cls_first_out.
### YOUR CODE HERE

cls_first_out = bert_toy_outputs.last_hidden_state[0][0]

### END YOUR CODE
print('First output of [CLS] token: ', cls_first_out)

First output of [CLS] token:  tensor([ 4.7248e-01,  1.7685e-01,  5.4234e-01, -5.4906e-01, -1.1503e-01,
         1.1630e-01,  4.2311e-01,  2.2489e-02, -1.9720e-01, -9.8925e-01,
        -4.3665e-01,  3.0220e-01, -4.5189e-01,  1.0892e-01, -6.0475e-01,
         2.3494e-01,  1.0624e-01,  3.3463e-01,  9.6197e-02, -7.3822e-02,
         1.8399e-01, -2.7128e-01,  6.3856e-01, -2.2248e-01,  4.8449e-01,
        -2.6155e-01,  6.0228e-01,  1.4545e-01,  4.9357e-02,  4.4853e-01,
        -5.3474e-02,  6.7213e-02,  4.8560e-02, -6.9567e-02,  3.0451e-02,
        -1.2569e-01,  1.5189e-02, -6.5966e-01, -5.1185e-03, -1.5514e-01,
        -5.0256e-01,  2.3907e-01,  2.9280e-01, -1.1793e-01,  5.0254e-01,
        -7.8381e-01, -8.3310e-02, -1.2723e-01, -4.2026e-01,  1.1898e-01,
         7.1867e-02,  1.9298e-01, -5.5773e-02,  3.2795e-01, -5.6385e-03,
        -1.5043e-01, -5.4291e-01,  3.0932e-01, -5.1025e-01,  2.8927e-01,
         1.8317e-01, -4.7179e-03,  4.0482e-01, -2.3334e-03, -5.9042e-02,
        -9.0397e-02, 

**QUESTION:**

3.b What is the first value of the output of the [CLS] token?

Now we construct the dataset and the dataloader. The BERT dataset class is defined at the beginning.

In [38]:
bert_train_data = BERTClassificationData(imdb_train_set,
                                tokenizer=bert_tokenizer,
                                max_len=100,
                                num_examples=-1
                                )

bert_test_data = BERTClassificationData(imdb_test_set,
                                tokenizer=bert_tokenizer,
                                max_len=100,
                                num_examples=-1
                                )

batch_size = 4
bert_imdb_train_loader = DataLoader(bert_train_data, batch_size=batch_size, shuffle=True)
bert_imdb_test_loader = DataLoader(bert_test_data, batch_size=batch_size, shuffle=True)



Token indices sequence length is longer than the specified maximum sequence length for this model (621 > 512). Running this sequence through the model will result in indexing errors


Now build the classification network that uses the output of the [CLS] token for the classification.

In [39]:
%%capture

class BERTClassificationNetworkClass(torch.nn.Module):
    """
    Write the class for the classification network using
    the Masked Language Model BERT.
    Specificaly, you will need to extract the output of the [CLS] token
    from the BERT model (i.e., the very first token), apply a suitable linear layer,
    and apply the sigmoid function.
    """

    def __init__(self):
        ### YOUR CODE HERE

        super(BERTClassificationNetworkClass, self).__init__()
        self.bert_model = bert_model
        self.linear = torch.nn.Linear(768, 1)
        self.sigmoid = torch.nn.Sigmoid()

        ### END YOUR CODE

    def forward(self, x):
        # Get the forward pass. Apply the BERT model, then the linear layer, and
        # then apply the sigmoid
        ### YOUR CODE HERE

        outputs = self.bert_model(**x)
        cls_output = outputs.last_hidden_state[:, 0, :]
        linear_output = self.linear(cls_output)
        sigmoid_output = self.sigmoid(linear_output)

        ### END YOUR CODE

        return torch.squeeze(sigmoid_output) # removing 'x 1 x ' dimensions


loss_fn = torch.nn.BCELoss()

bert_classification_model = BERTClassificationNetworkClass().to(device)


Let's test it. Is the structure correct?

In [40]:
test = next(iter(bert_imdb_train_loader))

out = bert_classification_model({'input_ids': test['input_ids']})

loss = loss_fn(out.float(), test['label'].float())

print('Output: ', out)
print('Loss: ', loss)


Output:  tensor([0.4195, 0.4163, 0.4187, 0.4307], grad_fn=<SqueezeBackward0>)
Loss:  tensor(0.6988, grad_fn=<BinaryCrossEntropyBackward0>)


Good. Finally, we need train and test loops:

In [41]:
def bert_train_loop(dataloader, model, loss_fn, optimizer, reporting_interval=100, steps=None):
    """
    Following the same logic as above, write the training loop to use the
    Masked Language Model BERT for the sentiment classification task. You
    only need to report the average loss after the reporting interval
    and end of each epoch.
    """

    ### YOUR CODE HERE

    model.train()
    epoch_loss = 0
    batch_count = 0

    for step, batch_data in enumerate(dataloader):
        if steps and step >= steps:
            break
    
        batch_input = {k: v.to(device) for k, v in batch_data.items()}
    
        optimizer.zero_grad()
    
        batch_output = model(batch_input)
        batch_loss = loss_fn(batch_output, batch_input['label'].float())
    
        batch_loss.backward()
        optimizer.step()
    
        epoch_loss += batch_loss.item()
        batch_count += 1
    
        if step % reporting_interval == 0 and step > 0:
            avg_loss = epoch_loss / batch_count
            print(f"Step {step}, Avg train loss: {avg_loss:.6f}")

    if batch_count == 0:
        print("No batches were processed.")
        return
    
    avg_epoch_loss = epoch_loss / batch_count

    ### END YOUR CODE

    print(f"Training Results: \n  Avg train loss: {epoch_loss/batch:>8f} \n")


def bert_test_loop(dataloader, model, loss_fn, reporting_interval=100, contrast_pair=None,steps=None):
    """
    Following the same logic as above, write the test loop to use the
    Masked Language Model BERT for the sentiment classification task.
    Please report on the accuracy after the reporting interval and end of each epoch.
    """

    ### YOUR CODE HERE

    model.eval()
    test_loss, correct_class_predictions

    for step, batch_data in enumerate(dataloader):
        if steps and step >= steps:
            break

        batch_input = {k: v.to(device) for k, v in batch_data.items()}

        batch_output = model(batch_input)
        batch_loss = loss_fn(batch_output, batch_input['label'].float())
        test_loss += batch_loss.item()

        for i in range(batch_output.size(0)):
            predicted_class = torch.round(batch_output[i]).item()
            correct_class = batch_input['label'][i].item()

            if predicted_class == correct_class:
                correct_class_predictions += 1

    avg_test_loss = test_loss / len(dataloader.dataset)

    ### END YOUR CODE

    correct = float(correct.cpu().detach().numpy())

    test_loss /= batch
    correct /= total
    print(correct)

    print(f"Test Results: \n Test Accuracy: {(100*correct):>0.1f}%, Avg test loss: {test_loss:>8f} \n")

Now let's see:

In [42]:
adam_optimizer_bert = torch.optim.AdamW(bert_classification_model.parameters(), lr=0.00001)

epochs = 1
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    bert_train_loop(bert_imdb_train_loader, bert_classification_model, loss_fn, adam_optimizer_bert, steps=2000)
    bert_test_loop(bert_imdb_test_loader, bert_classification_model, loss_fn,
              steps=500
              ) # no optimizer use here!
print("Done!")

Epoch 1
-------------------------------


TypeError: forward() got an unexpected keyword argument 'label'


**QUESTION:**

3.c What was the test accuracy you got for the BERT model?


And that is it. Congratulations!