# Pre-trained Language Models: SubTask A.
## [6 Marks]

In this assignment, you will work on the [ComVE](https://competitions.codalab.org/competitions/21080) shared task that was part of SemEval-2020. The task aims to evaluate whether a system can distinguish if a natural language statement makes sense to humans or not and provide a reason. **ConVE** includes three subtasks that require models to acquire and apply commonsense knowledge. In this notebook you will focus on **SubTask A**:

- Given two similar statements that differ by only a few words, select the statement of the two that does not make sense. For example, within the statements below, *Statement 2* is the nonsensical statement:

     *Statement 1*: He put a turkey into the fridge.    
     *Statement 2*: He put an elephant into the fridge.
     
     This subtask can be approached as a Text Matching problem where the input is the two statements and the output is a label indicating which is the nonsensical one.

You will fine-tune a Pre-trained Language Model with [Transformers](https://huggingface.co/docs/transformers/index) library that provides a set of tools for fine-tunning and deploying a wide variety of Pre-trained Language Models. The [Hugging Face Hub](https://huggingface.co/models) allows you to explore all the models supported by **Transformers** and even share your own models with the community. In this assignment, you will work with [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta), a model that uses **BERT**'s architecture but has been pre-trained with more data and a more carefully selected set of hyperparameters.

Fine-tuning a Pre-trained Language Model usually requires a great amount of time and computational resources. Your personal computer will not be probably enough. In order to complete the assignment, you can work with a reduced version of the dataset and the base version of **RoBERTa**:

In [1]:
#Mount Drive

from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [2]:
!pip install -r /content/drive/MyDrive/AIML/Assessment_4_requirements.txt

Collecting ipython==8.5.0 (from -r /content/drive/MyDrive/AIML/Assessment_4_requirements.txt (line 1))
  Downloading ipython-8.5.0-py3-none-any.whl (752 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/752.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/752.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━[0m [32m645.1/752.0 kB[0m [31m9.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m752.0/752.0 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jupyter==1.0.0 (from -r /content/drive/MyDrive/AIML/Assessment_4_requirements.txt (line 2))
  Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting nbimporter==0.3.4 (from -r /content/drive/MyDrive/AIML/Assessment_4_requirements.txt (line 3))
  Downloading nbimporter-0.3.4-py3-none-any.whl (4.9 kB)

In [1]:
!pip install tensorflow==2.12

Collecting tensorflow==2.12
  Downloading tensorflow-2.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (585.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m585.9/585.9 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Collecting flatbuffers>=2.0 (from tensorflow==2.12)
  Downloading flatbuffers-23.5.26-py2.py3-none-any.whl (26 kB)
Collecting keras<2.13,>=2.12.0 (from tensorflow==2.12)
  Downloading keras-2.12.0-py2.py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m69.5 MB/s[0m eta [36m0:00:00[0m
Collecting protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 (from tensorflow==2.12)
  Downloading protobuf-4.24.4-cp37-abi3-manylinux2014_x86_64.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.6/311.6 kB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
Collecting tensorboard<2.13,>=2.12 (from tensorflow==2.12)
  Downloading tensorboard-2.12.3-

In [2]:
import subprocess

def get_package_version(command):
    try:
        version = subprocess.getoutput(command)
        return version
    except Exception as e:
        return str(e)

commands = {
    'IPython': 'ipython --version',
    'Jupyter': 'pip show jupyter | grep Version',
    'Nbimporter': 'pip show nbimporter | grep Version',
    'Pytest': 'pip show pytest | grep Version',
    'Pandas': 'pip show pandas | grep Version',
    'TensorFlow': 'pip show tensorflow | grep Version',
    'torch': 'pip show torch | grep Version',
    'transformers': 'pip show transformers | grep Version',
    'datasets': 'pip show datasets | grep Version',
    'evaluate': 'pip show evaluate | grep Version',
    'rouge-score': 'pip show rouge-score | grep Version'
    }

for package, command in commands.items():
    version = get_package_version(command)
    print(f'{package}: {version}')

IPython: 8.5.0
Jupyter: Version: 1.0.0
Nbimporter: Version: 0.3.4
Pytest: Version: 7.1.3
Pandas: Version: 1.3.5
TensorFlow: Version: 2.12.0
torch: Version: 2.0.0
transformers: Version: 4.23.1
datasets: Version: 2.11.0
evaluate: Version: 0.4.0
rouge-score: Version: 0.1.2


In [3]:
shrink_dataset = True
base_model = True
colab = True

Although the value of these variables do not affect the tests that will evaluate your code, the output examples distributed throughout this notebook are based on a `shrink_dataset` and a `base_model` variables set as `True`, and a `colab` variable set as `False`.

If you want to perform a full training of the model to obtain its real performance, you can use a cloud service like [Google Colab](https://colab.research.google.com/). **Colab** is a **Jupyter** notebook environment that supports both GPU and TPU instances, allowing training large scale Deep Learning models. Set the `shrink_dataset` and a `base_model` variables to `False`, the `colab` variable to `True`, and follow the instructions provided to you to run the notebook in **Colab**.

In [4]:
if colab:
    ! pip install transformers datasets evaluate
    import os
    if not os.path.exists("SemEval2020-Task4-Data/ALL data/Training  Data/subtaskA_data_all.csv"):
        ! git clone https://github.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation.git SemEval2020-Task4-Data

Cloning into 'SemEval2020-Task4-Data'...
remote: Enumerating objects: 88, done.[K
remote: Counting objects: 100% (88/88), done.[K
remote: Compressing objects: 100% (66/66), done.[K
remote: Total 88 (delta 36), reused 64 (delta 19), pack-reused 0[K
Receiving objects: 100% (88/88), 2.22 MiB | 4.35 MiB/s, done.
Resolving deltas: 100% (36/36), done.


You will use the following objects and functions:

In [5]:
import pandas as pd
import numpy as np
import evaluate
from datasets import Dataset
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer,
                          enable_full_determinism)

When working with Neural Networks, there are a large number of random operations such as initializing the weights of the network, shuffling the data for training, or choosing samples. This causes that different training runs of the same model can lead to different results. To ensure reproducibility, i.e. obtaining the same results in the different runs, the random number generator must be initialized with a fixed value known as seed. In **Transformers**, this can be done as follows:

In [6]:
enable_full_determinism(seed=42)

> **Note!** With models as complex as Neural Networks, reproducibility is susceptible to factors such as software versions or the hardware on which the models are run. Even with seed initialization, there may be slight differences in the results.

Working with Neural Networks also involves defining a number of hyperparameters that set the configuration of the model. Finding the appropriate hyperparameter values requires training the model with different combinations and testing them on the development set. This hyperparameter tuning is a costly process that needs multiple rounds of experimentation. However, for this assignments, you will use the following values:

In [7]:
epochs = 3  # Number of epochs to train the model
train_batch_size = 8  # Number of examples used per gradient update
learning_rate = 1e-5  # The learning rate for the optimizer
max_length = 50  # Maximum lenght of the input sequence
output_dir = "modelA"  # The output directory where the model will be written to

## Loading the Pre-trained Model - [1 Mark]

The first step you must perform in this assignment is to load the model and its corresponding tokenizer. **Transformers** provides support for a wide variety of pre-trained models via specific classes. However, the library also allows automatically retrieving a model given jut the name or path using [AutoClasses](https://huggingface.co/docs/transformers/v4.27.2/en/model_doc/auto). To fine-tune a pre-trained model for a downstream task, it is necessary to replace the original top layer of the model with a new specific output layer. **AutoClasses** also allows you to do this automatically for various types of Natural Language Processing tasks. For instance, `AutoModelForSequenceClassification` instantiates the model with a top layer for Text Classification.

You must complete the code for the `load_model` function. This functions takes the name of the pre-trained model and should load and return both the model, initialized for Text Classification, and its corresponding tokenizer. You can get some tips from [Transformers documentation](https://huggingface.co/docs/transformers/autoclass_tutorial).

In [8]:
def load_model(model_name):   # [1 Mark]
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    #
    # Load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Load the model with a classification head (sequence classification)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)

    return model, tokenizer

In [9]:
model_name = "roberta-base" if base_model else "roberta-large"
model, tokenizer = load_model(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.weight', 'lm_head.decoder.weight', 'roberta.pooler.dense.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifie

## Data Pre-processing - [1 Mark]

The **ComVE** dataset consists of 10000 pairs of statements for the train set, 997 pairs for development and 1000 for test. Each of the statement pairs is annotated with a `0` or `1` label depending on whether the nonsensical statement is the first or the second one respectively. The dataset can be loaded into three `DataFrames` as follows:

In [10]:
def load_data(data_csv, answers_csv):
    data = pd.read_csv(data_csv)
    reasons = pd.read_csv(answers_csv, header=None).rename(columns={0: "id", 1: "label"})
    return pd.merge(data, reasons, on="id")

In [11]:
train_data_csv = "SemEval2020-Task4-Data/ALL data/Training  Data/subtaskA_data_all.csv"
train_answers_csv = "SemEval2020-Task4-Data/ALL data/Training  Data/subtaskA_answers_all.csv"
train_data = load_data(train_data_csv, train_answers_csv)
dev_data_csv = "SemEval2020-Task4-Data/ALL data/Dev Data/subtaskA_dev_data.csv"
dev_answers_csv = "SemEval2020-Task4-Data/ALL data/Dev Data/subtaskA_gold_answers.csv"
dev_data = load_data(dev_data_csv, dev_answers_csv)
test_data_csv = "SemEval2020-Task4-Data/ALL data/Test Data/subtaskA_test_data.csv"
test_answers_csv = "SemEval2020-Task4-Data/ALL data/Test Data/subtaskA_gold_answers.csv"
test_data = load_data(test_data_csv, test_answers_csv)
if shrink_dataset:
    train_data = train_data.sample(n=100, random_state=42)
    dev_data = dev_data.sample(n=100, random_state=42)
    test_data = test_data.sample(n=100, random_state=42)
train_data

Unnamed: 0,id,sent0,sent1,label
6252,6252,a duck walks on three legs,a duck walks on two legs,0
4684,4684,Jack's mom praised him because he broke the plate,Jack's mom condemned him because he broke the ...,0
1731,1731,People use electricity to buy things,People use money to buy things,0
4742,4742,"The speaker is damaged, thus I can't hear anyt...","The display is damaged, thus I can't hear anyt...",1
4521,4521,Santa Claus is the legend of the East,Santa Claus is the legend of the West,0
...,...,...,...,...
3787,3787,If you want to visit a museum then you should ...,If you want to visit a museum then you should ...,1
9189,9189,the mirror in the bathroom fogged up after she...,the mirror in the bathroom fogged up after she...,1
7825,7825,My mother keeps a tiger and takes care of it a...,My mother keeps a cat and takes care of it as ...,0
7539,7539,"His lens cover is broken, so he can not use it...","His lens is broken, so he can not use it to ta...",0


[Datasets](https://huggingface.co/docs/datasets/index) is a library for dataset management that provides a set of tools to manipulate data in a easy and efficient way. Since it is fully integrated with **Transformers**, it is very convenient to use both libraries together. **Datasets** allows accessing and sharing datasets through the [Hugging Face Hub](https://huggingface.co/datasets). The core component of this library is the [Dataset](https://huggingface.co/docs/datasets/v2.10.0/en/package_reference/main_classes#datasets.Dataset) class that implements an [Apache Arrow table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html). Similar to a **pandas** `DataFrame`, a `Dataset` object stores a table where each row corresponds to an example of the dataset and each column contains a different type of data. There are different ways to load the data into a `Dataset`, for example, from a `Dataframe`:

In [12]:
train_dataset = Dataset.from_pandas(train_data)
dev_dataset = Dataset.from_pandas(dev_data)
test_dataset = Dataset.from_pandas(test_data)
train_dataset[0]

{'id': 6252,
 'sent0': 'a duck walks on three legs',
 'sent1': 'a duck walks on two legs',
 'label': 0,
 '__index_level_0__': 6252}

One of the most powerful **Datasets** tools is the [map](https://huggingface.co/docs/datasets/v2.10.0/en/nlp_process#map) function which allows pre-processing the dataset in batches. The function takes another callable as argument and applies it to every row in the `Dataset`. The goal of the next exercise is to implement a function to tokenize the statement pairs that will be used as a parameter of the `map` function.

You must complete the code for the `preprocess_data` function. This function takes a batch of examples from a `Dataset`, the tokenizer returned by `load_model` and the `max_length` hyperparameter. The function must run the tokenizer jointly on the `sent0` and `sent1` columns of the `Dataset`. The tokenizer must pad and truncate the sequences to the `max_length` value. You can use the [Preprocessing](https://huggingface.co/docs/transformers/v4.27.2/en/preprocessing) and the [Tokenizer](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/tokenizer) documentation as reference. The `preprocess_data` should return the output of the tokenizer.

The `tokenizer` should return a [BatchEncoding](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/tokenizer#transformers.BatchEncoding) object with two fields for each data example:
* *input_ids*: A list of token indices that will be used as the input of the model.
* *attention_mask*: A list of indices masking out which tokens the model should not attend to.

The `map` function takes these fields and inserts them into the `Dataset` as new two columns. For example, the result for the first row in the `Dataset` should look like:

> <pre>
{'id': 6252, 'sent0': 'a duck walks on three legs', 'sent1': 'a duck walks on two legs', 'label': 0, '__index_level_0__': 6252, 'input_ids': [0, 102, 15223, 5792, 15, 130, 5856, 2, 2, 102, 15223, 5792, 15, 80, 5856, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
</pre>

Each value in `input_ids` represents a sub-word of the `tokenizer` vocabulary. For the example above, `input_ids` corresponds to the following sequence of sub-words:

> <pre>
['&lt;s&gt;', 'a', 'Ġduck', 'Ġwalks', 'Ġon', 'Ġthree', 'Ġlegs', '&lt;/s&gt;', '&lt;/s&gt;', 'a', 'Ġduck', 'Ġwalks', 'Ġon', 'Ġtwo', 'Ġlegs', '&lt;/s&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;']
</pre>

Notice that the **Hugging Face** implementation of **RoBERTa**'s tokenizer uses the `<s>` token equivalently to **BERT**'s `[CLS]` token and the `</s>` token to mark both the end and the separation of the sentences. The `Ġ` character indicates when there is a blank space before the token in the original text. This helps to know which tokens are the first sub-words of the words.

In [13]:
def preprocess_data(examples, tokenizer, max_length):   # [1 Mark]
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    #
    # Tokenize the statement pairs and pad/truncate to the specified max_length
    tokenized_inputs = tokenizer(
        examples['sent0'],
        examples['sent1'],
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='pt'  # Return PyTorch tensors
    )

    return tokenized_inputs

In [14]:
train_dataset = train_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
dev_dataset = dev_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
test_dataset = test_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
print(train_dataset[0])
print(tokenizer.convert_ids_to_tokens(train_dataset[0]["input_ids"]))

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

{'id': 6252, 'sent0': 'a duck walks on three legs', 'sent1': 'a duck walks on two legs', 'label': 0, '__index_level_0__': 6252, 'input_ids': [0, 102, 15223, 5792, 15, 130, 5856, 2, 2, 102, 15223, 5792, 15, 80, 5856, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
['<s>', 'a', 'Ġduck', 'Ġwalks', 'Ġon', 'Ġthree', 'Ġlegs', '</s>', '</s>', 'a', 'Ġduck', 'Ġwalks', 'Ġon', 'Ġtwo', 'Ġlegs', '</s>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']


## Fine-tuning - [4 Marks]

Although it is possible to write customized training loops for the **Transormers** models using **keras** or **pytorch**, **Transformers** provides a [Trainer](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer) API that allows fine-tuning efficiently with a few simple steps. The training is highly customizable through a wide range of options and hyperparameters that are handled by the [TrainingArguments](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer#transformers.TrainingArguments) class. Your next goal is to create both the `TrainingArguments` and `Trainer` objects that will be used to fine-tune **RoBERTa**. See the [documentation](https://huggingface.co/docs/transformers/training) for an introduction on how to perform these steps.

You must complete the code for the `create_training_arguments` function. This function takes as arguments the `epochs`, `train_batch_size` and `learning_rate` hyperparameters along with the `output_dir`. The function should use these arguments to create and return a `TrainingArguments` object. During the training, the model must be evaluated on the development test after every epoch. `TrainingArguments` should include this strategy.

> **Important!** By default, `Trainer` saves a checkpoint of the model every 500 training steps. For this assignment, avoid this behavior by setting `save_strategy="no"` when creating the `TrainingArguments`.

In [15]:
def create_training_arguments(epochs, train_batch_size, learning_rate, output_dir):    # [1 Mark]
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    #
    training_args = TrainingArguments(
        output_dir=output_dir,           # Directory where the model checkpoints will be saved
        num_train_epochs=epochs,        # Total number of training epochs
        per_device_train_batch_size=train_batch_size,  # Batch size per device during training
        evaluation_strategy="epoch",    # Evaluate the model after every epoch
        save_strategy="no",             # Do not save model checkpoints during training
        logging_dir="./logs",           # Directory for storing logs
        logging_steps=10,               # Log every 10 steps
        learning_rate=learning_rate,    # Learning rate
        load_best_model_at_end=False,    # Load the best model at the end of training
        metric_for_best_model="eval_accuracy",  # Metric to determine the best model
        save_total_limit=1,             # Limit the total number of checkpoints to save
        push_to_hub=False               # Do not push checkpoints to the Hugging Face Hub
    )

    return training_args

In [16]:
train_args = create_training_arguments(epochs, train_batch_size, learning_rate, output_dir)

Next, you will create a `Trainer` object with the training arguments.

You must complete the code for the `create_trainer` function. The function takes as input the model returned by the `load_model`, the `TrainingArguments` created by `create_training_arguments` and the train and development `Datasets`. The `create_trainer` function must create and return a `Trainer` object with the model and the training arguments. The `Trainer` must be set up so that the train `Dataset` is used for training and the development `Dataset` is used to evaluate the model during the training.

In [17]:
def create_trainer(model, train_args, train_dataset, dev_dataset):    # [1 Mark]
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    #
    trainer = Trainer(
        model=model,                     # The model to be trained
        args=train_args,                 # Training arguments created earlier
        train_dataset=train_dataset,     # Training dataset
        eval_dataset=dev_dataset         # Development dataset for evaluation during training
    )
    return trainer

In [18]:
trainer = create_trainer(model, train_args, train_dataset, dev_dataset)

The `trainer` object created by `create_trainer` is ready to fine-tune the model by just running:

In [19]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: sent1, id, __index_level_0__, sent0. If sent1, id, __index_level_0__, sent0 are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 100
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 39


Epoch,Training Loss,Validation Loss
1,0.6985,0.692225
2,0.6885,0.691137
3,0.673,0.691083


The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: sent1, id, __index_level_0__, sent0. If sent1, id, __index_level_0__, sent0 are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 100
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: sent1, id, __index_level_0__, sent0. If sent1, id, __index_level_0__, sent0 are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 100
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: sent1, id, __index_level_0__, sent0. If sent1, id, __index_le

TrainOutput(global_step=39, training_loss=0.6868079136579465, metrics={'train_runtime': 335.7432, 'train_samples_per_second': 0.894, 'train_steps_per_second': 0.116, 'total_flos': 7708331700000.0, 'train_loss': 0.6868079136579465, 'epoch': 3.0})

After training, the model can be used to make predictions on unlabeled data using the [predict](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer#transformers.Trainer.predict) method of the `Trainer` class.

You must complete the code for the `make_predictions` function. The function takes as input the `Trainer` object and test `Dataset`. The function must run the `predict` method on the input data. The `predict` method will return a `NamedTuple` including a **numpy** array with the predictions. For each statement pair in the input, the array should contain a vector with the logits (the values used as input of the softmax) predicted for every label. The output of `make_predictions` must include only the index of the label with the highest logit value. For example, if the prediction for one statement pair is `[0.10053499, -0.01917896]`, the output for that example should be `0`. For this, you can apply the [argmax](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html) method along the last axis of the **numpy** array.

In [20]:
def make_predictions(trainer, test_dataset):    # [2 Marks]
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    #
    # Run prediction on the test dataset
    raw_predictions = trainer.predict(test_dataset).predictions

    # Get the index of the label with the highest logit value for each example
    predicted_labels = np.argmax(raw_predictions, axis=1)

    return predicted_labels

In [21]:
predictions = make_predictions(trainer, test_dataset)
test_data["prediction"] = predictions
test_data

The following columns in the test set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: sent1, id, __index_level_0__, sent0. If sent1, id, __index_level_0__, sent0 are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 100
  Batch size = 8


Unnamed: 0,id,sent0,sent1,label,prediction
521,324,She put the papers into the filing cabinet.,She put the filing cabinet into the papers.,1,0
737,1455,The cat used the litter box,The lion used the litter box,1,0
740,13,Cigarette is good for healthy,Cereal is good for healthy,0,0
660,207,Pens are for writing,Pens are for painting,1,0
411,774,he put a piece of plastic on the bread,he put a piece of cheese on the bread,0,0
...,...,...,...,...,...
436,225,Carol turned on the potato,Carol turned on the flashlight,0,0
764,1663,i use my dog to play cricket,i use the bat to play cricket,0,0
88,82,Dolphins are fish.,Dolphins are mammals.,0,0
63,800,the family adopted a dinosaur to be their new pet,the family adopted a dog to be their new pet,0,0


The **Subtasks A** of **ComVE** is evaluated using accuracy. The [evaluate](https://huggingface.co/docs/evaluate/index) library provides support to apply this and other metrics. The `evaluate_prediction` function takes the test `DataFrame` and calculates the accuracy comparing the `prediction` and `label` columns. With `shrink_dataset` and `base_model` set to `True` the model is not able to learn the task so the expected score is only *0.49*. With a full training run, i.e. with `shrink_dataset` and `base_model` set to `False`, the score should be around *0.929*.

In [22]:
def evaluate_prediction(test_data):
    accuracy = evaluate.load("accuracy")
    return accuracy.compute(predictions=test_data["prediction"].values, references=test_data["label"].values)
evaluate_prediction(test_data)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 0.49}

In [23]:
!pip install nbconvert



In [24]:
from nbconvert import HTMLExporter
import nbformat

# Load your Jupyter Notebook
notebook_path = '/content/Pretrained_LM_Subtask_A.ipynb'

with open(notebook_path, 'r', encoding='utf-8') as notebook_file:
    notebook_content = nbformat.read(notebook_file, as_version=4)

# Create an HTMLExporter
html_exporter = HTMLExporter()

# Generate the HTML content
(html_output, resources) = html_exporter.from_notebook_node(notebook_content)

# Save the HTML content to a file
html_file_path = '/content/Pretrained_LM_Subtask_A.html'

with open(html_file_path, 'w', encoding='utf-8') as html_file:
    html_file.write(html_output)

  warn(
