# Pre-trained Language Models: SubTask B
## [6 Marks]

In this assignment, you will work on the [ComVE](https://competitions.codalab.org/competitions/21080) shared task that was part of SemEval-2020. The task aims to evaluate whether a system can distinguish if a natural language statement makes sense to humans or not and provide a reason. **ConVE** includes three subtasks that require models to acquire and apply commonsense knowledge. In this notebook you will focus on **SubTask B**:

- Given a statement that does not make sense and three possible reasons, select which reason explains why the given statement is against common sense. For example, for the following nonsensical statement the correct answer is *Reason A*:

     *Statement*: He put an elephant into the fridge.  
     *Reason A*: An elephant is much bigger than a fridge.  
     *Reason B*: Elephants are usually white while fridges are usually white.  
     *Reason C*: An elephant cannot eat a fridge.
     
     This subtask can be approached as a Multiple Choice problem where the input is the nonsensical statement and the three possible explanations, and the output is a label indicating which of the reasons is the correct one.

You will fine-tune a Pre-trained Language Model with [Transformers](https://huggingface.co/docs/transformers/index) library that provides a set of tools for fine-tunning and deploying a wide variety of Pre-trained Language Models. The [Hugging Face Hub](https://huggingface.co/models) allows you to explore all the models supported by **Transformers** and even share your own models with the community. In this assignment, you will work with [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta), a model that uses **BERT**'s architecture but has been pre-trained with more data and a more carefully selected set of hyperparameters.

Fine-tuning a Pre-trained Language Model usually requires a great amount of time and computational resources. Your personal computer will not be probably enough. In order to complete the assignment, you can work with a reduced version of the dataset and the base version of **RoBERTa**:

In [None]:
shrink_dataset = True
base_model = True
colab = True

Although the value of these variables do not affect the tests that will evaluate your code, the output examples distributed throughout this notebook are based on a `shrink_dataset` and a `base_model` variables set as `True`, and a `colab` variable set as `False`.

If you want to perform a full training of the model to obtain its real performance, you can use a cloud service like [Google Colab](https://colab.research.google.com/). **Colab** is a **Jupyter** notebook environment that supports both GPU and TPU instances, allowing training large scale Deep Learning models. Set the `shrink_dataset` and a `base_model` variables to `False`, the `colab` variable to `True`, and follow the instructions provided to you to run the notebook in **Colab**.

> **Note!** To run this notebook in **Colab** you will need to upload the `datacollator.py` file included in the repository of the assignment.

In [None]:
if colab:
    ! pip install transformers datasets evaluate
    import os
    if not os.path.exists("SemEval2020-Task4-Data/ALL data/Training  Data/subtaskA_data_all.csv"):
        ! git clone https://github.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation.git SemEval2020-Task4-Data

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.

You will use the following objects and functions:

In [None]:
import pandas as pd
import evaluate
from datasets import Dataset
from transformers import (AutoTokenizer, AutoModelForMultipleChoice,
                          TrainingArguments, Trainer,
                          enable_full_determinism)
#from datacollator import DataCollatorForMultipleChoice.  I will import later

In [4]:
import pandas as pd
import evaluate
from datasets import Dataset
from transformers import (AutoTokenizer, AutoModelForMultipleChoice,
                          TrainingArguments, Trainer,
                          enable_full_determinism)
from datacollator import DataCollatorForMultipleChoice

When working with Neural Networks, there are a large number of random operations such as initializing the weights of the network, shuffling the data for training, or choosing samples. This causes that different training runs of the same model can lead to different results. To ensure reproducibility, i.e. obtaining the same results in the different runs, the random number generator must be initialized with a fixed value known as seed. In Transformers, this can be done as follows:

In [None]:
enable_full_determinism(seed=42)

> **Note!** With models as complex as Neural Networks, reproducibility is susceptible to factors such as software versions or the hardware on which the models are run. Even with seed initialization, there may be slight differences in the results.

Working with Neural Networks also involves defining a number of hyperparameters that set the configuration of the model. Finding the appropriate hyperparameter values requires training the model with different combinations and testing them on the development set. This hyperparameter tuning is a costly process that needs multiple rounds of experimentation. However, for this assignments, you will use the following values:

In [None]:
epochs = 3  # Number of epochs to train the model
train_batch_size = 8  # Number of examples used per gradient update
learning_rate = 1e-5  # The learning rate for the optimizer
max_length = 50  # Maximum lenght of the input sequence
output_dir = "modelB"  # The output directory where the model will be written to

## Loading the Pre-trained Model - [1 Mark]

The first step you must perform in this assignment is to load the model and its corresponding tokenizer. **Transformers** provides support for a wide variety of pre-trained models via specific classes. However, the library also allows automatically retrieving a model given jut the name or path using [AutoClasses](https://huggingface.co/docs/transformers/v4.27.2/en/model_doc/auto). To fine-tune a pre-trained model for a downstream task, it is necessary to replace the original top layer of the model with a new specific output layer. **AutoClasses** also allows you to do this automatically for various types of Natural Language Processing tasks. For instance, `AutoModelForMultipleChoice` instantiates the model with a top layer for Multiple Choice.

You must complete the code for the `load_model` function. This functions takes the name of the pre-trained model and should load and return both the model, initialized for Text Classification, and its corresponding tokenizer. You can get some tips from [Transformers documentation](https://huggingface.co/docs/transformers/autoclass_tutorial).

In [None]:
def load_model(model_name):   # [1 Mark]
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    model = AutoModelForMultipleChoice.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    #
    return model, tokenizer
    #
    pass

In [None]:
model_name = "roberta-base" if base_model else "roberta-large"
model, tokenizer = load_model(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForMultipleChoice were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

## Data Pre-processing - [1 Mark]

The **ComVE** dataset consists of 9997 nonsensical statements with their corresponding 3 possible reasons for the train set, 997 statements for development and 1000 for test. Each nonsensical statements is annotated with a `A`, `B` or `C` label depending on which is the correct reason. The dataset can be loaded into three `DataFrames` as follows:

In [None]:
def load_data(data_csv, answers_csv, labels):
    data = pd.read_csv(data_csv).dropna()
    answers = pd.read_csv(answers_csv, header=None).rename(columns={0: "id", 1: "label"})
    answers["label"] = answers["label"].apply(lambda x: labels.index(x))
    return pd.merge(data, answers, on="id")

In [None]:
labels = ["A", "B", "C"]
train_data_csv = "SemEval2020-Task4-Data/ALL data/Training  Data/subtaskB_data_all.csv"
train_answers_csv = "SemEval2020-Task4-Data/ALL data/Training  Data/subtaskB_answers_all.csv"
train_data = load_data(train_data_csv, train_answers_csv, labels)
dev_data_csv = "SemEval2020-Task4-Data/ALL data/Dev Data/subtaskB_dev_data.csv"
dev_answers_csv = "SemEval2020-Task4-Data/ALL data/Dev Data/subtaskB_gold_answers.csv"
dev_data = load_data(dev_data_csv, dev_answers_csv, labels)
test_data_csv = "SemEval2020-Task4-Data/ALL data/Test Data/subtaskB_test_data.csv"
test_answers_csv = "SemEval2020-Task4-Data/ALL data/Test Data/subtaskB_gold_answers.csv"
test_data = load_data(test_data_csv, test_answers_csv, labels)
if shrink_dataset:
    train_data = train_data.sample(n=100, random_state=42)
    dev_data = dev_data.sample(n=100, random_state=42)
    test_data = test_data.sample(n=100, random_state=42)
train_data

Unnamed: 0,id,FalseSent,OptionA,OptionB,OptionC,label
4122,4122,You are likely to find a computer in the bathroom,The computer needs to take a shower in the bat...,The computer may be broken in the bathroom,The computer won't walk into the bathroom,1
4065,4065,Something you find in a stone is a blue flower,Sometimes stones are heavier than flowers,Sometimes stones are lighter than flowers,Flowers cannot grow on stones,2
1731,1731,People use electricity to buy things,It is impossible to buy things with electricity,Electricity is essential to live,Many appliances in home works on electricity,0
4740,4740,There is a way to cure every kind of cancer now,Cancer can kill people in a very short time,There is not a way to cure every kind of cance...,There is currently no vaccine to prevent peopl...,1
6391,6392,You can break the cement column,Cement column is so hard,Cement column is gray,Cement column has many shapes,0
...,...,...,...,...,...,...
6835,6836,Tall is a disease,Tall is a normal physiological phenomenon,Tall is determined by both environment and gene,Tall is an external expression,0
5399,5399,Tara spread some gasoline on her bread,the price of gasoline has been raised rapidly,people have jam with bread,people do not have bread with gasoline,2
6182,6183,he went to the school to deposit a sum of money,students take money to their school to pay the...,building a school costs a lot of money,the school has no place to save money,2
7536,7538,You can only use chopsticks to eat noodles,You can also use forks to eat noodles,Many Chinese people use chopsticks to eat noodles,There are different types of noodles in the world,0


Notice that the `load_data` function translates the labels into their corresponding numerical index: `0`, `1` and `2`.

[Datasets](https://huggingface.co/docs/datasets/index) is a library for dataset management that provides a set of tools to manipulate data in a easy and efficient way. Since it is fully integrated with **Transformers**, it is very convenient to use both libraries together. **Datasets** allows accessing and sharing datasets through the [Hugging Face Hub](https://huggingface.co/datasets). The core component of this library is the [Dataset](https://huggingface.co/docs/datasets/v2.10.0/en/package_reference/main_classes#datasets.Dataset) class that implements an [Apache Arrow table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html). Similar to a **pandas** `DataFrame`, a `Dataset` object stores a table where each row corresponds to an example of the dataset and each column contains a different type of data. There are different ways to load the data into a `Dataset`, for example, from a `Dataframe`:

In [None]:
train_dataset = Dataset.from_pandas(train_data)
dev_dataset = Dataset.from_pandas(dev_data)
test_dataset = Dataset.from_pandas(test_data)
train_dataset[0]

{'id': 4122,
 'FalseSent': 'You are likely to find a computer in the bathroom',
 'OptionA': 'The computer needs to take a shower in the bathroom',
 'OptionB': 'The computer may be broken in the bathroom',
 'OptionC': "The computer won't walk into the bathroom",
 'label': 1,
 '__index_level_0__': 4122}

One of the most powerful **Datasets** tools is the [map](https://huggingface.co/docs/datasets/v2.10.0/en/nlp_process#map) function which allows pre-processing the dataset in batches. The function takes another callable as argument and applies it to every row in the `Dataset`. The goal of the next exercise is to implement a function to tokenize the statement pairs that will be used as a parameter of the `map` function.

You must complete the code for the `preprocess_data` function. This function takes a batch of examples from a `Dataset`, the tokenizer returned by `load_model` and the `max_length` hyperparameter. The function should make three copies of each statement in the `FalseSent` field and pair them with each of the possible reasons in `OptionA`, `OptionB` and `OptionC`. Then, the statement-reason pairs must be tokenized jointly. The tokenizer must pad and truncate the sequences to the `max_length` value. You can use the [Preprocessing](https://huggingface.co/docs/transformers/v4.27.2/en/preprocessing) and the [Tokenizer](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/tokenizer) documentation as reference.

The `tokenizer` should return a [BatchEncoding](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/tokenizer#transformers.BatchEncoding) object with two fields for each data example:
* *input_ids*: A list of token indices that will be used as the input of the model.
* *attention_mask*: A list of indices masking out which tokens the model should not attend to.

After running the tokenizer, `preprocess_data` should unflatten the `input_ids` and `attention_mask` corresponding to the same statement, i.e., for each example, the value of `input_ids`  should be a list of three lists of token indices and, similarly, the value of `attention_mask` should be a list of three lists of masking indices. The **Transformers** documentation provides a [guide for Multiple Choice](https://huggingface.co/docs/transformers/tasks/multiple_choice) problems that you can use as reference.  The `preprocess_data` should return the output of the unflattening step.

The `map` function takes the `input_ids` and `attention_mask` fields and inserts them into the `Dataset` as new two columns. For example, the result for the first row in the `Dataset` should look like:

> <pre>
{'id': 4122, 'FalseSent': 'You are likely to find a computer in the bathroom', 'OptionA': 'The computer needs to take a shower in the bathroom', 'OptionB': 'The computer may be broken in the bathroom', 'OptionC': "The computer won't walk into the bathroom", 'label': 1, '__index_level_0__': 4122, 'input_ids': [[0, 1185, 32, 533, 7, 465, 10, 3034, 11, 5, 8080, 2, 2, 133, 3034, 782, 7, 185, 10, 9310, 11, 5, 8080, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1185, 32, 533, 7, 465, 10, 3034, 11, 5, 8080, 2, 2, 133, 3034, 189, 28, 3187, 11, 5, 8080, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1185, 32, 533, 7, 465, 10, 3034, 11, 5, 8080, 2, 2, 133, 3034, 351, 75, 1656, 88, 5, 8080, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}
</pre>

The `input_ids` field contains three lists, one for each statement-reason pair. Each value in each list of `input_ids` represents a sub-word of the `tokenizer` vocabulary. For the example above, `input_ids` corresponds to the following thee sequences of sub-words:

> <pre>
> ['&lt;s&gt;', 'You', 'Ġare', 'Ġlikely', 'Ġto', 'Ġfind', 'Ġa', 'Ġcomputer', 'Ġin', 'Ġthe', 'Ġbathroom', '&lt;/s&gt;', '&lt;/s&gt;', 'The', 'Ġcomputer', 'Ġneeds', 'Ġto', 'Ġtake', 'Ġa', 'Ġshower', 'Ġin', 'Ġthe', 'Ġbathroom', '&lt;/s&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;']
>
> ['&lt;s&gt;', 'You', 'Ġare', 'Ġlikely', 'Ġto', 'Ġfind', 'Ġa', 'Ġcomputer', 'Ġin', 'Ġthe', 'Ġbathroom', '&lt;/s&gt;', '&lt;/s&gt;', 'The', 'Ġcomputer', 'Ġmay', 'Ġbe', 'Ġbroken', 'Ġin', 'Ġthe', 'Ġbathroom', '&lt;/s&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;']
>
> ['&lt;s&gt;', 'You', 'Ġare', 'Ġlikely', 'Ġto', 'Ġfind', 'Ġa', 'Ġcomputer', 'Ġin', 'Ġthe', 'Ġbathroom', '&lt;/s&gt;', '&lt;/s&gt;', 'The', 'Ġcomputer', 'Ġwon', "'t", 'Ġwalk', 'Ġinto', 'Ġthe', 'Ġbathroom', '&lt;/s&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;']
</pre>


Notice that the **Hugging Face** implementation of **RoBERTa**'s tokenizer uses the `<s>` token equivalently to **BERT**'s `[CLS]` token and the `</s>` token to mark both the end and the separation of the sentences. The `Ġ` character indicates when there is a blank space before the token in the original text. This helps to know which tokens are the first sub-words of the words.

In [None]:
#test_dataset[:]['FalseSent']

In [None]:
def preprocess_data(batch, tokenizer, max_length):   # [1 Mark]
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    input_ids = []
    attention_mask = []

    # Process each example in the batch
    print("xuan print len FalseSent: ", len(batch['FalseSent']))
    print("xuan print batch ", batch['FalseSent'][-1])
    for i in range(len(batch['FalseSent'])):
        statement = batch['FalseSent'][i]
        option_a = batch['OptionA'][i]
        option_b = batch['OptionB'][i]
        option_c = batch['OptionC'][i]

        # Tokenize the pairs with padding and truncation
        encoding_a = tokenizer(statement, option_a, padding='max_length', truncation=True, max_length=max_length)
        encoding_b = tokenizer(statement, option_b, padding='max_length', truncation=True, max_length=max_length)
        encoding_c = tokenizer(statement, option_c, padding='max_length', truncation=True, max_length=max_length)

        # Append tokenized results for each example
        input_ids.append([encoding_a['input_ids'],
                          encoding_b['input_ids'],
                          encoding_c['input_ids']])

        attention_mask.append([encoding_a['attention_mask'],
                               encoding_b['attention_mask'],
                               encoding_c['attention_mask']])

    # Return the tokenized data with correctly structured input_ids and attention_mask
    return {
        'input_ids': input_ids,
        'attention_mask': attention_mask,
    }
    pass

In [None]:
train_dataset = train_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
dev_dataset = dev_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
test_dataset = test_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
print(train_dataset[0])
print("")
for seq in train_dataset[0]["input_ids"]:
    print(tokenizer.convert_ids_to_tokens(seq))
    print("")

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

xuan print len FalseSent:  100
xuan print batch  Running is a disease


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

xuan print len FalseSent:  100
xuan print batch  dolls are living things


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

xuan print len FalseSent:  100
xuan print batch  he learned with his refrigerator before the exam
{'id': 4122, 'FalseSent': 'You are likely to find a computer in the bathroom', 'OptionA': 'The computer needs to take a shower in the bathroom', 'OptionB': 'The computer may be broken in the bathroom', 'OptionC': "The computer won't walk into the bathroom", 'label': 1, '__index_level_0__': 4122, 'input_ids': [[0, 1185, 32, 533, 7, 465, 10, 3034, 11, 5, 8080, 2, 2, 133, 3034, 782, 7, 185, 10, 9310, 11, 5, 8080, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1185, 32, 533, 7, 465, 10, 3034, 11, 5, 8080, 2, 2, 133, 3034, 189, 28, 3187, 11, 5, 8080, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1185, 32, 533, 7, 465, 10, 3034, 11, 5, 8080, 2, 2, 133, 3034, 351, 75, 1656, 88, 5, 8080, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1

## Fine-tuning - [4 Marks]

Although it is possible to write customized training loops for the **Transormers** models using **keras** or **pytorch**, **Transformers** provides a [Trainer](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer) API that allows fine-tuning efficiently with a few simple steps. The training is highly customizable through with a wide range of options and hyperparameters that are handled by the [TrainingArguments](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer#transformers.TrainingArguments) class. Your next goal is to create both the `TrainingArguments` and `Trainer` objects that will be used to fine-tune **RoBERTa**. See the [documentation](https://huggingface.co/docs/transformers/training) for an introduction on how to perform these steps.

You must complete the code for the `create_training_arguments` function. This function takes as arguments the `epochs`, `train_batch_size` and `learning_rate` hyperparameters along with the `output_dir`. The function should use these arguments to create and return a `TrainingArguments` object. During the training, the model must be evaluated on the development test after every epoch. `TrainingArguments` should include this strategy.

> **Important!** By default, `Trainer` saves a checkpoint of the model every 500 training steps. For this assignment, avoid this behavior by setting `save_strategy="no"` when creating the `TrainingArguments`.

In [None]:
def create_training_arguments(epochs, train_batch_size, learning_rate, output_dir):   # [1 Mark]
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    trainer = TrainingArguments(
        output_dir=output_dir,
        evaluation_strategy="epoch",
        save_strategy="no",
        learning_rate=learning_rate,
        per_device_train_batch_size=train_batch_size,
        num_train_epochs=epochs
    )
    return trainer
    #
    pass

In [None]:
train_args = create_training_arguments(epochs, train_batch_size, learning_rate, output_dir)



Next, you will create a `Trainer` object with the training arguments. When the input format of a task has some special characteristic, the `Trainer` must be created with a data collator that can handle the batches of examples accordingly during the training. This is the case of Multiple Choice problems since the input of each example is a list of sequences. **Transformers** provides a set of [DataCollator](https://huggingface.co/docs/transformers/main_classes/data_collator) objects for different tasks, but not for Multiple Choice. However, a `DataCollatorForMultipleChoice` is provided along with this notebook.

You must complete the code for the `create_trainer` function. The function takes as input the model returned by `load_model`, the `TrainingArguments` created by `create_training_arguments` and the train and development `Datasets`. The function also takes the `tokenizer` returned by `load_model` that is required to initialize `DataCollatorForMultipleChoice`. The `create_trainer` function must create and return a `Trainer` object with the model, the training arguments and a `DataCollatorForMultipleChoice` object. The `Trainer` must be set up so that the train `Dataset` is used for training and the development `Dataset` is used to evaluate the model during the training.

In [None]:
import sys
import urllib.request

# URL of the raw Python file
url = "https://raw.githubusercontent.com/thanhxuan1995/NLP/refs/heads/master/week4/assignment/download.py"

# Download the file
urllib.request.urlretrieve(url, "download.py")

('download.py', <http.client.HTTPMessage at 0x796968729b90>)

In [None]:
from download import DataCollatorForMultipleChoice

In [None]:
def create_trainer(model, train_args, train_dataset, dev_dataset, tokenizer):   # [1 Mark]
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
    trainer = Trainer(
        model=model,
        args=train_args,
        train_dataset=train_dataset,
        eval_dataset=dev_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator
    )
    #
    return trainer
    pass

In [None]:
trainer = create_trainer(model, train_args, train_dataset, dev_dataset, tokenizer)

  trainer = Trainer(


The `trainer` object created by `create_trainer` is ready to fine-tune the model by just running:

In [None]:
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mxuan-nguyen-intel[0m ([33mxuan-nguyen-intel-intel[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,No log,1.097851
2,No log,1.097274
3,No log,1.096982


TrainOutput(global_step=39, training_loss=1.0965563456217449, metrics={'train_runtime': 67.543, 'train_samples_per_second': 4.442, 'train_steps_per_second': 0.577, 'total_flos': 23124787470000.0, 'train_loss': 1.0965563456217449, 'epoch': 3.0})

After training, the model can be used to make predictions on unlabeled data using the [predict](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer#transformers.Trainer.predict) method of the `Trainer` class.

You must complete the code for the `make_predictions` function. The function takes as input the `Trainer` object and test `Dataset`. The function must run the `predict` method on the input data. The `predict` method will return a `NamedTuple` including a **numpy** array with the predictions. For each statement in the input, the array contains a vector with the logits (the values used as input of the softmax) predicted for every label corresponding to a possible reason. The output of `make_predictions` must include only the index of the label with the highest logit value. For example, if the prediction for one statement is `[-0.856213458, 1.39899943, -0.703246286e]`, the output for that example should be `1`. For this, you can apply the [argmax](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html) method along the last axis of the **numpy** array.

In [None]:
def make_predictions(trainer, test_dataset):   # [2 Marks]
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    predictions = trainer.predict(test_dataset)
    predictions = predictions.predictions.argmax(axis=-1)
    return predictions
    #

    pass

In [None]:
predictions = make_predictions(trainer, test_dataset)
test_data["prediction"] = predictions
test_data

Unnamed: 0,id,FalseSent,OptionA,OptionB,OptionC,label,prediction
521,324,She put the filing cabinet into the papers.,Papers are fragile than the filing cabinet.,Nothing can be put into the paper.,Filing cabinets are usually gray while papers ...,1,0
737,1455,The lion used the litter box,A lion is normally found in the wild,A lion cannot eat a cat,A domestic cat is tame and use litter boxes,2,1
740,13,Cigarette is good for healthy,Cigarette contains lots of nicotines,Lung will be damaged by smoking cigarette,Cigarettes always have a high tax,1,2
660,207,Pens are for painting,Pens are too small,Pens are a writing utensil,pens would make a mess,1,0
411,774,he put a piece of plastic on the bread,I don't like sliced bread in plastic bags,the plastic usually is toxic,some plastic is biodegradable,1,0
...,...,...,...,...,...,...,...
436,225,Carol turned on the potato,Carol eats the potato,A potato can't be turned on,potatoes can be used to make electricity,1,1
764,1663,i use my dog to play cricket,Dog is not interested in the cricket ball,the dog is so brisk,No one can play the cricket ball by their's dog,2,0
88,82,Dolphins are fish.,"Dolphins are warm-blooded and breathe air, whi...",Dolphins have fins.,Dolphins live in the ocean.,0,2
63,800,the family adopted a dinosaur to be their new pet,the dinosaurs died out long ago,many different animals can make good pets,some dinosaurs are carnivorous animals,0,1


The **Subtasks B** of **ComVE** is evaluated using accuracy. The [evaluate](https://huggingface.co/docs/evaluate/index) library provides support to apply this and other metrics. The `evaluate_prediction` function takes the test `DataFrame` and calculates the accuracy comparing the `prediction` and `label` columns. With `shrink_dataset` and `base_model` set to `True` the model is not able to learn the task so the expected score is only *0.51*. With a full training run, i.e. with `shrink_dataset` and `base_model` set to `False`, the score should be around *0.928*.

In [None]:
def evaluate_prediction(test_data):
    accuracy = evaluate.load("accuracy")
    return accuracy.compute(predictions=test_data["prediction"].values, references=test_data["label"].values)
evaluate_prediction(test_data)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 0.43}