# Pre-trained Language Models: SubTask C
## [8 Marks]

In this assignment, you will work on the [ComVE](https://competitions.codalab.org/competitions/21080) shared task that was part of SemEval-2020. The task aims to evaluate whether a system can distinguish if a natural language statement makes sense to humans or not and provide a reason. **ConVE** includes three subtasks that require models to acquire and apply commonsense knowledge. In this notebook you will focus on **SubTask C**:

- Given a statement that does not make sense, generate the reason why this statement does not make sense. For each nonsensical statement, three valid reasons are given as reference:

     *Statement*: He put an elephant into the fridge.  
     *Reason A*: An elephant is much bigger than a fridge.  
     *Reason B*: A fridge is much smaller than an elephant.  
     *Reason C*: Most of the fridges aren't large enough to contain an elephant.

     This subtask can be approached as a Sequence-to-Sequence problem where the input is the nonsensical statement and the output is a valid reason.

You will fine-tune a Pre-trained Language Model with [Transformers](https://huggingface.co/docs/transformers/index) library that provides a set of tools for fine-tunning and deploying a wide variety of Pre-trained Language Models. The [Hugging Face Hub](https://huggingface.co/models) allows you to explore all the models supported by **Transformers** and even share your own models with the community. In this assignment, you will work with [BART](https://huggingface.co/docs/transformers/model_doc/bart), a pre-trained Sequence-to-Sequence model.

Fine-tuning a Pre-trained Language Model usually requires a great amount of time and computational resources. Your personal computer will not be enough. In order to complete the assignment, you can work with a reduced version of the dataset and the base version of **BART**:

In [1]:
shrink_dataset = True
base_model = True
colab = True

Although the value of these variables do not affect the tests that will evaluate your code, the output examples distributed throughout this notebook are based on a `shrink_dataset` and a `base_model` variables set as `True`, and a `colab` variable set as `False`.

If you want to perform a full training of the model to obtain its real performance, you can use a cloud service like [Google Colab](https://colab.research.google.com/). **Colab** is a **Jupyter** notebook environment that supports both GPU and TPU instances, allowing training large scale Deep Learning models. Set the `shrink_dataset` and a `base_model` variables to `False`, the `colab` variable to `True`, and follow the instructions provided to you to run the notebook in **Colab**.

In [2]:
if colab:
    ! pip install transformers datasets evaluate
    import os
    if not os.path.exists("SemEval2020-Task4-Data/ALL data/Training  Data/subtaskC_data_all.csv"):
        ! git clone https://github.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation.git SemEval2020-Task4-Data



You will use the following objects and functions:

In [3]:
import pandas as pd
import evaluate
from datasets import Dataset
from transformers import (AutoTokenizer, AutoModelForSeq2SeqLM,
                          Seq2SeqTrainingArguments, Seq2SeqTrainer,
                          DataCollatorForSeq2Seq, enable_full_determinism)

When working with Neural Networks, there are a large number of random operations such as initializing the weights of the network, shuffling the data for training, or choosing samples. This causes that different training runs of the same model can lead to different results. To ensure reproducibility, i.e. obtaining the same results in the different runs, the random number generator must be initialized with a fixed value known as seed. In Transformers, this can be done as follows:

In [4]:
enable_full_determinism(seed=42)

> **Note!** With models as complex as Neural Networks, reproducibility is susceptible to factors such as software versions or the hardware on which the models are run. Even with seed initialization, there may be slight differences in the results.

Working with Neural Networks also involves defining a number of hyperparameters that set the configuration of the model. Finding the appropriate hyperparameter values requires training the model with different combinations and testing them on the development set. This hyperparameter tuning is a costly process that needs multiple rounds of experimentation. However, for this assignments, you will use the following values:

In [5]:
epochs = 3  # Number of epochs to train the model
train_batch_size = 8  # Number of examples used per gradient update
learning_rate = 1e-5  # The learning rate for the optimizer
max_length = 25  # Maximum lenght of the input sequence
output_dir = "modelC"  # The output directory where the model will be written

> **Note!** The notebook for this assignment provides very little guidance. You are expected to refer to the [documentation](https://huggingface.co/docs) for details on how to solve the exercises.

## Loading the Pre-trained Model - [1 Mark]

The first step you must perform in this assignment is to load the model and its corresponding tokenizer using the classes imported above.

In [6]:
def load_model(model_name):   #[1 Mark]
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return model, tokenizer
    #
    pass

In [7]:
model_name = "facebook/bart-base" if base_model else "facebook/bart-large"
model, tokenizer = load_model(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Data Pre-processing - [2 Marks]

The **ComVE** dataset consists of 10000 nonsensical statements for the train set, 997 statements for development and 1000 for test. Each nonsensical statements comes with with three reference valid reasons. You must load the three sets into three `DataFrames`. For the training and development splits, the `DataFrame` should contain three columns: the `id` of the nonsensical statement, a `FalseSent` column with the nonsensical statement and a `reason` column with the reference reasons. For the test set, the `DataFrame` should contain five columns: the `id` of the nonsensical statement, a `FalseSent` column with the nonsensical statement and three columns (`reason1`, `reason2` and `reason3`) containing each of the reference reasons.

Train DataFrame:

|       |   id | FalseSent                                         | reason                                                                         |
|------:|-----:|:--------------------------------------------------|:-------------------------------------------------------------------------------|
|   769 |  769 | Computers is an ingredient used in preparing food | Computers are not used for food and they are not edible                        |
| 10769 |  769 | Computers is an ingredient used in preparing food | Computer is not something that can be used in preparing food.                  |
| 20769 |  769 | Computers is an ingredient used in preparing food | You cannot eat a computer                                                      |
|   888 |  888 | he did hear music in his cooling glass            | cooling glass can not play the song, it's not a electronic thing to play music |
| 10888 |  888 | he did hear music in his cooling glass            | Glass does not produce music.                                                  |
| 20888 |  888 | he did hear music in his cooling glass            | Any sound that might be made by a cooling glass is not music.                  |

Test DataFrame:

|     |   id | FalseSent                                      | reason1                                                  | reason2                                                | reason3                                                            |
|----:|-----:|:-----------------------------------------------|:---------------------------------------------------------|:-------------------------------------------------------|:-------------------------------------------------------------------|
|  76 | 1280 | Beer that is drunk by humans is white          | Beer is made of barley and it is a yellow drink          | A beer that is drunk by humans is not white.           | Beer is brown                                                      |
| 101 |  860 | eating trash food every day makes you stronger | eating trash food every day makes your body fat and weak | eating trash food every day is bad for your health     | Trash food could be contaminated                                   |
| 136 |  777 | he put some cooking oil in his wine            | cooking oil will destroy the taste of the wine           | Cooking oil does not go in wine                        | Cooking oil does not taste nice and therefore would ruin the wine. |
| 174 |  570 | Lobsters live in the mountains                 | Lobsters needs water to live                             | Lobsters live in the sea.                              | Lobsters live in the sea, not the mountains                        |
| 210 | 1929 | the clock shows animals                        | the clock is used to show the time to people             | Clocks are required to tell the time, not show animals | a clock shows the time not animals                                 |
| 235 | 1619 | she put the giraffe in the freezer             | A giraffe is much bigger than the freezer                | There is no way a giraffe is fitting in the freezer.   | A giraffe is too big to be put in a freezer.                       |

In [8]:
def load_data(data_csv, answers_csv, is_test=False):   # 1 Mark
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    data = pd.read_csv(data_csv)
    answer = pd.read_csv(answers_csv, sep=',', names=['id', 'reason1', 'reason2', 'reason3'])
    #print(data.columns)
    #print(answer.columns)
    df = pd.merge(data, answer, on = 'id')
    if  is_test:
      df = df
    else:
      df = pd.melt(df, id_vars=['id', 'FalseSent'], value_vars=['reason1', 'reason2', 'reason3'],value_name='reason' )
      df.drop('variable', axis=1, inplace=True)
    return df.reset_index(drop= True)
    #
    pass
#train_data_csv = "SemEval2020-Task4-Data/ALL data/Training  Data/subtaskC_data_all.csv"
#train_answers_csv = "SemEval2020-Task4-Data/ALL data/Training  Data/subtaskC_answers_all.csv"


In [9]:
def load_data(data_csv, answers_csv, is_test=False):
    # Load the nonsensical statements (data) and the corresponding reasons (answers)
    data = pd.read_csv(data_csv)
    answers = pd.read_csv(answers_csv, sep=',', names=['id', 'reason1', 'reason2', 'reason3'])

    # Merge the data and answers based on the 'id' column

    # Initialize an empty DataFrame to hold the final data
    if is_test:
        # For test set, there will be 5 columns: id, FalseSent, reason1, reason2, reason3
        df = pd.DataFrame(columns=['id', 'FalseSent', 'reason1', 'reason2', 'reason3'])
        # Merge data and answers for the test set
        df['id'] = data['id']
        df['FalseSent'] = data['FalseSent']
        df['reason1'] = answers['reason1']
        df['reason2'] = answers['reason2']
        df['reason3'] = answers['reason3']
    else:
        # For train and development sets, there will be 3 columns: id, FalseSent, reason
        df = pd.DataFrame(columns=['id', 'FalseSent', 'reason'])
        # Flatten the train/development data to match the format (id, FalseSent, reason)
        df['id'] = data['id'].repeat(3)  # Repeat the id for each of the three reasons
        df['FalseSent'] = data['FalseSent'].repeat(3)  # Repeat the nonsensical statement
        df['reason'] = answers[['reason1', 'reason2', 'reason3']].values.flatten()  # Flatten the reasons column

    return df

In [10]:
train_data_csv = "SemEval2020-Task4-Data/ALL data/Training  Data/subtaskC_data_all.csv"
train_answers_csv = "SemEval2020-Task4-Data/ALL data/Training  Data/subtaskC_answers_all.csv"
train_data = load_data(train_data_csv, train_answers_csv)
dev_data_csv = "SemEval2020-Task4-Data/ALL data/Dev Data/subtaskC_dev_data.csv"
dev_answers_csv = "SemEval2020-Task4-Data/ALL data/Dev Data/subtaskC_gold_answers.csv"
dev_data = load_data(dev_data_csv, dev_answers_csv)
test_data_csv = "SemEval2020-Task4-Data/ALL data/Test Data/subtaskC_test_data.csv"
test_answers_csv = "SemEval2020-Task4-Data/ALL data/Test Data/subtaskC_gold_answers.csv"
test_data = load_data(test_data_csv, test_answers_csv, True)
if shrink_dataset:
    idxs = train_data["id"].sample(frac=1, random_state=42).unique()[:30]
    train_data = train_data[train_data.id.isin(idxs)]
    idxs = dev_data["id"].sample(frac=1, random_state=42).unique()[:30]
    dev_data = dev_data[dev_data.id.isin(idxs)]
    idxs = test_data["id"].sample(frac=1, random_state=42).unique()[:30]
    test_data = test_data[test_data.id.isin(idxs)]
pd.set_option("display.max_colwidth", None)
print("Train DataFrame:")
display(train_data[:6])
print("Test DataFrame:")
display(test_data[:6])

Train DataFrame:


Unnamed: 0,id,FalseSent,reason
769,769,Computers is an ingredient used in preparing food,Computers are not used for food and they are not edible
769,769,Computers is an ingredient used in preparing food,Computer is not something that can be used in preparing food.
769,769,Computers is an ingredient used in preparing food,You cannot eat a computer
888,888,he did hear music in his cooling glass,"cooling glass can not play the song, it's not a electronic thing to play music"
888,888,he did hear music in his cooling glass,Glass does not produce music.
888,888,he did hear music in his cooling glass,Any sound that might be made by a cooling glass is not music.


Test DataFrame:


Unnamed: 0,id,FalseSent,reason1,reason2,reason3
76,1280,Beer that is drunk by humans is white,Beer is made of barley and it is a yellow drink,A beer that is drunk by humans is not white.,Beer is brown
101,860,eating trash food every day makes you stronger,eating trash food every day makes your body fat and weak,eating trash food every day is bad for your health,Trash food could be contaminated
136,777,he put some cooking oil in his wine,cooking oil will destroy the taste of the wine,Cooking oil does not go in wine,Cooking oil does not taste nice and therefore would ruin the wine.
174,570,Lobsters live in the mountains,Lobsters needs water to live,Lobsters live in the sea.,"Lobsters live in the sea, not the mountains"
210,1929,the clock shows animals,the clock is used to show the time to people,"Clocks are required to tell the time, not show animals",a clock shows the time not animals
235,1619,she put the giraffe in the freezer,A giraffe is much bigger than the freezer,There is no way a giraffe is fitting in the freezer.,A giraffe is too big to be put in a freezer.


In [11]:
train_dataset = Dataset.from_pandas(train_data)
dev_dataset = Dataset.from_pandas(dev_data)
test_dataset = Dataset.from_pandas(test_data)
print("Train Dataset example:")
display(train_dataset[0])
print("Test Dataset example:")
display(test_dataset[0])

Train Dataset example:


{'id': 769,
 'FalseSent': 'Computers is an ingredient used in preparing food',
 'reason': 'Computers are not used for food and they are not edible',
 '__index_level_0__': 769}

Test Dataset example:


{'id': 1280,
 'FalseSent': 'Beer that is drunk by humans is white',
 'reason1': 'Beer is made of barley and it is a yellow drink',
 'reason2': 'A beer that is drunk by humans is not white.',
 'reason3': 'Beer is brown',
 '__index_level_0__': 76}

The `Datasets` should be pre-processed following two different approaches. For the test `Dataset`, you must run the tokenizer on the `FalseSent` column and store the result in the `input_ids` and `attention_mask` fields. For the train and development `Datasets` you must also run the tokenizer on the `reason` column and store the resulting `input_ids` in the `labels` field. In all cases, the tokenizer must pad and truncate the sequences to the `max_length` value.

><pre>
>Train formated Dataset example:
>
>{'id': 769, 'FalseSent': 'Computers is an ingredient used in preparing food', 'reason': 'Computers are not used for food and they are not edible', '__index_level_0__': 769, 'input_ids': [0, 14721, 43990, 16, 41, 16181, 341, 11, 4568, 689, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': [0, 14721, 43990, 32, 45, 341, 13, 689, 8, 51, 32, 45, 27532, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
>
>Test formated Dataset example:
>
>{'id': 1280, 'FalseSent': 'Beer that is drunk by humans is white', 'reason1': 'Beer is made of barley and it is a yellow drink', 'reason2': 'A beer that is drunk by humans is not white.', 'reason3': 'Beer is brown', '__index_level_0__': 76, 'input_ids': [0, 45562, 14, 16, 10789, 30, 5868, 16, 1104, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
></pre>

In [12]:
def preprocess_data(examples, tokenizer, max_length, is_test=False):   # [1 Mark]
    #
    input_ids= []
    attention_mask = []
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    if is_test:
      for i, data in enumerate(examples["FalseSent"]):
        result = tokenizer(data, padding = 'max_length', max_length = max_length, truncation=True)
        input_ids.append(result["input_ids"])
        attention_mask.append(result["attention_mask"])
      return {
          "input_ids" : input_ids,
          "attention_mask" : attention_mask
      }

    else:
      for i, data in enumerate(examples["reason"]):
        result = tokenizer(data, padding= 'max_length', max_length = max_length, truncation=True)
        input_ids.append(result["input_ids"])
        attention_mask.append(result["attention_mask"])
      return {
         "input_ids" : input_ids,
          "attention_mask" : attention_mask,
         "labels" : input_ids
    }
    #
    pass

In [81]:
def preprocess_data(examples, tokenizer, max_length, is_test=False):
    # Tokenizing the FalseSent column for both train and test sets
    inputs = tokenizer(
        examples['FalseSent'],
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='pt'
    )

    # Initialize the output dictionary
    result = {
        'id': examples['id'],  # Keep the id from the original data
        'FalseSent': examples['FalseSent'],  # Keep the FalseSent statement
        '__index_level_0__': examples['id'],  # To preserve index
        'input_ids': inputs['input_ids'],  # Tokenized FalseSent
        'attention_mask': inputs['attention_mask'],  # Attention mask for FalseSent
    }

    if not is_test:
        # For train and development data, also tokenize the reason column and store it in the labels field
        # Tokenizing the reason column
        labels = tokenizer(
            examples['reason'],
            padding='max_length',
            truncation=True,
            max_length=max_length,
            return_tensors='pt'
        )

        # Adding the labels (reason) to the result dictionary
        result['labels'] = labels['input_ids']

    else:
        # For test set, store reasons as reason1, reason2, reason3
        reasons = ['reason1', 'reason2', 'reason3']
        for idx, reason in enumerate(reasons):
            reason_tokenized = tokenizer(
                examples[reason],
                padding='max_length',
                truncation=True,
                max_length=max_length,
                return_tensors='pt'
            )
            result[f'input_ids_{idx+1}'] = reason_tokenized['input_ids']
            result[f'attention_mask_{idx+1}'] = reason_tokenized['attention_mask']

    return result

In [13]:
train_dataset = train_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
dev_dataset = dev_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
test_dataset = test_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length, True), batched=True)
print("Train formated Dataset example:\n")
print(train_dataset[0])
print("\nTest formated Dataset example:\n")
print(test_dataset[0])

Map:   0%|          | 0/90 [00:00<?, ? examples/s]

Map:   0%|          | 0/90 [00:00<?, ? examples/s]

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

Train formated Dataset example:

{'id': 769, 'FalseSent': 'Computers is an ingredient used in preparing food', 'reason': 'Computers are not used for food and they are not edible', '__index_level_0__': 769, 'input_ids': [0, 14721, 43990, 32, 45, 341, 13, 689, 8, 51, 32, 45, 27532, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'labels': [0, 14721, 43990, 32, 45, 341, 13, 689, 8, 51, 32, 45, 27532, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Test formated Dataset example:

{'id': 1280, 'FalseSent': 'Beer that is drunk by humans is white', 'reason1': 'Beer is made of barley and it is a yellow drink', 'reason2': 'A beer that is drunk by humans is not white.', 'reason3': 'Beer is brown', '__index_level_0__': 76, 'input_ids': [0, 45562, 14, 16, 10789, 30, 5868, 16, 1104, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


## Fine-tuning - [5 Marks]

In general, when using a `Trainer` to make predictions, it returns the logits for each class in the task. However, the `Seq2SeqTrainingArguments` class provides an option that allows the `Trainer` to generate sequences of tokens in the prediction. The `create_training_arguments` function must create the `Seq2SeqTrainingArguments` with that option and the hyperparamters passed as arguments. During the training, the model must be evaluated on the development set after every epoch. `TrainingArguments` should include this strategy.

> **Important!** By default, `Trainer` saves a checkpoint of the model every 500 training steps. For this assignment, avoid this behavior by setting `save_strategy="no"` when creating the `TrainingArguments`.

In [29]:
def create_training_arguments(epochs, train_batch_size, learning_rate, output_dir):   # [1 Mark]
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    training_args = Seq2SeqTrainingArguments(
        output_dir= output_dir,
        learning_rate= learning_rate,
        per_device_train_batch_size= train_batch_size,
        per_device_eval_batch_size= train_batch_size,
        num_train_epochs= epochs,
        save_strategy= "no",
        evaluation_strategy="epoch",
        #predict_with_generate=True
        #logging_steps=500,
        #prediction_loss_only=False
    )
    return training_args
    #
    pass

In [86]:


def create_training_arguments(epochs, train_batch_size, learning_rate, output_dir):
    training_args = Seq2SeqTrainingArguments(
        output_dir=output_dir,  # Directory to save the model and checkpoints
        evaluation_strategy="epoch",  # Evaluate the model at the end of every epoch
        save_strategy="no",  # Prevent saving checkpoints every 500 steps
        num_train_epochs=epochs,  # Number of training epochs
        per_device_train_batch_size=train_batch_size,  # Batch size for training
        per_device_eval_batch_size=train_batch_size,  # Batch size for evaluation (same as training for consistency)
        learning_rate=learning_rate,  # Learning rate for the optimizer
        logging_dir="./logs",  # Directory to store logs (optional)
        logging_steps=50,  # Log every 50 steps (you can adjust this as needed)
        predict_with_generate=True,  # Enable sequence generation during predictions
        #load_best_model_at_end=True,  # Load the best model based on evaluation metric
        metric_for_best_model="loss",  # Use loss as the metric for selecting the best model
        greater_is_better=False,  # Lower loss is better
        warmup_steps=500,  # Number of warmup steps for the learning rate scheduler (optional)
        weight_decay=0.01,  # Weight decay to avoid overfitting (optional)
        logging_first_step=True,  # Log the first step as well
    )

    return training_args


In [30]:
train_args = create_training_arguments(epochs, train_batch_size, learning_rate, output_dir)



Next, you can create a `Trainer` object initializing the appropriate data collator.

In [31]:
def create_trainer(model, train_args, train_dataset, dev_dataset, tokenizer):   # [1 Mark]
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    data_collator= DataCollatorForSeq2Seq(tokenizer=tokenizer, model = model)
    trainer = Seq2SeqTrainer(
        model = model,
        args = train_args,
        train_dataset = train_dataset,
        eval_dataset = dev_dataset,
        tokenizer = tokenizer,
        data_collator= data_collator
    )
    return trainer
    #
    pass

In [17]:
def create_trainer(model, training_args, train_dataset, eval_dataset, tokenizer):
    # Load pre-trained model and tokenizer

    # Initialize the DataCollatorForSeq2Seq, which will pad the sequences dynamically during training
    data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

    # Initialize the Trainer
    trainer = Seq2SeqTrainer(
        model=model,  # The model to train
        args=training_args,  # Training arguments created earlier
        train_dataset=train_dataset,  # The training dataset
        eval_dataset=eval_dataset,  # The evaluation (development) dataset
        tokenizer=tokenizer,  # The tokenizer to process data
        data_collator=data_collator,  # The data collator to handle padding
    )

    return trainer

In [32]:
trainer = create_trainer(model, train_args, train_dataset, dev_dataset, tokenizer)

  trainer = Seq2SeqTrainer(


In [33]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,5.245742
2,No log,4.330564
3,No log,3.911906


TrainOutput(global_step=36, training_loss=5.189124637179905, metrics={'train_runtime': 216.5097, 'train_samples_per_second': 1.247, 'train_steps_per_second': 0.166, 'total_flos': 4019258880000.0, 'train_loss': 5.189124637179905, 'epoch': 3.0})

If you have set the `Seq2SeqTrainingArguments` properly, you could now use the `Trainer` to predict sequences of tokens. Take into account that `Trainer` will return the indexes of the tokens, so the sequence must be decoded to obtain the text strings. The `tokenizer` provides functionality to do this. The result of this process can be stored in the `prediction` column of the test `DataFrame`:

|     |   id | FalseSent                                      | reason1                                                  | reason2                                                | reason3                                                            | prediction                                     |
|----:|-----:|:-----------------------------------------------|:---------------------------------------------------------|:-------------------------------------------------------|:-------------------------------------------------------------------|:-----------------------------------------------|
|  76 | 1280 | Beer that is drunk by humans is white          | Beer is made of barley and it is a yellow drink          | A beer that is drunk by humans is not white.           | Beer is brown                                                      | Beer that is drunk by humans is white                             |
| 101 |  860 | eating trash food every day makes you stronger | eating trash food every day makes your body fat and weak | eating trash food every day is bad for your health     | Trash food could be contaminated                                   | eating trash food every day makes you stronger |
| 136 |  777 | he put some cooking oil in his wine            | cooking oil will destroy the taste of the wine           | Cooking oil does not go in wine                        | Cooking oil does not taste nice and therefore would ruin the wine. | he put some cooking oil in his wine            |
| 174 |  570 | Lobsters live in the mountains                 | Lobsters needs water to live                             | Lobsters live in the sea.                              | Lobsters live in the sea, not the mountains                        | Lobsters live in mountains                 |
| 210 | 1929 | the clock shows animals                        | the clock is used to show the time to people             | Clocks are required to tell the time, not show animals | a clock shows the time not animals                                 | the clock shows animals                        |
| 235 | 1619 | she put the giraffe in the freezer             | A giraffe is much bigger than the freezer                | There is no way a giraffe is fitting in the freezer.   | A giraffe is too big to be put in a freezer.                       | she put the giraffe in the freezer             |


In [40]:

def make_predictions(trainer, test_dataset, tokenizer):   # [1 Mark]
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    predictions = trainer.predict(test_dataset)
    logits = predictions.predictions[0].argmax(axis= -1)  # This is a 3D array: (batch_size, sequence_length, vocab_size)

    predictions_de = tokenizer.batch_decode(logits, skip_special_tokens=True)
    #print(predictions_de
    ##print( predictions_de)
    return predictions_de
    #
    pass

In [57]:
def make_predictions(trainer, test_dataset, tokenizer):
    # Get predictions from the trainer
    predictions = trainer.predict(test_dataset)

    # `predictions.predictions` is a tuple, where the first item is the logits
    logits = predictions.predictions[0]  # The logits output from the model

    # Get the token IDs with the highest probability for each position in the sequence
    predicted_ids = logits.argmax(axis=-1)  # Find the argmax across the vocabulary dimension

    # Decode the predicted token IDs into strings
    decoded_predictions = tokenizer.batch_decode(predicted_ids, clean_up_tokenization_spaces=True)

    decoder_final = [item.split('</s>')[0].split('<s>')[1] for item in decoded_predictions]
    # Return the decoded predictions
    return decoder_final


In [58]:
predictions = make_predictions(trainer, test_dataset, tokenizer)
test_data["prediction"] = predictions
test_data

Unnamed: 0,id,FalseSent,reason1,reason2,reason3,prediction
76,1280,Beer that is drunk by humans is white,Beer is made of barley and it is a yellow drink,A beer that is drunk by humans is not white.,Beer is brown,Beer that is drunk by humans is white
101,860,eating trash food every day makes you stronger,eating trash food every day makes your body fat and weak,eating trash food every day is bad for your health,Trash food could be contaminated,eating trash food every day makes you stronger
136,777,he put some cooking oil in his wine,cooking oil will destroy the taste of the wine,Cooking oil does not go in wine,Cooking oil does not taste nice and therefore would ruin the wine.,he put some cooking oil in his wine
174,570,Lobsters live in the mountains,Lobsters needs water to live,Lobsters live in the sea.,"Lobsters live in the sea, not the mountains",Lobsters live in the mountains
210,1929,the clock shows animals,the clock is used to show the time to people,"Clocks are required to tell the time, not show animals",a clock shows the time not animals,the clock shows animals
235,1619,she put the giraffe in the freezer,A giraffe is much bigger than the freezer,There is no way a giraffe is fitting in the freezer.,A giraffe is too big to be put in a freezer.,she put the giraffe in the freezer
280,979,he installed the carpet on the lake,The carpet will absorb water and sink,Carpets need a subfloor.,A lake would not be able to grip onto carpet.,he installed the carpet on the lake
319,75,My son had us write an essay on The National Monument.,My son isn't smart enough to assign an essay.,My son is studying in the seconds standard only,Children don't ask parents to write essays.,My son had us write an essay on The National Monument.
371,1810,He drove up the stairs to the bedroom,A car is too large to fit upstairs,Stairs are too small and weak for a car to drive up,People don't drive up indoor stairs.,He drove up the stairs to the bedroom
411,774,he put a piece of plastic on the bread,the plastic usually is toxic,You can't eat plastic.,People do not eat plastic because it's not a food,he put a piece of plastic on the bread


The **Subtasks B** of **ComVE** is evaluated using the *bleu* metric. In this assignment, you will also evaluate using *rouge*. With `shrink_dataset` and `base_model` set to `True`, the expected scores are *0.216* and *0.446* for *bleu* and *rouge* respectively. With a full training run, i.e. with `shrink_dataset` and `base_model` set to `False`, the scores should be around *0.228* and *0.461*.

In [64]:
from datasets import load_metric
from rouge_score import rouge_scorer

# Load the BLEU metric from the datasets library
bleu_metric = load_metric("bleu")
def evaluate_prediction(test_data, metric):   # [2 Marks]
    #
    #  REPLACE THE pass STATEMENT WITH YOUR CODE
    predictions = test_data["prediction"]
    references = test_data["reason1"]
    if metric== 'bleu':
      bleu_score = bleu_metric.compute(predictions=predictions, references=[[ref] for ref in references])
      return {"bleu": bleu_score["bleu"]}

    elif metric == "rouge":
        # Compute ROUGE score using the rouge-score library
        scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
        rouge_scores = {"rouge1": [], "rouge2": [], "rougeL": []}

        # Iterate through each prediction and reference pair
        for pred, ref in zip(predictions, references):
            scores = scorer.score(ref, pred)
            rouge_scores["rouge1"].append(scores["rouge1"].fmeasure)
            rouge_scores["rouge2"].append(scores["rouge2"].fmeasure)
            rouge_scores["rougeL"].append(scores["rougeL"].fmeasure)

        # Compute average ROUGE scores
        rouge1_avg = sum(rouge_scores["rouge1"]) / len(rouge_scores["rouge1"])
        rouge2_avg = sum(rouge_scores["rouge2"]) / len(rouge_scores["rouge2"])
        rougeL_avg = sum(rouge_scores["rougeL"]) / len(rouge_scores["rougeL"])

        return {
            "rouge1": rouge1_avg,
            "rouge2": rouge2_avg,
            "rougeL": rougeL_avg
        }
    else:
      raise ValueError("Unsupported metric. Use either 'bleu' or 'rouge'.")
    #
    pass

ImportError: cannot import name 'load_metric' from 'datasets' (/usr/local/lib/python3.11/dist-packages/datasets/__init__.py)

In [63]:
 evaluate_prediction(test_data, "bleu")

NameError: name 'bleu_metric' is not defined

In [22]:
evaluate_prediction(test_data, "rouge")

The scores for the partial training and the full training are so similar that it would appear that the full training does not provide any benefit in this task. However, it should be noted that the test sets in the two cases are different. More importantly, these results are indicative of the limitations of metrics such as *bleu* and *rouge* for evaluating text generation. Take, for example, the following case from the test set:


| FalseSent                 | reason1                                        | reason2                          | reason3                         |
|:--------------------------|:-----------------------------------------------|:---------------------------------|:--------------------------------|
| Beer that is drunk by humans is white | Beer is made of barley and it is a yellow drink | A beer that is drunk by humans is not white. | Beer is brown |

The predictions obtained by the partial and full trainings and their corresponding scores are the following:

| full training    | prediction                 | bleu     | rouge    |
|:-----------------|:---------------------------|---------:|---------:|
| no               | Beer that is drunk by humans is white  | 0.731    | 0.889    |
| yes              | White beer is not suitable for human consumption. | 0.000    | 0.364    |

The text generated by the full training is a better explanation than the reason generated by the partial training, which is a mere repetition of the nonsensical statement. However, the latter obtains much better scores than the former. Metrics such as *bleu* and *rouge* do not always replace accurately the human judgement.