# Pre-trained Language Models: SubTask A. 
## [6 Marks]

In this assignment, you will work on the [ComVE](https://competitions.codalab.org/competitions/21080) shared task that was part of SemEval-2020. The task aims to evaluate whether a system can distinguish if a natural language statement makes sense to humans or not and provide a reason. **ConVE** includes three subtasks that require models to acquire and apply commonsense knowledge. In this notebook you will focus on **SubTask A**:

- Given two similar statements that differ by only a few words, select the statement of the two that does not make sense. For example, within the statements below, *Statement 2* is the nonsensical statement:

     *Statement 1*: He put a turkey into the fridge.    
     *Statement 2*: He put an elephant into the fridge.
     
     This subtask can be approached as a Text Matching problem where the input is the two statements and the output is a label indicating which is the nonsensical one.

You will fine-tune a Pre-trained Language Model with [Transformers](https://huggingface.co/docs/transformers/index) library that provides a set of tools for fine-tunning and deploying a wide variety of Pre-trained Language Models. The [Hugging Face Hub](https://huggingface.co/models) allows you to explore all the models supported by **Transformers** and even share your own models with the community. In this assignment, you will work with [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta), a model that uses **BERT**'s architecture but has been pre-trained with more data and a more carefully selected set of hyperparameters.

Fine-tuning a Pre-trained Language Model usually requires a great amount of time and computational resources. Your personal computer will not be probably enough. In order to complete the assignment, you can work with a reduced version of the dataset and the base version of **RoBERTa**:

In [28]:
# from tqdm import tqdm
shrink_dataset = False
base_model = False
colab = False

Although the value of these variables do not affect the tests that will evaluate your code, the output examples distributed throughout this notebook are based on a `shrink_dataset` and a `base_model` variables set as `True`, and a `colab` variable set as `False`.

If you want to perform a full training of the model to obtain its real performance, you can use a cloud service like [Google Colab](https://colab.research.google.com/). **Colab** is a **Jupyter** notebook environment that supports both GPU and TPU instances, allowing training large scale Deep Learning models. Set the `shrink_dataset` and a `base_model` variables to `False`, the `colab` variable to `True`, and follow the instructions provided to you to run the notebook in **Colab**.

In [29]:
if colab:
    ! pip install transformers datasets evaluate
    import os
    if not os.path.exists("SemEval2020-Task4-Data/ALL data/Training  Data/subtaskA_data_all.csv"):
        ! git clone https://github.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation.git SemEval2020-Task4-Data

You will use the following objects and functions:

In [30]:
import pandas as pd
import evaluate
from datasets import Dataset

from transformers import (AutoTokenizer, AutoModelForSequenceClassification, 
                          TrainingArguments, Trainer, 
                          enable_full_determinism)

When working with Neural Networks, there are a large number of random operations such as initializing the weights of the network, shuffling the data for training, or choosing samples. This causes that different training runs of the same model can lead to different results. To ensure reproducibility, i.e. obtaining the same results in the different runs, the random number generator must be initialized with a fixed value known as seed. In **Transformers**, this can be done as follows:

In [31]:
enable_full_determinism(seed=42)

> **Note!** With models as complex as Neural Networks, reproducibility is susceptible to factors such as software versions or the hardware on which the models are run. Even with seed initialization, there may be slight differences in the results.

Working with Neural Networks also involves defining a number of hyperparameters that set the configuration of the model. Finding the appropriate hyperparameter values requires training the model with different combinations and testing them on the development set. This hyperparameter tuning is a costly process that needs multiple rounds of experimentation. However, for this assignments, you will use the following values:

In [32]:
epochs = 3  # Number of epochs to train the model
train_batch_size = 8  # Number of examples used per gradient update
learning_rate = 1e-5  # The learning rate for the optimizer
max_length = 50  # Maximum lenght of the input sequence
output_dir = "modelA"  # The output directory where the model will be written to

## Loading the Pre-trained Model - [1 Mark]

The first step you must perform in this assignment is to load the model and its corresponding tokenizer. **Transformers** provides support for a wide variety of pre-trained models via specific classes. However, the library also allows automatically retrieving a model given jut the name or path using [AutoClasses](https://huggingface.co/docs/transformers/v4.27.2/en/model_doc/auto). To fine-tune a pre-trained model for a downstream task, it is necessary to replace the original top layer of the model with a new specific output layer. **AutoClasses** also allows you to do this automatically for various types of Natural Language Processing tasks. For instance, `AutoModelForSequenceClassification` instantiates the model with a top layer for Text Classification.

You must complete the code for the `load_model` function. This functions takes the name of the pre-trained model and should load and return both the model, initialized for Text Classification, and its corresponding tokenizer. You can get some tips from [Transformers documentation](https://huggingface.co/docs/transformers/autoclass_tutorial).

In [33]:
def load_model(model_name):   # [1 Mark]
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    return model, tokenizer
   

In [34]:
model_name = "roberta-base" if base_model else "roberta-large"
model, tokenizer = load_model(model_name)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Data Pre-processing - [1 Mark]

The **ComVE** dataset consists of 10000 pairs of statements for the train set, 997 pairs for development and 1000 for test. Each of the statement pairs is annotated with a `0` or `1` label depending on whether the nonsensical statement is the first or the second one respectively. The dataset can be loaded into three `DataFrames` as follows:

In [35]:
def load_data(data_csv, answers_csv):
    data = pd.read_csv(data_csv)
    reasons = pd.read_csv(answers_csv, header=None).rename(columns={0: "id", 1: "label"})
    return pd.merge(data, reasons, on="id")

In [36]:
train_data_csv = "SemEval2020-Task4-Data/ALL data/Training  Data/subtaskA_data_all.csv"
train_answers_csv = "SemEval2020-Task4-Data/ALL data/Training  Data/subtaskA_answers_all.csv"
train_data = load_data(train_data_csv, train_answers_csv)
dev_data_csv = "SemEval2020-Task4-Data/ALL data/Dev Data/subtaskA_dev_data.csv"
dev_answers_csv = "SemEval2020-Task4-Data/ALL data/Dev Data/subtaskA_gold_answers.csv"
dev_data = load_data(dev_data_csv, dev_answers_csv)
test_data_csv = "SemEval2020-Task4-Data/ALL data/Test Data/subtaskA_test_data.csv"
test_answers_csv = "SemEval2020-Task4-Data/ALL data/Test Data/subtaskA_gold_answers.csv"
test_data = load_data(test_data_csv, test_answers_csv)
if shrink_dataset:
    train_data = train_data.sample(n=100, random_state=42)
    dev_data = dev_data.sample(n=100, random_state=42)
    test_data = test_data.sample(n=100, random_state=42)
train_data

Unnamed: 0,id,sent0,sent1,label
0,0,He poured orange juice on his cereal.,He poured milk on his cereal.,0
1,1,He drinks apple.,He drinks milk.,0
2,2,Jeff ran a mile today,"Jeff ran 100,000 miles today",1
3,3,A mosquito stings me,I sting a mosquito,1
4,4,A niece is a person.,A giraffe is a person.,1
...,...,...,...,...
9995,9995,Mark ate a big bitter cherry pie,Mark ate a big sweet cherry pie,0
9996,9996,Gloria wears a cat on her head,Gloria wears a hat on her head,0
9997,9997,Harry went to the barbershop to have his hair cut,Harry went to the barbershop to have his glass...,1
9998,9998,Reilly is sleeping on the couch,Reilly is sleeping on the window,1


In [37]:
len(train_data)

10000

[Datasets](https://huggingface.co/docs/datasets/index) is a library for dataset management that provides a set of tools to manipulate data in a easy and efficient way. Since it is fully integrated with **Transformers**, it is very convenient to use both libraries together. **Datasets** allows accessing and sharing datasets through the [Hugging Face Hub](https://huggingface.co/datasets). The core component of this library is the [Dataset](https://huggingface.co/docs/datasets/v2.10.0/en/package_reference/main_classes#datasets.Dataset) class that implements an [Apache Arrow table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html). Similar to a **pandas** `DataFrame`, a `Dataset` object stores a table where each row corresponds to an example of the dataset and each column contains a different type of data. There are different ways to load the data into a `Dataset`, for example, from a `Dataframe`:

In [38]:
train_dataset = Dataset.from_pandas(train_data)
dev_dataset = Dataset.from_pandas(dev_data)
test_dataset = Dataset.from_pandas(test_data)
train_dataset[0]

{'id': 0,
 'sent0': 'He poured orange juice on his cereal.',
 'sent1': 'He poured milk on his cereal.',
 'label': 0}

One of the most powerful **Datasets** tools is the [map](https://huggingface.co/docs/datasets/v2.10.0/en/nlp_process#map) function which allows pre-processing the dataset in batches. The function takes another callable as argument and applies it to every row in the `Dataset`. The goal of the next exercise is to implement a function to tokenize the statement pairs that will be used as a parameter of the `map` function.

You must complete the code for the `preprocess_data` function. This function takes a batch of examples from a `Dataset`, the tokenizer returned by `load_model` and the `max_length` hyperparameter. The function must run the tokenizer jointly on the `sent0` and `sent1` columns of the `Dataset`. The tokenizer must pad and truncate the sequences to the `max_length` value. You can use the [Preprocessing](https://huggingface.co/docs/transformers/v4.27.2/en/preprocessing) and the [Tokenizer](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/tokenizer) documentation as reference. The `preprocess_data` should return the output of the tokenizer.

The `tokenizer` should return a [BatchEncoding](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/tokenizer#transformers.BatchEncoding) object with two fields for each data example:
* *input_ids*: A list of token indices that will be used as the input of the model.
* *attention_mask*: A list of indices masking out which tokens the model should not attend to.

The `map` function takes these fields and inserts them into the `Dataset` as new two columns. For example, the result for the first row in the `Dataset` should look like:

> <pre>
{'id': 6252, 'sent0': 'a duck walks on three legs', 'sent1': 'a duck walks on two legs', 'label': 0, '__index_level_0__': 6252, 'input_ids': [0, 102, 15223, 5792, 15, 130, 5856, 2, 2, 102, 15223, 5792, 15, 80, 5856, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
</pre>

Each value in `input_ids` represents a sub-word of the `tokenizer` vocabulary. For the example above, `input_ids` corresponds to the following sequence of sub-words:

> <pre>
['&lt;s&gt;', 'a', 'Ġduck', 'Ġwalks', 'Ġon', 'Ġthree', 'Ġlegs', '&lt;/s&gt;', '&lt;/s&gt;', 'a', 'Ġduck', 'Ġwalks', 'Ġon', 'Ġtwo', 'Ġlegs', '&lt;/s&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;']
</pre>

Notice that the **Hugging Face** implementation of **RoBERTa**'s tokenizer uses the `<s>` token equivalently to **BERT**'s `[CLS]` token and the `</s>` token to mark both the end and the separation of the sentences. The `Ġ` character indicates when there is a blank space before the token in the original text. This helps to know which tokens are the first sub-words of the words. 

In [39]:
def preprocess_data(examples, tokenizer, max_length):   # [1 Mark]
    pairs = list(zip(examples['sent0'], examples['sent1']))
    
    # Tokenize the pairs with padding and truncation
    tokenized_batch = tokenizer(pairs, padding='max_length', truncation=True, max_length=max_length)
    
    # Convert BatchEncoding object to a dictionary with input_ids and attention_mask
    return {
        'input_ids': tokenized_batch['input_ids'],
        'attention_mask': tokenized_batch['attention_mask']
    }

In [40]:
train_dataset = train_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
dev_dataset = dev_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
test_dataset = test_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
print(train_dataset[0])
print(tokenizer.convert_ids_to_tokens(train_dataset[0]["input_ids"]))

Map: 100%|██████████| 10000/10000 [00:00<00:00, 49283.41 examples/s]
Map: 100%|██████████| 997/997 [00:00<00:00, 46695.49 examples/s]
Map: 100%|██████████| 1000/1000 [00:00<00:00, 48046.37 examples/s]

{'id': 0, 'sent0': 'He poured orange juice on his cereal.', 'sent1': 'He poured milk on his cereal.', 'label': 0, 'input_ids': [0, 894, 13414, 8978, 10580, 15, 39, 25629, 4, 2, 2, 894, 13414, 5803, 15, 39, 25629, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
['<s>', 'He', 'Ġpoured', 'Ġorange', 'Ġjuice', 'Ġon', 'Ġhis', 'Ġcereal', '.', '</s>', '</s>', 'He', 'Ġpoured', 'Ġmilk', 'Ġon', 'Ġhis', 'Ġcereal', '.', '</s>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']





## Fine-tuning - [4 Marks]

Although it is possible to write customized training loops for the **Transormers** models using **keras** or **pytorch**, **Transformers** provides a [Trainer](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer) API that allows fine-tuning efficiently with a few simple steps. The training is highly customizable through a wide range of options and hyperparameters that are handled by the [TrainingArguments](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer#transformers.TrainingArguments) class. Your next goal is to create both the `TrainingArguments` and `Trainer` objects that will be used to fine-tune **RoBERTa**. See the [documentation](https://huggingface.co/docs/transformers/training) for an introduction on how to perform these steps.

You must complete the code for the `create_training_arguments` function. This function takes as arguments the `epochs`, `train_batch_size` and `learning_rate` hyperparameters along with the `output_dir`. The function should use these arguments to create and return a `TrainingArguments` object. During the training, the model must be evaluated on the development test after every epoch. `TrainingArguments` should include this strategy.

> **Important!** By default, `Trainer` saves a checkpoint of the model every 500 training steps. For this assignment, avoid this behavior by setting `save_strategy="no"` when creating the `TrainingArguments`.

In [41]:
! pip3 install accelerate
def create_training_arguments(epochs, train_batch_size, learning_rate, output_dir):    # [1 Mark]
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=epochs,
        per_device_train_batch_size=train_batch_size,
        per_device_eval_batch_size=train_batch_size,
        evaluation_strategy="epoch",
        learning_rate=learning_rate,
        save_strategy="no",
        metric_for_best_model="accuracy",
        save_on_each_node=True,
    )

    return training_args

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [42]:
train_args = create_training_arguments(epochs, train_batch_size, learning_rate, output_dir)



Next, you will create a `Trainer` object with the training arguments.

You must complete the code for the `create_trainer` function. The function takes as input the model returned by the `load_model`, the `TrainingArguments` created by `create_training_arguments` and the train and development `Datasets`. The `create_trainer` function must create and return a `Trainer` object with the model and the training arguments. The `Trainer` must be set up so that the train `Dataset` is used for training and the development `Dataset` is used to evaluate the model during the training.

In [43]:
def create_trainer(model, train_args, train_dataset, dev_dataset):    # [1 Mark]
    return Trainer(model=model, train_dataset=train_dataset, args=train_args, eval_dataset=dev_dataset)

In [44]:
trainer = create_trainer(model, train_args, train_dataset, dev_dataset)

The `trainer` object created by `create_trainer` is ready to fine-tune the model by just running:

In [45]:
trainer.train()

 13%|█▎        | 500/3750 [03:56<25:31,  2.12it/s]

{'loss': 0.7048, 'grad_norm': 3.9966278076171875, 'learning_rate': 8.666666666666668e-06, 'epoch': 0.4}


 27%|██▋       | 1000/3750 [07:52<21:28,  2.14it/s]

{'loss': 0.6996, 'grad_norm': 11.398848533630371, 'learning_rate': 7.333333333333333e-06, 'epoch': 0.8}


 33%|███▎      | 1250/3750 [09:50<19:35,  2.13it/s]
 33%|███▎      | 1250/3750 [10:01<19:35,  2.13it/s]

{'eval_loss': 0.6966229677200317, 'eval_runtime': 11.1594, 'eval_samples_per_second': 89.342, 'eval_steps_per_second': 11.201, 'epoch': 1.0}


 40%|████      | 1500/3750 [12:00<17:34,  2.13it/s]  

{'loss': 0.6937, 'grad_norm': 5.285187244415283, 'learning_rate': 6e-06, 'epoch': 1.2}


 53%|█████▎    | 2000/3750 [15:54<13:39,  2.14it/s]

{'loss': 0.698, 'grad_norm': 3.18569278717041, 'learning_rate': 4.666666666666667e-06, 'epoch': 1.6}


 67%|██████▋   | 2500/3750 [19:49<09:47,  2.13it/s]

{'loss': 0.6972, 'grad_norm': 12.77448558807373, 'learning_rate': 3.3333333333333333e-06, 'epoch': 2.0}



 67%|██████▋   | 2500/3750 [19:59<09:47,  2.13it/s]

{'eval_loss': 0.6929813027381897, 'eval_runtime': 10.2769, 'eval_samples_per_second': 97.013, 'eval_steps_per_second': 12.163, 'epoch': 2.0}


 80%|████████  | 3000/3750 [23:55<05:50,  2.14it/s]  

{'loss': 0.696, 'grad_norm': 3.2175445556640625, 'learning_rate': 2.0000000000000003e-06, 'epoch': 2.4}


 93%|█████████▎| 3500/3750 [27:49<01:58,  2.12it/s]

{'loss': 0.6989, 'grad_norm': 5.092981338500977, 'learning_rate': 6.666666666666667e-07, 'epoch': 2.8}


100%|██████████| 3750/3750 [29:46<00:00,  2.13it/s]
100%|██████████| 3750/3750 [29:57<00:00,  2.09it/s]

{'eval_loss': 0.6928222179412842, 'eval_runtime': 10.2712, 'eval_samples_per_second': 97.067, 'eval_steps_per_second': 12.17, 'epoch': 3.0}
{'train_runtime': 1797.2688, 'train_samples_per_second': 16.692, 'train_steps_per_second': 2.086, 'train_loss': 0.6982589640299479, 'epoch': 3.0}





TrainOutput(global_step=3750, training_loss=0.6982589640299479, metrics={'train_runtime': 1797.2688, 'train_samples_per_second': 16.692, 'train_steps_per_second': 2.086, 'total_flos': 2730267666000000.0, 'train_loss': 0.6982589640299479, 'epoch': 3.0})

After training, the model can be used to make predictions on unlabeled data using the [predict](https://huggingface.co/docs/transformers/v4.27.2/en/main_classes/trainer#transformers.Trainer.predict) method of the `Trainer` class.

You must complete the code for the `make_predictions` function. The function takes as input the `Trainer` object and test `Dataset`. The function must run the `predict` method on the input data. The `predict` method will return a `NamedTuple` including a **numpy** array with the predictions. For each statement pair in the input, the array should contain a vector with the logits (the values used as input of the softmax) predicted for every label. The output of `make_predictions` must include only the index of the label with the highest logit value. For example, if the prediction for one statement pair is `[0.10053499, -0.01917896]`, the output for that example should be `0`. For this, you can apply the [argmax](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html) method along the last axis of the **numpy** array.

In [46]:
import numpy as np
def make_predictions(trainer, test_dataset):
    predictions = trainer.predict(test_dataset)
    
    logits = predictions.predictions
    
    probabilities = np.exp(logits) / np.exp(logits).sum(axis=-1, keepdims=True)
    
    predicted_labels = np.argmax(probabilities, axis=-1)
    
    return predicted_labels

In [47]:
predictions = make_predictions(trainer, test_dataset)
test_data["prediction"] = predictions
test_data

100%|██████████| 125/125 [00:12<00:00,  9.95it/s]


Unnamed: 0,id,sent0,sent1,label,prediction
0,1175,He loves to stroll at the park with his bed,He loves to stroll at the park with his dog.,0,0
1,452,The inverter was able to power the continent.,The inverter was able to power the house,0,0
2,275,The chef put extra lemons on the pizza.,The chef put extra mushrooms on the pizza.,0,0
3,869,sugar is used to make coffee sour,sugar is used to make coffee sweet,0,0
4,50,There are beautiful flowers here and there in ...,There are beautiful planes here and there in t...,1,0
...,...,...,...,...,...
995,1114,"If it had rained, you would got wet.","If it is a sunny day, you would got wet.",1,0
996,8,ice hockey is a sport,ice hockey is a financial institution,1,0
997,1945,He put water without a container in the freeze...,He put a watermelon in the freezer for 24 hours,0,0
998,1053,The desert has sand that you can drink.,"The desert is very dry, so bring water when yo...",0,0


The **Subtasks A** of **ComVE** is evaluated using accuracy. The [evaluate](https://huggingface.co/docs/evaluate/index) library provides support to apply this and other metrics. The `evaluate_prediction` function takes the test `DataFrame` and calculates the accuracy comparing the `prediction` and `label` columns. With `shrink_dataset` and `base_model` set to `True` the model is not able to learn the task so the expected score is only *0.49*. With a full training run, i.e. with `shrink_dataset` and `base_model` set to `False`, the score should be around *0.929*.

In [48]:
def evaluate_prediction(test_data):
    accuracy = evaluate.load("accuracy")
    return accuracy.compute(predictions=test_data["prediction"].values, references=test_data["label"].values)
evaluate_prediction(test_data)

{'accuracy': 0.508}