# 2022-2023 Natural Language Interaction
# Group 2
# 61287 Anna Ricker
# 60552 Rodrigo Santos

Installments

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Load the dataset and tokenizer and tokenize the dataset

In [2]:
from datasets import load_dataset
from transformers import DataCollatorWithPadding, RobertaTokenizer

raw_datasets = load_dataset("super_glue", "boolq")
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")


def tokenize_function(example):
    return tokenizer(example["question"], example["passage"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Downloading builder script:   0%|          | 0.00/30.7k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/38.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/14.8k [00:00<?, ?B/s]

Downloading and preparing dataset super_glue/boolq to /root/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed...


Downloading data:   0%|          | 0.00/4.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9427 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3270 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3245 [00:00<?, ? examples/s]

Dataset super_glue downloaded and prepared to /root/.cache/huggingface/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Map:   0%|          | 0/9427 [00:00<?, ? examples/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Map:   0%|          | 0/3270 [00:00<?, ? examples/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Map:   0%|          | 0/3245 [00:00<?, ? examples/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Remove unnecessary columns and rename/adjust columns.

In [3]:
tokenized_datasets = tokenized_datasets.remove_columns(["question", "passage", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

['labels', 'input_ids', 'attention_mask']

Use the Dataloader on the dataset

In [4]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=12, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=12, collate_fn=data_collator
)

In [5]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'labels': torch.Size([12]),
 'input_ids': torch.Size([12, 275]),
 'attention_mask': torch.Size([12, 275])}

Train the model

In [6]:
import torch
from transformers import AdamW, RobertaForSequenceClassification, get_scheduler
from tqdm.auto import tqdm

model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

num_epochs = 2
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=100,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))
loss = 0

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
    print(loss)

Downloading pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classi

  0%|          | 0/1572 [00:00<?, ?it/s]

tensor(0.4521, device='cuda:0', grad_fn=<NllLossBackward0>)
tensor(0.2355, device='cuda:0', grad_fn=<NllLossBackward0>)


Evaluate the model

In [7]:
import evaluate
from datasets import load_metric

metric = load_metric("super_glue", "boolq")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()


  metric = load_metric("super_glue", "boolq")


Downloading builder script:   0%|          | 0.00/2.63k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

{'accuracy': 0.772782874617737}

Save the model

In [8]:
model.save_pretrained("question-answerer", from_pt=True)

Load the model

In [9]:
from transformers import RobertaForSequenceClassification

model = RobertaForSequenceClassification.from_pretrained("question-answerer")

Test function

In [10]:
import torch

def test(passage, question):
  # Tokenize the new input
  encoded_input = tokenizer.encode_plus(question, passage, truncation=True, padding=True, return_tensors='pt')

  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
  model.to(device)
  input_ids = encoded_input['input_ids'].to(device)
  attention_mask = encoded_input['attention_mask'].to(device)

  outputs = model(input_ids, attention_mask=attention_mask)

  # Get the predicted label
  logits = outputs.logits
  predicted_label = torch.argmax(logits, dim=-1).item()

  # Convert label (0 or 1) to actual prediction 
  labels = ['False', 'True']
  prediction = labels[predicted_label]

  return prediction

In [11]:

#Should be true (1)
question = "is confectionary sugar the same as powdered sugar"
print("question", question)
passage = "Powdered sugar, also called confectioner's sugar, icing sugar, and icing cake, is a finely ground sugar produced by milling granulated sugar into a powdered state. It usually contains a small amount of anti-caking agent to prevent clumping and improve flow. Although most often produced in a factory, powdered sugar can also be made by processing ordinary granulated sugar in a coffee grinder, or by crushing it by hand in a mortar and pestle."
print("passage", passage)
prediction = test(passage, question)
print(f"Prediction: {prediction}")

#Should be false (0)
question = "is saline and sodium chloride the same thing"
print("question", question)
passage = "Saline, also known as saline solution, is a mixture of sodium chloride in water and has a number of uses in medicine. Applied to the affected area it is used to clean wounds, help remove contact lenses, and help with dry eyes. By injection into a vein it is used to treat dehydration such as from gastroenteritis and diabetic ketoacidosis. It is also used to dilute other medications to be given by injection."
print("passage", passage)
prediction = test(passage, question)
print(f"Prediction: {prediction}")

#Should be false (0)
question = "is tomato puree and tomato sauce the same thing"
print("question", question)
passage = "Tomato purée -- Tomato purée is a thick liquid made by cooking and straining tomatoes. The difference between tomato paste, tomato purée, and tomato sauce is consistency; tomato puree has a thicker consistency and a deeper flavour than sauce."
print("passage", passage)
prediction = test(passage, question)
print(f"Prediction: {prediction}")


question is confectionary sugar the same as powdered sugar
passage Powdered sugar, also called confectioner's sugar, icing sugar, and icing cake, is a finely ground sugar produced by milling granulated sugar into a powdered state. It usually contains a small amount of anti-caking agent to prevent clumping and improve flow. Although most often produced in a factory, powdered sugar can also be made by processing ordinary granulated sugar in a coffee grinder, or by crushing it by hand in a mortar and pestle.
Prediction: True
question is saline and sodium chloride the same thing
passage Saline, also known as saline solution, is a mixture of sodium chloride in water and has a number of uses in medicine. Applied to the affected area it is used to clean wounds, help remove contact lenses, and help with dry eyes. By injection into a vein it is used to treat dehydration such as from gastroenteritis and diabetic ketoacidosis. It is also used to dilute other medications to be given by injection.
