In [1]:
! pip install transformers datasets accelerate evaluate diffusers

Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting diffusers
  Downloading diffusers-0.18.2-py3-none-any.whl (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: diffusers, evaluate
Successfully installed diffusers-0.18.2 evaluate-0.4.0


# **Multiple Choice**

A multiple choice task is similar to **question answering**, 
- Except several candidate answers are provided along with a context and the model is trained to select the correct answer

This guide will show you how to:

- Finetune BERT on the `regular` configuration of the **[SWAG](https://huggingface.co/datasets/swag)** dataset to select the best answer given multiple options and some context
- Use your finetuned model for inference

## **1. Load SWAG dataset**

Start by loading the `regular` configuration of the **SWAG dataset**

In [2]:
from datasets import load_dataset
import warnings; warnings.filterwarnings('ignore')

swag = load_dataset("swag", "regular")
print(swag['train'].column_names)

Downloading builder script:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.77k [00:00<?, ?B/s]

Downloading and preparing dataset swag/regular (download: 41.92 MiB, generated: 44.96 MiB, post-processed: Unknown size, total: 86.88 MiB) to /root/.cache/huggingface/datasets/swag/regular/0.0.0/9640de08cdba6a1469ed3834fcab4b8ad8e38caf5d1ba5e7436d8b1fd067ad4c...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/6.71M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.24M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.21M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/73546 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/20006 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/20005 [00:00<?, ? examples/s]

Dataset swag downloaded and prepared to /root/.cache/huggingface/datasets/swag/regular/0.0.0/9640de08cdba6a1469ed3834fcab4b8ad8e38caf5d1ba5e7436d8b1fd067ad4c. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label']


In [3]:
swag["train"][0]

{'video-id': 'anetv_jkn6uvmqwh4',
 'fold-ind': '3416',
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'gold-source': 'gold',
 'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'label': 0}

In [4]:
swag["train"][1]

{'video-id': 'anetv_jkn6uvmqwh4',
 'fold-ind': '3417',
 'startphrase': 'A drum line passes by walking down the street playing their instruments. Members of the procession',
 'sent1': 'A drum line passes by walking down the street playing their instruments.',
 'sent2': 'Members of the procession',
 'gold-source': 'gen',
 'ending0': 'are playing ping pong and celebrating one left each in quick.',
 'ending1': 'wait slowly towards the cadets.',
 'ending2': 'continues to play as well along the crowd along with the band being interviewed.',
 'ending3': 'continue to play marching, interspersed.',
 'label': 3}

In [5]:
# Starting Sentence
swag['train'][0]['sent1']

'Members of the procession walk down the street holding small horn brass instruments.'

In [6]:
# Ending Sentences (Multiple Choice)
print(f"{swag['train'][0]['sent2']}",swag['train'][0]['ending0'])
print(f"{swag['train'][0]['sent2']}",swag['train'][0]['ending1'])
print(f"{swag['train'][0]['sent2']}",swag['train'][0]['ending2'])
print(f"{swag['train'][0]['sent2']}",swag['train'][0]['ending3'])

A drum line passes by walking down the street playing their instruments.
A drum line has heard approaching them.
A drum line arrives and they're outside dancing and asleep.
A drum line turns the lead singer watches the performance.


Model data input:

- `sent1` and `sent2`: these fields show how a sentence starts, and if you put the two together, you get the `startphrase` field
- `ending`: suggests a possible ending for how a sentence can end, but only one of them is correct
- `label`: identifies the correct sentence ending

## **2. Preprocess**
The next step is to load a BERT tokenizer to process the sentence starts and the four possible endings

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

The preprocessing function you want to create needs to:

- Make four copies of the `sent1` field and combine each of them with `sent2` to recreate how a sentence starts.
- Combine `sent2` with each of the four possible sentence endings.
- Flatten these two lists so you can tokenize them, and then unflatten them afterward so each example has a corresponding `input_ids`, `attention_mask`, and `labels` field.

In [8]:
ending_names = ["ending0", "ending1", "ending2", "ending3"]

# multiple choice preprocessing function
def preprocess_function(examples):

    # starting sentences
    first_sentences = [[context] * 4 for context in examples["sent1"]]
    question_headers = examples["sent2"]

    # ending sentences
    second_sentences = [
        [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)
    ]

    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])

    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    return {k: [v[i : i+4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}

In [9]:
tokenized_swag = swag.map(preprocess_function, batched=True)

  0%|          | 0/74 [00:00<?, ?ba/s]

  0%|          | 0/21 [00:00<?, ?ba/s]

  0%|          | 0/21 [00:00<?, ?ba/s]

- Transformers doesn't have a data collator for **multiple choice**; adapt the <code>DataCollatorWithPadding</code> to create a batch of examples
- It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length
- `DataCollatorForMultipleChoice` flattens all the model inputs, applies padding, and then unflattens the results

In [10]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])

        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch

## **3. Evaluate**

`compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

In [11]:
import evaluate
import numpy as np

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

## **4. Fine Tune Model**

For multiple choice problems, we load <code>AutoModelForMultipleChoice</code>

In [12]:
from transformers import AutoModelForMultipleChoice

# load model configuration
model = AutoModelForMultipleChoice.from_pretrained("bert-base-uncased")

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMultipleChoice: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForMultipleChoice from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMultipleChoice from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight'

At this point, only three steps remain:

- Define your **training hyperparameters** in <code>TrainingArguments</code>
- Pass the training arguments to <code>Trainer</code> along with the model, dataset, tokenizer, data collator, and `compute_metrics` function
- Call train() to finetune your model

In [14]:
import os
from transformers import TrainingArguments, Trainer

os.environ['WANDB_DISABLED'] = 'true'

# trainer arguments
training_args = TrainingArguments(
    output_dir="my_awesome_swag_model",
    evaluation_strategy="epoch",
    load_best_model_at_end=False,
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy = "no",
    save_total_limit = 2,
    disable_tqdm=True  # remove status bar
)

# train model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_swag["train"],
    eval_dataset=tokenized_swag["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)


trainer.train()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'loss': 0.9781, 'learning_rate': 4.818722355159162e-05, 'epoch': 0.11}
{'loss': 0.9057, 'learning_rate': 4.637444710318324e-05, 'epoch': 0.22}
{'loss': 0.875, 'learning_rate': 4.4561670654774854e-05, 'epoch': 0.33}
{'loss': 0.844, 'learning_rate': 4.2748894206366476e-05, 'epoch': 0.44}
{'loss': 0.808, 'learning_rate': 4.093611775795809e-05, 'epoch': 0.54}
{'loss': 0.8046, 'learning_rate': 3.912334130954971e-05, 'epoch': 0.65}
{'loss': 0.8032, 'learning_rate': 3.731056486114133e-05, 'epoch': 0.76}
{'loss': 0.7601, 'learning_rate': 3.549778841273294e-05, 'epoch': 0.87}
{'loss': 0.7608, 'learning_rate': 3.368501196432456e-05, 'epoch': 0.98}
{'eval_loss': 0.6071198582649231, 'eval_accuracy': 0.7676696990902729, 'eval_runtime': 137.2609, 'eval_samples_per_second': 145.752, 'eval_steps_per_second': 9.114, 'epoch': 1.0}
{'loss': 0.48, 'learning_rate': 3.187223551591618e-05, 'epoch': 1.09}
{'loss': 0.3944, 'learning_rate': 3.0059459067507794e-05, 'epoch': 1.2}
{'loss': 0.4135, 'learning_rate'

TrainOutput(global_step=13791, training_loss=0.4609237691920625, metrics={'train_runtime': 4946.0691, 'train_samples_per_second': 44.609, 'train_steps_per_second': 2.788, 'train_loss': 0.4609237691920625, 'epoch': 3.0})

In [15]:
# save model and tokeniser
trainer.save_model("my_awesome_swag_model")
tokenizer.save_pretrained("my_awesome_swag_model")

('my_awesome_swag_model/tokenizer_config.json',
 'my_awesome_swag_model/special_tokens_map.json',
 'my_awesome_swag_model/vocab.txt',
 'my_awesome_swag_model/added_tokens.json',
 'my_awesome_swag_model/tokenizer.json')

## **5. Inference**

Now that we have finetuned a model, we can use it for inference
- Lets come up with some text and two candidate answers

In [None]:
# Our example
prompt = "France is the biggest country in the world"
candidate1 = "True"
candidate2 = "False"

- Tokenize each prompt and candidate answer pair and return tensors
- You also need to create some `labels`

In [16]:
# Tokenize each prompt and candidate answer pair and return PyTorch tensors. You should also create some `labels`:

from transformers import AutoTokenizer
from transformers import AutoModelForMultipleChoice

# Tokenize each prompt and candidate answer pair and return PyTorch tensors. You should also create some `labels`:
tokenizer = AutoTokenizer.from_pretrained("my_awesome_swag_model")
inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], 
                    return_tensors="pt", 
                    padding=True)
                    
labels = torch.tensor(0).unsqueeze(0)

In [17]:
# Pass your inputs and labels to the model and return the `logits`:

model = AutoModelForMultipleChoice.from_pretrained("my_awesome_swag_model")
outputs = model(**{k: v.unsqueeze(0) for k, v in inputs.items()},
                labels=labels)
logits = outputs.logits

# get the class with the highest probability
predicted_class = logits.argmax().item()
predicted_class

1