In [1]:
# colab resource monitor
from urllib.request import urlopen
exec(urlopen("http://colab-monitor.smankusors.com/track.py").read())
_colabMonitor = ColabMonitor().start()

Now live at : http://colab-monitor.smankusors.com/609d348c89838


In [2]:
# !pip install transformers
# !pip install datasets

In [3]:
!nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-6f4bdbeb-9c2c-c310-6bf4-00dbd32e4841)


# Fine-tuning a model on a multiple choice task
This notebook will show how to fine-tune one of the 🤗 Transformers model to a multiple choice task task, which is the task of selecting the most plausible inputs in a given selection. The dataset used here is [SWAG](https://www.aclweb.org/anthology/D18-1009/) but you can adapt the pre-processing to any other multiple choice dataset you like, or your own data.<br>
[SWAG](https://www.aclweb.org/anthology/D18-1009/) is a dataset about commonsense reasoning, where each example describes a situation then proposes four options that could go after it.<br>
- [SWAG from huggingface](https://huggingface.co/datasets/swag)

In [4]:
# set main parameters
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

# check execution time for whole code
import time
s_time = time.time()

In [5]:
import datasets

import pandas as pd
import numpy as np

import random
from IPython.display import HTML
import collections
import tqdm

import transformers
from transformers import AutoTokenizer
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
from transformers import default_data_collator

import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# datasets : 1.6.1  |  pd : 1.1.5  |  np : 1.19.5  |  tqdm : 4.41.1  |  transformers : 4.5.1  |  torch : 1.8.1+cu101
print(f'datasets : {datasets.__version__}  |  pd : {pd.__version__}  |  np : {np.__version__}  |  tqdm : {tqdm.__version__}  |  transformers : {transformers.__version__}  |  torch : {torch.__version__}')
print('device :', device)

datasets : 1.6.2  |  pd : 1.1.5  |  np : 1.19.5  |  tqdm : 4.41.1  |  transformers : 4.6.0  |  torch : 1.8.1+cu101
device : cuda


## 1. Loading the dataset & metric
- We will use the 🤗 Datasets library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.

In [6]:
# load dataset & metric
dset_dict = datasets.load_dataset('swag')
# metric = datasets.load_metric('swag')  # not run

# check dataset
print('\n>>> dataset object :')
display(dset_dict)
print('\n>>> sample data :')
display(dset_dict['train'][0])

No config specified, defaulting to: swag/regular
Reusing dataset swag (/root/.cache/huggingface/datasets/swag/regular/0.0.0/9640de08cdba6a1469ed3834fcab4b8ad8e38caf5d1ba5e7436d8b1fd067ad4c)



>>> dataset object :


DatasetDict({
    train: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 73546
    })
    validation: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 20006
    })
    test: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 20005
    })
})


>>> sample data :


{'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'fold-ind': '3416',
 'gold-source': 'gold',
 'label': 0,
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'video-id': 'anetv_jkn6uvmqwh4'}

In [7]:
# show random sample of a dataset
def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    random.seed(777)
    picks = random.sample(range(len(dataset)), k=num_examples)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(dset_dict["train"], 2)

Unnamed: 0,ending0,ending1,ending2,ending3,fold-ind,gold-source,label,sent1,sent2,startphrase,video-id
0,takes possession of the surging metal.,someone and his friends follow each and open his mechanics for the last remark on the chess piece.,are row upon row of the weapons his father once used.,"someone reaches into the passenger window, moves the wall punctuated back from the cabinet side - first as kids rush out.",18668,gold,2,Someone crashes into a set of shelves filled with spherical bombs.,There,Someone crashes into a set of shelves filled with spherical bombs. There,lsmdc1008_Spider-Man2-76017
1,takes out her hand and pulls the wheel under.,"stands on his side at the shore, looking at the hair of someone with his hands.",hangs low with a closed phone.,pauses and turns back and slowly drives away.,11809,gen,3,"The unit moves on, spread out along the hazy street.",Someone,"The unit moves on, spread out along the hazy street. Someone",lsmdc3009_BATTLE_LOS_ANGELES-121
