In [1]:
# !pip install transformers
# !pip install datasets

In [2]:
!nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-ef3d59f4-ddf0-c0d1-c3db-d3180bbc33ab)


## Fine-tuning a model on a question-answering task
This notebook will show to fine-tune one of the ðŸ¤— Transformers model to a question answering task, which is the task of extracting the answer to a question from a given context.
<br><br>
**Note** : This notebook finetunes models that answer question by taking a substring of a context, not by generating new text.

In [3]:
# set main parameters
squad_v2_flag = False
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

# check execution time for whole code
import time
s_time = time.time()

In [4]:
import datasets

import pandas as pd
import numpy as np
import random
import collections
import tqdm

from IPython.display import display, HTML

import transformers
from transformers import AutoTokenizer
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# datasets : 1.6.1  |  pd : 1.1.5  |  np : 1.19.5  |  tqdm : 4.41.1  |  transformers : 4.5.1  |  torch : 1.8.1+cu101
print(f'datasets : {datasets.__version__}  |  pd : {pd.__version__}  |  np : {np.__version__}  |  tqdm : {tqdm.__version__}  |  transformers : {transformers.__version__}  |  torch : {torch.__version__}')
print('device :', device)

datasets : 1.6.1  |  pd : 1.1.5  |  np : 1.19.5  |  tqdm : 4.41.1  |  transformers : 4.5.1  |  torch : 1.8.1+cu101
device : cuda


## 1. Loading the dataset & metric
- We will use the ðŸ¤— Datasets library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.

- ðŸ¤— Datasets library also provide `list_datasets()` function to get the list of all available datasets. There are about 21 datasets related to QA task.
  - ref : https://huggingface.co/datasets/squad_kor_v1 (Korean squad_v1 by LG CNS)
  - ref : https://huggingface.co/datasets/squad_kor_v2 (Korean squad_v2 by LG CNS)

In [5]:
# check dataset list
dset_list = datasets.list_datasets()
qa_dset_list = [i for i in dset_list if 'quad' in i]

print('>>> Total No of provided datasets :', len(dset_list))
print('>>> No of QA datasets :', len(qa_dset_list))
print(np.array([i for i in dset_list if 'quad' in i]))

>>> Total No of provided datasets : 849
>>> No of QA datasets : 21
['fquad' 'iapp_wiki_qa_squad' 'lc_quad' 'squad' 'squad_adversarial'
 'squad_es' 'squad_it' 'squad_kor_v1' 'squad_kor_v2' 'squad_v1_pt'
 'squad_v2' 'squadshifts' 'thaiqa_squad' 'xquad' 'xquad_r'
 'lhoestq/custom_squad' 'lhoestq/squad' 'piEsposito/br-quad-2.0'
 'piEsposito/br_quad_20' 'piEsposito/squad_20_ptbr'
 'susumu2357/squad_v2_sv']


In [6]:
# load dataset & metric
dset_dict = datasets.load_dataset('squad_v2' if squad_v2_flag else 'squad')
metric = datasets.load_metric("squad_v2" if squad_v2_flag else "squad")

# check dataset
print('\n>>> dataset object :')
display(dset_dict)
print('\n>>> sample data :')
display(dset_dict['train'][0])

Reusing dataset squad (/root/.cache/huggingface/datasets/squad/plain_text/1.0.0/4fffa6cf76083860f85fa83486ec3028e7e32c342c218ff2a620fc6b2868483a)



>>> dataset object :


DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})


>>> sample data :


{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'}

In [7]:
# show random sample of a dataset
def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    random.seed(42)
    picks = random.sample(range(len(dataset)), k=num_examples)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(dset_dict["train"], 2)

Unnamed: 0,answers,context,id,question,title
0,"{'answer_start': [346], 'text': ['Ovid']}","The meaning and origin of many archaic festivals baffled even Rome's intellectual elite, but the more obscure they were, the greater the opportunity for reinvention and reinterpretation â€” a fact lost neither on Augustus in his program of religious reform, which often cloaked autocratic innovation, nor on his only rival as mythmaker of the era, Ovid. In his Fasti, a long-form poem covering Roman holidays from January to June, Ovid presents a unique look at Roman antiquarian lore, popular customs, and religious practice that is by turns imaginative, entertaining, high-minded, and scurrilous; not a priestly account, despite the speaker's pose as a vates or inspired poet-prophet, but a work of description, imagination and poetic etymology that reflects the broad humor and burlesque spirit of such venerable festivals as the Saturnalia, Consualia, and feast of Anna Perenna on the Ides of March, where Ovid treats the assassination of the newly deified Julius Caesar as utterly incidental to the festivities among the Roman people. But official calendars preserved from different times and places also show a flexibility in omitting or expanding events, indicating that there was no single static and authoritative calendar of required observances. In the later Empire under Christian rule, the new Christian festivals were incorporated into the existing framework of the Roman calendar, alongside at least some of the traditional festivals.",5731ab21b9d445190005e44f,What poet wrote a long poem describing Roman religious holidays?,Religion_in_ancient_Rome
1,"{'answer_start': [64], 'text': ['hydrocarbons']}","Hydrogen forms a vast array of compounds with carbon called the hydrocarbons, and an even vaster array with heteroatoms that, because of their general association with living things, are called organic compounds. The study of their properties is known as organic chemistry and their study in the context of living organisms is known as biochemistry. By some definitions, ""organic"" compounds are only required to contain carbon. However, most of them also contain hydrogen, and because it is the carbon-hydrogen bond which gives this class of compounds most of its particular chemical characteristics, carbon-hydrogen bonds are required in some definitions of the word ""organic"" in chemistry. Millions of hydrocarbons are known, and they are usually formed by complicated synthetic pathways, which seldom involve elementary hydrogen.",56e08b457aa994140058e5e3,What is the form of hydrogen and carbon called?,Hydrogen
