If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize

import psutil
import humanize
import os
import GPUtil as GPU

GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
    process = psutil.Process(os.getpid())
    print("Gen RAM Free: " + humanize.naturalsize(psutil.virtual_memory().available), " |     Proc size: " + humanize.naturalsize(process.memory_info().rss))
    print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total     {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

Collecting gputil
  Downloading https://files.pythonhosted.org/packages/ed/0e/5c61eedde9f6c87713e89d794f01e378cfd9565847d4576fa627d758c554/GPUtil-1.4.0.tar.gz
Building wheels for collected packages: gputil
  Building wheel for gputil (setup.py) ... [?25l[?25hdone
  Created wheel for gputil: filename=GPUtil-1.4.0-cp37-none-any.whl size=7411 sha256=7164534ab3ef184db410b0a4918e9237b69124333cb4d930d54b68b0c4742dbb
  Stored in directory: /root/.cache/pip/wheels/3d/77/07/80562de4bb0786e5ea186911a2c831fdd0018bda69beab71fd
Successfully built gputil
Installing collected packages: gputil
Successfully installed gputil-1.4.0
Gen RAM Free: 12.8 GB  |     Proc size: 112.2 MB
GPU RAM Free: 15109MB | Used: 0MB | Util   0% | Total     15109MB


In [None]:
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [None]:
! pip install datasets transformers > /dev/null

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/question-answering).

# Fine-tuning a model on a question-answering task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model to a question answering task, which is the task of extracting the answer to a question from a given context. We will see how to easily load a dataset for these kinds of tasks and use the `Trainer` API to fine-tune a model on it.

![Widget inference representing the QA task](https://github.com/huggingface/notebooks/blob/master/examples/images/question_answering.png?raw=1)

**Note:** This notebook finetunes models that answer question by taking a substring of a context, not by generating new text.

This notebook is built to run on any question answering task with the same format as SQUAD (version 1 or 2), with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a token classification head and a fast tokenizer (check on [this table](https://huggingface.co/transformers/index.html#bigtable) if this is the case). It might just need some small adjustments if you decide to use a different dataset than the one used here. Depending on you model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:

In [None]:
# This flag is the difference between SQUAD v1 or 2 (if you're using another dataset, it indicates if impossible
# answers are allowed or not).
squad_v2 = False
model_checkpoint = "bert-large-uncased-whole-word-masking-finetuned-squad"
batch_size = 16

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [None]:
from datasets import load_dataset, load_metric

For our example here, we'll use the [SQUAD dataset](https://rajpurkar.github.io/SQuAD-explorer/). The notebook should work with any question answering dataset provided by the 🤗 Datasets library. If you're using your own dataset defined from a JSON or csv file (see the [Datasets documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) on how to load them), it might need some adjustments in the names of the columns used.

In [None]:
datasets = load_dataset("covid_qa_deepset")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1754.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=794.0, style=ProgressStyle(description_…


Downloading and preparing dataset covid_qa_deepset/covid_qa_deepset (download: 4.21 MiB, generated: 62.13 MiB, post-processed: Unknown size, total: 66.35 MiB) to /root/.cache/huggingface/datasets/covid_qa_deepset/covid_qa_deepset/1.0.0/db8b3251603d2c1afd9b1dd8a46d7ab63bce6e3d14d8fc48062fd68b5a0bc6d7...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1346251.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset covid_qa_deepset downloaded and prepared to /root/.cache/huggingface/datasets/covid_qa_deepset/covid_qa_deepset/1.0.0/db8b3251603d2c1afd9b1dd8a46d7ab63bce6e3d14d8fc48062fd68b5a0bc6d7. Subsequent calls will reuse this data.


In [None]:
datasets["train"]

Dataset({
    features: ['document_id', 'context', 'question', 'is_impossible', 'id', 'answers'],
    num_rows: 2019
})

The `datasets` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set.

In [None]:

datasets["validation"] = datasets["train"]
datasets["validation"] = datasets["validation"].select([i for i in range(0,300)])
datasets["train"] = datasets["train"].select([i for i in range(300,2019)])

In [None]:
len(datasets["validation"])

300

We can see the training, validation and test sets all have a column for the context, the question and the answers to those questions.

To access an actual element, you need to select a split first, then give an index:

We can see the answers are indicated by their start position in the text (here at character 515) and their full text, which is a substring of the context as we mentioned above.

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset (automatically decoding the labels in passing).

In [None]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=1):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(datasets["train"])

Unnamed: 0,answers,context,document_id,id,is_impossible,question
0,"{'answer_start': [775], 'text': ['v-ATPase inhibitor']}","The vacuolar-type ATPase inhibitor archazolid increases tumor cell adhesion to endothelial cells by accumulating extracellular collagen\n\nhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC6133348/\n\nSHA: f1b81916fac1ca3d50dde774df2e1bb26bf0fb39\n\nAuthors: Luong, Betty; Schwenk, Rebecca; Bräutigam, Jacqueline; Müller, Rolf; Menche, Dirk; Bischoff, Iris; Fürst, Robert\nDate: 2018-09-11\nDOI: 10.1371/journal.pone.0203053\nLicense: cc-by\n\nAbstract: The vacuolar-type H(+)-ATPase (v-ATPase) is the major proton pump that acidifies intracellular compartments of eukaryotic cells. Since the inhibition of v-ATPase resulted in anti-tumor and anti-metastatic effects in different tumor models, this enzyme has emerged as promising strategy against cancer. Here, we used the well-established v-ATPase inhibitor archazolid, a natural product first isolated from the myxobacterium Archangium gephyra, to study the consequences of v-ATPase inhibition in endothelial cells (ECs), in particular on the interaction between ECs and cancer cells, which has been neglected so far. Human endothelial cells treated with archazolid showed an increased adhesion of tumor cells, whereas the transendothelial migration of tumor cells was reduced. The adhesion process was independent from the EC adhesion molecules ICAM-1, VCAM-1, E-selectin and N-cadherin. Instead, the adhesion was mediated by β1-integrins expressed on tumor cells, as blocking of the integrin β1 subunit reversed this process. Tumor cells preferentially adhered to the β1-integrin ligand collagen and archazolid led to an increase in the amount of collagen on the surface of ECs. The accumulation of collagen was accompanied by a strong decrease of the expression and activity of the protease cathepsin B. Overexpression of cathepsin B in ECs prevented the capability of archazolid to increase the adhesion of tumor cells onto ECs. Our study demonstrates that the inhibition of v-ATPase by archazolid induces a pro-adhesive phenotype in endothelial cells that promotes their interaction with cancer cells, whereas the transmigration of tumor cells was reduced. These findings further support archazolid as a promising anti-metastatic compound.\n\nText: The vacuolar-type H + -ATPase (v-ATPase) is the major proton pump responsible for acidification of intracellular compartments in eukaryotic cells [1] . The enzyme consists of two multi-subunit complexes, the soluble V 1 transmembrane V o subcomplex required for the proton transport across membranes [1, 2] . In most cell types v-ATPases are only expressed in the endomembrane system to regulate and maintain the acidic pH of intracellular compartments such as lysosomes, endosomes, the Golgi apparatus, secretory granules and coated vesicles [3] . The function of v-ATPases is essential for cellular processes such as vesicular trafficking, receptor-mediated endocytosis and protein degradation and processing. In specialized cell types including osteoclasts and renal epithelial cells, v-ATPases can also be expressed on the plasma membrane, where they pump protons into the extracellular space [2] [3] [4] . In cancer cells v-ATPases are expressed on the plasma membrane in order to eliminate toxic cytosolic H + . Most importantly, v-ATPases contribute to the acidic tumor microenvironment, which leads to the activation of proteases, thus facilitating tumor cell migration, invasion and angiogenesis [5] [6] [7] . Since the inhibition of v-ATPase was shown to reduce the invasiveness of cancer cells and metastasis formation [8, 9] , this enzyme has emerged as a promising drug target in the recent years. Archazolid A and B are highly potent and specific inhibitors of v-ATPases [10] . They were first isolated from the myxobacterium Archangium gephyra [11] . These compounds inhibit v-ATPase at low nanomolar concentrations [10, 12] by binding to the subunit c of the V o complex. As their biological activity is comparable to the v-ATPase inhibitors bafilomycin and concanamycin [10, 11] , archazolids are natural compounds of high interest that can be used both as a tool to study the consequences of v-ATPase inhibition and as a lead for drug development. Archazolids can be either produced by fermentation [11] or by total synthesis [13, 14] .\n\nIn the field of cancer research several studies reported on interesting pharmacological effects of archazolid: It reduced the migration of different invasive tumor cells in vitro and cancer cell metastasis in vivo in a breast tumor mouse model [15] . Furthermore, archazolid activated pathways of cellular stress response and apoptosis in highly invasive tumor cells [16] . In classically activated macrophages, archazolid selectively induced the generation of tumor necrosis factor α (TNFα), which may indirectly promote tumor suppression [17] .\n\nUp to now, the role of v-ATPases in endothelial cells has only rarely been investigated. The endothelium plays a crucial role in the pathogenesis and progression of cancer: The metastatic cascade includes local angiogenesis at the site of the primary tumor and adhesion of tumor cells at the site of metastasis [18] . Angiogenesis, the development of new blood vessels out of existing ones, depends on the proliferation, migration and differentiation of endothelial cells [19] . This process ensures the nutrient supply of the tumor and its growth [20] . Circulating cancer cells can adhere to the endothelium at distant sites. This adhesive interaction is mediated by receptors and corresponding ligands expressed on tumor and endothelial cells [18, 21] . V-ATPases have been reported to regulate intracellular pH and cell migration in microvascular endothelial cells [22, 23] . A recent study showed that the inhibition of v-ATPase by concanamycin prevented proliferation, reduced migration and impaired angiogenesis-related signaling in endothelial cells [24] . So far, there are no investigations on the role of endothelial v-ATPases for the process of tumor cell adhesion onto the endothelium. Thus, we were interested in the consequences of the inhibition of endothelial v-ATPase by archazolid on the interaction between endothelial and cancer cells. Various cell adhesion molecules on the endothelium, such as intercellular adhesion molecule 1 (ICAM-1), vascular cell adhesion protein (VCAM-1), E-selectin or N-cadherin [21] as well as integrins expressed on cancer cells have been reported to mediate cell adhesion of cancer cells onto endothelial cells [25] [26] [27] . Accordingly, we focused on these cell adhesion molecules and integrins. For the first time, our study revealed a link between the function of v-ATPases and the adhesion and transmigration properties of endothelial cells.\n\nCellTiter-Blue Cell Viability Assay (Promega, Mannheim, Germany) was performed according to the manufacturer's protocol for determining the cell viability of cells after treatment with archazolid. This assay is based on the ability of metabolically active cells to reduce resazurin which results in fluorescent resorufin. The CellTiter-Blue Reagent was added to the cells 4 h before the endpoint of treatment. Fluorescence was measured with an Infinite F200 pro microplate reader (Tecan, Männedorf, Switzerland) at 560 nm (excitation) and 590 nm (emission).\n\nCytoTox 96 Non-Radioactive Cytotoxicity Assay (Promega) was performed according to the manufacturer's instructions for determining the lactate dehydrogenase (LDH) release after treatment with archazolid. Lysis buffer was added to untreated cells 45 min before the end of treatment to induce the release of this enzyme. LDH is a cytosolic enzyme that is released by leaky cells. Released LDH catalyzes the enzymatic conversion of lactate to pyruvate which provides NADH for the conversion of iodonitrotetrazolium violet into a red formazan product in the presence of diaphorase. The absorbance was measured with a Varioskan Flash microplate reader (Thermo Fisher Scientific) at 490 nm.\n\nLysoTracker Red DND-99 (Life Technologies, Thermo Fisher Scientific) is a dye to measure pH values in viable cells. HUVECs were cultured to confluence on collagen G-coated μ-slides (80826, ibidi, Martinsried, Germany) before they were treated with archazolid for 24 h. 1 μg/ ml Hoechst 33342 (Sigma-Aldrich, Munich, Germany) was used to visualize the nuclei and 50 nM LysoTracker Red DND-99 was used to visualize the acidic compartments which correspond to the lysosomes. Both dyes were incubated for 10 min at 37˚C before acquisition of single images by a Leica DMI6000 B fluorescence microscope (Leica Microsystems, Wetzlar, Germany).\n\nHUVECs were seeded in collagen G-coated 24-well plates and grown to confluence for two days before treatment. The cells were incubated with indicated concentrations of archazolid for 24 h. Untreated MDA-MB-231 or PC-3 cells were labeled with CellTracker Green CMFDA Dye (5 μM in serum-free DMEM, 37˚C) for 30 min before 100,000 cells per well were added to HUVECs and were allowed to adhere for various time points at 37˚C. Non-adherent tumor cells were washed off three times with PBS containing Ca 2+ and Mg 2+ . Tumor cell adhesion was determined by fluorescence measurements with an Infinite F200 pro microplate reader (Tecan) at 485 nm (excitation) and 535 nm (emission).\n\nFor blocking the integrin β1 subunit on MDA-MB-231 or PC-3 cells, CellTracker Greenlabeled MDA-MB-231 or PC-3 cells were incubated with an anti-integrin β1 antibody (P5D2, ab24693, Abcam, Cambridge, United Kingdom) at a concentration of 1 μg antibody per one million cells in 1 ml DMEM. Before adding to archazolid-treated HUVECs, MDA-MB-231 or PC-3 cells were washed once with DMEM. For blocking the integrin β1 subunit on HUVECs, the cells were incubated with the anti-integrin β1 antibody (0.1 μg/well in ECGM). HUVECs were washed once with ECGM before untreated MDA-MB-231 or PC-3 cells were added to HUVECs.\n\nFor the adhesion of MDA-MB-231 or PC-3 cells onto extracellular matrix (ECM) components 24-well plates were coated with collagen G (10 μg/ml in PBS), human plasma fibronectin (10 μg/ml PBS) or laminin-411 (10 μg/ml in Dulbecco's PBS [DPBS] containing Ca 2+ and Mg 2+ ) at 4˚C overnight. The adhesion of MDA-MB-231 and PC-3 cells onto these three most prominent ECM components was carried out as described above (10 min adhesion at 37˚C).\n\nHUVECs were grown on a porous filter membrane (Transwell insert, polycarbonate membrane, 8 μm pores; Corning, New York, USA) for 48 h and were treated as indicated. Untreated MDA-MB-231 cells were labeled with CellTracker Green CMFDA Dye (as described in the section cell adhesion assay) and resuspended in medium 199 (PAN-Biotech) containing 0.1% BSA. HUVECs were washed twice with medium 199 containing 0.1% BSA before MDA-MB-231 cells were allowed to transmigrate through the endothelial monolayer for 24 h. Medium 199 containing 0.1% BSA was used as negative control and medium 199 containing 20% FCS was used as chemoattractant for transmigration in the lower compartment. Non-migrated cells remaining in the upper compartment were carefully removed using a cotton swab. Transmigrated cells were lysed in radioimmunoprecipitation assay (RIPA) buffer and transmigration was quantified by measuring the fluorescence signal at 485 nm (excitation) and 535 nm (emission).\n\nHUVECs were grown to confluence on 6-well plates before they were treated with archazolid for 12 h. The cells were induced to upregulate the gene expression of cell adhesion molecules by TNFα. RNA was isolated using the RNeasy Mini Kit from Qiagen (Hilden, Germany) according to the manufacturer's protocol. On-column DNase digestion was performed to remove genomic DNA. RNA was transcribed into cDNA by Superscript II (Life Technologies, Thermo Fisher Scientific). qPCR experiments were performed using a StepOnePlus System (Applied Biosystems, Thermo Fisher Scientific) and data was analyzed by the StepOne and Ste-pOnePlus Software v2.3. Power SYBR Green PCR Master Mix (Life Technologies) and the comparative C T quantitation method (2 -ΔΔCT ) were used. \n\nHUVECs were grown to confluence on 12-well plates before they were treated with archazolid for 24 h. Cells were treated with TNFα for 24 h to induce the expression of cell adhesion molecules. Subsequently, the cells were detached with HyClone HyQTase (GE Healthcare, Freiburg, Germany). In the case of ICAM-1 the detached cells were fixed with 4% formaldehyde (Polysciences, Hirschberg an der Bergstraße, Germany) in PBS for 10 min and washed once with PBS before incubating with the fluorescein isothiocyanate (FITC)-labeled anti-human CD54 (mouse, ICAM-1) antibody (MCA1615F, Biozol, Eching, Germany) at room temperature for 45 min. For all other proteins, the cells were not fixed and washed once with PBS before incubating with the antibodies phycoerythrin (PE)-labeled anti-human CD106 (mouse, VCAM-1), PE-labeled anti-human CD62E (mouse, E-selectin) and PE-labeled anti-human CD325 (mouse, N-cadherin) from Becton Dickinson on ice for 45 min. These antibodies were diluted in PBS containing 0.2% BSA. The surface expression of cell adhesion molecules was measured by flow cytometry (FACSVerse, Becton Dickinson, Heidelberg, Germany).\n\nTo stain the surface collagen on HUVECs, cells were incubated with an anti-human collagen antibody (rabbit, 1:40, ab36064, Abcam) on ice for 30 min. The staining procedure was performed on ice to ensure that surface proteins or antibodies are not endocytosed. The cells were washed once with PBS containing Ca 2+ and Mg 2+ before they were fixed with Roti-Histofix (Carl Roth). Alexa Fluor 488-conjugated anti-rabbit antibody (goat, 1:400, A11008, Life Technologies) was used as secondary antibody and Hoechst 33342 (1 μg/ml, Sigma-Aldrich) was used to visualize nuclei.\n\nConfluent HUVECs in 6-well plates were treated as indicated. Cells were washed with ice-cold PBS and lysed with RIPA buffer supplemented with protease inhibitors (Complete Mini EDTA-free; Roche, Mannheim, Germany), sodium orthovanadate, sodium fluoride, phenylmethylsulphonyl fluoride, β-glycerophosphate, sodium pyrophosphate and H 2 O 2 . Protein determination was performed using the Pierce BCA Protein Assay Kit (Thermo Fisher Scientific). Equal amounts of proteins (10-20 μg) were separated by sodium dodecyl sulfatepolyacrylamide gel electrophoresis (SDS-PAGE; Bio-Rad Laboratories, Munich, Germany). Separated proteins were transferred onto polyvinylidene difluoride membranes by tank blotting (Bio-Rad Laboratories) for immunodetection. Membranes were blocked with 5% boltinggrade milk powder (Carl Roth) in TBS containing 0.1% Tween 20 (Sigma-Aldrich). The following antibodies were used: mouse anti-human cathepsin B antibody (IM27L, Merck) (1:500), mouse anti-β-actin-peroxidase antibody (A3854, Sigma-Aldrich) (1:100,000) and antimouse IgG horse radish peroxidase (HRP)-linked antibody (7076, Cell Signaling, Frankfurt, Germany) (1:5,000). ImageJ version 1.49m was used for densitometric analysis.\n\nCathepsin B activity assay was performed as described in the publication by Kubisch et al. [28] . Confluent HUVECs or HMEC-1 seeded in 6-well plates were treated as indicated. Cells were washed with PBS and lysed with the non-denaturating M-PER mammalian protein extraction reagent (78501, Thermo Fisher Scientific) supplemented with protease inhibitors (Complete Mini EDTA-free, Roche), sodium orthovanadate, sodium fluoride, phenylmethylsulphonyl fluoride. The fluorogenic cathepsin B substrate Z-Arg-Arg-7-amido-4-methylcoumarin hydrochloride (C5429, Sigma-Aldrich) was added to 30 μg of the cell lysate diluted in assay buffer supplemented with 2 mM L-cysteine (C7880, Sigma-Aldrich) and incubated for 30 min at 40˚C. Fluorescence was measured at 348 nm (excitation) and 440 nm (emission) with a microplate reader (Varioskan Flash, Thermo Fisher Scientific). The intensity of the fluorescence signal corresponded to the cathepsin B enzyme activity. For background subtraction the cathepsin B inhibitor CA-074Me (Enzo Life Sciences, Lörrach, Germany) was added to an additional reaction.\n\nThe HUVEC Nucleofector Kit (Lonza, Cologne, Germany) was used to transfect HUVECs. The transfection was performed according to the manufacturer's protocol using 2.5 μg plasmid DNA for 500,000 cells (Nucleofector 2b Device, Lonza). 48 h after transfection the cells were treated for further experiments. The addgene plasmid #11249 hCathepsin B was kindly provided by Hyeryun Choe [29] . hCathepsin B was digested with PmeI and XbaI and the linear DNA fragment not corresponding to the human CTSB gene was religated to generate the empty pcDNA3.1 (-) delta MCS plasmid that was used for control transfections. The original backbone of hCathepsin B is the pcDNA3.1 (-) from Thermo Fisher Scientific. The control vector pcDNA3.1 (-) delta MCS used for our transfections was cloned on the basis of hCathepsin B and is therefore lacking almost the whole part of the multiple cloning site of the pcDNA3.1 (-).\n\nStatistical analyses were performed using GraphPad Prism 5.0 (San Diego, USA). One-way ANOVA followed by Tukey's post-hoc test or unpaired t-test was used for the evaluation of a minimum of three independent experiments. The numbers of independently performed experiments (n) are stated in the corresponding figure legends. p 0.05 was considered as statistically significant. Data are expressed as mean ± standard error of the mean (SEM).\n\nSince the v-ATPase inhibitor archazolid has never been used for studies in endothelial cells, we first performed cytotoxicity assays. We treated confluent HUVECs with up to 1 nM archazolid for 24 and 48 h and observed that this treatment has neither an influence on the metabolic activity nor on the release of LDH after 24 h (Fig 1A and 1B, left panels) . The metabolic activity and the release of LDH were only slightly affected by the highest concentration of archazolid after 48 h (Fig 1A and 1B, right panels) . Consequently, the following experiments were all carried out after 24 h (or less) of archazolid treatment in order to exclude any cytotoxic effects of archazolid within our experimental settings.\n\nMicroscopic analysis revealed that also the integrity of the endothelial monolayer was not affected by archazolid, but the cells showed a slightly different morphology (Fig 2A) : Archazolid-treated cells were swollen compared to control cells, which was not unexpected, as vacuolation of the endoplasmic reticulum (ER) has been described for other cell types and is typical for v-ATPase inhibitors [11, 16, 24, 30] . This effect was obvious both in subconfluent and in confluent cells (Fig 2A) . Inhibition of v-ATPase prevents the acidification of lysosomes [1, 31] . Using the cell-permeable dye LysoTracker Red DND-99, it is possible to label the acidic lysosomes in living cells. Thus, this dye can serve as an indicator of v-ATPase inhibition. To proof that archazolid is also functionally active as a v-ATPase inhibitor in HUVECs, cells were treated with 1 nM archazolid before they were incubated with LysoTracker Red DND-99 and Hoechst 33342. As shown in Fig 2B, the red vesicular staining corresponding to acidified lysosomes in control cells disappeared completely after treatment with archazolid. In summary, archazolid treatment for 24 h was not cytotoxic to quiescent HUVECs, but inhibited the functionality of the v-ATPase.\n\nWe analyzed the adhesion of MDA-MB-231 cells onto HUVECs. Confluent HUVECs were treated with up to 1 nM archazolid for 24 h. Untreated MDA-MB-231 cells were labeled with Cell-Tracker Green CMFDA Dye. Interestingly, v-ATPase inhibition strongly increased the attachment of the metastatic breast carcinoma cell line MDA-MB-231 onto HUVECs after 10 and 120 min of adhesion (Fig 3A and 3B) . We also investigated the influence of archazolid on the transendothelial migration of MDA-MB-231 cells. HUVECs seeded in a Boyden chamber were treated with 1 nM archazolid for 24 h. CellTracker Green-labeled MDA-MB-231 cells (not treated with archazolid) were allowed to transmigrate through the endothelial monolayer for 24 h. As shown in Fig 3C, archazolid significantly decreased the transendothelial migration of MDA-MB-231 cells.\n\nThe influence of archazolid on tumor cell adhesion was not only studied in HUVECs, which represent macrovascular endothelial cells, but also in microvascular HMEC-1 cells. Moreover, besides the breast cancer cell line MDA-MB-231, also PC-3 prostate cancer cells were used as a second metastatic cancer cell line. Archazolid treatment of endothelial cells increased the attachment of MDA-MB-231 cells onto the HMEC-1 monolayer after 120 min of adhesion ( Fig 4A) and increased the attachment of PC-3 cells onto the HUVEC monolayer after 30 and 60 min of adhesion (Fig 4B) . Of note, the adhesion of non-metastatic Jurkat cells, an acute T cell leukemia cell line, remained unaffected after treatment of HUVECs with archazolid (S1A Fig). Taken together, archazolid treatment augmented the adhesive properties of both micro-and macrovascular endothelial cells for metastatic tumor cells, but not for non-metastatic ones. Of note, cancer cell adhesion onto archazolid-activated endothelial cells increased with the time of adhesion.\n\nThe adhesion of tumor cells onto the endothelium is in principle similar to that of leukocytes, but slightly differs in the molecules that mediate the adhesion process. Ligands for the endothelial cell adhesion molecules ICAM-1, VCAM-1, E-selectin and N-cadherin were found to be expressed on tumor cells and to mediate tumor-endothelial cell interaction [21] . Inhibition of the v-ATPase might affect the expression of endothelial cell adhesion molecules on mRNA or protein levels. To determine the mRNA expression of ICAM-1, VCAM-1, E-selectin and Ncadherin, HUVECs were treated with archazolid for 12 h. TNFα is known to upregulate the expression of ICAM-1, VCAM-1 and E-selectin [32] and, thus, served as positive control. Quantitative real-time PCR showed that v-ATPase inhibition in HUVECs did not alter the mRNA levels of ICAM-1, VCAM-1, E-selectin and N-cadherin (Fig 5A) . The protein expression of these adhesion molecules on the surface of endothelial cells was analyzed by flow cytometry. Archazolid (1 nM, 24 h) did not affect the cell surface expression of ICAM-1, VCAM-1, E-selectin and N-cadherin (Fig 5B) .\n\nBesides ICAM-1, VCAM-1, E-selectin and N-cadherin, also integrins are able to mediate the process of cell adhesion [33] [34] [35] . Since none of the cell adhesion molecules expressed on HUVECs were regulated upon archazolid treatment, we considered integrins as potential interaction partners. Within this protein family β1-integrins have been reported to mediate tumor cell adhesion onto quiescent endothelial cells [25] . In order to elucidate the role of β1-integrins for the archazolid-induced tumor cell adhesion, the integrin β1-subunit was blocked either on MDA-MB-231 cells, PC-3 cells or on HUVECs. (Of note, as in all experiments throughout this study, only endothelial cells were treated with archazolid.) After blocking β1-integrins on MDA-MB-231 or PC-3 cells, the archazolid-induced tumor cell adhesion was reduced almost to control level (Fig 6A and 6B , left panels), whereas blocking of β1-integrins on HUVECs had no significant effect on the increase of tumor cell adhesion by v-ATPase inhibition (Fig 6A and 6B , right panels).\n\nDepending on their α subunit, β1-integrins have a variety of ligands including extracellular matrix (ECM) components such as collagen, fibronectin and laminin [35] . Therefore, we hypothesized that archazolid treatment of endothelial cells might lead to an upregulation of these components. MDA-MB-231 and PC-3 cells were allowed to adhere onto plastic that was coated with these ECM components. This cell adhesion assay revealed that MDA-MB-231 as well as PC-3 cells favor the interaction with the ECM component collagen, as the adhesion onto collagen is much higher than onto the uncoated plastic control (Fig 7A) . MDA-MB-231 and PC-3 cells also adhered to fibronectin-coated plastic, but to a much lesser extent compared to the collagen coating. Therefore, we focused on the interaction between these two tumor cell lines and collagen. Blocking of the integrin β1 subunit on MDA-MB-231 and PC-3 cells clearly abolished the interaction with collagen (Fig 7B) , indicating that the attachment of these tumor cells to collagen is mediated by β1-integrins.\n\nSince collagen is the major ECM component MDA-MB-231 and PC-3 cells interact with, the next step was to prove whether v-ATPase inhibition influences the amount of collagen expressed by HUVECs as extracellular matrix. To detect collagen on the endothelial surface, archazolid-treated HUVECs were labeled with an antibody against collagen type I-IV on ice to prevent endocytosis and to ensure that the antibody does not bind to intracellular collagen. Interestingly, archazolid increased the amount of surface collagen on HUVECs by about 50% (Fig 7C) . Control stainings were performed using an antibody against the cytosolic p65 subunit of the transcription factor nuclear factor κB (NFκB) to show that intracellular proteins were not detected by this staining method (S2 Fig) . \n\nIt was reported that v-ATPase inhibition by archazolid impairs the activity of cathepsin B [28, 36] , a lysosomal enzyme that degrades extracellular matrix components including collagen [37] [38] [39] [40] [41] . As collagen is degraded by cathepsin B and the activation of cathepsin B depends on v-ATPase activity [28, [36] [37] [38] 42] , we suggested that an accumulation of collagen on the surface of endothelial cells might be a consequence of an impaired functionality of cathepsin B. Therefore, an enzyme activity assay based on the proteolysis of a fluorogenic cathepsin B substrate was performed. In archazolid-treated HUVECs and HMEC-1 the activity of cathepsin B was induce both the mRNA (1 ng/ml TNF) and the cell surface expression (10 ng/ml TNF) of ICAM-1, VCAM-1, E-selectin and Ncadherin.\n\nhttps://doi.org/10.1371/journal.pone.0203053.g005 Inhibition of endothelial vATPase increases tumor cell adhesion to endothelial cells strongly decreased by approximately 50% compared to control cells at an archazolid concentration of 1 nM (Fig 8A) . In line with this result, western blot analysis showed that archazolid (1 nM) reduces the protein expression of the mature, active form of cathepsin B to less than 40% of the control in HUVECs (Fig 8B) . To proof whether the archazolid-induced tumor cell adhesion is a consequence of the decreased amount of cathepsin B, HUVECs were transfected with a plasmid coding for human cathepsin B or with the empty vector as control. After 48 h, the transfected cells were treated with 1 nM archazolid. The level of cathepsin B after transfection and treatment was assessed by western blot analysis (Fig 9A) . Overexpression of cathepsin B strongly diminished both the basal and the archazolid-induced adhesion of MDA-MB-231 cells (Fig 9B) .\n\nTargeting the proton pump v-ATPase for cancer therapy has gained great interest since its inhibition was reported to reduce the invasiveness of cancer cells and, most importantly, also metastasis [8, 9] . Thus, intensive research related to v-ATPases was done in cancer cells, whereas there are only few studies investigating v-ATPases in endothelial cells indicating a role in migration, proliferation and possibly angiogenesis [22] [23] [24] . In the present study we used the myxobacterial natural product archazolid to investigate the consequences of v-ATPase inhibition in the endothelium on tumor-endothelial cell interactions.\n\nFor the first time, we were able to show a link between v-ATPase and the adhesion and transmigration properties of the endothelium. Inhibition of the v-ATPase in endothelial cells by archazolid significantly increased the adhesion of metastatic cancer cells and decreased the transendothelial migration of cancer cells which was attributed to augmented collagen levels on the surface on archazolid-treated endothelial cells. Of note, adhesion of the non-metastatic Jurkat cell line onto archazolid-treated endothelial cells remained unaffected. The archazolidinduced adhesion of tumor cells was independent from the endothelial cell adhesion molecules ICAM-1, VCAM-1, E-selectin and N-cadherin, as their expression was not regulated by the compound. However, we found that the archazolid-induced tumor cell adhesion was mediated by β1-integrins expressed on MDA-MB-231 breast cancer and PC-3 prostate cancer cells as blocking of the integrin β1 subunit on these tumor cells reversed the pro-adhesive effect of archazolid. In adhesion experiments on plastic coated with extracellular matrix components, we could show that MDA-MB-231 and PC-3 cells clearly favored the interaction with collagen, whereas the adhesion of non-metastatic Jurkat cells was largely independent from extracellular matrix proteins (S1B Fig). The different adhesion properties of metastatic cancer cells and Jurkat cells might be a result of the distinct integrin expression pattern of each cell line. MDA-MB-231 and PC-3 cells express α2β1-and α3β1-integrins, which represent collagen receptors [43, 44] , while Jurkat cells express α4β1-integrins but lack α2β1-, α3β1-integrins [44] . α4β1integrins are receptors for VCAM-1 and fibronectin [35] and it has been shown that Jurkat cells interact with human endothelial cells that express VCAM-1 after cytokine treatment or cells transfected with VCAM-1 [45] . Our results are in line with previous studies showing that α2β1-and α3β1-integrin expressing MDA-MB-231 and PC-3 cells were able to rapidly attach to collagen in the cortical bone matrix. In contrast, Jurkat cells were not able to adhere [44] and might preferentially interact with cell adhesion molecules rather than with ECM proteins. α2β1-and α3β1-integrins can additionally act as laminin receptors [46] and at least α3β1integrins recognize fibronectin [46, 47] . Though expressing receptors for fibronectin and laminin, MDA-MB-231 and PC-3 cells adhered to fibronectin to a much lesser extent and did not adhere to laminin, probably due to lower affinities to these extracellular matrix components.\n\nImportantly, v-ATPase inhibition by archazolid increased the surface levels of the extracellular matrix component collagen, which might explain that the increase of MDA-MB-231 and PC-3 cells onto archazolid-treated HUVECs is independent of endothelial cell adhesion molecules. By performing a live cell proteolysis assay, Cavallo-Medved et al. demonstrated ECM degradation, in particular of gelatin and collagen IV, in association with active cathepsin B in caveolae of endothelial cells during tube formation [40] . In addition, recent studies reported that v-ATPase inhibition impairs the activity of cathepsin B in cancer cells [28, 36] . Therefore, we suggested that the accumulation of collagen on the endothelial surface might be a consequence of impaired cathepsin B activity or expression in endothelial cells. In fact, we confirmed the impairment of cathepsin B activity by archazolid as the expression levels of the mature active form of this enzyme was strongly reduced. Cathepsin B is synthesized as preprocathepsin B on membrane-bound ribosomes. Following transport to the Golgi apparatus, the preprocathepsin B is glycosylated with mannose-containing oligosaccharides. The targeting of procathepsin B to lysosomes is mannose-6-phosphate receptor-dependent and its dissociation from the receptor as well as its proteolytic processing into mature cathepsin B requires acidification of the compartment [48] . In cancer cells v-ATPase inhibition by archazolid impaired the mannose-6-phosphate receptor-mediated trafficking from the trans-Golgi network to prelysosomal compartments resulting in a decrease of active lysosomal proteases like cathepsin B [28] . We assumed that the archazolid-induced decrease in cathepsin B activity and expression was based on the same mechanism. Interestingly, overexpression of cathepsin B attenuated the archazolid-induced adhesion of breast cancer cells onto endothelial cells, indicating that the adhesion negatively correlates with the expression of cathepsin B.\n\nAs cathepsin B can also degrade other extracellular matrix components such as fibronectin and laminin [38, 49] , v-ATPase inhibition could lead to an accumulation of these proteins and an increased adhesion of cells expressing fibronectin or laminin receptors. However, we did not focus on these ECM components since they were not relevant for the adhesion of MDA-MB-231 and PC-3 cells. These cells predominantly adhered to collagen, while the adhesion of Jurkat cells is mostly independent from the ECM proteins collagen, fibronectin or laminin (S1B Fig). Interestingly [50] . In hepatic cancer cells, archazolid reduces Ras/Raf/MEK/ERK signaling by altering the membrane composition and fluidity [51] . We assume that archazolid affects endothelial cells in a similar way leading to inhibition of Ras signaling and, therefore, reduced transendothelial migration of MDA-MB-231 cells.\n\nTaken together, our study shows that archazolid reduces the activity and expression of cathepsin B in endothelial cells. As a result, the amount of collagen on the surface of endothelial cells was significantly upregulated, which finally resulted in an increased adhesion of the β1-integrin-expressing metastatic cancer cell lines MDA-MB-231 and PC-3 onto archazolidtreated endothelial cells, whereas the adhesion of non-metastatic Jurkat cells was unaffected. This study shows that the v-ATPase plays an important role in regulating the adhesion of cells expressing receptors for extracellular matrix components. Archazolid represents a promising tool to elucidate the role of v-ATPase in endothelial cells. Moreover, we for the first time linked the function of v-ATPase to the adhesion and transmigration of tumor cells onto endothelial cells as well as to the remodeling of the extracellular matrix on the surface of endothelial cells. The fact that the adhesion of metastatic tumor cells onto endothelial cells is increased while their transendothelial migration is reduced upon inhibition of endothelial v-ATPase by archazolid further supports the view of archazolid as a potential anti-metastatic compound.",1633,5302,False,What is archazolid?


## Preprocessing the training data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=443.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




The following assertion ensures that our tokenizer is a fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, and we will need some of the special features they have for our preprocessing.

In [None]:
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

You can check which type of models have a fast tokenizer available and which don't on the [big table of models](https://huggingface.co/transformers/index.html#bigtable).

You can directly call this tokenizer on two sentences (one for the answer, one for the context):

In [None]:
tokenizer("What is your name?", "My name is Sylvain.")

{'input_ids': [101, 2054, 2003, 2115, 2171, 1029, 102, 2026, 2171, 2003, 25353, 22144, 2378, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Now one specific thing for the preprocessing in question answering is how to deal with very long documents. We usually truncate them in other tasks, when they are longer than the model maximum sentence length, but here, removing part of the the context might result in losing the answer we are looking for. To deal with this, we will allow one (long) example in our dataset to give several input features, each of length shorter than the maximum length of the model (or the one we set as a hyper-parameter). Also, just in case the answer lies at the point we split a long context, we allow some overlap between the features we generate controlled by the hyper-parameter `doc_stride`:

In [None]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

Let's find one long example in our dataset:

In [None]:
for i, example in enumerate(datasets["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        break
example = datasets["train"][i]

Token indices sequence length is longer than the specified maximum sequence length for this model (5818 > 512). Running this sequence through the model will result in indexing errors


Without any truncation, we get the following length for the input IDs:

In [None]:
len(tokenizer(example["question"], example["context"])["input_ids"])

5818

Now, if we just truncate, we will lose information (and possibly the answer to our question):

In [None]:
len(tokenizer(example["question"], example["context"], max_length=max_length, truncation="only_second")["input_ids"])

384

Note that we never want to truncate the question, only the context, else the `only_second` truncation picked. Now, our tokenizer can automatically return us a list of features capped by a certain maximum length, with the overlap we talked above, we just have to tell it with `return_overflowing_tokens=True` and by passing the stride:

In [None]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride
)

Now we don't have one list of `input_ids`, but several: 

In [None]:
[len(x) for x in tokenized_example["input_ids"]]

[384,
 384,
 384,
 384,
 384,
 384,
 384,
 384,
 384,
 384,
 384,
 384,
 384,
 384,
 384,
 384,
 384,
 384,
 384,
 384,
 384,
 384,
 384]

And if we decode them, we can see the overlap:

In [None]:
for x in tokenized_example["input_ids"][:2]:
    print(tokenizer.decode(x))

[CLS] what can nuclear receptors regulate? [SEP] inr - drug : predicting the interaction of drugs with nuclear receptors in cellular networking https : / / www. ncbi. nlm. nih. gov / pmc / articles / pmc3975431 / sha : ee55aea26f816403476a7cb71816b8ecb1110329 authors : fan, yue - nong ; xiao, xuan ; min, jian - liang ; chou, kuo - chen date : 2014 - 03 - 19 doi : 10. 3390 / ijms15034915 license : cc - by abstract : nuclear receptors ( nrs ) are closely associated with various major diseases such as cancer, diabetes, inflammatory disease, and osteoporosis. therefore, nrs have become a frequent target for drug development. during the process of developing drugs against these diseases by targeting nrs, we are often facing a problem : given a nr and chemical compound, can we identify whether they are really in interaction with each other in a cell? to address this problem, a predictor called “ inr - drug ” was developed. in the predictor, the drug compound concerned was formulated by a 256

Now this will give us some work to properly treat the answers: we need to find in which of those features the answer actually is, and where exactly in that feature. The models we will use require the start and end positions of these answers in the tokens, so we will also need to to map parts of the original context to some tokens. Thankfully, the tokenizer we're using can help us with that by returning an `offset_mapping`:

In [None]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride
)
print(tokenized_example["offset_mapping"][0][:100])

[(0, 0), (0, 4), (5, 8), (9, 16), (17, 26), (27, 35), (35, 36), (0, 0), (0, 2), (2, 3), (3, 4), (4, 8), (8, 9), (10, 20), (21, 24), (25, 36), (37, 39), (40, 45), (46, 50), (51, 58), (59, 68), (69, 71), (72, 80), (81, 91), (93, 98), (98, 99), (99, 100), (100, 101), (101, 104), (104, 105), (105, 107), (107, 109), (109, 110), (110, 112), (112, 113), (113, 114), (114, 116), (116, 117), (117, 118), (118, 121), (121, 122), (122, 124), (124, 125), (125, 126), (126, 134), (134, 135), (135, 137), (137, 138), (138, 140), (140, 142), (142, 144), (144, 145), (145, 146), (148, 151), (151, 152), (153, 155), (155, 157), (157, 160), (160, 162), (162, 163), (163, 164), (164, 166), (166, 168), (168, 170), (170, 171), (171, 172), (172, 173), (173, 174), (174, 176), (176, 177), (177, 179), (179, 181), (181, 182), (182, 183), (183, 185), (185, 186), (186, 188), (188, 190), (190, 192), (192, 193), (195, 202), (202, 203), (204, 207), (207, 208), (209, 212), (212, 213), (213, 216), (216, 217), (217, 218), (21

This gives, for each index of our input IDS, the corresponding start and end character in the original text that gave our token. The very first token (`[CLS]`) has (0, 0) because it doesn't correspond to any part of the question/answer, then the second token is the same as the characters 0 to 3 of the question:

In [None]:
first_token_id = tokenized_example["input_ids"][0][1]
offsets = tokenized_example["offset_mapping"][0][1]
print(tokenizer.convert_ids_to_tokens([first_token_id])[0], example["question"][offsets[0]:offsets[1]])

what What


So we can use this mapping to find the position of the start and end tokens of our answer in a given feature. We just have to distinguish which parts of the offsets correspond to the question and which part correspond to the context, this is where the `sequence_ids` method of our `tokenized_example` can be useful:

In [None]:
sequence_ids = tokenized_example.sequence_ids()
print(sequence_ids)

[None, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

It returns `None` for the special tokens, then 0 or 1 depending on whether the corresponding token comes from the first sentence past (the question) or the second (the context). Now with all of this, we can find the first and last token of the answer in one of our input feature (or if the answer is not in this feature):

In [None]:
answers = example["answers"]
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])

# Start token index of the current span in the text.
token_start_index = 0
while sequence_ids[token_start_index] != 1:
    token_start_index += 1

# End token index of the current span in the text.
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
    token_end_index -= 1

# Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
offsets = tokenized_example["offset_mapping"][0]
if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
    # Move the token_start_index and token_end_index to the two ends of the answer.
    # Note: we could go after the last offset if the answer is the last word (edge case).
    while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
        token_start_index += 1
    start_position = token_start_index - 1
    while offsets[token_end_index][1] >= end_char:
        token_end_index -= 1
    end_position = token_end_index + 1
    print(start_position, end_position)
else:
    print("The answer is not in this feature.")

The answer is not in this feature.


And we can double check that it is indeed the theoretical answer:

In [None]:
# print(tokenizer.decode(tokenized_example["input_ids"][0][start_position: end_position+1]))
# print(answers["text"][0])

For this notebook to work with any kind of models, we need to account for the special case where the model expects padding on the left (in which case we switch the order of the question and the context):

In [None]:
pad_on_right = tokenizer.padding_side == "right"

Now let's put everything together in one function we will apply to our training set. In the case of impossible answers (the answer is in another feature given by an example with a long context), we set the cls index for both the start and end position. We could also simply discard those examples from the training set if the flag `allow_impossible_answers` is `False`. Since the preprocessing is already complex enough as it is, we've kept is simple for this part.

In [None]:
def prepare_train_features(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
features = prepare_train_features(datasets['train'][:5])

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command. Since our preprocessing changes the number of samples, we need to remove the old columns when applying it.

In [None]:
tokenized_datasets = datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1340675298.0, style=ProgressStyle(descr…




The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define three more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
args = TrainingArguments(
    f"test-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    weight_decay=0.01,
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay.

Then we will need a data collator that will batch our processed examples together, here the default one will work:

In [None]:
from transformers import default_data_collator

data_collator = default_data_collator

We will evaluate our model and compute metrics in the next section (this is a very long operation, so we will only compute the evaluation loss during training).

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

We can now finetune our model by just calling the `train` method:

In [None]:
# trainer.train()

Since this training is particularly long, let's save the model just in case we need to restart.

In [None]:
# trainer.save_model("distil-covid-trained")

## Evaluation

Evaluating our model will require a bit more work, as we will need to map the predictions of our model back to parts of the context. The model itself predicts logits for the start and en position of our answers: if we take a batch from our validation datalaoder, here is the output our model gives us:

In [None]:
import torch

for batch in trainer.get_eval_dataloader():
    break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
with torch.no_grad():
    output = trainer.model(**batch)
output.keys()

odict_keys(['loss', 'start_logits', 'end_logits'])

The output of the model is a dict-like object that contains the loss (since we provided labels), the start and end logits. We won't need the loss for our predictions, let's have a look a the logits:

In [None]:
output.start_logits.shape, output.end_logits.shape

(torch.Size([16, 384]), torch.Size([16, 384]))

We have one logit for each feature and each token. The most obvious thing to predict an answer for each featyre is to take the index for the maximum of the start logits as a start position and the index of the maximum of the end logits as an end position.

In [None]:
output.start_logits.argmax(dim=-1), output.end_logits.argmax(dim=-1)

(tensor([145,   0,   0,   0,   0, 293,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0], device='cuda:0'),
 tensor([150,  14,  14,   0,  14, 299,   0,   0,   0,   0,   0,  14,  14,  14,
           0, 146], device='cuda:0'))

This will work great in a lot of cases, but what if this prediction gives us something impossible: the start position could be greater than the end position, or point to a span of text in the question instead of the answer. In that case, we might want to look at the second best prediction to see if it gives a possible answer and select that instead.

However, picking the second best answer is not as easy as picking the best one: is it the second best index in the start logits with the best index in the end logits? Or the best index in the start logits with the second best index in the end logits? And if that second best answer is not possible either, it gets even trickier for the third best answer.


To classify our answers, we will use the score obtained by adding the start and end logits. We won't try to order all the possible answers and limit ourselves to with a hyper-parameter we call `n_best_size`. We'll pick the best indices in the start and end logits and gather all the answers this predicts. After checking if each one is valid, we will sort them by their score and keep the best one. Here is how we would do this on the first feature in the batch:

In [None]:
n_best_size = 20

In [None]:
import numpy as np

start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": "" # We need to find a way to get back the original substring corresponding to the answer in the context
                }
            )

And then we can sort the `valid_answers` according to their `score` and only keep the best one. The only point left is how to check a given span is inside the context (and not the question) and how to get back the text inside. To do this, we need to add two things to our validation features:
- the ID of the example that generated the feature (since each example can generate several features, as seen before);
- the offset mapping that will give us a map from token indices to character positions in the context.

That's why we will re-process the validation set with the following function, slightly different from `prepare_train_features`:

In [None]:
def prepare_validation_features(examples):
    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

And like before, we can apply that function to our validation set easily:

In [None]:
validation_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names
)

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




Now we can grab the predictions for all features by using the `Trainer.predict` method:

In [None]:
raw_predictions = trainer.predict(validation_features)

The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [None]:
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

We can now refine the test we had before: since we set `None` in the offset mappings when it corresponds to a part of the question, it's easy to check if an answer is fully inside the context. We also eliminate very long answers from our considerations (with an hyper-parameter we can tune)

In [None]:
max_answer_length = 30

In [None]:
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
offset_mapping = validation_features[0]["offset_mapping"]
# The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# an example index
context = datasets["validation"][0]["context"]

# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
        # to part of the input_ids that are not in the context.
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # Don't consider answers with a length that is either < 0 or > max_answer_length.
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char: end_char]
                }
            )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers

[{'score': 15.663379, 'text': 'Mother-to-child transmission'},
 {'score': 14.666584, 'text': 'Mother-to-child transmission (MTCT)'},
 {'score': 13.394403, 'text': 'Mother-to-child transmission (MTCT'},
 {'score': 12.228895, 'text': 'MTCT)'},
 {'score': 11.11519, 'text': 'Mother-to-child'},
 {'score': 10.956715, 'text': 'MTCT'},
 {'score': 10.29887, 'text': 'Mother-to-Child Transmission'},
 {'score': 9.768068,
  'text': 'Mother-to-child transmission (MTCT) is the main cause of HIV-1 infection'},
 {'score': 9.374699, 'text': '(MTCT)'},
 {'score': 9.229617,
  'text': 'Mother-to-child transmission (MTCT) is the main cause of HIV-1 infection in children'},
 {'score': 8.881094, 'text': 'transmission'},
 {'score': 8.728713,
  'text': 'Mother-to-child transmission (MTCT) is the main cause of HIV-1 infection in children worldwide'},
 {'score': 8.455909,
  'text': 'Mother-to-child transmission (MTCT) is the main cause'},
 {'score': 8.389352,
  'text': 'Mother-to-child transmission (MTCT) is the 

We can compare to the actual ground-truth answer:

In [None]:
datasets["validation"][0]["answers"]

{'answer_start': [370],
 'text': ['Mother-to-child transmission (MTCT) is the main cause of HIV-1 infection in children worldwide.']}

Our model picked the right as the most likely answer!

As we mentioned in the code above, this was easy on the first feature because we knew it comes from the first example. For the other features, we will need a map between examples and their corresponding features. Also, since one example can give several features, we will need to gather together all the answers in all the features generated by a given example, then pick the best one. The following code builds a map from example index to its corresponding features indices:

In [None]:
import collections

examples = datasets["validation"]
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

We're almost ready for our post-processing function. The last bit to deal with is the impossible answer (when `squad_v2 = True`). The code above only keeps answers that are inside the context, we need to also grab the score for the impossible answer (which has start and end indices corresponding to the index of the CLS token). When one example gives several features, we have to predict the impossible answer when all the features give a high score to the impossible answer (since one feature could predict the impossible answer just because the answer isn't in the part of the context it has access too), which is why the score of the impossible answer for one example is the *minimum* of the scores for the impossible answer in each feature generated by the example.

We then predict the impossible answer when that score is greater than the score of the best non-impossible answer. All combined together, this gives us this post-processing function:

In [None]:
from tqdm.auto import tqdm

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

And we can apply our post-processing function to our raw predictions:

In [None]:
final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)

Post-processing 300 example predictions split into 6116 features.


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))




In [None]:
# import pickle
import json

# with open('output-distil-covid.pickle', 'wb') as handle:
#     pickle.dump(final_predictions, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('bert-covid.json', 'w', encoding='utf-8') as f:
    json.dump(final_predictions, f, indent=4, ensure_ascii=False)
    

In [None]:
# import pickle
# file = open("output-distil-covid.pickle",'rb')
# object_file = pickle.load(file)
# file.close()

In [None]:
# import pickle


# with open('output-distil-covid.pickle', 'wb') as handle:
#     pickle.dump(final_predictions, handle, protocol=pickle.HIGHEST_PROTOCOL)

# with open('{}-covid-1.json'.format("distil"), 'w', encoding='utf-8') as f:
#     json.dump(object_file, f, indent=4, ensure_ascii=False)

In [None]:
from __future__ import print_function
from collections import Counter
import string
import re
import argparse
import json
import sys

def normalize_answer(s):
    """Lower text and remove punctuation, articles and extra whitespace."""
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)

    def white_space_fix(text):
        return ' '.join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

In [None]:
def f1_score(prediction, ground_truth):
    prediction_tokens = prediction.split()
    ground_truth_tokens = ground_truth.split()
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1


def exact_match_score(prediction, ground_truth):
    return (prediction == ground_truth)

In [None]:
my_predictions = list(final_predictions.items())

for i in range(0,len(my_predictions)):
  my_predictions[i] = normalize_answer(my_predictions[i][1])

In [None]:
orig = list(datasets["validation"])
for i in range(0,len(orig)):
  orig[i] = normalize_answer(orig[i]["answers"]["text"][0])  

In [None]:
em = 0
for i in range(0,len(orig)):
  if orig[i] == my_predictions[i]: em+=1


print("Exact Match {}".format(em/len(orig)))

Exact Match 0.3342036553524804
Exact Match 0.4266666666666667
300


In [None]:
from sklearn.metrics import f1_score
f1 = f1_score(orig, my_predictions, average='weighted')
print("f1_score {} weighted".format(f1))

from sklearn.metrics import f1_score
f1 = f1_score(orig, my_predictions, average='micro')
print("f1_score {} micro".format(f1))

from sklearn.metrics import f1_score
f1 = f1_score(orig, my_predictions, average='macro')
print("f1_score {} macro".format(f1))

f1_score 0.4244444444444444 weighted
f1_score 0.42666666666666675 micro
f1_score 0.27279358132749815 macro
