If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize

import psutil
import humanize
import os
import GPUtil as GPU

GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
    process = psutil.Process(os.getpid())
    print("Gen RAM Free: " + humanize.naturalsize(psutil.virtual_memory().available), " |     Proc size: " + humanize.naturalsize(process.memory_info().rss))
    print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total     {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

Collecting gputil
  Downloading https://files.pythonhosted.org/packages/ed/0e/5c61eedde9f6c87713e89d794f01e378cfd9565847d4576fa627d758c554/GPUtil-1.4.0.tar.gz
Building wheels for collected packages: gputil
  Building wheel for gputil (setup.py) ... [?25l[?25hdone
  Created wheel for gputil: filename=GPUtil-1.4.0-cp37-none-any.whl size=7411 sha256=40242dfc8554716ee205231f0c7bf3611e4a7c34e743628efca117396d78e279
  Stored in directory: /root/.cache/pip/wheels/3d/77/07/80562de4bb0786e5ea186911a2c831fdd0018bda69beab71fd
Successfully built gputil
Installing collected packages: gputil
Successfully installed gputil-1.4.0
Gen RAM Free: 12.8 GB  |     Proc size: 118.3 MB
GPU RAM Free: 15109MB | Used: 0MB | Util   0% | Total     15109MB


In [None]:
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [None]:
! pip install datasets transformers > /dev/null

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/question-answering).

# Fine-tuning a model on a question-answering task

In this notebook, we will see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model to a question answering task, which is the task of extracting the answer to a question from a given context. We will see how to easily load a dataset for these kinds of tasks and use the `Trainer` API to fine-tune a model on it.

![Widget inference representing the QA task](https://github.com/huggingface/notebooks/blob/master/examples/images/question_answering.png?raw=1)

**Note:** This notebook finetunes models that answer question by taking a substring of a context, not by generating new text.

This notebook is built to run on any question answering task with the same format as SQUAD (version 1 or 2), with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a token classification head and a fast tokenizer (check on [this table](https://huggingface.co/transformers/index.html#bigtable) if this is the case). It might just need some small adjustments if you decide to use a different dataset than the one used here. Depending on you model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:

In [None]:
# This flag is the difference between SQUAD v1 or 2 (if you're using another dataset, it indicates if impossible
# answers are allowed or not).
squad_v2 = False
model_checkpoint = "deepset/roberta-base-squad2-covid"
batch_size = 16

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [None]:
from datasets import load_dataset, load_metric

For our example here, we'll use the [SQUAD dataset](https://rajpurkar.github.io/SQuAD-explorer/). The notebook should work with any question answering dataset provided by the 🤗 Datasets library. If you're using your own dataset defined from a JSON or csv file (see the [Datasets documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) on how to load them), it might need some adjustments in the names of the columns used.

In [None]:
datasets = load_dataset("covid_qa_deepset")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1754.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=794.0, style=ProgressStyle(description_…


Downloading and preparing dataset covid_qa_deepset/covid_qa_deepset (download: 4.21 MiB, generated: 62.13 MiB, post-processed: Unknown size, total: 66.35 MiB) to /root/.cache/huggingface/datasets/covid_qa_deepset/covid_qa_deepset/1.0.0/db8b3251603d2c1afd9b1dd8a46d7ab63bce6e3d14d8fc48062fd68b5a0bc6d7...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1346251.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset covid_qa_deepset downloaded and prepared to /root/.cache/huggingface/datasets/covid_qa_deepset/covid_qa_deepset/1.0.0/db8b3251603d2c1afd9b1dd8a46d7ab63bce6e3d14d8fc48062fd68b5a0bc6d7. Subsequent calls will reuse this data.


In [None]:
datasets["train"]

Dataset({
    features: ['document_id', 'context', 'question', 'is_impossible', 'id', 'answers'],
    num_rows: 2019
})

The `datasets` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set.

In [None]:

datasets["validation"] = datasets["train"]
datasets["validation"] = datasets["validation"].select([i for i in range(0,300)])
datasets["train"] = datasets["train"].select([i for i in range(300,2019)])

In [None]:
datasets["train"][0]

{'answers': {'answer_start': [2040],
  'text': ['homeostasis, differentiation, embryonic development, and organ physiology']},
 'context': 'iNR-Drug: Predicting the Interaction of Drugs with Nuclear Receptors in Cellular Networking\n\nhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3975431/\n\nSHA: ee55aea26f816403476a7cb71816b8ecb1110329\n\nAuthors: Fan, Yue-Nong; Xiao, Xuan; Min, Jian-Liang; Chou, Kuo-Chen\nDate: 2014-03-19\nDOI: 10.3390/ijms15034915\nLicense: cc-by\n\nAbstract: Nuclear receptors (NRs) are closely associated with various major diseases such as cancer, diabetes, inflammatory disease, and osteoporosis. Therefore, NRs have become a frequent target for drug development. During the process of developing drugs against these diseases by targeting NRs, we are often facing a problem: Given a NR and chemical compound, can we identify whether they are really in interaction with each other in a cell? To address this problem, a predictor called “iNR-Drug” was developed. In the predi

We can see the training, validation and test sets all have a column for the context, the question and the answers to those questions.

To access an actual element, you need to select a split first, then give an index:

We can see the answers are indicated by their start position in the text (here at character 515) and their full text, which is a substring of the context as we mentioned above.

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset (automatically decoding the labels in passing).

In [None]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=1):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(datasets["train"])

Unnamed: 0,answers,context,document_id,id,is_impossible,question
0,"{'answer_start': [28444], 'text': ['Rhinolophus bats seem to harbor a wide diversity of CoVs']}","Characterization of a New Member of Alphacoronavirus with Unique Genomic Features in Rhinolophus Bats\n\nhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC6521148/\n\nSHA: ee14de143337eec0e9708f8139bfac2b7b8fdd27\n\nAuthors: Wang, Ning; Luo, Chuming; Liu, Haizhou; Yang, Xinglou; Hu, Ben; Zhang, Wei; Li, Bei; Zhu, Yan; Zhu, Guangjian; Shen, Xurui; Peng, Cheng; Shi, Zhengli\nDate: 2019-04-24\nDOI: 10.3390/v11040379\nLicense: cc-by\n\nAbstract: Bats have been identified as a natural reservoir of a variety of coronaviruses (CoVs). Several of them have caused diseases in humans and domestic animals by interspecies transmission. Considering the diversity of bat coronaviruses, bat species and populations, we expect to discover more bat CoVs through virus surveillance. In this study, we described a new member of alphaCoV (BtCoV/Rh/YN2012) in bats with unique genome features. Unique accessory genes, ORF4a and ORF4b were found between the spike gene and the envelope gene, while ORF8 gene was found downstream of the nucleocapsid gene. All the putative genes were further confirmed by reverse-transcription analyses. One unique gene at the 3’ end of the BtCoV/Rh/YN2012 genome, ORF9, exhibits ~30% amino acid identity to ORF7a of the SARS-related coronavirus. Functional analysis showed ORF4a protein can activate IFN-β production, whereas ORF3a can regulate NF-κB production. We also screened the spike-mediated virus entry using the spike-pseudotyped retroviruses system, although failed to find any fully permissive cells. Our results expand the knowledge on the genetic diversity of bat coronaviruses. Continuous screening of bat viruses will help us further understand the important role played by bats in coronavirus evolution and transmission.\n\nText: Members of the Coronaviridae family are enveloped, non-segmented, positive-strand RNA viruses with genome sizes ranging from 26-32 kb [1] . These viruses are classified into two subfamilies: Letovirinae, which contains the only genus: Alphaletovirus; and Orthocoronavirinae (CoV), which consists of alpha, beta, gamma, and deltacoronaviruses (CoVs) [2, 3] . Alpha and betacoronaviruses mainly infect mammals and cause human and animal diseases. Gamma-and delta-CoVs mainly infect birds, but some can also infect mammals [4, 5] . Six human CoVs (HCoVs) are known to cause human diseases. HCoV-HKU1, HCoV-OC43, HCoV-229E, and HCoV-NL63 commonly cause mild respiratory illness or asymptomatic infection; however, severe acute respiratory syndrome coronavirus (SARS-CoV) and\n\nAll sampling procedures were performed by veterinarians, with approval from Animal Ethics Committee of the Wuhan Institute of Virology (WIVH5210201). The study was conducted in accordance with the Guide for the Care and Use of Wild Mammals in Research of the People's Republic of China.\n\nBat fecal swab and pellet samples were collected from November 2004 to November 2014 in different seasons in Southern China, as described previously [16] .\n\nViral RNA was extracted from 200 µL of fecal swab or pellet samples using the High Pure Viral RNA Kit (Roche Diagnostics GmbH, Mannheim, Germany) as per the manufacturer's instructions. RNA was eluted in 50 µL of elution buffer, aliquoted, and stored at -80 • C. One-step hemi-nested reverse-transcription (RT-) PCR (Invitrogen, San Diego, CA, USA) was employed to detect coronavirus, as previously described [17, 18] .\n\nTo confirm the bat species of an individual sample, we PCR amplified the cytochrome b (Cytob) and/or NADH dehydrogenase subunit 1 (ND1) gene using DNA extracted from the feces or swabs [19, 20] . The gene sequences were assembled excluding the primer sequences. BLASTN was used to identify host species based on the most closely related sequences with the highest query coverage and a minimum identity of 95%.\n\nFull genomic sequences were determined by one-step PCR (Invitrogen, San Diego, CA, USA) amplification with degenerate primers (Table S1 ) designed on the basis of multiple alignments of available alpha-CoV sequences deposited in GenBank or amplified with SuperScript IV Reverse Transcriptase (Invitrogen) and Expand Long Template PCR System (Roche Diagnostics GmbH, Mannheim, Germany) with specific primers (primer sequences are available upon request). Sequences of the 5' and 3' genomic ends were obtained by 5' and 3' rapid amplification of cDNA ends (SMARTer Viruses 2019, 11, 379 3 of 19 RACE 5'/3' Kit; Clontech, Mountain View, CA, USA), respectively. PCR products were gel-purified and subjected directly to sequencing. PCR products over 5kb were subjected to deep sequencing using Hiseq2500 system. For some fragments, the PCR products were cloned into the pGEM-T Easy Vector (Promega, Madison, WI, USA) for sequencing. At least five independent clones were sequenced to obtain a consensus sequence.\n\nThe Next Generation Sequencing (NGS) data were filtered and mapped to the reference sequence of BatCoV HKU10 (GenBank accession number NC_018871) using Geneious 7.1.8 [21] . Genomes were preliminarily assembled using DNAStar lasergene V7 (DNAStar, Madison, WI, USA). Putative open reading frames (ORFs) were predicted using NCBI's ORF finder (https://www.ncbi.nlm.nih.gov/ orffinder/) with a minimal ORF length of 150 nt, followed by manual inspection. The sequences of the 5' untranslated region (5'-UTR) and 3'-UTR were defined, and the leader sequence, the leader and body transcriptional regulatory sequence (TRS) were identified as previously described [22] . The cleavage of the 16 nonstructural proteins coded by ORF1ab was determined by alignment of aa sequences of other CoVs and the recognition pattern of the 3C-like proteinase and papain-like proteinase. Phylogenetic trees based on nt or aa sequences were constructed using the maximum likelihood algorithm with bootstrap values determined by 1000 replicates in the MEGA 6 software package [23] . Full-length genome sequences obtained in this study were aligned with those of previously reported alpha-CoVs using MUSCLE [24] . The aligned sequences were scanned for recombination events by using Recombination Detection Program [25] . Potential recombination events as suggested by strong p-values (<10 -20 ) were confirmed using similarity plot and bootscan analyses implemented in Simplot 3.5.1 [26] . The number of synonymous substitutions per synonymous site, Ks, and the number of nonsynonymous substitutions per nonsynonymous site, Ka, for each coding region were calculated using the Ka/Ks calculation tool of the Norwegian Bioinformatics Platform (http://services.cbu.uib.no/tools/kaks) with default parameters [27] . The protein homology detection was analyzed using HHpred (https://toolkit.tuebingen.mpg.de/#/tools/hhpred) with default parameters [28] .\n\nA set of nested RT-PCRs was employed to determine the presence of viral subgenomic mRNAs in the CoV-positive samples [29] . Forward primers were designed targeting the leader sequence at the 5'-end of the complete genome, while reverse primers were designed within the ORFs. Specific and suspected amplicons of expected sizes were purified and then cloned into the pGEM-T Easy vector for sequencing.\n\nBat primary or immortalized cells (Rhinolophus sinicus kidney immortalized cells, RsKT; Rhinolophus sinicus Lung primary cells, RsLu4323; Rhinolophus sinicus brain immortalized cells, RsBrT; Rhinolophus affinis kidney primary cells, RaK4324; Rousettus leschenaultii Kidney immortalized cells, RlKT; Hipposideros pratti lung immortalized cells, HpLuT) generated in our laboratory were all cultured in DMEM/F12 with 15% FBS. Pteropus alecto kidney cells (Paki) was maintained in DMEM/F12 supplemented with 10% FBS. Other cells were maintained according to the recommendations of American Type Culture Collection (ATCC, www.atcc.org).\n\nThe putative accessory genes of the newly detected virus were generated by RT-PCR from viral RNA extracted from fecal samples, as described previously [30] . The influenza virus NS1 plasmid was generated in our lab [31] . The human bocavirus (HBoV) VP2 plasmid was kindly provided by prof. Hanzhong Wang of the Wuhan Institute of Virology, Chinese Academy of Sciences. SARS-CoV ORF7a was synthesized by Sangon Biotech. The transfections were performed with Lipofectamine 3000 Reagent (Life Technologies). Expression of these accessory genes were analyzed by Western blotting using an mAb (Roche Diagnostics GmbH, Mannheim, Germany) against the HA tag. \n\nThe virus isolation was performed as previously described [12] . Briefly, fecal supernatant was acquired via gradient centrifugation and then added to Vero E6 cells, 1:10 diluted in DMEM. After incubation at 37°C for 1 h the inoculum was replaced by fresh DMEM containing 2% FBS and the antibiotic-antimycotic (Gibco, Grand Island, NY, USA). Three blind passages were carried out. Cells were checked daily for cytopathic effect. Both culture supernatant and cell pellet were examined for CoV by RT-PCR [17] .\n\nApoptosis was analyzed as previously described [18] . Briefly, 293T cells in 12-well plates were transfected with 3 µg of expression plasmid or empty vector, and the cells were collected 24 h post transfection. Apoptosis was detected by flow cytometry using by the Annexin V-FITC/PI Apoptosis Detection Kit (YEASEN, Shanghai, China) following the manufacturer's instructions. Annexin-V-positive and PI-negative cells were considered to be in the early apoptotic phase and those stained for both Annexin V and PI were deemed to undergo late apoptosis or necrosis. All experiments were repeated three times. Student's t-test was used to evaluate the data, with p < 0.05 considered significant.\n\nHEK 293T cells were seeded in 24-well plates and then co-transfected with reporter plasmids (pRL-TK and pIFN-βIFN-or pNF-κB-Luc) [30] , as well as plasmids expressing accessory genes, empty vector plasmid pcAGGS, influenza virus NS1 [32] , SARS-CoV ORF7a [33] , or HBoV VP2 [34] . At 24 h post transfection, cells were treated with Sendai virus (SeV) (100 hemagglutinin units [HAU]/mL) or human tumor necrosis factor alpha (TNF-α; R&D system) for 6 h to activate IFNβ or NF-κB, respectively. Cell lysates were prepared, and luciferase activity was measured using the dual-luciferase assay kit (Promega, Madison, WI, USA) according to the manufacturer's instructions.\n\nRetroviruses pseudotyped with BtCoV/Rh/YN2012 RsYN1, RsYN3, RaGD, or MERS-CoV spike, or no spike (mock) were used to infect human, bat or other mammalian cells in 96-well plates. The pseudovirus particles were confirmed with Western blotting and negative-staining electromicroscopy. The production process, measurements of infection and luciferase activity were conducted, as described previously [35, 36] .\n\nThe complete genome nucleotide sequences of BtCoV/Rh/YN2012 strains RsYN1, RsYN2, RsYN3, and RaGD obtained in this study have been submitted to the GenBank under MG916901 to MG916904.\n\nThe surveillance was performed between November 2004 to November 2014 in 19 provinces of China. In total, 2061 fecal samples were collected from at least 12 Rhinolophus bat species ( Figure 1A ). CoVs were detected in 209 of these samples ( Figure 1B and Table 1 ). Partial RdRp sequences suggested the presence of at least 8 different CoVs. Five of these viruses are related to known species: Mi-BatCoV 1 (>94% nt identity), Mi-BatCoV HKU8 [37] (>93% nt identity), BtRf-AlphaCoV/HuB2013 [11] (>99% nt identity), SARSr-CoV [38] (>89% nt identity), and HKU2-related CoV [39] (>85% nt identity). While the other three CoV sequences showed less than 83% nt identity to known CoV species. These three viruses should represent novel CoV species. Virus isolation was performed as previously described [12] , but was not successful. identity). While the other three CoV sequences showed less than 83% nt identity to known CoV species. These three viruses should represent novel CoV species. Virus isolation was performed as previously described [12] , but was not successful. \n\nWe next characterized a novel alpha-CoV, BtCoV/Rh/YN2012. It was detected in 3 R.affinis and 6 R.sinicus, respectively. Based on the sequences, we defined three genotypes, which represented by RsYN1, RsYN3, and RaGD, respectively. Strain RsYN2 was classified into the RsYN3 genotype. Four full-length genomes were obtained. Three of them were from R.sinicus (Strain RsYN1, RsYN2, and RsYN3), while the other one was from R.affinis (Strain RaGD). The sizes of these 4 genomes are between 28,715 to 29,102, with G+C contents between 39.0% to 41.3%. The genomes exhibit similar structures and transcription regulatory sequences (TRS) that are identical to those of other alpha-CoVs ( Figure 2 and Table 2 ). Exceptions including three additional ORFs (ORF3b, ORF4a and ORF4b) were observed. All the 4 strains have ORF4a & ORF4b, while only strain RsYN1 has ORF3b.\n\nThe replicase gene, ORF1ab, occupies~20.4 kb of the genome. The replicase gene, ORF1ab, occupies~20.4 kb of the genome. It encodes polyproteins 1a and 1ab, which could be cleaved into 16 non-structural proteins (Nsp1-Nsp16). The 3'-end of the cleavage sites recognized by 3C-like proteinase (Nsp4-Nsp10, Nsp12-Nsp16) and papain-like proteinase (Nsp1-Nsp3) were confirmed. The proteins including Nsp3 (papain-like 2 proteas, PL2pro), Nsp5 (chymotrypsin-like protease, 3CLpro), Nsp12 (RdRp), Nsp13 (helicase), and other proteins of unknown function ( Table 3 ). The 7 concatenated domains of polyprotein 1 shared <90% aa sequence identity with those of other known alpha-CoVs ( Table 2 ), suggesting that these viruses represent a novel CoV species within the alpha-CoV. The closest assigned CoV species to BtCoV/Rh/YN2012 are BtCoV-HKU10 and BtRf-AlphaCoV/Hub2013. The three strains from Yunnan Province were clustered into two genotypes (83% genome identity) correlated to their sampling location. The third genotype represented by strain RaGD was isolated to strains found in Yunnan (<75.4% genome identity). We then examined the individual genes ( Table 2) . All of the genes showed low aa sequence identity to known CoVs. The four strains of BtCoV/Rh/YN2012 showed genetic diversity among all different genes except ORF1ab (>83.7% aa identity). Notably, the spike proteins are highly divergent among these strains. Other structure proteins (E, M, and N) are more conserved than the spike and other accessory proteins. Comparing the accessory genes among these four strains revealed that the strains of the same genotype shared a 100% identical ORF3a. However, the proteins encoded by ORF3as were highly divergent among different genotypes (<65% aa identity). The putative accessory genes were also BLASTed against GenBank records. Most accessory genes have no homologues in GenBank-database, except for ORF3a (52.0-55.5% aa identity with BatCoV HKU10 ORF3) and ORF9 (28.1-32.0% aa identity with SARSr-CoV ORF7a). We analyzed the protein homology with HHpred software. The results showed that ORF9s and SARS-CoV OR7a are homologues (possibility: 100%, E value <10 −48 ). We further screened the genomes for potential recombination evidence. No significant recombination breakpoint was detected by bootscan analysis.\n\nTo confirm the presence of subgenomic RNA, we designed a set of primers targeting all the predicted ORFs as described. The amplicons were firstly confirmed via agarose-gel electrophoresis and then sequencing ( Figure 3 and Table 2 ). The sequences showed that all the ORFs, except ORF4b, had preceding TRS. Hence, the ORF4b may be translated from bicistronic mRNAs. In RsYN1, an additional subgenomic RNA starting inside the ORF3a was found through sequencing, which led to a unique ORF3b. \n\nTo confirm the presence of subgenomic RNA, we designed a set of primers targeting all the predicted ORFs as described. The amplicons were firstly confirmed via agarose-gel electrophoresis and then sequencing ( Figure 3 and Table 2 ). The sequences showed that all the ORFs, except ORF4b, had preceding TRS. Hence, the ORF4b may be translated from bicistronic mRNAs. In RsYN1, an additional subgenomic RNA starting inside the ORF3a was found through sequencing, which led to a unique ORF3b. \n\nPhylogenetic trees were constructed using the aa sequences of RdRp and S of BtCoV/Rh/YN2012 and other representative CoVs (Figure 4) . In both trees, all BtCoV/Rh/YN2012 were clustered together and formed a distinct lineage to other known coronavirus species. Two distinct sublineages were observed within BtCoV/Rh/YN2012. One was from Ra sampled in Guangdong, while the other was from Rs sampled in Yunnan Among the strains from Yunnan, RsYN2 and RsYN3 were clustered together, while RsYN1 was isolated. The topology of these four strains was correlated to the sampling location. The relatively long branches reflect a high diversity among these strains, indicating a long independent evolution history. \n\nPhylogenetic trees were constructed using the aa sequences of RdRp and S of BtCoV/Rh/YN2012 and other representative CoVs (Figure 4) . In both trees, all BtCoV/Rh/YN2012 were clustered together and formed a distinct lineage to other known coronavirus species. Two distinct sublineages were observed within BtCoV/Rh/YN2012. One was from Ra sampled in Guangdong, while the other was from Rs sampled in Yunnan Among the strains from Yunnan, RsYN2 and RsYN3 were clustered together, while RsYN1 was isolated. The topology of these four strains was correlated to the sampling location. The relatively long branches reflect a high diversity among these strains, indicating a long independent evolution history. \n\nPhylogenetic trees were constructed using the aa sequences of RdRp and S of BtCoV/Rh/YN2012 and other representative CoVs (Figure 4) . In both trees, all BtCoV/Rh/YN2012 were clustered together and formed a distinct lineage to other known coronavirus species. Two distinct sublineages were observed within BtCoV/Rh/YN2012. One was from Ra sampled in Guangdong, while the other was from Rs sampled in Yunnan Among the strains from Yunnan, RsYN2 and RsYN3 were clustered together, while RsYN1 was isolated. The topology of these four strains was correlated to the sampling location. The relatively long branches reflect a high diversity among these strains, indicating a long independent evolution history. \n\nThe Ka/Ks ratios (Ks is the number of synonymous substitutions per synonymous sites and Ka is the number of nonsynonymous substitutions per nonsynonymous site) were calculated for all genes. The Ka/Ks ratios for most of the genes were generally low, which indicates these genes were under purified selection. However, the Ka/Ks ratios of ORF4a, ORF4b, and ORF9 (0.727, 0.623, and 0.843, respectively) were significantly higher than those of other ORFs (Table 4 ). For further selection pressure evaluation of the ORF4a and ORF4b gene, we sequenced another four ORF4a and ORF4b genes (strain Rs4223, Rs4236, Rs4240, and Ra13576 was shown in Figure 1B \n\nAs SARS-CoV ORF7a was reported to induce apoptosis, we conducted apoptosis analysis on BtCoV/Rh/YN2012 ORF9, a~30% aa identity homologue of SARSr-CoV ORF7a. We transiently transfected ORF9 of BtCoV/Rh/YN2012 into HEK293T cells to examine whether this ORF9 triggers apoptosis. Western blot was performed to confirm the expression of ORF9s and SARS-CoV ORF7a ( Figure S1 ). ORF9 couldn't induce apoptosis as the ORF7a of SARS-CoV Tor2 ( Figure S2 ). The results indicated that BtCoV/Rh/YN2012 ORF9 was not involved in apoptosis induction.\n\nTo determine whether these accessory proteins modulate IFN induction, we transfected reporter plasmids (pIFNβ-Luc and pRL-TK) and expression plasmids to 293T cells. All the cells over-expressing the accessory genes, as well as influenza virus NS1 (strain PR8), HBoV VP2, or empty vector were tested for luciferase activity after SeV infection. Luciferase activity stimulated by SeV was remarkably higher than that without SeV treatment as expected. Influenza virus NS1 inhibits the expression from IFN promoter, while HBoV VP2 activate the expression. Compared to those controls, the ORF4a proteins exhibit an active effect as HBoV VP2 ( Figure 5A ). Other accessory proteins showed no effect on IFN production ( Figure S3 ). Expression of these accessory genes were confirmed by Western blot ( Figure S1 ). was remarkably higher than that without SeV treatment as expected. Influenza virus NS1 inhibits the expression from IFN promoter, while HBoV VP2 activate the expression. Compared to those controls, the ORF4a proteins exhibit an active effect as HBoV VP2 ( Figure 5A ). Other accessory proteins showed no effect on IFN production ( Figure S3 ). Expression of these accessory genes were confirmed by Western blot (Figure S1 ). Samples were collected at 6 h postinfection, followed by dual-luciferase assay. The results were expressed as the firefly luciferase value normalized to that of Renilla luciferase. (B) ORF3a protein activate NF-κB. 293T cells were transfected with 100 ng pNF-κB-Luc, 10 ng pRL-TK, empty vector (500 ng), an NS1-expressing plasmid (500 ng), a SARS-CoV ORF7a-expressing plasmid (500 ng), or ORF3a-expressing plasmids (500 ng). After 24 h, the cells were treated with TNF-α. Dual-luciferase activity was determined after 6 h. The results were expressed as the firefly luciferase activity normalized to that of Renilla luciferase. The experiments were performed three times independently. Data are representative of at least three independent experiments, with each determination performed in triplicate (mean ± SD of fold change). Asterisks indicate significant differences between groups (compared with Empty vector-NC, p < 0.05, as determined by student t test).\n\nNF-κB plays an important role in regulating the immune response to viral infection and is also a key factor frequently targeted by viruses for taking over the host cell. In this study, we tested if these accessory proteins could modulate NF-κB. 293T cells were co-transfected with reporter Samples were collected at 6 h postinfection, followed by dual-luciferase assay. The results were expressed as the firefly luciferase value normalized to that of Renilla luciferase. (B) ORF3a protein activate NF-κB. 293T cells were transfected with 100 ng pNF-κB-Luc, 10 ng pRL-TK, empty vector (500 ng), an NS1-expressing plasmid (500 ng), a SARS-CoV ORF7a-expressing plasmid (500 ng), or ORF3a-expressing plasmids (500 ng). After 24 h, the cells were treated with TNF-α. Dual-luciferase activity was determined after 6 h. The results were expressed as the firefly luciferase activity normalized to that of Renilla luciferase. The experiments were performed three times independently. Data are representative of at least three independent experiments, with each determination performed in triplicate (mean ± SD of fold change). Asterisks indicate significant differences between groups (compared with Empty vector-NC, p < 0.05, as determined by student t test).\n\nNF-κB plays an important role in regulating the immune response to viral infection and is also a key factor frequently targeted by viruses for taking over the host cell. In this study, we tested if these accessory proteins could modulate NF-κB. 293T cells were co-transfected with reporter plasmids (pNF-κB-Luc and pRL-TK), as well as accessory protein-expressing plasmids, or controls (empty vector, NS1, SARS-CoV Tor2-ORF7a). The cells were mock treated or treated with TNF-α for 6 h at 24 h post-transfection. The luciferase activity was determined. RsYN1-ORF3a and RaGD-ORF3a activated NF-κB as SARS-CoV ORF7a, whereas RsYN2-ORF3a inhibited NF-κB as NS1 ( Figure 5B ). Expressions of ORF3as were confirmed with Western blot ( Figure S1 ). Other accessory proteins did not modulate NF-κB production ( Figure S4 ).\n\nTo understand the infectivity of these newly detected BtCoV/Rh/YN2012, we selected the RsYN1, RsYN3 and RaGD spike proteins for spike-mediated pseudovirus entry studies. Both Western blot analysis and negative-staining electron microscopy observation confirmed the preparation of BtCoV/Rh/YN2012 successfully ( Figure S5 ). A total of 11 human cell lines, 8 bat cells, and 9 other mammal cell lines were tested, and no strong positive was found (Table S2) .\n\nIn this study, a novel alpha-CoV species, BtCoV/Rh/YN2012, was identified in two Rhinolophus species. The 4 strains with full-length genome were sequences. The 7 conserved replicase domains of these viruses possessed <90% aa sequence identity to those of other known alpha-CoVs, which defines a new species in accordance with the ICTV taxonomy standard [42] . These novel alpha-CoVs showed high genetic diversity in their structural and non-structural genes. Strain RaGD from R. affinis, collected in Guangdong province, formed a divergent independent branch from the other 3 strains from R. sinicus, sampled in Yunnan Province, indicating an independent evolution process associated with geographic isolation and host restrain. Though collected from same province, these three virus strains formed two genotypes correlated to sampling locations. These two genotypes had low genome sequence identity, especially in the S gene and accessory genes. Considering the remote geographic location of the host bat habitat, the host tropism, and the virus diversity, we suppose BtCoV/Rh/YN2012 may have spread in these two provinces with a long history of circulation in their natural reservoir, Rhinolophus bats. With the sequence evidence, we suppose that these viruses are still rapidly evolving.\n\nOur study revealed that BtCoV/Rh/YN2012 has a unique genome structure compared to other alpha-CoVs. First, novel accessory genes, which had no homologues, were identified in the genomes. Second, multiple TRSs were found between S and E genes while other alphacoronavirus only had one TRS there. These TRSs precede ORF3a, ORF3b (only in RsYN1), and ORF4a/b respectively. Third, accessory gene ORF9 showed homology with those of other known CoV species in another coronavirus genus, especially with accessory genes from SARSr-CoV.\n\nAccessory genes are usually involved in virus-host interactions during CoV infection [43] . In most CoVs, accessory genes are dispensable for virus replication. However, an intact 3c gene of feline CoV was required for viral replication in the gut [44] [45] [46] . Deletion of the genus-specific genes in mouse hepatitis virus led to a reduction in virulence [47] . SARS-CoV ORF7a, which was identified to be involved in the suppression of RNA silencing [48] , inhibition of cellular protein synthesis [49] , cell-cycle blockage [50] , and apoptosis induction [51, 52] . In this study, we found that BtCoV/Rh/YN2012 ORF9 shares~30% aa sequence identity with SARS-CoV ORF7a. Interestingly, BtCoV/Rh/YN2012 and SARSr-CoV were both detected in R. sinicus from the same cave. We suppose that SARS-CoV and BtCoV/Rh/YN2012 may have acquired ORF7a or ORF9 from a common ancestor through genome recombination or horizontal gene transfer. Whereas, ORF9 of BtCoV/Rh/YN2012 failed to induce apoptosis or activate NF-κB production, these differences may be induced by the divergent evolution of these proteins in different pressure.\n\nThough different BtCoV/Rh/YN2012 ORF4a share <64.4% amino acid identity, all of them could activate IFN-β. ORF3a from RsYN1 and RaGD upregulated NF-κB, but the homologue from RsYN2 downregulated NF-κB expression. These differences may be caused by amino acid sequence variations and may contribute to a viruses' pathogenicity with a different pathway.\n\nThough lacking of intestinal cell lines from the natural host of BtCoV/Rh/YN2012, we screened the cell tropism of their spike protein through pseudotyped retrovirus entry with human, bat and other mammalian cell lines. Most of cell lines screened were unsusceptible to BtCoV/Rh/YN2012, indicating a low risk of interspecies transmission to human and other animals. Multiple reasons may lead to failed infection of coronavirus spike-pseudotyped retrovirus system, including receptor absence in target cells, failed recognition to the receptor homologue from non-host species, maladaptation in non-host cells during the spike maturation or virus entry, or the limitation of retrovirus system in stimulating coronavirus entry. The weak infectivity of RsYN1 pseudotyped retrovirus in Huh-7 cells could be explained by the binding of spike protein to polysaccharide secreted to the surface. The assumption needs to be further confirmed by experiments.\n\nOur long-term surveillances suggest that Rhinolophus bats seem to harbor a wide diversity of CoVs. Coincidently, the two highly pathogenic agents, SARS-CoV and Rh-BatCoV HKU2 both originated from Rhinolophus bats. Considering the diversity of CoVs carried by this bat genus and their wide geographical distribution, there may be a low risk of spillover of these viruses to other animals and humans. Long-term surveillances and pathogenesis studies will help to prevent future human and animal diseases caused by these bat CoVs.\n\nSupplementary Materials: The following are available online at http://www.mdpi.com/1999-4915/11/4/379/s1, Figure S1 : western blot analysis of the expression of accessory proteins. Figure S2 : Apoptosis analysis of ORF9 proteins of BtCoV/Rh/YN2012. Figure S3 : Functional analysis of ORF3a, ORF3b, ORF4b, ORF8 and ORF9 proteins on the production of Type I interferon. Figure S4 : Functional analysis of ORF3b, ORF4a, ORF4b, ORF8 and ORF9 proteins on the production of NF-κB. Figure S5 : Characteristic of BtCoV/Rh/YN2012 spike mediated pseudovirus. Table S1 : General primers for AlphaCoVs genome sequencing. Table S2 : Primers for the detection of viral sugbenomic mRNAs. Table S3",1576,3686,False,What is the conclusion of the coronavirus long-term surveillance studies?


## Preprocessing the training data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1265.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=150.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=190.0, style=ProgressStyle(description_…




The following assertion ensures that our tokenizer is a fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, and we will need some of the special features they have for our preprocessing.

In [None]:
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

You can check which type of models have a fast tokenizer available and which don't on the [big table of models](https://huggingface.co/transformers/index.html#bigtable).

You can directly call this tokenizer on two sentences (one for the answer, one for the context):

In [None]:
tokenizer("What is your name?", "My name is Sylvain.")

{'input_ids': [0, 2264, 16, 110, 766, 116, 2, 2, 2387, 766, 16, 28856, 1851, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Now one specific thing for the preprocessing in question answering is how to deal with very long documents. We usually truncate them in other tasks, when they are longer than the model maximum sentence length, but here, removing part of the the context might result in losing the answer we are looking for. To deal with this, we will allow one (long) example in our dataset to give several input features, each of length shorter than the maximum length of the model (or the one we set as a hyper-parameter). Also, just in case the answer lies at the point we split a long context, we allow some overlap between the features we generate controlled by the hyper-parameter `doc_stride`:

In [None]:
max_length = 500 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

Let's find one long example in our dataset:

In [None]:
for i, example in enumerate(datasets["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        break
example = datasets["train"][i]

Token indices sequence length is longer than the specified maximum sequence length for this model (5908 > 512). Running this sequence through the model will result in indexing errors


Without any truncation, we get the following length for the input IDs:

In [None]:
len(tokenizer(example["question"], example["context"])["input_ids"])

5908

Now, if we just truncate, we will lose information (and possibly the answer to our question):

In [None]:
len(tokenizer(example["question"], example["context"], max_length=max_length, truncation="only_second")["input_ids"])

500

Note that we never want to truncate the question, only the context, else the `only_second` truncation picked. Now, our tokenizer can automatically return us a list of features capped by a certain maximum length, with the overlap we talked above, we just have to tell it with `return_overflowing_tokens=True` and by passing the stride:

In [None]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride
)

Now we don't have one list of `input_ids`, but several: 

In [None]:
[len(x) for x in tokenized_example["input_ids"]]

[500,
 500,
 500,
 500,
 500,
 500,
 500,
 500,
 500,
 500,
 500,
 500,
 500,
 500,
 500,
 478]

And if we decode them, we can see the overlap:

In [None]:
for x in tokenized_example["input_ids"][:2]:
    print(tokenizer.decode(x))

<s>What can nuclear receptors regulate?</s></s>iNR-Drug: Predicting the Interaction of Drugs with Nuclear Receptors in Cellular Networking

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3975431/

SHA: ee55aea26f816403476a7cb71816b8ecb1110329

Authors: Fan, Yue-Nong; Xiao, Xuan; Min, Jian-Liang; Chou, Kuo-Chen
Date: 2014-03-19
DOI: 10.3390/ijms15034915
License: cc-by

Abstract: Nuclear receptors (NRs) are closely associated with various major diseases such as cancer, diabetes, inflammatory disease, and osteoporosis. Therefore, NRs have become a frequent target for drug development. During the process of developing drugs against these diseases by targeting NRs, we are often facing a problem: Given a NR and chemical compound, can we identify whether they are really in interaction with each other in a cell? To address this problem, a predictor called “iNR-Drug” was developed. In the predictor, the drug compound concerned was formulated by a 256-D (dimensional) vector derived from its molecu

Now this will give us some work to properly treat the answers: we need to find in which of those features the answer actually is, and where exactly in that feature. The models we will use require the start and end positions of these answers in the tokens, so we will also need to to map parts of the original context to some tokens. Thankfully, the tokenizer we're using can help us with that by returning an `offset_mapping`:

In [None]:
tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride
)
print(tokenized_example["offset_mapping"][0][:100])

[(0, 0), (0, 4), (5, 8), (9, 16), (17, 26), (27, 35), (35, 36), (0, 0), (0, 0), (0, 1), (1, 3), (3, 4), (4, 8), (8, 9), (10, 17), (17, 20), (21, 24), (25, 30), (30, 36), (37, 39), (40, 45), (46, 50), (51, 58), (59, 61), (61, 65), (65, 68), (69, 71), (72, 80), (81, 88), (88, 91), (91, 92), (92, 93), (93, 98), (98, 101), (101, 104), (104, 105), (105, 107), (107, 109), (109, 110), (110, 112), (112, 113), (113, 114), (114, 117), (117, 118), (118, 121), (121, 122), (122, 124), (124, 125), (125, 126), (126, 134), (134, 135), (135, 137), (137, 138), (138, 140), (140, 143), (143, 145), (145, 146), (146, 147), (147, 148), (148, 151), (151, 152), (153, 154), (154, 155), (155, 157), (157, 160), (160, 162), (162, 163), (163, 164), (164, 166), (166, 168), (168, 170), (170, 172), (172, 173), (173, 174), (174, 176), (176, 179), (179, 181), (181, 182), (182, 183), (183, 185), (185, 186), (186, 188), (188, 191), (191, 193), (193, 194), (194, 195), (195, 199), (199, 202), (202, 203), (204, 207), (207, 2

This gives, for each index of our input IDS, the corresponding start and end character in the original text that gave our token. The very first token (`[CLS]`) has (0, 0) because it doesn't correspond to any part of the question/answer, then the second token is the same as the characters 0 to 3 of the question:

In [None]:
first_token_id = tokenized_example["input_ids"][0][1]
offsets = tokenized_example["offset_mapping"][0][1]
print(tokenizer.convert_ids_to_tokens([first_token_id])[0], example["question"][offsets[0]:offsets[1]])

What What


So we can use this mapping to find the position of the start and end tokens of our answer in a given feature. We just have to distinguish which parts of the offsets correspond to the question and which part correspond to the context, this is where the `sequence_ids` method of our `tokenized_example` can be useful:

In [None]:
sequence_ids = tokenized_example.sequence_ids()
print(sequence_ids)

[None, 0, 0, 0, 0, 0, 0, None, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

It returns `None` for the special tokens, then 0 or 1 depending on whether the corresponding token comes from the first sentence past (the question) or the second (the context). Now with all of this, we can find the first and last token of the answer in one of our input feature (or if the answer is not in this feature):

In [None]:
answers = example["answers"]
start_char = answers["answer_start"][0]
end_char = start_char + len(answers["text"][0])

# Start token index of the current span in the text.
token_start_index = 0
while sequence_ids[token_start_index] != 1:
    token_start_index += 1

# End token index of the current span in the text.
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
    token_end_index -= 1

# Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
offsets = tokenized_example["offset_mapping"][0]
if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
    # Move the token_start_index and token_end_index to the two ends of the answer.
    # Note: we could go after the last offset if the answer is the last word (edge case).
    while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
        token_start_index += 1
    start_position = token_start_index - 1
    while offsets[token_end_index][1] >= end_char:
        token_end_index -= 1
    end_position = token_end_index + 1
    print(start_position, end_position)
else:
    print("The answer is not in this feature.")

The answer is not in this feature.


And we can double check that it is indeed the theoretical answer:

In [None]:
# print(tokenizer.decode(tokenized_example["input_ids"][0][start_position: end_position+1]))
# print(answers["text"][0])

For this notebook to work with any kind of models, we need to account for the special case where the model expects padding on the left (in which case we switch the order of the question and the context):

In [None]:
pad_on_right = tokenizer.padding_side == "right"

Now let's put everything together in one function we will apply to our training set. In the case of impossible answers (the answer is in another feature given by an example with a long context), we set the cls index for both the start and end position. We could also simply discard those examples from the training set if the flag `allow_impossible_answers` is `False`. Since the preprocessing is already complex enough as it is, we've kept is simple for this part.

In [None]:
def prepare_train_features(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
features = prepare_train_features(datasets['train'][:5])

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command. Since our preprocessing changes the number of samples, we need to remove the old columns when applying it.

In [None]:
tokenized_datasets = datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Fine-tuning the model

Now that our data is ready for training, we can download the pretrained model and fine-tune it. Since our task is question answering, we use the `AutoModelForQuestionAnswering` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us:

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=498637768.0, style=ProgressStyle(descri…




The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

To instantiate a `Trainer`, we will need to define three more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
args = TrainingArguments(
    f"test-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1,
    weight_decay=0.01,
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay.

Then we will need a data collator that will batch our processed examples together, here the default one will work:

In [None]:
from transformers import default_data_collator

data_collator = default_data_collator

We will evaluate our model and compute metrics in the next section (this is a very long operation, so we will only compute the evaluation loss during training).

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

We can now finetune our model by just calling the `train` method:

In [None]:
# trainer.train()

Since this training is particularly long, let's save the model just in case we need to restart.

In [None]:
# trainer.save_model("distil-covid-trained")

## Evaluation

Evaluating our model will require a bit more work, as we will need to map the predictions of our model back to parts of the context. The model itself predicts logits for the start and en position of our answers: if we take a batch from our validation datalaoder, here is the output our model gives us:

In [None]:
import torch

for batch in trainer.get_eval_dataloader():
    break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
with torch.no_grad():
    output = trainer.model(**batch)
output.keys()

odict_keys(['loss', 'start_logits', 'end_logits'])

The output of the model is a dict-like object that contains the loss (since we provided labels), the start and end logits. We won't need the loss for our predictions, let's have a look a the logits:

In [None]:
output.start_logits.shape, output.end_logits.shape

(torch.Size([16, 500]), torch.Size([16, 500]))

We have one logit for each feature and each token. The most obvious thing to predict an answer for each featyre is to take the index for the maximum of the start logits as a start position and the index of the maximum of the end logits as an end position.

In [None]:
output.start_logits.argmax(dim=-1), output.end_logits.argmax(dim=-1)

(tensor([153,   0,   0, 413,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0], device='cuda:0'),
 tensor([175,  78,  71,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0], device='cuda:0'))

This will work great in a lot of cases, but what if this prediction gives us something impossible: the start position could be greater than the end position, or point to a span of text in the question instead of the answer. In that case, we might want to look at the second best prediction to see if it gives a possible answer and select that instead.

However, picking the second best answer is not as easy as picking the best one: is it the second best index in the start logits with the best index in the end logits? Or the best index in the start logits with the second best index in the end logits? And if that second best answer is not possible either, it gets even trickier for the third best answer.


To classify our answers, we will use the score obtained by adding the start and end logits. We won't try to order all the possible answers and limit ourselves to with a hyper-parameter we call `n_best_size`. We'll pick the best indices in the start and end logits and gather all the answers this predicts. After checking if each one is valid, we will sort them by their score and keep the best one. Here is how we would do this on the first feature in the batch:

In [None]:
n_best_size = 20

In [None]:
import numpy as np

start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": "" # We need to find a way to get back the original substring corresponding to the answer in the context
                }
            )

And then we can sort the `valid_answers` according to their `score` and only keep the best one. The only point left is how to check a given span is inside the context (and not the question) and how to get back the text inside. To do this, we need to add two things to our validation features:
- the ID of the example that generated the feature (since each example can generate several features, as seen before);
- the offset mapping that will give us a map from token indices to character positions in the context.

That's why we will re-process the validation set with the following function, slightly different from `prepare_train_features`:

In [None]:
def prepare_validation_features(examples):
    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

And like before, we can apply that function to our validation set easily:

In [None]:
validation_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names
)

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




Now we can grab the predictions for all features by using the `Trainer.predict` method:

In [None]:
raw_predictions = trainer.predict(validation_features)

The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [None]:
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

We can now refine the test we had before: since we set `None` in the offset mappings when it corresponds to a part of the question, it's easy to check if an answer is fully inside the context. We also eliminate very long answers from our considerations (with an hyper-parameter we can tune)

In [None]:
max_answer_length = 60

In [None]:
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
offset_mapping = validation_features[0]["offset_mapping"]
# The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# an example index
context = datasets["validation"][0]["context"]

# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
        # to part of the input_ids that are not in the context.
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # Don't consider answers with a length that is either < 0 or > max_answer_length.
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char: end_char]
                }
            )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers

[{'score': 11.464323,
  'text': 'Mother-to-child transmission (MTCT) is the main cause of HIV-1 infection in children worldwide.'},
 {'score': 9.37315,
  'text': 'Mother-to-child transmission (MTCT) is the main cause of HIV-1 infection in children worldwide'},
 {'score': 7.5647163, 'text': 'worldwide.'},
 {'score': 6.1717076, 'text': 'Mother-to-child transmission (MTCT)'},
 {'score': 5.685243, 'text': 'Mother-to-child transmission'},
 {'score': 5.4735436, 'text': 'worldwide'},
 {'score': 5.3105655,
  'text': ': Mother-to-child transmission (MTCT) is the main cause of HIV-1 infection in children worldwide.'},
 {'score': 4.5358906,
  'text': 'transmission (MTCT) is the main cause of HIV-1 infection in children worldwide.'},
 {'score': 4.413592, 'text': '.'},
 {'score': 4.3732033,
  'text': 'MTCT) is the main cause of HIV-1 infection in children worldwide.'},
 {'score': 3.938302, 'text': 'Mother-to-child transmission (MTCT'},
 {'score': 3.5990171,
  'text': ') is the main cause of HIV-1 i

We can compare to the actual ground-truth answer:

In [None]:
datasets["validation"][0]["answers"]

{'answer_start': [370],
 'text': ['Mother-to-child transmission (MTCT) is the main cause of HIV-1 infection in children worldwide.']}

Our model picked the right as the most likely answer!

As we mentioned in the code above, this was easy on the first feature because we knew it comes from the first example. For the other features, we will need a map between examples and their corresponding features. Also, since one example can give several features, we will need to gather together all the answers in all the features generated by a given example, then pick the best one. The following code builds a map from example index to its corresponding features indices:

In [None]:
import collections

examples = datasets["validation"]
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

We're almost ready for our post-processing function. The last bit to deal with is the impossible answer (when `squad_v2 = True`). The code above only keeps answers that are inside the context, we need to also grab the score for the impossible answer (which has start and end indices corresponding to the index of the CLS token). When one example gives several features, we have to predict the impossible answer when all the features give a high score to the impossible answer (since one feature could predict the impossible answer just because the answer isn't in the part of the context it has access too), which is why the score of the impossible answer for one example is the *minimum* of the scores for the impossible answer in each feature generated by the example.

We then predict the impossible answer when that score is greater than the score of the best non-impossible answer. All combined together, this gives us this post-processing function:

In [None]:
from tqdm.auto import tqdm

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

And we can apply our post-processing function to our raw predictions:

In [None]:
final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)

Post-processing 300 example predictions split into 4149 features.


HBox(children=(FloatProgress(value=0.0, max=300.0), HTML(value='')))




In [None]:
# import pickle
import json

# with open('output-distil-covid.pickle', 'wb') as handle:
#     pickle.dump(final_predictions, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('roberta-covid.json', 'w', encoding='utf-8') as f:
    json.dump(final_predictions, f, indent=4, ensure_ascii=False)
    

In [None]:
# import pickle
# file = open("output-distil-covid.pickle",'rb')
# object_file = pickle.load(file)
# file.close()

In [None]:
# import pickle


# with open('output-distil-covid.pickle', 'wb') as handle:
#     pickle.dump(final_predictions, handle, protocol=pickle.HIGHEST_PROTOCOL)

# with open('{}-covid-1.json'.format("distil"), 'w', encoding='utf-8') as f:
#     json.dump(object_file, f, indent=4, ensure_ascii=False)

In [None]:
from __future__ import print_function
from collections import Counter
import string
import re
import argparse
import json
import sys

def normalize_answer(s):
    """Lower text and remove punctuation, articles and extra whitespace."""
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)

    def white_space_fix(text):
        return ' '.join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

In [None]:
def f1_score(prediction, ground_truth):
    prediction_tokens = prediction.split()
    ground_truth_tokens = ground_truth.split()
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1


def exact_match_score(prediction, ground_truth):
    return (prediction == ground_truth)

In [None]:
my_predictions = list(final_predictions.items())

for i in range(0,len(my_predictions)):
  my_predictions[i] = normalize_answer(my_predictions[i][1])

In [None]:
orig = list(datasets["validation"])
for i in range(0,len(orig)):
  orig[i] = normalize_answer(orig[i]["answers"]["text"][0])  

In [None]:
em = 0
for i in range(0,len(orig)):
  if orig[i] == my_predictions[i]: em+=1

print("Exact Match {}".format(em/len(orig)))

Exact Match 0.4666666666666667


In [None]:
from sklearn.metrics import f1_score

f1 = f1_score(orig, my_predictions, average='macro')
print("f1_score {} macro".format(f1))

f1_score 0.30866965620328846 macro


In [None]:
orig[1]

'dcsignr plays crucial role in mtct of hiv1 and that impaired placental dcsignr expression increases risk of transmission'

In [None]:
f1 = f1_score(orig[1].split(), my_predictions[1].split(), average='macro')
f1

1.0