# Passage Retrieval with Bert on CORD-19 dataset 
## <div> Vassilis Panagakis </div>

In [1]:
! pip install -U sentence-transformers

import scipy.spatial
import numpy as np
import os, json
import glob
import re
import torch
import pandas as pd
import transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/6a/e2/84d6acfcee2d83164149778a33b6bdd1a74e1bcb59b2b2cd1b861359b339/sentence-transformers-0.4.1.2.tar.gz (64kB)
[K     |████████████████████████████████| 71kB 3.8MB/s 
[?25hCollecting transformers<5.0.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 7.8MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 47.0MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp3

# Passage retrieval on CORD-19 dataset

## Load Data

In [2]:
from google.colab import drive 
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
# get "/comm_use_subset" directory path on google drive 
dir_path = 'gdrive/My Drive/Colab Notebooks/comm_use_subset'

json_articles = glob.glob(os.path.join(dir_path, "*.json"))

## Data Pre-processing


**Initial number of articles**

In [4]:
len(json_articles)

9000

The cord-19_2020-03-13 version of the Cord-19 dataset contains 9000 articles. 
In order to accelerate the time response of our model we use some keywords such as **RNA virus, SARS, coronavirus, COVID, SARS-Cov-2, 2019-nCoV, vaccine, Antibody-Dependent Enhancement, naproxen, clarithromycin, minocyclinethat** and more to filter the articles. As a result if an article's title doesn't contain any of the keywords, we don't include the article in our database.

### Cleanse Data

In [None]:
if not os.path.exists('gdrive/My Drive/Colab Notebooks/covid19Data.csv'): 
    # cleanse data based on following keywords
    keywords = ['persistence','decontamination','RNA virus',' SARS','coronavirus', 'COVID', 'SARS-Cov-2', 
                '-CoV', '2019-nCoV','coronavirus vaccine','Antibody-Dependent Enhancement', 'prophylaxis clinical',
                'asymptomatic', 'symptoms', 'presymptomatic', 'virus', 'MERS', 'contagious illness', 
                'incubation period', 'pathogen', 'patient zero', 'PPE', 'social distancing', 'self-isolation', 
                'self-quarantine', 'medicine', 'super spreader', 'antibody', 'outbreak', 'epidemic', 'pandemic',
                'mask', 'health professionals', 'N95', 'disease', 'immunity', 'contagious virus', 'COVID-19'] 

    # initialize lists to store filtered titles and ids
    titles = []
    ids = []

    for json_article in json_articles: # traverse each json article
        text = json.load(open(json_article))

        # clean title
        title = text['metadata']['title']  
        title = re.sub(r'[^\x00-\x7F]',' ', title)

        # append article only if it contains any of the keywords in its title
        if title != '' and any(keyword.lower() in title.lower() for keyword in keywords):
            titles.append(title)
            ids.append(text['paper_id'])

### Store filtered data to csv file

In [None]:
key_df = pd.DataFrame({'title': titles, 'id': ids})
meta_df = pd.read_csv('gdrive/My Drive/Colab Notebooks/all_sources_metadata_2020-03-13.csv') # load metadata csv file

articles_df = pd.merge(meta_df, key_df)
articles_df = articles_df.drop_duplicates(subset='title') # remove duplicate articles based on title
articles_df = articles_df.dropna(subset=['abstract'])     # remove articles with no 'abstract' field
articles_df = articles_df.reset_index(drop=True)

articles_df.to_csv('gdrive/My Drive/Colab Notebooks/covid19Data.csv', index = False, header=True)

### Load filtered data

In [None]:
# load filtered data from csv file
articles_df = pd.read_csv('gdrive/My Drive/Colab Notebooks/covid19Data.csv')

articles_df.head()

Unnamed: 0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,id
0,e9239100c5493ea914dc23c3d7a262f4326022ac,CZI,Distinct Roles for Sialoside and Protein Recep...,10.1128/mBio.02764-19,,,cc-by,Coronaviruses (CoVs) are common human and anim...,2020,"Qing, Enya; Hantak, Michael; Perlman, Stanley;...",mBio,3005811000.0,#2427,True,3fa90782b0cd99871663f5317bb69d255cfde50f
1,c9fee561c2a3834645dbb61dc4ae6448051da492,CZI,Comprehensive Genomic Characterization Analysi...,10.3389/fmicb.2019.03036,,,cc-by,Porcine delta coronavirus (PDCoV) is a novel e...,2020,"Liui, Junli; Wang, Fangfang; Du, Liuyang; Li, ...",Frontiers in Microbiology,3003968000.0,#5462,True,c9fee561c2a3834645dbb61dc4ae6448051da492
2,655537fc8cc52bccf43cf7189ab060d3097caa7a,CZI,Potential Factors Influencing Repeated SARS Ou...,10.3390/ijerph17051633,,,cc-by,Within last 17 years two widespread epidemics ...,2020,"Sun, Zhong; Thilakavathy, Karuppiah; Kumar, S....",International Journal of Environmental Researc...,2615949000.0,#3296,True,655537fc8cc52bccf43cf7189ab060d3097caa7a
3,f294f0df7468a8ac9e27776cc15fa20297a9f040,CZI,Systematic Comparison of Two Animal-to-Human T...,10.3390/v12020244,,,cc-by,After the outbreak of the severe acute respira...,2020,"Xu, Jiabao; Zhao, Shizhe; Teng, Tieshan; Abdal...",Viruses,2163319000.0,#1449,True,f294f0df7468a8ac9e27776cc15fa20297a9f040
4,5734e3b81e16fe1976a129c5a0872716f3dd50b8,CZI,A new coronavirus associated with human respir...,10.1038/s41586-020-2008-3,,32015508.0,cc-by,"Emerging infectious diseases, such as SARS and...",2020,"Wu, Fan; Zhao, Su; Yu, Bin; Chen, Yan-Mei; Wan...",Nature,3003217000.0,#258,True,5734e3b81e16fe1976a129c5a0872716f3dd50b8


**Number of articles after filtering**

In [None]:
titles = articles_df['title'].tolist()
ids = articles_df['id'].tolist()
len(titles)

3203

## Title Retrieval

In [None]:
from sentence_transformers import SentenceTransformer
from datetime import datetime

# function that returns the closest article titles to on a query based on cosine similarity metric
def k_closest(embedder, question, titles, articles, k):
    start = datetime.now() # start time counter

    query_embeddings = embedder.encode([question])  # query embeddings
    title_embeddings = embedder.encode(titles)      # title embeddings

    dist = scipy.spatial.distance.cdist(query_embeddings, title_embeddings, "cosine")[0]  # calculate distances based on vosine similarity

    neighbors = zip(range(len(dist)), dist)
    neighbors = sorted(neighbors, key=lambda x: x[1]) # sort neighbors from highest to lowest cosine similarity

    # initialize neighbors list
    closest_ids = []
    closest_titles = []
    closest_scores = []
    closest_abstracts = []
    abstracts = list(articles.abstract)

    for i, dist in neighbors[0:k]:
        closest_ids.append(ids[i])
        closest_titles.append(titles[i])
        closest_scores.append(round((1-dist), 4))
        closest_abstracts.append(abstracts[i])
    
    end = datetime.now()  # end time counter

    time_dif = (end - start).total_seconds() # count time difference in seconds

    print("Execution Time: {0:4f} seconds\n".format(time_dif))

    closest_df = pd.DataFrame({
        'id': closest_ids,
        'cosine_similarity': closest_scores,
        'title': closest_titles,
        'abstract': closest_abstracts
    })
    
    return closest_df

### Device

In [None]:
# enable gpu for faster execution
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device available for running: ")
print(device)

Device available for running: 
cuda


### BERT-base

In [None]:
sentence_model1 = SentenceTransformer('bert-base-nli-mean-tokens')
sentence_model1.to(device)

100%|██████████| 405M/405M [00:08<00:00, 45.1MB/s]


SentenceTransformer(
  (0): Transformer(
    (auto_model): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0): BertLayer(
            (attention): BertAttention(
              (self): BertSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768, bias=True)
       

#### Suggested questions

In [None]:
query1 = 'What are the coronaviruses?'

bbQ1_df = k_closest(sentence_model1, query1, titles, articles_df, 1)
bbQ1_df[['cosine_similarity', 'title']]

Execution Time: 12.332064 seconds



Unnamed: 0,cosine_similarity,title
0,0.8506,Unanswered questions about the Middle East res...


In [None]:
query2 = 'What is Coronavirus Disease 2019?'

bbQ2_df = k_closest(sentence_model1, query2, titles, articles_df, 5)
bbQ2_df[['cosine_similarity', 'title']]

Execution Time: 3.978883 seconds



Unnamed: 0,cosine_similarity,title
0,0.8758,An interim review of the epidemiological chara...
1,0.757,"Potential Rapid Diagnostics, Vaccine and Thera..."
2,0.7162,Q&A: The novel coronavirus outbreak causing CO...
3,0.6963,Molecular Diagnosis of a Novel Coronavirus (20...
4,0.6861,Croup Is Associated with the Novel Coronavirus...


In [None]:
query3 = 'What is caused by SARS-COV2?'

bbQ3_df = k_closest(sentence_model1, query3, titles, articles_df, 10)
bbQ3_df[['cosine_similarity', 'title']]

Execution Time: 12.161577 seconds



Unnamed: 0,cosine_similarity,title
0,0.7361,Porcine Hemagglutinating Encephalomyelitis Vir...
1,0.7202,"Q&A: What are pathogens, and what have they do..."
2,0.7199,The role of CXCL10 in the pathogenesis of expe...
3,0.7151,Antagonizing Interferon-Mediated Immune Respon...
4,0.7142,Trypsin-independent porcine epidemic diarrhea ...
5,0.7086,Sialic Acid Binding Properties of Soluble Coro...
6,0.7068,Virus-induced ER stress and the unfolded prote...
7,0.7019,Biochemical Characterization of Middle East Re...
8,0.7,Host Modulators of H1N1 Cytopathogenicity
9,0.699,HACE1 Negatively Regulates Virus-Triggered Typ...


#### Extra questions

In [None]:
query4 = 'What are most common underlying diseases in covid-19 patients?'

bbQ4_df = k_closest(sentence_model1, query4, titles, articles_df, 1)
bbQ4_df[['cosine_similarity', 'title']]

Execution Time: 12.183343 seconds



Unnamed: 0,cosine_similarity,title
0,0.7019,A Comparative Study of Clinical Presentation a...


In [None]:
query5 = 'what are the public measures to control the spread of covid-19?'

bbQ5_df = k_closest(sentence_model1, query5, titles, articles_df, 3)
bbQ5_df[['cosine_similarity', 'title']]

Execution Time: 12.258709 seconds



Unnamed: 0,cosine_similarity,title
0,0.7171,Including the public in pandemic planning: a d...
1,0.6988,Estimated effectiveness of symptom and risk sc...
2,0.6944,"Q&A: What are pathogens, and what have they do..."


### DistilBERT

In [None]:
sentence_model2 = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
sentence_model2.to(device)

100%|██████████| 245M/245M [00:03<00:00, 73.9MB/s]


SentenceTransformer(
  (0): Transformer(
    (auto_model): DistilBertModel(
      (embeddings): Embeddings(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (transformer): Transformer(
        (layer): ModuleList(
          (0): TransformerBlock(
            (attention): MultiHeadSelfAttention(
              (dropout): Dropout(p=0.1, inplace=False)
              (q_lin): Linear(in_features=768, out_features=768, bias=True)
              (k_lin): Linear(in_features=768, out_features=768, bias=True)
              (v_lin): Linear(in_features=768, out_features=768, bias=True)
              (out_lin): Linear(in_features=768, out_features=768, bias=True)
            )
            (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (ffn): FFN(
              (dro

#### Suggested questions

In [None]:
query1 = 'What are the coronaviruses?'

dbQ1_df = k_closest(sentence_model2, query1, titles, articles_df, 1)
dbQ1_df[['cosine_similarity', 'title']]

Execution Time: 3.450956 seconds



Unnamed: 0,cosine_similarity,title
0,0.6567,Genotyping coronaviruses associated with felin...


In [None]:
query2 = 'What is Coronavirus Disease 2019?'

dbQ2_df = k_closest(sentence_model2, query2, titles, articles_df, 5)
dbQ2_df[['cosine_similarity', 'title']]

Execution Time: 2.100247 seconds



Unnamed: 0,cosine_similarity,title
0,0.7681,An interim review of the epidemiological chara...
1,0.6637,"Potential Rapid Diagnostics, Vaccine and Thera..."
2,0.6166,Q&A: The novel coronavirus outbreak causing CO...
3,0.5782,Molecular Diagnosis of a Novel Coronavirus (20...
4,0.5722,Regulatory T Cells in Arterivirus and Coronavi...


In [None]:
query3 = 'What is caused by SARS-COV2?'

dbQ3_df = k_closest(sentence_model2, query3, titles, articles_df, 10)
dbQ3_df[['cosine_similarity', 'title']]

Execution Time: 3.432332 seconds



Unnamed: 0,cosine_similarity,title
0,0.5973,Surface vimentin is critical for the cell entr...
1,0.5943,The Role of Severe Acute Respiratory Syndrome ...
2,0.585,SARS-CoV Pathogenesis Is Regulated by a STAT1 ...
3,0.5805,Analysis of Intraviral Protein-Protein Interac...
4,0.5744,The SARS-Unique Domain (SUD) of SARS Coronavir...
5,0.5524,Genetic lesions within the 3a gene of SARS-CoV
6,0.5468,Identification of Residues of SARS-CoV nsp1 Th...
7,0.5356,The SARS-Coronavirus-Host Interactome: Identif...
8,0.5337,Different residues in the SARS-CoV spike prote...
9,0.5223,Inhibition of SARS Pseudovirus Cell Entry by L...


#### Extra questions

In [None]:
query4 = 'What are most common underlying diseases in covid-19 patients?'

dbQ4_df = k_closest(sentence_model2, query4, titles, articles_df, 1)
dbQ4_df[['cosine_similarity', 'title']]

Execution Time: 3.407676 seconds



Unnamed: 0,cosine_similarity,title
0,0.657,Q&A: The novel coronavirus outbreak causing CO...


In [None]:
query5 = 'what are the public measures to control the spread of covid-19?'

dbQ5_df = k_closest(sentence_model2, query5, titles, articles_df, 3)
dbQ5_df[['cosine_similarity', 'title']]

Execution Time: 3.402274 seconds



Unnamed: 0,cosine_similarity,title
0,0.6664,Estimated effectiveness of symptom and risk sc...
1,0.5969,Identification of COVID-19 Can be Quicker thro...
2,0.5772,Q&A: The novel coronavirus outbreak causing CO...


### BERT base vs DistilBERT

**As we know the initial BERT models are enormous, as they contain a big number of layers and connections. It is obvious, that they are not energy-efficient and they require costly GPU servers to serve at scale. So it is difficult to put these kind of models in massive production. As a result, at some point there was a need of upgraded BERT models of smaller sizes. Through the passage of years many techniques were used to deal with this problem. The most significant of them are quantization, where network weights are approximated with a smaller precision and weights pruning, where some network's connections are removed.**

**In our current homework we apply another important technique, distillation. Distillation is a compression technique, in which a small model is trained to reproduce the behavior of a larger model. In this technique, a student network (DistilBERT) is trained to mimic the full output distribution of the teacher network (BERT-base). In particular, DistilBERT is a small version of BERT, in which the token-type embeddings and the pooler are removed. The rest of the architecture is identical, while the number of layers is reduced by a factor of two. As we know from theory, DistilBERT has about half the total number of parameters of BERT base and it is more than 60% faster than BERT, in terms of inference time. For the above reasons, we compare the `bert-base-nli-mean-tokens` sentence transformer with the `distilbert-base-nli-stsb-mean-tokens` sentence transformer, in order to investigate the cosine similarity score - time execution trade-off between the two models.**

**We experiment with the same 5 queries on both our models. At each execution we ask for a different number of k closest articles' titles, in order to study the different scores and times range. After executing all the queries we observe a clear contradiction between the two models. It is obvious, that the cosine similarity scores of our first BERT-base model are significantly higher than the respective scores of our second DistilBERT model (more than 15% in a query to query comparison). Of course, the articles' titles that each model returns are not identical or in the same priority but we can notice the same articles' titles in some cases, as well. On the other hand, the execution times of the DistilBERT model are usually less than half to the relative BERT-base times. Moreover, as expected, when we raise the number k of expected titles the execution times rise accordingly. We also notice that the execution times are relatively fast because of the reduced database we use, aftering filtering the articles. To sum up, the theoretical knowledge that we presented on the previous paragraphs is verified via our experiments. Essentially, the DistilBERT model leads to faster but uncertain predictions, due to its reduced layers and parametres, while the dense BERT-base model produces highly accurate but slower results. Obviously, there are plenty more criteria that can be used to compare such complicated models, but score metrics and time execution are the ones that must definitely be mentioned.**

## Passage Retrieval

In [None]:
!pip install colorama

import colorama
import re 
from transformers import BertTokenizer, BertForQuestionAnswering

Collecting colorama
  Downloading https://files.pythonhosted.org/packages/44/98/5b86278fbbf250d239ae0ecb724f8572af1c91f4a11edf4d36a206189440/colorama-0.4.4-py2.py3-none-any.whl
Installing collected packages: colorama
Successfully installed colorama-0.4.4


In [None]:
# function that gets a question and an article's body text 
# and returns the article's passage that answers the given question and its score
def retrieve_passage(model, tokenizer, question, text):
    # tokenize combined question and text string
    input_ids = tokenizer.encode(question, text)

    sep_ind = input_ids.index(tokenizer.sep_token_id) # get index of first [SEP] token

    segA_toks = sep_ind + 1 # segment A tokens + [SEP] token 
    segB_toks = len(input_ids) - segA_toks # segment B tokens

    seg_ids = [0]*segA_toks + [1]*segB_toks # construct the list of 0s and 1s
    assert len(seg_ids) == len(input_ids) # every input token must have a segment id
    
    # insert embeddings to the model
    if len(seg_ids) < 512:
      start_scores, end_scores = model(torch.tensor([input_ids]).to(device), 
                                       token_type_ids=torch.tensor([seg_ids]).to(device), return_dict=False)
    else:
        start_scores, end_scores = model(torch.tensor([input_ids[:512]]).to(device), 
                                         token_type_ids=torch.tensor([seg_ids[:512]]).to(device), return_dict=False)
        
    tokens = tokenizer.convert_ids_to_tokens(input_ids) # get tokens based on ids

    # get the start token and end token indicies 
    start_tok_ind = torch.argmax(start_scores)
    end_tok_ind = torch.argmax(end_scores)
    
    if start_tok_ind <= 0 or end_tok_ind <= 0 or end_tok_ind <= start_tok_ind:
        answer = "None"
        score = -99999.0
    
    else:
        answer = tokens[start_tok_ind]  # answer's first token is the start token

        for i in range(start_tok_ind + 1, end_tok_ind + 1): # traverse the rest of the tokens

            # if it is a subword token, construct the whole token
            if tokens[i][0:2] == '##':
                answer += tokens[i][2:]

            # else add token to the answer after a whitespace
            else:
                answer += ' ' + tokens[i]

        # remove [CLS] and [SEP] tokens
        answer = answer.replace('[CLS]', '')
        answer = answer.replace('[SEP]', '').strip()

        # define score as the average value of the best start and end tokens
        score = (start_scores.max() + end_scores.max()) / 2
        score = score.item()

    return answer, score

In [None]:
# function that gets a question and all articles' body texts 
# and returns the best passage of each article that answers the given question and its score
def retrieve_all_passages(model, tokenizer, question, abstracts):
    total_answers = []
    total_scores = []

    for i, abstract in enumerate(abstracts):  # get best answer-passage from each article
        answer, score = retrieve_passage(model, tokenizer, question, abstract)
        total_answers.append(answer)
        total_scores.append(score)

    return total_answers, total_scores

In [None]:
# function that displays the passage of each one of k articles than answers a given question
def display_passages(question, articles, answers, scores, best_indices, k):
    print("\n*** The answer-passage is highlighted with red color in the abstract text of each article ***\n")
    print("Question: " + question)
    print()

    for i, ind in enumerate(best_indices):
        article = articles.iloc[ind] # get article in ind position
        
        print("Title: " + article['title'])
        print("Score: " + str(scores[ind]))
        
        abstract = article['abstract']
        
        # cleanse the passage
        passage = answers[ind]
        passage = re.sub(' -', '-', passage)
        passage = re.sub('- ', '-', passage)
        passage = re.sub(' ,', ',', passage)
        passage = re.sub(r'\s([?.!"](?:\s|$))', r'\1', passage)
        passage = re.sub('\( ', '(', passage)
        passage = re.sub(' \)', ')', passage)
        
        ins_passage = re.compile(re.escape(passage), re.IGNORECASE)
        new_abstract = ins_passage.sub('\033[31m' + passage + '\033[39m', abstract) # change passage's color to red
        print("Abstract: " + new_abstract)
        print('\n')

### DistilBERT

In [None]:
model0 = BertForQuestionAnswering.from_pretrained('distilbert-base-uncased')
model0.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing BertForQuestionAnswering: ['distilbert.embeddings.word_embeddings.weight', 'distilbert.embeddings.position_embeddings.weight', 'distilbert.embeddings.LayerNorm.weight', 'distilbert.embeddings.LayerNorm.bias', 'distilbert.transformer.layer.0.attention.q_lin.weight', 'distilbert.transformer.layer.0.attention.q_lin.bias', 'distilbert.transformer.layer.0.attention.k_lin.weight', 'distilbert.transformer.layer.0.attention.k_lin.bias', 'distilbert.transformer.layer.0.attention.v_lin.weight', 'distilbert.transformer.layer.0.attention.v_lin.bias', 'distilbert.transformer.layer.0.attention.out_lin.weight', 'distilbert.transformer.layer.0.attention.out_lin.bias', 'distilbert.transformer.layer.0.sa_layer_norm.weight', 'distilbert.transformer.layer.0.sa_layer_norm.bias', 'distilbert.transformer.layer.0.ffn.lin1.weight', 'distilbert.transformer.layer.0.ffn.lin1.bias', 'distilbert.transformer.layer.0.ffn.l

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_

In [None]:
tokenizer0 = BertTokenizer.from_pretrained('distilbert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




In [None]:
query1 = 'What are the coronaviruses?'

q0_answers, q0_scores = retrieve_all_passages(model0, tokenizer0, query1, articles_df.abstract)
q0_best_indices = [i[0] for i in sorted(enumerate(q0_scores), key=lambda x:-x[1])]

display_passages(query1, articles_df, q0_answers, q0_scores, q0_best_indices[:1], 1)


*** The answer-passage is highlighted with red color in the abstract text of each article ***

Question: What are the coronaviruses?

Title: Identifying Live Bird Markets with the Potential to Act as Reservoirs of Avian Influenza A (H5N1) Virus: A Survey in Northern Viet Nam and Cambodia
Score: 1.3374799489974976
Abstract: Wet markets are common in many parts of the world and may promote the emergence, spread and maintenance of livestock pathogens, including zoonoses. A survey was conducted in order to assess the potential of Vietnamese and Cambodian live bird markets (LBMs) to sustain circulation of highly pathogenic avian influenza virus subtype H5N1 (HPAIV H5N1). Thirty Vietnamese and 8 Cambodian LBMs were visited, and structured interviews were conducted with the market managers and 561 Vietnamese and 84 Cambodian traders. Multivariate and cluster analysis were used to construct a typology of traders based on their poultry management practices[31m. as a result of those practices 

**In order to complete our previous task, the titles retrieval, we experimented with 2 pretrained models, both of which use mean tokens values (`bert-base-nli-mean-tokens`, `distilbert-base-nli-stsb-mean-tokens`). However, there is no option to load this kind of models in a BERT tokenizer, which is needed for our next task, questionanswering. That's why in our first effort to build a QA BERT model we use a similar pretrained model, `distilbert-base-uncased`. As we mentioned before DistilBERT is a small version of BERT, in which the token-type embeddings and the pooler are removed and its parameters are half of BERT-base ones making the model's inference time more than 60% faster compared to the corresponding inference time of BERT.**

**That's the main reason that lead us to use DistilBERT is our first experiment. Time is a significant factor because of the big number of articles that must be checked for the model to find the best passage that answers each question, even though we have already filtered out some of the articles. For our first experiment we posed the model our primary question, namely 'What are the coronaviruses?'. Although the model's response time was ,indeed, fast the outcome was very disappointing. In fact, the outcome was so disappointing that the model didn't even return an answer to the question. As you can see in the above code block there isn't any highlighted sentence in the article's body text (abstract). If anything the returned article seems irrelevant itself, judging from the title, that refers to Live Bird Markets. To be honest, we could anticipate a bad outcome by just noticing the model's produced score value for the specific question, which is just 1.3374. The score is calculated as the average value of the best start and end tokens and as we will see below it gets a value of around 8 in our best experiments. To sum up, DistilBERT model seems unable to deal with a complicated task like questionanswering, due to the lacking number of data on which it is trained. So we have to seek for a more complex model, in order to achieve more satisfying results.** 

### BERT-large-uncased + finetuned SQuAD

In [None]:
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=443.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1340675298.0, style=ProgressStyle(descr…




BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1e-12,

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




#### Suggested questions

In [None]:
query1 = 'What are the coronaviruses?'

q1_answers, q1_scores = retrieve_all_passages(model, tokenizer, query1, articles_df.abstract)
q1_best_indices = [i[0] for i in sorted(enumerate(q1_scores), key=lambda x:-x[1])]

display_passages(query1, articles_df, q1_answers, q1_scores, q1_best_indices[:1], 1)


*** The answer-passage is highlighted with red color in the abstract text of each article ***

Question: What are the coronaviruses?

Title: Infectious Bronchitis Virus Nonstructural Protein 4 Alone Induces Membrane Pairing
Score: 8.640386581420898
Abstract: [31mpositive-strand rna viruses[39m, such as coronaviruses, induce cellular membrane rearrangements during replication to form replication organelles allowing for efficient viral RNA synthesis. Infectious bronchitis virus (IBV), a pathogenic avian Gammacoronavirus of significant importance to the global poultry industry, has been shown to induce the formation of double membrane vesicles (DMVs), zippered endoplasmic reticulum (zER) and tethered vesicles, known as spherules. These membrane rearrangements are virally induced; however, it remains unclear which viral proteins are responsible. In this study, membrane rearrangements induced when expressing viral non-structural proteins (nsps) from two different strains of IBV were comp

In [None]:
query2 = 'What is Coronavirus Disease 2019?'

q2_answers, q2_scores = retrieve_all_passages(model, tokenizer, query2, articles_df.abstract)
q2_best_indices = [i[0] for i in sorted(enumerate(q2_scores), key=lambda x:-x[1])]

display_passages(query2, articles_df, q2_answers, q2_scores, q2_best_indices[:5], 5)

Token indices sequence length is longer than the specified maximum sequence length for this model (608 > 512). Running this sequence through the model will result in indexing errors



*** The answer-passage is highlighted with red color in the abstract text of each article ***

Question: What is Coronavirus Disease 2019?

Title: Genetic manipulation of porcine deltacoronavirus reveals insights into NS6 and NS7 functions: a novel strategy for vaccine design
Score: 8.00501823425293
Abstract: Porcine deltacoronavirus (PDCoV) is an emerging swine coronavirus that causes [31msevere diarrhea[39m, resulting in high mortality in neonatal piglets. Despite widespread outbreaks in many countries, no effective PDCoV vaccines are currently available. Here, we generated, for the first time, a full-length infectious cDNA clone of PDCoV. We further manipulated the infectious clone by replacing the NS6 gene with a green fluorescent protein (GFP) to generate rPDCoV-ΔNS6-GFP; likewise, rPDCoV-ΔNS7 was constructed by removing the ATG start codons of the NS7 gene. Growth kinetics studies suggest that rPDCoV-ΔNS7 could replicate similarly to that of the wild-type PDCoV, whereas rPDCoV

In [None]:
query3 = 'What is caused by SARS-COV2?'

q3_answers, q3_scores = retrieve_all_passages(model, tokenizer, query3, articles_df.abstract)
q3_best_indices = [i[0] for i in sorted(enumerate(q3_scores), key=lambda x:-x[1])]

display_passages(query3, articles_df, q3_answers, q3_scores, q3_best_indices[:10], 10)


*** The answer-passage is highlighted with red color in the abstract text of each article ***

Question: What is caused by SARS-COV2?

Title: The Disulfide Bonds in Glycoprotein E2 of Hepatitis C Virus Reveal the Tertiary Organization of the Molecule
Score: 7.8293867111206055
Abstract: Hepatitis C virus (HCV), a major cause of [31mchronic liver disease[39m in humans, is the focus of intense research efforts worldwide. Yet structural data on the viral envelope glycoproteins E1 and E2 are scarce, in spite of their essential role in the viral life cycle. To obtain more information, we developed an efficient production system of recombinant E2 ectodomain (E2e), truncated immediately upstream its trans-membrane (TM) region, using Drosophila melanogaster cells. This system yields a majority of monomeric protein, which can be readily separated chromatographically from contaminating disulfide-linked aggregates. The isolated monomeric E2e reacts with a number of conformation-sensitive monocl

#### Extra questions

In [None]:
query4 = 'What are most common underlying diseases in covid-19 patients?'

q4_answers, q4_scores = retrieve_all_passages(model, tokenizer, query4, articles_df.abstract)
q4_best_indices = [i[0] for i in sorted(enumerate(q4_scores), key=lambda x:-x[1])]

display_passages(query4, articles_df, q4_answers, q4_scores, q4_best_indices[:1], 1)


*** The answer-passage is highlighted with red color in the abstract text of each article ***

Question: What are most common underlying diseases in covid-19 patients?

Title: Neurologic Alterations Due to Respiratory Virus Infections
Score: 7.386350631713867
Abstract: Central Nervous System (CNS) infections are one of the most critical problems in public health, as frequently patients exhibit neurologic sequelae. Usually, CNS pathologies are caused by known neurotropic viruses such as measles virus (MV), herpes virus and human immunodeficiency virus (HIV), among others. However, nowadays respiratory viruses have placed themselves as relevant agents responsible for CNS pathologies. Among these neuropathological viruses are the human respiratory syncytial virus (hRSV), the influenza virus (IV), the coronavirus (CoV) and the human metapneumovirus (hMPV). These viral agents are leading causes of acute respiratory infections every year affecting mainly children under 5 years old and also 

In [None]:
query5 = 'what are the public measures to control the spread of covid-19?'

q5_answers, q5_scores = retrieve_all_passages(model, tokenizer, query5, articles_df.abstract)
q5_best_indices = [i[0] for i in sorted(enumerate(q5_scores), key=lambda x:-x[1])]

display_passages(query5, articles_df, q5_answers, q5_scores, q5_best_indices[:3], 3)


*** The answer-passage is highlighted with red color in the abstract text of each article ***

Question: what are the public measures to control the spread of covid-19?

Title: Local risk perception enhances epidemic control
Score: 7.057101249694824
Abstract: As infectious disease outbreaks emerge, public health agencies often enact [31mvaccination and social distancing measures[39m to slow transmission. Their success depends on not only strategies and resources, but also public adherence. Individual willingness to take precautions may be influenced by global factors, such as news media, or local factors, such as infected family members or friends. Here, we compare three modes of epidemiological decision-making in the midst of a growing outbreak using network-based mathematical models that capture plausible heterogeneity in human contact patterns. Individuals decide whether to adopt a recommended intervention based on overall disease prevalence, the proportion of social contacts inf

**As we mentioned above, simple BERT models such as DistilBERT seem pretty weak in dealing with complicated tasks like questionanswering. So we had to give them a little boost in order to succeed in our QA task. To achieve that we had to use the pretrained `bert-large-uncased-whole-word-masking-finetuned-squad` model, which is basically the BERT-large model that has already been fine-tuned for the SQuAD benchmark. SQuAD dataset, is a reading comprehension dataset, implemented in Stanford in order to accomplish QA tasks. It consists of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage. In some cases the question might be unanswerable. SQuAD offers around 150,000 questions, which is not that much in the deep learning world but it still does the trick for simpler QA task like ours. We should also have in mind that BERT-large is a really big model consisting of 24 layers and an embedding size of 1024, for a total of 340M parameters. As a result, the BERT-large and SQuAD combination lead to a much more time consuming execution compared to the DistilBERT one.**

**In our first experiment with our new model we can already observe a clear improvement both on score and retrieved passage terms, even though it takes much longer to respond. We can see that for the question 'What are the coronaviruses?' our model responds with a very convincing passage from the article 'Infectious Bronchitis Virus Nonstructural Protein 4 Alone Induces Membrane Pairing', which is highlighted with red color in the article's body text (abstract). According to this article coronaviruses are 'positive-strand rna viruses', which is a pretty accurate answer with a score of 8.6403. In fact, this score is by far the best score we get among all the QAs we experiment with and it shows that the question we posed is simple yet primary, as we can find an answer about it in most of the articles of the dataset. In our following experiments we try to pose more complex questions and we seek for more than one possible answers. For instance, in our third experiment we pose the question 'What is caused by SARS-COV2?' and we ask for the 10 best answers. The best answer with a score of 7.8293 is 'chronic liver disease' and it's a correct answer. However, the third answer with almost the same score (7.6833) is 'severe acute respiratory syndrome' a.k.a. 'SARS', which is ,of course, inaccurate. Therefore, sometimes a high score doesn't automatically imply a correct answer. This is a logical conclusion specially when we seek for the best passages among multiple articles. In fact, this conclusion  can be verified in our second example where we ask 'What is Coronavirus Disease 2019?' and the best answer is 'severe diarrhea', with a high score of 8.0050, which means that the model answers with high certainty but in practice it returns an inaccurate passage. The paradox is that all the next answers of the question have also high score values, between 7 and 8. Another assumption we can make is that there are actually better answers for the query but the model is missleaded by the original question and ends up answering a different one. An important factor that can cause the model's deception is the filtering that we apply on the dataset. For example, if a query's keyword isn't included in the filtering keywords, there is a big chance that articles that could answer the query are filtered out of the database. Keywords can, substantially, particularize the type of questions that a model can answer. Yet, this doesn't seem to be the case in our third experiment, as keywords 'coronavirus' and 'disease' are used in our filtering process. After all, it's hard to know what is the reason behind the model's incapability to answer a simple question like that after having already answered a very similar question, 'What are the coronaviruses?'. To sum up, a BERT model that is fine-tuned on the SQuAD dataset leads to much better QA results. However, there are many factors that can expose this model's weaknesses.**