<b><u>DISCLAIMER</u></b>:  THIS SET OF CODE IS FOR USE IN A NON-CLINICAL RESEARCH SETTING.  IT IS NOT INTENDED FOR USE TO TREAT OR DIAGNOSE PATIENTS, DRIVE CLINICAL MANAGEMENT OR INFORM CLINICAL MANAGEMENT.

This software is provided on an “as is” basis without any warranty or liability whatsoever. Northrop Grumman Systems Corporation (NGSC) expressly disclaims, to the maximum extent permissible by applicable law, all warranties, express, implied and statutory, including without limitation any implied warranty of merchantability, fitness for a particular purpose, non-infringement, or arising from course of performance, dealing, usage, or trade. NGSC shall not be liable to the Government or any user of the software for any incidental, consequential, special or other damages, including loss of profits, revenue, or data, resulting, directly or indirectly, from use of the software.

© Northrop Grumman Systems Corporation 

In [None]:
from IPython.display import Image
Image(filename="../input/covid19-images/NGC.PNG", width=500, height=500)

# Information Retrieval with Semantic Similarity and BERT

In this notebook the powerful pre-trained [BERT](https://arxiv.org/abs/1810.04805) Language Model is used to compute semantic similarity scores over all sentences in the dataset in regard to some of the tasks presented (queries). Essentially this is a form of fine-grained information retrieval where the most relevant sentences are returned for each task (in contrast to high-level information retrieval where an entire article is returned). Before using BERT for semantic similarity computation, word vectors are learned over the dataset using the famous [word2vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) model. Although these word embeddings aren't used throughout the remainder of the notebook they may prove beneficial to other researchers and challenge participants as they try to tackle downstream tasks.

In [None]:
from IPython.display import Image
Image(filename="../input/covid19-images/covid-19.jpg", width=500, height=500)

Source: [Image](http://www.sci-news.com/medicine/guidelines-management-icu-patients-covid-19-08295.html)

## Project Overview 
The goal of this project is to utilize state-of-the-art NLP techniques for text mining and information retrieval using transformer-based LMs. This is accomplished by feeding a pre-trained BERT model with sentences and queries while computing similarity scores. By using this technique we extract the most relevant sentences for each task question. Hopefully this tool provides helpful information to medical personnel and other researchers as they seek to quickly get answers to many questions about the deadly COVID-19 virus.  

Note that this notebook was created using an older iteration of the dataset, about 33,000 total articles; thus the stages for reading in the data may need to be adjusted when working with the latest release. This was done due to the time constraint of the challenge as well as the computation requirement for using BERT. As this is the case the notebook has been loaded with the older dataset under the "cord1933k" folder. 

Also note that the additional data used is <b><u>only</u></b> for runtime speedup purposes and does not pull any information other than that contained in CORD-19. The data in the "cord19cleaneddata" folder is the original data saved in csv files for quick loading to avoid data pre-processing steps. Likewise the "cord19results" folder contains the results from running this model over various queries across the corpus using outside GPU hardware resources. 

One final thing to mention is that in various parts of this notebook the dataset has been condensed with slicing. This is only for runtime speedup purposes and thus the answers derived from BERT and word2vec may be affected. <b><u>For the best results change the code where stated to compute over ALL data</u></b>.

Two useful sources while developing this code base were [COVID-19 Literature Clustering](https://www.kaggle.com/maksimeren/covid-19-literature-clustering) and [BERT Semantic Text Similarity](https://github.com/AndriyMulyar/semantic-text-similarity)

## Notebook Setup 

This notebook is partitioned into the following sections:
1. Background Information (Word2Vec / BERT / Semantic Similarity)
2. Approach Overview (Pros / Cons)
3. Loading CORD-19 Data
4. Data Pre-Processing & Cleaning 
5. Computing Word Vectors with Word2Vec
6. Visualizing Learned Word Vectors, Computing Cosine Similarity Between Words, and Extracting the Most Similar Words
7. Computing Most Relevant Sentences / Articles
8. Results

Note that all sections except for 5 and 6 are self-contained meaning that they aren't dependent on running earlier sections of code <b><u>given that the library imports in section 1 have been executed</u></b>. In other words if a viewer would just like to look at the results they can skip to section 8 and run the code blocks there, if they simply want to use BERT they can skip to section 7 and run all code sequentially within it, etc. To allow the notebook to function in this manner intermediate processing of the dataset has been saved in the "cord19cleaneddata" folder. Likewise all results have been stored under the "cord19results" folder since massive GPU hardware resources are required to get lots of answers quickly. 

To run section 6, section 5 must first be executed. 

<b><u>Again, to view the results ONLY please go to section 8 and directly run the code blocks there skipping all other sections.</u></b>

## 1. Background Information

Before diving into the code and results there are a number of key concepts that will be covered as they are used throughout the notebook. For those well-versed in the field of NLP this background knowledge can be skipped. 

### Word2Vec

Word2vec is a combination of models that can be used to learn vectors / embeddings for all words present in a corpus. This algorithm was one of the major break-throughs that started the Deep Learning revolution in the field of NLP as these vector embeddings can be fed into neural networks to better accomplish downstream tasks. Embeddings are typically learned in one of two ways, first a single word is used to predict the context and other words around it (skip-gram); second the context around a word is used to predict the word itself (CBOW). These techniques allow words to be mapped to a vector space (typically 100, 200, or 300 dimensional) where similar words are mapped to similar locations in the space. 

For an in-depth tutorial on this subject see [Word2vec](https://medium.com/@zafaralibagh6/a-simple-word2vec-tutorial-61e64e38a6a1).

### BERT

BERT, or Bidirectional Encoder Representations from Transformers, was one of the recent major accomplishments in the field of NLP. At the time of its release in 2018 this architecture set the state-of-the-art on the GLUE benchmark encompassing numerous tasks such as question-answering, named-entity-recognition, and natural-language-inference. BERT built upon the [Transformer](https://arxiv.org/pdf/1706.03762.pdf) architecture by adding the concept of both left-to-right and right-to-left (bidirectional) processing. It also leveraged the concept of unsupervised learning by pre-training a massive Language Model (LM) on all of English Wikipedia using a [Masked Language Model](https://www.quora.com/What-is-a-masked-language-model-and-how-is-it-related-to-BERT). 

In [None]:
from IPython.display import Image
Image(filename="../input/covid19-images/BERT.png", width=500, height=500)

Source: [BERT Explained](https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270)

After learning the LM in an unsupervised fashion it can be fine-tuned on various tasks. This has allowed the architecture to produce incredible scores as it already understands the concept of language extremely well. Thus, instead of having to learn both a language and a task during supervised training, the model only has to learn the task. For an in-depth overview of BERT see [Illustrated BERT](http://jalammar.github.io/illustrated-bert/).

### Semantic Similarity

Semantic similarity is a metric defined over a set of documents or terms where the idea of distance between them is based on the likeness of their meaning or semantic content as opposed to similarity which can be estimated regarding their syntactical representation (string format). In other words the goal of semantic textual similarity is to input two sentences / passages / etc. and output some score value with higher scores corresponding to sentences or passages more similar in meaning and lower scores vice-versa. BERT or BERT-based models such as [XLNet](https://arxiv.org/abs/1906.08237) and [RoBERTa](https://arxiv.org/abs/1907.11692) have essentially set the state-of-the-art on this task. To compute semantic similarity scores over sentences and task queries the following architecture is used:

In [None]:
from IPython.display import Image
Image(filename="../input/covid19-images/Bert_STS.PNG", width=500, height=500)

## 2. Approach Overview

The approach implemented has various pros to it as well as a number of cons. In future iterations these cons can be addressed enabling the system to be faster as well as derive more powerful results. 

### Pros

1. Fine-Grained Results: Often contain a direct answer to the query posed
2. Query Scaling: Works on any input and is not task specific 
3. Reporting Potential: Allows for the N most relevant sentences to be computed where a report can then be generated
4. Dataset Reduction: Derive the N most relevant articles
5. Leveraging Powerful Bert LM: Takes advantage of the current state-of-the-art NLP LM technique
6. Literature Backing: Bert-based models set the bar in STS (Semantic Textual Similarity) Tasks

### Cons

1. Compute Time: BERT is a massive model with millions of parameters; thus it takes 7-8 hours to compute score values for one query across all sentences using a GPU
2. Hardware Requirements: GPU hardware is critical for fast computation time   
3. Classification Context Level: Currently sentences are classified by themselves in regard to a score; in the future looking at computing scores across sets of sentences or paragraphs to ensure proper context could be beneficial
4. Query Wording: Wording can impact results

Note that in terms of the first con (currently the most problematic) there are several obvious solutions that could be implemented going forward to resolve this issue. The first solution would be to use a technique similar to  [Sentence-BERT](https://arxiv.org/abs/1908.10084) where sentence embeddings are pre-computed across the corpus. At runtime the only computation requirement is then computing a query embedding along with similarity scores. Thus, instead of sending all the sentences through the model for each query, we only send the query. If this system were to be implemented in the real world this would be a requirement, otherwise the computation time constraints are simply too large. 

## 3. Loading CORD-19 Data

The first step before doing any data processing will be to load in all the libraries that will be used throughout this notebook. As you can see a variety of these are used; pandas for storing data frames, nltk for tokenization, torch for BERT model description/architecture, etc. 

In [None]:
#Installs needed
!pip install langdetect
!pip install semantic-text-similarity

#Libraries needed
import pandas as pd 
import glob
import json
import re 
import numpy as np
import copy 
import torch 
import matplotlib.pyplot as plt
from langdetect import detect
from semantic_text_similarity.models import ClinicalBertSimilarity
from nltk.tokenize import sent_tokenize, word_tokenize
from gensim.models import Word2Vec
from sklearn.manifold import TSNE
print("Done")

After setting up our environment by loading in all the necessary libraries we load the CORD-19 metadata. This provides a lot of relevant information that can be used for pre-processing and analysis. 

In [None]:
#Read in the metadata
print("Reading in metadata.")
metadata_path = f'../input/cord1933k/metadata.csv'
meta_df = pd.read_csv(metadata_path, dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})
print("Metadata read.")
print()

Since the actual text is contained within four separate folders of json files (biorxiv_medrxiv, comm_use_subset, custom_license, and noncomm_use_subset) this code goes through all these folders and extracts file names for article loading. 

In [None]:
#Get all file names 
all_json = glob.glob(f'../input/cord1933k/**/*.json', recursive=True)
print(str(len(all_json))+" total file paths fetched.")
print()

Next we create a class description for a json file reader used in loading articles. This class functions by loading the file and then extracting the paper id, article abstract, and text body. 

In [None]:
#Reader class for files 
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id']
            self.abstract = []
            self.body_text = []
            # Abstract
            for entry in content['abstract']:
                self.abstract.append(entry['text'])
            # Body text
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            self.abstract = '\n'.join(self.abstract)
            self.body_text = '\n'.join(self.body_text)
        return 

Here we read json article files into a Python dictionary. From each article we extract the paper id, abstract, text body, title, authors, journal, and url. After reading in articles we convert the dictionary to a pandas data structure for more easy usage/access. 

<b><u>Currently this is only done over the first 1000 articles due to runtime. To perform this computation over the entire dataset change line 4 from</u></b>

"for idx, entry in enumerate(all_json[:1000]):"

<b><u>to</u></b>

"for idx, entry in enumerate(all_json):"

In [None]:
#Read in all text files and store in a Pandas data frame 
print("Reading in text files.")
dict_ = {'paper_id': [], 'abstract': [], 'body_text': [], 'title': [], 'authors': [], 'journal': [], 'url': []}
for idx, entry in enumerate(all_json[:1000]):
    print(f'Processing index: {idx} of {len(all_json)}', end='\r')
    content = FileReader(entry)
    
    # get metadata information
    meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
    # no metadata, skip this paper
    if len(meta_data) == 0:
        continue
    
    dict_['paper_id'].append(content.paper_id)
    dict_['abstract'].append(content.abstract)
    dict_['body_text'].append(content.body_text)
    dict_['title'].append(meta_data.iloc[0]['title'])
    dict_['authors'].append(meta_data.iloc[0]['authors'])
    dict_['journal'].append(meta_data.iloc[0]['journal'])
    dict_['url'].append(meta_data.iloc[0]['url'])
df_covid = pd.DataFrame(dict_, columns=['paper_id', 'abstract', 'body_text', 'title', 'authors', 'journal', 'url'])
dict_ = None
print("Text files read.")
print()

To aid in pre-processing and data cleaning we compute the number of words in both the abstract and text body of articles and add these to the pandas data frame. 

In [None]:
#Get a count for the number of words in articles and abstracts 
print("Getting abstract and body word counts.")
df_covid['abstract_word_count'] = df_covid['abstract'].apply(lambda x: len(x.strip().split()))
df_covid['body_word_count'] = df_covid['body_text'].apply(lambda x: len(x.strip().split()))
print("Counts computed.")
print()

Finally we save the data frame for quicker access later (for example if we want to change the pre-processing steps) as the computations above take some time for completion.

In [None]:
#Saving the dataframe 
print("Saving the dataframe.")
df_covid.to_csv('covidData.csv') 
print("Dataframe saved.")

## 4. Data Pre-Processing & Cleaning

Before computing word vectors or using BERT for semantic similarity we first do a number of pre-processing steps to clean up the data. This consists of removing articles that don't meet certain criteria as well as non-English text. 

The code below can be modified to load the file saved at the end of the section above if edits are made to it. Otherwise the pre-computed data frame can be loaded from "cord19cleaneddata". 

<b><u>Currently we only load in and use the first 1000 articles due to runtime. To use the entire dataset remove line 6 that reads</u></b>

"df_covid = df_covid.head(1000)"

In [None]:
#Read in the saved data 
print("Loading the dataframe.")
df_covid = pd.read_csv('../input/cord19cleaneddata/covidData.csv')
print("Dataframe loaded.")
print()
df_covid = df_covid.head(1000)

Next we perform data cleaning by removing all articles with a text body shorter than 1000 words. This number is arbitrary and can be modified. The thinking here is that this may help to remove less refined texts for better downstream results. 

In [None]:
#Remove all articles that have fewer than the number of words specified 
min_word_count = 1000
print("Removing all articles with fewer than "+str(min_word_count)+" words.")
indexNames = df_covid[df_covid['body_word_count'] < min_word_count].index
df_covid = df_covid.drop(indexNames)
df_covid = df_covid.reset_index(drop=True)
print("Articles cleaned.")
print()

As another pre-processing step we remove non-English articles.

In [None]:
#Remove all non-English articles
print("Removing all non-English articles")
index = 0
indexNames = []
while(index < len(df_covid)):
    print(f'Processing index: {index} of {len(df_covid)}', end='\r')
    language = detect(df_covid.iloc[index]['body_text'])
    if(language != 'en'):
        indexNames.append(index)
    index += 1
df_covid = df_covid.drop(indexNames)
df_covid = df_covid.reset_index(drop=True)
print("All non-English articles removed. Total article count is now: "+str(len(df_covid)))
print()

Again we save the pre-processed data for faster computation of downstream tasks. 

In [None]:
#Save the cleaned dataset 
print("Saving the dataframe.")
df_covid.to_csv('covidDataCleaned.csv') 
print("Dataframe saved.")

## 5. Computing Word Vectors with Word2Vec

Here we learn word vectors over all tokens in the CORD-19 data set. Although these vectors are not directly used in the remainder of this notebook they may prove beneficial to other researchers/groups in downstream tasks. 

First we read in the pre-processed data set. The code below can be modified to load the file saved at the end of the section above if edits are made to it. Otherwise the pre-computed data frame can be loaded from "cord19cleaneddata".  

<b><u>Currently we only load in and use the first 5000 articles due to runtime. To use the entire dataset remove line 6 that reads</u></b>

"df_covid = df_covid.head(5000)"

In [None]:
#Read in the saved data 
print("Loading the dataframe.")
df_covid = pd.read_csv('../input/cord19cleaneddata/covidDataCleaned.csv')
print("Dataframe loaded.")
print()
df_covid = df_covid.head(5000)

Next we extract the text body from articles and do a little more pre-processing by converting all text to lowercase and removing non-alphanumeric characters and punctuation. 

In [None]:
#Get the article text and pre-process by converting to lowercase and removing weird characters 
print("Pre-processing articles by converting to lowercase and removing random characters.")
articleTexts = df_covid.drop(["paper_id", "abstract", "title", "authors", "journal", "url", "abstract_word_count", "body_word_count"], axis=1)
articleTexts['body_text'] = articleTexts['body_text'].apply(lambda x: x.lower())
articleTexts['body_text'] = articleTexts['body_text'].apply(lambda x: re.sub('[^a-z0-9.!?\s]','',x))
text_list = list(articleTexts['body_text'])
print("Pre-processing complete.")
print()

Here text is tokenized into a list of sentences with each element in the list containing a list of words. This is the format [gensim](https://radimrehurek.com/gensim/models/word2vec.html) expects as input for learning word2vec embeddings. 

In [None]:
#Extract all sentences - gensim expects a sequence of sentences as input
#Example: sentences = [['first', 'sentence'], ['second', 'sentence']]
print("Tokenizing articles into sentences and words.")
sentences = []
for index, text in enumerate(text_list):
    sentenceList = sent_tokenize(text)
    for sentence in sentenceList:
        wordList = word_tokenize(sentence)
        sentences.append(wordList)
    print("Processing article "+str(index)+" of "+str(len(text_list)), end="\r")
print("A total of "+str(len(sentences))+" sentences have been tokenized for word2vec training.")
print()

After tokenizing the text word2vec is trained. The embedding size used is 100 with a window size of 5 and minimum word count of 10. For potentially more expressive embeddings the vector size can be increased to 200, 300, etc. Note that it is not recommended to go above 500 due to the limited size of the corpus. We save the learned word vectors for use in other downstream tasks. 

In [None]:
#Train & save the word2vec model 
print("Training word2vec.")
model = Word2Vec(sentences, size=100, window=5, min_count=10, workers=4)
print("Word count:", len(list(model.wv.vocab)))
model.save("word2vec.model")
print("Finished training and saving word2vec.")

## 6. Visualizing Learned Word Vectors, Computing Cosine Similarity between Words, and Extracting the Most Similar Words

Using the learned word vectors we will now perform a number of different analytics over them. First we will visualize some of the more prominent words using matplotlib for plotting and t-sne for dimensionality reduction. T-Distributed Stochastic Neighbor Embedding, or t-sne, maps 100 dimensional vectors down to a 2-dimensional space. 

Load in the saved word vectors (skip this step if running the notebook sequentially as these will already be in RAM).

In [None]:
#Load the trained word2vec model 
print("Loading the pre-trained word2vec model.")
model = Word2Vec.load("word2vec.model")
print("Model loaded.")
print()

The following function is used for performing dimensionality reduction and creating the plot. 

In [None]:
#From: https://methodmatters.github.io/using-word2vec-to-analyze-word/
#Define the function to compute the dimensionality reduction and then produce the biplot  
def tsne_plot(model, words):
    "Creates a TSNE model and plots it"
    labels = []
    tokens = []
    
    print("Getting embeddings.")
    for word in model.wv.vocab:  
        if(word in words):
            tokens.append(model[word])
            labels.append(word)
    print("Embeddings extracted.")
    print()
        
    print("Performing dimensionality reduction with t-sne.")
    tsne_model = TSNE(perplexity=5, n_components=2, init='pca', n_iter=2500, verbose=0)
    new_values = tsne_model.fit_transform(tokens)
    print("Dimensioanlity reduction complete.")
    print()
    
    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(8, 8))
    for i in range(len(x)):
        if(labels[i] in words):
            plt.scatter(x[i],y[i])
            plt.annotate(labels[i], xy=(x[i], y[i]), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
    plt.show()
    return 

Create the list of words to visualize and plot them. 

In [None]:
#List of words to visualize in the plot
words = ['china', 'italy', 'taiwan', 'india', 'japan', 'france', 
         'spain', 'canada', 'infection', 'disease', 'pathogen', 
         'organism', 'bacteria', 'virus', 'covid19', 'coronavirus', 
         'healthcase', 'doctor', 'nurse', 'specialist', 'hospital', 
         'novel', 'human', 'sars', 'covid', 'wuhan', 'case', 
         'background', 'dynamic', 'pneumonia', 'outbreak', 'pandemic', 
         'syndrome', 'contact', 'wash', 'hands', 'cough', 
         'respiratory', 'case', 'fear', 'spike', 'curve', 
         'transmission', 'seasonal', 'genome', 'dna', 'testing', 
         'asymptomatic', 'global', 'spread', 'diagnosis']
  
#Call the function on our dataset  
tsne_plot(model, words)

Next we get the top N similar words to some input based on the learned word vectors. Here we get the 3 most similar words to 'facemask'. Note that the variables below can be changed to get different count values for other words (given that they are present in the corpus). 

In [None]:
#Word to compare against and number of similar words to print out 
word = 'facemask'
similarCount = 3

Compute the results.

In [None]:
#Get and print the results 
results = model.wv.most_similar(positive=word, topn=similarCount)
print("Input word:", word)
print("Top "+str(similarCount)+" similar words are:")
for index, word in enumerate(results):
    print(str(index+1)+". "+word[0]+" --- Score: "+str(word[1]))

Last we will compute the cosine similarity between two words using our word vectors. Cosine similarity is a distance measurement that ranges from -1 to 1, higher values represent words more similar in meaning. 

First we set the words and get the embedding value for each. The variables 'word1' and 'word2' can be changed to compute the cosine similarity between other sets of words. 

In [None]:
#Words to compute cosine similarity over  
word1 = 'china'
word2 = 'wuhan'

#Get the word embeddings 
embedding1 = model.wv[word1]
embedding2 = model.wv[word2]

Here we compute the cosine similarity. 

In [None]:
#Compute the cosine similarity and print the results 
cosineSimilarity = np.sum(embedding1*embedding2) / (np.sqrt(np.sum(np.square(embedding1)))*np.sqrt(np.sum(np.square(embedding2))))
print("Word1: "+word1+" --- Word2: "+word2)
print("Cosine similarity: "+ str(cosineSimilarity))

## 7. Computing Most Relevant Sentences / Articles

To use BERT for semantic similarity computation and information retrieval having a GPU is recommended (for faster computation speed). For multi-GPU systems the code below is used to specify the specific GPU device, this is particularly useful when running multiple jobs in parallel with each being mapped to separate hardware. <b><u>Note that for systems without a GPU the following code may need to be commented out.</u></b>  

In [None]:
#Set the GPU device
device = 0
torch.cuda.set_device(device)

Read in the pre-processed and cleaned data frame. 

The code below can be modified to load the file saved at the end of section 4 if edits are made to it. Otherwise the pre-computed data frame can be loaded from "cord19cleaneddata".

<b><u>Currently we only load in and use the first 500 articles due to runtime. To use the entire dataset remove line 6 that reads</u></b>

"df_covid = df_covid.head(500)"

In [None]:
#Read in the saved data 
print("Loading the dataframe.")
df_covid = pd.read_csv('../input/cord19cleaneddata/covidDataCleaned.csv')
print("Dataframe loaded.")
print()
df_covid = df_covid.head(500)

Next we set the batch size or the number of sentences and queries to compute the semantic similarity score over in parallel. Using a batch size of 500 takes up about 10.6GB of RAM. For systems with fewer than 10GB of RAM available this number will need to be reduced to avoid a "CUDA out of memory" error. 

In [None]:
#Variable to store the batch size
batchSize = 500 

To avoid training a model we load in a pre-trained BERT semantic similarity model from [BERT Semantic Similarity](https://github.com/AndriyMulyar/semantic-text-similarity). Thankfully this model has been pre-trained and fine-tuned on the [MED-STS](https://arxiv.org/abs/1808.09397) data set making it perfect to be used in this setting. <b><u>Change device to 'cpu' if no GPU is available system wide.</u></b>

In [None]:
#Load the model
print("Loading BERT semantic similarity model.")
model = ClinicalBertSimilarity(device='cuda', batch_size=batchSize) #defaults to GPU prediction
print("Model loaded.")
print()

Here we create a variable to store the 'query' we want to search our corpus over. This variable expects a list, thus to compute similarities over multiple questions some of the queries below can be uncommented or you can add your own. Currently the code is setup to search for the most relevant sentences/articles relative to the first task of the challenge: "What is known about transmission, incubation, and environmental stability?"

On another note we have added the word 'coronavirus' to many of the tasks to aid our semantic search capabilities in returning better results. 

In [None]:
#The primary questions that attempt to be answered  
primaryQuestions = [
    "What is known about transmission, incubation, and environmental stability of coronavirus"
    #"What do we know about coronavirus risk factors"
    #"What do we know about coronavirus genetics, origin, and evolution"
    #"What do we know about vaccines and therapeutics for coronavirus"
    #"What has been published about coronavirus medical care"
    #"What do we know about non-pharmaceutical interventions for coronavirus"
    #"What do we know about diagnostics and surveillance of coronavirus"
    #"In what ways does geography affects virality"
    #"What has been published about ethical and social science considerations regarding coronavirus"
    #"What has been published about information sharing and inter-sectoral collaboration"
]

Next article text is extracted from the pandas data frame and placed in a list. 

In [None]:
#Extract the text from the articles 
articleTexts = df_covid.drop(["paper_id", "abstract", "title", "authors", "journal", "url", "abstract_word_count", "body_word_count"], axis=1)
text_list = list(articleTexts['body_text'])

Create a function to compare a computed score value to a list of current best scores. This function will later be used in determining when to add new sentences/articles to the best results list. 

In [None]:
#Get the index of where the prediction ranks among the current best predictions 
def computeScoreIndex(predictionScore, answerScores):
    index = 0
    while(index < len(answerScores)):
        if(predictionScore > answerScores[index]):
            break
        index += 1
    return index 

Create another function useful in making batches to send through our BERT semantic similarity model. 

In [None]:
#Function to make a batch to send through the semantic similarity model 
def makeBatch(query, sentences):
    batch = []
    index = 0
    while(index < len(sentences)):
        batch.append((query, sentences[index]))
        index += 1
    return batch 

Specify another function helpful in saving results to a file. This function is used to convert score values from floats to strings that way results can be saved as json files. 

In [None]:
#Set scores to strings for saving 
def convertScores(answers):
    for key in answers.keys():
        for scoreIndex, score in enumerate(answers[key]['scores']):
            answers[key]['scores'][scoreIndex] = str(answers[key]['scores'][scoreIndex])
    return answers

### Top Sentences

This section is for computing the most relevant <b><u>sentences</u></b> in the corpus regarding some query. There are various code differences between getting sentences and articles, thus they are broken up into separate sections that can be executed in either order.  

First a dictionary data structure is created to store the top N results (in this case 10) relative to each query. This dictionary has a key for each query; underneath that are keys for article titles, most similar sentences, and semantic similarity score values. Each of these are lists containing the top results in order from best to worst. 

Note that the first element within the 'titles' list corresponds to the title of the article that is first in the 'sentences' list. Likewise, this applies to 'scores' too. 

In [None]:
#Dictionary to store answers and variable to specify the number of answers to store 
answerCount = 10
questionResponses = {'titles': [], 'sentences': [], 'scores': []}
answers = {}
for query in primaryQuestions:
    answers[query] = copy.deepcopy(questionResponses)

Here we create a function to update the dictionary of top sentence results based on new score values from our BERT model. 

In [None]:
#Function to update the most relevant answers based on model predictions 
def updateAnswersSentences(query, title, sentences, predictions, answers, answerCount):
    #Get the top answerCount prediction scores 
    topIndices = predictions.argsort()[-answerCount:][::-1]
    for index in topIndices:
        #Case where lists are empty
        if(len(answers[query]['scores']) == 0):
            answers[query]['titles'].append(title)
            answers[query]['sentences'].append(sentences[index])
            answers[query]['scores'].append(predictions[index])
        #Case where lists have length shorter than answerCount 
        elif(len(answers[query]['scores']) < answerCount):
            scoreIndex = computeScoreIndex(predictions[index], answers[query]['scores'])
            answers[query]['titles'].insert(scoreIndex, title)
            answers[query]['sentences'].insert(scoreIndex, sentences[index])
            answers[query]['scores'].insert(scoreIndex, predictions[index])
        #Case where lists are full 
        else:
            scoreIndex = computeScoreIndex(predictions[index], answers[query]['scores'])
            #Check to see if an item should be bumped out of the list 
            if(scoreIndex < answerCount):
                answers[query]['titles'].insert(scoreIndex, title)
                answers[query]['sentences'].insert(scoreIndex, sentences[index])
                answers[query]['scores'].insert(scoreIndex, predictions[index])
                answers[query]['titles'] = answers[query]['titles'][:answerCount]
                answers[query]['sentences'] = answers[query]['sentences'][:answerCount]
                answers[query]['scores'] = answers[query]['scores'][:answerCount]
    return answers 

Next we perform computation by looping through all queries and articles, tokenizing sentences, and computing score values for each while maintaining the top results. 

In [None]:
#Loop through all queries   
for queryIndex, query in enumerate(primaryQuestions):
    #Loop through all articles 
    for textIndex, text in enumerate(text_list):
        #Tokenize the article to sentences and loop through all sentences computing prediction scores for each 
        sentences = sent_tokenize(text)
        batchIndex = 0
        while(batchIndex*batchSize < len(sentences)):
            batchSentences = None
            batch = None
            #Check to make sure the batch size won't go out of bounds in regard to the sentences 
            if((batchIndex*batchSize)+batchSize > len(sentences)):
                batchSentences = sentences[batchIndex*batchSize:len(sentences)]
                batch = makeBatch(query, batchSentences)
            else:
                batchSentences = sentences[batchIndex*batchSize:(batchIndex*batchSize)+batchSize]
                batch = makeBatch(query, batchSentences)
            predictions = model.predict(batch)
            answers = updateAnswersSentences(query, df_covid.iloc[textIndex]["title"], batchSentences, predictions, answers, answerCount)
            batchIndex += 1
        print("Processing query "+str(queryIndex)+" of "+str(len(primaryQuestions))+" --- Article "+str(textIndex)+" of "+str(len(text_list)), end='\r')
    print()

The top sentence results are now saved to a json file. 

In [None]:
#Save the results
answers = convertScores(answers)
jsonData = json.dumps(answers)
f = open("topSentences.json","w")
f.write(jsonData)
f.close()

Read in the results (skip if the results are already in RAM). 

In [None]:
#Read in the json file of results 
filename = "topSentences.json"
answers = None
with open(filename, 'r') as myfile:
    answers = json.load(myfile)

And finally we print the results. This returns the most relevant sentences in regard to our initial input query!

In [None]:
#Print the results 
for query in answers.keys():
    print("Query: "+query)
    resultCount = len(answers[query]['scores'])
    index = 0
    while(index < resultCount):
        print(str(index+1)+". Article: "+str(answers[query]['titles'][index]))
        print("Sentence: "+answers[query]['sentences'][index])
        print("Score: "+answers[query]['scores'][index])
        print()
        index += 1

### Top Articles 

This section is for computing the most relevant articles in the corpus.

A dictionary data structure is created to store the top results. This dictionary has a key for each query; underneath that are keys for article titles and semantic similarity score values. Each of these are lists.

In [None]:
#Dictionary to store answers and variable to specify the number of answers to store 
answerCount = 10
questionResponses = {'titles': [], 'scores': []}
answers = {}
for query in primaryQuestions:
    answers[query] = copy.deepcopy(questionResponses)

Similar to our process for getting the top sentences, we create a function to update the dictionary of top article results based on new score values from BERT. 

In [None]:
#Function to update the most relevant answers based on model predictions 
def updateAnswersArticles(query, title, articleScore, answers, answerCount):
    #Case where lists are empty
    if(len(answers[query]['scores']) == 0):
        answers[query]['titles'].append(title)
        answers[query]['scores'].append(articleScore)
    #Case where lists have length shorter than answerCount 
    elif(len(answers[query]['scores']) < answerCount):
        scoreIndex = computeScoreIndex(articleScore, answers[query]['scores'])
        answers[query]['titles'].insert(scoreIndex, title)
        answers[query]['scores'].insert(scoreIndex, articleScore)
    #Case where lists are full 
    else:
        scoreIndex = computeScoreIndex(articleScore, answers[query]['scores'])
        #Check to see if an item should be bumped out of the list 
        if(scoreIndex < answerCount):
            answers[query]['titles'].insert(scoreIndex, title)
            answers[query]['scores'].insert(scoreIndex, articleScore)
            answers[query]['titles'] = answers[query]['titles'][:answerCount]
            answers[query]['scores'] = answers[query]['scores'][:answerCount]
    return answers 

Perform computation and get the top article results.  

In [None]:
#Loop through all queries   
for queryIndex, query in enumerate(primaryQuestions):
    #Loop through all articles 
    for textIndex, text in enumerate(text_list):
        #Tokenize the article to sentences and loop through all sentences computing prediction scores for each 
        #Create a variable to store all score values 
        sentences = sent_tokenize(text)
        sentenceScores = np.array([])
        batchIndex = 0
        while(batchIndex*batchSize < len(sentences)):
            batchSentences = None
            batch = None
            #Check to make sure the batch size won't go out of bounds in regard to the sentences 
            if((batchIndex*batchSize)+batchSize > len(sentences)):
                batchSentences = sentences[batchIndex*batchSize:len(sentences)]
                batch = makeBatch(query, batchSentences)
            else:
                batchSentences = sentences[batchIndex*batchSize:(batchIndex*batchSize)+batchSize]
                batch = makeBatch(query, batchSentences)
            predictions = model.predict(batch)
            sentenceScores = np.append(sentenceScores, predictions)
            batchIndex += 1
        articleScore = np.mean(sentenceScores)
        answers = updateAnswersArticles(query, df_covid.iloc[textIndex]["title"], articleScore, answers, answerCount)
        print("Processing query "+str(queryIndex)+" of "+str(len(primaryQuestions))+" --- Article "+str(textIndex)+" of "+str(len(text_list)), end='\r')
    print()

Save the results.

In [None]:
#Save the results
answers = convertScores(answers)
jsonData = json.dumps(answers)
f = open("topArticles.json","w")
f.write(jsonData)
f.close()

Read in the results (skip if the results are already in RAM). 

In [None]:
#Read in the json file of results 
filename = "topArticles.json"
answers = None
with open(filename, 'r') as myfile:
    answers = json.load(myfile)

Print the results; this returns the most relevant articles in regard to our query!

In [None]:
#Print the results 
for query in answers.keys():
    print("Query: "+query)
    resultCount = len(answers[query]['scores'])
    index = 0
    while(index < resultCount):
        print(str(index+1)+". Article: "+str(answers[query]['titles'][index]))
        print("Score: "+answers[query]['scores'][index])
        print()
        index += 1

## 8. Results

Results of using BERT for information retrieval and semantic similarity are as follows. Here we give the results in regard to the most relevant sentences for each query followed by the most relevant articles. <b><u>To view the results run the code blocks contained below. Results have been added to the dataset and computed beforehand on additional hardware (due to runtime constraints) over ALL ~33k articles in "cord1933k".</u></b> 

To add context to answers we have added a window size of 2 around highly ranked sentences (2 sentences before and 2 after). Here we report all results (~100) for sentences; some are able to directly answer the query while others are slightly out of context. There are numerous ways to advance the techniques currently in use for better results. For discussion please contact the authors.

For articles only the 10 most relevant are reported.

#### Query: What is known about transmission, incubation, and environmental stability of coronavirus? (Task 1)

In [None]:
#Load in pandas and set options 
import pandas as pd 
import numpy as np
pd.set_option('display.max_colwidth', 10000)
pd.set_option('display.max_rows', 100)

#Read in the saved data 
results = pd.read_csv('../input/cord19results/topSentences_0.csv')
results = results.drop(['Unnamed: 0'], axis=1)
results.index = np.arange(1, len(results)+1)
display(results)

#Read in the saved data 
results = pd.read_csv('../input/cord19results/topArticles_0.csv')
results = results.drop(['Unnamed: 0'], axis=1)
results.index = np.arange(1, len(results)+1)
display(results.head(10))

#### Query: How long are people contagious with SARS-CoV-2 even after recovery? (Task 1 - Subtask)

In [None]:
#Load in pandas and set options 
import pandas as pd 
import numpy as np
pd.set_option('display.max_colwidth', 10000)
pd.set_option('display.max_rows', 100)

#Read in the saved data 
results = pd.read_csv('../input/cord19results/topSentences_11.csv')
results = results.drop(['Unnamed: 0'], axis=1)
results.index = np.arange(1, len(results)+1)
display(results)

#Read in the saved data 
results = pd.read_csv('../input/cord19results/topArticles_11.csv')
results = results.drop(['Unnamed: 0'], axis=1)
results.index = np.arange(1, len(results)+1)
display(results.head(10))

#### Query: What is the prevalence of asymptomatic shedding and transmission of SARS-CoV-2 particularly in children? (Task 1 - Subtask)

In [None]:
#Load in pandas and set options 
import pandas as pd 
import numpy as np
pd.set_option('display.max_colwidth', 10000)
pd.set_option('display.max_rows', 100)

#Read in the saved data 
results = pd.read_csv('../input/cord19results/topSentences_12.csv')
results = results.drop(['Unnamed: 0'], axis=1)
results.index = np.arange(1, len(results)+1)
display(results)

#Read in the saved data 
results = pd.read_csv('../input/cord19results/topArticles_12.csv')
results = results.drop(['Unnamed: 0'], axis=1)
results.index = np.arange(1, len(results)+1)
display(results.head(10))

#### Query: What is the physical science of coronavirus such as the charge distribution, adhesion to hydrophilic/phobic surfaces, and environmental survival to inform decontamination efforts for affected areas? (Task 1 - Subtask)

In [None]:
#Load in pandas and set options 
import pandas as pd 
import numpy as np
pd.set_option('display.max_colwidth', 10000)
pd.set_option('display.max_rows', 100)

#Read in the saved data 
results = pd.read_csv('../input/cord19results/topSentences_13.csv')
results = results.drop(['Unnamed: 0'], axis=1)
results.index = np.arange(1, len(results)+1)
display(results)

#Read in the saved data 
results = pd.read_csv('../input/cord19results/topArticles_13.csv')
results = results.drop(['Unnamed: 0'], axis=1)
results.index = np.arange(1, len(results)+1)
display(results.head(10))

#### Query: What is the persistence of coronavirus on surfaces of different material such as copper, stainless steel, and plastic? (Task 1 - Subtask)

In [None]:
#Load in pandas and set options 
import pandas as pd 
import numpy as np
pd.set_option('display.max_colwidth', 10000)
pd.set_option('display.max_rows', 100)

#Read in the saved data 
results = pd.read_csv('../input/cord19results/topSentences_15.csv')
results = results.drop(['Unnamed: 0'], axis=1)
results.index = np.arange(1, len(results)+1)
display(results)

#Read in the saved data 
results = pd.read_csv('../input/cord19results/topArticles_15.csv')
results = results.drop(['Unnamed: 0'], axis=1)
results.index = np.arange(1, len(results)+1)
display(results.head(10))

#### Query: What is the natural history of SARS-CoV-2 and its shedding from an infected person? (Task 1 - Subtask)

In [None]:
#Load in pandas and set options 
import pandas as pd 
import numpy as np
pd.set_option('display.max_colwidth', 10000)
pd.set_option('display.max_rows', 100)

#Read in the saved data 
results = pd.read_csv('../input/cord19results/topSentences_16.csv')
results = results.drop(['Unnamed: 0'], axis=1)
results.index = np.arange(1, len(results)+1)
display(results)

#Read in the saved data 
results = pd.read_csv('../input/cord19results/topArticles_16.csv')
results = results.drop(['Unnamed: 0'], axis=1)
results.index = np.arange(1, len(results)+1)
display(results.head(10))

#### Query: What is the immune response and immunity to SARS-CoV-2? (Task 1 - Subtask)

In [None]:
#Load in pandas and set options 
import pandas as pd 
import numpy as np
pd.set_option('display.max_colwidth', 10000)
pd.set_option('display.max_rows', 100)

#Read in the saved data 
results = pd.read_csv('../input/cord19results/topSentences_17.csv')
results = results.drop(['Unnamed: 0'], axis=1)
results.index = np.arange(1, len(results)+1)
display(results)

#Read in the saved data 
results = pd.read_csv('../input/cord19results/topArticles_17.csv')
results = results.drop(['Unnamed: 0'], axis=1)
results.index = np.arange(1, len(results)+1)
display(results.head(10))

#### Query: What do we know about vaccines and therapeutics for coronavirus? (Task 4)

In [None]:
#Load in pandas and set options 
import pandas as pd 
import numpy as np
pd.set_option('display.max_colwidth', 10000)
pd.set_option('display.max_rows', 100)

#Read in the saved data 
results = pd.read_csv('../input/cord19results/topSentences_3.csv')
results = results.drop(['Unnamed: 0'], axis=1)
results.index = np.arange(1, len(results)+1)
display(results)

#Read in the saved data 
results = pd.read_csv('../input/cord19results/topArticles_3.csv')
results = results.drop(['Unnamed: 0'], axis=1)
results.index = np.arange(1, len(results)+1)
display(results.head(10))