<center>
<img src="https://www.schwarzwaelder-bote.de/media.media.c5d3a492-5f32-4bcc-83f3-27e779ad4d46.original1024.jpg" width="1000" align="center"/>
 <br><br>
 <h1>  <span style="color:green"> COVID-19 Open Research Dataset Challenge (CORD-19) </span>  </h1>
</center>
<h2> <span style="color:green"> Introduction </span>  </h2>
The aim of this notebook is to provide a robust algorithm that can help the medical research community to find useful information about COVID-19. 

<h3> <span style="color:green"> Approach </span> </h3> 
We designed a pipeline that consists of three parts: document retrieval, information extraction and creating html report for each query. 

<br><br><br>
![pipeline.png](https://user-images.githubusercontent.com/28005338/79320047-6ee29e00-7f09-11ea-887c-3b3f1cdfb09f.png)
<br><br><br>

The idea behind is to use the powerful BioBERT embeddings and use a traditional document retrieval technique (BM25+) to overcome some of its weaknesses.

**BM25** is the next generation of TFIDF and stands for “Best Match 25”. It is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework.

**BioBERT** is a pre-trained biomedical language representation model for biomedical text mining.
<br><br><br>
<h3> <span style="color:green"> Difference with other approaches </span> </h3> 

We have seen that most approaches rely entirely on embedding techniques to find the answer or on document retrieval techniques to find the most relevant documents. We tried to use the best things of both approaches and combine them to overcome the weaknesses they have.

<h3> <span style="color:green"> Conclusions </span> </h3> 
By combining the document retrieval and the embedding comparison we can get results we couldn't get otherwise, besides reducing the computing time by comparing less documents.

<h3> <span style="color:green"> Next steps </span> </h3> 
We need to further process the output to obtain a more robust algorithm. Our main idea would be to implement a pretrained SNLI or QA model to improve the output given to the researcher in need for concrete and easy to read information. On the other hand we also will try to introduce topic selection in order to filter documents before information retrieval.

In [None]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<h2 id="tocheading"> <span style="color:green"> Table of Contents </span> </h2>
<div id="toc"></div>

<h2> <span style="color:green"> Load Data </span>  </h2>

The module `read_data.py` handles the data preprocessing and creates the files needed for this notebook:

- `processed_data_v5.csv`: csv that contains the metadata, text and tagging of the documents.
- `term_frequency.p`: dictionary with term frequency for each document.
- `document_frequency.p`: dictionary with the document frequency for each term in the dataset.

Although you can create these files using `read_data.py`, there is no need because they are provided in this notebook. The preprocessing done and the tagging are explained in the following sections.


In [None]:
!git clone https://github.com/Apunti/covid19-kaggle.git

In [None]:
cd covid19-kaggle

In [None]:
!pip install langdetect

In [None]:
!git clone https://github.com/dmis-lab/biobert.git

In [None]:
cd biobert

In [None]:
with open('requirements.txt', 'r') as f:
    lines = f.readlines()

with open('requirements.txt', 'w') as f:
    for line in lines:
        if 'gpu' in line:
            f.write('tensorflow-gpu==1.14.0   # GPU version of TensorFlow >= 1.11.0. (should be under 2.0.0)')
        else:
            f.write(line)
        

In [None]:
pip install -r requirements.txt

In [None]:
cd ..

In [None]:
from utils.read_data import save_dictionaries, read_new_files
import pandas as pd
import os
import nltk
nltk.download('stopwords')

found = False
df = pd.read_csv('../../input/processed-data-v8/processed_data_v8_2.csv', sep=';')

try:
    print('check in output folder')
    df = pd.read_csv('processed_data.csv', sep=';')
    found = True
    print('The csv is in the output folder.')
except:
    try:
        print('check in input folder')
        df = pd.read_csv('../../input/processed-data-v8/processed_data_v8_2.csv', sep=';')
        print('The csv is in the input folder.')
    except:
        print('Creating csv.')
        data_path = '../../input/CORD-19-research-challenge'
        df = read_new_files(data_path)

if not found:
    save_csv_path = 'processed_data.csv'
    df.to_csv(save_csv_path, sep=';', index=False)
    
from shutil import copyfile
if not os.path.isdir('Data/ranking_dict'):
    input_dir = '../../input/ranking-dict-v8/'
    if os.path.isdir(input_dir):
        print('Copying dictionaries to output')
        os.mkdir('Data')
        os.mkdir('Data/ranking_dict')
        
        new_path = 'Data/ranking_dict/'
        for filename in os.listdir(input_dir):
            copyfile(input_dir + filename, new_path + filename)
    else:
        print('Creating dictionaries for document retrieval.')
        save_dictionaries(df)

In [None]:
import numpy as np

bert_encodings = np.load('../../input/bert-encodings/bert_encodings.npy', allow_pickle=True)
bert_encodings = bert_encodings.item()

<h3> <span style="color:green"> Data Preprocessing </span>  </h3>
We go through each row of the metadata and if it has full text, we append it to our dataframe, extracting the text from pmc folder, if possible, or from the pdf one. The preprocessing that we do for each row is the following:

- We get rid off the documents that are not written in English with the package `langdetect`.
- We get rid off the texts that has less than 200 words.
- The abstract and the text are strings with the paragraphs separated by new line character (`\n`). 

<h3> <span style="color:green"> Tagging </span>  </h3>
Following the contribution of the [notebook](https://www.kaggle.com/ajrwhite/covid-19-thematic-tagging-with-regular-expressions) of Andy White we decided to add a disease and design tag. We think the disease tag is needed because we are mostly interested in the COVID-19 information, and we thougth that the level of evidence of the studies is really useful for the medical research community when looking for answers. the level of evidence is based on the design following the [guides](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/137027) of Savanna Reid. The main weakness of this tagging is that it relies on looking for keywords through the document, so it's not an optimal classification.

<h2> <span style="color:green"> Document Retrieval </span> </h2>

The aim of the document retrieval is to select the most relevant articles for a query. To do so, for a given query, we rank each document using BM25+ scoring.

BM25 is a retrieval model based on the probabilistic retrieval framework. The main advantage of BM25 which makes it popular is its efficiency. BM25 score is calculated based on two main components: TF (term frequency) and IDF (inverse document frequency). However, there are some techniques for document length normalization and satisfying the concavity constraint of the term frequency (e.g., considering the logarithmic TF, instead of the raw TF). Based on these heuristic techniques, BM25 often achieves better performance compared to TF-IDF.

The class that handles the document retrieval can be found in the module `ranking.py` in the output folder.

<h2> <span style="color:green"> Information Extraction </span> </h2>

The aim of the information extraction is to select the most relevant paragraphs/sentences from an article.

By now, we are considering two approaches:

1) BM25 + Word2Vec Embedding across all documents.

2) BM25 + BioBERT + Word2Vec Embedding for each of the top-k documents, take the most similar sentence embbedding across the top-k paragraphs.

Let us take a look on each of them.

1) Word2vec is essentially about proportions of word occurrences in relations holding in general over large corpora of text. The main idea, called distributional hypothesis, is that similar words appear in similar contexts of words around them and the purpose of Word2vec is to group the vectors of similar words together in vectorspace. Given enough data, usage and contexts, Word2vec can make highly accurate guesses about a word’s meaning based on past appearances. The output of the Word2vec is a vocabulary in which each item has a vector attached to it.

We simply use Word2vec embeddings to compare the query with the sentence embeddings based on Word2Vec embeddings and we take the one with greater cosine similarity.

2) BioBERT is a pre-trained biomedical language representation model for biomedical text mining and fine-tuned with COVID-19 dataset. BERT constructs vectors for a word as follows:

- BERT has a fixed size vocabulary of words/subwords (wordpiece embeddings) — any input word is mapped to these words/subwords. For instance, some common words like “the” or even uncommon ones like “quantum”, “constantinople” are present in BERT vocabulary(base and large model vocab) — so it is a direct mapping for these words.

- During training, BERT learns the vector representation for its fixed size vocab using its attention heads (which is essentially a bunch of learnt matrices as we shall see below) and other transformation matrices (all of which are also learnt during training). The trained model along with the learnt vectors for its vocab is then used, during eval/testing, to dynamically construct the vector for any word in sentence by first mapping words to the fixed vocab words/subwords and encoding the position of a word in the input sentence into its representation.

- Each layer of BERT model has multiple attention heads (12 heads in base, and 16 in large) and a non-linear feed forward layer takes these attention head outputs and allow them to interact with each other before they are fed to the next layer that perform the same operation described above.

Firstly, we must decide what documents should we look for the answer. Here we use the document retrieval to take the most relevant documents. Then, we choose the most relevant paragraph and then extract the sentence from the selected paragraph. We use BioBERT embeddings to compare the query with the paragraphs and we take the one with greater cosine similarity. Once we have the relevant paragraphs, we select the sentence that contains the answer by comparing sentence embeddings based on Word2Vec embeddings trained on the whole dataset. We take the average embedding of the words within the paragraph.

Let us take a look on the following queries to understand better the performance of each approach:

In [None]:
# Given a query
query1 = 'Incubation period'
query2 = 'Prevalence of asymptomatic shedding and transmission'

In [None]:
import matplotlib.pyplot as plt
from get_result import filtered_query, remove_punct
from ranking import Ranking

# We can calculate and plot the scores of the documents talking about COVID19
ranking = Ranking()
covid_documents = df[(df.after_dec == True) & (df.tag_disease_covid == True)].paper_id

query1 = filtered_query(query1)
scores1 = ranking.get_bm25_scores(query1, covid_documents)

query2 = filtered_query(query2)
scores2 = ranking.get_bm25_scores(query2, covid_documents)

# Plot the results
fig, axs = plt.subplots(1,2, sharey=True, tight_layout=False, figsize =(15,5))
axs[0].hist(scores1.values(), bins=20, color='g')
axs[0].set_xlabel('Scores')
axs[0].set_ylabel('Number of documents')
axs[0].set_title('Incubation period')
axs[1].hist(scores2.values(), bins=20, color='g')
axs[1].set_xlabel('Scores')
axs[1].set_ylabel('Number of documents')
axs[1].set_title('Prevalence of asymptomatic shedding and transmission')

plt.show()


We can see the distribution of documents with the scores. Different queries give rise to different distributions. Queries with common words that rarely appear in documents but when they appear, they have a great number of appearances, they will have a rather sparse distribution.

Contrarily, if there are common words and they use to appear with the same frequency across documents, a lot of documents will have similar scores. This fact can be seen with the two different queries plotted above.

Thus, since the first approach relies heavily on a number of appearances of each word, it maybe performes better with more specific queries, while BioBERT should better perform on general queries.

In Result Section we will check the results of each of approaches for the mentioned queries.

We create a `Embedding_retrieval` object, that will load an `Embedding` object already pretrained with the whole dataset.

In [None]:
from information_retrieval import *
from extract_features_refactored import FeatureExtractor

fe = FeatureExtractor(bert_config_file ="../../input/biobert-pretrained/biobert_v1.1_pubmed/bert_config.json",
                      init_checkpoint = "../../input/biobert-pretrained/biobert_v1.1_pubmed/model.ckpt-1000000",
                      vocab_file = "../../input/biobert-pretrained/biobert_v1.1_pubmed/vocab.txt",
                      batch_size = 32, # Batch size for predictions
                      max_seq_length = 128, # Sequences longer than this will be truncated, and sequences shorter than this will be padded.
                      verbose=0)

inforet = Embedding_retrieval(path = 'wordvectors.kv')


For each document, we call the `get_closest_sentence` from the `Embedding_retrieval` class to get the most relevant sentences.

In [None]:
# Get ranked documents for query 1
sorted_papers1 = sorted(scores1.items(), key=lambda x:x[1], reverse=True)
ranking_nearest1 = [a for a,b in sorted_papers1]

# Get ranked documents for query 2
sorted_papers2 = sorted(scores2.items(), key=lambda x:x[1], reverse=True)
ranking_nearest2 = [a for a,b in sorted_papers2]

# First we calculate the best sentences across all the documents

# Get closest sentences for query 1
doc_info1 = []

for paper_id in ranking_nearest1[:1000]:
    actual_doc = []
    row = df.loc[df.paper_id == paper_id]
    if not len(row.text.values) != 0: # check for empty texts
        continue
    text = df.loc[df.paper_id == paper_id].text.values[0]
    
    # Get the similar sentences and append it to the document information
    similar_sent = inforet.get_closest_sentence(query1, paper_id, text, topk=10)
    for sent, sim in similar_sent:
        actual_doc.append((paper_id, sent, sim))
        
    doc_info1.append(actual_doc)
    
    
# Get closest sentences for query 2
doc_info2 = []

for paper_id in ranking_nearest2[:1000]:
    actual_doc = []
    row = df.loc[df.paper_id == paper_id]
    if not len(row.text.values) != 0: # check for empty texts
        continue
    text = df.loc[df.paper_id == paper_id].text.values[0]
    
    # Get the similar sentences and append it to the document information
    similar_sent = inforet.get_closest_sentence(query2, paper_id, text, topk=10)
    for sent, sim in similar_sent:
        actual_doc.append((paper_id, sent, sim))
        
    doc_info2.append(actual_doc)
    

# Then we calculate the best sentences across the top-k paragraphs of the top-k documents
doc_k = 3
par_k = 3
sent_k = 3

# Get closest sentences for query 1
bert_query1 = fe.prepare_embedding_csv(query1, None, False).values

doc_info_bert1 = []
for paper_id in ranking_nearest1[:doc_k]:
    actual_doc = []
    row = df.loc[df.paper_id == paper_id]
    if len(row.text.values) == 0 or not row['after_dec'].values[0]:
        continue

    # get 'topk' closest paragraphs
    similar_par = fe.get_closest_sentence(bert_query1, paper_id, row.text.values[0], par_k, encodings_dict = bert_encodings)
    top_paragraphs = ''
    for n_par in range(par_k):
        if len(similar_par) < par_k:
            print('DOC TOO SHORT')
            continue
        top_paragraphs += similar_par[n_par][0] + '\n'
    similar_sent = inforet.get_closest_sentence(query1, paper_id, top_paragraphs, sent_k)
    for sent, sim in similar_sent:
        actual_doc.append((paper_id, sent, sim))
    doc_info_bert1.append(actual_doc)

# Get closest sentences for query 2
bert_query2 = fe.prepare_embedding_csv(query2, None, False).values

doc_info_bert2 = []
for paper_id in ranking_nearest2[:doc_k]:
    actual_doc = []
    row = df.loc[df.paper_id == paper_id]
    if len(row.text.values) == 0 or not row['after_dec'].values[0]:
        continue

    # get 'topk' closest paragraphs
    similar_par = fe.get_closest_sentence(bert_query2, paper_id, row.text.values[0], par_k, encodings_dict = bert_encodings)
    top_paragraphs = ''
    for n_par in range(par_k):
        if len(similar_par) < par_k:
            print('DOC TOO SHORT')
            continue
        top_paragraphs += similar_par[n_par][0] + '\n'
    similar_sent = inforet.get_closest_sentence(query2, paper_id, top_paragraphs, sent_k)
    for sent, sim in similar_sent:
        actual_doc.append((paper_id, sent, sim))
    doc_info_bert2.append(actual_doc)

<h2> <span style="color:green"> Results </span>  </h2>

Before showing the output for the task questions, we will show a result for an example query to be able to discuss the strengths and weaknesses of our approach. Right now, our approach consists of two main information retrieval parts:

- Ranking + Word2Vec Embedding across all documents.
- Ranking + BioBERT + Word2Vec Embedding for each of the top-k documents, take the most similar sentence embbedding across the top-k paragraphs.

The first approach performs better for specific queries, while the second approach performs better for general queries (see plots above). We will see these differences in the following examples using the queries plotted before.


We will first evaluate the first query: `Incubation period`

Printing the top sentences from top documents:

In [None]:
print(query1)
for i, doc in enumerate(doc_info_bert1):
    if i > 5:
        break
    print('DOCUMENT {}: {}'.format(i+1, doc[0][0]))
    print('-------------------------')
    print('\n')
    for j, item in enumerate(doc[:3]):
        _, sent, similarity = item
        print('Sentence {}: {}'.format(j+1, similarity))
        print(sent)
        print('\n')

Printing the most similar sentences across all the documents.

In [None]:
import itertools
all_sent = list(itertools.chain.from_iterable(doc_info1))
all_sent = sorted(all_sent, key=lambda x:x[2], reverse=True)
for paper_id, sentence, similarity in all_sent[:5]:
    print(str(similarity) + '  [{}]'.format(paper_id))
    print(sentence)
    print('\n')

Then we will evaluate the second query: `Prevalence of asymptomatic shedding and transmission`

Printing the top sentences from top documents:

In [None]:
print(query2)
for i, doc in enumerate(doc_info_bert2):
    if i > 5:
        break
    print('DOCUMENT {}: {}'.format(i+1, doc[0][0]))
    print('-------------------------')
    print('\n')
    for j, item in enumerate(doc[:3]):
        _, sent, similarity = item
        print('Sentence {}: {}'.format(j+1, similarity))
        print(sent)
        print('\n')

Printing the most similar sentences across all the documents.

In [None]:
import itertools
all_sent = list(itertools.chain.from_iterable(doc_info2))
all_sent = sorted(all_sent, key=lambda x:x[2], reverse=True)
for paper_id, sentence, similarity in all_sent[:5]:
    print(str(similarity) + '  [{}]'.format(paper_id))
    print(sentence)
    print('\n')

<h2> <span style="color:green"> Conclusions </span>  </h2>

Analysing obtained results for two selected query, we can see that for the query "incubation period" second approach is more accurate, while second is better for the query "prevalence asymptomatic shedding transmission".

The reason is that our results (by now) depend on the level of specification of words in the given query. For those queries that contains more specific/rare terminology (like incubation period), we got better results with the most similar sentences across all the documents. From the other hand, for those queries that contain more genral terms (age, human, climate...) we get better results by taking into account the most relevant documents, instead of comparing the embeddings across all of them.

For this reason, it is reasonable each time to compare the results of two different approaches and for each query select the appropriate one based on words distribution (see plots presented before).

Thus, in future steps we are going to include mentioned comparison in order to increase an accuracy.

<h2> <span style="color:green"> Task Answers </span>  </h2>

<h3> <span style="color:green"> Question selection based on COVID-19 medical dictionary </span>  </h3>
We use **COVID-19** medical dictionary published by [Savanna Reid](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/137027) which introduces a list of research questions that fall under each task in the Kaggle competition, and links these research questions to three groups of concepts:

- study designs,
- outcome variables, and
- between-group differences (comparing outcomes in groups with/without an intervention or a risk factor)

Also it introduces each type of study design represented in the CORD-19 data, explains the meaning of the study design, and lists important synonyms and abbreviations for identifying that study design.

It also provides information on the **level of evidence** (i.e., reliability) corresponding to that choice of study design, and lists both generic keywords that can be used to identify that study design, and best-practices keywords that can be used to rank results within the group of papers conforming to that study design.

Based on the information provided in the dictionary, we tag our results according to their **level of evidence**:

- 1: the highest,
- 6: the lowest level of evidence.


Creating a dictionary with tasks, questions and subquestions from an excel file that will be using for generating html reports.

In [None]:
path = 'Kaggle COVID-19 medical dictionary.xlsx'
med_dict = pd.read_excel(path, sheet_name=2)

tasks  = med_dict["subtask"].unique()
questions  = med_dict.Question.unique()
subquestions = med_dict.Subquestion.unique()

res_dict = {}
for task in tasks:
    res_dict.update( {task : [{x : [y for y in (med_dict[med_dict.Question==x].Subquestion.unique())]} for x in (med_dict[med_dict["subtask"]==task].Question.unique())]}) 

Now we create json output files for all subquestions.

In [None]:
from get_result import *

n_task = 0
print('Task {}'.format(n_task))
for n_question, question_dict in enumerate(res_dict[tasks[n_task]]):
    for n_subquestion, subquestion in enumerate(list(question_dict.values())[0]):
        print(subquestion)
        get_answers_files(df, subquestion, ranking, ('BM25', inforet), doc_k = 3, sent_k =3, 
                                         only_top_doc = False, 
                                         task = n_task,
                                         question = n_question,
                                         subquestion = n_subquestion, method = 'word2vec')
        
n_task = 0
print('Task {}'.format(n_task))
for n_question, question_dict in enumerate(res_dict[tasks[n_task]]):
    for n_subquestion, subquestion in enumerate(list(question_dict.values())[0]):
        print(subquestion)
        get_answers_files(df, subquestion, ranking, ('BERT', fe), doc_k = 3, sent_k =3, par_k = 4, 
                                         inforet_sentence = inforet, only_top_doc = True, 
                                         task = n_task,
                                         question = n_question,
                                         subquestion = n_subquestion, method = 'BERT',
                                         encodings_dict = bert_encodings)


<h3> <span style="color:green"> HTML report for a selected query </span> </h3>

In order to facilitate presentation and readibility of proposed answers, the interactive tool for question managing has been created. The structure is the following:

![example](https://user-images.githubusercontent.com/28005338/79470146-59e93600-8001-11ea-93f1-51d3d474e979.png)

The menu allows to choose interested question from the list of proposed questions as well as one of subquestions to make search query more precise. For each query, it is shown 3 top-relevant documents with 3 top-relevant phrases. Each of proposed results contains of:

- Title
- Authors
- Publication data
- Level of evidence (if available)
- Design study (if satisfied)
- Three the most relevant phrases

The following code creates html visualization.

In [None]:
from create_html import *

array_bert = []
html = create_html(res_dict, tasks[0], path_bert = 'json_answers/BERT/task_0/', path_word2vec = 'json_answers/word2vec/task_0/', array_bert = array_bert)

In [None]:
from IPython.display import HTML
display(HTML(html))