<center>
<img src="https://www.schwarzwaelder-bote.de/media.media.c5d3a492-5f32-4bcc-83f3-27e779ad4d46.original1024.jpg" width="1000" align="center"/>
 <br><br>
 <h1>  <span style="color:green"> COVID-19 Open Research Dataset Challenge (CORD-19) </span>  </h1>
    <h2>  <span style="color:green"> TASK : summary tables about material studies </span>  </h2>

<br>

**The aim of this notebook is to provide a robust algorithm that can help the medical community to find useful information about COVID-19**
</center>

<h2> <span style="color:green"> Approach </span> </h2> 

We designed a pipeline that consists of three parts: 
- document retrieval, 
- information extraction and 
- creating table of results for each material. 

![retriv](https://user-images.githubusercontent.com/28005338/84590315-c7fd7e00-ae35-11ea-935d-462d1b15434a.png)

<h2> <span style="color:green"> Document Retrieval </span> </h2>

> The aim of the document retrieval is to select the most relevant articles for a query. To do so, for a given query, we rank each document using BM25+ scoring. 

**BM25+** is the next generation of TFIDF (term frequency–inverse document frequency) and stands for “Best Match 25 +”. It is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. The score increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. BM25+ is the version of BM25 adapted for a long documents.

**Example**

*Document A: "The man went out for a walk"*

*Document B: "The children sat around the fire"*

First we create a dictionary of words and their occurence for each document in the corpus (collection of documents):

![freq](https://user-images.githubusercontent.com/28005338/84591334-c3d55e80-ae3d-11ea-9b59-addde9b0fc76.png)

in which row 0 stands for the document A and row 1 - for the document B.

**Term Frequency (TF)** is defined as *the number of times a word appears in a document divided by the total number of words in the document.* Every document has its own term frequency.

**Inverse Data Frequency (IDF)** id defined as *the log of the number of documents divided by the number of documents that contain the word w.* Inverse data frequency determines the weight of rare words across all documents in the corpus.

Lastly, the **TF-IDF** score is simply *the TF multiplied by IDF*.

![fff](https://user-images.githubusercontent.com/28005338/84591503-3dba1780-ae3f-11ea-8e30-d0f365a4a30a.png)

As one can see, TF-IDF score for the word "fire" much higher for the document B (row 1), than for the document A (row 0) when "walk" is more significant for the document A. 
Thus, if the query would be **"Who is sitting around the fire?"**, our document retrieval will select the document B as the most related document to the query.

We need to clarify, that certain words are used to formulate sentences but do not add any semantic meaning to the text. For example, the most commonly used word in the english language is **the** which represents 7% of all words written or spoken. You couldn’t make deduce anything about a text given the fact that it contains the word **the**. On the other hand, words like **fatality** and **death** could be used to determine a death rate for example. In natural language processing, useless words are referred to as stop words, for this reason we filter all stop words.

<h2> <span style="color:green"> Information Extraction </span> </h2>

> The aim of the information extraction is to find an answer for a given question in a given article. First we need to exptract the paragraph(s) that contain(s) the answer and then find the answer in a given paragraph.

We use a new baseline for the SQuAD based on BERT. 

**BERT** (Bidirectional Encoder Representations from Transformers) is a recent model introduced by researchers at Google AI Language. BERT makes use of Transformer, an attention mechanism that learns contextual relations between words in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task.

The chart below is a high-level description of the Transformer encoder. The input is a sequence of tokens, which are first embedded into vectors and then processed in the neural network. The output is a sequence of vectors of size H, in which each vector corresponds to an input token with the same index.

Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence.

<img src="https://user-images.githubusercontent.com/28005338/84592125-87a4fc80-ae43-11ea-9d18-09678cd9e43f.png" width="450" align="center"/>

Generally speaking during training BERT learns to understand a language and the relation between words. Then BERT has to be fine-tuned for a specific task: *the model receives a question regarding a text sequence and is required to mark the answer in the paragraph*.

For this reason we use **SQuAD 2.2 dataset** (Stanford Question and Answers Dataset) is a question answering dataset containing  100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.

The model tokenize every example in QA, then generate multiple instances per example by concatenating a “[CLS]” token, the tokenized question, a “[SEP]” token, tokens from the content of the document, and a final “[SEP]” token, limiting the total size of each instance to 512 tokens. For each document we generate all possible instances, by listing the document content. For each training instance start and end token indices are computed  to represent the target answer span.

The chart below is a high-level description of the fine-tuning BERT:
<img src="https://user-images.githubusercontent.com/28005338/84592410-d5baff80-ae45-11ea-970e-f93d3ac717a8.png" align="center"/>

**Example continuation**

After document selection we stay with the Document B:

*Document B: "The children sat around the fire"*

And now the idea is to extract the answer for the following question:

*Query: "Who is sitting around the fire?"*

Then the model output will be

<img src="https://user-images.githubusercontent.com/28005338/84592625-6c3bf080-ae47-11ea-9cd3-5820fdede06f.png" align="center"/>

Since we are working with the long documents, the model will first select only those paragraphs that contain the answer and for each paragraph will perform the extraction.

<h2> <span style="color:green"> Difference with other approaches </span> </h2> 

We have seen that most approaches rely entirely on embedding techniques to find the answer or on document retrieval techniques to find the most relevant documents. We tried to use the best things of both approaches and improve them using the most recent achievements of NLP.

<h2> <span style="color:green"> Installing and importing requiered libraries </span> </h2>  

In [None]:
!git clone https://github.com/Apunti/covid19-kaggle.git

In [None]:
cd covid19-kaggle

In [None]:
!git clone https://github.com/huggingface/transformers

In [None]:
cd transformers

In [None]:
pip install .

In [None]:
cd ..

In [None]:
#from qa_pipeline import *

In [None]:
from transformers import pipeline
from ranking import Ranking
from get_result import get_level_evidence, get_design

import pandas as pd

def get_answer_from_doc(query, doc, qa_model):

    output = []
    
    seq_length = 250
    stride = 150
    splitted_doc = doc.split(' ')
    
    i = 0
    while len(splitted_doc) > seq_length:
        paragraph = ' '.join(splitted_doc[:seq_length])
        m_input = {'question': query,
                 'context': paragraph}
        try:
            output_dict = qa_model(m_input)
        except:
            if i!=0:
                splitted_doc = splitted_doc[stride:]
                i = 0
            i += 1
            continue
        answer = output_dict['answer']
        score = output_dict['score']

        output.append((answer, score, paragraph))
        
        splitted_doc = splitted_doc[stride:]
        i = 0
        
    error = False    
    if len(splitted_doc) > 127:
        
        paragraph = ' '.join(splitted_doc)
        m_input = {'question': query,
                 'context': paragraph}
        print('processing paragraph...', end= '')
        try:
            output_dict = qa_model(m_input)
        except:
            error = True
            pass
        if error:
            error = False
        else:
            answer = output_dict['answer']
            score = output_dict['score']
        
            output.append((answer, score, paragraph))
    
    sorted_output = sorted(output, key=lambda x: x[1], reverse=True)
    
    if len(sorted_output) == 0:
        sorted_output = [('-', 0, '-')]
        
    #print('### ANSWER ###')
    #print(sorted_output[0])

    if sorted_output[0][1] > 0.3:
        return sorted_output[0][0], sorted_output[0][2]
    else:
        return '-', sorted_output[0][2]
    
def get_documents(dataset, ranking, query, top_k = 10):

    similar = ranking.most_similar(query, dataset, k = top_k, func='bm25', data='text')
    print('similar length: {}'.format(len(similar)))

    return similar

def get_information(row):
    date = row['date'].values[0]
    url = row['url'].values[0]
    authors = row['authors'].values[0]
    if len(authors) > 20:
        authors = authors[:20] + ' [...]'
    title = row['title'].values[0]
    if len(title) > 60:
        title = title[:60] + ' [...]'
    design = get_design(row)
    level_of_evidence = get_level_evidence(row)
    
    new_line = {'date': date,
                'title': '<a href="' + url + '">' + title + '</a>',
                'authors': authors,
                'design': design,
                'level_of_evidence': level_of_evidence}
    
    return new_line
    

def get_csv(df, csv_path, risk_factor, questions, top_k = 1, device = -1, dict_path = 'Data/ranking_dict'):

    dataset = df #pd.read_csv(df_path, sep=';')

    ranking = Ranking('texts', path= dict_path)
    qa_model = pipeline('question-answering', device=device, model='bert-large-uncased-whole-word-masking-finetuned-squad')

    print('All loaded')

    all_query = questions[0] #' '.join(questions)
    documents = get_documents(dataset, ranking, all_query, top_k = top_k)
    print('Length documents: {}'.format(len(documents)))

    results = pd.DataFrame(columns=['date', 'title', 'authors', 'design', 'level_of_evidence'] + questions)

    print('Starting docs for {}: \n'.format(risk_factor))

    for doc in documents:
        skip = False
        row = dataset.loc[dataset.text == doc]
        new_line = get_information(row)
        #new_line = {'paper_id': paper_id}
        for query in questions:
            answer, paragraph = get_answer_from_doc(query, doc, qa_model)
            if answer == '-' or type(answer) != str:
                skip = True
                continue
            else:
                index = paragraph.find(answer)
                paragraph = '[...] ' + paragraph[index-50:index+70] + ' [...]'
                new_line[query] = str(paragraph.replace(answer, f"<mark>{answer}</mark>")) 
        if skip == True:
            continue
        results = results.append(new_line, ignore_index=True)
        
    results.to_csv(csv_path, sep=';', index=False)

<h2> <span style="color:green"> Data Preprocessing </span>  </h2>
We go through each row of the metadata and if it has full text, we append it to our dataframe, extracting the text from pmc folder, if possible, or from the pdf one. The preprocessing that we do for each row is the following:

- We get rid off the documents that are not written in English with the package `langdetect`.
- We get rid off the texts that has less than 200 words.
- The abstract and the text are strings with the paragraphs separated by new line character (`\n`). 

<h3> <span style="color:green"> Tagging </span>  </h3>

Following the contribution of the [notebook](https://www.kaggle.com/ajrwhite/covid-19-thematic-tagging-with-regular-expressions) of Andy White we decided to add a disease and design tag. We think the disease tag is needed because we are mostly interested in the COVID-19 information, and we thougth that the level of evidence of the studies is really useful for the medical research community when looking for answers. the level of evidence is based on the design following the [guides](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/137027) of Savanna Reid. The main weakness of this tagging is that it relies on looking for keywords through the document, so it's not an optimal classification.

In [None]:
import pandas as pd

df = pd.read_csv('/kaggle/input/processed-data-v8/processed_data_v8_2.csv', sep=';')
print(df.columns)

<h2> <span style="color:green"> Results </span>  </h2>

In [None]:
materials = ['plastic', 'stainless steel', 'cardboard', 'stool']

d = {}
for material in materials:
    d[material] = ['Up to how many days is the persistance of the virus on ' + material]
                   #'How long is the persistance and stability of the virus on ' + material,
                   #'How long remains viral activity on ' + material,
                   #'What is the viral shedding in ' + material]

In [None]:
import os

os.mkdir('csv')

In [None]:
for key in d:
    get_csv(df, 'csv/' + key + '.csv', key, d[key], top_k = 10, device = -1, dict_path='/kaggle/input/ranking-dict-v8')

In [None]:
from IPython.core.display import HTML
pd.set_option('display.max_colwidth', 200)

In [None]:
for material in materials:
    display(HTML('<h2>{}</h2>'.format(material.upper())))
    table = pd.read_csv("csv/{}.csv".format(material), sep=';')
    df_table=HTML(table.to_html(escape=False,index=False))#, col_space=150))
    display(df_table)