<font size="6" >üëë A QA model to answer them all</font>

<br>

<font size="4">An attempt to answer all tasks's questions with a single Question-Answering model</font>

<br><br>

<font size="3">
    <strong>Why:</strong> at the time of writing, there are more than 580 notebooks on the <a href="https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks" >COVID-19 Open Research Dataset</a> challenge. The main reason why we are here, united, is that we want to help the research community <strong>find answers</strong>. We need to find answers now, as a deep understanding of the coronavirus infectious disease may save lives!
</font>

<br>
 
<font size="3">
  <strong>Simple and intuitive:</strong> the second main reason we are here is to learn and grow together. This notebook has been designed and conceived to be easy-to-understand for beginners but, hopefully, full of valuable insights also for advanced Kagglers. The notebook runs in less than 5 minutes so feel free to fork and work on your own!
</font>

<br>

<font size="3">
  <strong>Goal:</strong> in just a few lines of code we develop from <strong>start to finish</strong> universal question-answering systems able to answer (almost) any kind of question related to coronavirus. In particular, the notebook will attempt to answer all the questions from the <a href="https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks">CORD-19 TASKS</a>. We then visualize the answer in a nice and readable format.
</font>


<br>

<font size="3">
If you have questions or feedback please leave a comment. Disclaimer: work in progress.
</font>

# 1. Web application and call-to-action

Update: the output results of the model can now be visualized here: [Korono - Question and Answering model for COVID-19 paper](https://jbesomi.github.io/Korono/). This simple website let you chose the task and the questions and propose a collection of relevant papers with the answer to the questions highlighted. 

**I would like to know what do you think about it**. Also, if you feel like helping, I'm open to any kind of suggestions and  feedback and I'm **looking for collaborators to push further this project**. Anyone is warmly welcome to help; just leave a comment below and I'm sure we will find a way to work together!**


<a href="https://jbesomi.github.io/Korono/">
    <img src="https://i.imgur.com/WUvp09u.png" alt="Korono - Question and Answering model">
</a>

# 2. Introduction

### 2.1 Question-Answering (QA) model

In machine learning, a question-answering model is composed of three sources: the `question`, the `context` and the `answer`. The model inputs are the `question` and the `context` and the model output is the `answer`. In most cases, but not all, the `answer` is contained in the `context`. For simplicity, throughout the notebook, we will assume that this is indeed true.

It exists many datasets used to train the QA model. One of the most popular is she Stanford Question Answering Dataset, also known as [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/). It contains thousands of tuples of the type (`question`, `context`, `answer`) used to teach the model what does it means to both **find** and **return** a question. During training, the model exploits and learn linguistical properties of the language.

### 2.2 Using a search engine to produce the context

In general, the `context` is quite limited, about one page. In our case, instead, we are dealing with more than 40k papers. **We need therefore to reduce the size of the context**. We do so by selecting all the papers that are most similar to the `answer`. In the code, a very simple algorithm, [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25), is used. As you will see, even if Okapi BM25 is quite old (from 1980), it does a great job. In future, I plan to compare the Okapi solution against other most recent approaches and solutions such as transformers.


### 2.3 From (context, question) to answer with transformers

This is the most interesting and magic part of all the notebook. Given a question `q` (we use the term query and question interchangeably), the previous section gives us a list of context. Now, for each context and for the same query `q`, we ask to a pre-trained and pretty-powerful transformer model what is the part of the context that **better represent** the query.

You may ask how in just a few lines of code we can build such a powerful model. The reason why is that we make use of the (great, you need to check it out if you haven't!) [Huggingface transformer library](https://github.com/huggingface/transformers) that permits us to work with ease with such complex and big neural networks.

The data obtained now, are dirty and hard to read. That's why for each task and for each question we visualize the context and the highlighted answer in a friendly way.

### 2.4 Summarization abstract for each question [Coming Soon]

The code for the summarization has been written but hasn't been tested and visualized yet.

### 2.5 Acknowledgement

This notebook has been inspired from the great work of:

- https://www.kaggle.com/dgunning/building-a-cord19-research-engine-with-bm25 by DwightGunning
- https://www.kaggle.com/dirktheeng/anserini-bert-squad-for-semantic-corpus-search by Dirk


# 3. Load dataframe

In [None]:
"""
Libraries
"""

!pip install rank_bm25 -q

import numpy as np
import pandas as pd 
from pathlib import Path, PurePath

import nltk
from nltk.corpus import stopwords
import re
import string
import torch

from rank_bm25 import BM25Okapi # Search engine

In [None]:
"""
Load metadata df
"""

input_dir = PurePath('../input/CORD-19-research-challenge')
metadata_path = input_dir / 'metadata.csv'
metadata_df = pd.read_csv(metadata_path, low_memory=False)
metadata_df = metadata_df.dropna(subset=['abstract', 'title']) \
                            .reset_index(drop=True)

# 4. Covid Search Engine

We define a python class `CovidSearchEngine`. It has two main methods, the `__init__` and `search(question)`. The `__init__` method is called only once when the class is initialized. It stores and index the dataframe passed as an argument. Once the indexing is complete, we can search similar papers simply by invoking `search(question)`. 


The snippet of code below shows how it works:


```python
    metadata_df = pd.read_csv()
    metadata_df = clean(metadata_df)
    cse = CovidSearchEngine(metadata_df) # Covid Search Engine
    cse.search("what is coronavirus?")
```

In [None]:
from rank_bm25 import BM25Okapi

english_stopwords = list(set(stopwords.words('english')))

class CovidSearchEngine:
    """
    Simple CovidSearchEngine.
    """
    
    def remove_special_character(self, text):
        #Remove special characters from text string
        return text.translate(str.maketrans('', '', string.punctuation))

    def tokenize(self, text):
        # tokenize text
        words = nltk.word_tokenize(text)
        return list(set([word for word in words 
                         if len(word) > 1
                         and not word in english_stopwords
                         and not word.isnumeric() 
                        ])
                   )
    
    def preprocess(self, text):
        # Clean and tokenize text input
        return self.tokenize(self.remove_special_character(text.lower()))


    def __init__(self, corpus: pd.DataFrame):
        self.corpus = corpus
        self.columns = corpus.columns
        
        raw_search_str = self.corpus.abstract.fillna('') + ' ' \
                            + self.corpus.title.fillna('')
        
        self.index = raw_search_str.apply(self.preprocess).to_frame()
        self.index.columns = ['terms']
        self.index.index = self.corpus.index
        self.bm25 = BM25Okapi(self.index.terms.tolist())
    
    def search(self, query, num):
        """
        Return top `num` results that better match the query
        """
        # obtain scores
        search_terms = self.preprocess(query) 
        doc_scores = self.bm25.get_scores(search_terms)
        
        # sort by scores
        ind = np.argsort(doc_scores)[::-1][:num] 
        
        # select top results and returns
        results = self.corpus.iloc[ind][self.columns]
        results['score'] = doc_scores[ind]
        results = results[results.score > 0]
        return results.reset_index()

We can now initialize the `cse` object class:

In [None]:
cse = CovidSearchEngine(metadata_df)

# 5. Question-Answering model

As mentioned in the introduction part, we make use of a pre-trained question answering model. The first step consists of installing the dependencies and downloading the models.

In [None]:
"""
Download pre-trained QA model
"""

import torch
from transformers import BertTokenizer
from transformers import BertForQuestionAnswering

torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'

BERT_SQUAD = 'bert-large-uncased-whole-word-masking-finetuned-squad'

model = BertForQuestionAnswering.from_pretrained(BERT_SQUAD)
tokenizer = BertTokenizer.from_pretrained(BERT_SQUAD)

model = model.to(torch_device)
model.eval()

print()

Now, we define a function `answer_question(question, context)` that given a paper abstract and a question, it returns the span of text that better represent the question.

For instance, given as `question` _"what is coronavirus?"_ and as `context` _"Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus"_ we expect to obtain as `answer` _"infectious disease"_.


In [None]:
def answer_question(question, context):
    # anser question given question and context
    encoded_dict = tokenizer.encode_plus(
                        question, context,
                        add_special_tokens = True,
                        max_length = 256,
                        pad_to_max_length = True,
                        return_tensors = 'pt'
                   )
    
    input_ids = encoded_dict['input_ids'].to(torch_device)
    token_type_ids = encoded_dict['token_type_ids'].to(torch_device)
    
    start_scores, end_scores = model(input_ids, token_type_ids=token_type_ids)

    all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
    start_index = torch.argmax(start_scores)
    end_index = torch.argmax(end_scores)
    
    answer = tokenizer.convert_tokens_to_string(all_tokens[start_index:end_index+1])
    answer = answer.replace('[CLS]', '')
    return answer

# 6. Tasks and questions

In this section, we store in a `dict` object a list of `tasks` and their `questions`. In the next parts, we will ask our model to answers them all.  

In [None]:
# adapted from https://www.kaggle.com/dirktheeng/anserini-bert-squad-for-semantic-corpus-search

covid_kaggle_questions = {
"data":[
          {
              "task": "What is known about transmission, incubation, and environmental stability?",
              "questions": [
                  "Is the virus transmitted by aerisol, droplets, food, close contact, fecal matter, or water?",
                  "How long is the incubation period for the virus?",
                  "Can the virus be transmitted asymptomatically or during the incubation period?",
                  "How does weather, heat, and humidity affect the tramsmission of 2019-nCoV?",
                  "How long can the 2019-nCoV virus remain viable on common surfaces?"
              ]
          },
          {
              "task": "What do we know about COVID-19 risk factors?",
              "questions": [
                  "What risk factors contribute to the severity of 2019-nCoV?",
                  "How does hypertension affect patients?",
                  "How does heart disease affect patients?",
                  "How does copd affect patients?",
                  "How does smoking affect patients?",
                  "How does pregnancy affect patients?",
                  "What is the fatality rate of 2019-nCoV?",
                  "What public health policies prevent or control the spread of 2019-nCoV?"
              ]
          },
          {
              "task": "What do we know about virus genetics, origin, and evolution?",
              "questions": [
                  "Can animals transmit 2019-nCoV?",
                  "What animal did 2019-nCoV come from?",
                  "What real-time genomic tracking tools exist?",
                  "What geographic variations are there in the genome of 2019-nCoV?",
                  "What effors are being done in asia to prevent further outbreaks?"
              ]
          },
          {
              "task": "What do we know about vaccines and therapeutics?",
              "questions": [
                  "What drugs or therapies are being investigated?",
                  "Are anti-inflammatory drugs recommended?"
              ]
          },
          {
              "task": "What do we know about non-pharmaceutical interventions?",
              "questions": [
                  "Which non-pharmaceutical interventions limit tramsission?",
                  "What are most important barriers to compliance?"
              ]
          },
          {
              "task": "What has been published about medical care?",
              "questions": [
                  "How does extracorporeal membrane oxygenation affect 2019-nCoV patients?",
                  "What telemedicine and cybercare methods are most effective?",
                  "How is artificial intelligence being used in real time health delivery?",
                  "What adjunctive or supportive methods can help patients?"
              ]
          },
          {
              "task": "What do we know about diagnostics and surveillance?",
              "questions": [
                  "What diagnostic tests (tools) exist or are being developed to detect 2019-nCoV?"
              ]
          },
          {
              "task": "Other interesting questions",
              "questions": [
                  "What is the immune system response to 2019-nCoV?",
                  "Can personal protective equipment prevent the transmission of 2019-nCoV?",
                  "Can 2019-nCoV infect patients a second time?"
              ]
          }
   ]
}

# 7. Compute answers

Next, we define the function `get_results(question)`. 

Given a question, `get_results(question)` returns a _JSON_ object with the following format:

```json
{ 
    "question": "What is coronavirus?"
    "results": [
        {
            "context": "Coronavirus disease (COVID-19) is an infectious disease ...",
            "answer": "infectious disease",
            "start_index": 37,
            "end_index": 55
        },
        ...
    ]
}
```

Where `start_index` and `end_index` point at the start and end character of the answer in the question. This two values will be useful later to highlight the answer in the context.

The helper functions `get_all_context`, `get_all_answers` and `create_output_results` are here make the code more readable.

In [None]:
NUM_CONTEXT_FOR_EACH_QUESTION = 10


def get_all_context(query, num_results):
    # Return ^num_results' papers that better match the query
    
    papers_df = cse.search(query, num_results)
    return papers_df['abstract'].str.replace("Abstract", "").tolist()


def get_all_answers(question, all_contexts):
    # Ask the same question to all contexts (all papers)
    
    all_answers = []
    
    for context in all_contexts:
        all_answers.append(answer_question(question, context))
    return all_answers


def create_output_results(question, 
                          all_contexts, 
                          all_answers, 
                          summary_answer='', 
                          summary_context=''):
    # Return results in json format
    
    def find_start_end_index_substring(context, answer):   
        search_re = re.search(re.escape(answer.lower()), context.lower())
        if search_re:
            return search_re.start(), search_re.end()
        else:
            return 0, len(context)
        
    output = {}
    output['question'] = question
    output['summary_answer'] = summary_answer
    output['summary_context'] = summary_context
    results = []
    for c, a in zip(all_contexts, all_answers):

        span = {}
        span['context'] = c
        span['answer'] = a
        span['start_index'], span['end_index'] = find_start_end_index_substring(c,a)

        results.append(span)
    
    output['results'] = results
        
    return output

    
def get_results(question, 
                summarize=False, 
                num_results=NUM_CONTEXT_FOR_EACH_QUESTION,
                verbose=True):
    # Get results

    all_contexts = get_all_context(question, num_results)
    
    all_answers = get_all_answers(question, all_contexts)
    
    if summarize:
        # NotImplementedYet
        summary_answer = get_summary(all_answers)
        summary_context = get_summary(all_contexts)
    
    return create_output_results(question, 
                                 all_contexts, 
                                 all_answers)

We now **iterate** over all **tasks** and all **questions** and store the results into `all_tasks`:

In [None]:
all_tasks = []

for i, t in enumerate(covid_kaggle_questions['data']):
    print("Answering questions to task {}. ...".format(i+1))
    answers_to_question = []
    for q in t['questions']:
            answers_to_question.append(get_results(q, verbose=False))
    task = {}
    task['task'] = t['task']
    task['questions'] = answers_to_question
    
    all_tasks.append(task)

all_answers = {}
all_answers['data'] = all_tasks

# 8. Show results

In the first place, we specify an helper function, `dh()` (dh stands for _display html_) to visualize `html` tags in a proper way. 

In [None]:
from IPython.display import display, Markdown, Latex, HTML

def layout_style():
    style = """
        div {
            color: black;
        }
        .single_answer {
            border-left: 3px solid #dc7b15;
            padding-left: 10px;
            font-family: Arial;
            font-size: 16px;
            color: #777777;
            margin-left: 5px;

        }
        .answer{
            color: #dc7b15;
        }
        .question_title {
            color: grey;
            display: block;
            text-transform: none;
        }      
        div.output_scroll { 
            height: auto; 
        }
    """
    return "<style>" + style + "</style>"

def dm(x): display(Markdown(x))
def dh(x): display(HTML(layout_style() + x))

Subsequentely, we define some helper functions to visualize the tasks (and the questions). 

The name of the functions are self-explanatory: `display_single_context`, `display_question_title`, `display_all_contexts`, `display_task_title` and `display_single_task`.

In [None]:
def display_single_context(context, start_index, end_index):
    
    before_answer = context[:start_index]
    answer = context[start_index:end_index]
    after_answer = context[end_index:]

    content = before_answer + "<span class='answer'>" + answer + "</span>" + after_answer

    return dh("""<div class="single_answer">{}</div>""".format(content))

def display_question_title(question):
    return dh("<h2 class='question_title'>{}</h2>".format(question.capitalize()))


def display_all_contexts(index, question):
    
    def answer_not_found(context, start_index, end_index):
        return (start_index == 0 and len(context) == end_index) or (start_index == 0 and end_index == 0)

    display_question_title(str(index + 1) + ". " + question['question'].capitalize())
    
    # display context
    for i in question['results']:
        if answer_not_found(i['context'], i['start_index'], i['end_index']):
            continue # skip not found questions
        display_single_context(i['context'], i['start_index'], i['end_index'])

def display_task_title(index, task):
    task_title = "Task " + str(index) + ": " + task
    return dh("<h1 class='task_title'>{}</h1>".format(task_title))

def display_single_task(index, task):
    
    display_task_title(index, task['task'])
    
    for i, question in enumerate(task['questions']):
        display_all_contexts(i, question)

Now, we can invoke the function `display_single_task` for all eight tasks and visualize what does the model generated as answers. As you can notice, some of the answers are more relevant and to-the-point than others. In the next update, I will describe better the findings and try to improve further the results. If you notice something relevant in the spans of answers, let me know in the comment box!  

In [None]:
task = 1
display_single_task(task, all_tasks[task-1])

In [None]:
task = 2
display_single_task(task, all_tasks[task-1])

In [None]:
task = 3
display_single_task(task, all_tasks[task-1])

In [None]:
task = 4
display_single_task(task, all_tasks[task-1])

In [None]:
task = 5
display_single_task(task, all_tasks[task-1])

In [None]:
task = 6
display_single_task(task, all_tasks[task-1])

In [None]:
task = 7
display_single_task(task, all_tasks[task-1])

In [None]:
task = 8
display_single_task(task, all_tasks[task-1])

# 9. Export solutions

We save in a _JSON_ file all the obtained answers. The same _JSON_ files might come in handy for further analysis and right now is used to visualize the same results in the [interactive interface](https://jbesomi.github.io/Korono/). 

In [None]:
import json
with open("covid_kaggle_answer_from_qa.json", "w") as f:
    json.dump(all_answers, f)

# 10. Conclusions

I hope your learned something along the way and had fun reading this notebook üëç

As you, I'm here for learning: your feedback and opinion it what makes me create better content. Please, tell me your opinion in the commentary box.

Thank you ü§ó

##### Future relases: text summarization

In [None]:
def get_summary(text):
    """
    Get summary
    """
    
    
    from transformers import BartTokenizer, BartForConditionalGeneration

    tokenizer_summarize = BartTokenizer.from_pretrained('bart-large-cnn')
    model_summarize = BartForConditionalGeneration \
            .from_pretrained('bart-large-cnn').to(torch_device)


    model_summarize.to(torch_device)
    model_summarize.eval()
    
    answers_input_ids = tokenizer_summarize.batch_encode_plus(
        [text], return_tensors='pt', max_length=1024
    )['input_ids']
    
    answers_input_ids = answers_input_ids.to(torch_device)
    
    summary_ids = model_summarize.generate(answers_input_ids,
                                           num_beams=4,
                                           max_length=5,
                                           early_stopping=True
                                          )
        
    return tokenizer_summarize.decode(summary_ids.squeeze(), 
                                      skip_special_tokens=True, 
                                      clean_up_tokenization_spaces=False)