# Topic Model Driven Neural Passage Retrieval Approach

## Task 5: **What has been published about medical care?**


## Note on the Notebook
This notebook relies on another submitted notebook ([covid-ixa](https://www.kaggle.com/aotegi/neural-question-answering-for-cord19-task1)) as both share the IR/QA components. Regarding [covid-ixa](https://www.kaggle.com/aotegi/neural-question-answering-for-cord19-task1) notebook, the contribution of this notebook is the use of LDA models to create a subset of potential papers and create dynamically IR indexing for each of the task descriptions. 

## Objective
Our goal is to use textual Question Answering (QA) techniques to directly find exact answers to the scientific questions listed in [COVID-19 Open Research Dataset Challenge](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge). 
For this purpose the system returns short replies, and only answers when the quality of retrieval and answers is satisfactory. The user could access the whole paragraphs and documents if further details were needed. We use neural textual Question Answering (QA) techniques to directly find specific answers to the scientific questions listed in [COVID-19 Open Research Dataset Challenge](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge). We adapted the scientific questions given in each task description to be more amenable towards current technology. The system is not tailored towards specific questions, and can be readily used to answer any other question. 


## Approach
For that purpose, we will use the freely available [CORD-19 dataset](https://pages.semanticscholar.org/coronavirus-research), which contains metadata of over 51,000 scientific papers (full text is also available for around 40,000 of them) about COVID-19, SARS-CoV-2, and related coronaviruses.

The implemented system has three main components. The first component is a LDA based **recommender** system, which helps indentifying paper topically related to the task description, and automatically discard those papers that are useful for the researcher. LDA model is uses abstracts to learn a topic model of the [CORD-19 dataset](https://pages.semanticscholar.org/coronavirus-research). In this notebook we used a pre-computed LDA model of 170 topics. You can find how to build the LDA model used in for the submission in the following notebook: [COVID-19 LDA Fitting](https://www.kaggle.com/oierldl/covid-19-lda-fitting).

The second component is an **information retrieval** (IR) system, based on the classical BM25F search algorithm. This system indexes not only the abstracts, but also paragraphs on the full text of the papers.

The third component of the system is the **question answering** (QA) system that automatically answers questions posed in natural language. The implemented system is based on neural network techniques. More specifically, we have used the [SciBERT language representation model](https://arxiv.org/abs/1903.10676), which is a pretrained language model based on [BERT](https://arxiv.org/abs/1810.04805), but trained on a large corpus of scientific text, including text from biomedical domain. BERT has shown successful results in many NLP tasks, such as QA. Following this approach, we fined tuned SciBERT for QA, using the [SQuAD2.0 dataset](https://arxiv.org/abs/1806.03822), which is a reading comprehension dataset widely used in the QA research community. 

Note that the system is identical for all the tasks. Given a set of questions related to a task, returns answers for those questions without any additional tuning.

## Modular Code
The notebook  make use of some custom python libraries in order to improve the modularity of the code and simplify the information given the notebook. In order to run the code it is required to add the following utility scripts `indexer`, `qa`, `recommender`, `match`, and `utils`. You can add utility scrips via `File -> Add utility script`. See Section  [Install packages and load libraries](#libraries) for further details on how to correctly set up the notebook.

## Pros and cons

Positive and negative aspects 

Positive aspects of the system:
* System creates filters unrelated papers with the task topic, and therefore can focuses on the relevant information. 
* The system tries to fight information overload: A) returns specific answers to the questions. B) returns answers only when relevant, trying to avoid low quality answers. 
* Given the questions, it is completely automatic and does not need any tuning. The better the questions quality, the better the answers the system provides. 
* The system can be used to easily explore other tasks and information needs, as it directly returns answers to the document collection via new questions.
* We experimented with different fine-tuning strategies according to two broad types of questions. 
* The system is complementary to labor-intensive information extraction techniques that try to find answers to specific tasks using hand-annotation or manually built rules.

Limitations and possible improvements (cons):
* We need to test if actually topic model help filtering irrelevant papers. 
* The interface could be richer, allowing more in-depth exploration in cases where the user would like to explore additional documents and answers. 
* Currently the system relies only on the information available in the metadata file and full texts of the CORD19 dataset. We have not used any external source or other related dataset.
* The speed can easily be improved. It is limited by the 5Gb of storage space available, which makes the system slow in getting abstracts and full documents. Producing larger and richer indices will speed up the system considerably.
* The system can be easily improved with a more sophisticated Information Retrieval module (see to-dos)
* The system can be easily improved by incorporating domain-specific annotated development data (see to-dos) and a continuous learning component to keep learning thanks to the feedback of some hand-selected expert users (see to-dos)
* We also plan to improve the system with a confidence measure in the answers. In the future we would like to introduce an improved confidence measure that combines the IR and QA scores into a unified measure that automatically assesses the quality of the answers (see to-dos).

Some of the limitations are software-engineering tasks which do not add to the technical and scientific part of our system. We thus will focus on the more challenging and hopefully effective improvements of the next to-dos:
* We will test more sophisticated IR modules for paragraph retrieval (we plan to evaluate on the [TREC-COVID challenge](https://ir.nist.gov/covidSubmit/))
* We plan to collaborate with third parties to exploit domain-specific development data ([COVID-QA project](https://github.com/deepset-ai/COVID-QA/tree/master/data/question-answering)) 
* We plan to add a continuous learning component to keep learning thanks to the feedback of hand-selected expert users, using Human In The Loop strategies.
* We also plan to improve the system with a confidence measure in the answers.



## Sections
1. [Install packages and load libraries](#libraries)
2. [TM-IR/QA options](#options)
3. [Load info from metadata file](#files)
4. [Start the recommender system](#recommender)
5. [Define a function to parse task descriptions](#parse)
6. [Define questions for all the tasks](#questions)
7. [Select a subset of documents](#subset)
8. [Create an IR index and define retrieval function](#index)
9. [Question Answering system](#qa)
10. [Results of passage retrieval](#results)

------

## 1. Install packages and load libraries<a class="anchor" id="libraries"></a>

Although most the libraries are already instaleed in the kaggle-docker, we need to a few python libraries to properly run the code. In addition we make the output folder for storing the results of the IR/QA systems.

In order to run the code it is required to add the following utility scripts `indexer`, `qa`, `recommender`, `match`, and `utils`. You can add utility scrips via `File -> Add utility script`.

In [None]:
# Set-up: uncomment and run selection for  
! pip install scispacy
! pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz

# NOT WORKING: nlp.load('en_core_sci_sm') 
# Seting manually
! mkdir /kaggle/working/scispacy-models
! wget https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz 
! mv en_core_sci_sm-0.2.4.tar.gz /kaggle/working/scispacy-models 
! tar xvfz /kaggle/working/scispacy-models/en_core_sci_sm-0.2.4.tar.gz -C /kaggle/working/scispacy-models

# Search engine library
!pip install Whoosh 

# Model main folder for the output:
! mkdir /kaggle/working/indexes

In [None]:
import numpy as np
import pandas as pd
import json
import spacy

from time import asctime
from collections import Counter

# object displaying in different formats
from IPython.core.display import display, HTML 

# TM
from gensim.models import LdaModel, AuthorTopicModel, LdaMulticore
from gensim.corpora import MmCorpus

# IR
from whoosh.index import * # full-text indexing and searching

# BERTQA
import torch # optimized tensor library for deep learning using GPUs and CPUs
from transformers import BertTokenizer, BertForQuestionAnswering, BasicTokenizer # transformers: large-scale transformer models like BERT, and usage scripts for them
from transformers.data.metrics.squad_metrics import _get_best_indexes

# Custom libraries
from indexer import Indexer
from question_answering import QuestionAnswering
from recommender import Recommender
from match import Match

## 2. TM-IR/QA options<a class="anchor" id="options"></a>
In order to make easier the control of the main arguments to run the task, in this section we define the paths, models, and similar options. More concretely we indicate the following:

- `task` : Task number (1-10). See [task definitions](#tasks).
- `corpus_folder`: Path to the dataset 
- `metadata_file`: Path to the metadata file.
- `lda_model`: Path to LDA model file.
- `lda_mmcorpus`: Path to the serialized corpus used to train de LDA model. This is required to vectorize the input for the recommender system. 
- `lda_type`: Topic Model type. In this notebook we set to "lda".
- `index_folder`: Path to index folder created by the IR system.
- `subset_size`: Number of paper selected for indexing
- `qa_model` : Path to the pretrained model 
- `use_covid_synonyms`: Option to add COVID-19 related terms to include in the task description (True/False).
- `only_covid`: Retrieve only documents related COVID-19.
- `spacy_model`: Path or name to the spacy model


In [None]:
arguments = {'task': 5, 
             'corpus_folder' : '/kaggle/input/CORD-19-research-challenge/',
             'metadata_file' : '/kaggle/input/CORD-19-research-challenge/metadata.csv',
             'lda_model': '/kaggle/input/covid19-abstract-lda-model/lda.topics-170.fr-5-50.all-abstracts.model/lda.topics-170.fr-5-50.all-abstracts.model',
             'lda_mmcorpus': '/kaggle/input/covid19-abstract-lda-model/lda.topics-170.fr-5-50.all-abstracts.model/corpus.mm',
             'lda_type': 'lda',
             'index_folder': '/kaggle/working/indexes/covid-19',
             'subset_size': 2000, 
             'qa_model' : '/kaggle/input/scibertqasquad/checkpoint-31500/',
             'use_covid_synonyms': True,
             'only_covid': True,
             'spacy_model': '/kaggle/working/scispacy-models/en_core_sci_sm-0.2.4/en_core_sci_sm/en_core_sci_sm-0.2.4'}

## 3. Load information in the metadata file<a class="anchor" id="files"></a>

CORD19-dataset includes a metadata file (CSV file) of research papers related to coronavirus and COVID-19. In this section we first load the info in the metadata file into a dataframe object. As we are not interested in all the metadata info, we will select just some of the columns of the CSV file, such as title, publish time, abstract or journal.

Note: this version of the notebook takes v7 of the dataset (from 2020-04-10).

CORD-19.v7 includes info of 51,078 papers, but some of them are repeated (they have the same *cord_uid*). Thus, we filter out the repeated ones.

In [None]:
# We define a function that reads the csv file, do some filtering and select the fields of interest
def load_metadata(path_to_metada, below_year=2020, above_year=1950):
    # Select interesting fields from metadata file
    fields = ['cord_uid','title', 'authors', 'publish_time', 'abstract', 'journal','url', 'has_pdf_parse',
              'has_pmc_xml_parse', 'pmcid', 'full_text_file', 'sha']
    # Extract selected fields from metadata file into dataframe
    df_mdata = pd.read_csv(path_to_metada, skipinitialspace=True, index_col='cord_uid', usecols=fields)

    # WARNING: cord_uid is described as unique, but c4u0gxp5 is repeated. So I remove one of this
    df_mdata = df_mdata.loc[~df_mdata.index.duplicated(keep='first')]
    
    df_mdata['publish_time'] = pd.to_datetime(df_mdata['publish_time'], errors="coerce")
    df_mdata['publish_year'] = df_mdata['publish_time'].dt.year
    df_mdata = df_mdata[df_mdata['abstract'].notna()]
    df_mdata = df_mdata[df_mdata['authors'].notna()]
    df_mdata['authors'] = df_mdata['authors'].apply(lambda row: str(row).split('; '))

    relevant_time = df_mdata.publish_year.between(above_year, below_year)
    df_mdata = df_mdata[relevant_time]

    return df_mdata

In [None]:
df_mdata = load_metadata(arguments['metadata_file'])
print("Number of papers loaded from metadata (after filtering out the repeated ones):", len(df_mdata))

## 4. Start the recommender system<a class="anchor" id="recommender"></a>

In this section we prepare the recommender system that ranks scientific papers according to their topical distribution. As commented above, the recommender deploys LDA model of the [CORD-19 dataset](https://pages.semanticscholar.org/coronavirus-research) to sort the paper according to the topical relatedness with the task description. 

Once we parse the description and get its topic distribution. The recommender applies the Jehnssen-Shannon metric to comute similarity between the task descrption and the abstract of the paper.  For the task texts are represented with 170 topics induced from the LDA model (further details in  [COVID-19 LDA Fitting](https://www.kaggle.com/oierldl/covid-19-lda-fitting))

The recommender system is implemented in a utility script outside this notebook. You can check the details [in this notebook](https://www.kaggle.com/oierldl/recommender).  As the recommender make use a precomputed LDA model and serialized corpus used to estimate the model, we first define some helper functions to load all the required stuff. 


In [None]:
def load_model(path_to_model, model_type):
    if model_type == 'lda':
        return LdaModel.load(path_to_model)
    elif model_type == 'lmc':
        return LdaMulticore(path_to_model)
    else:
        return AuthorTopicModel.load(path_to_model)

def load_docind(path_to_docind):
    with open(path_to_docind, 'r', encoding='utf-8') as f:
        docind = [doci for doci in f.readlines()]
    return docind

Following we load the LDA, generate word index to parse the description and load the serialized corpus.

In [None]:
# load lda
print('[INFO - {}] Loading model ({})'.format(asctime(), arguments['lda_type']))
tm_model = load_model(arguments['lda_model'], arguments['lda_type'])

# generate word index
word2id = {w: i for i, w in tm_model.id2word.items()}

# load metada corpus
print('[INFO - {}] Loading serialized corpus'.format(asctime()))
corpus = MmCorpus(arguments['lda_mmcorpus'])
docind = load_docind(arguments['lda_mmcorpus'] + '.docind') # EZ DA BEHAR??

# load recommender to compute similarity
print('[INFO - {}] Starting the recommeder'.format(asctime()))
recommender = Recommender(tm_model, df_mdata, corpus, arguments['lda_type'])
print('[INFO - {}] Recommeder ready to use'.format(asctime()))

## 5. Define a function to parse task descriptions<a class="anchor" id="parse"></a>
In this section we defined the function that extracts the most representative terms of the description, and from the those term obtained the topic distribution that summarize the description. We defined two functions:

- `parse_description()`: It takes a description as input and returns a set of terms. Function first apply a NER model (scispacy) to recognize multiword entities, which we consider as tokens. After that, the function lematization of the tokens, discarding punctuation marks, stopwords, and numbers. 

- `max_pooling()`: It takes a set of terms and their topic distributions, and returns a unique topic distribution that summarize all the iput distribution. The function applies max-pooling over the the topics, selecting the highest value for each topic. 

Some description can be generic and could not only describe questions aplicable to COVID-19. Therefore we decide to add a set of synonyms and related words of COVID-19 in order to refine the topic distribution of the task description. 

In [None]:
def parse_description(desc, model_name='en_core_sci_sm'):
    nlp = spacy.load(model_name)

    # extract all entities in corpus:
    doc = nlp(desc)
    matcher = Match()
    entvocab = set([entity.text for entity in doc.ents if len(entity.text.split(' ')) > 1])
    matcher.matchinit_from_list(entvocab)

    # mark multword expressions
    desc = matcher.match(desc)
    doc = nlp(desc)

    # create a counter of identified terms
    terms = Counter()
    for term in doc:
        if not term.is_stop and not term.is_punct and not term.like_num:
            terms[term.lemma_] += 1
    return terms

def max_pooling(terms, word2id, model):
    nrows = len(terms)
    ncolmns = model.num_topics
    topicmatrix = np.zeros([nrows, ncolmns])
    for i, term in enumerate(terms.keys()):
        if term in word2id:
            for topic in model.get_term_topics(word2id[term], minimum_probability=0.0):
                topicmatrix[i, topic[0]] = topic[1]
    max_pool = np.max(topicmatrix, 0)
    max_pool = max_pool / np.sum(max_pool)
    return max_pool

In the following snippet of code we show the parsing result of a shor description as an example.

In [None]:
print('[INFO - {}] Parsing task description'.format(asctime()))
desc = "What is known about transmission, incubation, and environmental stability? " + \
       "Range of incubation periods for the disease in humans " + \
       "Tools to monitor phenotypic change and potential adaptation of the virus"

synonyms = ['coronavirus 2019', 'coronavirus disease 19', 'cov2', 'cov-2', 'covid', 'ncov 2019', '2019ncov',
            '2019-ncov', '2019 ncov', 'novel coronavirus', 'sarscov2', 'sars-cov-2', 'sars cov 2',
            'severe acute respiratory syndrome coronavirus 2', 'wuhan coronavirus', 'wuhan pneumonia',
            'wuhan virus']
if arguments['use_covid_synonyms']:
    desc = desc + " " + " ".join(synonyms)
terms = parse_description(desc)
print(terms.most_common(10))
print(max_pooling(terms, word2id, tm_model))

## 6. Define questions for all the tasks<a class="anchor" id="questions"></a>
In this section we define the questions that will be used as an input for the TM-IR/QA system implemented in the previous section.
As some of the subquestions for each of the tasks defined by the organizers are too complex for the QA system, we refined them.

In [None]:
tasks = [
    {
        'task': "Task1 - What is known about transmission, incubation, and environmental stability?",
        'questions': [
            "Range of incubation periods for the disease in humans",
            "Range of incubation periods for the disease in humans depending on age",
            "Range of incubation periods for the disease in humans depending on health status",
            "How long individuals are contagious?",
            "Prevalence of asymptomatic shedding and transmission",
            "Prevalence of asymptomatic shedding and transmission in children",
            "Seasonality of transmission",
            "Charge distribution",
            "Adhesion to hydrophilic/phobic surfaces",
            "Environmental survival to inform decontamination efforts for affected areas",
            "Viral shedding",
            "Persistence and stability on nasal discharge",
            "Persistence and stability on sputum",
            "Persistence and stability on urine",
            "Persistence and stability on fecal matter",
            "Persistence and stability on blood",
            "Persistence of virus on surfaces of different materials",
            "Persistence of virus on copper",
            "Persistence of virus on stainless steel",
            "Persistence of virus on plastic",
            "Natural history of the virus",
            "Shedding the virus from an infected person",
            "Implementation of diagnostics to improve clinical processes",
            "Implementation of products to improve clinical processes",
            "Disease models, including animal models for infection, disease and transmission",
            "Tools to monitor phenotypic change and potential adaptation of the virus",
            "Studies to monitor phenotypic change and potential adaptation of the virus",
            "Immune response and immunity",
            "Effectiveness of movement control strategies to prevent secondary transmission in health care and community settings",
            "Effectiveness of personal protective equipment (PPE) and its usefulness to reduce risk of transmission in health care and community settings",
            "Role of the environment in transmission"
         ]
    },
    {
        'task': "Task 2 - What do we know about COVID-19 risk factors?",
        'questions': [
            "Which are the main risk factors?",
            "Does smoking increase risk for COVID-19?",
            "Is a pre-existing pulmonary disease a risk factor for COVID-19?",
            "Do co-infections increase risk for COVID-19?",
            "Does a respiratory or viral infection increase risk for COVID-19?",
            "Are neonates at increased risk of COVID-19?",
            "Are pregnant women at increased risk of COVID-19?",
            "Is there any socio-economic factor associated with increased risk for COVID-19?",
            "Is there any behavioral factor associated with increased risk for COVID-19?",
            "What is the basic reproductive number?",
            "What is the incubation period?",
            "What are the modes of transmission?",
            "What are the environmental factors?",
            "Risk of fatality among symptomatic hospitalized patients",
            "Risk of fatality among high-risk patient groups",
            "Susceptibility of populations",
            "Public health mitigation measures that could be effective for control"
        ]
    },
    {
        'task': "Task 3 - What do we know about virus genetics, origin, and evolution?",
        'questions': [
            "Real-time tracking of whole genomes to inform the development of diagnostics",
            "Real-time tracking of whole genomes to inform the development of therapeutics",
            "Real-time tracking of whole genomes to track variations of the virus over time",
            "Mechanism for coordinating the rapid dissemination of whole genomes to inform the development of diagnostics",
            "Mechanism for coordinating the rapid dissemination of whole genomes to inform the development of therapeutics",
            "Mechanism for coordinating the rapid dissemination of whole genomes to track variations of the virus over time",
            "Which geographic and temporal diverse sample sets are accessed to understand geographic differences?",
            "Which geographic and temporal diverse sample sets are accessed to understand genomic differences?",
            "Is there more than one strain in circulation?",
            "Is any multi-lateral agreement leveraged such as the Nagoya Protocol?",
            "Is there evidence that livestock could be infected and serve as a reservoir after the epidemic appears to be over?",
            "Has there been any field surveillance to show that livestock could be infected?",
            "Has there been any genetic sequencing to show that livestock could be infected?",
            "Has there been any receptor binding to show that livestock could be infected?",
            "Is there evidence that farmers are infected?",
            "Is there evidence that farmers could have played a role in the origin?",
            "What are the results of the surveillance of mixed wildlife-livestock farms for SARS-CoV-2 and other coronaviruses in Southeast Asia?",
            "What are the results of the experimental infections to test host range for this pathogen?",
            "Which are the animal hosts?",
            "Is there evidence of continued spill-over to humans from animals?",
            "Which are the socioeconomic and behavioral risk factors for the spill-over to humans from animals?",
            "Sustainable risk reduction strategies"
        ]
    },
    {
        'task': "Task 4 - What do we know about vaccines and therapeutics?",
        'questions': [
            "What is known about the effectiveness of drugs being developed to treat COVID-19 patients?",
            "What is known about the effectiveness of drugs tried to treat COVID-19 patients?",
            "Show results of clinical and bench trials to investigate less common viral inhibitors against COVID-19",
            "Show results of clinical and bench trials to investigate naproxen against COVID-19",
            "Show results of clinical and bench trials to investigate clarithromycin against COVID-19",
            "Show results of clinical and bench trials to investigate Minocyclinethat against COVID-19",
            "Which are the methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients?",
            "What is known about the use of best animal models and their predictive value for a human vaccine?",
            "Capabilities to discover a therapeutic for the disease",
            "Clinical effectiveness studies to discover therapeutics, to include antiviral agents",
            "Which are the models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics?",
            "Efforts targeted at a universal coronavirus vaccine",
            "Efforts to develop animal models and standardize challenge studies",
            "Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers",
            "Approaches to evaluate risk for enhanced disease after vaccination",
            "Assays to evaluate vaccine immune response",
            "Process development for vaccines, alongside suitable animal models"
        ]
    },
    {
        'task': "Task 5 - What has been published about medical care?",
        'questions': [
            "Resources to support skilled nursing facilities",
            "Resources to support long term care facilities",
            "Mobilization of surge medical staff to address shortages in overwhelmed communities",
            "Age-adjusted mortality data for Acute Respiratory Distress Syndrome (ARDS)",
            "Age-adjusted mortality data for Acute Respiratory Distress Syndrome (ARDS) for viral etiologies",
            "What are the outcomes of Extracorporeal membrane oxygenation (ECMO) of COVID-19 patients?",
            "What are the outcomes for COVID-19 after mechanical ventilation adjusted for age?",
            "What is known of the frequency, manifestations, and course of extrapulmonary manifestations of COVID-19?",
            "What is known of the frequency, manifestations, and course of cardiomyopathy?",
            "What is known of the frequency, manifestations, and course of cardiac arrest?",
            "Application of regulatory standards (e.g., EUA, CLIA)",
            "Ability to adapt care to crisis standards of care level",
            "Approaches for encouraging and facilitating the production of elastomeric respirators, which can save thousands of N95 masks",
            "Which are the best telemedicine practices?",
            "Which are the facilitators to expand the telemedicine practices?",
            "Which are the specific actions to expand the telemedicine practices?",
            "Guidance on the simple things people can do at home to take care of sick people and manage disease",
            "Which are the oral medications that might potentially work?",
            "Use of artificial intelligence in real-time health care delivery to evaluate interventions",
            "Use of artificial intelligence in real-time health care delivery to evaluate risk factors",
            "Use of artificial intelligence in real-time health care delivery to evaluate outcomes",
            "Which are the challenges, solutions and technologies in hospital flow and organization?",
            "Which are the challenges, solutions and technologies in workforce protection?",
            "Which are the challenges, solutions and technologies in workforce allocation?",
            "Which are the challenges, solutions and technologies in community-based support resources?",
            "Which are the challenges, solutions and technologies in payment?",
            "Which are the challenges, solutions and technologies in supply chain management to enhance capacity, efficiency, and outcomes?",
            "Efforts to define the natural history of disease to inform clinical care, public health interventions, infection prevention control, transmission, and clinical trials",
            "What has been done to develop a core clinical outcome set to maximize usability of data across a range of trials?",
            "Can adjunctive or supportive intervention (e.g. steroids, high flow oxygen)  improve the clinical outcomes of infected patients?"
        ]
    },
    {
        'task': "Task 6 - What do we know about non-pharmaceutical interventions?",
        'questions': [
            "Which is the best way to scale up NPIs in a more coordinated way to give us time to enhance our health care delivery system capacity to respond to an increase in cases?",
            "Which is the best way to mobilize resources to geographic areas where critical shortfalls are identified?",
            "Rapid design and execution of experiments to examine and compare NPIs currently being implemented",
            "What is known about the efficacy of school closures?",
            "What is known about the efficacy of travel bans?",
            "What is known about the efficacy of bans on mass gatherings?",
            "What is known about the efficacy of social distancing approaches?",
            "Which are the methods to control the spread in communities?",
            "Models of potential interventions to predict costs and benefits depending on race",
            "Models of potential interventions to predict costs and benefits depending on income",
            "Models of potential interventions to predict costs and benefits depending on disability",
            "Models of potential interventions to predict costs and benefits depending on age",
            "Models of potential interventions to predict costs and benefits depending on geographic location",
            "Models of potential interventions to predict costs and benefits depending on immigration status",
            "Models of potential interventions to predict costs and benefits depending on housing status",
            "Models of potential interventions to predict costs and benefits depending on employment status",
            "Models of potential interventions to predict costs and benefits depending on health insurance status",
            "Policy changes necessary to enable the compliance of individuals with limited resources and the underserved with NPIs",
            "Why do people fail to comply with public health advice?",
            "Which is the economic impact of any pandemic?",
            "How can we mitigate risks to critical government services in a pandemic?",
            "Alternatives for food distribution and supplies in a pandemic",
            "Alternatives for household supplies in a pandemic",
            "Alternatives for health diagnoses, treatment, and needed care in a pandemic"
        ]
    },
    {
        'task': "Task 7 - What do we know about diagnostics and surveillance?",
        'questions': [
            "Which are the sampling methods to determine asymptomatic disease?",
            "What can we do for early detection of disease?",
            "Is the use of screening of neutralizing antibodies such as ELISAs valid for early detection of disease?",
            "Which are the existing diagnostic platforms?",
            "Which are the existing surveillance platforms?",
            "Recruitment, support, and coordination of local expertise and capacity",
            "How states might leverage universities and private laboratories for testing purposes?",
            "Which are the best ways for communications to public health officials and the public?",
            "What is the speed, accessibility, and accuracy of a point-of-care test?",
            "What is the speed, accessibility, and accuracy of rapid bed-side tests?",
            "Rapid design and execution of targeted surveillance experiments calling for all potential testers using PCR in a defined area to start testing and report to a specific entity",
            "Separation of assay development issues from instruments",
            "Which is the role of the private sector to help quickly migrate assays?",
            "What has been done to track the evolution of the virus?",
            "Latency issues and when there is sufficient viral load to detect the pathogen",
            "What is needed in terms of biological and environmental sampling?",
            "Use of diagnostics such as host response markers (e.g., cytokines) to detect early disease or predict severe disease progression",
            "Policies and protocols for screening and testing",
            "Policies to mitigate the effects on supplies associated with mass testing, including swabs and reagents",
            "Technology roadmap for diagnostics",
            "Which are the barriers to developing and scaling up new diagnostic tests?",
            "How future coalition and accelerator models could provide critical funding for diagnostics?",
            "How future coalition and accelerator models could provide critical funding for opportunities for a streamlined regulatory environment?",
            "New platforms and technology (CRISPR) to improve response times",
            "New platforms and technology to employ more holistic approaches",
            "Coupling genomics and diagnostic testing on a large scale",
            "What is needed for rapid sequencing and bioinformatics to target regions of the genome that will allow specificity for a particular variant?",
            "What is needed for sequencing with advanced analytics for unknown pathogens?",
            "What is needed for distinguishing naturally-occurring pathogens from intentional?",
            "What is known about One Health surveillance of humans and potential sources of future spillover or ongoing exposure for this organism and future pathogens?"
        ]
    },
    {
        'task': "Task 8 - Help us understand how geography affects virality",
        'questions': [
            "Are there geographic variations in the rate of COVID-19 spread?",
            "Are there geographic variations in the mortality rate of COVID-19?",
            "Is there any evidence to suggest geographic based virus mutations?"
        ]
    },
    {
        'task': "Task 9 - What has been published about ethical and social science considerations?",
        'questions': [
            "Articulate and translate existing ethical principles and standards to salient issues in COVID-2019",
            "Embed ethics across all thematic areas, engage with novel ethical issues that arise and coordinate to minimize duplication of oversight",
            "Support sustained education, access, and capacity building in the area of ethics",
            "Establish a team at WHO that will be integrated within multidisciplinary research and operational platforms and that will connect with existing and expanded global networks of social sciences",
            "Develop qualitative assessment frameworks to systematically collect information related to local barriers and enablers for the uptake and adherence to public health measures for prevention and control",
            "How the burden of responding to the outbreak and implementing public health measures affects the physical and psychological health of those providing care for Covid-19 patients?",
            "Identify the underlying drivers of fear, anxiety and stigma that fuel misinformation and rumor, particularly through social media"
        ]
    },
    {
        'task': "Task 10 - What has been published about information sharing and inter-sectoral collaboration?",
        'questions': [
            "Which are the methods for coordinating data-gathering with standardized nomenclature?",
            "Sharing response information among planners, providers, and others",
            "Understanding and mitigating barriers to information-sharing",
            "How to recruit, support, and coordinate local expertise and capacity relevant to public health emergency response?",
            "Integration of federal/state/local public health surveillance systems",
            "Value of investments in baseline public health response infrastructure preparedness",
            "Modes of communicating with target high-risk populations (elderly, health care workers)",
            "Risk communication and guidelines that are easy to understand and follow",
            "Communication that indicates potential risk of disease to all population groups",
            "Misunderstanding around containment and mitigation",
            "Action plan to mitigate gaps and problems of inequity in the Nation’s public health capability, capacity, and funding to ensure all citizens in need are supported and can access information, surveillance, and treatment",
            "Measures to reach marginalized and disadvantaged populations",
            "Data systems and research priorities and agendas incorporate attention to the needs and circumstances of disadvantaged populations and underrepresented minorities",
            "Mitigating threats to incarcerated people from COVID-19, assuring access to information, prevention, diagnosis, and treatment",
            "Understanding coverage policies (barriers and opportunities) related to testing, treatment, and care"
        ]
    }
]

## 7. Select a subset of documents<a class="anchor" id="subset"></a>

In this section we carry out the first of the process. Given a task we select the most relevant papers according the recommender system, and select a subset of paper to be indexed. We take the 2000 most similar papers of the task as a subset. 

Note that in `arguments` we can defined the taks number ([go to section to change task number](#options)). In our case, we consider a description of the title and the set of question defined in the task. 

In [None]:
print("[{}] Task number selected: {}".format(asctime(), arguments['task']))
task = tasks[arguments['task'] - 1] 
print(task['task'])

# join task title and question as description
print('[INFO - {}] Parsing task description'.format(asctime()))
desc = task['task'] + " " + " ".join(task['questions'])
synonyms = ['coronavirus 2019', 'coronavirus disease 19', 'cov2', 'cov-2', 'covid', 'ncov 2019', '2019ncov',
            '2019-ncov', '2019 ncov', 'novel coronavirus', 'sarscov2', 'sars-cov-2', 'sars cov 2',
            'severe acute respiratory syndrome coronavirus 2', 'wuhan coronavirus', 'wuhan pneumonia',
            'wuhan virus']
if arguments['use_covid_synonyms']:
    desc = desc + " " + " ".join(synonyms)

terms = parse_description(desc)
task_topic_dist = max_pooling(terms, word2id, tm_model)

print('[INFO - {}] Ranking documents and subseting'.format(asctime()))
similars, distances = recommender.k_nearest_docs(arguments['subset_size'], task_topic_dist, recommender.corpus_topic_dist, 
                                                 only_covid=arguments['only_covid'])
subset = df_mdata.loc[similars].copy()
subset['similarity'] = 1 - distances

We can inspect the subset with the following code.

In [None]:
subset[['title', 'authors', 'journal', 'similarity']].head(10)

## 8. Create an IR index and define retrieval function<a class="anchor" id="index"></a>

The second component of the system that we are going to develop in our approach is the information retrieval system. An information retrieval system is a tool that searches for  documents that are relevant to an information need from a collection of documents. This system has two main modules: (1) the indexing system and (2) the query system. The modules are implemented in a utility sctript  that contains [`Indexer`](`https://www.kaggle.com/oierldl/indexer`) python class.  

The first module is in charge of creating the primary data structure for the system, which is the index. The second module is the one with which users interact submitting a query based on their information need, and based on this query and using the index, retrieves documents. In this section we create an index on given subset by the recommender system. [`Indexer`](`https://www.kaggle.com/oierldl/indexer`) contains the query module. For the implementation of these modules, we used [Whoosh library](https://pypi.org/project/Whoosh/), which contains functions for indexing text and then searching the index.

The index is a data structure that makes it possible to search for information in a document collection in a very efficient way. In short, it lists, for every word, all documents that contain it. We will index the papers related to COVID-19, not only the abstracts that are in the metadata file, but also the full text provided in PMC or PDF JSON format. As having shorter documents is better for the answering system that we will develop later, we will not index the whole text in a paper together. Instead, the indexing unit will be an abstract or each of the paragraphs of the full text (as marked in JSON files).

In order to create an index, we must define the schema of the index, which is defined in the [`Indexer`](`https://www.kaggle.com/oierldl/indexer`). The schema lists the fields in the index. A field is a piece of information for each document in the index, for example, id, path of the document, title and text. We define the type of these last two fields as “TEXT”, which means that they will be searchable. As it is common practice, we also define to apply the Stemming Analyzer to these text fields. Applying this analyzer all the text will be tokenized, then all the tokens will be converted to lowercase, a stopword filter will be applied in order to remove too common words, and finally, a stemming algorithm will be applied.

If you check the constructor of the [`Indexer`](`https://www.kaggle.com/oierldl/indexer`), you will find the followin code that defines the schema used to create the index.
```
# Schema definition:
# - id: type ID, unique, stored; cord_uid + "##abs" for abstract, and "##pmc-N" or "##pdf-N" for paragraphs in body text (Nth paragraph)
# - path: type ID, stored; path to the JSON file (only for papers with full text)
# - title: type TEXT processed by StemmingAnalyzer; not stored; title of the paper
# - text: type TEXT processed by StemmingAnalyzer; not stored; content of the abstract section or the paragraph

schema = Schema(id = ID(stored=True,unique=True),
                path = ID(stored=True),
                title = TEXT(analyzer=analysis.StemmingAnalyzer()),
                text = TEXT(analyzer=analysis.StemmingAnalyzer())
               )
```

The module that creates the index is implemented by `Indexer.create_index()`, while the query module (used in the following sections) is implemented by `Indexer.retrieve_documents()`. Please, check [covid-ixa](https://www.kaggle.com/enekoagirre/covid-ixa)) as both share the IR/QA components for further details of how to create an index with Whoosh. 

We index the papers related to the task description, and we index not only the abstracts that are in the metadata file, but also the full text provided in PMC or PDF JSON format. As having shorter documents is better for the answering system, we index  paragraphs of the documents (as marked in JSON files) and abstract when full text is not available.

Indexing could take several minutes.

In [None]:
print('[INFO - {}] Indexing sub-corpus'.format(asctime()))
indexer = Indexer(subset, arguments['index_folder'], arguments['corpus_folder'])
indexer.create_index()
print('[INFO - {}] Index calculated'.format(asctime()))

## 9. Question Answering system<a class="anchor" id="qa"></a>

The third main component of the system is the QA system. Given a question in natural language and a paragraph, this system returns the answer to the question in the paragraph or “I don’t know” otherwise. Our implementation for such a system will be based on neural networks techniques. The implementation details will be given below.

In this section we define question-answering class ([QuestionAnswering](https://www.kaggle.com/oierldl/question-answering)) that contain  a function (`QuestionAnswering.extract_answers()`) that given a question, a dataframe with the relevant paragraphs (returned by the `indexer.retrieve_docs()` function), maximum number of answers to extract and maximum length of the answer, extracts specific answers from all the relevant paragraphs.

This function returns the dataframe with relevant paragraphs, but with additional data. The best answers are added for each paragraph, specifying the answer itself (text), the score, and the start and end index that define the position of the answer in the paragraph.

For the implementation of these functions we took the [SciBERT language representation model](https://arxiv.org/abs/1903.10676) and we fine tuned for QA using [SQuAD2.0](https://arxiv.org/abs/1806.03822) and [QuAC](https://arxiv.org/abs/1808.07036) datasets. We performed this fine tuning externally. Thus, we made this [model publicly available in Kaggle](https://www.kaggle.com/jonander95/bertsquadquac), and we just need to load it here.

Following the usual reading comprehension method we use BERT as a pointer network. This kind of networks select an answer start and end index given a question and a context. In order to extract the correct answer span we get the highest probability pairs of start and end indexes in the code below. As the input length for the BERT model is fixed, we use a sliding window approach for sequences that are longer than 384 subtokens.

For further details of the QA component please visit [covid-ixa](https://www.kaggle.com/enekoagirre/covid-ixa)) notebook. 

In [None]:
print('[INFO - {}] Loading pretrained BERT QA'.format(asctime()))
qa_model = QuestionAnswering(arguments['qa_model'])

## 10. Results of the passage retrieval<a class="anchor" id="results"></a>
In this last section, we want to show the results for the task. For that purpose, we will run the above functions to first retrieve relevant paragraphs from the papers, and then extract specific answers from them.

We set to 20 the maximum number of paragraphs that the IR system returns, but we discard paragraphs where the QA system returns “I don’t know”. Moreover, we decided not to show any results for the questions which receive more than %85 of “I don’t know” answers. For the rest of the questions, we show the best answer string for each of the best five paragraphs, that is, five specific answers per question. Additionally, next to each answer, we show some extra information: the title of the paper from where the answer was extracted (with a link to access online version on the web), the journal and the date of the publication. Moreover, under the answer we show the paragraph from which the answer was extracted. In this paragraph the best 5 answers are highlighted, using different lightness of color (the darker the better the answer)

In [None]:
# Creates the HTML code to show all the answers colored gradually in the paragraph
def color_snippet(text,marks):
    # Set colors for answers
    colors = ['#ffebcc','#ffd699', '#ffc266', '#ffad33','#ff9900']
    
    # Create HTML code to show the colored paragraph
    html = '<blockquote>'
    current_mark = 0
    for i,mark in enumerate(marks):
        if current_mark != mark:
            if current_mark != 0:
                html += '</span>'
            if mark > 0:
                html += '<span style="background-color: {}">'.format(colors[mark-1])
            current_mark = mark
        html += text[i]
    if current_mark != 0:
        html += '</span>'
    html += '</blockquote>' 
    return html


# Set number of this task
ntask = arguments['task']

# Show title of the task
task_title = tasks[ntask-1]['task']
html = html = "<p><h2>" + task_title + "</h2></p><br>"

# Set input parameters of the functions above
# Maximum number of documents to retrieve
max_n_docs = 20
# Maximum number of answers to extract
max_n_answers = 5
# Maximum answer length
max_answer_length = 30
# Amount of Cannotanswers to declare answers as not suitable
threshold = 17

# Iterate over all the questions in a task and call the functions above
for nq,question in enumerate(tasks[ntask-1]['questions']):
    # Call the function to retrieve relevant paragraphs of papers
    df_ir_results = indexer.retrieve_documents(question, topn=max_n_docs)
    # Call the function to extract answers from paragraphs
    df_qa_results = qa_model.extract_answers(question, df_ir_results, max_n_answers, max_answer_length)

    # Show the question
    html += '<br><p><font color="#C28A08"><h3>{}</h3></font>'.format(question)
    
    # Count how many non-null answers are extracted for a question
    n_cannotanswer = 0
    for ind in df_qa_results.index:
        answer = df_qa_results['qa_answers'][ind][0] 
        #Take SQuAD and QuAC cases into account
        if answer['text'] == 'Cannotanswer' or len(answer['text'])==0:
            n_cannotanswer += 1
            
    if n_cannotanswer < threshold:
        # Set maximum number of results to show
        max_n_results = 5
        n_results = 0
        for ind in df_qa_results.index:
            if n_results == max_n_results:
                break
            answers = df_qa_results['qa_answers'][ind]
            # If the first answer is non-null, show the answer
            #if answers[0]['text'] != 'CANNOTANSWER':
            
            if answers[0]['text'] != 'Cannotanswer' and len(answers[0]['text']) != 0:
                html += '<br><b>{}</b> ({}, {}, {})<br>'.format(answers[0]['text'], df_qa_results['date'][ind], df_qa_results['journal'][ind], df_qa_results['title'][ind])
            
                # Color the paragraph to highlight the answers
                marks = [0] * len(df_qa_results['text'][ind])
               
                for n_ans, answer in enumerate(answers):
                    if answer['text'] != 'Cannotanswer':
                        level = 5 - n_ans
                        start = answer['start_index']
                        if answer['end_index'] >= len(marks):
                            end = len(marks)-1
                        else:
                            end = answer['end_index']
                       
                        for i in range(start,end):
                            if marks[i] < level:
                                marks[i] = level
                html += color_snippet(df_qa_results['text'][ind], marks)
                n_results += 1        
        html += '<hr>'
    else:
        html += '<br><font color="red">No suitable answers found.</font><br>'

# Display the HTML string that contains all the answers
display(HTML(html))
if not os.path.exists("html"):
    os.mkdir("html")
html_file = open("/kaggle/working/html/task" + str(ntask) + ".html","w")
html_file.write(html)
html_file.close()