# The summary of the approach

This notebook is created as an attempt to answer questions in all 10 tasks except for **Sample task with sample submission** because the main task is phrased too broadly and may not fit our approach. Our approach combines the classic search algorithm BM25 with the state of the art deep learning model BERT for NLP. The strategy can be summarized in the following sections. 

### Retrieve relevant articles for COVID-19 (BM25) 
We use broad terms defined in the main task question as the query to retrieve relevant articles in a BM25 search engine. For example, the main question for task 1 is **What is known about transmission, incubation, and environmental stability?**. We extract 3 terms MANUALLY including **transmission**, **incubation** and **environment stability**. We use a library called whoosh as the indexing engine to enable fast search in title, abstract, and full_text across all documents. https://whoosh.readthedocs.io/en/latest/index.html. 

### Define COVID-19 key words
To find COVID-19 related articles, we have defined a list of key words, the article is considered COVID-19 related if, any of these fields (title, abstract and full text) has any of the key word mentions. To make sure we have included all the key words for COVID-19, we trained a word2vec model on all full texts for phrase embeddings, then we tried to find all synonyms for COVID-19 from the word2vec model. We used an iterative approach, where we start looking for synonyms with one key word and add new phrases or words to the key word list, then use the newly found key word to repeat the same process until there is no new key word found anymore. Here is the list of synonyms fr COVID-19.

['ncov',
 'covid19',
 'covid-19',
 'sars cov2',
 'sars cov-2',
 'sars-cov-2',
 'sars coronavirus 2',
 '2019-ncov',
 '2019 novel coronavirus',
 '2019-ncov sars',
 'cov-2',
 'cov2',
 'novel coronvirus',
 'coronavirus 2019-ncov']

### Assign documents to subtasks
We retrieve the top 50 articles using the broad terms extracted from the main task. Then for each article, we take the following steps. 
1. we split the full text into sentences, then use a pre-trained deep learning model called sentence-transformers to encode sentences into embeddings (vectors).
2. we apply sentence-transformers to all subtasks and turn them into a set of sentence embeddings as well.
3. we compute the sentence-subtask pairwise cosine similarities based on their sentence embeddings.
4. we then sum up the cosine similarities for each subtask. These summed up cosine similarity scores indicate relatedness between each subtask and this article. 
5. we normalize the cosine similarities by applying softmax. The output of it could be interpreted as how strong each subtask is related to the article, which will be referred to as confidence in the rest of the notebook.
6. we print out the articles in the order generated by the BM25 search algorithm. In addition, we print out the confidence associated with each subtask and the top matching sentences except the subtasks with a weak confidence defined as conf < 0.5. 

**What is Bert and what is sentence-transformers?** 
Bert is the state of the art deep learning language model for NLP tasks, it uses a self-attention mechanism to create context-dependent word embeddings, and sentence transformers leverage these context-aware embeddings to create sentence embeddings https://github.com/UKPLab/sentence-transformers. 

### Interactive mode
In addition to the search strategy, we introduced an interactive mode to enable users to do the free text search and incorporated the Bert Question-Answer model to pinpoint the answer (sentence(s)) given a subtask question. 


## Discussion
We did notice that sentence-transformers did not work for sentences especially given a very short sentence or an incomplete sentence such as subtask **Seasonality of Transmission** in Task 1. sentence-transformers couldn't encode such sentences into a useful sentence embeddings due to the limited context in the short sentence, as a consequence the cosine similarities computed for these tend to be low, and the matching sentences for **Seasonality of Transmission** are not very useful. Therefore, we proposed a complementary approach **Interactive mode** to solve such cases. 

## Future work
Currently, we use straight-up search for finding relevant articles and haven't implemented the query expansion in this notebook yet. We do plan to incorporate it if we get to the next round of the submission. One of the ways for finding related terms or synonyms is to use word/phrase embeddings, which could be easily done by training a word2vec model on all abstracts or full texts extracted from the dataset. 

Currently, the queries are defined manually by extracting the main concepts from the tasks, we want to use NLP tools such spaCy to automatically extract the important concepts (primarily noun chunks) from the tasks. 

We would like to explore the citation connections between articles. The hypothesis behind it is that similar articles tend to reference the same set of articles, by exploring those relationships, we could generate the article embeddings and cluster similar articles together. 


## Acknowledgement
This notebook is inspired by some other people's work. 

https://www.kaggle.com/jonathanbesomi/a-qa-model-to-answer-them-all

https://www.kaggle.com/danielwolffram/topic-modeling-finding-related-articles

https://www.kaggle.com/davidmezzetti/cord-19-analysis-with-sentence-embeddings


# Install python libraries

In [None]:
!pip install whoosh
!pip install ipywidgets
!pip install sentence-transformers

In [None]:
import os
import json
import re
import logging
from tqdm import tqdm
import itertools
from itertools import chain

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from scipy import spatial
from scipy.special import softmax
from scipy.spatial.distance import cosine

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS as stop_words
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer
from collections import Counter
from sklearn.metrics import confusion_matrix
from gensim.models import Word2Vec

In [None]:
from whoosh.fields import Schema, TEXT, ID, BOOLEAN
from whoosh.index import create_in
from whoosh.qparser import MultifieldParser
import whoosh.index as search_index
from whoosh import highlight
import math

In [None]:
import torch
from transformers import BertTokenizer
from transformers import BertForQuestionAnswering
from sentence_transformers import SentenceTransformer
import random
import time
import datetime

In [None]:
import matplotlib.pyplot as plt
import ipywidgets as widgets
from IPython.core.display import display, HTML

In [None]:
class DataLoader:
    """
    Data loader for build the dataset from metadata.csv, 4 data folders containing the json files for the articles
    """
    data_folder = '../input/CORD-19-research-challenge/'
    metadata_path = os.path.join(data_folder, 'metadata.csv')
    default_dataset_path = 'merged_dataset.pickle'
    key_words = ['ncov',
        'covid19',
        'covid-19',
        'sars cov2',
        'sars cov-2',
        'sars-cov-2',
        'sars coronavirus 2',
        '2019-ncov',
        '2019 novel coronavirus',
        '2019-ncov sars',
        'cov-2',
        'cov2',
        'novel coronvirus',
        'coronavirus 2019-ncov']
    
    def __init__(self, dataset_path):
        
        ## If dataset_path is None, set dataset_path to self.default_dataset_path
        if dataset_path is None:
            dataset_path = self.default_dataset_path
        
        try:
            self.dataset = pd.read_pickle(dataset_path)
            self.dataset_path = dataset_path
            self.build = False
        except FileNotFoundError:
            self.dataset_path = self.default_dataset_path
            self.build = True
    
    def build_if_not_exist(self):
        """
        The method builds the dataset that merges the information in metadata.csv and json files
        
        """
        if self.build and not os.path.exists(self.dataset_path):
            
            print(f'Building the dataset at {self.dataset_path}')
            
            dataset = self._load_articles_from_folder()
            metadata = self._load_metadata()
            
            print(f'There are {len(pd.concat([metadata["title"], dataset["title"]]).drop_duplicates())} articles in the dataset')

            dataset = self._merge_with_metadata(metadata, dataset)
            dataset = dataset[(dataset['title'] != '') & dataset['title'].notna()] \
                .reset_index().drop(['index'], axis=1).reset_index().rename(columns={'index': 'id'})

            dataset['full_text'] = dataset['full_text'].astype(str)
            dataset['abstract'] = dataset['abstract'].astype(str)
            dataset['is_covid_19'] = (dataset['title'] + dataset['abstract'].astype(str)).str.lower() \
                .apply(lambda t: sum([key in t for key in self.key_words]) > 0)
            
            dataset.to_pickle(self.dataset_path)
            
            print(f'Total records in the merged dataset {len(dataset)}')
        
        return self
    
    def get_dataset(self):
        """
        Get the dataset from the dataloader
        """
        if hasattr(self, 'dataset'):
            return self.dataset
        else:
            try:
                self.build_if_not_exist()
                self.dataset = pd.read_pickle(self.dataset_path)
                return self.dataset
            except FileNotFoundError:
                print(f'dataset doesn`t exist at {self.dataset_path}')
        
        return None
    
    def _load_metadata(self):
        
        metadata = pd.read_csv(self.metadata_path)
        metadata = metadata[['sha', 'title', 'abstract']].drop_duplicates()
        print(f'Total records in metadata {len(metadata)}')
        print(f'Total records in metadata after dropping the duplicates {len(metadata)}')
        return metadata
    
    def _load_articles_from_folder(self):
        
        articles = []

        for (dirpath, dirnames, filenames) in os.walk(self.data_folder):
            if filenames:
                for filename in tqdm([f for f in filenames if re.search('.+\.json$', f) is not None]):
                    file_path = os.path.join(dirpath, filename)
                    paper_id, title, abstract, full_text, reference = self._load_article_from_json(file_path)
                    articles.append((paper_id, title, abstract, full_text, reference))
        
        return pd.DataFrame(articles, columns=['sha', 'title', 'abstract', 'full_text', 'reference'])
    
    def _load_article_from_json(self, file_path):
        """
        Load the article json file
        """
        with open(file_path) as json_file: 

            data = json.load(json_file)
            paper_id = data['paper_id']
            title = data['metadata']['title'].replace('\n', '')
            abstract = '\n'.join([a['text'] for a in data.get('abstract', [])])
            full_text = '\n'.join([body_text.get('text', '') for body_text in data.get('body_text', [])])
            reference = '\n'.join([v['title'] for k, v in data.get('bib_entries', dict()).items()])

        return (paper_id, title, abstract, full_text, reference)
    
    def _merge_with_metadata(self, metadata, dataset):
    
        def _coalease(_df, _col):
            _col_x = _col + '_x'
            _col_y = _col + '_y'
            _df[_col] = _df[_col_x].combine_first(_df[_col_y])
            _df.drop(_col_x, axis=1, inplace=True)
            _df.drop(_col_y, axis=1, inplace=True)

            return _df

        if 'full_text' not in metadata.columns:
            metadata['full_text'] = np.nan

        if 'reference' not in metadata.columns:
            metadata['reference'] = np.nan

        _combined = metadata.merge(dataset, on='sha', how='outer')
        _combined = _coalease(_combined, 'title')
        _combined = _coalease(_combined, 'abstract')
        _combined = _coalease(_combined, 'full_text')
        _combined = _coalease(_combined, 'reference')

        _combined = _combined.merge(dataset, on='title', how='outer')
        _combined = _coalease(_combined, 'sha')
        _combined = _coalease(_combined, 'abstract')
        _combined = _coalease(_combined, 'full_text')
        _combined = _coalease(_combined, 'reference')

        return _combined.sort_values(['title', 'sha', 'full_text'], na_position='last') \
            .drop_duplicates(['title'], keep='first')


In [None]:
class SearchEngine:
    
    lemmatizer = WordNetLemmatizer() 
    default_index_folder = 'index'
    batch_size = 10000
    
    def __init__(self, index_folder, dataset):
        """
        Initialize SearchEngine with an index folder and the pandas dataset that contains the article raw data
        """
        if index_folder is None or not os.path.exists(index_folder):
            if not os.path.exists(self.default_index_folder):
                os.mkdir(self.default_index_folder)
                self.built = True
            else:
                self.built = False
            self.index_folder = self.default_index_folder
        else:
            self.index_folder = index_folder
            self.built = False
        
        self.dataset = dataset
    
    def search_articles(self, query, num):
        """
        Search for the article in the index
        """
        #query = ' '.join(self._tokenize_text(query))
        ix = search_index.open_dir(self.index_folder)
        with ix.searcher() as searcher:
            parser = MultifieldParser(['title', 'abstract', 'full_text'], ix.schema)
            processed_query = query + ' AND (is_covid_19:TRUE)'
            parsed_query = parser.parse(processed_query)
            results = searcher.search(parsed_query, limit=num)
            results.fragmenter = highlight.ContextFragmenter(surround=200)
            return [self._process_hit(result) for result in results]
        
        return []
        
    def _process_hit(self, hit):
        """
        Extract the article data from the dataset based on the paper_id in the search hit
        """
        paper_id = int(hit['paper_auto_id'])
        record = self.dataset[self.dataset['id'] == paper_id]
        title = record['title'].iloc[0]
        abstract = record['abstract'].iloc[0]
        full_text = record['full_text'].iloc[0]
        reference = record['reference'].iloc[0]
        score = hit.score
        abstract_highlight = hit.highlights('abstract', text=abstract)        
        full_text_highlight = hit.highlights('full_text', text=full_text)        
        reference_highlight = hit.highlights('reference', text=full_text)        

        return {'title': title, 
                'abstract': abstract, 
                'full_text': full_text, 
                'score': score, 
                'abstract_highlight': abstract_highlight, 
                'full_text_highlight': full_text_highlight, 
                'reference_highlight': reference_highlight}

    def build_index_if_not_exist(self):
        """
        Build the index in the default folder if the provided index_folder does not have index
        """
        if self.built:
            print(f'Building the index at {self.index_folder}')
            num_of_batches = math.ceil(len(self.dataset) / self.batch_size)
            
            for i in range(num_of_batches):
                print(f'Building index for batch no. {i+1} for {self.batch_size} records')
                self._build_index_batch(self.dataset[i * self.batch_size: (1 + i) * self.batch_size])
        else:
            print(f'The index exists at {self.index_folder}')
        
        return self
    
    def _build_index_batch(self, batch):
        
        ix = create_in(self.index_folder, self._get_schema())
        writer = ix.writer(limitmb=2048)
        
        for _, row in tqdm(batch.replace(np.nan, '', regex=True).iterrows()):
    
            paper_auto_id = str(row[0])
            title = row[1]
            abstract = row[3]
            full_text = row[4]
            reference = row[5]
            is_covid_19 = row[6]
            writer.add_document(paper_auto_id=paper_auto_id,
                                title=title,
                                abstract=abstract,
                                full_text=full_text,
                                reference=reference, 
                                is_covid_19=is_covid_19)

        writer.commit()
        
    def _get_schema(self):
        schema = Schema(
            paper_auto_id=ID(stored=True), 
            title=TEXT(stored=True, phrase=True), 
            abstract=TEXT(stored=False, phrase=True),
            full_text=TEXT(stored=False, phrase=True),
            reference=TEXT(stored=False, phrase=True),
            is_covid_19=BOOLEAN(stored=False)
        )
        return schema
    
    def _tokenize_text(self, text):
        return [self.lemmatizer.lemmatize(w) for w in word_tokenize(text.lower()) if not w in stop_words and w.isalpha()]

In [None]:
class BertQA:
    
    def __init__(self):
        BERT_SQUAD = 'bert-large-uncased-whole-word-masking-finetuned-squad'
        self.model = BertForQuestionAnswering.from_pretrained(BERT_SQUAD)
        self.tokenizer = BertTokenizer.from_pretrained(BERT_SQUAD)
        self.torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.model.to(self.torch_device)
        self.model.eval()
        
    def answer_question(self, question, context):
        # anser question given question and context
        encoded_dict = self.tokenizer.encode_plus(
                            question, context,
                            add_special_tokens = True,
                            max_length = 512,
                            pad_to_max_length = True,
                            return_tensors = 'pt'
                       )

        input_ids = encoded_dict['input_ids'].to(self.torch_device)
        token_type_ids = encoded_dict['token_type_ids'].to(self.torch_device)

        start_scores, end_scores = self.model(input_ids, token_type_ids=token_type_ids)

        all_tokens = self.tokenizer.convert_ids_to_tokens(input_ids[0])
        start_index = torch.argmax(start_scores)
        end_index = torch.argmax(end_scores)

        answer = self.tokenizer.convert_tokens_to_string(all_tokens[start_index:end_index+1])
        answer = answer.replace('[CLS]', '').replace('[SEP]', '')
        return answer
    
#Full credit goes to https://www.kaggle.com/jonathanbesomi/a-qa-model-to-answer-them-all. That's where I got this code snippet.

In [None]:
#Load the dataset from the specified location, if the dataset is missing, it builds the dataset from scratch. 
data_loader = DataLoader(None)
dataset = data_loader.build_if_not_exist().get_dataset()

#Initialize the search engine and build the index from scratch if index folder doesn't exist in the input folder. 
search_engine = SearchEngine(None, dataset).build_index_if_not_exist()

#Initalize the sentence transformer model downloaded from https://github.com/UKPLab/sentence-transformers. 
sentence_transformers_model = SentenceTransformer('bert-large-nli-mean-tokens')

In [None]:
def perform_task(main_query, subtask_list, num):
    
    def get_top_hits(sen_ques_dataframe, sent_i, num):
        return sen_ques_dataframe[sen_ques_dataframe['question_index'] == sent_i] \
            .sort_values(['cos'], ascending=False)['sentence'].iloc[0:num].to_list()
    
    def tokenize_text(text):
            return [lemmatizer.lemmatize(w) for w in word_tokenize(text.lower()) if not w in stop_words and w.isalpha()]
        
    lemmatizer = WordNetLemmatizer()
    
    question_embeddings = sentence_transformers_model.encode(subtask_list)
    results = search_engine.search_articles(main_query, num)
    
    html_string = ''

    for index, result in enumerate(results):

        title = result['title']
        abstract = result['abstract']
        score = result['score']
        full_text = result['full_text']
        full_text = abstract if full_text == '' or full_text is None else full_text
        original_sentences = re.split('[\.?!]\s+', full_text)
        original_sentences = [sent for sent in original_sentences if len(tokenize_text(sent)) >= 5]

        if len(original_sentences) > 0:

            sentences = [' '.join(tokenize_text(sent)) for sent in original_sentences] 
            sentence_embeddings = sentence_transformers_model.encode(sentences)
            sentence_zip = enumerate(zip(original_sentences, sentence_embeddings))
            task_zip = enumerate(zip(subtask_list, question_embeddings))
            sen_ques = [(sen_t[0], sen_t[1][0], sen_t[1][1], ques_t[0], ques_t[1][0], ques_t[1][1]) \
                 for sen_t, ques_t in itertools.product(sentence_zip, task_zip)]

            sen_ques_dataframe = pd.DataFrame(sen_ques, columns=['sentence_index', 
                                                                 'sentence', 
                                                                 'sentence_embeddings', 
                                                                 'question_index', 
                                                                 'question', 
                                                                 'question_embeddings'])

            sen_ques_dataframe['cos'] = sen_ques_dataframe.apply(lambda x: 1 - cosine(x.sentence_embeddings, x.question_embeddings), axis=1)
            sen_ques_dataframe['rank'] = sen_ques_dataframe.groupby('sentence_index')['cos'].rank("dense", ascending=False)
            confidences = softmax(sen_ques_dataframe.groupby('question_index')['cos'].sum())
            
            confidences = confidences.sort_values(ascending=False).reset_index().to_records(index=False)
            cosine_sum = sen_ques_dataframe[sen_ques_dataframe['rank'].astype(int) == 1]['cos'].sum()

            html_string += f'<h2>{index+1}. {title}</h2><br><b>Abstract</b>: {abstract}<br><br>'
            for conf in confidences:
                if conf[1] < 0.01: 
                    break
                html_string += f'<b>Subtask:</b> {subtask_list[conf[0]]}<br>'
                html_string += f'<b>Confidence:</b> {round(conf[1], 2)}<br>'
                html_string += f'<b>Top matchin sentences:</b><br>'
                html_string += '<br>'.join(get_top_hits(sen_ques_dataframe, conf[0], 3))
                html_string += '<br><br>'
            html_string += '<br>'

    return html_string

# What is known about transmission, incubation, and environmental stability?

In [None]:
task_1_search_query = '(transmission OR (incubation AND period) OR (environmental AND factor))'

In [None]:
task1 = ["Range of incubation periods for the disease in humans (and how this varies across age and health status) and how long individuals are contagious, even after recovery.",
"Prevalence of asymptomatic shedding and transmission (e.g., particularly children).",
"Seasonality of transmission.",
"Physical science of the coronavirus (e.g., charge distribution, adhesion to hydrophilic/phobic surfaces, environmental survival to inform decontamination efforts for affected areas and provide information about viral shedding).",
"Persistence and stability on a multitude of substrates and sources (e.g., nasal discharge, sputum, urine, fecal matter, blood).",
"Persistence of virus on surfaces of different materials (e,g., copper, stainless steel, plastic).",
"Natural history of the virus and shedding of it from an infected person",
"Implementation of diagnostics and products to improve clinical processes",
"Disease models, including animal models for infection, disease and transmission",
"Tools and studies to monitor phenotypic change and potential adaptation of the virus",
"Immune response and immunity",
"Effectiveness of movement control strategies to prevent secondary transmission in health care and community settings",
 "Effectiveness of personal protective equipment (PPE) and its usefulness to reduce risk of transmission in health care and community settings",
"Role of the environment in transmission"]

In [None]:
html_string = perform_task(task_1_search_query, task1, 50)

In [None]:
display(HTML(html_string))

# What do we know about COVID-19 risk factors?

In [None]:
task_2_searh_query = '(risk AND factors)'

In [None]:
task2 = ['Data on potential risks factors',
'Smoking, pre-existing pulmonary disease',
'Co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other co-morbidities',
'Neonates and pregnant women',
'Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences.',
'Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors', 
'Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups',
'Susceptibility of populations',
'Public health mitigation measures that could be effective for control']

In [None]:
html_string = perform_task(task_2_searh_query, task2, 50)

In [None]:
display(HTML(html_string))

# What do we know about virus genetics, origin, and evolution?

In [None]:
task_3_search_query = '((virus AND genetics) OR origin OR evolution)'

In [None]:
task3 = ['Real-time tracking of whole genomes and a mechanism for coordinating the rapid dissemination of that information to inform the development of diagnostics and therapeutics and to track variations of the virus over time.',
'Access to geographic and temporal diverse sample sets to understand geographic distribution and genomic differences, and determine whether there is more than one strain in circulation. Multi-lateral agreements such as the Nagoya Protocol could be leveraged.',
'Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over.',
'Evidence of whether farmers are infected, and whether farmers could have played a role in the origin.',
'Surveillance of mixed wildlife- livestock farms for SARS-CoV-2 and other coronaviruses in Southeast Asia.',
'Experimental infections to test host range for this pathogen.',
'Animal host(s) and any evidence of continued spill-over to humans',
'Socioeconomic and behavioral risk factors for this spill-over',
'Sustainable risk reduction strategies']

In [None]:
html_string = perform_task(task_3_search_query, task3, 50)

In [None]:
display(HTML(html_string))

# What do we know about non-pharmaceutical interventions?

In [None]:
task_4_search_query = '((non-pharmaceutical AND interventions) OR (tranditional AND medicine) OR (alternative AND medicine) OR (herbal AND medicine))'

In [None]:
task4 = ["Guidance on ways to scale up NPIs in a more coordinated way (e.g., establish funding, infrastructure and authorities to support real time, authoritative (qualified participants) collaboration with all states to gain consensus on consistent guidance and to mobilize resources to geographic areas where critical shortfalls are identified) to give us time to enhance our health care delivery system capacity to respond to an increase in cases.",
"Rapid design and execution of experiments to examine and compare NPIs currently being implemented. DHS Centers for Excellence could potentially be leveraged to conduct these experiments.",
"Rapid assessment of the likely efficacy of school closures, travel bans, bans on mass gatherings of various sizes, and other social distancing approaches.",
"Methods to control the spread in communities, barriers to compliance and how these vary among different populations..",
"Models of potential interventions to predict costs and benefits that take account of such factors as race, income, disability, age, geographic location, immigration status, housing status, employment status, and health insurance status.",
"Policy changes necessary to enable the compliance of individuals with limited resources and the underserved with NPIs.",
"Research on why people fail to comply with public health advice, even if they want to do so (e.g., social or financial costs may be too high).",
"Research on the economic impact of this or any pandemic. This would include identifying policy and programmatic alternatives that lessen/mitigate risks to critical government services, food distribution and supplies, access to critical household supplies, and access to health diagnoses, treatment, and needed care, regardless of ability to pay."]

In [None]:
html_string = perform_task(task_4_search_query, task4, 50)

In [None]:
display(HTML(html_string))

# What do we know about vaccines and therapeutics?

In [None]:
task_5_search_query = '(vaccines OR therapeutics)'

In [None]:
task5 = ["Effectiveness of drugs being developed and tried to treat COVID-19 patients. Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin, and minocyclinethat that may exert effects on viral replication.",
"Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.",
"Exploration of use of best animal models and their predictive value for a human vaccine.",
"Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.",
"Alternative models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics as production ramps up. This could include identifying approaches for expanding production capacity to ensure equitable and timely distribution to populations in need.",
"Efforts targeted at a universal coronavirus vaccine.",
"Efforts to develop animal models and standardize challenge studies",
"Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers",
"Approaches to evaluate risk for enhanced disease after vaccination",
"Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models [in conjunction with therapeutics]"]

In [None]:
html_string = perform_task(task_5_search_query, task5, 50)

In [None]:
display(HTML(html_string))

# What has been published about ethical and social science considerations?

In [None]:
task_6_sbearch_query = '(ethical AND considerationsand) OR (social OR science OR considerations))'

In [None]:
task6 = ["Efforts to articulate and translate existing ethical principles and standards to salient issues in COVID-2019", 
"Efforts to embed ethics across all thematic areas, engage with novel ethical issues that arise and coordinate to minimize duplication of oversight",
"Efforts to support sustained education, access, and capacity building in the area of ethics",
"Efforts to establish a team at WHO that will be integrated within multidisciplinary research and operational platforms and that will connect with existing and expanded global networks of social sciences.",
"Efforts to develop qualitative assessment frameworks to systematically collect information related to local barriers and enablers for the uptake and adherence to public health measures for prevention and control. This includes the rapid identification of the secondary impacts of these measures. (e.g. use of surgical masks, modification of health seeking behaviors for SRH, school closures)",
"Efforts to identify how the burden of responding to the outbreak and implementing public health measures affects the physical and psychological health of those providing care for Covid-19 patients and identify the immediate needs that must be addressed.",
"Efforts to identify the underlying drivers of fear, anxiety and stigma that fuel misinformation and rumor, particularly through social media."]

In [None]:
html_string = perform_task(task_6_sbearch_query, task6, 50)

In [None]:
display(HTML(html_string))

# What do we know about diagnostics and surveillance?

In [None]:
task_7_search_query = '(diagnostics OR surveillance)'

In [None]:
task7 = ["How widespread current exposure is to be able to make immediate policy recommendations on mitigation measures. Denominators for testing and a mechanism for rapidly sharing that information, including demographics, to the extent possible. Sampling methods to determine asymptomatic disease (e.g., use of serosurveys (such as convalescent samples) and early detection of disease (e.g., use of screening of neutralizing antibodies such as ELISAs).",
"Efforts to increase capacity on existing diagnostic platforms and tap into existing surveillance platforms.",
"Recruitment, support, and coordination of local expertise and capacity (public, private—commercial, and non-profit, including academic), including legal, ethical, communications, and operational issues.",
"National guidance and guidelines about best practices to states (e.g., how states might leverage universities and private laboratories for testing purposes, communications to public health officials and the public).",
"Development of a point-of-care test (like a rapid influenza test) and rapid bed-side tests, recognizing the tradeoffs between speed, accessibility, and accuracy.",
"Rapid design and execution of targeted surveillance experiments calling for all potential testers using PCR in a defined area to start testing and report to a specific entity. These experiments could aid in collecting longitudinal samples, which are critical to understanding the impact of ad hoc local interventions (which also need to be recorded).",
"Separation of assay development issues from instruments, and the role of the private sector to help quickly migrate assays onto those devices.",
"Efforts to track the evolution of the virus (i.e., genetic drift or mutations) and avoid locking into specific reagents and surveillance/detection schemes.",
"Latency issues and when there is sufficient viral load to detect the pathogen, and understanding of what is needed in terms of biological and environmental sampling.",
"Use of diagnostics such as host response markers (e.g., cytokines) to detect early disease or predict severe disease progression, which would be important to understanding best clinical practice and efficacy of therapeutic interventions.",
"Policies and protocols for screening and testing.",
"Policies to mitigate the effects on supplies associated with mass testing, including swabs and reagents.",
"Technology roadmap for diagnostics.",
"Barriers to developing and scaling up new diagnostic tests (e.g., market forces), how future coalition and accelerator models (e.g., Coalition for Epidemic Preparedness Innovations) could provide critical funding for diagnostics, and opportunities for a streamlined regulatory environment.",
"New platforms and technology (e.g., CRISPR) to improve response times and employ more holistic approaches to COVID-19 and future diseases.",
"Coupling genomics and diagnostic testing on a large scale.",
"Enhance capabilities for rapid sequencing and bioinformatics to target regions of the genome that will allow specificity for a particular variant.",
"Enhance capacity (people, technology, data) for sequencing with advanced analytics for unknown pathogens, and explore capabilities for distinguishing naturally-occurring pathogens from intentional.",
"One Health surveillance of humans and potential sources of future spillover or ongoing exposure for this organism and future pathogens, including both evolutionary hosts (e.g., bats) and transmission hosts (e.g., heavily trafficked and farmed wildlife and domestic food and companion species), inclusive of environmental, demographic, and occupational risk factors."]

In [None]:
html_string = perform_task(task_7_search_query, task7, 50)

In [None]:
display(HTML(html_string))

# What has been published about medical care?

In [None]:
task_8_search_query = '(medical AND care)'

In [None]:
task8 = ["Resources to support skilled nursing facilities and long term care facilities.",
"Mobilization of surge medical staff to address shortages in overwhelmed communities",
"Age-adjusted mortality data for Acute Respiratory Distress Syndrome (ARDS) with/without other organ failure – particularly for viral etiologies",
"Extracorporeal membrane oxygenation (ECMO) outcomes data of COVID-19 patients",
"Outcomes data for COVID-19 after mechanical ventilation adjusted for age.",
"Knowledge of the frequency, manifestations, and course of extrapulmonary manifestations of COVID-19, including, but not limited to, possible cardiomyopathy and cardiac arrest.",
"Application of regulatory standards (e.g., EUA, CLIA) and ability to adapt care to crisis standards of care level.",
"Approaches for encouraging and facilitating the production of elastomeric respirators, which can save thousands of N95 masks.",
"Best telemedicine practices, barriers and faciitators, and specific actions to remove/expand them within and across state boundaries.",
"Guidance on the simple things people can do at home to take care of sick people and manage disease.",
"Oral medications that might potentially work.",
"Use of AI in real-time health care delivery to evaluate interventions, risk factors, and outcomes in a way that could not be done manually.",
"Best practices and critical challenges and innovative solutions and technologies in hospital flow and organization, workforce protection, workforce allocation, community-based support resources, payment, and supply chain management to enhance capacity, efficiency, and outcomes.",
"Efforts to define the natural history of disease to inform clinical care, public health interventions, infection prevention control, transmission, and clinical trials",
"Efforts to develop a core clinical outcome set to maximize usability of data across a range of trials",
"Efforts to determine adjunctive and supportive interventions that can improve the clinical outcomes of infected patients (e.g. steroids, high flow oxygen)"]

In [None]:
html_string = perform_task(task_8_search_query, task8, 50)

In [None]:
display(HTML(html_string))

# What has been published about information sharing and inter-sectoral collaboration?

In [None]:
task_9_search_query = '((information AND sharing) OR (inter-sectoral AND collaboration))'

In [None]:
task9 = ["Methods for coordinating data-gathering with standardized nomenclature.",
"Sharing response information among planners, providers, and others.",
"Understanding and mitigating barriers to information-sharing.",
"How to recruit, support, and coordinate local (non-Federal) expertise and capacity relevant to public health emergency response (public, private, commercial and non-profit, including academic).",
"Integration of federal/state/local public health surveillance systems.",
"Value of investments in baseline public health response infrastructure preparedness",
"Modes of communicating with target high-risk populations (elderly, health care workers).",
"Risk communication and guidelines that are easy to understand and follow (include targeting at risk populations’ families too).",
"Communication that indicates potential risk of disease to all population groups.",
"Misunderstanding around containment and mitigation.",
"Action plan to mitigate gaps and problems of inequity in the Nation’s public health capability, capacity, and funding to ensure all citizens in need are supported and can access information, surveillance, and treatment.",
"Measures to reach marginalized and disadvantaged populations.",
"Data systems and research priorities and agendas incorporate attention to the needs and circumstances of disadvantaged populations and underrepresented minorities.",
"Mitigating threats to incarcerated people from COVID-19, assuring access to information, prevention, diagnosis, and treatment.",
"Understanding coverage policies (barriers and opportunities) related to testing, treatment, and care"]

In [None]:
html_string = perform_task(task_9_search_query, task9, 50)

In [None]:
display(HTML(html_string))

# Interactive search
You could search any keywords. The articles will be retrieved from the index and the pretrained BertQuestionAnswering model will be used to find the answer. BertQuestionAnswering doesn't always find the answer, the question needs to be phrased in simple questions e.g. Is hypertension a risk factor for COVID-19?

In [None]:
bert_qa = BertQA()

In [None]:
def render_search_results(query, num=10):
    
    results = search_engine.search_articles(query, num)
    html_string = f'<h2>Query: {query}</h2></br></br>'
    if len(results) == 0:
        html_string='There is no match. '
        
    for i, result in enumerate(results):
        title = result['title']
        abstract = result['abstract']
        full_text = result['full_text']
        abstract_highlight = result['abstract_highlight']
        full_text_highlight = result['full_text_highlight']
        reference_highlight = result['reference_highlight']
        html_string += f'<h2>{i+1}. {title}</h2></br>'

        if abstract_highlight != '':
            html_string += f'<h3>Abstract evidence</h3>'
            for i, sent in enumerate(abstract_highlight.split("...")):
                html_string += f'{i+1}. ...{sent}...</br>'
            bert_answer = bert_qa.answer_question(query, abstract)
            if (bert_answer.lower() not in query.lower()) and (bert_answer != ''):
                html_string += f'</br>Bert answer: {bert_answer}</br>'
            html_string += f'</br>'

        if full_text_highlight != '':
            html_string += f'<h3>Full Text evidence</h3>'
            for i, sent in enumerate(full_text_highlight.split("...")):
                html_string += f'{i+1}. ...{sent}...</br>'
            bert_answer = bert_qa.answer_question(query, full_text)
            if (bert_answer.lower() not in query.lower()) and (bert_answer != ''):
                html_string += f'</br>Bert answer: {bert_answer}</br>'
            html_string += f'</br>'
        
        
        
    display(HTML(html_string))

In [None]:
text = widgets.Text(
    value='Seasonality of transmission.',
    placeholder='Paste ticket description here!',
    disabled=False,
    layout=widgets.Layout(width='60%', overflow_y='auto')
)
output = widgets.Output()
hbox = widgets.HBox([widgets.Label('Search articles: '), text])
vbox = widgets.VBox([hbox, output])

display(vbox)

def callback(wdgt):
    with output:
        output.clear_output()
        render_search_results(wdgt.value)

text.on_submit(callback)

## Discussion