# Overview of Project: Summary, Methods Used, and Findings

## Task: Create summary tables that address relevant factors related to COVID-19

In this notebook, we used insight from our [initial findings](https://www.kaggle.com/ralphlozanoc1/team-alcapone-submission) (utilizing Specter embeddings of titles and abstracts from the COVID-19 dataset for clustering, and then evaluating the data for similarities and separation) to develop a plan for further evaluation of the dataset. In particular, we have built a system that creates summary tables using texts from relevant literature, given a query. 

Below, we will show the following: 

* Step I: Retrieval of relevant documents for a query
    * Create indexes - [leveraging Anserini to create indexes for documents](https://github.com/castorini/anserini)
    * Retrieve the top 100 relevant documents based on similarity using query-document feature vectors (bag of words) 
* Step II: Re-Rank to improve initial set of scores
    * Get embeddings using XLNet
    * Measure semantic similarity between a query and document pair using cosine distance
* Step III: QA System 
    * Reformat and generate training data extracted from manually labeled [QA dataset](https://github.com/deepset-ai/COVID-QA/blob/master/data/question-answering/200423_covidQA.json)
    * Finetune [SQuAD](https://github.com/huggingface/transformers/tree/master/examples/question-answering) with a [RoBERTa-based-SQuAD](https://huggingface.co/deepset/roberta-base-squad2#list-files) model using our modified training data
    * Run QA system to get relevant factors, the respective excerpt, and associated evidence.

We also provide the code for reproducing these results. Note that the code is not suited for running in kaggle kernel, because of memory limitations. We suggest changing the input and output paths appropriately for running the code.

In [None]:
from IPython.display import display, Image
display(Image(filename='/kaggle/input/images/image1.png', embed=True, width=1000))

# Document Retrieval 

As the first step for document retrieval, we leveraged Anserini to take advantage of it's low-latency Lucene indexes. We followed the steps outlined [here](https://github.com/castorini/anserini/blob/master/docs/experiments-cord19.md) to create document indexes for full text. Under the hood, Anserini uses a BM25 scoring system, which builds on the default bag of words approach, to calculate similarity between query and document pairs.

The top 100 documents were retrieved for each of the queries given under the task. Here is an example of documents retrieved for the query "Effectiveness of case isolation/isolation of exposed individuals (i.e. quarantine)":

In [None]:
import pandas as pd
query_1_retrieved = pd.read_csv("/kaggle/input/retrievaloutput/doc_scores_1_retrieved.csv")
query_1_retrieved[['query', 'docid', 'score']]

# Re-Rank to Improve Initial Scores

Since BM25 relies on frequency of word occurrences, semantic similarity is not captured appropriately by this measure. To address this, we used similarity scores based on cosine distance between embedding vectors for document and query pairs to re-rank the documents once retrieved. 

We used a pre-trained XLNet model from the [Transformers library](https://huggingface.co/transformers/) to generate embeddings of text in queries and documents. To generate a single fixed-length embedding, we obtained vectors for each token in the input text from the last hidden state of the model and calculated the mean. For documents, we took averages of each section in the document to get a single vector representation. 

Once  embedding vectors were created for a query-document pair, we calculated the similarity score by using the cosine distance. Using this similarity score, the retrieved documents were re-ranked so that the documents most relevant to the query would appear at the top of the summary table. Here is an example table that shows documents reranked for the same query:

In [None]:
query_1_reranked = pd.read_csv("/kaggle/input/rerankeddocs/sim_scores_1_retrieved.csv")
combined = pd.merge(left=query_1_retrieved[['query', 'docid', 'score']], right=query_1_reranked, left_on=['docid', 'query', 'score'], right_on=['docid', 'query', 'score'])
combined['old_rank'] = combined['score'].rank(ascending=False)
combined['new_rank'] = combined['sim_score'].rank(ascending=False)
combined.rename(columns={'score': 'bm25_score', 'sim_score': 'cosine_score'})

# Extract answers:
As the next step in building summary tables, we used two different approaches for extracting answers for the columns in the summary table. For `Study Type` which is a categorical variable, we used a string search method. For the rest of the columns (`Factors`, `Excerpt` and `Measure of Evidence`) we used a QA system to populate column values.  

## String Search
We used string search to infer the value for the `Study Type` column. We used the [Aho-Corasick algorithm](https://pypi.org/project/pyahocorasick/), which leverages a trie-based data structure for fast and efficient string searching. First, we identified associated words or phrases for each of the category labels. Then, we identified the category label with the maximum number of occurences of associated words/phrases under that category label. If no matches were found for the associated words, the column is adjusted to _null_. Here is an example for the same query from above:

In [None]:
query_1_retrieved[['query', 'docid', 'study_type']].fillna("")

## QA System 

Here, we built an automated QA (Question Answering) System capable of identifying and retrieving relevant content when posed with a series of questions. To begin, we finetuned the original RoBERTa SQuAD model using a manually labeled Covid-QA dataset as a starting point.

RoBERTa has a limit of 512 words/tokens per processing batch, so we reformatted the manually labeled data to a sequence of 500 word document slices. These slices included the answer to a particular question, and used a 65/35 train/eval split to subset our data for finetuning. Additional in-depth details and discussion about how this was performed and can be replicated is covered in the associated code portion of the notebook [below](https://www.kaggle.com/katieflannigan/c1nlp-cord19-submission-6-16#Using-the-Training-Data-and-running-the-SQuAD-FineTuner). Our SQuAD finetuner was subsequently run on this dataset and produced the following metrics:

    exact: 39.75155279503105,
    f1: 72.57937317850947,
    total: 483,
    HasAns_exact: 39.75155279503105,
    HasAns_f1: 72.57937317850947,
    HasAns_total: 483,
    best_exact: 39.75155279503105,
    best_exact_thresh: 0.0,
    best_f1: 72.57937317850947, 
    best_f1_thresh': 0.0

Using this finetuned SQuAD model, we asked a series of questions against the ranked documents created earlier, to identify relevant factors, the respective excerpt from the document, and associated evidence. The questions used to extract relevant factors can be found below in the code section. A simple example (for Query 1) is as follows:

    What factors are related to quarantine and isolation?

In order to identify the respective excerpt from the document we dynamically re-phrased questions to the following format:

    What is relevant to <identified factors>?

Finally, to identify the associated evidence, we first looked to the acknowledgements (if present) in the research paper itself, and if none were provided, posed the following question to the QA system:

    How was this study performed?

An example of the output of this process for Query 1 can be seen below:

In [None]:
pd.read_csv("/kaggle/input/qaoutput/qa_1.csv")

# Results

We built summary tables for documents retrieved for the following set of queries, and placed the resulting tables in the `output` folder.
* Effectiveness of case isolation/isolation of exposed individuals (i.e. quarantine)
* Effectiveness of community contact reduction
* Effectiveness of inter/inner travel restriction
* Effectiveness of school distancing
* Effectiveness of workplace distancing
* Effectiveness of a multifactorial strategy prevent secondary transmission
* Seasonality of transmission
* How does temperature and humidity affect the transmission of 2019-nCoV?
* Significant changes in transmissibility in changing seasons?
* Effectiveness of personal protective equipment (PPE)

In [None]:
pd.read_csv("/kaggle/input/summarytables/Effectiveness_of_case_isolation_isolation_of_exposed_individuals_quarantine.csv", index_col=0).fillna("")

### Summary of Findings and Areas of Improvement
   
  Here, we take a few of the summary tables produced from the queries above, and take a closer look at the results, misplaced evidence (if a factor), and  factors identified and excerpts. 
  
  #### Query 1
  To provide an example of our results and findings, we will walk through the first query "Effectiveness of case isolation/isolation of exposed individuals (i.e. quarantine)". This query, as well as the others not shown here, were re-phrased to optimize QA system's performance. In our case, query 1 was modified to "What factors are related to quarantine and isolation?". At a cursory glance, many of the factors can be broadly categorized into:
* The query-indicated factors themselves (isolation,quarantine): isolation, distancing, and contact-- or lack thereof 
* Individuals: direction from a third party (including authorities/healthcare providers) 
* Economy: toll on non-disease factors like travel and construction 
* Science and Global Effects: references to biology, vaccinations, other diseases
* Statistical measures: median, mean, rates/proportions

In [None]:
pd.read_csv("/kaggle/input/qaoutput/qa_1.csv")[:21]

This was a very successful query; its results (associated excerpt and evidence) here are definitely relevant to quarantine and isolation. Furthermore, they are appropriate to hone in on for more information on the intended query. However, there are some minor exceptions. For example, some results apply more broadly to the disease affecting the world economy as a whole, instead of just isolation-specific material for COVID-19 (Result 14, Query 1, Factor):

    "individual.Traffic control and social distancing in each city!"

The associated excerpt/evidence here are more focused on economic factors such as country population and vehicular traffic during a pandemic. While interesting and tangentially related, and certainly a goal of such research projects as this, it may not be the most applicable result for the query in question. 

In addition, some key factors do not necessarily present as a category under which evidence can be summarized. (Result 20, Query 1, Factor): **'growth rate behaves slightly above the rate'**. This factor seems to be opposite of what one would expect when paired with the parameters 'isolation' and 'quarantine', so we take a look at the associated excerpt:

*"The peak position is also sensitive to this parameter, as well as the curve variance. In relation to the peak, we observed that lower values of guarantee a postponement in the occurrence of the peak of infection, however, the duration of the effects of the epidemic is longer.The following table (Table 3) Table 3 . Characteristic values of peaks of infection in different scenarios (graph in Figure 3). The following comparison was made for the same value, varying the proportions of individuals in quarantine or isolation, also for a single scenario. In the graph of Figure 4 , we can see the behaviour of the peaks of infection for three specific values of and .As these values are higher, this means that the containment measures for tackling the pandemic have been more effective within the specific scenario"*


Ultimately, we conclude that this does provide relevant content for the intended query. However, the key factor could not be categorized without further considering the excerpt or evidence. 

#### Query 6




In [None]:
pd.read_csv("/kaggle/input/qaoutput/qa_6.csv")

Query 6 translated to "what factors are related to preventing secondary transmission?". We noted that certain key factors are overwhelmingly relevant, honing in on particular biological factors, species/animals, and risk factors that play into secondary transmission. 

We hypothesize that the phrase "preventing secondary transmission" is more specific than phrases included in other queries (i.e., in Query 1, "isolation/quarantine"), which partially accounts for the more natural results/key factor categories here. Query 9 ("transmission effects by season") is similar to query 6-- more targeted verbiage by nature of the query. 

#### Query 10 



In [None]:
pd.read_csv("/kaggle/input/qaoutput/qa_10.csv")[55:60]

Lastly, we provide another example that deviates slightly from what one might naturally think of when prompted with the question "What factors are related to personal protective equipment PPE?". However, it still feeds into our overall conclusion that the returned information is useful. The more open-ended nature of the phrase "PPE- Personal Protective Equipment", we believe, leads to the increased ambiguity of this result set. For example, consider the factor retrieved for Result 57, Query 10, Factor: “personal digital assistants (PDAs) were provided to”. The associated excerpt relates to patient experience, and the evidence ultimately links an appropriate window of information detailing PPE wear and consumption. However, the excerpt prominently details a small study relying more on patient satisfaction due to wait time, as opposed to a direct link to PPE. While patient satisfaction is important, and relates to PPE, it is not the expected answer.

# Conclusion: 

In conclusion, each of our tables returned relevant information pertaining to the queries. Despite some results that may not be as targeted as one would hope, relevant information is provided when querying this particular retrieval and QA system. We feel confident that the summary tables produced are able to provide necessary research material and background for someone looking for information on an aspect of COVID-19.

Two important limitations of our system are that it does not provide satisfactory answers for the columns `Study Type` and `Influential`. For `Study Type`, which is a categorical variable, we suggest training a multi-label classification model. Similarly, since our QA system extracts answers from text in the input document, it is not suited to answer questions such as *"Was this factor found to be influential in the experiments/models?"*, which requires ability to synthesize answers, which goes beyond extraction-based question answering. For this particular value, we suggest training a binary classification model with appropriate texts from scientific literature that studies the impact of certain factors on the outcome of experiments.

However, despite these limitations, the system successfully brings together several components: information retrieval, document ranking, and question answering. These components work together to extract relevant papers and present information from the corpus in a concise, summarized manner.

# Code

In [None]:
# imports
from pyserini.search import pysearch
from transformers import XLNetTokenizer, XLNetModel, XLNetConfig, modeling_utils
import torch
import json
import pandas as pd
from scipy import spatial
import csv
import ahocorasick
import operator
import random

## Retrieval, Reranking and String Search

In [None]:
# define constants
NUM_DOCS = 100

# the paths should point to location where the lucene indexes generated by Anserini are stored
PARAGRAPH_INDEX = "./indexes/lucene-index-cord19-paragraph-2020-05-19" 
DOCUMENT_INDEX = "./indexes/lucene-index-cord19-full-text-2020-05-19"


QUERY = [
    "Effectiveness of case isolation isolation of exposed individuals quarantine",
    "Effectiveness of community contact reduction",
    "Effectiveness of inter inner travel restriction",
    "Effectiveness of school distancing",
    "Effectiveness of workplace distancing",
    "Effectiveness of a multifactorial strategy prevent secondary transmission",
    "Seasonality of transmission",
    "How does temperature and humidity affect the transmission of 2019-nCoV",
    "Significant changes in transmissibility in changing seasons",
    "Effectiveness of personal protective equipment PPE",
]


def get_relevant_documents(query, number_docs, index, print_results=False):
    """Retrieve top k documents from given index, for a query"""
    
    searcher = pysearch.SimpleSearcher(index)
    hits = searcher.search(query, number_docs)
    if print_results:
        for i in range(len(hits)):
            print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')
    
    return hits


def get_xlnet_embedding(input_text):
    """create embedding for input_text using pretrained xlnet model"""
    
    tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
    model = XLNetModel.from_pretrained('xlnet-base-cased')
    model.eval()
    
    with torch.no_grad():
        input_ids = torch.tensor(tokenizer.encode(input_text, add_special_tokens=False)).unsqueeze(0)  # Batch size 1
        model_output = model(input_ids)
        summarized_output = model_output[0].mean(1)
    
    return summarized_output


def embed_document(doc):
    """create single embedding for a document"""
    
    contents = doc.contents.split("\n")
    text_vecs = []
    for i,c in enumerate(contents):
        # ignore short paragraphs
        if len(c) > 50:
            text_vecs.append(get_xlnet_embedding(c))
    # concatnate paragraphs into a single vector and caculate mean
    X = torch.cat(text_vecs, dim=0)
    doc_mean = torch.mean(X,dim=0).reshape(1,768)
    return doc_mean


def get_cosine_distance(vec_a, vec_b):
    """calculate cosine distance between given vectors"""
    
    a = vec_a.detach().numpy()
    b = vec_b.detach().numpy()
    result = 1 - spatial.distance.cosine(a,b)
    return result

In [None]:

# define dictionary to link category labels to associated words and phrases
PATTERN = {
    "systematic review": ["systematic review", "meta-analysis", "meta analysis"],
    "prospective observational study": ["prospective observational"], 
    "retrospective observational study": ["retrospective observational"],
    "observational study": ["observational study"],
    "cross-sectional study": ["cross-sectional study", "cross sectional study"],
    "case series": ["case series"],
    "expert review": ["expert review"],
    "editorial review": ["editorial review"],
    "simulation": ["simulation study", "simulation based study"],
    "model based": ["model based study", "modeling experiment", "mathematical model", "statistical model"]
}


def make_automations(input_pattern):
    """Make AhoCorasick automatons from given pattern"""
    
    automatons = []
    for term, keywords in input_pattern.items():
        auto = ahocorasick.Automaton()
        for keyword in keywords:
            value = (len(keyword), term)
            auto.add_word(keyword, value)
            auto.make_automaton()
            automatons.append(auto)
    return automatons


def search_pattern_in_text(input_text_list, pattern):
    """given an input text, find occurances of words in pattern"""

    pattern_set = {}
    for key,_ in pattern.items():
        pattern_set[key] = set()
    
    auto_list = make_automations(pattern)

    for input_text in input_text_list:
        for auto in auto_list:
            for end_index, (length, keyword) in auto.iter(input_text.lower()):
                if keyword in pattern_set.keys():
                    new_set = pattern_set[keyword]
                    new_set.add(end_index)
                    pattern_set.update(
                        {
                            keyword: new_set
                        }
                    )

    return pattern_set


def get_study_type(pattern_dict):
    """infer study type based on the number of occurances of phrases in PATTERN"""
    
    result = {}
    for key, value in pattern_dict.items():
        result[key] = len(value)

    # get the maximum value associated with a key
    max_value = max(result.items(), key=operator.itemgetter(1))[1]
    # if no matches were found, return empty list
    
    keys_with_max_value = []
    # get all the keys that have the maximum value
    for key,value in result.items():
        if result[key] == max_value:
            keys_with_max_value.append(key)
    
    # if no matches were found, return empty list
    if max_value == 0:
        return []
    # else, return list with keys for max value
    return keys_with_max_value


In [None]:
def write_csv_files_for_retrived(query_list):
    """function to create csv file for documents retrived by queries in query_list """
    
    for i,q_text in enumerate(query_list):
        out_filename = "doc_scores_" + str(i+1) + "_retrieved.csv"
        query_vec = get_xlnet_embedding(q_text)
        docs_retrieved = get_relevant_documents(q_text, NUM_DOCS, DOCUMENT_INDEX, True)

        with open(out_filename,'w') as fileobj:
            newFile = csv.writer(fileobj)
            fields = ['query', 'docid', 'score', 'body_text', 'study_type']
            newFile.writerow(fields)
            for i, doc in enumerate(docs_retrieved):
                doc_json = json.loads(doc.raw)
                if doc_json['has_full_text']:
                    body_text = doc_json['body_text']
                else:
                    body_text = 'Not Available'
                contents = doc.contents.split("\n")
                study_type = get_study_type(search_pattern_in_text(contents, PATTERN))
                row = [q_text, doc.docid, doc.score, body_text, study_type if study_type else None]
                
                newFile.writerow(row)

In [None]:
def write_csv_files_for_sim_scores(query_list):
     """function to create csv file with cosine similarity scores for queries in query_list """

    
    for i,q_text in enumerate(query_list):
        print ("query: %s"% q_text)
        out_filename = "sim_scores_" + str(i+1) + "_retrieved.csv"
        query_vec = get_xlnet_embedding(q_text)
        docs_retrieved = get_relevant_documents(q_text, NUM_DOCS, DOCUMENT_INDEX, True)
        
        doc_ids = []
        with open(out_filename,'w') as fileobj:
            newFile = csv.writer(fileobj)
            fields = ['query', 'docid', 'score', 'sim_score']
            newFile.writerow(fields)
            for i, doc in enumerate(docs_retrieved):
                # to handle documents appearing multiple times in the retrieved list
                # one of the instances where this happens is when same paper is published in 
                # multiple journals
                if doc.docid in doc_ids:
                    print ("docid %s already present, skipping it."%doc.docid)
                    continue
                else:
                    doc_ids.append(doc.docid)
                    doc_vector = embed_document(doc)
                    doc_sim_score = get_cosine_distance( query_vec, doc_vector)
                row = [q_text, doc.docid, doc.score, doc_sim_score]
                newFile.writerow(row)

In [None]:
write_csv_files_for_retrived(QUERY)

In [None]:
write_csv_files_for_sim_scores(QUERY)

## QA System

## Training Data Generation and SQuAD finetuning

### Purpose/Objective

This code segment leverages a [manually labeled Question Answering dataset](https://rajpurkar.github.io/SQuAD-explorer/) on Covid Research Papers to generate training and evaluation data for a finetuned SQuAD (https://rajpurkar.github.io/SQuAD-explorer/) model for automated QA.

### Reading in the Manually Labeled Data

The manually labeled dataset we are utilizing can be found [here](https://github.com/deepset-ai/COVID-QA/blob/master/data/question-answering/200423_covidQA.json).
- Link: https://github.com/deepset-ai/COVID-QA/blob/master/data/question-answering/200423_covidQA.json

We are reading the data in from this json file into a single list which will be used to create train and eval data subsets.

In [None]:
qa = "/kaggle/input/200423_covidQA.json" # Adjust to point to data file

with open(qa, 'r') as f:
    data = json.load(f)
data = [item for topic in data['data'] for item in topic['paragraphs']]

### Data Manipulation and Reformatting

This section reformats and appropriately structures the QA data, to work around the 512 token limit for BERT-based system.
We split each single document into multiple 500 word slices which include the respective answer.
Ultimately, instead of parsing the entire document, we only process chunks of the document which include a valid answer according to the manually labeled dataset.

From experimentation, we have found that the system performs best when the answer to the question
is towards the end of the window of the text and have structured it accordingly.

Example:

(text)
(text)
(text)
(text)
(ANSWER)

We start from the word after the listed answer and construct each window as encompassing the 499 words before it, creating our 500 word window.

In [None]:
new_data = []

def get_indices(string, substring, actual):
    indexes=[]
    for i in range(len(string)):
        ss = " ".join(string[i:i+len(substring)])
        if " ".join(substring) in ss:
            indexes.append((i, i+len(substring)))

    return indexes[0]


for datum in data:
    # Reformatting Context
    orig_context = datum['context'].replace("\n\n", " ").replace("\n", " ").replace("  ", "").replace(u"\u202f", "")
    split_context = orig_context.split(" ")
    orig_qas = datum['qas']

    for qas in orig_qas:
        for answer in qas['answers']:

            # Reformatting Answer
            a = answer['text'].replace("\n\n", " ").replace("\n", " ").replace("  ", "").replace(u"\u202f", "")
            if a[0] == " ":
                a = a[1::]
            if a[-1] == " ":
                a = a[0:-1]

            indices = get_indices(split_context, a.split(), a)
            start = indices[1] - 499
            end = indices[1] + 1
            if start < 0:
                end = end + (0-start)
                start = 0

            context_subset = " ".join(split_context[start:end])

            new_data.append({
                'title': str(datum['document_id']),
                'paragraphs': [{
                    'context': context_subset,
                    'document_id': datum['document_id'],
                    'qas': [{
                        'question': qas['question'],
                        'id': qas['id'],
                        'answers': [{
                            'text': a,
                            'answer_start': context_subset.find(a),
                        }],
                        'is_impossible': False
                    }]
                }]
            })

### Subsetting our Data to Create a Train/Eval Split

Since the data is originally ordered with questions and answers from the same article and same parts of the article together,
we shuffle the data.

We also create a 65/35 train/eval split of our data.
    - i.e. 65% will be used to train/finetune our model, and the other 35% will be used to evaluate its performance.

Finally, we write these two subsets to corresponding json files which will be used by the SQuAD finetuning script:

In [None]:
shuffled_data = new_data
random.shuffle(shuffled_data)

train = shuffled_data[0:int(len(shuffled_data)*0.65)]
test = shuffled_data[int(len(shuffled_data)*0.65)::]

with open('qa_train.json', 'w') as trainfile:
    json.dump({"data": train}, trainfile)

with open('qa_test.json', 'w') as testfile:
    json.dump({"data": test}, testfile)

## Using the Training Data and running the SQuAD FineTuner

1) The SQuAD finetuning script needs to be downloaded locally and can be found here:
    - https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py

2) As an alternatively to `--model_name_or_path deepset/roberta-base-squad2`, one can also run this by pointing to a local `roberta_model/` folder which includes the files found here:
    - https://huggingface.co/deepset/roberta-base-squad2#list-files

3) The sample command below is based on the training filenames specified above and should be modified to match any changes.
    The command also lists the output directory in which the new finetuned model and associated files will be generated/saved.
    This directory is essential to running the QA system in the next step.

In [None]:
!python /kaggle/input/run_squad.py --model_type roberta --model_name_or_path deepset/roberta-base-squad2 --do_train --do_eval --train_file /kaggle/output/qa_train.json --predict_file /kaggle/output/qa_test.json --learning_rate 3e-5 --num_train_epochs 2.0 --max_seq_length 512 --output_dir /kaggle/output/finetuned_roberta_output/ --max_answer_length 512 --max_query_length 512

## Running an Automated Question-Answering System

### Purpose/Objective

In this section, we utilize the finetuned SQuAD model to build an automated Question Answering system.

In [None]:
from ast import literal_eval
import csv
import operator
import os
from simpletransformers.question_answering import QuestionAnsweringModel as qam
import sys
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
import torch

csv.field_size_limit(sys.maxsize)

### Initializing Input Data

The variable 'roberta_finetuned_model' points to the output directory specified in the run_squad.py command.
This should be adjusted accordingly to match the filepath in which the model output is located.

The list of query filepaths specified in the 'queries' list point to Covid Research Papers ranked 1-100 based on their relevance to their associated question.

'query_1' documents are relevant to the first question below, and so on.

In [None]:
roberta_finetuned_model = "/kaggle/output/finetuned_roberta_output"

input_path = "/kaggle/input/retrievaloutput/"
query_1 = input_path + "doc_scores_1_retrieved.csv"
query_2 = input_path + "doc_scores_2_retrieved.csv"
query_3 = input_path + "doc_scores_3_retrieved.csv"
query_4 = input_path + "doc_scores_4_retrieved.csv"
query_5 = input_path + "doc_scores_5_retrieved.csv"
query_6 = input_path + "doc_scores_6_retrieved.csv"
query_7 = input_path + "doc_scores_7_retrieved.csv"
query_8 = input_path + "doc_scores_8_retrieved.csv"
query_9 = input_path + "doc_scores_9_retrieved.csv"
query_10 = input_path + "doc_scores_10_retrieved.csv"

queries = [
    query_1, query_2, query_3, query_4, query_5, query_6, query_7, query_8, query_9, query_10
]
questions = [
    "What factors are related to quarantine and isolation?",
    "What factors are related to community contact reduction?",
    "What factors are related to travel restrictions?",
    "What factors are related to school distancing?",
    "What factors are related to workplace distancing?",
    "What factors are related to preventing secondary transmission?",
    "What factors are related to seasonality of transmission?",
    "What factors are related to temperature, humidity, and transmission?",
    "What factors are related to changing transmission by season?",
    "What factors are related to personal protective equipment PPE?"
]

### Initialize Finetuned QA Pipeline and Output Directory

In [None]:
nlp = pipeline('question-answering', model=roberta_finetuned_model, tokenizer='deepset/roberta-base-squad2')

output_dir = "/kaggle/output/qa_csv_output/"
if os.path.exists(output_dir):
    pass
else:
    os.mkdir(output_dir)

### Output Formatting Helper Function

In [None]:
def write_csv_table(data: list, query: str):
    """
    Writes output csv given data and query
    """
    with open(f"{output_dir}{query}.csv", "w") as fileobj:
        newFile = csv.writer(fileobj)
        fields = ['Query/Question', 'Document ID', "Key Factor", "Excerpt", "Evidence"]
        newFile.writerow(fields)

        for datum in data:
            newFile.writerow([
                datum[0], # Question
                datum[1], # Document ID
                datum[2]['answer'], # Key Factor
                datum[3]['answer'], # Excerpt
                datum[4]['answer'], # Evidence
            ])

### QA System Execution

This code segment is responsible for the execution of the QA system. It poses a series of questions to get answers regarding:
- Relevant factors associated with the query/question being analyzed
- Excerpt(s) related to the identified factors
- Evidence pertaining to the Factors and Excerpts found

In [None]:
for i in range(len(queries)):
    data_file = queries[i]
    question = questions[i]

    answers = []
    doc_count = 0

    data = []

    with open(data_file, "r") as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            if row[0] =='query':
                data.append(row)
            else:
                if row[3] == "Not Available":
                    continue
                data.append([row[0], row[1], float(row[2]), literal_eval(row[3])])
    headers = data[0]
    data = data[1::]

    """
    Processes Documents and Predicts on each documents to find best answer
    """
    for doc in data:

        acknowledgements =[]

        """
        Aggregating Separated Text
        """
        full_text = ""
        for text in doc[3]:
            if text['section'] == "Acknowledgments":
                acknowledgements.append(text['text'])
            full_text += text['text']

        doc_count += 1


        # Get Answer for Relevant Factors
        best_factor = nlp({
            'question': question,
            'context': full_text
        })

        best_excerpt = nlp({
            'question': f"What is relevant to {best_factor['answer']}?",
            'context': full_text
        })

        sentence_index = len(full_text[0:best_excerpt['end']].split(". "))
        sentences = full_text.split(". ")
        sent_start = sentence_index - 2 if sentence_index >= 2 else 0
        new_start = full_text.find(sentences[sent_start])
        sent_end = sentence_index + 3 if sentence_index < len(sentences)-3 else len(sentences)-1
        new_end = new_start + len(". ".join(sentences[sent_start:sent_end+1]))

        # Get Excerpt associated with identified Factors
        excerpt = {
            'score': best_excerpt['score'],
            'start': new_start,
            'end': new_end,
            'answer': full_text[new_start:new_end]
        }
        
        # Identify Evidence related to Excerpt and Factors
        if acknowledgements:
            proof = "".join(acknowledgements)
            proof_start = full_text.find(proof)

            evidence = {
                'answer': proof,
                'start': proof_start,
                'end': proof_start+len(proof)
            }
        else:
            evidence = nlp({
                'question': "How was this study performed?",
                'context': full_text
            })

            proof_idx = len(full_text[0:evidence['end']].split(". "))
            proof_start_idx = proof_idx - 2 if proof_idx >=2 else 0
            proof_start = full_text.find(sentences[proof_start_idx])
            proof_end_idx = proof_idx + 3 if proof_idx < len(sentences)-3 else len(sentences)-1
            proof_end = proof_start + len(". ".join(sentences[proof_start_idx:proof_end_idx+1]))

            evidence = {
                'score': evidence['score'],
                'start': proof_start,
                'end': proof_end,
                'answer': full_text[proof_start:proof_end]
            }

        # Store Factors, Excerpt, and Evidence into a list
        answers.append((question, doc[1], best_factor, excerpt, evidence))
        
    # Write CSV Output
    write_csv_table(answers, question.replace(" ", "_"))

### Create Summary Tables

Lastly, the intermediate tables from the steps outlined above were combined with the metadata table to create final summary table, in the expected output format.

In [None]:
def create_summary_table(path_for_retrived, path_for_reranked, path_for_qa, metadata):
    """create summary table by merging the intermediate tables with metadata"""
    
    retrived = pd.read_csv(path_for_retrived, usecols=['docid', 'study_type'])
    reranked = pd.read_csv(path_for_reranked, usecols=['docid', 'sim_score'], index_col=0)
    qa = pd.read_csv(path_for_qa, usecols = ['Document ID', 'Key Factor', 'Excerpt', 'Evidence'])
    metadata = pd.read_csv(metadata)

    docs = pd.merge(left=retrived, right=reranked, left_on=['docid'], right_on=['docid'])
    docs_metdata = pd.merge(left=docs, right=metadata, left_on=['docid'], right_on=['cord_uid'])
    qa = qa.assign(Influential="", added_on="")
    summary_table = pd.merge(left=docs_metdata, right=qa, left_on=['docid'], right_on=['Document ID'])
    
    
    # Sort summary table by similarity score to get most relevant documents at top
    # Select only relevant columns after sorting
    final_summary_table = summary_table.sort_values('sim_score', ascending=False).reset_index(drop=True)[['publish_time', 'title', 'url', 'source_x', 'study_type', 'Key Factor', 'Influential', 'Excerpt', 'Evidence', 'added_on', 'doi', 'Document ID']]
    final_summary_table.rename(
        columns={
            "publish_time": "Date",
            "title": "Study",
            "url": "Study Link",
            "source_x": "Journal",
            "study_type": "Study Type",
            "Key Factors": "Factors",
            "Influential": "Influential",
            "Excerpt": "Excerpt",
            "Evidence": "Measure of Evidence",
            "added_on": "Added on",
            "doi": "DOI",
            "Document ID": "CORD_UID"
            },
    inplace=True
    )
    return final_summary_table

In [None]:
# iterate through the list of queries and create a csv file for each query
for i,q_text in enumerate(QUERY):
    
    retrived_path = "/kaggle/input/retrievaloutput/doc_scores_" + str(i+1) + "_retrieved.csv"
    reranked_path = "/kaggle/input/rerankeddocs/sim_scores_" + str(i+1) + "_retrieved.csv"
    qa_path = "/kaggle/input/qaoutput/qa_" + str(i+1) + ".csv"
    metadata_path = "/kaggle/input/CORD-19-research-challenge/metadata.csv"
    
    output_file_name = q_text.replace(" ","_") + ".csv"
    output_df = create_summary_table(retrived_path, reranked_path, qa_path, metadata_path)
    output_df.to_csv(output_file_name)
