COVID-19 Open Research Dataset (CORD-19) Analysis
======

COVID-19 Open Research Dataset (CORD-19) is a free resource of scholarly articles, aggregated by a coalition of leading research groups, about COVID-19 and the coronavirus family of viruses. The dataset can be found on [Semantic Scholar](https://pages.semanticscholar.org/coronavirus-research) and there is a research challenge on [Kaggle](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge).

BERT - Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks: https://github.com/google-research/bert, We use derivation of BERT model as BioBERT to come up with answers for set of questions listed in the challenge, First step is to use sentence embeddings by using FastText in gensim and come up with context for questions.


Second step is to use that context and feed set of questions and their respective context in BioBERT model

### Installing Gensim library which will be used to load FastText embeddings for Sentence vectors to find context, 

Note: need to maintain version as 3.8.0 had compaitibility issues

In [None]:
! pip install gensim==3.4.0

In [None]:
import os
import pandas as pd
from gensim.models.fasttext import FastText as FT_gensim
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
from uuid import uuid4
import torch
import re
import json
from tqdm import tqdm
import datetime
import pprint
import random
import string
import sys

### Indexing all 45,000 research papers into pandas dataframe with Title, Abstract and Full text body

In [None]:
# main folder of Covid19 Dataset
dirs = ["biorxiv_medrxiv", "comm_use_subset", "custom_license", "noncomm_use_subset"]
docs = []
base_path = "/kaggle/input/CORD-19-research-challenge"
for d in dirs:
    for file in tqdm(os.listdir(f"{base_path}/{d}/{d}/pdf_json")):
        file_path = f"{base_path}/{d}/{d}/pdf_json/{file}"
        json_file = json.load(open(file_path,"rb"))
        
        title = json_file["metadata"]["title"]
        try : 
            abstract = "\n\n".join([text["text"] for text in json_file["abstract"]])
        except : 
            abstract = ""
        full_text = "\n\n".join([text["text"] for text in json_file["body_text"]])
        docs.append([title, abstract, full_text])
# Pandas Dataframe containing the title, abstract and body text
papers_df = pd.DataFrame(docs, columns = ["title", "abstract", "full_text"])

### Eliminate empty papers

In [None]:
papers_df = papers_df.dropna()
papers_df

### Upload Pretrained FastText Embeddings Model

In [None]:
# Upload the model 
model_load_name = 'final_model_gensim.pt'
path = F"/kaggle/input/similaritymodels2new/Similarity/FinalModel/{model_load_name}"
model = FT_gensim.load(path)

### Function to campare similary between input query and each paper title or paper abstract

In [None]:
def token_similarity(token1, token2): 
    """
    calculate similarity between sentences based on their embeddings
        ----------------
    Args : 
        token1 : String
        token1 : String
        ---------------
    returns:
        float between 0 and 100, representing the percentage of similarity
    """
    try :
        token1 = re.sub('[^a-zA-z0-9\s]', '' , token1).lower()
        token2 = re.sub('[^a-zA-z0-9\s]', '' , token2).lower()
        return model.similarity(token1, token2)
    except : 
        return 0

def get_context(query, search_on, model = model, df = papers_df):
    """
    maps similarity function for given query to either all paper abstracts or to all paper titles to extract the closes paper to answer the query.
        ----------------
    Args : 
        query : String
        search_on : String
        model : Gensim model Object
        df : pandas Dataframe
        ---------------
    returns:
        Tuple containing full_text, title and similarity degree to closeset paper to query.
    """
    
    if search_on in ["title", "abstract"]:
        df["similarity_to_query"] = df[search_on].apply(lambda x : token_similarity(x, query))
        result = df.nlargest(1, ['similarity_to_query']).reset_index(drop = True)
        return result["full_text"][0].replace("\n", " "), result["title"][0], result["similarity_to_query"][0]
    else :
        raise Exception("search_on argument should be in ['title', 'abstract']")

### Test the query

In [None]:
query = "what are risk factors COVID-19?"
search_on = "title"
import time
t1 = time.time()
context = get_context(query, search_on, model, papers_df)
t2 = time.time()
print(f"query took {t2-t1} seconds")
print(context)

 # BioBERT - BERT model trained on corpus: 
 ## This repository provides the code for fine-tuning BioBERT, a biomedical language representation model designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, and question answering, Github code: https://github.com/dmis-lab/biobert
### We have used question answering side of BioBERT model to find answer in COVID-19 Challenge

### Task-1 Questions to be answered

In [None]:

question_list = """ What is known about transmission, incubation, and environmental stability
What do we know about natural history, transmission, and diagnostics for the virus 
What have we learned about infection prevention and control
What is Range of incubation periods for the disease in humans (and how this varies across age and health status) and how long individuals are contagious, even after recovery.
What is Prevalence of asymptomatic shedding and transmission (e.g., particularly children).
What is Seasonality of transmission.
What is Physical science of the coronavirus (e.g., charge distribution, adhesion to hydrophilic/phobic surfaces, environmental survival to inform decontamination efforts for affected areas and provide information about viral shedding).
What is Persistence and stability on a multitude of substrates and sources (e.g., nasal discharge, sputum, urine, fecal matter, blood).
What is Persistence of virus on surfaces of different materials (e,g., copper, stainless steel, plastic).
What is Natural history of the virus and shedding of it from an infected person
What is Implementation of diagnostics and products to improve clinical processes
What is Disease models, including animal models for infection, disease and transmission
What is Tools and studies to monitor phenotypic change and potential adaptation of the virus
What is Immune response and immunity
What is Effectiveness of movement control strategies to prevent secondary transmission in health care and community settings
What is Effectiveness of personal protective equipment (PPE) and its usefulness to reduce risk of transmission in health care and community settings
What is Role of the environment in transmission
What are the potential risks factors of Covid-19
What is the influence of Smoking, pre-existing pulmonary disease on Covid-19
Will Co-infections and other co-morbidities co-existing respiratory/viral infections make the virus more transmissible or virulent 
Risk factors associated with Neonates and pregnant women
What are the risk factors associated with Socio-economic and behavioral factors to understand the economic impact of Covid19.
What are the Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors
What is the Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups
What is Susceptibility of populations for covid19 rik factors
What are Public health mitigation measures that could be effective for control of covid19
What do we know about virus genetics, origin, and evolution
What do we know about the virus origin and management measures at the human-animal interface
Real-time tracking of whole genomes and a mechanism for coordinating the rapid dissemination of that information to inform the development of diagnostics and therapeutics and to track variations of the virus over time.
Access to geographic and temporal diverse sample sets to understand geographic distribution and genomic differences, and determine whether there is more than one strain in circulation. Multi-lateral agreements such as the Nagoya Protocol could be leveraged.
Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over.
Evidence of whether farmers are infected, and whether farmers could have played a role in the origin.
Surveillance of mixed wildlife- livestock farms for SARS-CoV-2 and other coronaviruses in Southeast Asia.
Experimental infections to test host range for this pathogen.
Animal host(s) and any evidence of continued spill-over to humans
Socioeconomic and behavioral risk factors for this spill-over
Sustainable risk reduction strategies
What do we know about vaccines and therapeutics
What has been published concerning research and development and evaluation efforts of vaccines and therapeutics
What is Effectiveness of drugs being developed and tried to treat COVID-19 patients.
What is Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin, and minocyclinethat that may exert effects on viral replication.
Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.
Exploration of use of best animal models and their predictive value for a human vaccine.
Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.
Alternative models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics as production ramps up. This could include identifying approaches for expanding production capacity to ensure equitable and timely distribution to populations in need.
Efforts targeted at a universal coronavirus vaccine.
Efforts to develop animal models and standardize challenge studies
Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers
Approaches to evaluate risk for enhanced disease after vaccination
Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models [in conjunction with therapeutics
What has been published about medical care
What has been published concerning surge capacity and nursing homes
What has been published concerning efforts to inform allocation of scarce resources
What do we know about personal protective equipment
What has been published concerning alternative methods to advise on disease management
What has been published concerning processes of care? What do we know about the clinical characterization and management of the virus
Resources to support skilled nursing facilities and long term care facilities.
Mobilization of surge medical staff to address shortages in overwhelmed communities
Age-adjusted mortality data for Acute Respiratory Distress Syndrome (ARDS) with/without other organ failure – particularly for viral etiologies
Extracorporeal membrane oxygenation (ECMO) outcomes data of COVID-19 patients
Outcomes data for COVID-19 after mechanical ventilation adjusted for age.
Knowledge of the frequency, manifestations, and course of extrapulmonary manifestations of COVID-19, possible cardiomyopathy and cardiac arrest.
Application of regulatory standards (EUA, CLIA) and ability to adapt care to crisis standards of care level.
Approaches for encouraging and facilitating the production of elastomeric respirators, which can save thousands of N95 masks.
Best telemedicine practices, barriers and faciitators, and specific actions to remove/expand them within and across state boundaries.
Guidance on the simple things people can do at home to take care of sick people and manage disease.
Oral medications that might potentially work.
Use of AI in real-time health care delivery to evaluate interventions, risk factors, and outcomes in a way that could not be done manually.
Best practices and critical challenges and innovative solutions and technologies in hospital flow and organization, workforce protection, workforce allocation, community-based support resources, payment, and supply chain management to enhance capacity, efficiency, and outcomes.
Efforts to define the natural history of disease to inform clinical care, public health interventions, infection prevention control, transmission, and clinical trials
Efforts to develop a core clinical outcome set to maximize usability of data across a range of trials
Efforts to determine adjunctive and supportive interventions that can improve the clinical outcomes of infected patients (e.g. steroids, high flow oxygen)
What do we know about the effectiveness of non-pharmaceutical interventions
What is known about equity and barriers to compliance for non-pharmaceutical interventions
How can we use Guidance on ways to scale up NPIs in a more coordinated way (e.g., establish funding, infrastructure and authorities to support real time, authoritative (qualified participants) collaboration with all states to gain consensus on consistent guidance and to mobilize resources to geographic areas where critical shortfalls are identified) to give us time to enhance our health care delivery system capacity to respond to an increase in cases.
Rapid design and execution of experiments to examine and compare NPIs currently being implemented. DHS Centers for Excellence could potentially be leveraged to conduct these experiments.
Rapid assessment of the likely efficacy of school closures, travel bans, bans on mass gatherings of various sizes, and other social distancing approaches.
Methods to control the spread in communities, barriers to compliance and how these vary among different populations..
Models of potential interventions to predict costs and benefits that take account of such factors as race, income, disability, age, geographic location, immigration status, housing status, employment status, and health insurance status.
Policy changes necessary to enable the compliance of individuals with limited resources and the underserved with NPIs.
Research on why people fail to comply with public health advice, even if they want to do so (e.g., social or financial costs may be too high).
Research on the economic impact of this or any pandemic. This would include identifying policy and programmatic alternatives that lessen/mitigate risks to critical government services, food distribution and supplies, access to critical household supplies, and access to health diagnoses, treatment, and needed care, regardless of ability to pay.
What do we know about diagnostics and surveillance
What has been published concerning systematic, holistic approach to diagnostics (from the public health surveillance perspective to being able to predict clinical outcomes)
How widespread current exposure is to be able to make immediate policy recommendations on mitigation measures. Denominators for testing and a mechanism for rapidly sharing that information, including demographics, to the extent possible. Sampling methods to determine asymptomatic disease (e.g., use of serosurveys (such as convalescent samples) and early detection of disease (e.g., use of screening of neutralizing antibodies such as ELISAs).
Efforts to increase capacity on existing diagnostic platforms and tap into existing surveillance platforms.
Recruitment, support, and coordination of local expertise and capacity (public, private—commercial, and non-profit, including academic), including legal, ethical, communications, and operational issues.
National guidance and guidelines about best practices to states (e.g., how states might leverage universities and private laboratories for testing purposes, communications to public health officials and the public).
Development of a point-of-care test (like a rapid influenza test) and rapid bed-side tests, recognizing the tradeoffs between speed, accessibility, and accuracy.
Rapid design and execution of targeted surveillance experiments calling for all potential testers using PCR in a defined area to start testing and report to a specific entity. These experiments could aid in collecting longitudinal samples, which are critical to understanding the impact of ad hoc local interventions (which also need to be recorded).
Separation of assay development issues from instruments, and the role of the private sector to help quickly migrate assays onto those devices.
Efforts to track the evolution of the virus (i.e., genetic drift or mutations) and avoid locking into specific reagents and surveillance/detection schemes.
Latency issues and when there is sufficient viral load to detect the pathogen, and understanding of what is needed in terms of biological and environmental sampling.
Use of diagnostics such as host response markers (e.g., cytokines) to detect early disease or predict severe disease progression, which would be important to understanding best clinical practice and efficacy of therapeutic interventions.
Policies and protocols for screening and testing.
Policies to mitigate the effects on supplies associated with mass testing, including swabs and reagents.
Technology roadmap for diagnostics.
Barriers to developing and scaling up new diagnostic tests (e.g., market forces), how future coalition and accelerator models (e.g., Coalition for Epidemic Preparedness Innovations) could provide critical funding for diagnostics, and opportunities for a streamlined regulatory environment.
New platforms and technology (e.g., CRISPR) to improve response times and employ more holistic approaches to COVID-19 and future diseases.
Coupling genomics and diagnostic testing on a large scale.
Enhance capabilities for rapid sequencing and bioinformatics to target regions of the genome that will allow specificity for a particular variant.
Enhance capacity (people, technology, data) for sequencing with advanced analytics for unknown pathogens, and explore capabilities for distinguishing naturally-occurring pathogens from intentional.
One Health surveillance of humans and potential sources of future spillover or ongoing exposure for this organism and future pathogens, including both evolutionary hosts (e.g., bats) and transmission hosts (e.g., heavily trafficked and farmed wildlife and domestic food and companion species), inclusive of environmental, demographic, and occupational risk factors.
What has been published concerning ethical considerations for research
What has been published concerning social sciences at the outbreak response
Efforts to articulate and translate existing ethical principles and standards to salient issues in COVID-2019
Efforts to embed ethics across all thematic areas, engage with novel ethical issues that arise and coordinate to minimize duplication of oversight
Efforts to support sustained education, access, and capacity building in the area of ethics
Efforts to establish a team at WHO that will be integrated within multidisciplinary research and operational platforms and that will connect with existing and expanded global networks of social sciences.
Efforts to develop qualitative assessment frameworks to systematically collect information related to local barriers and enablers for the uptake and adherence to public health measures for prevention and control. This includes the rapid identification of the secondary impacts of these measures. (e.g. use of surgical masks, modification of health seeking behaviors for SRH, school closures)
Efforts to identify how the burden of responding to the outbreak and implementing public health measures affects the physical and psychological health of those providing care for Covid-19 patients and identify the immediate needs that must be addressed.
Efforts to identify the underlying drivers of fear, anxiety and stigma that fuel misinformation and rumor, particularly through social media.
What has been published about information sharing and inter-sectoral collaboration
What has been published about data standards and nomenclature
What has been published about governmental public health
What do we know about risk communication
What has been published about communicating with high-risk populations
What has been published to clarify community measures
What has been published about equity considerations and problems of inequity
Methods for coordinating data-gathering with standardized nomenclature.
Sharing response information among planners, providers, and others.
Understanding and mitigating barriers to information-sharing.
How to recruit, support, and coordinate local (non-Federal) expertise and capacity relevant to public health emergency response (public, private, commercial and non-profit, including academic).
Integration of federal/state/local public health surveillance systems.
Value of investments in baseline public health response infrastructure preparedness
Modes of communicating with target high-risk populations (elderly, health care workers).
Risk communication and guidelines that are easy to understand and follow (include targeting at risk populations’ families too).
Communication that indicates potential risk of disease to all population groups.
Misunderstanding around containment and mitigation.
Action plan to mitigate gaps and problems of inequity in the Nation’s public health capability, capacity, and funding to ensure all citizens in need are supported and can access information, surveillance, and treatment.
Measures to reach marginalized and disadvantaged populations.
Data systems and research priorities and agendas incorporate attention to the needs and circumstances of disadvantaged populations and underrepresented minorities.
Mitigating threats to incarcerated people from COVID-19, assuring access to information, prevention, diagnosis, and treatment.
Understanding coverage policies (barriers and opportunities) related to testing, treatment, and care
"""

question_list = question_list.split("\n")
qa_dataframe = pd.DataFrame({"Questions": question_list})

Extracting context from the data ingestion module above using Fast Text model

In [None]:
qa_dataframe["Context"] = qa_dataframe["Questions"].apply(lambda x : get_context(x, "title", model, papers_df)[0])

In [None]:
qa_dataframe

## BIOBERT Fine-tuning and Prediction

### Since BERT model was trained using tf1.x, we can not use transfer learning of using pre-trained weights on tensorflow 2.0 so need to installing Tensorflow version 1.x

In [None]:
!pip install tensorflow==1.15.2

In [None]:
import tensorflow as tf
print(tf.version)

TPU support is not available for Tensorflow version 1.x in Kaggle.

# Test Data preparation

In [None]:
covidQuestions = qa_dataframe

Create temp directory to store temporary files

In [None]:
! mkdir temp

In [None]:
covidQuestions.to_json('temp/QnAInput.json')

Format the dataframe to create a JSON as input to BIOBERT QnA model

In [None]:
#Test File

import json
from uuid import uuid4

def restructure_json_Test(input_data):  
    
    """
    Restructure input data to be compatible with BIOBERT Qna test data format.
        ----------------
    Args : 
        input_data : String
        ---------------
    returns:
        formatted data in dictionary
    """
    output = dict()
    data = []
    paragraphs = []
    for i in input_data["Questions"].keys():
        qas = []
        answers = []
      
        qas.append({"id" : str(uuid4()),
                   "question" : input_data["Questions"][i],
                   })
        paragraphs = paragraphs + [{"qas" : qas, "context" : input_data["Context"][i]}]
    data = data + [{"paragraphs" : paragraphs, "title" : "BioASQ6b"}]
    output["data"] = data
    output["version"] = "BioASQ6b"
    return output

In [None]:
with open('temp/QnAInput.json',encoding='ISO-8859-1' ) as f:
    input_data = json.load(f)

Save the restructured data in JSON format

In [None]:
outputData = restructure_json_Test(input_data)
with open('temp/QnAInputTest.json', 'w',  encoding='ISO-8859-1') as json_file:
    json.dump(outputData, json_file)

Verify all files used for prediction

In [None]:
! ls -l '/kaggle/input/biobertcode/GitRepo/BIOBERT/bioasq-biobert/run_factoid.py'
! ls -l '/kaggle/input/biobertconfig/biobertconfig/vocab.txt'
! ls -l '/kaggle/input/biobertconfig/biobertconfig/bert_config.json'
! ls -l '/kaggle/input/biobertmodel2/BERT-pubmed-1000000-SQuAD2/model.ckpt-14470.index'
! ls -l 'temp/QnAInputTest.json'

# Prediction using Fine Tuned BIOBERT model

BioBERT model has been uploaded to https://www.kaggle.com/varshnes/biobertmodel2

In [None]:
! python /kaggle/input/biobertcode/GitRepo/BIOBERT/bioasq-biobert/run_factoid.py \
     --do_train=False \
     --do_predict=True \
     --vocab_file=/kaggle/input/biobertconfig/biobertconfig/vocab.txt \
     --bert_config_file=/kaggle/input/biobertconfig/biobertconfig/bert_config.json \
     --init_checkpoint=/kaggle/input/biobertmodel2/BERT-pubmed-1000000-SQuAD2/model.ckpt-14470.index \
     --max_seq_length=512 \
     --max_answer_length=256 \
     --train_batch_size=12 \
     --learning_rate=5e-6 \
     --doc_stride=128 \
     --num_train_epochs=1.0 \
     --do_lower_case=False \
     --train_file=$BIOASQ_INPUT_DIR/BioASQ-6b/train/Full-Abstract/BioASQ-train-factoid-6b-full-annotated.json \
     --predict_file=temp/QnAInputTest.json \
     --output_dir=/kaggle/output/kaggle/factoid_output/prediction/

Check the prediction file in output folder

In [None]:
ls -l /kaggle/output/kaggle/factoid_output/prediction/predictions.json

In [None]:
with open('/kaggle/output/kaggle/factoid_output/prediction/predictions.json',encoding='ISO-8859-1' ) as f:
    output_data = json.load(f)

In [None]:
output_data

In [None]:
with open('temp/QnAInputTest.json',encoding='ISO-8859-1' ) as f:
    output_data_questions = json.load(f)

### Lets go through Task 2 Questions 1 by 1 and predictions are mapped to output side by side answers

In [None]:
paragraphs = output_data_questions['data'][0]['paragraphs']

for index, item in enumerate(paragraphs):
    id = item['qas'][0]['id']
    question = item['qas'][0]['question']
    answer = output_data[id]
    print("Question ", str(index + 1), " : " , question ,"\n Answer : ", answer)
    print("\n")