# Competition: COVID-19 Open Research Dataset Challenge (CORD-19)

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge

# Task: What is known about transmission, incubation, and environmental stability?

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=568

****

This notebook uses the spaCy natural language processing library to rank thousands of research papers by their semantic similarities to a specific question specified in the task. A semantic similarity score is a **number betwwen `0` and `1`** with a higher score indicting greater similarity. The results are found to be relevant to the goals of the task. The top results are presented at the end of this notebook.

An analysis of semantic similarity scores is accomplished using histograms and boxplots. It is found that the distribution of scores may be considered approximately normal.

A major advantage of this approach is its simplicity while there are no readily apparent disadvantages to this approach. Computing the semantic similarity between a question and a research paper is accomplished with the following *"simple"* (thanks to spaCy) technique:

In [None]:
# SpaCy setup.
import spacy
spacy.prefer_gpu()
nlp = spacy.load("en_core_web_lg")
nlp.max_length = 2e6

# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
similarity = doc1.similarity(doc2)

similarity

## Notebook setup

In [None]:
!spacy info

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from spacy.matcher import PhraseMatcher
from spacy.matcher import Matcher
from IPython.core.display import display, HTML
import re
import matplotlib.pyplot as plt
import math

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

# The Data

The dataset used in this notebook is preprocessed and contains similarity scores for all records.

The complete dataset can be found here: https://www.kaggle.com/hvillacorte/covid19-literature

In [None]:
data = pd.read_json("../input/covid19-literature/df_covid_3.json")

In [None]:
display(HTML(f"""<h2>Data description</h2><br/>
<ul>
    <li>
        The dataset contains metadata and content from 
        <b>{data.shape[0]:,} scientific research papers</b>.
    </li>
    <li>
        <b>Each record contains the following variables:
        </b> {", ".join(data.columns.values)}
    </li>
    <li>
        The <code>body_text</code> feature contains the 
        sceintific content and ranges in size from 
        <b>{data.body_word_count.min():,} to 
        {data.body_word_count.max():,} words</b> with an
        <b>average of {round(data.body_word_count.mean()):,}
        words</b>.
    </li>
</ul>"""))

display(HTML("<p><b>Here is a random sample of"
             " four records from the dataset:</b></p>"))
data.sample(4)

# List of questions

These questions are derived from the task then coded as a list of Python dictionaries.

In [None]:
questions = [
    {"question": "Range of incubation periods for the disease in humans"
                 " (and how this varies across age and health status) and"
                 " how long individuals are contagious, even after recovery."},
    {"question": "Prevalence of asymptomatic shedding and transmission"
                 " (e.g., particularly children)."},
    {"question": "Seasonality of transmission."},
    {"question": "Physical science of the coronavirus (e.g., charge distribution,"
                 " adhesion to hydrophilic/hydrophobic surfaces, environmental"
                 " survival to inform decontamination efforts for affected areas"
                 " and provide information about viral shedding)."},
    {"question": "Persistence and stability on a multitude of substrates and"
                 " sources (e.g., nasal discharge, sputum, urine,"
                 " fecal matter, blood)."},
    {"question": "Persistence of virus on surfaces of different materials"
                 " (e,g., copper, stainless steel, plastic)."},
    {"question": "Natural history of the virus and shedding of it from"
                 " an infected person."},
    {"question": "Implementation of diagnostics and products to improve"
                 " clinical processes."},
    {"question": "Disease models, including animal models for infection,"
                 " disease and transmission."},
    {"question": "Tools and studies to monitor phenotypic change and"
                 " potential adaptation of the virus."},
    {"question": "Immune response and immunity."},
    {"question": "Effectiveness of movement control strategies to prevent secondary"
                 " transmission in health care and community settings."},
    {"question": "Effectiveness of personal protective equipment (PPE) and its"
                 " usefulness to reduce risk of transmission in"
                 " health care and community settings."},
    {"question": "Role of the environment in transmission."}
]

## Generate keywords

Keywords are generated from each question then added to each question's dictionary. spaCy is used to filter keywords based on certain lexical attributes. These keywords are not used in the computation of semantic similarity scores as they are already computed. These keywords will be hightlighted when the top results are displayed at the end of this notebook. This is simply meant to be a visual aid and *evidence of task completion and accuracy of the methods used*.

In [None]:
display(HTML("<h3>Questions and their keywords</h3>"))

for i, q in enumerate(questions):
    qdoc = nlp(q["question"])
    key_tokens = [token.text.lower() for token in qdoc
                  if token.pos_ not in ["PUNCT","SYM","PART","ADV"]
                  and not token.is_stop
                  and len(token.text) > 2]
        
    questions[i]["keywords"] = key_tokens
    keywords = ", ".join(key_tokens)
    display(HTML(f"""<p><strong>Question #{i+1}:</strong> {qdoc.text}<br/>
                        <strong>Keywords:</strong> {keywords}
                     </p>"""))

# Distribution of similarity scores

## Histograms

In [None]:
def draw_histograms(df, variables, n_rows, n_cols):
    """Plots histograms
    :param df: pandas dataframe
    :param variables: columns to plot
    :param n_rows: number of rows
    :param n_cols: number of columns
    :see: https://stackoverflow.com/questions/29530355/plotting-multiple-histograms-in-grid#answer-29530596
    """
    fig=plt.figure(figsize=(30,n_rows*8))
    for i, var_name in enumerate(variables):
        ax=fig.add_subplot(n_rows,n_cols,i+1)
        df[var_name].hist(bins=100,ax=ax)
        ax.set_title(
            f"Distribution of question #{i+1} similarity scores"
            , fontdict={"fontsize":34}
        )
    plt.show()

draw_histograms(
    data,
    [f"q{i}_similarity" for i, q in enumerate(questions)]
    , math.ceil(len(questions)/2)
    , 2
)

The distribution of each question's semantic similarity scores shows an overall negative skew with an aproximately normal distribution in the higher ranges for the majority of the distributions.

## Boxplots

In [None]:
plt.figure(figsize=(30,20))
plt.title(
    "Quantile-based distributions of each question's similarity scores"
    , fontdict={"fontsize":34}
)
data.boxplot(column=[f"q{i}_similarity" for i, q in enumerate(questions)])

The boxplots indicate that the vast majority of the similarity scores are in the upper ranges and that the high negative skewness of the distributions is due to low outliers in the lower ranges. The outliers may be dismissed and it may be said that the distributions are for the most part approximately normal. Outliers may possibly be attributed to non-english text, missing text or corrupted text, or they may be genuinely very dissimilar. This may be further investigated, however, we are really only interested in identifying the top results. Further efforts to improve the dataset may include language detection and translation, which may be looked into in the future.

Additonally, high outliers in the boxplots indicate that in many cases the top results are comparatively very similar to the question.

# Top results

The top results, along with a table of contents, will now be displayed. Keywords and numerical tokens and phrases are highlighted as follows.

* <mark style="background:yellow">keyword</mark>
* <mark style="background:lightblue">number with time interval</mark>
* <mark style="background:lightgreen">percentage</mark>

The number of results that are to be displayed is controlled by the variable `num_results` in the folowing code block. To increase or decrease the number of results just edit the value of the variable `num_results` as desired then rerun the code below (there is no need to run the entire notebook again, just the code below).

In [None]:
# Adjust this variable as desired to increase
# or decrease the number of results displayed.
num_results = 3

# Generate table of contents.
display(HTML('<h2 id="toc">Table of Contents</h2>'))
for i, q in enumerate(questions):
    display(HTML(
        f'<b>{i+1}</b>. <a href="#q{i}">{q["question"]}</a><br/>'
    ))
    
# Generate the display for the top results per question.
for i, q in enumerate(questions):
    sim_id = f"q{i}_similarity"
    
    # Create the PhraseMatcher object. The tokenizer is the first argument.
    # Use attr = 'LOWER' to make consistent capitalization
    matcher = PhraseMatcher(nlp.vocab, attr='LEMMA')
    patterns = [
        nlp(term) for term in q["keywords"]
        + ["covid-19","covid19","coronavirus","corona"]
    ]
    matcher.add("MENU",            # Just a name for the set of rules we're matching to
            None,              # Special actions to take on matched words
            *patterns)
    
    display(HTML(
        f"""<h1 id="q{i}">{q["question"]}</h1>
        <a href="#toc" title="Table of Contents">Back to top ↑</a>"""
    ))

    
    dat = data.sort_values(by=f"q{i}_similarity", ascending=False)
    
    for r in dat[:num_results].iterrows():
        i = r[0]
        row = r[1]
        
        doc = nlp(row["body_text"])
        
        """
        # Highlight sentences that meet a question similarity threshold.
        sentences = [{"q": sent, "sim": nlp(q["question"]).similarity(sent)}
                                 for sent in doc.sents]        
        sentence_similarities = [s["sim"] for s in sentences]        
        threshold = np.quantile(sentence_similarities, .75)
        for s in sentences:
            if s["sim"] >= .9:
                row.body_text = row.body_text.replace(
                    s["q"].text
                    , f"<mark>{s['q']}</mark>")
        """
        
        matches = matcher(doc)        

        if matches:
            excerpt = row["body_text"]
            
            for m in matches:
                match = doc[m[1]:m[2]].text
                excerpt = re.sub(
                    f'([ -])(?!<mark style="background:yellow">){match}'
                    '(?!</mark>)([ ,.?!-])'
                    , f'\\1<mark style="background:yellow">{match}</mark>\\2'
                    , excerpt)
        else:
            if row["body_text"]:
                excerpt = row["body_text"]
            elif row["abstract"]:
                excerpt = row["abstract"]
            else:
                excerpt = "NO TEXT FOUND!"
                
        excerpt = re.sub(
            ' (?!<mark style="background:lightblue">)([0-9.-−]+)?'
            ' (days?|hours?|weeks?|months?|years?)([ ,.?!])'
            , ' <mark style="background:lightblue">\\1 \\2</mark>\\3'
            , excerpt)
        
        excerpt = re.sub(
            '(?!<mark style="background:lightgreen">)([0-9.-]+) ?(%)'
            , '<mark style="background:lightgreen">\\1 \\2</mark>'
            , excerpt)
        
        display(HTML(f"""<p>
                <strong>Title:</strong> {row["title"]}<br/>
                <strong>Authors:</strong> {row["authors"]}<br/>
                <strong>Journal:</strong> {row["journal"]}<br/>
                <strong>ID:</strong> {row["paper_id"]}<br/>
                <strong>Similarity score:</strong> {row[f"{sim_id}"]}<br/>
                <strong>Number of keyword matches:</strong> {len(matches)}<br/>
                <strong>Question:</strong> {q["question"]}<br/>
                <a href="#toc" title="Table of Contents">Back to top ↑</a>
            </p>
            <blockquote>{excerpt}</blockquote>"""))