# Introduction <a class="anchor" id="intro-header"></a>

This notebook provides extensive search and information extraction functionality for relevant factors related to Sars-CoV-2, Covid-19, and provides an example of information extraction on how temperature and humidity affects the transmission of 2019-nCoV. It filters relevant papers using an Apache Lucene based search engine over an index compiled on the full text of publications in the Kaggle competition metadata and then performs extensive text mining of unstructured text through a robust [GrapeNLP](https://github.com/GrapeNLP) grammar with fuzzy matching capabilities.

The techniques provided here would be of interest to researchers and policy makers seeking to automatically find answers to questions such as the effect of temperature and humidity on disease transmission. Answers to such questions would help policy makers to tailor their response to the pandemic, based on geographic and seasonal difereences. This would aid ongoing COVID-19 response efforts worldwide and attempts at economic recovery. These techniques can also be reconfigured to explore other factors.

This submission is able to: (1) recreate the target tables; (2) append new rows to the old tables in order to add: (A) newly published articles; or (B) previously overlooked articles.

It will use the following workflow:

* [Notebook parameters](#parameters-header) - Set the parameters to apply to the entire notebook
* [Install libraries and load metadata](#install-header) - Install the necessary components and load the CORD-19 metadata.
* [Load Lucene index and searcher](#load-lucene-header) - Load the CORD-19 Lucene index and instantiate the Lucene searcher
* [Search for relevant publications](#search-header) - Search for papers that may contain relevant Covid-19 factors. To implement this component, we use PyLucene, which is a Python extension for accessing Java Lucene.
* [Extraction grammar description](#grammar-description-header) - Overview of the GrapeNLP grammar used to find and extract the target datapoints.
* [Load grammar](#load-grammar-header) - Instantiation of the grammar engine.
* [Extract information from text](#extract-header) - Extract datapoints from full text of papers.
* [Display results](#display-header) Save target CSV tables and display them.
* [Conclusion](#conclusion-header) Conclusions and future directions.

# Notebook parameters <a class="anchor" id="parameters-header"></a>

Set the parameters to use accross the entire notebook.

In [None]:
relevant_factor_terms = {
    'temperature_or_humidity': ['air', 'clammy', 'climate', 'cool', 'cold', 'hotness', 'humid', 'humidity', 'precipitation', 'rainfall', 'temperature', 'temperatures', 'warm']
}

covid19_synonyms = ["coronavirus disease 19", "sars cov 2", "2019 ncov", "2019ncov", "coronavirus 2019", "wuhan pneumonia", "wuhan virus", "wuhan coronavirus", "covid19", "covid-19"]

# Max documents to search per Lucene query (set it to e.g. 100000 to return all possible matches)
MAX_SEARCH_RESULTS = 1000000
MIN_GRAMMAR_SCORE = -600

# We allow for at least 40 tokens in between (roughly the lenght of 2 sentences), which will add -15*40=-600 points to the overall match score
MIN_GRAMMAR_SCORE = -600
# Left and right context size in characters to extract upon a grammar match (roughly 40 words, assuming 5 chars per word on average)
CONTEXT_SIZE = 200
# Styles for highlighting the matched risk factor and severe measures within the full text of the papers
FACTORS_STYLE = "background-color: #EC1163"
EXCERPT_STYLE = "background-color: #80FF32"

LUCENE_INDEX_DIR = "documentLevel"
LUCENE_BASE_DIR = "/kaggle/working"
COVID_FULLTEXT_DF = "../input/covidfulltext/metadata_and_fulltext_2020-04-17.csv"

# Install Libraries <a class="anchor" id="install-header"></a>

A number of external components are used, including Apache Lucene with pre-compiled indexes and the [GrapeNLP](https://github.com/GrapeNLP) grammar engine for information extraction. In this kernel, we provide an installation/ configuration/ compilation package for PyLucene 8.1.1 as an external data called “compiledlucene”, which provides all the required software dependencies for installation and deployment of PyLucene. For the installation of GrapeNLP, we reuse the libgrapenlp dataset, which provides the required Debian packages, then we install from Pypi the GrapeNLP Python interface package: pygrapenlp. More details on how to install and use the GrapeNLP grammar engine in a Kaggle notebook can be found here: https://www.kaggle.com/javiersastre/grapenlp-grammar-engine-in-a-kaggle-notebook

In [None]:
# Install and import relevant libraries
!python -m easy_install ../input/compiledlucene/bk/lucene-8.1.1-py3.6-linux-x86_64.egg
!cp -r ../input/compiledlucene/bk/JCC-3.7-py3.6-linux-x86_64.egg /opt/conda/lib/python3.6/site-packages/
import sys
sys.path
sys.path.append('/opt/conda/lib/python3.6/site-packages/JCC-3.7-py3.6-linux-x86_64.egg')
sys.path.append('/opt/conda/lib/python3.6/site-packages/lucene-8.1.1-py3.6-linux-x86_64.egg')

In [None]:
!dpkg -i ../input/libgrapenlp/libgrapenlp_2.8.0-0ubuntu1_xenial_amd64.deb
!dpkg -i ../input/libgrapenlp/libgrapenlp-dev_2.8.0-0ubuntu1_xenial_amd64.deb
!pip install pygrapenlp

In [None]:
import sys, os, lucene, threading, time, html
from datetime import datetime
from java.nio.file import Paths
from org.apache.lucene.analysis.miscellaneous import LimitTokenCountAnalyzer
from org.apache.lucene.document import Document, Field, FieldType
from org.apache.lucene.index import FieldInfo, IndexWriter, IndexWriterConfig, IndexOptions
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.index import DirectoryReader
from org.apache.lucene.queryparser.classic import QueryParser
from org.apache.lucene.store import SimpleFSDirectory
from org.apache.lucene.search import IndexSearcher

In [None]:
from collections import OrderedDict
from pygrapenlp import u_out_bound_trie_string_to_string
from pygrapenlp.grammar_engine import GrammarEngine

In [None]:
from tqdm.auto import tqdm
import pandas as pd
from IPython.display import display, HTML, Image

# Load Lucene index <a class="anchor" id="load-lucene-header"></a>

Load the Lucene index using IndexWriter. This component manages an index over a dynamic collection of documents and provides very rapid updates to the index as documents are added and deleted from the collection. This index provides a mapping from terms to documents, which is called an “inverted index. Document indexing consists of first constructing a Lucene Document that contains the fields to be indexed, then adding that Document to the inverted index”, see figure below. The index is maintained as a set of segments in a storage abstraction called SimpleFSDirectory which provides an interface similar to an OS file system.

In [None]:
# Load indexed metadata 
final_df = pd.read_csv('../input/covidfulltext/metadata_and_fulltext_2020-04-17.csv')
final_df.shape

In [None]:
%%capture --no-display
# This section loads the indexs and takes about 4 minutes to run
class Ticker(object):

    def __init__(self):
        self.tick = True

    def run(self):
        while self.tick:
            sys.stdout.write('.')
            sys.stdout.flush()
            time.sleep(1.0)

class IndexFiles(object):
    """Usage: python IndexFiles <doc_directory>"""

    def __init__(self, root, storeDir, analyzer):
        ##print("before store")
        if not os.path.exists(storeDir):
            os.mkdir(storeDir)
        ##print("after store")

        store = SimpleFSDirectory(Paths.get(storeDir))
        ##print(storeDir)
        analyzer = LimitTokenCountAnalyzer(analyzer, 1048576)
        ##print("after analyzer ")

        config = IndexWriterConfig(analyzer)
        ##print("after config")

        config.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
        ##print("before writer")
        writer = IndexWriter(store, config)
        ##print("after writer")
        self.indexDocs(root, writer)
        ticker = Ticker()
        ##print ('commit index')
        threading.Thread(target=ticker.run).start()
        writer.commit()
        writer.close()
        ticker.tick = False
        ##print ('done')

    def indexDocs(self, root, writer):

        t1 = FieldType()
        t1.setStored(True)
        t1.setTokenized(False)
        t1.setStoreTermVectors(True)
        t1.setStoreTermVectorOffsets(True)
        t1.setStoreTermVectorPositions(True)
        t1.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS)
        

        t2 = FieldType()
        t2.setStored(True)
        t2.setTokenized(True)
        t2.setStoreTermVectors(True)
        t2.setStoreTermVectorOffsets(True)
        t2.setStoreTermVectorPositions(True)
        t2.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS)
                
        i = 1
        for index, row in tqdm(final_df.iterrows(), desc='Indexing: ', total=len(final_df.index)):
            print ("adding ", i , "th document:", row['paper_id'])
            try :
                doc = Document()
                doc.add(Field("paper_id", row['paper_id'], t1))
                doc.add(Field("title", row['title'], t2))
                doc.add(Field("doi",row['doi'], t1))
                doc.add(Field("pmcid", row['pmcid'], t1))
                doc.add(Field("publish_time", row['publish_time'], t1))
                doc.add(Field("journal", row['journal'], t1))
                doc.add(Field("url", row['url'], t1))
                
                if len(row['text']) > 0:
                    doc.add(Field("full_text", row['text'], t2))
                else :
                    print ("warning: no fulltext available in %s", row['title'])
                    
                if len(row['abstract_y']) > 0:
                    doc.add(Field("abstract", row['abstract_y'], t2))
                else :
                    print ("warning: no abstract available in %s", row['title'])
                writer.addDocument(doc)
            except (RuntimeError, TypeError, NameError):
                pass
            i=i+1
            

lucene.initVM()
start = datetime.now()
try:
    IndexFiles(LUCENE_BASE_DIR, os.path.join(LUCENE_BASE_DIR, LUCENE_INDEX_DIR),StandardAnalyzer())
    end = datetime.now()
    print (end - start)
except (RuntimeError, TypeError, NameError):
    print ("Failed: ")
    raise

# Search relevant publications <a class="anchor" id="search-header"></a>

The following section uses Apache Lucene to search for publications that contain expressions of temperature and/or humidity as well as mentions to Covid-19.

In [None]:
def make_or_term_query(terms):
    quoted_terms = terms.copy()
    for i in range(len(quoted_terms)):
        if ' ' in quoted_terms[i]:
            quoted_terms[i] = '"' + quoted_terms[i] + '"'
    return ' OR '.join(quoted_terms)

def make_and_query(subqueries):
    return "(" + ") AND (".join(subqueries) + ")"

def search(searcher, analyzer, query_expression):
    query = QueryParser("full_text", analyzer).parse(query_expression)
    scoreDocs = searcher.search(query, MAX_SEARCH_RESULTS).scoreDocs
    results = []
    for scoreDoc in scoreDocs:
        doc = searcher.doc(scoreDoc.doc)
        result = {
            'date': doc.get("publish_time"),
            'study': doc.get("title"),
            'study_link': doc.get("url"),
            'journal': doc.get("journal"),
            'paper_id': doc.get('paper_id'),
            'paper_full_text': doc.get('full_text'),
            'pmcid': doc.get("pmcid")
        }
        results.append(result)
    return pd.DataFrame(results)

In [None]:
directory = SimpleFSDirectory(Paths.get(os.path.join(LUCENE_BASE_DIR, LUCENE_INDEX_DIR)))
searcher = IndexSearcher(DirectoryReader.open(directory))
analyzer = StandardAnalyzer()

In [None]:
covid_19_query = make_or_term_query(covid19_synonyms)
relevant_factor_queries = {relevant_factor: make_and_query([make_or_term_query(terms), covid_19_query]) for relevant_factor, terms in relevant_factor_terms.items()}
relevant_factor_queries

In [None]:
papers_by_relevant_factor = {}
for relevant_factor, query_expression in tqdm(relevant_factor_queries.items(), desc='Searching: '):
    results = search(searcher, analyzer, query_expression)
    papers_by_relevant_factor[relevant_factor] = results
    print("{}: {} documents found".format(relevant_factor, len(results.index)))

# Extraction Grammar Description <a class="anchor" id="grammar-description-header"></a>

Grammars have greater expressive power compared to regular expressions, so we have developed a [GrapeNLP](https://github.com/GrapeNLP) grammar for the extraction of causal relations between temperature and humidity related factors and effects such as spread of the virus or death. This grammar performs a similar operation than a regular expression with extraction groups: it will try to match the entire text of each paper, then upon a match it will extract segments of the matched text that are bounded in the grammar by XML tags. The XML tags act as parenthesis in regular expressions for defining extraction groups, allowing later to retrieve the matched portions of text in each group using the tag names instead of group indexes. We extract the following segments from the text:

* The detected relevant factor (e.g. temperature)
* The minimum excerpt of text that contains the causal relation statement

Apart from these, we generate an empty XML tag when negation of the causal relation is present, in order to classify the excerpt as influential or non-influential.

The grammars can be structured into reusable components (equivalent to non-terminal symbols of context-free grammars) as illustrated by the grammar axiom:

In [None]:
Image('../input/relevant-factor-grammars/axiom_temp_or_humidity.png')

This grammar relies on sub-grammar **causal_relation**  to detect the causal or negated causal relation. The matched fragment of text expressing the causal relation is delimited by XML tags &lt;excerpt&gt; in order to extract it. Since the grammar engine performs exact matching on the entire paper full text, a call to grammar **null-insert** is added before and after the causal relation to match the beginning and end of the paper. This grammar recognizes an arbitrary sequence of 0, 1 or more tokens:

In [None]:
Image('../input/relevant-factor-grammars/null_insert.png')

Apart from generating XML tags, the grammar engine generates as well matching scores per token, which by default depend on the specificity degree of the lexical mask used to match each token: lexical mask &lt;TOKEN&gt; is given by default 0 points, while lexical masks requiring a specific word, digit or symbol are given 14 points. The grammar engine generates an efficient representation of all posible matches as a kind of weighted finite-state automaton with output, then uses a Viterbi-like algorithm to efficiently extract the match with the highest overall score.

The **causal_relation** grammar below detects expressions of correlations or results of a factor or list of factors (grammar **list_of_factors**) with an effect (grammar **efect**). If present, it also detects negation particles and generates an empty &lt;negation&gt; tag for classifying the relation as non-influential. Meta code &lt;E&gt; represents the empty string, and is used to make a box optional (e.g. tokens "positive" and "positively" may appear but are not mandatory).

In [None]:
Image('../input/relevant-factor-grammars/causal_relation.png')

Grammar **penalizing_insert** is similar to grammar **null_insert**: it recognizes 0, 1 or more arbitrary tokens, but overwrites the default 0 score per token by a -15 score, letting the causal expressions to contain unknown token inserts but favoring a minimal occurrence of these.

In [None]:
Image('../input/relevant-factor-grammars/penalizing_insert.png')

In order to avoid recognizing too large chunks of texts as causal relations, matches that are below an overall grammar score will be rejected.

The **list_of_factor** grammar recognizes lists of 1 or more relevant factors, allowing as well mentions in between of unknown tokens. However, the matched list must start and end with a known factor. The matched list of factors is delimited by tag &lt;factors&gt; to extract it.

In [None]:
Image('../input/relevant-factor-grammars/list_of_factors.png')

Grammar **temp_or_humidity** detects one single relevant factor, in our case different expressions related to temperature and humidity:

In [None]:
Image('../input/relevant-factor-grammars/temp_or_humidity.png')

Finally, grammar **effect** detects different expressions of effects such as spread, transmission or death. Additionally, it also recognizes optional modifiers such as "increases" or "decreases", and in the later case generates the &lt;negation&gt; tag to classify the relation as non-influential:

In [None]:
Image('../input/relevant-factor-grammars/effect.png')

# Load grammar <a class="anchor" id="load-grammar-header"></a>

In [None]:
bin_delaf_pathname = os.path.join('..', 'input', 'test-delaf', 'dictionary.bin')
grammar_dir = os.path.join('..', 'input', 'relevant-factor-grammars')
relevant_factors_grammar_pathname = os.path.join(grammar_dir, 'grammar_relevant_factors.fst2')
relevant_factors_grammar_engine = GrammarEngine(relevant_factors_grammar_pathname, bin_delaf_pathname)

# Extract datapoints <a class="anchor" id="extract-header"></a>

We apply here the loaded grammar to the full text of each paper found, for each relevant factor defined in the first section of this notebook.

The grammar engine is implemented in C++, and the corresponding native match object accessed from Python using SWIG. We use the following function to convert this native object into a Python dictionary which is easier to process. Upon multiple matches of the same expression, we retrieve the top scored match. The grammar not only matches expressions but tags the fragmets to extract (what we call "segments" of the input text). For the top scored match, this method returns a Python dictionary for all the tagged text segments, using as key the label of the tag and as value the text segment itself. The total match score is also returned in order to be able to enforce a minimum score.

In [None]:
def grapenlp_results_to_python_dic(sentence, native_results):
    top_segments = OrderedDict()
    score = -sys.maxsize - 1 # Minimum integer value
    if not native_results.empty():
        top_native_result = native_results.get_elem_at(0)
        score = top_native_result.w
        top_native_result_segments = top_native_result.ssa
        for i in range(0, top_native_result_segments.size()):
            native_segment = top_native_result_segments.get_elem_at(i)
            native_segment_label = native_segment.name
            segment_label = u_out_bound_trie_string_to_string(native_segment_label)
            segment = OrderedDict()
            segment['value'] = sentence[native_segment.begin:native_segment.end]
            segment['start'] = native_segment.begin
            segment['end'] = native_segment.end
            top_segments[segment_label] = segment
    return top_segments, score

The function below generates the final datapoints to present in the summary table, given the extracted segments of text:

In [None]:
def parse_segments(text, segments):
    study_type = None
    factors = None
    influential = 'Y'
    excerpt = None
    measure_of_evidence = None

    if 'study_type' in segments:
        study_type = segments['study_type']['value']
    if 'factors' in segments:
        factors = segments['factors']['value']
    if 'negation' in segments:
        influential = 'N'
    if 'measure_of_evidence' in segments:
        measure_of_evidence = segments['measure_of_evidence']['value']

    if factors and 'excerpt' in segments:
        excerpt_start = segments['excerpt']['start']
        excerpt_end = segments['excerpt']['end']
        factors_start = segments['factors']['start']
        factors_end = segments['factors']['end']
        factors_value = segments['factors']['value']
        
        left_excerpt = text[excerpt_start:factors_start]
        right_excerpt = text[factors_end:excerpt_end]
        left_context = text[excerpt_start - CONTEXT_SIZE:excerpt_start]
        right_context = text[excerpt_end:excerpt_end + CONTEXT_SIZE]
        excerpt = left_context + \
                  '<span style="' + EXCERPT_STYLE + '">' + left_excerpt + '</span>' + \
                  '<span style="' + FACTORS_STYLE + '">' + factors + '</span>' + \
                  '<span style="' + EXCERPT_STYLE + '">' + right_excerpt + '</span>' + \
                  right_context

    return study_type, factors, influential, excerpt, measure_of_evidence

This function applies the grammar to the entire paper and returns the summary table as a Pandas dataframe:

In [None]:
def extract_from_text(papers, grammar_engine):
# Extracts from the paper full text the target datapoints for the given risk factor
    result_df = pd.DataFrame(columns=['Date', 'Study','Study Link','Journal','Study Type','Factors','Influential', 'Score', 'Excerpt', 'Measure of Evidence'])
    context = {}
    matched_papers = 0
    for id, paper in tqdm(papers.iterrows(), total=len(papers.index)):
        text = paper['paper_full_text']
        native_matches = grammar_engine.tag(text, context)
        segments, score = grapenlp_results_to_python_dic(text, native_matches)
        if segments:
            matched_papers += 1
            if score >= MIN_GRAMMAR_SCORE:
                print("Score:", score)
                study_type, factors, influential, excerpt, measure_of_evidence = parse_segments(text, segments)
                if excerpt:
                    paper_data_dict = {
                        'Date': html.escape(paper.date),
                        'Study': html.escape(paper.study),
                        'Study Link': '<a href="' + paper.study_link + '">' + html.escape(paper.study_link) + '</a>',
                        'Journal': html.escape(paper.journal),
                        'Study Type': study_type,
                        'Factors': factors,
                        'Influential': influential,
                        'Score': score,
                        'Excerpt': excerpt,
                        'Measure of Evidence': measure_of_evidence
                    }
                    result_df = result_df.append(paper_data_dict, ignore_index=True)
    matches_above_threshold = len(result_df.index)
    total_papers = len(papers.index)
    display(HTML("<p>total papers/total matches/matches above threshold: {}/{}/{}</p>".format(total_papers, matched_papers, matches_above_threshold)))
    return result_df

Here we apply the extraction function to all the papers returned by Lucene. While this notebook focuses on temperature humidity, the loop below could be used in a future work to extract tables for the other proposed relevant factors.

In [None]:
%%capture --no-display
# For each relevant factor
tables_by_relevant_factor = {}
for relevant_factor, papers in tqdm(papers_by_relevant_factor.items(), desc='Extracting: '):
    table = extract_from_text(papers, relevant_factors_grammar_engine)
    tables_by_relevant_factor[relevant_factor] = table

# Display tables <a class="anchor" id="display-header"></a>

Given a list of papers, a relevant factor and a grammar engine instance, this method returns a DataFrame for the target summary tables containing an entry per paper, where an effect was found close to a mention of the factor. The table includes the target datapoints, indicating whether the factor was influential or not depending on the presence of the absence or presence of the &lt;negation&gt; tag. The excerpt of text with some left and right context is included in a last column for manual review, as well as the overall matching score, which can be used to fine tune the matching threshold.

In [None]:
%%capture --no-display
for relevant_factor, table in tables_by_relevant_factor.items():
    display(HTML(table.to_html(escape=False)))
    table.to_csv(relevant_factor + ".csv")

# Conclusion <a class="anchor" id="conclusion-header"></a>

We set out to provide a reproducible pipeline to accurately search and extract valuable information from Covid-19 research studies for relevant factors. We have demonstrated the ability to automatically extract tables for relevent factors realted to temoerature and humidity. This pipeline enables researchers, clinicians, policy makers and many others to rapidly mine tens of thousands of publications in order to find highly relevant factors from studies related to Covid-19. This pipeline is fully automatic and is highly configurable and can include grammars for many other factors. Our inspection of the result show that they are highly relevant and we can tune precision and recall by tuning the grammar threshold for accepting a match.

## Pros and cons of this approach

### Pros

Our approach provides the following features:

* We provide a robust grammar engine to extract factors related to Covid-19.
* The generated results are highly relevant
* We can tune precision and recall by tuning the grammar threshold for accepting a match.
* We automate all of the steps needed to extract this crucial information in one reusable pipeline.
* The system is highly configurable, allowing researchers to target other factors and issues.
* The analysis is highly reproducible, allowing other researchers to replicate and reuse the pipeline.
* The system can be re-run as more data becomes available.

### Cons

The following are the shortfalls of our approach:

* More factors to be explored

## Future Work

* Extend approach to more factors
* Further refinements of the grammar to improve precision and recall