# COVID-19 Open Research Dataset (CORD-19) ETL

<p align="center">
    <img src="https://pages.semanticscholar.org/hs-fs/hubfs/covid-image.png?width=300&name=covid-image.png"/>
</p>

***NOTE: There is a [Report Builder Notebook](https://www.kaggle.com/davidmezzetti/cord-19-report-builder) that runs on a prebuilt model. If you just want to try this out without a full build, this is the best choice.***

COVID-19 Open Research Dataset (CORD-19) is a free resource of scholarly articles, aggregated by a coalition of leading research groups, covering COVID-19 and the coronavirus family of viruses. The dataset can be found on [Semantic Scholar](https://pages.semanticscholar.org/coronavirus-research) and [Kaggle](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge).

This notebook uses the [paperetl](https://github.com/neuml/paperetl) project to load the raw CORD-19 dataset into a SQLite database. paperetl also supports loading data into Elasticsearch and exporting output to JSON/YAML files. [CORD-19 Analysis with Sentence Embeddings](https://www.kaggle.com/davidmezzetti/cord-19-analysis-with-sentence-embeddings) uses this notebook as the source data for analysis tasks.

# Install

[paperetl](https://github.com/neuml/paperetl) can be installed directly from GitHub using pip as follows. This project also depends on scispacy which must be installed separately.

In [None]:
# Install versioned packages for compatability
!pip install spacy==2.3.2

# Install paperetl project
!pip install paperetl

# Install scispacy model
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_md-0.2.5.tar.gz

This notebook requires Internet connectivity to be enabled. If this notebook is copied, the GitHub project could also be forked for an edited notebook to modify the Python code. Would simply just need to update the pip install command above to the new repository location.

# Build SQLite articles database

The raw CORD-19 data is stored across a metadata.csv file and json files with the full text. This project uses [SQLite](https://www.sqlite.org/index.html) to aggregate and store the merged content.

The ETL process transforms the csv/json files into a SQLite database. The process iterates over each row in metadata.csv, extracts the column data and ensures it is not a pure duplicate (using the sha hash). This process will also load the full text if available. 

## Tagging
Articles are tagged based on keyword matches. The only tag at this time is COVID-19 and articles are tagged with this if the article text contains any of the following regular expressions. 

>2019[\-\s]?n[\-\s]?cov, 2019 novel coronavirus, coronavirus 2(?:019)?, coronavirus disease (?:20)?19, covid(?:[\-\s]?(?:20)?19)?, n\s?cov[\-\s]?2019, sars[\-\s]cov-?2, wuhan (?:coronavirus|cov|pneumonia)

Credit to [@ajrwhite](https://www.kaggle.com/ajrwhite) and his [notebook](https://www.kaggle.com/ajrwhite/covid-19-thematic-tagging-with-regular-expressions) in helping to develop this list.

## Study Design
Additional metadata is parsed out of the article to derive information on the study design. The models referenced in this section are [available as a dataset](https://www.kaggle.com/davidmezzetti/cord19-study-design). 

### Design Type
The full text is analyzed to determine a design type for the backing study in the article using a machine learning model. The model has a pre-defined vocabulary and features are a count of each of these defined keywords. A Random Forest Classifier is then trained using the feature set and is used to predict study design labels. 

Credit to [@savannareid](https://www.kaggle.com/savannareid) for developing the keywords to use with this method. The keywords can be found in this [domain dictionary](https://docs.google.com/spreadsheets/d/1t2e3CHGxHJBiFgHeW0dfwtvCG4x0CDCzcTFX7yz9Z2E/edit#gid=389064679). More details on deriving a study design can be found in [this discussion](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/139355).

### Attribute Type Detection and Extraction
Additionally, the full text is analyzed to identify study metadata for the backing study in the article also using a machine learning model. The model has a combination of features including a TF-IDF vector of the text elements and Natural Lanuage Processing (NLP) elements. The NLP features are built from entity, part of speech and dependency labels extracted with [scispacy](https://allenai.github.io/scispacy/). scispacy has been pretrained on medical articles and has good detection on articles in this dataset. A Logistic Regression Classifier is then trained using the feature set and is used to predict attribute labels.

Based on the attribute type, further extraction is used via NLP. An example of this is with the sample size. Given a sentence "34 patients were enrolled", the logic will take the token patients and use dependency labels to extract the associated number (34) of patients to use as the sample size.

Another example of extraction is with risk factors with an example being the odds ratio of hypertension within a study. Using the detected entities, logic runs to find the matching statistic for a topic (such as hypertension) within a text section. The current process can currently only extract statistics, not calculate statistics from lower level data.

## Grammar Labels
The title, abstract and full-text fields are tokenized into sentences. Linguistic rules are used to label each sentence to help identify concise, data-driven statements. 

For the linguistic rules process, it has two basic rules right now.

1. *QUESTION*: Sentence ending in a '?' mark
2. *FRAGMENT*: Less informative/incomplete statements. Acceptable sentences have the following structure.
  - At least one nominal subject noun/proper noun AND
  - At least one action/verb AND
  - At least 5 words

Important source files to highlight
- ETL Process -> [execute.py](https://github.com/neuml/paperetl/blob/master/src/python/paperetl/cord19/execute.py)
- Linguistic Rules -> [grammar.py](https://github.com/neuml/paperetl/blob/master/src/python/paperetl/grammar.py)
- Study Design Model -> [design.py](https://github.com/neuml/paperetl/blob/master/src/python/paperetl/study/design.py)
- Attribute Model -> [attribute.py](https://github.com/neuml/paperetl/blob/master/src/python/paperetl/study/attribute.py)
- Sample Size Extraction -> [sample.py](https://github.com/neuml/paperetl/blob/master/src/python/paperetl/study/sample.py)

In [None]:
import os
import shutil

from paperetl.cord19.execute import Execute as Etl

# Copy study design models locally
os.mkdir("cord19q")
shutil.copy("../input/cord19-study-design/attribute", "cord19q")
shutil.copy("../input/cord19-study-design/design", "cord19q")

# Copy previous articles database locally for predictable performance
shutil.copy("../input/cord-19-etl/cord19q/articles.sqlite", "/tmp")

# Build SQLite database for metadata.csv and json full text files
Etl.run("../input/CORD-19-research-challenge", "cord19q", "cord19q", "../input/cord-19-article-entry-dates/entry-dates.csv", False, "/tmp/articles.sqlite")

Upon completion, a database named articles.sqlite will be stored in the output directory under a sub-folder named cord19q.