# CORD-19 - Data extraction functions

Within the arising COVID-19 pandemia, Kaggle has launched the COVID-19 Open Research Dataset Challenge (CORD-19) dataset as a general call for any data scientist that is able to contribute extracting relevant information to deal with the virus. Since we are still in the first stages of this analytics challenge, the idea of this kernel is to provide quality of life functions to extract certain information about COVID-19 papers. The content is far from being particularly creative or perfect, but it will hopefully save time to other people interested in the challenge.

TABLE OF CONTENTS

1. [Filter papers by word occurrences](#section1)
2. [Extract the conclusions section](#section2)

Disclaimer: This kernel is still under construction. 

Import required libraries:

In [None]:
import numpy as np 
import pandas as pd
pd.set_option('display.width', 30) 
import matplotlib.pyplot as plt
import time
import warnings 
warnings.filterwarnings('ignore')
from collections import Counter
from nltk.corpus import stopwords

# NLP libraries
import spacy
from spacy.lang.en import English

# There's a large number of input files, it's nice to check out the list
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

I load the output files from [xhlulu's kernel](https://www.kaggle.com/xhlulu/cord-19-eda-parse-json-and-generate-clean-csv), which contains a useful transformation of the json files in dictionaries to csv readable format. Go check it to give some credit!

In [None]:
biorxiv = pd.read_csv("/kaggle/input/cord-19-eda-parse-json-and-generate-clean-csv/biorxiv_clean.csv")
biorxiv.shape

In [None]:
biorxiv.head(5)

## 1. Filter papers by word occurrences <a id="section1"></a>

General studies like word frequency and such do require the full set of scientific papers. However, when dealing with specific tasks or topics, it's useful to select the subset of papers containing only certain words. Despite being very simple, the function defined in this section provides a list of paper_id containing a desired set of words.

In [None]:
# Filter papers containing all words in list
def filter_papers_word_list(word_list):
    papers_id_list = []
    for idx, paper in biorxiv.iterrows():
        if all(x in paper.text for x in word_list):
            papers_id_list.append(paper.paper_id)

    return papers_id_list

Let's see an example related to the challenge task [What is known about transmission, incubation, and environmental stability?](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=568), tackling the last bullet: **Role of the environment in transmission**. We look for  papers that contain the words "coronavirus", "environment" and "transmission". (Thanks to [Mar√≠lia](https://www.kaggle.com/mpwolke) for her insightful comment!)

In [None]:
pd.set_option("display.max_colwidth", 100000) # Extend the display width to prevent split functions to not cover full text

biorxiv_environment = filter_papers_word_list(["coronavirus"])
print("Papers containing coronavirus: ", len(biorxiv_environment))

biorxiv_environment = filter_papers_word_list(["environment"])
print("Papers containing environment: ", len(biorxiv_environment))

biorxiv_environmental = filter_papers_word_list(["environmental"])
print("Papers containing environmental: ", len(biorxiv_environmental))

print("Intersection of environment and environmental: ", len(set(biorxiv_environment)-(set(biorxiv_environment)-set(biorxiv_environmental))))

biorxiv_environment_transmission = filter_papers_word_list(["coronavirus", "environment", "transmission"])
print("Number of papers containing coronavirus, environment and transmission: ", len(biorxiv_environment_transmission))

**Observations**:

* From the 803 biorxiv papers from this challenge, the word "coronavirus" appears only in half of them (412). This suggests that a large number of papers are not strictly related to COVID-19, despite they may involve useful information to answer some of the challenge tasks.
* Papers with "environmental" are a subset of those containing "environment". However, this could not be the case for other words' families. In the next subsection we cover an alternative to find all papers based on the lemma of words. 

### 1.1. Alternative filter: word lemmatization

Filtering by literal words may lead to loosing some papers just because the word appeared in an alternative version. For example, above we looked for the word "environment", and we had to check if the word "environmental" was present in some papers where "environment" was not. To deal with this, we can **transform all words into their lemma**, and then filter the papers. 

Notice that **this procedure is very time consuming**, since we need to first transform all texts (including the ones that do not contain any information about our word list).

To do this, we will use the spaCy library:

In [None]:
nlp = spacy.load('en_core_web_lg')

def lemmatizer(text):
    tokens = [token.lemma_ for token in text]
    return ' '.join([token for token in tokens])

# Filter papers containing all words in list
def filter_papers_word_list_lemma(word_list):
    papers_id_list = []
    word_list_lemma = lemmatizer(nlp(str(' '.join([token for token in word_list]))))
    for idx, paper in biorxiv.iterrows():
        if all(w in lemmatizer(nlp(paper.text)) for w in word_list_lemma):
            papers_id_list.append(paper.paper_id)

    return papers_id_list

In [None]:
#ts = time.time() # I comment this part so that commit time does not skyrocket

#biorxiv_environment_lemma = filter_papers_word_list_lemma(["coronavirus","environment"])
#print("Papers containing environment: ", len(biorxiv_environment_lemma))

#print("Time spent: ", time.time() - ts)

## 2. Extract the conclusions section <a id="section2"></a>

Most scientific papers contain a Conclusion section, which consists on a summary of the main observations and results from the study. In order to reduce the amount of data to analyze, it may prove useful to focus on the conclusions instead of performing a full search in the paper. 

In [None]:
def extract_conclusion(dataset, papers_id_list):
    data = dataset.loc[dataset['paper_id'].isin(papers_id_list)]
    conclusion = []
    for idx, paper in data.iterrows():
        paper_text = paper.text
        if "\nConclusion\n" in paper.text:
            conclusion.append(paper_text.split('\nConclusion\n')[1])
        else:
            conclusion.append("No Conclusion section")
    data['conclusion'] = conclusion
        
    return data

pd.reset_option('^display.', silent=True)

Let's  now extract the Conclusion section from all papers containing the words "environment" and "transmission":

In [None]:
environ_trans_conclusion = extract_conclusion(biorxiv, biorxiv_environment_transmission)
environ_trans_conclusion.head(5)

Done, our DataFrame has now a conclusion column. We are able to process conclusions independently, but remember this is convenient only incertainsituations. With the help of the word_bar_function from [Paul Mooney](https://www.kaggle.com/paultimothymooney/most-common-words-in-the-cord-19-dataset), let's study which are the most frequent words in the Conclusion section of papers containing "environment" and "transmission":

In [None]:
pd.set_option('display.width', 100000)



def word_bar_graph_function(df,column,title):
    # adapted from https://www.kaggle.com/benhamner/most-common-forum-topic-words
    topic_words = [ z.lower() for y in
                       [ x.split() for x in df[column] if isinstance(x, str)]
                       for z in y]
    word_count_dict = dict(Counter(topic_words))
    popular_words = sorted(word_count_dict, key = word_count_dict.get, reverse = True)
    popular_words_nonstop = [w for w in popular_words if w not in stopwords.words("english")]
    plt.barh(range(50), [word_count_dict[w] for w in reversed(popular_words_nonstop[0:50])])
    plt.yticks([x + 0.5 for x in range(50)], reversed(popular_words_nonstop[0:50]))
    plt.title(title)
    plt.show()

plt.figure(figsize=(10,10))
word_bar_graph_function(environ_trans_conclusion, "conclusion", "Most common words in papers with environment & transmission")

We observe that some of these words are merely situational (i.e. conclusion, section, medrxiv), but others may be particularly common in the subset of papers we have filtered. For example, words like "epicenter", "humidity", "city", "distance" and "lockdown" seem particularly related to the transmission of the virus and the environmental effects, and they probably won't be that frequent in other articles. 

Let's compare this case with papers containing the word "susceptibility":

In [None]:
biorxiv_susceptibility = filter_papers_word_list(["coronavirus", "susceptibility"])
susceptibility_conclusion = extract_conclusion(biorxiv, biorxiv_susceptibility)
plt.figure(figsize=(10,10))
word_bar_graph_function(susceptibility_conclusion, "conclusion", "Most common words in papers with susceptibility")

As expected, in this case there is no trace of the words "epicenter", "humidity", "city", "distance" or "lockdown". Instead, we now see large frequencies for words like "reporting", "rate", "scenarios", "diagnostic" and "protocol", which are more related to the contagion suscpetibility of the population.