<div align="right"> <span style="font-size:0.8em"> Przemysław Bedełek, Mateusz Marciniewicz </span></div>

# <div align="center"> What do we know about COVID-19 risk factors? </div>  
# <div align="center"> <span style="color:gray; font-size:0.5em;">(COVID-19 Open Research Dataset Challenge - task 2)</span></div>
  
    
     
### 1. **Interpretation**
This notebook is dedicated to solve the problem mentioned in the 2<sup>nd </sup> task.  
In our approach, we are searching the dataset in order to find those articles that refer to given questions, namely:  
> * Data on potential **risks factors** - smoking, pulmonary diseases, co-infections etc.
* **Transmission dynamics** of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors
* **Severity of disease**, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups
* **Susceptibility of populations**
* Public health **mitigation measures** that could be effective for control

### 2. **Approach**
Some of readers may not be familiar with NLP methods (yet).  
In order to make analysis of our code easier we will introduce a brief step-by-step description of our solution: 
> 1. Data preprocessing:
    * removal of stopwords
    * removal of redundant information - links, references, two letter words, letter-digit sequences and external links
    * tokenization
    * stemming
2. Articles evaluation using [TF-IDF](https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089): 
    * Text vectorizing 
    * calculating each article's TF-IDF value basing on initial keywords (arbitrarily chosen with hope for success :) )
    * collecting articles with the best TF-IDF values
    * searching for extra keywords based on collected articles
    * removal of the "least informative" keywords (appearing in too big number of articles)
    * re-collecting the best articles
    
### 3. **Pros and cons of our solution**
> Advantages:
    * Simplicity - the main idea behind the solution is to use basic nltk text preprocessing and tfidf which are both well documented
    and easy to decipher.
    * Potential for universal usage - it can be used for any set of inital keywords and it will still provide valuable articles.
> Disadvantages:
    * The algorithm is sensitive to the quality of the provided keywords.
    * The text preprocessing is quite harsh because limiting the number of tokens in each article increases the computation speed and removes noise. However this may result in some loss of information in certain articles.
    * There is no duplicate removal in the final articles.


### 3. **Result**

The final result of our computations will be:
> * for each problem above, a set of top *n* best matching articles
* word clouds of the most frequent occuring words in each topic
* plots evaluating how accurate our approach was







# <div align="center"> Let's begin! </div>

# 1. Setup

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import glob
import json
import nltk

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import string
import os
from datetime import datetime

start = datetime.now()
print(f'Execution started at: {start.strftime("%m/%d/%Y, %H:%M:%S")}')

# Any results you write to the current directory are saved as output.

 * Create an auxiliary class that will store article's file path and the main text

In [None]:
class Article:
    def __init__(self, file_path):
        content = json.load(open(file_path))
        self.paper_id = file_path
        self.body_text = ''   
        for input in content['body_text']:
            self.body_text += input['text'] 

* Prepare paths to read files

In [None]:
root_path = '/kaggle/input/CORD-19-research-challenge'
sub_dirs = ['/biorxiv_medrxiv/biorxiv_medrxiv/pdf_json',
            '/comm_use_subset/comm_use_subset/pdf_json',
            '/comm_use_subset/comm_use_subset/pmc_json',
            '/custom_license/custom_license/pdf_json',
            '/custom_license/custom_license/pmc_json',
            '/noncomm_use_subset/noncomm_use_subset/pdf_json',
            '/noncomm_use_subset/noncomm_use_subset/pmc_json'
           ]

* Create an absolute file path for each article


In [None]:
all_paths = []

for sub_dir in sub_dirs:
    all_paths.append(glob.glob(f'{root_path}{sub_dir}/*.json'))

# merging sublists into a single list
all_paths = [item for sublist in all_paths for item in sublist]
len(all_paths)


* Load articles. Only 10000 articles will be loaded in order to accelerate the computation.


In [None]:
#Loading articles
articles_number = 10000      # to speed up computing, we will work on a smaller number than 50k. However, feel free to type a bigger one. 
articles = []
for index in range(articles_number):
    if index % (articles_number//5) == 0:
        print(f'{index/(articles_number)*100}% of files processed: {index}')
    articles.append(Article(all_paths[index]))
print('Files loading finished')    

articles[3].body_text


# 2.Text preprocessing

* Removal of digits and punctuation characters
* Tokenization
* Strings with length under 3 characters are omitted as they don't carry much meaning 

In [None]:
# In the first step of data preprocessing the punctation and numerical characters are removed.
# Besides that words under 3 letters are omitted as they don't carry any significant meaning.
for i in range(articles_number):
    articles[i].body_text = articles[i].body_text.replace('-',' ')
    articles[i].body_text = articles[i].body_text.translate(str.maketrans('','',string.punctuation + string.digits))
    articles[i].body_text = [w.lower() for w in articles[i].body_text.split() if len(w)>2 and w.isalpha()]

#### Noise words list consists of words that refer to the copyrights and usage licences. They don't carry any information as they are present in almost all articles.
* Removal of stopwords, noise words and http links

In [None]:
from nltk.corpus import stopwords 
noise_words = ['medrxiv','biorxiv','covid','sars','preprint','authorfunder','license','available','copyright','peer','granted','perpetuityis','display','coronavirus','doi','also']

for index in range(articles_number):    
    articles[index].body_text = [word for word in articles[index].body_text if word not in stopwords.words('english') + noise_words and word[:4] != 'http']  
    if index % (articles_number//5) == 0:
        print(f'{index/(articles_number)*100}% of files processsed: {index}')
print('Stopwords removal finished')
    


* Stemming using the snowball stemmer

In [None]:
# Stemming using the nltk SnowballStemmer
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer('english')

for i in range(len(articles)):    
    articles[i].body_text = [stemmer.stem(word) for word in articles[i].body_text]
" ".join(articles[3].body_text)


# 3. Articles evaluation

#### Creating a sparse matrix containing the tf-idf values for all articles. The articles are already tokenized and therefore the tokenizer in the TfidfVectorizer is swapped with a dummy function. We use CSR and CSC sparse matrices to minimize the cost of iterating over key words(columns) and articles(rows) in the Tf-idf sparse matrix.****





In [None]:
# Tf-idf function
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import coo_matrix
from scipy.sparse import csr_matrix
import matplotlib.pyplot as plt

# The text is already tokenized so the default tf-idf tokenizer is not needed here.
# It will be swapped with the dummy function below

def dummy_fun(gunwo):
    return gunwo

tfidf = TfidfVectorizer(analyzer='word',
                        tokenizer=dummy_fun,
                        preprocessor=dummy_fun,
                        token_pattern=None)

tfidf_matrix = coo_matrix(tfidf.fit_transform([article.body_text for article in articles]))
tfidf_csr = tfidf_matrix.tocsr()
tfidf_csc = tfidf_matrix.tocsc()

print('Matrix size: (articles, unique_tokens) = ' + str(tfidf_matrix.shape))

* Declaring initial keywords that will identify articles from different topics.

In [None]:
# In the second task there are 5 major fields in which we want to harvest information 
topics = ['Risk factors', 'Transmission dynamics', 'Severity of disease', 'Susceptibility of population', 'Mitigation measures']

# Listed below are the arbitrarily chosen keywords for each topic
init_keywords = {
 'Risk factors':['risk', 'factors', 'smoke', 'tobacco', 'cigarette', 'pneumonic', 'pulmonary', 'coexisting', 'coinfections', 'comorbidities', 'preexisting', 'chronic', 'neonates', 'mother', 'child', 'pregnancy', 'cancer', 'addiction', 'rich', 'poor', 'background', 'welfare', 'prosperity', 'immune'],
 'Transmission dynamics': ['reproductive', 'number', 'incubation', 'period', 'serial', 'interval', 'transmission', 'spread', 'environment', 'circumstances', 'respiratory', 'droplets'],
 'Severity of disease': ['fatality', 'risk', 'severe', 'hospitalize', 'mortality', 'death', 'rate', 'serious', 'mild'],
 'Susceptibility of population': ['susceptibility', 'receptivity', 'sensitivity', 'age', 'old', 'young', 'ill', 'cold'],
 'Mitigation measures': ['mitigate', 'measures', 'action', 'public', 'health', 'healthcare', 'reaction', 'counteraction', 'flatten', 'capacity', 'mask', 'gloves', 'soap', 'lockdown', 'wash', 'clean', 'sterile', 'prevent', 'slow', 'fast', 'block']}

keywords = init_keywords.copy()

In [None]:
# Stemming the keywords so they match the stemmed tokens from the tfidf vect

for topic in topics:    
    keywords[topic] = [stemmer.stem(word) for word in keywords[topic]]

# Remove those initial keywords that don't appear in articles
for topic in topics:    
    keywords[topic] = [word for word in keywords[topic] if word in tfidf.get_feature_names()]
    
# Getting indices of our keywords in the tfidf_array
keywords_indices = {}

for topic in topics:    
    keywords_indices[topic] = [tfidf.get_feature_names().index(word) for word in keywords[topic]]




* Create auxiliary functions

In [None]:
def get_tfidf_value(article,keyword):
    start_index = tfidf_csr.indptr[article]
    end_index = tfidf_csr.indptr[article+1]
    for i in tfidf_csr.indices[start_index:end_index]:
        if tfidf_csr.indices[i] == keyword:
            return tfidf_csr.data[i]
    return 0.0


In [None]:
def evaluate_article(article, keywords):
    start_index = tfidf_csr.indptr[article] 
    end_index = tfidf_csr.indptr[article+1] 
    article_value = 0
    matching_indices = [i for i in range(start_index,end_index) if tfidf_csr.indices[i] in keywords]
    for i in matching_indices:
        article_value += tfidf_csr.data[i]
    return article_value    
 

In [None]:
def evaluate_keyword(articles, keyword):
    articles_indices = [article[0] for article in articles]
    start_index = tfidf_csc.indptr[keyword]
    end_index = tfidf_csc.indptr[keyword+1]
    keyword_value = 0
    matching_indices = [i for i in range(start_index,end_index) if tfidf_csc.indices[i] in articles_indices]
    for i in matching_indices:
        keyword_value += tfidf_csc.data[i]
    return keyword_value  

In [None]:
#Insert adds an item into a sorted list
def insert(list, doc, doc_value): 
    global top_number
    if len(list) < top_number: 
            list.append([doc,doc_value])
            return list
    for index in range(top_number):
        if list[index][1] <= doc_value:
            list = list[:index] + [[doc,doc_value]] + list[index:]
            return list[:top_number]
    return list

In [None]:
def get_best_articles(keywords_indices):
    best_articles = {topic: [] for topic in topics}
    for topic in topics:
        for article in range(len(articles)):        
            article_value = evaluate_article(article, keywords_indices[topic])
            best_articles[topic] = insert(best_articles[topic],article,article_value)
    return best_articles

In [None]:
def evaluate_col(col):
    start_index = tfidf_csc.indptr[col]
    end_index = tfidf_csc.indptr[col+1]
    value = 0
    for i in tfidf_csc.data[start_index:end_index]:
        value += i
        
    return value

* Collect articles with the best TF-IDF values

In [None]:
#Best articles contains the most informative articles in each topic
top_number = 20 # number of top articles that we would like to assign to each topic
best_articles = get_best_articles(keywords_indices)

* Search for extra keywords based on collected articles

In [None]:
# Extra kewords holds the most important keywords in each topic
extra_keywords = {topic: [] for topic in topics}
for topic in topics:
    for keyword in range(len(tfidf.get_feature_names())):
        keyword_value = evaluate_keyword(best_articles[topic],keyword)
        extra_keywords[topic] = insert(extra_keywords[topic],keyword,keyword_value)

        

In [None]:
top20_keywords = {}
for topic in topics:
    top20_keywords[topic] =  [tfidf.get_feature_names()[extra_keywords[topic][doc][0]] for doc in range(len(extra_keywords[topic]))]
    print(f'{topic}: {top20_keywords[topic][1:-1]}')


Amongst the new keywords there are some words like infect, cov or case. Those words appear in all topics, hence they don't provide us with valuable information. Therefore they need to be removed.

In [None]:
# Some of the extra_keywords don't hold much information as they appear in articles from various fields i.e: case, covid

to_remove = {topic: [] for topic in topics}

# Collect these keywords
for topic in topics:
    for keyword in extra_keywords[topic]:
        keyword_value = evaluate_col(keyword[0])/(0.3 * len(articles))
        if(keyword_value > keyword[1]/top_number):
            to_remove[topic].append(keyword)
    print(f'{topic}: { [tfidf.get_feature_names()[to_remove[topic][doc][0]] for doc in range(len(to_remove[topic]))] }')


In [None]:
for topic in topics: 
    extra_keywords[topic] = [keyword for keyword in extra_keywords[topic] if keyword not in to_remove[topic]]
    print(f'{topic}: { [tfidf.get_feature_names()[extra_keywords[topic][doc][0]] for doc in range(len(extra_keywords[topic]))] }')    

extra_keywords_indices = {}
for topic in topics:
    extra_keywords_indices[topic] = [word[0] for word in extra_keywords[topic]]
    
for topic in topics:
    keywords_indices[topic] += extra_keywords_indices[topic]
    keywords_indices[topic] = list(set(keywords_indices[topic]))   

new_articles = get_best_articles(keywords_indices)

keywords = {}
for topic in topics:
    keywords[topic] = {}
    for i in keywords_indices[topic]:
        keyword_value = 0
        for article in new_articles[topic]:
            keyword_value += get_tfidf_value(article[0],i)
        if keyword_value != 0.0:
            keywords[topic][tfidf.get_feature_names()[i]] = keyword_value



* Normalize TF-IDF values

In [None]:
for topic in topics: 
    best_articles[topic] = [ [item[0], item[1]/len(init_keywords[topic])] for item in best_articles[topic]]
    new_articles[topic] = [ [item[0], item[1]/len(keywords_indices[topic])] for item in new_articles[topic]]


#### Distribution of the best articles in each topic. The y-axis represents the average tf-idf value for the article per keyword. 

In [None]:
import seaborn as sns

def create_boxplot(articles):
    data = {}
    for topic in topics:
        data[topic] = [item[1] for item in articles[topic]]
    sns.set(palette='Blues_d', style="whitegrid")
    df = pd.DataFrame.from_dict(data)
    boxplot = df.boxplot(figsize=(16,8))
    
create_boxplot(best_articles)

### Before the addition of extra keywords.

In [None]:
for topic in topics: 
    print(len(init_keywords[topic]))
    print(len(keywords_indices[topic]))

In [None]:
create_boxplot(new_articles)

* Create a boxplot for a new set of best articles in order to compare those two sets 

# 4. Result
* Create a wordcloud of keywords for each topic

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

word_clouds = {}
for topic in topics:
    print(topic)
    word_clouds[topic] = WordCloud(background_color='white').generate_from_frequencies(keywords[topic])
    plt.figure(figsize=(16,8))
    plt.imshow(word_clouds[topic])
    plt.axis('off')
    plt.show()

In [None]:
def present_article(file_path):
    content = json.load(open(file_path))
    title = content['metadata']['title']
    body_text = ''

    print(f'\nTitle: {title}\n')
      
    for input in content['body_text']:
        body_text += input['text']
    print(f'Text: {body_text[:300]}')
   
    


In [None]:
for topic in topics:
    print(f'{topic}: \n')
    
    for i in range(3):
        present_article(articles[new_articles[topic][i][0]].paper_id)

In [None]:
end = datetime.now()
total = end - start
print(f'Execution finished at: {end.strftime("%m/%d/%Y, %H:%M:%S")} \nDuration: {total}')