# Coronavirus Online Paper Search by Using Keywords

*by Yusuf Güven*

This notebook helps to find most frequent words in the Kaggle coronavirus article database. The articles are eligible for search if they include all the keywords. The chosen article is splitted into words and frequency of each word is counted among all eligible article group. The most frequent words are listed and visualized in a wordcloud.

Red colored fonts shows the parameters that can be set freely to reach the desired results.

## 1. Setting the Environment and Understanding the Data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory     
# Any results you write to the current directory are saved as output.

import os

# files is the dataframe that contains file name and full path of all json formatted articles.
files=pd.DataFrame(columns=["name","path"])
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        files=files.append({"name":filename,"path":os.path.join(dirname, filename)}, ignore_index=True)


In [None]:
# df is the dataframe of the metadata file given in the covid 19 open research database
df=pd.read_csv("../input/CORD-19-research-challenge/metadata.csv")
print(df.shape)
df.head()

In [None]:
files.head(10)

In [None]:
files.shape

Only .json files are needed for the analysis. Below the files with other extensions are removed from the data frame.

In [None]:
indices=files[files.name.str.contains(".json")==False].index 
files.drop(index=indices, inplace=True)
files.shape

In [None]:
files.reset_index(inplace=True)
files.drop("index",axis=1, inplace=True)
files.head(3) # We have name of the all files in json format and the path (including the name of the file).

## 2. Seaching Through Files with Keywords

In [None]:
# necessary modules for keyword search and visualization
import json
from pandas.io.json import json_normalize
import collections
#!pip install wordcloud 
# if not installed install wordcloud by uncommeting the above line
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
stopwords = set(STOPWORDS) #stopwords are the words like "the", "he'll", "i", etc that won't be used in the search.

Any keyword for the search can be added by separeting with commas. The search will look if all the words exist in the article. If one of them is missing the article is excluded in the search.

**<font color="red"> Update the keywords below for iterations: </font>**

In [None]:
key_words=["pseudoknots"]

In [None]:
# This function searches the title and the body of the article for the keywords.
def article_search(path, key_words):
    with open(path) as f:
      article_json = json.load(f)
    title = json_normalize(article_json['metadata'])
    article = title.title[0].lower()+"\n"
    article_title = title.title[0].lower()
    text = json_normalize(article_json['body_text']) 
    for j in range(text.shape[0]):
        article=article+"\n"+text.text[j].lower()
    
    # Check whether article contains all keywords
    key_word_count = len(key_words)
    article_chosen=False
    for key_word in (key_words):
        if key_word in article:
            key_word_count -= 1
    if key_word_count == 0:
        article_chosen=True
        return article_chosen, article_title, article
    else:
        return article_chosen, "", ""

The articles that contain all the keywords are eligible for the analysis. Each word in the selected articles are splitted and counted. At the end they are sorted from the most common word to the rarest one in a dictionary. Actually, last word(s) in the dictionary expected to be seen only once. 

In [None]:
wordcount={} # defining dictionary
article_titles=[] # will be used for listing the selected articles at the end of the notebook
for i in range(files.shape[0]):
    path=files.loc[i,"path"]
    contains_keywords, article_title, article = article_search(path, key_words)
    
    if contains_keywords:
        article_titles.append(article_title)
        for word in article.split():
            word = word.replace(".","")
            word = word.replace(",","")
            word = word.replace("\"","")
            word = word.replace("“","")
            word = word.replace("(","")
            word = word.replace(")","")
            word = word.replace("<","")
            word = word.replace(">","")
            if word not in stopwords:
                if word not in wordcount:
                    wordcount[word] = 1
                else:
                    wordcount[word] += 1
            
word_dict = collections.Counter(wordcount) # final dictionary

In [None]:
len(word_dict) # number of words in the dictionary

**<font color="red"> Update the starting and ending words range below for iterations: </font>**

In [None]:
start = 0 #starting number of word. 0 is most common. Negative values can be given to search from the rarest words.
end = 100 #ending number of word. It should be greater than start

The list of words in selected range are listed below. It's possible to look for the rare words also by determining the range accordingly. Important rare words shoud be added to the keywords to look them in the other articles. The search can be iterative by changing the keywords and dictionary range.

In [None]:

for word, count in word_dict.most_common()[start:end]:
    print(word, ": ", count)

## 3. Visualization

The code below blocks create a word cloud of the words in the selected range above to visualize the search.

In [None]:
# instantiate a word cloud object
word_cloud_source=""
for word, count in word_dict.most_common()[start:end]:
    word_cloud_source=word_cloud_source+(word+" ")*count

word_cloud = WordCloud(
    background_color='white',
    max_words=abs(end-start),
    stopwords=stopwords,
    collocations=False
)

# generate the word cloud
word_cloud.generate(word_cloud_source)

In [None]:
# display the cloud
fig = plt.figure()
fig.set_figwidth(30) # set width
fig.set_figheight(25) # set height

plt.imshow(word_cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

## 4. The List of Related Articles

When the second part generated desired results it is possible to look for the details of the articles selected. Below code will generate this list.

In [None]:
indices = df[df.title.str.lower().isin(article_titles)].index
indices
df.loc[indices]

The number of files in the data directory is larger than the articles listed in metadata csv. Because of this reason these is check below which controls if any other article remained in article titles. And remaining titles, the titles exists in the file directory but not exist in the metadata file, are listed below:

In [None]:
remaining_titles=pd.Series(article_titles)
remaining_indices= remaining_titles[~remaining_titles.isin(df.loc[indices,"title"].str.lower())]
remaining_titles.loc[remaining_indices]