## Open questions and literature ranking related to COVID-19 

# Goal
Given the large number of literature and the rapid spread of COVID-19, it is difficult for health professionals to keep up with new information on the virus. Can 
clustering similar research articles together to simplify the search for related publications? How can the content of the clusters be qualified? Is clustering only 
sufficient to get the related papers to the question/s? or we need to use a classification algorithm to solve this problem.  

By using unsupervised clustering algorithm (K-means clustering) for labelling in combination with dimensionality reduction for visualization, the collection of 
literature can be represented by a scatter plot. On this plot, publications of highly similar topic will share a label and will be plotted near each other.

In order to answer the question/s with the most similar papers, we used the supervised classification algorithm k-nearest neighbors (KNN) as it takes the data 
labels from the clustering stage.

This is a difficult time in which health care workers, sanitation staff, and many other essential personnel are out there keeping the world afloat. While adhering 
to quarantine protocol, the Kaggle CORD-19 competition has given us an opportunity to help in the best way we can as computer science students. It should be noted, 
however, that we are not epidemiologists, and it is not our place to gauge the importance of these papers. This tool was created to help make it easier for trained 
professionals to sift through many, many publications related to the virus, and find their own determinations.


# Approach:

- Parse the text from the body of each document using Natural Language Processing (NLP).
- Turn each document instance $d_i$ into a feature vector $X_i$ using Term Frequency–inverse Document Frequency (TF-IDF).
- Apply Dimensionality Reduction to each feature vector $X_i$ using t-Distributed Stochastic Neighbor Embedding (t-SNE) to cluster similar research articles in the 
two dimensional plane $X$ embedding $Y_1$.
- Use Principal Component Analysis (PCA) to project down the dimensions of $X$ to a number of dimensions that will keep .95 variance while removing noise and 
outliers in embedding $Y_2$.
- Apply k-means clustering on $Y_2$, where $k$ is 28, to label each cluster on $Y_1$ (the suitable k was determined by the Elbow method).
- Apply classification on the given question/s to find the highest 10 related papers among the given dataset using KNN with the aid of clustering labels. 


# Pros:
- Training a generalized model to responed faster to questions related to kaggle tasks.

- Using unsubervised clustering technique to have labelled data that are used to train knn supervised classification technique which is simple and efficient.


# Cons:
- The results do not depend on the meaning of the question but search for the related papers according to words matching.
- choosing abstarct data to train the model leads to dropping of miltiple documents which don't have an abstarct 


# Table of Contents
1. Loading the data
2. Pre-processing
3. Vectorization
4. PCA  & Clustering
5. Dimensionality Reduction with t-SNE
6. Classification with KNN
7. How to Use the Plot?
8. Conclusion

# Table of Contents
1. Loading the data
2. Pre-processing
3. Vectorization
4. PCA  & k_means Clustering
5. Dimensionality Reduction with t-SNE
6. Visualization Plot for clustering
7. Classify using KNN
8. Conclusion
9. Citation/Sources

# 1) Loading the Data
Load the data following the notebook by Ivan Ega Pratama, from Kaggle.
#### Cite: [Dataset Parsing Code | Kaggle, COVID EDA: Initial Exploration Tool](https://www.kaggle.com/ivanegapratama/covid-eda-initial-exploration-tool)

### Loading Metadata

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import glob
import json
from IPython.display import Image
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('ggplot')

Let's load the metadata of the dateset. 'title' and 'journal' attributes may be useful later when we cluster the articles to see what kinds of articles cluster together.

In [None]:
root_path = '/kaggle/input/CORD-19-research-challenge/'
metadata_path = f'{root_path}/metadata.csv'
meta_df = pd.read_csv(metadata_path, dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})
meta_df.head()

In [None]:
meta_df.info()

### Fetch All of JSON File Path

Get path to all JSON files:

In [None]:
all_json = glob.glob(f'{root_path}/**/*.json', recursive=True)
len(all_json)

### Helper Functions

 File Reader Class

In [None]:
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id']
            self.abstract = []
            self.body_text = []
            # Abstract
            if not 'abstract' in content.keys():
                self.abstract = ''
                self.body_text= ''
                return
            for entry in content['abstract']:
                self.abstract.append(entry['text'])
            # Body text
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            self.abstract = '\n'.join(self.abstract)
            self.body_text = '\n'.join(self.body_text)
    def __repr__(self):
        return f'{self.paper_id}: {self.abstract[:200]}... {self.body_text[:200]}...'
first_row = FileReader(all_json[0])
print(first_row)

Helper function adds break after every words when character length reach to certain amount. This is for the interactive plot so that hover tool fits the screen.

In [None]:
def get_breaks(content, length):
    data = ""
    words = content.split(' ')
    total_chars = 0

    # add break every length characters
    for i in range(len(words)):
        total_chars += len(words[i])
        if total_chars > length:
            data = data + "<br>" + words[i]
            total_chars = 0
        else:
            data = data + " " + words[i]
    return data

### Load the Data into DataFrame

Using the helper functions, let's read in the articles into a DataFrame that can be used easily:

In [None]:
dict_ = {'paper_id': [], 'doi':[], 'abstract': [], 'body_text': [], 'authors': [], 'title': [], 'journal': [], 'abstract_summary': []}
for idx, entry in enumerate(all_json):
    if idx % (len(all_json) // 10) == 0:
        print(f'Processing index: {idx} of {len(all_json)}')
    
    try:
        content = FileReader(entry)
    except Exception as e:
        continue  # invalid paper format, skip
    
    # get metadata information
    meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
    # no metadata, skip this paper
    if len(meta_data) == 0:
        continue
    
    dict_['abstract'].append(content.abstract)
    dict_['paper_id'].append(content.paper_id)
    dict_['body_text'].append(content.body_text)
    
    # also create a column for the summary of abstract to be used in a plot
    if len(content.abstract) == 0: 
        # no abstract provided
        dict_['abstract_summary'].append("Not provided.")
    elif len(content.abstract.split(' ')) > 100:
        # abstract provided is too long for plot, take first 100 words append with ...
        info = content.abstract.split(' ')[:100]
        summary = get_breaks(' '.join(info), 40)
        dict_['abstract_summary'].append(summary + "...")
    else:
        # abstract is short enough
        summary = get_breaks(content.abstract, 40)
        dict_['abstract_summary'].append(summary)
        
    # get metadata information
    meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
    
    try:
        # if more than one author
        authors = meta_data['authors'].values[0].split(';')
        if len(authors) > 2:
            # if more than 2 authors, take them all with html tag breaks in between
            dict_['authors'].append(get_breaks('. '.join(authors), 40))
        else:
            # authors will fit in plot
            dict_['authors'].append(". ".join(authors))
    except Exception as e:
        # if only one author - or Null valie
        dict_['authors'].append(meta_data['authors'].values[0])
    
    # add the title information, add breaks when needed
    try:
        title = get_breaks(meta_data['title'].values[0], 40)
        dict_['title'].append(title)
    # if title was not provided
    except Exception as e:
        dict_['title'].append(meta_data['title'].values[0])
    
    # add the journal information
    dict_['journal'].append(meta_data['journal'].values[0])
    
    # add doi
    dict_['doi'].append(meta_data['doi'].values[0])
    
df_covid = pd.DataFrame(dict_, columns=['paper_id', 'doi', 'abstract', 'body_text', 'authors', 'title', 'journal', 'abstract_summary'])
df_covid.head()

In [None]:
df_covid

## Some feature engineering
Adding word count columns for both abstract and body_text can be useful parameters later:

In [None]:
df_covid['abstract_word_count'] = df_covid['abstract'].apply(lambda x: len(x.strip().split()))  # word count in abstract
df_covid['body_word_count'] = df_covid['body_text'].apply(lambda x: len(x.strip().split()))  # word count in body
df_covid['body_unique_words']=df_covid['body_text'].apply(lambda x:len(set(str(x).split())))  # number of unique words in body
df_covid.head()

In [None]:
df_covid.info()

In [None]:
df_covid['abstract'].describe(include='all')

## Handle Possible Duplicates

When we look at the unique values above, we can see that there are duplicates. It may have caused because of author submiting the article to multiple journals. Let's remove the duplicats from our dataset:

(Thank you Desmond Yeoh for recommending the below approach on Kaggle)

In [None]:

df_covid.drop_duplicates(['abstract', 'body_text'], inplace=True)
df_covid['abstract'].describe(include='all')

In [None]:
df_covid['body_text'].describe(include='all')

It looks like we didn't have duplicates. Instead, it was articles without Abstracts.

## Take a Look at the Data:

####  we need to clean-up the text to improve any clustering or classification efforts. First, let's drop Null vales

In [None]:
df_covid.dropna(inplace=True)


In [None]:
df_covid.columns

In [None]:
df_covid.head()

In [None]:
df_covid.describe()

# Data Pre-processing

In [None]:
df = df_covid

### Handling multiple languages
We going to determine the language of each paper in the dataframe. Not all of the sources are English and the language needs to be identified so that we know how handle these instances

In [None]:
from tqdm import tqdm
from langdetect import detect
from langdetect import DetectorFactory

# set seed
DetectorFactory.seed = 0

# hold label - language
languages = []

# go through each text
for ii in tqdm(range(0,len(df))):
    # split by space into list, take the first x intex, join with space
    text = df.iloc[ii]['body_text'].split(" ")
    
    lang = "en"
    try:
        if len(text) > 50:
            lang = detect(" ".join(text[:50]))
        elif len(text) > 0:
            lang = detect(" ".join(text[:len(text)]))
    # ught... beginning of the document was not in a good format
    except Exception as e:
        all_words = set(text)
        try:
            lang = detect(" ".join(all_words))
        # what!! :( let's see if we can find any text in abstract...
        except Exception as e:
            
            try:
                # let's try to label it through the abstract then
                lang = detect(df.iloc[ii]['abstract_summary'])
            except Exception as e:
                lang = "unknown"
                pass
    
    # get the language    
    languages.append(lang)

In [None]:
from pprint import pprint

languages_dict = {}
for lang in set(languages):
    languages_dict[lang] = languages.count(lang)
    
print("Total: {}\n".format(len(languages)))
pprint(languages_dict)

Lets take a look at the language distribution in the dataset

In [None]:
df['language'] = languages
plt.bar(range(len(languages_dict)), list(languages_dict.values()), align='center')
plt.xticks(range(len(languages_dict)), list(languages_dict.keys()))
plt.title("Distribution of Languages in Dataset")
plt.show()


We will be dropping any language that is not English. Attempting to translate foreign texts gave the following problems:

1. API calls were limited

2. Translating the language may not carry over the true semantic meaning of the text


In [None]:
df = df[df['language'] == 'en'] 
df.info()

In [None]:
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz
import en_core_sci_lg

In [None]:
#NLP 
from IPython.utils import io
import spacy
from spacy.lang.en.stop_words import STOP_WORDS


### Stopwords

Part of the preprocessing will be finding and removing stopwords (common words that will act as noise in the clustering step).

In [None]:
import string

punctuations = string.punctuation
stopwords = list(STOP_WORDS)
stopwords[:10]

In [None]:
custom_stop_words = [
    'doi', 'preprint', 'copyright', 'peer', 'reviewed', 'org', 'https', 'et', 'al', 'author', 'figure', 
    'rights', 'reserved', 'permission', 'used', 'using', 'biorxiv', 'medrxiv', 'license', 'fig', 'fig.', 
    'al.', 'Elsevier', 'PMC', 'CZI', 'www'
]

for w in custom_stop_words:
    if w not in stopwords:
        stopwords.append(w)

### Next lets create a function that will process the text data for us. 
For this purpose we will be using the spacy library. This function will convert text to lower case, remove punctuation, and find and remove stopwords. For the parser, we will use en_core_sci_lg. This is a model for processing biomedical, scientific or clinical text.

In [None]:
import re
# Parser
parser = en_core_sci_lg.load(disable=["tagger", "ner"])
parser.max_length = 7000000

def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    mytokens = [ re.sub('[0-9%]','',word) for word in mytokens ]
    mytokens = " ".join([i for i in mytokens])
    return mytokens

#### To focus on features using main needed words related to the topic , we chose to work with processed text of the abstract column
       

In [None]:
from tqdm import tqdm
import numpy as np
import math

In [None]:
tqdm.pandas()
df["processed_text"] = df["abstract"].progress_apply(spacy_tokenizer)

### Remove all samples with emtpy abstract

In [None]:
df['abstract'].replace('', np.nan, inplace=True)
df.dropna(inplace=True)
df.info()

### visualizing processed text

In [None]:
df['processed_word_count'] = df["processed_text"].apply(lambda x: len(x.strip().split()))
sns.distplot(df['processed_word_count'])


In [None]:
df['abstract_word_count'].describe()

# Vectorization

Now that we have pre-processed the data, it is time to convert it into a format that can be handled by our algorithms. For this purpose we will be using tf-idf. This will convert our string formatted data into a measure of how important each word is to the instance out of the literature as a whole.

Vectorize our data. We will be clustering based off the content of the processed text. The maximum number of features will be limited to 4096 features , eseentially acting as a noise filter. Additionally, more features cause painfully long runtimes.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
text = df['processed_text'].values
vectorizer = TfidfVectorizer(max_features=4096)
X = vectorizer.fit_transform(text) 
terms = vectorizer.get_feature_names()

### terms indicate the chosen features from the vectorizer , let's take a look on some of them

In [None]:
terms[0:20]

# PCA  & Clustering

Let's see how much we can reduce the dimensions while still keeping 95% variance. We will apply Principle Component Analysis (PCA) to our vectorized data. The reason for this is that by keeping a large number of dimensions with PCA, you don’t destroy much of the information, but hopefully will remove some noise/outliers from the data, and make the clustering problem easier for k-means. Note that X_reduced will only be used for k-means, t-SNE will still use the original feature vector X that was generated through tf-idf on the NLP processed text.


In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95, random_state=42)
X_reduced= pca.fit_transform(X.toarray())
X_reduced.shape

To separate the literature, k-means will be run on the vectorized text. Given the number of clusters, k, k-means will categorize each vector by taking the mean distance to a randomly initialized centroid. The centroids are updated iteratively.

In [None]:
from sklearn.cluster import KMeans

In [None]:
Image(filename='/kaggle/input/kaggle-resources/kmeans.PNG', width=800, height=800)

[source](https://en.wikipedia.org/wiki/K-means_clustering)

### How many clusters? 

To find the best k value for k-means we'll look at the distortion at different k values. Distortion computes the sum of squared distances from each point to its assigned center. When distortion is plotted against k there will be a k value after which decreases in distortion are minimal. This is the desired number of clusters.

##### Uncomment those cells to watch the elbow , it takes some time

In [None]:
# from sklearn import metrics
# from scipy.spatial.distance import cdist
# %matplotlib inline
# from matplotlib import pyplot as plt

# # run kmeans with many different k
# distortions = []
# K = range(10, 35)
# for k in tqdm(K):
#     k_means = KMeans(n_clusters=k, random_state=42).fit(X_reduced)
#     k_means.fit(X_reduced)
#     distortions.append(sum(np.min(cdist(X_reduced, k_means.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])
#     #print('Found distortion for {} clusters'.format(k))

In [None]:
# X_line = [K[0], K[-1]]
# Y_line = [distortions[0], distortions[-1]]

# # Plot the elbow
# plt.plot(K, distortions, 'b-')
# plt.plot(X_line, Y_line, 'r')
# plt.xlabel('k')
# plt.ylabel('Distortion')
# plt.title('The Elbow Method showing the optimal k')
# plt.show()

In this plot we can see that the better k values are between 18-25. After that, the decrease in distortion is not as significant. For simplicity, we will use k=20

### Run k-means

Now that we have an appropriate k value, we can run k-means on the PCA-processed feature vector (X_reduced). 

In [None]:
k = 20
kmeans = KMeans(n_clusters=k, random_state=42)
y_pred = kmeans.fit_predict(X_reduced)
df['y'] = y_pred

In [None]:
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]

#### let's see each cluster with its features

In [None]:
for i in range(20):
    print('Cluster %d:' % i),
    for ind in order_centroids[i, :20]:
        print('%s' % terms[ind])

# Dimensionality Reduction with t-SNE

Dimensionality Reduction with t-SNE
Using t-SNE we can reduce our high dimensional features vector to 2 dimensions. By using the 2 dimensions as x,y coordinates, the processed_text can be plotted.

t-Distributed Stochastic Neighbor Embedding (t-SNE) reduces dimensionality while trying to keep similar instances close and dissimilar instances apart. It is mostly used for visualization, in particular to visualize clusters of instances in high-dimensional space

In [None]:
# from sklearn.manifold import TSNE

# tsne = TSNE(verbose=1, perplexity=100, random_state=42)
# X_embedded = tsne.fit_transform(X.toarray())

So that step took a while! Let's take a look at what our data looks like when compressed to 2 dimensions.

In [None]:
# # sns settings
# sns.set(rc={'figure.figsize':(15,15)})

# # colors
# palette = sns.color_palette("bright", 1)

# # plot
# sns.scatterplot(X_embedded[:,0], X_embedded[:,1], palette=palette)
# plt.title('t-SNE with no Labels')
# plt.savefig("t-sne_covid19.png")
# plt.show()

This looks pretty bland. There are some clusters we can immediately detect, but the many instances closer to the center are harder to separate. t-SNE did a good job at reducing the dimensionality, but now we need some labels. Let's use the clusters found by k-means as labels. This will help visually separate different concentrations of topics.

In [None]:
# %matplotlib inline
# from matplotlib import pyplot as plt
# import seaborn as sns

# # sns settings
# sns.set(rc={'figure.figsize':(15,15)})

# # colors
# palette = sns.hls_palette(20, l=.4, s=.9)

# # plot
# sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y_pred, legend='full', palette=palette)
# plt.title('t-SNE with Kmeans Labels')
# plt.savefig("improved_cluster_tsne.png")
# plt.show()

### Using Knn to get the nearest  neighbours to the question

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn_model = KNeighborsClassifier(n_neighbors=10)
knn_model.fit(X_reduced,df['y'].values )

In [None]:
def predict_nearest_neighbour(model,sentence):
    print("Prediction")
    sentence = spacy_tokenizer(sentence)
    print (sentence)
    X = vectorizer.transform([sentence])
    X = pca.transform(X.toarray())
    predicted = model.kneighbors(X, 100)
    return predicted

### Use knn to get the nearest neighbours to different questions

In [None]:
sentence = "What has been published about ethical and social science considerations?"
sentence_1 = 'What do we know about virus genetics, origin and evolution ?'
sentence_2 = 'What is known about transmission, incubation, and environmental stability?'
sentence_3 = 'Create summary tables that address risk factors related to COVID-19'
sentence_4 = 'What do we know about COVID-19 risk factors?'
sentence_5 = 'What has been published about medical care?'
sentence_6 = 'What do we know about diagnostics and surveillance?'
sentence_7 = 'What do we know about vaccines and therapeutics?'

#### For question 1 

In [None]:
res = predict_nearest_neighbour(knn_model, sentence_1)
print (res)

In [None]:
inds =[]
for i,dist in enumerate(res[0][0]):
    if dist > 1:
        inds.append(res[1][0][i])
print (inds)
        
    

#### Ten nearest douments related to the question

In [None]:
for i in range(10):
    index = inds[i]
    print (df.iloc[index]['abstract'])
    print ("virus appeared", df.iloc[index]['abstract'].count('virus'), 'times')
    print ("genetic appeared", df.iloc[index]['abstract'].count('genetic'), 'times')
    print ("origin appeared", df.iloc[index]['abstract'].count('origin'), 'times')
    print ("evolution appeared", df.iloc[index]['abstract'].count('evolution'), 'times')

#### The documents' summary table

In [None]:
for i in range(10):
    index = inds[i]
    print (df.iloc[index])

### Trying question 4

In [None]:
res = predict_nearest_neighbour(knn_model, sentence_4)
print (res)

In [None]:
inds =[]
for i,dist in enumerate(res[0][0]):
    if dist > 1:
        inds.append(res[1][0][i])
print (inds)

In [None]:
for i in range(10):
    index = inds[i]
    print (df.iloc[index]['abstract'])
    print ("risk appeared", df.iloc[index]['abstract'].count('risk'), 'times')
    print ("factor appeared", df.iloc[index]['abstract'].count('factor'), 'times')
    print ("COVID-19", df.iloc[index]['abstract'].count('covid-19'), 'times')

In [None]:
for i in range(10):
    index = inds[i]
    print (df.iloc[index])

## Deploy the model to pickle file

In [None]:
import pickle
pickle.dump(kmeans, open("k-means_abstract.pkl", "wb"))
pickle.dump(knn_model, open("knn_model_abstract.pkl", "wb"))

#kmeans = pickle.load(open("k-means_model.pkl", "rb"))

# Conclusion

After loading the data we used **langdetect** to detect documents' languages to remove non english documents ,
after that we used spacy_tokenizer in our data pre-processing where it lemmitize the words , make them all small alphapetically and we removed numbers and null values  from them .
We used TF_IDF vectorizer for features abstraction , and used PCA for dimensionality reduction for more efficiency and  too many features take more running time .

We used k-means clustering  as an unsupervised approach as we don't have labels , after clustering the documents we used k-nearest neighbours to get the closest documents to our question.

# Future thoughts to consider

1) Topic modelling to discover the topic of each cluster ,  so after applying KNN algorithm we can vote for the major 
    label to indicate the topic related to this question so that results will be more accurate

2) Using of alternative approach such as neural networks LSTM to classify based on words meanining instead of our
    approach in TF-IDF that rely on the word count and its frequency inside the whole documents .