# CORD-19 Challenge: Drugs that treats COVID-19


## Introduction
In the CORD-19 Challenge, the idea is to use NLP to reduce the amount of manual work required in finding specific information in the vast amount of literature out there. The CORD-19 dataset contains the full text of a large number (40000+) of articles and publications, as well as the metadata of these articles, store in a csv file. The task at hand is to therefore reduce this large dataset into something that is manually workable by researchers. 

To this end, I am following a divide and conquer approach - narrowing down the number of articles that is relevant in multiple stages, before more in depth analysis of the selected articles. In particular, the process flow that I present in this workbook is as follows:

***Step 1: Use topic modelling on the titles of the articles to seperate the articles into different topics***

***Step 2: Find out the topics that are relevant to this task (drugs, vaccine and theraupetics), and select articles with the relevant topics***

***Step 3: Use Word2Vec to create a word vector space using the abstract of the selected articles***

***Step 4: Put the abstracts into clusters, and select the cluster of article that is most relevant to our task***

***Step 5: Use text summarisation to summarise the full text of these articles, and find out relevant facts about our task manually from the summaries.*** 







## Part 1: Title Topic Modelling

In most publications in the literature, the amount of details or information you can obtain usually follows a hireachy of:

**full text > abstract > titles**

Therefore, by running an inital topic modelling on the titles of the papers, I should be able to categorize the papers into relatively large topics, from which the topics of interests can be selected for a more granular analysis using the abstract, and final a more in-depth analysis to specific full text. 

The workflow for this work is as follows:

***Step 1: Load in data***

***Step 2: Clean text using nltk, re and string***

***Step 3: Exploratory analysis and document matrix formation using CountVectorizer()***

***Step 4: Revisit text preprocessing step***

***Step 5: LDA topic modelling using Gensim***

We will start with loading the required packages

In [None]:
# Standard data processing packages: numpy and pandas
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Text cleaning and string formating tool box: nltk, string and re. In particular, stopwords and word_tokenize are great tools for turning text 
# into tokens and remove common words; Wordnetlemmatizer allows reasonable lemmatization; re and string use for removing punctuations and other 
# string wrangling
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
import string


#CountVectorizer() for counting word tokens and creating document term matrix needed for Gensim. Also require the text package to modify 
#stopwords according to what we see in the EDA

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text

#Gensim for topic modelling. Requires scipy.sparse to store spare matrix, and use Coherence Model to evalute topics
from gensim import matutils, models
import scipy.sparse
from gensim.models import CoherenceModel
from gensim import corpora

#Visualisation tools. pyLDAvis to visualise the topics, matplotlib for general plotting
import pyLDAvis
import pyLDAvis.gensim  
import matplotlib.pyplot as plt

#Pickle for saving model
import pickle








### Loading in Data and Data preprocessing

First, we load in the metadata file which contains amongst other things, the title, abstract, and the SHA code which links to the full text json of all the articles available in the CORD-19 data. We will view the first few rows of the data to get an idea what the data looks like


In [None]:
#Load in data
meta_df = pd.read_csv("/kaggle/input/CORD-19-research-challenge/metadata.csv")
meta_df.head()


#See how big the dataset is
print("size of table is ", meta_df.shape)

We can see that the metadata contains a lot of important information such as publishing dates, doi (unique document identifier for academic publications) and authors. However, for our analysis, we really only need the title. We will grab the SHA, abstract and doi as well, which will be useful further down the track 



In [None]:
#Create new dataframe that just store the title and abstract to work on
text_df = meta_df[['sha','title','abstract', 'doi']].copy()

#Print out random title to see what we are in for
print(text_df.title[345])


OK, Looks good. Let's start with some basic text cleaning. As a first pass, let's turn everything lower case and remove all punctuations. 

In [None]:
#define a text cleaner to do the cleaning we need - lower case, and replace all punctuation with empty strings
#Note the explict casting of the title elements into str is needed to use string operations without an error
def text_cleaner(text):
    text = str(text).lower()
    text = re.sub('[0-9]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub("−", '', text)

    return text


text_df.title = text_df.title.apply(lambda x: text_cleaner(x))
text_df = text_df[text_df['title'].notna()]
text_df.head()


OK Good. The next step is a quick analysis of the data. The quick analysis serves two purpose. To get an idea of what the data looks like, as well as to judge whether additional cleaning is required. We use CountVectorizer() in sklearn.preprocessing.text, as it easily turn the text into a document term matrix (i.e. a bag of words with word frequency), which allows the frequency of words to be viewed, as well as have the data in a form that is ready for Gensim.


In [None]:
#Use CountVectorizer to vectorize the counts of each word in the title column. i.e. creating a document term matrix
cv = CountVectorizer(stop_words = "english")
text_cv = cv.fit_transform(text_df.title)
dtm = pd.DataFrame(text_cv.toarray(), columns = cv.get_feature_names())
dtm.head()

There is already interesting things to be seen here. In particular, it seems that not all articles are english. There are chinese articles in here that may require seperate analysis down the track. It is also evident that there are chemical names with greek letters that may require special attention, especially with hypenation. However, as a first pass, let's see the most frequently occuring words in the titles.


In [None]:
#Identify top words by aggregating the table
top_words = dtm.sum(axis = 0).sort_values(ascending = False)
print(top_words[0:50])


It is clear that there are words that should be combined - virus vs viruses, infections vs infection, etc, i.e. stemming or lemmatization. However, after testing using stemmer and lemmatizers, it seems that it doesn't really work on the different "virus" terms. So I am going to manually specify that replacement. 

Another manual replacement that I notice is the term "severe acute respiratory syndrome" vs "SARS". This seems to occur frequent enough that I should make a special case for it. 

Finally, there are clearly terms such as chapters, study, that can be considered as additional stop words since they do not offer any extra information. In fact, one can probably say the same for the term "virus" since it is so overwhelming. Lets remove these in the texts as well. 


In [None]:
#combine the different "virus" forms and combine the term "severe acute respiratory syndrome" into "sars"
text_df.title = text_df.title.apply(lambda x:x.replace("severe acute respiratory syndrome", "sars"))
text_df.title = text_df.title.apply(lambda x:re.sub('viral|viruses', 'virus', x))

#Lemmatization for the rest of the words using wordnet lemmatizer from nltk. A new column "Tokens" is formed in the dataframe to store this
wordnet_lemmatizer = WordNetLemmatizer()
lemma = WordNetLemmatizer()
text_df['Tokens'] = text_df.title.apply(lambda x: word_tokenize(x))
text_df.Tokens = text_df.Tokens.apply(lambda x: " ".join([lemma.lemmatize(item) for item in x]))


#Add stop_words of "chapter", "study", "virus" and redo the countvectorizer. Stopwords can be manually formed using text.ENGLISH_STOP_WORDS
stop_words = text.ENGLISH_STOP_WORDS.union(["chapter","study","virus"])
cv2 = CountVectorizer(stop_words = stop_words)
text_cv_stemed = cv2.fit_transform(text_df.Tokens)
dtm = pd.DataFrame(text_cv_stemed.toarray(), columns = cv2.get_feature_names())
top_words = dtm.sum(axis = 0).sort_values(ascending = False)
print(top_words[0:50])

This looks much better. I can start to see keywords that may be important for the different tasks in the CORD-19 Challenge. Time to try topic modelling. 



### Topic Modelling

Topic modelling is essentially a unsupervised learning technique that try to learn and group the text into different topics. One model to use for topic modelling is called Latent Dirichlet Allocation or LDA. LDA can easily be used using Gensim.



In [None]:
#First, the dtm needs to be transposed into a term document matrix, and then into a spare matrix
tdm = dtm.transpose()
sparse_counts = scipy.sparse.csr_matrix(tdm)

#Gensim provide tools to turn the spare matrix into the corpus input needed for the LDA modelling. 
corpus = matutils.Sparse2Corpus(sparse_counts)

#One also require a look up table that allow us to refer back to the word from its word-id in the document term matrix.
id2word = dict((v,k) for k, v in cv2.vocabulary_.items())

#Fitting a LDA model simply requires the corpus input, the id2word look up, and specify the number of topics required
lda = models.LdaModel(corpus = corpus, id2word = id2word, num_topics=20, 
                                           passes=10,
                                           alpha='auto')

lda.print_topics()


Looking at the topics above, it is actually not bad. The top words in the topics are fairly sensible, and some of them do seem to be grouping into the right topics,such as topic 3 with keywords such as rna, genome, and sequencing, which are all somewhat related. 

To optimise the LDA model, let's tune the number of topics by optimising the coherence score. 


In [None]:
#The Coherence model uses a corpora.Dictionary object that have both a word2id and id2word lookup table. We can create this dictionary as follows
d = corpora.Dictionary()
word2id = dict((k, v) for k, v in cv2.vocabulary_.items())
d.id2token = id2word
d.token2id = word2id

#Function to create LDA model and evalute the coherence score for a range of values for the number of topics. Note that the coherence model needs
#the original text to calculate the coherence, i.e. the tokens column in the table. The column needs to be tokenized as it was stored as strings
#in the dataframe.
def calculate_coherence(start, stop, step, corpus, text, id2word, dictionary):
    model_list = []
    coherence = []
    for num_topics in range(start, stop, step):
        lda = models.LdaModel(corpus = corpus, id2word = id2word, num_topics=num_topics, passes=10,alpha='auto', random_state = 1765)
        model_list.append(lda)
        coherence_model_lda = CoherenceModel(model=lda, texts=text, dictionary=dictionary, coherence='c_v')
        coherence_lda = coherence_model_lda.get_coherence()
        print("Coherence score for ", num_topics, " topics: ", coherence_lda)
        coherence.append(coherence_lda)
    
    return model_list, coherence

#Create and evaluate models with 10 - 80 topics in steps of 10
model_list, coherence_list = calculate_coherence(10, 90, 10, corpus, text_df.Tokens.apply(lambda x: word_tokenize(x)), id2word, d)


# Plot graph of coherence score as a function of number of topics to look at optimal number of topics. 
x = range(10, 90, 10)
plt.plot(x, coherence_list)

plt.title("Coherence Score vs Number of Topics")
plt.xlabel("Number of Topics")
plt.ylabel("Coherence score")
plt.show()

Base on the graph, it can be seen that the number of topics that gives the best coherence score is 60 (Note that this may change due to statistical nature of the models). The coherence score is around 0.54, which is reasonable, but not great. We can also view the topics and compare with before.


In [None]:
model_list[5].show_topics(num_topics=20, num_words=10, log=False, formatted=True)

Hmm... It's hard to tell whether the topics are better or not. There are topics that seems to make sense, but there are still one that does not. One way to visualise the topics to see whether they are go or bad is by using the pyLDAvis package, which is especially designed for viewing LDA topic models.


In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(model_list[5], corpus, d)
vis

What to watch for in the visualisation is that in the intertopic distance map, a good model would yield models that are spread out across the prinicipal component space, and a minimum overlap between the topics. It can be seen from the map above that while there are a few topics are spread out and isolated, many of the topics concentrates in a band and overlap with each other. It is, however also noted that despite that, there is only a limited overlap between the top 30 words in each of the topic. The most likely reason for this is that LDA actually performs poorly with short text (see for example, https://arxiv.org/abs/1904.07695). Therefore, in hindsight, it might have been better to directly get topic models based on the abstract. 

Nevertheless, let's assign models to each title based on this analysis. In particular, we will take the top three models and their contribution and store it into the dataframe. 


In [None]:
#Store tht top 3 topics, the contribution of the most dominant topic, and the total contribution of the top three topics
topic_list = []
top_score = []
sum_score = []
#lda_mode[corpus] gives the list of topics for each sentence (list of tokens) in the corpus as a list of tuples. We can then decipher that 
#and extract out the top three topics and their scores
for i, row in enumerate(model_list[5][corpus]):
    top_topics = sorted(row, key=lambda tup: tup[1], reverse = True)[0:3]
    topic_list.append([tup[0] for tup in top_topics])
    top_score.append(top_topics[0][1])
    sum_score.append(sum([tup[1] for tup in top_topics]))

text_df['topics'] = topic_list
text_df['top_score'] = top_score
text_df['sum_scores'] = sum_score
text_df.head()

I will save the model and the data for later use. 

In [None]:
#Finally, save the models ,the topic keys, and the updated dataframe for later use
model_file = open("title_model",'wb')
pickle.dump(model_list[5], model_file)
model_file.close()

dict_file = open("dictionary",'wb')
pickle.dump(d, dict_file)
dict_file.close()

text_df.to_csv("topic_data.csv", index = False, header=True)

## Part 2: Finding relevant articles

Topic modelling via LDA performed on the titles of the CORD19 data yield a number of topics ("title topics"), and each article or entry in the dataset is assigned three top models according to the topic model score. In this part of the workbook, the article selection is further refined, by first selecting articles with title topics that are relevant to the task/questions at hand, and then further identify articles in these topics that are most relevant by using kmeans clustering. In particular, the following steps will be taken:

1. *Extract keywords from the Tasks/questions*
2. *Select articles with title topics that are most relevant to these keyword*
3. *Map the abstracts of the selected articles using word vectors*
4. *Perform clustering analysis to find article clusters that are most relevant to the tasks/questions.*

We will focus on Vaccine and Therapeutics Task documented in the following link:
https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=561

Before starting, let's import all necessary libraries and packages

In [None]:
#Standard data analysis tools: numpy and pandas
import numpy as np 
import pandas as pd 

#nltk, re and string for text pre-processing
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
import string

#Keyword is generated using TF-IDF provided by sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

#Models loading and unloading using pickle
import pickle

#Majority of the task is done using gensim
from gensim.corpora.dictionary import Dictionary
from gensim.parsing.preprocessing import remove_stopwords
from gensim.models.phrases import Phrases, Phraser
from gensim.models import Word2Vec

#Speed up modelling time using multiprocessing
import multiprocessing

#sklearn tools for Kmeans clustering, model evaluation and visualisation of clusters 
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import CountVectorizer

#For visualisation, we use matplotlib for plotting and wordcloud for visualisation of keywords
from matplotlib import pyplot as plt
from wordcloud import WordCloud

#Use Mode for finding most frequent cluster
from statistics import mode





### Extracting Task Keywords

Ideally, the fully automated way to extract keywords would be to do webscrapping to scrap all the content in the page automatically. However, as the page is actually a tab with content generated by javascript, and therefore can not be scrap using simple packages such as *beautifulsoup*. Since the page is relatively small, I have decided that instead of wasting time to scrap, it is just easier to copy the questions in the task page manually into a list of strings as below:



In [None]:
#Manually enter questions into a list
#The questions are found in https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=561
question_list = ["What do we know about vaccines and therapeutics?", \
                 "What has been published concerning research and development and evaluation efforts of vaccines and therapeutics?" \
                 "Effectiveness of drugs being developed and tried to treat COVID-19 patients.", \
                 "Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin, and minocycline that may exert effects on viral replication.", \
                 "Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.", \
                 "Exploration of use of best animal models and their predictive value for a human vaccine.",\
                 "Capabilities to discover a therapeutic for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.",\
                 "Alternative models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics as production ramps up. This could include identifying approaches for expanding production capacity to ensure equitable and timely distribution to populations in need.", \
                 "Efforts targeted at a universal coronavirus vaccine.", \
                 "Efforts to develop animal models and standardize challenge studies", \
                 "Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers", \
                 "Approaches to evaluate risk for enhanced disease after vaccination", \
                 "Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models in conjunction with therapeutics"]

#Put list into a dataframe for easier viewing and manipulation
df_task = pd.DataFrame(question_list, columns = ['tasks'])
df_task.head()


With these sentences as a list of strings, we can now use TF-IDF to extract keywords that are important for these sentences. Since the goal is to find keywords that we can match with the title models, the same text pre-processing routine as that in the last workbook will be used. It is also noted we will only use a 1-gram approach as the title models were created without any use of n-gram with n > 1. 


In [None]:
#Define function to clean text
def text_cleaner2(text):
    #Convert to lower case
    text = str(text).lower()
    #Remove punctuations
    text = re.sub('[%s]' % re.escape(string.punctuation), " ", text)
    return text
df_task['tasks'] = df_task.tasks.apply(lambda x:text_cleaner2(x))
df_task.head() 

Looks good. Time to perform TF-IDF analysis。 TF-IDF analysis can be done using sklearn's *TfidfVectorizer*


In [None]:
#Create TFIDF model note that min_df was tuned to give optimal output
vectorizer = TfidfVectorizer(min_df=0.08, tokenizer = lambda x: x.split(), sublinear_tf=False, stop_words = "english")
tasks_keywords = vectorizer.fit_transform(df_task.tasks)

#Print keywords in the order of importantce
print(sorted([(k,v) for k, v in vectorizer.vocabulary_.items()], key = lambda x: x[1], reverse = True))

The keywrods above looks reasonable, but there are some obviously useless words. Let's manually remove these: 'use', 'tried', 'studies', 'know', 'need', 'concerning', 'alongside'. We should also apply lemmatization similar to what was used in the title models (Note that we only apply lemmatization now to allow stopwords to be removed more easily by the TfidfVectorizer)


In [None]:
#Grab all keywords in the TF-IDF vectorizer
new_dict = vectorizer.vocabulary_
#manually remove useless words and put into a new keyword_list
stop_words = ["use", "tried", "studies", "know", "need", "concerning", "alongside"]
for word in stop_words:
    new_dict.pop(word, None)
keyword_list = list(new_dict.keys())

#Do the same processing as in the previous workbook that was used to form the topic topic titles
#This include the replacement of various keywords, removal of numbers and lemmatization
keyword_list = [x.replace("severe acute respiratory syndrome", "sars") for x in keyword_list]
keyword_list = [re.sub('viral|viruses', 'virus', x) for x in keyword_list]
keyword_list = [re.sub('[0-9]', '', x) for x in keyword_list]
wordnet_lemmatizer = WordNetLemmatizer()
lemma = WordNetLemmatizer()
keyword_list = [lemma.lemmatize(x) for x in keyword_list] 

print(keyword_list)

This keyword list looks slightly better. We are now ready to select articles by topic matching.

### Topic matching 

The next task is to match the title topics of all the articles with the keyword list that we have just extracted.  Let's load the topic models and processed metadata dataframe, which contains the corresponding title topics for each articles.


In [None]:
#Load the LDA topic model
topic_model = pickle.load(open('/kaggle/working/title_model', "rb"))
word_key = pickle.load(open('/kaggle/working/dictionary' , "rb"))

#Load the dataframe with the title, abstract, and the corresponding title models
df_metadata = pd.read_csv('/kaggle/working/topic_data.csv')

df_metadata.head()


There are essentially two ways to find the relevant title topics based on the keyword list. The first is to simply lump all keywords into a single corpus and use the topic model to predict which topics the keyword corpus falls into. The second is to go through all topics and pick out topics that contains the keywords in our keyword list. We will do both here, first starting with the topic prediction


In [None]:
#Format the keyword list into a form acceptable by the gensim lda model
corpus_dict = Dictionary([keyword_list])
corpus = [corpus_dict.doc2bow(words) for words in [keyword_list]]
#predict the topic probabilities using the model
vector = topic_model[corpus]
topic_list = []
for topic in vector:
      print(topic)
     
topic_list = [tup[0] for tup in vector[0]]


It looks like that the model predicted that these keywords could be contained by 11 topics, which is not too bad for a topic model with 70 topics. Let's check the keywords for these topics.



In [None]:
for topic in topic_list:
      print(topic_model.show_topic(topic))

Unfortunately, it doesn't look like the topics predicted have keywords that we are after. So simple prediction using the LDA model and the keywords as corpus doesn't yield a good result. So we would have to go the harder but less risky way to selecting all topics that contain the keywords in the keyword list. 


In [None]:
word2id = word_key.token2id 
new_topic_list = []
#Initial test shows that two important keywords, naproxen and minocycline are not in the topic keywords
#Since these are antiinflammatory and antibiotics respectively, I have decided to manually add these keywords into the keyword_list
keyword_list.append('antiinflammatory')
keyword_list.append('antibiotic')
for word in keyword_list:
    try: 
        word_id = word2id[word]
        topics = topic_model.get_term_topics(word_id, minimum_probability=None)
        for topic in topics:
            new_topic_list.append(topic[0])
    except:
        print(word + " not in topic words")

new_topic_list = list(set(new_topic_list))

We see that there are a few keywords that doesn't exist in the topic model. But only a few words, which is good. Most keywords can be assigned to a topic, which is now in the list new_topic_list. Let's now choose from the metadata dataframe the articles with title topics that is included in the new_topic_list. Since the topic list is relatively large, we add in a criteria to limit the number of articles, namely, the three top title topics of the article must all be included in the new_topic_list to be selected. This is as done below:


In [None]:
#Function to extract the topics that was assigned previously

def read_topics(topic_string):
    topic_list = topic_string.split(",")
    topic_dict = {}
    if (len(topic_list) == 3):
        topic1 = int(topic_list[0][1:])
        topic2 = int(topic_list[1])
        topic3 = int(topic_list[2][:-1])
    elif (len(topic_list) == 2):
        topic1 = int(topic_list[0][1:])
        topic2 = int(topic_list[1][:-1])
        topic3 = 80
    else:
        topic1 = 80
        topic2 = 80
        topic3 = 80
    topic_dict['topic1'] = topic1
    topic_dict['topic2'] = topic2
    topic_dict['topic3'] = topic3
    return topic_dict
        

#Seperate out the three topics into seperate columns for easier processing
df_metadata['topic_dict'] = df_metadata['topics'].apply(lambda x:read_topics(x))
df_metadata['topic1'] = [x['topic1'] for x in df_metadata['topic_dict']]
df_metadata['topic2'] = [x['topic2'] for x in df_metadata['topic_dict']]
df_metadata['topic3'] = [x['topic3'] for x in df_metadata['topic_dict']]
#with the relatively large topic list, I will increase the requirement that all three top topics needs to be in the topic list for an article 
#to be selected
df_metadata['Select'] = df_metadata['topic1'].isin(new_topic_list) & df_metadata['topic2'].isin(new_topic_list) & df_metadata['topic3'].isin(new_topic_list) 
df_selected = df_metadata[df_metadata['Select'] == True]
#Drop all rows without an abstract
df_selected = df_selected[df_selected['abstract'].notna()]
print(df_selected.shape)


We have successfully minimise the number of articles to 1/5 of the total number of articles. Furthermore, a quick look in the articles looks like they are indeed of the right topics. We now look at extracting the abstracts of these articles and training a word vector model to look more deeply into what information these articles contain. 

### Extracting Word Vectors from Abstracts 

Let's have a quick look at the selected articles (df_select)

In [None]:
df_selected.head()

We need to perform text preprocessing as before.  One thing that can be noticed is that the word "Abstract" appears at the start of a lot of abstracts. We would need to remove these as well. It is also noted that (after the first pass), there are punctuations that are not in the punctuation list that needs to be removed manually. Finally, we also remove the stopwords in the abstract using gensim's remove_stopwords


In [None]:
#Function to clean up abstract
def abstract_cleaner(text):
    #standard preprocessing - lower case and remove punctuation
    text = str(text).lower()
    text = re.sub('[%s]' % re.escape(string.punctuation), "", text)
    #remove other punctuation formats that appears in the abstract
    text = re.sub("’", "", text)
    text = re.sub('“', "", text)
    text = re.sub('”', "", text)
    #remove the word abstract and other stopwords
    text = re.sub("abstract", "", text)
    text = remove_stopwords(text)
    #lemmatize and join back into a sentence
    text = " ".join([lemma.lemmatize(x) for x in word_tokenize(text)])
    
    return text

#Clean abstract
df_selected['abstract'] = df_selected['abstract'].apply(lambda x: abstract_cleaner(x))
df_selected.head()

Next, with many more words in the abstract compared to the titles, we will make use of bigrams using gensim's Phrases and Phraser



In [None]:
#Check for bi-grams - first split sentence into tokens
words = [abstract.split() for abstract in df_selected['abstract']]
#Check for phrases, with a phrase needing to appear over 30 times to be counted as a phrase
phrases = Phrases(words, min_count=30, progress_per=10000)
#Form the bigram model
bigram = Phraser(phrases)
#Tokenise the sentences, using both words and bigrams. Tokenised_sentence is the word tokens that we will use to form word vectors
tokenised_sentences = bigram[words]
print(list(tokenised_sentences[0:5]))

Now with the sentences tokenised, we are ready to create word vectors. We use gensim's Word2Vec model for this job


In [None]:
#Make use of multiprocessing
cores = multiprocessing.cpu_count() # Count the number of cores in a computer
#Create a word vector model. The number of dimensions chosen for word vector is 300
w2v_model = Word2Vec(window=2,
                     size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=cores-1)
#Build vocab for the model
w2v_model.build_vocab(tokenised_sentences)
#Train model using the tokenised sentences
w2v_model.train(tokenised_sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print("Model trained")

Now that the model is trained, we can test how good the model is by similarities and analogies. Since these articles are Covid19 related, we can use similarity and analogies that are relevant to these topics. For example, what is the most similar term to Covid19? If tamiflu is a treatment for influenza, which is the analogue for SARS-COV-2? 


In [None]:
print("Top words that are most similar to COVID19:", "\n", w2v_model.wv.most_similar(positive = ["covid19"]))
print("\n")
print("Oseltamivir (Tamiflu) to influenza as what to COVID19:\n", 
      w2v_model.wv.most_similar(positive=["oseltamivir", "influenza"], negative=["covid19"], topn=20))


The word vectors looked like they are trained reasonably well, with correct associations tested by similarities and analogies. For example, we see that "COVID-19" similarity check produces terms such as "2019-ncov", "wuhan", "novel_coronavirus" etc, all of which describes COVID-19 very well. When we perform a analogy check between tamiflu/influenza with COVID19, a range of drugs and treatments are returned, such as 'blg' which stands for the chinese medicine Ban-Lan-Gan which was a popular herbal treatment for SARS. Drugs such as board spectrum anti-viral drug arbitol, malaria drug chloriquine, and flu drug favipiravir are also in the list, all of which have been in the news as possible drugs for COVID-19. A bunch of antibiotics also appeared, probably referring the need to antibiotics for treating seconary infections. 

Even with the simple analysis such as that above, we have already found treatment or drugs options that have been implicated in the literature, and potentially this is enough information as a brief literature survey to direct certain researchers. 

We now look deeper into the literature data using cluster analysis. 

### Cluster Analysis of Abstracts 

Since we are confident with the word vectors formed using these abstract, we can now use these word vectors for further analysis. In particular, my approach here would be to see if the abstract can be seperated into clusters, and select clusters which is most similar to the questions/tasks we have in the start of the workbook. We can do this by first turning each abstract into a single vector. The simplest method is the average the word vectors for all words in the abstract tp form a summary vector. Note that for words in the abstract that is not in the word vector space (since there was a min-count applied to the Phraser tokeniser), their vector is set as zero. 


In [None]:
#Turn abstract into a single vector by averaging word vectors of all words in the abstract
def abstract2vec(abstract):
    vector_list = []
    for word in abstract:
        #seperate out cases where the word is in the word vector space, and words that are not
        if word in w2v_model.wv.vocab:
            vector_list.append(w2v_model.wv[word])
        else:
            vector_list.append(np.zeros(300))
    #In case there are empty abstracts, to avoid error
    if (len(vector_list) == 0):
        return np.zeros(300)
    else:
        return sum(vector_list)/len(vector_list)

#Store tokens into dataframe and turn it into vectors
df_selected['sentences'] = tokenised_sentences
df_selected['avg_vector'] = df_selected['sentences'].apply(lambda x: abstract2vec(x))
df_selected.head()


The result looks ok. Next is to perform kmeans clustering using sklearn to find clusters of these articles. In order to do that, we would need to perform scaling with standard scaler. Furthermore, to have a rough search of the optimum number of cluster, we use the elbow method to look at the inflection point in the decrease of sum of square distance from centriod (inertia in sklearn.kmeans) with number of clusters


In [None]:
# Turn data in an array as input to sklearn packages
X = np.array(df_selected.avg_vector.to_list())

#Perform standard scaling and kmeans strongly affected by scales
sc = StandardScaler()
X_scaled = sc.fit_transform(X)

#Form kmeans model with cluster size from 2 - 100, and record the inertia, which is a measure of the average distance of each point 
#in the cluster to the cluster centroid 
sum_square = []
for i in range(2,100,5):
    km_model = KMeans(init='k-means++', n_clusters=i, n_init=10)
    cluster = km_model.fit_predict(X_scaled)
    sum_square.append(km_model.inertia_)


x = range(2,100,5)
plt.figure(figsize=(20,10))
plt.plot(x,sum_square)
plt.scatter(x,sum_square)
plt.title('Sum of square as a function of cluster size')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of square distance from centroid')
plt.show()

It can be seen that the elbow seems to be around n = 10 - 30. We can zoom in this region, and use silhouette score to find the optimal number of clusters. The optimum number of cluster should yield a global maximum in the silhouette score. 


In [None]:
#Sweep from 10 to 30 (range around the elbow) and look for the record the silhouette score
silhouette = []
for i in range(10,30,1):
    km_model = KMeans(init='k-means++', n_clusters=i, n_init=10, random_state = 1075)
    cluster = km_model.fit_predict(X_scaled)
    silhouette.append(silhouette_score(X_scaled, cluster))


#Plot to observe the maximum silhouette score across this range
x = range(10,30,1)
plt.figure(figsize=(20,10))
plt.plot(x,silhouette)
plt.scatter(x,silhouette)
plt.title('Silhouette score as a function of number of clusters')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.show()


It looks like 17 clusters gives the best silhouette score (although this may change depending on the randomness of the cluster. It is around this region...). So Let's rerun that and store the resulting cluster label for each abstract into our datatable. 



In [None]:
km_model = KMeans(init='k-means++', n_clusters=17, n_init=10, random_state = 1075)
#Obtain cluster labels
cluster = km_model.fit_predict(X_scaled)
#store in dataframe
df_selected['cluster'] = cluster
df_selected.head()

We can visualise these clusters using t-SNE. t-SNE is good for visualising high dimensionalities. However, even with t-SNE, a dimension of 300 is too much to plot. That why PCA is used to first reduce the dimension to 50 before applying t-SNE. To differentiate the clusters, different colors are used. The perplexity was chosen in a way that visually seperates the clusters the best


In [None]:
#Create Principal components
pca = PCA(n_components=50)
X_reduced = pca.fit_transform(X)

#Create t-SNE components and stored in dataframe
X_embedded = TSNE(n_components=2, perplexity = 40).fit_transform(X_reduced)
df_selected['TSNE1'] = X_embedded[:,0]
df_selected['TSNE2'] = X_embedded[:,1]

#plot cluster label against TSNE1 and TSNE2 using different colour for each cluster
color = ['b','g','r','c','m','y','yellow','orange','pink','purple', 'deepskyblue', 'lime', 'aqua', 'grey', 'gold', 'yellowgreen', 'black']
plt.figure(figsize=(20,10))
plt.title("Clusters of abstract visualised with t-SNE")
plt.xlabel("PC1")
plt.ylabel("PC2")
for i in range(17):
    plt.scatter(df_selected[df_selected['cluster'] == i].TSNE1, df_selected[df_selected['cluster'] == i].TSNE2, color = color[i])
plt.show()

It can be seen that the clusters looks pretty well seperated. Finally, we can now place the questions into the same clustering model to see what cluster each task belongs to:

In [None]:
#Take the questions from the question list and clean using the same function
q_cleaned = [abstract_cleaner(x) for x in question_list]
#Create tokens from the bigram phraser
q_words = [q.split() for q in q_cleaned]
q_tokens = bigram[q_words]
#Turn tokens into a single summary word vector
q_vectors = [abstract2vec(x) for x in q_tokens]
#Predict cluster based on the summary word_vector
question_cluster = km_model.predict(q_vectors)
print(question_cluster)

It looks like all the questions/tasks are mainly related to one clusters. Let's look at the keywords associated with the common cluster using count vectorizer. We will first look at the keywords by words, and then also plot them in a wordcloud for better visualisation impact. 


In [None]:
#Make tokens back into a string
df_selected['sentence_str'] = df_selected['sentences'].apply(lambda x: " ".join(x))
#Perform Count Vectorize to obtain words that appeared most frequently in these abstracts
main_cluster = mode(question_cluster)
cv = CountVectorizer()
text_cv = cv.fit_transform(df_selected[df_selected['cluster'] == main_cluster].sentence_str)
dtm = pd.DataFrame(text_cv.toarray(), columns = cv.get_feature_names())
top_words = dtm.sum(axis = 0).sort_values(ascending = False)
topword_string = ",".join([x for x in top_words[0:50].index])
print("Main cluster: Top 50 words - " + topword_string + "\n")


In [None]:
#Create word cloud using the topwords
wordcloud1 = WordCloud(width = 800, height = 600, 
                background_color ='white', 
                min_font_size = 10).generate(topword_string)

wordcloud2 = WordCloud(width = 800, height = 600, 
                background_color ='white', 
                min_font_size = 10).generate(topword_string)
  
# plot the WordCloud image        


plt.figure(figsize=(10,6))
plt.imshow(wordcloud1) 
plt.title("Word Cloud for Main Cluster")
plt.axis("off") 
plt.tight_layout(pad = 0) 
plt.show() 

Looks Reasonable. In particular, vaccine, drugs and treatment are top keywords in the cluster. However, in hindsight, the model could probably perform better if we have removed some common words such as virus, disease etc, to reduce the overlap of these clusters potentially. Nonetheless, these top occuring words gives us confidence that the clusters selected is accurate. Let have a look at which articles have been selected. 


In [None]:
df_cluster = df_selected[df_selected['cluster'] == main_cluster]
print("Number of articles in main cluster: ", df_cluster.shape[0])


#Have a look at first 10 articles from cluster 5:
print(list(df_cluster['title'][0:10]))

Based on the title of the articles, it seems that the selections was overall accurate. So we have now narrow down to about 700 articles, which is much more manageable compared to the 40k+ starting point. Of course, manually going through 700 articles is probably still too much. We can explore the use of NLP in further understanding of these articles in the next workbook. To end the workbook, we will save the dataframe for Cluster for later use.


In [None]:
#Save data for later use
df_cluster.to_csv("Cluster.csv", index = False, header=True)

## Part 3: Text Summarization

Now that we have narrow down to about 600 articles, we can look deeper into the articles using full text instead of abstract. In particular, I will use extractive  summarization in gensim to create summaries for each of the full text of the selected articles, and then by performing a keyword search, summaries of relevant articles can be pulled out, in a form that is compact enough and ready for human processing. 

Again, Let's start by loading relevant packages

In [None]:
#standard data packages
import numpy as np 
import pandas as pd 

#OS is required to look for the correct file in the directory
import os

#json for parsing the json full text files
import json

#Summarization using gensim summarizer, which uses an extractive summarization algorithm 
from gensim.summarization.summarizer import summarize



### Summarization

To perform summarization, we look for the full text of the articles selected previously, and summarize it one by one using gensim. The key to this is really the parsing of the json file. The json format can be found in the metadata.readme file. In particular, once read, the full text file essentially becomes a list of dictionaries, with the title, authors and the body text contained by the tags "title" under "metadata" tag, "authors" under "metadata" tag and "body_text" respectively. We can therefore write a function to parse the json to read and store these information. Furthermore, the filename of the full text of the articles is given by the "sha" number in the dataframe. This can be utilised to find the correct file to be read. First, let's recall what the metafile dataframe look like:

In [None]:
#Load file containing articles of the selected cluster
df_cluster = pd.read_csv("/kaggle/working/Cluster.csv")
df_cluster.head()

In [None]:
#Define function to find a file within the input path
def find(name, path):
    #filename will be sha number + .json
    filename = name + '.json'
    for root, dirs, files in os.walk(path):
        if filename in files:
            return os.path.join(root, filename)


#define a functuon to read the json file, and seperate the relevant parts into body text, title and authors
def parse_json(name, path):
    filename = find(name, path)
    try:
        with open(filename, "r") as read_file:
            data = json.load(read_file)
            body_text = data["body_text"]
            #Title and authors are under the metadata tag
            title = data["metadata"]["title"]
            author = data["metadata"]["authors"]
    
        return body_text, title, author
    except:
        return "", "", ""


#Load files for all articles and store it into the data frame  
df_cluster['data'] = df_cluster['sha'].apply(lambda x: parse_json(str(x), "/kaggle/input/"))


Next, we perform extractive summarization using gensim on the full text of the file. Note that each paragraph of the body text is stored as a list of dictionary, with the actual text under the "text" key in each dict. Similarly, the authors are stored with a list of dict with each author's name stored in the "first" and "last" key of each dict. We can write a function to do both of these. 

It is noted that gensim makes it very easy to do extractive summarization within a single command. In this case here, I have keep the summary to 200 words, such that some detail is retained but it is compact enough to be readily revealed manually. An average person can read 200 words within 1 - 2 minutes, so even if keyword search brings up 100 articles, it will only take 2 hours to review all the summaries, which is still doable. 

In [None]:
#Define function to combine the full text into a corpus that is then passe through the summarizer
def extractive_summary(json_data, word_count):
    body_text = json_data[0]
    corpus = []
    for i in range(len(body_text)):
        corpus.append(body_text[i]["text"]) 
    document = " ".join(corpus)
    return summarize(document, word_count = word_count)

#Define function to extract the authors of the paper
def extract_authors(dict_list):
    name_list = []
    for item in dict_list:
        name_list.append(" ".join([item["first"], item["last"].upper()]))
    return name_list

df_cluster['summary'] = df_cluster['data'].apply(lambda x: extractive_summary(x, 200))
df_cluster['full_title'] = [x[1] for x in df_cluster['data']]
df_cluster['authors'] = [x[2] for x in df_cluster['data']]
df_cluster['authors'] = df_cluster['authors'].apply(lambda x: extract_authors(x))

Finally, we can now extract information from this data table. I have created a search function to find the relevant summaries but looking to see of the summary contain all of the keywords in the keyword_list and at least one of the words in the optional_list. The function would then print out the search result - the title, authors, doi and summary. If a research wants to know further about a particular summary, they have the information they need (title, author and doi) to find the article that they want (since figures are not included in the json)

In [None]:
def find_information(keyword_list, optional_list):
    df_answer =  df_cluster[df_cluster['summary'].apply(lambda x: all(substring in x for substring in keyword_list)) == True]
    df_answer = df_answer[df_answer['summary'].apply(lambda x: any(substring in x for substring in optional_list)) == True]
    for i in range(len(df_answer)):
        print("Title: ", df_answer['title'].iloc[i], '\n')
        print("Authors: ", ",".join(df_answer['authors'].iloc[i]), '\n')
        print("DOI: ", df_answer['doi'].iloc[i],'\n')
        print("Summary: ", df_answer['summary'].iloc[i], '\n')


### Results

Here, we will use the search function above to see if we can learn anything about the questions/tasks given. This is done by searching for keywords and analyse the results manually. We will start with drugs that can be used to treat COVID-19 related viruses:


In [None]:

df_answer = find_information(["drug"], ["SARS-COV", "COVID-19", "2019-nCoV", "coronavirus", "-CoV", "SARS", "MERS"])


We can see some important information regarding development and research on anti-viral drugs for COVID19, for example:
1. A nucleotide prodrug GS-5734 which is used for Ebola has been shown to inhibit SARS-CoV and MERS-CoV replication, so probably can be applied to SARS-CoV-2. 
2. The mouse model can be used for evaluation of vaccine, immunoprophylaxis and antiviral drugs, especially model using GOlden Syrian hamster
3. Type I interferons and lopinavir-ritonavir are potential anti-MERS-CoV agents and therefore can be tried on SARS-CoV-2
4. A poly-ADP-ribose polymerase 1 (PARP1) inhibitor, CVL218, may serves as a potential drug and inhibits SARS-CoV-2 replication
5. Remdesivir is another antivirual drug candidate that in currently in human clinical trials for COVID-19. 
6. Another possible drug is niclosamide that regulates multiple signaling pathway and is potentially effective for viral infections, but still need to validated.
7. Chloroquine while may have possible benefits in treating SARS-CoV-2, it has been shown that it may cost chronic arthraligia in previous usage in other acute viral diseases, and therefore should be cautious of its potential detrimental effect

Note - since every run of this notebook may be different due to the statistical nature of the model, what can be observed in each search may be slightly different. However, this demonstrates that through a series of NLP techniques, we are able to extract important information from the research database without needing to spend days to read paper by paper. While the information above seems to be of medium detail, some researchers may require more details, and it is easy enough to refer to the DOI also displayed to look at the original paper. 

Let's do another try, this time with vaccines:




In [None]:
df_answer = find_information(["vaccines"], ["SARS-COV", "COVID-19", "2019-nCoV", "coronavirus", "-CoV", "SARS", "MERS"])

Unlike in the last search, not a lot of specific information can be extracted from the search. However, it has identified a key paper (DOI:  10.1021/acscentsci.0c00272) which does a patent analysis from 2003 to present of coronavirus related biologics which would no doubt provide a lot of information that a technical person can make use of.  

## Conclusion

I have demonstrate here how NLP can be used to extract specific information from a large research database, through a combination of topic modelling, TF-IDF, Word2Vec, kmeans clustering, and extractive summarization. In hindsight there a number of shortfalls in this approach. In particular, as the topic modelling and clustering may not be optimal or accurate, there are very likely going to be articles that are relevant but are not selected. A potential solution may be to work with abstract or even full text right at the start, but this would inevitably requires significantly more computing power and computing time. There could be a potential tradeoff point somewhere. Obviously, other improvements can be implemented, especially at the end in the in-depth full-text analysis. Most sophisticated abstractive summarization, knowledge graphs, as well as potentially extracting relevant references in each paper to map or even extract all related information will potentially a more automated way of extracting even more details from the text. Nevertheless, what is presented here hopefully provide a basis of what can be done and potentially inspire other more elegant solutions to the problem. 