## ** LDA-based Topic Modeling**

***

This notebook applied Natural Language Processing (NLP) and other AI techniques to generate insights in the support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up. 
<br>
<br>Language is unstructured data that has been produced by people to be understood by other people. Text data is not random, it is governed by linguistic properties that make it very understandable to other people and also processable by computers !!

***

**Methodology:** This notebook retrieve insights from a corpus composed of 2020 COVID-19 full-text research papers. First, the authors proposed text-mining approaches to explore the corpus, including 1) wordcloud, 2) Word2Vec model to retrieve the most similar words to a specific word (e.g., retrieve the most similar words to *"origin"*, *"symptom"*), and 3) t-SNE visualization of semantic clusters from the corpus. Then, the proposed framework implemented an unsupervised Latent Dirichlet Allocation-based modeling of the strategic topics present in the corpus.
***
**Highlights:** 
1.  Text Data Loading and Preparation
2.  Wordcloud of COVID-19 Abstracts
3.  Word2Vec Model and Textual Similarities
4.  TSNE-Visualization of Semantic Clusters
5.  **Latent Dirichlet Allocation-based Topic Modeling**
***

**Pros:**
* Focus on new coronavirus literature
* Application of diverse text mining techniques
* **Automated LDA-based Topic Modeling**
* Insightful Data Visualization Tools

**Cons:**
* Reduced Scope: Analysis of 1,994 full-text research papers published in 2020.

***

In [None]:
from tqdm import tqdm
import json
import re
import fnmatch
import numpy as np
import pandas as pd
from pprint import pprint
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('stopwords')
import os
from time import time  # to time our operations
import operator
import sys
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# **Part 1: Data Extraction and Preparation**

### ** <font color=green> Funtion to Extract Text Data from Json Files **

In [None]:
# Fonction to extract text data from json files

def extract_values(obj, key):
    """Pull all values of specified key from nested JSON."""
    arr = []

    def extract(obj, arr, key):
        """Recursively search for values of key in JSON tree."""
        if isinstance(obj, dict):
            for k, v in obj.items():
                if isinstance(v, (dict, list)):
                    extract(v, arr, key)
                elif k == key:
                    arr.append(v)
        elif isinstance(obj, list):
            for item in obj:
                extract(item, arr, key)
        return arr

    results = extract(obj, arr, key)
    return results

### ** <font color=green> Data Exploration **

In [None]:
df = pd.read_csv("/kaggle/input/CORD-19-research-challenge/metadata.csv")
df=df[["title", "authors", "publish_time", "abstract","sha"]]
try:
    df["publish_time"] = pd.DatetimeIndex(df["publish_time"]).year
except:
    pass
plt.subplots(figsize = (10,6))
plt.hist(df["publish_time"],bins = 30, edgecolor ="black")
plt.title("Coronavirus-Related Academic Publications \n", fontsize = 24, fontweight = "bold")
plt.xlabel("Published Year")
plt.ylabel("Publications")
plt.savefig("COVID-19_Publications_Histogram.png")
plt.show()

### ** <font color=green> Retrieval of Relevant Research Papers **

In [None]:
df=df[df["publish_time"]==2020]
df=df.dropna(subset = ["sha"])
df=df.dropna(subset = ["abstract"])
df =df[df["abstract"].str.contains("COVID|covid|Covid|coronavirus|Coronavirus|2019-nCov|SARS-CoV-2", regex=True)]
print("Number of Retrieved Full-Text Papers", len(df))
display(df.head(5))

### ** <font color=green> Text Data Preprocessing (Tokenization, Stopword Removal, Bigrams/Trigrams) **

In [None]:
# Text data extraction and preprocessing(tokenization, stopword removal, and bigrams/trigrams)

t =time()

stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(["could", "medrxiv", "http", "license", "preprint"])
corpus1 = []
tokenizer = RegexpTokenizer(r'\w+')


    
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        if fnmatch.fnmatch(filename, '*.json'):
                
            path = os.path.join(dirname, filename)
            data = json.load(open(path))
            paper_id = data["paper_id"]
            
            if paper_id in df["sha"].unique().tolist():
                data = extract_values(data, 'text')
                
                for sentence in data:
                    shortword = re.compile(r'\W*\b\w{1,4}\b')
                    sentence = shortword.sub('', sentence).lower()
                    word_list = tokenizer.tokenize(sentence.lower())
                    word_list1 = [word for word in word_list if word.isalpha()]
                    word_list2 = [word for word in word_list1 if word not in stopwords]
                    corpus1.append(word_list2)


# Build the bigram and trigram models
bigram = gensim.models.Phrases(corpus1, min_count=10, threshold=100)  # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[corpus1], threshold=100)

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

corpus2 = make_bigrams(corpus1)
corpus2 = make_trigrams(corpus2)
corpus2 =  list(filter(lambda x: x != [], corpus2))
covid_corpus =  list(filter(lambda x: len(x) > 4, corpus2))
print('Time to process data: {} mins'.format(round((time() - t) / 60, 2)))


In [None]:
print(covid_corpus[:3])

# **Part 2: Wordcloud of 2020 Covid-19 Research Papers**

In [None]:
def return_sum(my_dict):
    sum = 0
    for i in my_dict:
        sum = sum + my_dict[i]
    return sum

def dict_for_wordcloud(corpus):
    words_dict = {}

    filtered_words =[]

    for i in range(len(corpus)):
        for j in range(len(corpus[i])):
            filtered_words.append(corpus[i][j])

    filtered_words1 = [w for w in filtered_words if w.isalpha()]

    lemmatizer = WordNetLemmatizer()

    for w in range(len(filtered_words1)):
        filtered_words1[w] = lemmatizer.lemmatize(filtered_words1[w])

    for word in filtered_words1:
        words_dict[word] = words_dict.get(word, 0) + 1

    print('Total Number of Words:', return_sum(words_dict))

    sorted_d = sorted(words_dict.items(), key=operator.itemgetter(1), reverse=True)
    
    print('Distinct words', len(sorted_d))
    
    return words_dict


def plot_wordcloud(corpus):

    wordcloud = WordCloud(width=800, height=400, max_words=150, max_font_size=50, relative_scaling=0.5,
                          background_color="white").generate_from_frequencies(dict_for_wordcloud(corpus)) #color_func=lambda *args, **kwargs: (0, 50, 1)).generate_from_frequencies(words_dict)
    
    # Display the generated image:
    plt.figure(figsize=(10,8))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.title('WordCloud of Covid-19 Research Papers \n', fontsize = 24, fontweight = "bold")
    plt.savefig('WordCloud of Covid-19 Research Papers.png')
    plt.show()


In [None]:
t = time()

plot_wordcloud(covid_corpus)    

print('Time to process data: {} mins'.format(round((time() - t) / 60, 2)))

In a wordcloud, the importance of each word is shown with font size. In this section, a wordcloud of the most frequent words appearing in corpus of COVID-19 abstracts is built. A number of preprocessing steps (e.g., tokenization, lemmatization) are required to build a word cloud. As expected, words such as *"covid"*, *""patient*, and *"infection"* are particularly prominent. Other words such as *"wuhan"* and *"proteine"* have also been extensively discussed in literature.  

# **Part 3: Word2Vec Model and Textual Cosine Similarities**

A **Word2Vec** model (Word to Vector) was built using Gensim Python library to produce word embeddings. Using a large corpus of text as an input, a Word2vec model returns a vector (here, 100 dimensions) for each unique word in the corpus. The similarity between vectors is measured through the cosine similarity metric. Similar vectors represent words that are semantically related in the original corpus.

In [None]:
from gensim.models import word2vec, KeyedVectors
filename = 'testword2vec_Covid_10min_count'

#### Word2Vec Model ####

model = word2vec.Word2Vec(covid_corpus, size=100, window=8, min_count=10, workers=10)
model.train(covid_corpus, total_examples=len(covid_corpus), epochs=15)
model.wv.save(filename)
word_vectors = KeyedVectors.load(filename)

#### Example of Word Embedding ####

model.wv['coronavirus']


In [None]:
word = "origin"
print("Similar words to {}:".format(word))
print(word_vectors.most_similar(positive=word, topn=10))
print("\n")
word = "symptom"
print("Similar words to {}:".format(word))
print(word_vectors.most_similar(positive=word, topn=10))
print("\n")
word = "diagnostic"
print("Similar words to {}:".format(word))
print(word_vectors.most_similar(positive=word, topn=10))
print("\n")
word = "transmission"
print("Similar words to {}:".format(word))
print(word_vectors.most_similar(positive=word, topn=10))

For instance, it is interesting to understand at a glance that the origin of coronaviruses is linked to bigrams such as *"natural_reservoir"*, seafood_markets", *"animal_reservoir"*, and *"zoonotic_origin"* (FYI, A zoonosis is an infectious disease caused by a pathogen that has jumped from non-human animals to humans) <br><br>
Similarly, regarding what is known about COVID-19 *symptoms*, it is worth having a look at "symptom"'s most similar words, including words such as *"fever"* and *"cough"*, and *"fatigue"*.

# **Part 4: T-SNE Visualization of Semantic Clusters**

The **T-distributed Stochastic Neighbor Embedding (t-SNE)** dimensionality reduction technique was ultimately applied to project the 2D position of each word with its label. A machine learning **Kmean** algorithm was also implemented using *Scikit-learn* Python Library to partition n words into semantic clusters. To determine the optimal number of clusters K, the **elbow method** was used with below the plot of sum of squared distances for K in the range [1, 30]. If the plot looks like an arm, then the elbow on the arm is the optimal K. Here, **K =7**.

In [None]:
def Word2Vec_Sorted(model):
    ''' 
    Function to extract the word2vec embeddings 
    of the most frequent terms in the corpus
    '''
    stopwords.extend(["also", "however", "could", "rights_reserved", "reviewed_https_biorxiv", "author_funder", "copyright_holder", \
                      "without_permission", "reviewed", " author_funder_granted"])
    w2c = dict()  
    
    for item in model.wv.vocab:
        w2c[item]=model.wv.vocab[item].count
    w2cSorted=dict(sorted(w2c.items(), key=lambda x: x[1],reverse=True))
    w2cSortedList = list(w2cSorted.keys())
    w2cSortedList = [word for word in w2cSortedList if word not in stopwords]
    
    return w2cSortedList


#### Implementation of the elbow method to find the optimal number of clusters K ####

Sum_of_squared_distances = []
tokens = []

for word in Word2Vec_Sorted(model):
    tokens.append(model[word])

tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=500)
new_values1 = tsne_model.fit_transform(tokens)

K = range(1,30)

for k in tqdm(K):
    km = KMeans(n_clusters=k)
    km = km.fit(new_values1)
    Sum_of_squared_distances.append(km.inertia_)
    
#### Plot the "elbow" curve ####
plt.subplots(figsize = (10,6))
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('Number of Clusters')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal Number of Clusters K', fontsize = 24, fontweight = "bold")
plt.savefig("Elbow_Method_Optimal_K.png")
plt.show()

In [None]:
def tsne_plot(model, key_words):
    "Creates a TSNE model and plots it"
    
    labels = []
    tokens = []
    
    for word in Word2Vec_Sorted(model)[:300]:
        tokens.append(model[word])
        labels.append(word)

    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=500)
    new_values = tsne_model.fit_transform(tokens)
    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])

    clusters = KMeans(n_clusters=7)
    clusters.fit(new_values)
    y_kmeans = clusters.predict(new_values)
    
    colmap = {0: 'red', 1: 'green', 2: 'blue', 3 :'black', 4:'fuchsia', 5:'orange', 6:'grey', 7:'grey'}

    dict={}
    for i in range(len(colmap)):
        dict[colmap[i]]=list(y_kmeans).count(i)/len(y_kmeans)*100
    
    plt.figure(figsize=(20,15))

    plt.title('Word2Vec Model Vizualization')
    plt.xlabel('Dimension 1')
    plt.ylabel('Dimension 2')

    for i in range(len(new_values)):
        plt.scatter(x[i], y[i], color=colmap[y_kmeans[i]], s=12)
        if labels[i] in key_words:
            plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom', fontsize = 12, weight="bold", color= 'red')
        else:
            plt.annotate(labels[i],
                         xy=(x[i], y[i]),
                         xytext=(5, 2),
                         textcoords='offset points',
                         ha='right',
                         va='bottom', fontsize=12)
    
 
    plt.savefig("tsne_visualization.png")
    plt.show()
    plt.close()

In [None]:
t = time()

#### TSNE Data Visualization ####

key_words = ['pangolin', 'origin', 'transmission', 'vaccine','symptom', 'environment', 'patient', 'outbreak']

tsne_plot(model, key_words)

print('Time to process data: {} mins'.format(round((time() - t) / 60, 2)))

From the above TSNE visualization of Word2Vec embeddings, we can distinguish several clusters among which we can recognize semantic similarities including for instance, *medical treatment, government policies and measures, vaccine research, epidemiological research, covid-19 detection, transmission, causes and consequences of the disease*... Inter-word distance in the 2D plane is an indication of inter-word similarity.
***

# **Part 5: Latent Dirichlet Based Topic Modeling**

### ** <font color=green> Literature Review **

In recent years, the Latent Dirichlet Allocation (LDA) method for topic modeling has gained gradual popularity in project management and engineering research. LDA is an unsupervised machine learning technique that can extrapolate the core topics from a set of unlabeled documents. In LDA, each document d is viewed as a probabilistic distribution θ_d over a set of K topics and each topic k∈{1,…,K} is, in turn, represented as a probabilistic distribution φ_k over keywords in the vocabulary (Blei, Ng, & Jordan, 2003). Each word has a certain contribution to each topic. The mathematical annotations are clearly indicated in the figures below; for example, θ denotes a matrix with rows defined by documents and columns defined by topics and θ_(d,k) represents the probability of topic k occurring in document d. Similarly, φ is a matrix with rows defined by topics and columns defined by words. A simplistic representation of the LDA process is shown below.

In [None]:
from IPython.display import HTML, Image
Image(filename='/kaggle/input/figure/LDA_Process_1.png', width=500,height=500)

In [None]:
from IPython.display import HTML, Image
Image(filename='/kaggle/input/figure/LDA_Process_2.png', width=500,height=500)

### ** <font color=green> Grid-based Determination of the optimal number of topics  **

In this notebook, the grid-search optimization technique was implemented to find the optimal number of topics K that produces the most coherent model. To elaborate, to determine the optimal number of topics K for the corpus of abstracts, the C_v coherence metric have been computed after training baseline models over the range [10;30] of K. The coherence score C_UMass of the topic model averages the topic coherence scores for all topics in the model. Due to the log, C_UMass returns negative values, with values closer to 0 referring to more coherent topics in terms of human interpretability.

In [None]:
from IPython.display import HTML, Image
Image(filename='/kaggle/input/figure/C_Umass_Formula.png', width=500,height=500)

In [None]:
from tqdm.notebook import tqdm
from gensim.models.coherencemodel import CoherenceModel
import matplotlib.pyplot as plt
import seaborn as sns

#### Fonction to estimate the optimal number of topics (i.e., the one maximizing C_Umass) ####
#### Time-consuming function ####

def optimal_topic_number(start, end, text):
    
    Lda = gensim.models.ldamodel.LdaModel
    coherenceList_cv = []
    coherenceList_umass = []
        
    dictionary = corpora.Dictionary(text)
    doc_term_matrix = [dictionary.doc2bow(doc) for doc in text]

    num_topics_list = np.arange(start,end)

    for num_topics in tqdm(num_topics_list):
        lda= Lda(doc_term_matrix, num_topics=num_topics,id2word = dictionary, 
                 passes=20,chunksize=10000,random_state=43)
        cm = CoherenceModel(model=lda, corpus=doc_term_matrix, 
                            dictionary=dictionary, coherence='u_mass')
        coherenceList_umass.append(cm.get_coherence())

        #cm_cv = CoherenceModel(model=lda, corpus=doc_term_matrix,
         #                      texts=text, dictionary=dictionary, coherence='c_v')
        #coherenceList_cv.append(cm_cv.get_coherence())


    plotData = pd.DataFrame({'Number of topics':num_topics_list,
                             'CoherenceScore':coherenceList_umass})
    f,ax = plt.subplots(figsize=(10,6))
    sns.set_style("darkgrid")
    sns.pointplot(x='Number of topics',y= 'CoherenceScore',data=plotData)
    plt.title('Topic coherence \n', fontsize = 24, fontweight = "bold")
    plt.savefig('Topic_Coherence.png')
    plt.show()
    index = coherenceList_umass.index(max(coherenceList_umass))
    return index

In [None]:
t = time()

#### Estimate of the optimal number of topics based on the existing corpus ####

start =10
end = 25

num_optimal_topics = start + optimal_topic_number( start, end, covid_corpus)
print("Optimal Number of Topics", num_optimal_topics)

print('Time to process data: {} mins'.format(round((time() - t) / 60, 2)))

In [None]:
t = time()

data_lemmatized = covid_corpus
id2word = corpora.Dictionary(data_lemmatized)
id2word.save('dictionary.gensim')
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                                id2word=id2word,
                                                num_topics=num_optimal_topics,
                                                random_state=100,
                                                update_every=1,
                                                chunksize=100,
                                                passes=10,
                                                alpha='auto',
                                                per_word_topics=True)

lda_model.save('model.gensim')

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)


print('Time to process data: {} mins'.format(round((time() - t) / 60, 2)))

pyLDAvis is an interactive web-based LDA visualization Python package

In [None]:
import warnings
warnings.filterwarnings('ignore')
import os 
%matplotlib inline
import pyLDAvis.gensim
import gensim
pyLDAvis.enable_notebook()

d = gensim.corpora.Dictionary.load('dictionary.gensim')
c = [id2word.doc2bow(text) for text in texts]
lda = gensim.models.LdaModel.load('model.gensim')

data = pyLDAvis.gensim.prepare(lda, c, d, mds='tsne')


In [None]:
pyLDAvis.save_html(data, 'lda_{}topics.v0.html'.format(num_optimal_topics))
#from IPython.core.display import display, HTML
#display(HTML('lda_{}topics.v0.html'.format(num_optimal_topics)))

In [None]:
Lda = gensim.models.LdaModel
lda_final = Lda.load('model.gensim')
a = lda_final.show_topics(num_optimal_topics, formatted = False, num_words = 10)
b = lda_final.top_topics(c,dictionary=d,topn=10) # This orders the topics in the decreasing order of coherence score

topic2skillb = {}
topic2csb = {}
topic2skilla = {}
topic2csa = {}
num_topics =lda_final.num_topics
cnt =1

for ws in b:
    wset = set(w[1] for w in ws[0])
    topic2skillb[cnt] = wset
    topic2csb[cnt] = ws[1]
    cnt +=1

for ws in a:
    wset = set(w[0]for w in ws[1])
    topic2skilla[ws[0]+1] = wset
    
for i in range(1,num_topics+1):
    for j in range(1,num_topics+1):  
        if topic2skilla[i].intersection(topic2skillb[j])==topic2skilla[i]:
            topic2csa[i] = topic2csb[j]

finalData = pd.DataFrame([],columns=['Topic','words'])
finalData['Topic']=topic2skilla.keys()
finalData['Topic'] = finalData['Topic'].apply(lambda x: 'Topic'+str(x))
finalData['words']=topic2skilla.values()
finalData['cs'] = topic2csa.values()
finalData.sort_values(by='cs',ascending=False,inplace=True)
finalData.to_csv('CoherenceScore.csv')
finalData

In [None]:
token_percent = data.topic_coordinates.sort_values(by ='topics').loc[:, ["topics", "Freq"]]
df =token_percent.iloc[:, [1]]
df["Topic"]=1
for i in range(len(df)):
    df["Topic"][i]="Topic"+str(i+1)
df

To aid in the task of topic interpretation, pyLDAvis enables users to adjust the relevance measure proposed by Sievert et al. (2015) to rank the words within topics.

In [None]:
def get_relevant_words(vis,lam=0.3,topn=10):
    a = data.topic_info
    a['finalscore'] = a['logprob']*lam+(1-lam)*a['loglift']
    a = a.loc[:,['Category','Term','finalscore']].groupby(['Category'])\
    .apply(lambda x: x.sort_values(by='finalscore',ascending=False).head(topn))
    a = a.loc[:,'Term'].reset_index().loc[:,['Category','Term']]
    a = a[a['Category']!='Default']
    a = a.to_dict('split')['data']
    d ={}
    for k,v in a: 
        if k not in d.keys():
            d[k] =set()
            d[k].add(v)
        else:
            d[k].add(v)
    finalData = pd.DataFrame([],columns=['Topic','words with Relevance'])
    finalData['Topic']=d.keys()
    finalData['words with Relevance']=d.values()
    return finalData

In [None]:
pd.set_option('max_colwidth', 170)

a = get_relevant_words(data,0.4,15).merge(finalData,how='left',on ='Topic').merge(df, how='left', on = 'Topic').sort_values(by='Freq',ascending=False)
a.rename(columns={'cs':'Coherence','Freq':'Frequency', 'words with Relevance':'Relevant Words'}, inplace=True)
b = a[['Topic', 'Frequency','Relevant Words', 'Coherence']]
b

After adjusting the relevance metrics, the top-10 relevant words of each topics modeled aid topic interpretation. Related topic frequencies and CUMass are also indicated. Then, the identified topics are easily interpretable and refers to topics such as, for instance, *medical treatment, government policies and measures, vaccine research, epidemiological research, covid-19 detection, transmission, causes and consequences of the disease*....

**Many thanks for your time and consideration.**