# COVID-19 Topic Modeling & Article Similarity Matching

This Jupyter Notebook takes data published in the Kaggle COVID-19 challenge and creates a topic model on a subset of the data, and uses FAISS to find the most similar articles in the training set to an unseen article.

The source data can be found here: https://www.kaggle.com/covid19

Possible extensions are to scale up to the total dataset removing the random sampling of the dataset, improving the presentation of the results, and getting to a more granular similarity matching at the sentence level.

First we'll import required packages, and install a few packages as well.

In [1]:
import numpy as np
import pandas as pd
import re
import json
import sys
import os
import ast
import random
pd.set_option('display.max_columns', 40)

In [2]:
#!{sys.executable} -m pip install nltk gensim wordcloud faiss-cpu
#!{sys.executable} -m pip install scikit-learn --upgrade

import nltk
import gensim
import wordcloud
import faiss
from nltk.corpus import stopwords
from helper_functions import * # Make sure this is in the same directory as you Jupyter Notebook

# Data Processing

Next let's load and process the articles. The first step will be to load the articles, which have been downloaded onto my local machine. Due to memory related issues, we'll randomly sample 20% of the articles from each source. This is not ideal, but the sample size created is still large (>3000 articles).

These functions pull data sourced from my location machine, I've downloaded the data from Kaggle and am loading it that way. Not ideal for reproducability, but just change the file locations listed below if your files are in another location, and make sure there's no other data outside of the kaggle-provided data in those locations.

In [3]:
root_location = 'assignment_5_data' # Insert Root Location here

file_locations = [f'{root_location}/biorxiv_medrxiv/biorxiv_medrxiv', 
                  f'{root_location}/noncomm_use_subset/noncomm_use_subset', 
                  f'{root_location}/comm_use_subset/comm_use_subset',
                  f'{root_location}/custom_license/custom_license']

## Set up Stop Words
## Add Any other relevant options manually

stop_words = stopwords.words('english') + ['et', 'al', 'fig', 'etal', 'et al', 'et-al']

processed_articles = process_articles(file_locations, stop_words)
processed_articles.read_files()

print('Example File Name')
print(processed_articles.root_files[0])
print('Number of files')
print(len(processed_articles.root_files))
print('Example Article Information')
print(processed_articles.title_text[2])

Processing Files at the requested location
There are 177 files to process
There were 885 files in the dataset
Processing Files at the requested location
There are 470 files to process
There were 2353 files in the dataset
Processing Files at the requested location
There are 1823 files to process
There were 9118 files in the dataset
Processing Files at the requested location
There are 3391 files to process
There were 16959 files in the dataset
Example File Name
f9075c1ccf2a4debbe8b96394bf4b541f72dbc5c.json
Number of files
3391
Example Article Information
['biorxiv_medrxiv', 'The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for the production of infectious virus. 2 3', ['VP3, and VP0 (which is further processed to VP2 and VP4 during virus assembly) (6). The P2 64 and P3 regions encode the non-structural proteins 2B and 2C and 3A, 3B (1-3) (VPg), 3C pro and 4 structural protein-coding region is replaced by reporter genes, allow the st

Next let's process the text - this involves removing links and other undesirable text, tokenizing words from the raw paragraph text, removing stop words (the, it etc...) and finally creating stems of words to group derivatives of the same root word together.

There's no contextual meaning associated with these stems, which is why word embedding based approaches have grown in popularity as the vectors they create attempt to capture the contextual meaning of words based on the words around them rather than just represent the raw text. Shifting to that type of approach is a logical extension of this initial work

In [4]:
processed_articles.process_text()

Cleaning out Junk
Tokenizing words
Converting to list of words and removing stop words
Creating word stems


Next we'll create bigrams and trigrams of the data - this concatenates words that are used frequently with one and other to create their own representation. As an example, *multicellular_eukaryot* is a bigram create based on these two words appearing frequently in the text corpus. *oil_immers_object* is a trigram created for the same reason, the only difference is an additional word on the end.

In [5]:
processed_articles.trigrams([art[3] for art in processed_articles.processed_article])
processed_articles.processed_trigrams[2][4]

Creating Bigrams


Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x7fd67f4af310>>
Traceback (most recent call last):
  File "/opt/anaconda3/envs/tf2/lib/python3.8/site-packages/ipykernel/ipkernel.py", line 775, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(
KeyboardInterrupt: 

KeyboardInterrupt



Next let's split the data into train and test, in this case the test data is used to compute perplexity as well as to have a held out set for similarity matching used later. The model with the lowest perplexity is used as the final model. Topic Coherence is another method used to evaluate the quality of topic creation, but that was not pursued in this example.

Often topics are manually reviewed to determine the appropriate number of topics, but in this case I don't have enough subject matter expertise to provide any value.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

train, test = train_test_splitter(processed_articles.processed_trigrams, 0.90)

vectorizer = CountVectorizer(min_df = 50, max_df = 0.8, max_features = 50000)
tf = vectorizer.fit_transform([t[4] for t in train]) ## Vectorize training set
tf_feature_names = vectorizer.get_feature_names_out() ## Pull out words for use in eval

# Transform test data for perplexity eval

tf_test = vectorizer.transform([t[4] for t in test])

# Modeling - LDA

Next let's build the topic model. We'll use Latent Dirichlet Allocation (LDA), specifically the implementation from sklearn. The key concept of LDA is that there are latent topics that exist within a document corpus based on words used in each document. LDA is an unsupervised probablistic model where documents are probability distributions over latent topics, and topics are probability distributions over words. In simplistic terms, LDA aims to find collections of words that are disproportionately represented within documents relative to the total document corpus to identify latent topics.

LDA is not a supervised model, a practitioner provides processed text and a number of topics, and LDA creates the specified number of topics based on observed probability distributions of words within documents. LDA is a bag of words model evaluated within each document, so the order of words is immaterial.

For each document LDA assumes all topic assignments are correct except for the document in question. From there, it calculates two proportions. The first is the proportion of words in the current document that are assigned to a specific topic. Then the proportion of assignments to that topic over all documents that come from each word. These proportions get multiplied across all words and topics to update the probabilities that a word is assigned to a topic.

LDA trades off two adversarial goals to find the appropriate distributions. The first is in each document it wants to allocate words to as few topics as possible, the second is for each topic it wants to assign high probability to few words.

The end outcome is each document receives a distribution of scores across each topic that sums to 1. The higher the score for a particular topic, the more representative a topic is of that document. We can also evaluate each topic to see which words are most prevalent in total, and relative to how prevalent they are relative to total word usage in the overall corpus

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
perps = []
models = []

# Below was used for testing to find optimal number of topics based on test data perplexity. 
# Commented out for brevity's sake in the notebook - using number of topics with lowest perplexity from testing

'''

for num_top in range(20,40,1):
    
    #'online': Online variational Bayes method. In each EM update, use
    #mini-batch of training data to update the ``components_``
    #variable incrementally.
    # Controlled by learning offset and decay functions
    # Offset & decay good candidates for optimization
    
    lda = LatentDirichletAllocation(n_components=num_top,
                                    learning_method = 'online',
                                    verbose = 1,
                                    learning_offset = 15., # Downweights early iterations
                                    learning_decay = 0.75, # default = 0.7
                                    random_state = 100
                                   )
    print(f"{num_top} topics")
    print('fitting')
    ldamod = lda.fit(tf)
    perp = ldamod.perplexity(tf_test)
    print(f'Perplexity for {num_top}')
    print(perp)
    perps.append(perp)
    models.append(ldamod)

'''

lda = LatentDirichletAllocation(n_components=32,
                                    learning_method = 'online',
                                    verbose = 1,
                                    learning_offset = 15., # Downweights early iterations
                                    learning_decay = 0.75, # default = 0.7
                                    random_state = 100
                                   )

ldamod = lda.fit(tf)

In [None]:
evalinfo = LDA_Evaluator(lda_model = ldamod, vectorizer = vectorizer)

Next let's look at the word frequency within a topic to get a better understanding of topic composition. The first table shows the top 20 words in terms of frequency within topic 3\0. The second shows the top 20 words in terms of frequency relative to all other topics for topic 0. Both of these are based off the components_ provided by the LDA implementation, which is described as a "pseudocount that represents the number of times word j was assigned to topic i."

In [None]:
# Evaluate words that are highest per topic
# Use topic three as an example 

evalinfo.eval_raw_frequency(0, 20)

In [None]:
## Evaluate the words that show up the most relative to other topics for each topic

evalinfo.eval_rel_frequency(0, 20)

In [None]:
train_wc = wcEval(train, vectorizer, ldamod)
train_wc.raw_freq_wc()
train_wc.rel_freq_wc()

test_wc = wcEval(test, vectorizer, ldamod)

train_topic_fin_raw = [[t[0], t[2], primary_top] for t, primary_top in zip(train, train_wc.raw_primary_topic)]
train_topic_fin_rel = [[t[0], t[2], rel_primary_top] for t, rel_primary_top in zip(train, train_wc.rel_primary_topic)]

It can be challenging to interpret topics this way, especially when the number of topics is larger. Another approach is to create a word cloud that displays the most frequently used words in a corpus in a visual way. We'll create a word cloud for each topic by assigning a document to the topic that it has the highest score relative to all other topics. To determine relative score, we looked at the value of topic n divided by the average value for topic n across all topics, selecting the topic with the highest value. Certain topics have higher scores in general based on probablistic assignment, so this approach aims to produce a more even distribution of documents across topics.

In addition to generating the word cloud we also show the highest relative frequency words within that topic, as the word cloud itself will disproportionately weight higher frequency terms when generated. The relative frequency words underneath aim to provide more clarity into the topic.

Looking through the word clouds there appears to be some apparent themes emerging, with some topics focused on data and clinical observations, while others are focused on specific genomic sequences, among other themes. As a lay-person I cannot make much sense of some of the differences, but the hope would be that an expert could, and creating this topic distribution helps them interpret quickly what a new document contains based simply on its topic assignment.

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator

for t in np.arange(32):
    try:
        word_clouds(train_topic_fin_rel, t, 200, stop_words, evalinfo)
    except:
        print('failure')
        print(t)

# Similarity Matching

Finally let's create a mechanism to take a new document and find the 10 most similar documents in our training database. We'll use the FAISS package to create a fast indexing system. Faiss is a library for efficient similarity search and clustering of dense vectors that came out of Facebook Research: https://github.com/facebookresearch/faiss

We'll use the topic scores generated by the topic model as the dense vectors to pass into the faiss index. Then after scoring a new document with it's topic scores, we'll look up the most similar documents we have available based on cosine similarity. For brevity's sake we'll just return the topic score of the new document, the document title, and the document titles of the most similar articles. A lay-person's quick review of these items seems to indicate the titles returned often match up closely with the new document provided. The similarity score shown is the cosine similarity.

An extension of this work could be to do the same at the sentence or paragraph level to find specific areas of similarity rather than look at the full document level. This could provide value for researchers looking for very specific topics, rather than general similarities. In this case a topic model based approach could suffice, where paragraphs became documents rather than sentences, but topic models typically are not as useful on smaller texts. Instead, using an approach like BERT do create sentence or paragraph embeddings would be a logical place to start.

This is outside the scope of the assignment, but is shown purely for optional review and learning

In [None]:
# FAISS - Create index based on training topic scores

ncentroids = 100
k = 4
matrix_conv = np.ascontiguousarray(train_wc.topic_scores.astype('float32'))
dim_index = len(matrix_conv[0])
faiss.normalize_L2(matrix_conv)
quantizer = faiss.IndexFlatL2(dim_index)  # the other index
index = faiss.IndexIVFFlat(quantizer, dim_index, ncentroids, faiss.METRIC_INNER_PRODUCT)
assert not index.is_trained
index.train(matrix_conv)
assert index.is_trained

index.add(matrix_conv)                  # add may be a bit slower as well

In [None]:
# Create test data matrix, randomly sample from test set, and return 10 most similar candidates that have document titles


test_matrix = np.ascontiguousarray(test_wc.topic_scores.astype('float32'))

testD, testI = index.search(test_matrix, 50)
rand = random.randint(0, (len(test) - 1))

print(f"Topic Scores for new article: \n {test_wc.topic_scores.iloc[rand]}")
print(f"Title of new article: \n{test[rand][1]}\n")
print("Most Similar Articles:\n")
counter = 0
valid = 1

for val in testI[rand]:
    counter = counter + 1
    if valid > 10:
        break
    if test[rand][1] == '':
        break
    if train[val][1] == test[rand][1]:
        continue
    elif train[val][1] == '':
        continue
    else:
        print(f"{valid}) Similarity Score = {np.around(testD[rand][counter], 3)}\nTitle: {train[val][1]} \nPublication: {train[val][0]}\n")
        valid = valid + 1