# **CTM, NMF + Coherence Scores (LDA,CTM,NMF) on COVID** **Data**




# Installing Contextualized Topic Models

First, we install the contextualized topic model library

In [None]:
%%capture
!pip install contextualized-topic-models==2.2.0

# Data

We are going to need some data. You should upload a file with one document per line. We assume you haven't run any preprocessing script.

However, if you want to first test the model without uploading your data, you can simply use the test file I'm putting here

# Importing what we need

In [None]:
import pandas as pd


In [None]:
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
import nltk
from contextualized_topic_models.models.ctm import CombinedTM

## Preprocessing

Why do we use the **preprocessed text** here? We need text without punctuation to build the bag of word. Also, we might want only to have the most frequent words inside the BoW. Too many words might not help.

We don't discard the non-preprocessed texts, because we are going to use them as input for obtaining the contextualized document representations. 

Let's pass our files with preprocess and unpreprocessed data to our `TopicModelDataPreparation` object. This object takes care of creating the bag of words for you and of obtaining the contextualized BERT representations of documents. This operation allows us to create our training dataset.

Note: Here we use the contextualized model "distiluse-base-multilingual-cased", because we need a multilingual model for performing cross-lingual predictions later.  

## Training our CTM

Finally, we can fit our new topic model. We will ask the model to find 25 topics in our collection (n_component parameter of the CTM object).

# Topics

After training, now it is the time to look at our topics: we can use the 

```
get_topic_lists
```

function to get the topics. It also accepts a parameter that allows you to select how many words you want to see for each topic.

If you look at the topics, you will see that they all make sense and are representative of a collection of documents that comes from Wikipedia (general knowledge). Notice that the topics are in English, because we trained the model on English documents.

In [None]:
data = pd.read_csv('sukanya_final_1gram.csv')

In [None]:
data = data['feat']

In [None]:
data

0               work
1                not
2            deinsed
3              again
4         vaccinated
             ...    
180528            my
180529         these
180530         south
180531      impaired
180532           who
Name: feat, Length: 180533, dtype: object

In [None]:
data = data.reset_index(drop = True)

In [None]:
overall_data = []
for i in range(len(data)):
    data[i] = str(data[i])
    overall_data.append(data[i])

In [None]:
len(overall_data)

180533

In [None]:
stopwords = ['to','and','i','.','app','the','?','this','it',',']
documents = [i for i in overall_data if i not in stopwords]

In [None]:
nltk.download('stopwords')
sp = WhiteSpacePreprocessing(documents, stopwords_language='english')
preprocessed_documents, unpreprocessed_corpus, vocab = sp.preprocess()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.




In [None]:
preprocessed_documents[:10]

['work',
 'vaccinated',
 'problem',
 'installed',
 'working',
 'worked',
 'wish',
 'know',
 'kept',
 'currently']

In [None]:
tp = TopicModelDataPreparation("paraphrase-distilroberta-base-v1")

training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Batches:   0%|          | 0/395 [00:00<?, ?it/s]



In [None]:
tp.vocab[:10]

['ability',
 'able',
 'absolute',
 'absolutely',
 'accept',
 'acceptable',
 'access',
 'according',
 'account',
 'accuracy']

In [None]:
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=25, num_epochs=10)
ctm.fit(training_dataset) # run the model

Epoch: [10/10]	 Seen Samples: [789690/789690]	Train Loss: 16.50780354054083	Time: 0:00:37.793082: : 10it [06:15, 37.58s/it]


In [None]:
ctm.bow_size

2000

In [None]:
#all words in each topic
all_ctm = ctm.get_topic_lists(ctm.bow_size)
all_ctm

[['use',
  'get',
  'time',
  'easy',
  'us',
  'way',
  'think',
  'useful',
  'thing',
  'one',
  'wrong',
  'things',
  'everything',
  'put',
  'working',
  'give',
  'idea',
  'people',
  'take',
  'go',
  'tested',
  'contact',
  'results',
  'test',
  'never',
  'positive',
  'would',
  'location',
  'bad',
  'installed',
  'since',
  'covid',
  'download',
  'safe',
  'phone',
  'like',
  'google',
  'still',
  'alert',
  'someone',
  'bluetooth',
  'even',
  'ways',
  'keep',
  'work',
  'need',
  'really',
  'see',
  'day',
  'install',
  'says',
  'also',
  'help',
  'exposed',
  'tracking',
  'useless',
  'good',
  'virus',
  'exposure',
  'tracing',
  'great',
  'works',
  'notifications',
  'downloaded',
  'privacy',
  'notification',
  'android',
  'health',
  'well',
  'thank',
  'update',
  'back',
  'know',
  'want',
  'please',
  'months',
  'data',
  'everyone',
  'state',
  'information',
  'info',
  'job',
  'days',
  'battery',
  'could',
  'others',
  'code',
  

In [None]:
#20 words
top_20_ctm = ctm.get_topic_lists(20)
top_20_ctm

[['use',
  'get',
  'time',
  'easy',
  'us',
  'way',
  'think',
  'useful',
  'thing',
  'one',
  'wrong',
  'things',
  'everything',
  'put',
  'working',
  'give',
  'idea',
  'people',
  'take',
  'go'],
 ['even',
  'google',
  'completely',
  'far',
  'minutes',
  'start',
  'finally',
  'possible',
  'mask',
  'already',
  'gets',
  'county',
  'glad',
  'android',
  'numbers',
  'enter',
  'government',
  'least',
  'developer',
  'wifi'],
 ['something',
  'new',
  'since',
  'actually',
  'report',
  'well',
  'turn',
  'hope',
  'thanks',
  'much',
  'say',
  'love',
  'safe',
  'hours',
  'alert',
  'checked',
  'ago',
  'stars',
  'cases',
  'family'],
 ['told',
  'mind',
  'apple',
  'soon',
  'nj',
  'horrible',
  'utah',
  'reason',
  'option',
  'via',
  'twice',
  'screen',
  'difficult',
  'came',
  'masks',
  'matter',
  'page',
  'slow',
  'awesome',
  'hour'],
 ['notifications',
  'thank',
  'idea',
  'phone',
  'tracking',
  'download',
  'find',
  'number',
  'd

In [None]:
for i in top_20_ctm:
  print(i)

['use', 'get', 'time', 'easy', 'us', 'way', 'think', 'useful', 'thing', 'one', 'wrong', 'things', 'everything', 'put', 'working', 'give', 'idea', 'people', 'take', 'go']
['even', 'google', 'completely', 'far', 'minutes', 'start', 'finally', 'possible', 'mask', 'already', 'gets', 'county', 'glad', 'android', 'numbers', 'enter', 'government', 'least', 'developer', 'wifi']
['something', 'new', 'since', 'actually', 'report', 'well', 'turn', 'hope', 'thanks', 'much', 'say', 'love', 'safe', 'hours', 'alert', 'checked', 'ago', 'stars', 'cases', 'family']
['told', 'mind', 'apple', 'soon', 'nj', 'horrible', 'utah', 'reason', 'option', 'via', 'twice', 'screen', 'difficult', 'came', 'masks', 'matter', 'page', 'slow', 'awesome', 'hour']
['notifications', 'thank', 'idea', 'phone', 'tracking', 'download', 'find', 'number', 'downloaded', 'waste', 'department', 'installed', 'notification', 'stop', 'fix', 'notify', 'notified', 'found', 'available', 'wish']
['love', 'however', 'report', 'spread', 'recei

### **NMF**

In [None]:
import numpy as np

In [None]:
import sklearn
import sklearn.feature_extraction
import sklearn.feature_extraction.text
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
len(preprocessed_documents)

78969

In [None]:
tfidf_vectorizer = TfidfVectorizer(
    min_df=3,
    max_df=0.85,
    max_features=9114,
    #preprocessor=' '.join
)
tfidf = tfidf_vectorizer.fit_transform(preprocessed_documents)

In [None]:
import sklearn
import sklearn.decomposition
from sklearn.decomposition import NMF

In [None]:
nmf = NMF(
    n_components=25,
    init='nndsvd',
).fit(tfidf)



In [None]:
H = nmf.components_

In [None]:
vocab = np.array(tfidf_vectorizer.get_feature_names())



In [None]:
#all words NMF
vocab = np.array(tfidf_vectorizer.get_feature_names())
top_words = lambda t: [vocab[i] for i in np.argsort(t)[::-1]]
topics_words = ([top_words(t) for t in H])
nmf_topics = [' '.join(t) for t in topics_words]
nmf_topics = [', '.join(t) for t in topics_words]
nmf_topics





In [None]:
#top 20 NMF
num_words = 20
top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_words-1:-1]]
topics_words = ([top_words(t) for t in H])
nmf_20 = [' '.join(t) for t in topics_words]
nmf_20 = [', '.join(t) for t in topics_words]
nmf_20

['get, way, tested, idea, beta, code, battery, information, state, everyone, well, level, still, useless, designed, intended, integrated, got, data, keep',
 'covid, non, slow, infected, free, nj, specific, related, mi, intrusive, functional, invasive, working, response, locked, white, virginia, county, power, positive',
 'use, easy, see, well, data, designed, intended, integrated, sharing, downloaded, give, exposed, virus, sure, turned, political, helpful, used, trying, thing',
 'work, state, data, well, level, install, still, useless, designed, sharing, intended, county, integrated, enough, old, go, specific, year, virginia, privacy',
 'people, tested, idea, beta, battery, information, well, install, still, got, designed, intended, integrated, keep, enough, old, using, year, could, notification',
 'great, see, state, everyone, data, level, still, useless, sharing, county, enough, old, specific, year, using, days, virginia, non, working, intrusive',
 'good, still, enough, old, year, id

In [None]:
#postprocessing for coherence score
nmf_20_list = []
for i in range(len(nmf_20)):
  nmf_20_list.append(nmf_20[i].split(', '))

nmf_all_list = []
for i in range(len(nmf_topics)):
  nmf_all_list.append(nmf_topics[i].split(', '))
print(nmf_20_list)
print(nmf_all_list)

[['get', 'way', 'tested', 'idea', 'beta', 'code', 'battery', 'information', 'state', 'everyone', 'well', 'level', 'still', 'useless', 'designed', 'intended', 'integrated', 'got', 'data', 'keep'], ['covid', 'non', 'slow', 'infected', 'free', 'nj', 'specific', 'related', 'mi', 'intrusive', 'functional', 'invasive', 'working', 'response', 'locked', 'white', 'virginia', 'county', 'power', 'positive'], ['use', 'easy', 'see', 'well', 'data', 'designed', 'intended', 'integrated', 'sharing', 'downloaded', 'give', 'exposed', 'virus', 'sure', 'turned', 'political', 'helpful', 'used', 'trying', 'thing'], ['work', 'state', 'data', 'well', 'level', 'install', 'still', 'useless', 'designed', 'sharing', 'intended', 'county', 'integrated', 'enough', 'old', 'go', 'specific', 'year', 'virginia', 'privacy'], ['people', 'tested', 'idea', 'beta', 'battery', 'information', 'well', 'install', 'still', 'got', 'designed', 'intended', 'integrated', 'keep', 'enough', 'old', 'using', 'year', 'could', 'notificatio

### **Coherence** **Scores**

In [None]:
data2 = pd.read_csv('sukanya_final_csv.csv')
data2 = data2['message']
data2

0       Currently cannot get the app to work. Installe...
1       I thoroughly appreciate this app. I have a Sam...
2       All the app does is share anon covid test resu...
3       The app seemed to just sit idle till today. To...
4       Thank you for trying to keep us updated. The a...
                              ...                        
7617    My wife and I both use this app. We both got C...
7618    This app does not work. I was with my friend f...
7619    Been trying to do my part and report my positi...
7620    What?s the point of having the app if it doesn...
7621    Well I believe this app was designed with good...
Name: message, Length: 7622, dtype: object

In [None]:
overall = []
for i in range(len(data2)):
  overall.append(data2[i].split())

In [None]:
overall

In [None]:
from gensim.test.utils import common_corpus, common_dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.corpora import Dictionary
from numpy.core.numeric import array_equal

Coherence Scores for Top 20 words for LDA for 25 topics

In [None]:
lda_20 = pd.read_csv('top_20_lda.csv')
lda_20 = lda_20['termy']
lda_20

0     get, error, when, up, screen, update, page, en...
1     on, bluetooth, location, off, turn, gps, which...
2     people, more, be, will, if, use, great, only, ...
3     we, help, can, safe, us, keep, our, will, spre...
4     you, if, are, your, know, or, what, can, just,...
5     as, no, an, there, well, out, way, find, pleas...
6     be, would, but, like, could, symptoms, better,...
7     contact, <newline>, tracing, google, health, a...
8     about, cases, how, information, county, users,...
9     positive, have, been, with, who, tested, peopl...
10    exposure, since, last, been, now, update, days...
11    not, it's, does, do, what, work, know, i'm, or...
12    at, location, been, locations, places, minutes...
13    my, results, get, test, me, can't, am, see, ab...
14    code, positive, test, pin, get, report, verifi...
15    me, notifications, try, turn, something, again...
16    was, had, we, were, out, said, days, after, to...
17    my, phone, on, battery, had, installed, is

In [None]:
lda_20_topics = []
for i in range(len(lda_20)):
  lda_20_topics.append(lda_20[i].split(', '))

lda_20_topics

In [None]:
topics = lda_20_topics

texts = [preprocessed_documents]

dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(i) for i in overall]
cm1 = CoherenceModel(topics=topics,corpus = corpus,dictionary = dictionary, coherence='u_mass')
coherence1 = cm1.get_coherence()
print("u_mass:",coherence1)
cm2 = CoherenceModel(topics=topics,texts = texts,dictionary = dictionary, coherence='c_v')
coherence2 = cm2.get_coherence()
print('c_v:',coherence2)
cm3 = CoherenceModel(topics=topics,texts = texts,dictionary = dictionary, coherence='c_uci')
coherence3 = cm3.get_coherence()
print('c_uci:',coherence3)
cm4 = CoherenceModel(topics=topics,texts = texts,dictionary = dictionary, coherence='c_npmi')
coherence4 = cm4.get_coherence()
print('c_npmi:',coherence4)

u_mass: -5.5762587560959
c_v: 0.4199408850173454
c_uci: -0.19355815180898894
c_npmi: 0.031995084831385856


Coherence Scores for all top words for LDA for 25 

In [None]:
lda_all = pd.read_csv('all_topics_covid.csv')
lda_all = lda_all['termy']
lda_all_topics = []
for i in range(len(lda_all)):
  lda_all_topics.append(lda_all[i].split(', '))

lda_all_topics

In [None]:
topics = lda_all_topics

texts = [preprocessed_documents]

dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(i) for i in overall]
cm1 = CoherenceModel(topics=topics,corpus = corpus,dictionary = dictionary, coherence='u_mass')
coherence1 = cm1.get_coherence()
print("u_mass:",coherence1)
cm2 = CoherenceModel(topics=topics,texts = texts,dictionary = dictionary, coherence='c_v')
coherence2 = cm2.get_coherence()
print('c_v:',coherence2)
cm3 = CoherenceModel(topics=topics,texts = texts,dictionary = dictionary, coherence='c_uci')
coherence3 = cm3.get_coherence()
print('c_uci:',coherence3)
cm4 = CoherenceModel(topics=topics,texts = texts,dictionary = dictionary, coherence='c_npmi')
coherence4 = cm4.get_coherence()
print('c_npmi:',coherence4)

u_mass: -6.518233247452165
c_v: 0.374612120932157
c_uci: -0.9500200980942881
c_npmi: -0.0009012527528714959


Coherence Scores for top 20 words for CTM for 25 topics

In [None]:
topics = top_20_ctm
texts = [preprocessed_documents]

dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(i) for i in overall]
cm1 = CoherenceModel(topics=topics,corpus = corpus,dictionary = dictionary, coherence='u_mass')
coherence1 = cm1.get_coherence()
print("u_mass:",coherence1)
cm2 = CoherenceModel(topics=topics,texts = texts,dictionary = dictionary, coherence='c_v')
coherence2 = cm2.get_coherence()
print('c_v:',coherence2)
cm3 = CoherenceModel(topics=topics,texts = texts,dictionary = dictionary, coherence='c_uci')
coherence3 = cm3.get_coherence()
print('c_uci:',coherence3)
cm4 = CoherenceModel(topics=topics,texts = texts,dictionary = dictionary, coherence='c_npmi')
coherence4 = cm4.get_coherence()
print('c_npmi:',coherence4)

u_mass: -7.8961796586606825
c_v: 0.20486189909717775
c_uci: -2.4325133435890525
c_npmi: -0.09764744956025172


Coherence Scores for all top words for CTM for 25 topics

In [None]:
topics = all_ctm
texts = [preprocessed_documents]

dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(i) for i in overall]
cm1 = CoherenceModel(topics=topics,corpus = corpus,dictionary = dictionary, coherence='u_mass')
coherence1 = cm1.get_coherence()
print("u_mass:",coherence1)
cm2 = CoherenceModel(topics=topics,texts = texts,dictionary = dictionary, coherence='c_v')
coherence2 = cm2.get_coherence()
print('c_v:',coherence2)
cm3 = CoherenceModel(topics=topics,texts = texts,dictionary = dictionary, coherence='c_uci')
coherence3 = cm3.get_coherence()
print('c_uci:',coherence3)
cm4 = CoherenceModel(topics=topics,texts = texts,dictionary = dictionary, coherence='c_npmi')
coherence4 = cm4.get_coherence()
print('c_npmi:',coherence4)

u_mass: -7.8961796586606825
c_v: 0.20486189909717775
c_uci: -2.4325133435890525
c_npmi: -0.09764744956025172


Coherence Scores for top 20 words for NMF

In [None]:
topics = nmf_20_list
texts = [preprocessed_documents]

dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(i) for i in overall]
cm1 = CoherenceModel(topics=topics,corpus = corpus,dictionary = dictionary, coherence='u_mass')
coherence1 = cm1.get_coherence()
print("u_mass:",coherence1)
cm2 = CoherenceModel(topics=topics,texts = texts,dictionary = dictionary, coherence='c_v')
coherence2 = cm2.get_coherence()
print('c_v:',coherence2)
cm3 = CoherenceModel(topics=topics,texts = texts,dictionary = dictionary, coherence='c_uci')
coherence3 = cm3.get_coherence()
print('c_uci:',coherence3)
cm4 = CoherenceModel(topics=topics,texts = texts,dictionary = dictionary, coherence='c_npmi')
coherence4 = cm4.get_coherence()
print('c_npmi:',coherence4)

u_mass: -8.543957168828163
c_v: 0.23162437985959322
c_uci: -3.434846070441137
c_npmi: -0.12887877264186487


Coherence Scores for all top words for NMF

In [None]:
topics = nmf_all_list
texts = [preprocessed_documents]

dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(i) for i in overall]
cm1 = CoherenceModel(topics=topics,corpus = corpus,dictionary = dictionary, coherence='u_mass')
coherence1 = cm1.get_coherence()
print("u_mass:",coherence1)
cm2 = CoherenceModel(topics=topics,texts = texts,dictionary = dictionary, coherence='c_v')
coherence2 = cm2.get_coherence()
print('c_v:',coherence2)
cm3 = CoherenceModel(topics=topics,texts = texts,dictionary = dictionary, coherence='c_uci')
coherence3 = cm3.get_coherence()
print('c_uci:',coherence3)
cm4 = CoherenceModel(topics=topics,texts = texts,dictionary = dictionary, coherence='c_npmi')
coherence4 = cm4.get_coherence()
print('c_npmi:',coherence4)

u_mass: -8.543957168828163
c_v: 0.23162437985959322
c_uci: -3.434846070441137
c_npmi: -0.12887877264186487
