# Lab6.2: Topic modeling using gensim

In this notebook, we demonstrate how LDA models can be built and applied using the *gensim* package.

Credits:

This notebook is an adaptation of a blog from Susan Li's:

https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24



The data set we’ll use is a list of over one million news headlines published over a period of 15 years and can be downloaded from:

https://www.kaggle.com/therohk/million-headlines/data

We read the CSV file using the pandas framework.

In [1]:
import pandas as pd

#### Adapt the path below to point to your local copy of the data set
data = pd.read_csv('./data/abcnews-date-text.csv', on_bad_lines='warn');
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text

Let's have a look at the data:

In [3]:
print(len(documents))
print(documents[:5])

1244184
                                       headline_text  index
0  aba decides against community broadcasting lic...      0
1     act fire witnesses must be aware of defamation      1
2     a g calls for infrastructure protection summit      2
3           air nz staff in aust strike for pay rise      3
4      air nz strike to affect australian travellers      4


We are going to use the *gensim* package to build our LDA models from the data.
Before building the model, we are going to preprocess the texts.

## Data Pre-processing
We will perform the following steps:

* Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
* Words that have fewer than 3 characters are removed.
* All stopwords are removed.
* Words are lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed into present.
* Words are stemmed — words are reduced to their root form.

In order to apply these processing steps, we first load the gensim and nltk libraries

In [4]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/axelehrnrooth/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [5]:
def lemmatize_stemming(text):
    return lemmatizer.lemmatize(text)
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
           # result.append(token)
            result.append(lemmatize_stemming(token))
    return result

In [6]:
doc_sample = documents[documents['index'] == 4310].values[0][0]
print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['ratepayers', 'group', 'wants', 'compulsory', 'local', 'govt', 'voting']


 tokenized and lemmatized document: 
['ratepayer', 'group', 'want', 'compulsory', 'local', 'govt', 'voting']


We now apply the preprocessing to all the headlines and print the first 10 results

In [7]:
processed_docs = documents['headline_text'].map(preprocess)
### print the first 10 results
processed_docs[:10]

0          [decides, community, broadcasting, licence]
1                         [witness, aware, defamation]
2           [call, infrastructure, protection, summit]
3                          [staff, aust, strike, rise]
4              [strike, affect, australian, traveller]
5               [ambitious, olsson, win, triple, jump]
6          [antic, delighted, record, breaking, barca]
7    [aussie, qualifier, stosur, waste, memphis, ma...
8             [aust, address, security, council, iraq]
9                       [australia, locked, timetable]
Name: headline_text, dtype: object

## Bag of Words on the Data set
Create a dictionary from ‘processed_docs’ containing the number of times a word appears in the training set.
We are going to use the *Dictionary* function to derive a dictionary with counts from the headlines.

In [8]:
dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcasting
1 community
2 decides
3 licence
4 aware
5 defamation
6 witness
7 call
8 infrastructure
9 protection
10 summit


## Gensim filter_extremes
Filter out tokens that appear in
less than 15 documents (absolute number) or
more than 0.5 documents (fraction of total corpus size, not absolute number).
after the above two steps, keep only the first 100000 most frequent tokens.

In [9]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

## Gensim doc2bow
For each document we create a dictionary reporting how many words and how many times those words appear. 
Gensim provides the *doc2bow* function to create a BoW vector representation for a document.
Save this to ‘bow_corpus’, then check our selected document earlier.

In [10]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]

[(164, 1), (241, 1), (615, 1), (891, 1), (4173, 1), (4174, 1), (4175, 1)]

Preview Bag Of Words for our sample preprocessed document.

In [11]:
bow_doc_4310 = bow_corpus[4310]
for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                               dictionary[bow_doc_4310[i][0]], 
bow_doc_4310[i][1]))

Word 164 ("govt") appears 1 time.
Word 241 ("group") appears 1 time.
Word 615 ("local") appears 1 time.
Word 891 ("want") appears 1 time.
Word 4173 ("compulsory") appears 1 time.
Word 4174 ("ratepayer") appears 1 time.
Word 4175 ("voting") appears 1 time.


## TF-IDF
Create tf-idf model object using models.TfidfModel on ‘bow_corpus’ and save it to ‘tfidf’, then apply transformation to the entire corpus and call it ‘corpus_tfidf’. Finally we preview TF-IDF scores for our first document.


In [12]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.6161125947380649),
 (1, 0.3308772069039591),
 (2, 0.5681053683635203),
 (3, 0.43379930266554434)]


## Running LDA using Bag of Words
Train our lda model using gensim.models.LdaMulticore and save it to ‘lda_model’. This takes a while.
Look at the documentation of *gensim* for further details:

https://radimrehurek.com/gensim/models/ldamulticore.html

As parameters, we pass the corpus data as BoW (a list of lists of tuples), the prefixed number of topics, the actual words and the number of passes and workers used for modeling.

In [13]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

For each topic, we will explore the words occuring in that topic and its relative weight.

In [14]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.055*"covid" + 0.038*"victoria" + 0.037*"coronavirus" + 0.034*"case" + 0.023*"child" + 0.016*"scott" + 0.015*"island" + 0.012*"border" + 0.010*"deal" + 0.010*"beach"
Topic: 1 
Words: 0.024*"restriction" + 0.024*"canberra" + 0.023*"life" + 0.019*"water" + 0.016*"police" + 0.015*"missing" + 0.015*"country" + 0.015*"concern" + 0.014*"claim" + 0.013*"farmer"
Topic: 2 
Words: 0.040*"sydney" + 0.030*"election" + 0.017*"lockdown" + 0.012*"andrew" + 0.011*"state" + 0.011*"president" + 0.011*"commission" + 0.010*"say" + 0.010*"biden" + 0.009*"australia"
Topic: 3 
Words: 0.043*"queensland" + 0.026*"south" + 0.017*"north" + 0.016*"victorian" + 0.015*"australia" + 0.015*"indigenous" + 0.015*"morrison" + 0.013*"west" + 0.013*"student" + 0.013*"school"
Topic: 4 
Words: 0.036*"police" + 0.030*"woman" + 0.027*"court" + 0.024*"death" + 0.023*"donald" + 0.019*"murder" + 0.018*"people" + 0.016*"year" + 0.015*"charged" + 0.015*"face"
Topic: 5 
Words: 0.030*"government" + 0.020*"health" +

Can you distinguish different topics using the words in each topic and their corresponding weights?

## Running LDA using TF-IDF

In [15]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.011*"climate" + 0.009*"hill" + 0.009*"david" + 0.007*"grand" + 0.006*"america" + 0.006*"change" + 0.005*"australia" + 0.005*"murray" + 0.005*"capital" + 0.005*"sunday"
Topic: 1 Word: 0.017*"coronavirus" + 0.015*"covid" + 0.012*"country" + 0.009*"market" + 0.008*"lockdown" + 0.008*"hour" + 0.007*"price" + 0.006*"australian" + 0.006*"business" + 0.006*"farm"
Topic: 2 Word: 0.011*"australia" + 0.010*"scott" + 0.009*"world" + 0.008*"league" + 0.007*"update" + 0.006*"south" + 0.006*"korea" + 0.006*"coronavirus" + 0.005*"sentenced" + 0.005*"australian"
Topic: 3 Word: 0.013*"government" + 0.010*"health" + 0.008*"federal" + 0.007*"budget" + 0.006*"election" + 0.006*"funding" + 0.006*"say" + 0.005*"mental" + 0.005*"labor" + 0.005*"school"
Topic: 4 Word: 0.011*"speaks" + 0.007*"vaccine" + 0.007*"christmas" + 0.007*"history" + 0.006*"august" + 0.006*"october" + 0.006*"coronavirus" + 0.006*"george" + 0.005*"know" + 0.005*"quarantine"
Topic: 5 Word: 0.028*"trump" + 0.013*"live" + 0

Again, can you distinguish different topics using the words in each topic and their corresponding weights? Do you observe any differences with the BoW version? Do these differences make sense given the information value weighing by the *tfidf* method?

## Performance evaluation by classifying sample document using LDA Bag of Words model
We will check where our test document would be classified.

In [16]:
processed_docs[4310]

['ratepayer', 'group', 'want', 'compulsory', 'local', 'govt', 'voting']

Document 4310 is already represented in the correct way. We can directly pass it to our *lda_model* to get the similarity scores for each topic. We represent each topic by printing 

In [17]:
for index, score in sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.7306936979293823	 
Topic: 0.030*"government" + 0.020*"health" + 0.015*"tasmania" + 0.014*"plan" + 0.013*"federal" + 0.011*"say" + 0.011*"care" + 0.010*"council" + 0.010*"call" + 0.010*"regional"

Score: 0.1692260056734085	 
Topic: 0.035*"coronavirus" + 0.019*"news" + 0.018*"china" + 0.017*"covid" + 0.017*"australia" + 0.017*"record" + 0.016*"market" + 0.015*"australian" + 0.013*"live" + 0.011*"coast"

Score: 0.012510393746197224	 
Topic: 0.022*"national" + 0.017*"change" + 0.015*"premier" + 0.015*"return" + 0.014*"tasmanian" + 0.013*"work" + 0.012*"rural" + 0.011*"show" + 0.011*"say" + 0.011*"risk"

Score: 0.012510163709521294	 
Topic: 0.043*"queensland" + 0.026*"south" + 0.017*"north" + 0.016*"victorian" + 0.015*"australia" + 0.015*"indigenous" + 0.015*"morrison" + 0.013*"west" + 0.013*"student" + 0.013*"school"

Score: 0.012510121800005436	 
Topic: 0.040*"trump" + 0.020*"vaccine" + 0.017*"australia" + 0.015*"test" + 0.014*"open" + 0.013*"world" + 0.013*"final" + 0.012*"roya

Our test document has the highest probability to be part of the topic that our model assigned, which is the accurate classification.

### Analyzing our LDA model

Now that we have a trained model let’s visualize the topics for interpretability. 
To do so, we’ll use a popular visualization package, *pyLDAvis* which is designed to help interactively with:

1. Better understanding and interpreting individual topics, and
2. Better understanding the relationships between the topics.

For (1), you can manually select each topic to view its top most frequent and/or “relevant” terms, using different values of the λ parameter. This can help when you’re trying to assign a human interpretable name or “meaning” to each topic.
For (2), exploring the Intertopic Distance Plot can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.

You need to install *pyldavis* through the command line, following the instructions:

https://anaconda.org/conda-forge/pyldavis

WARNING: running the next cell takes a long time and you need some memory to run it. However, the result is spectacular.

In [19]:
%matplotlib inline
import pyLDAvis
import pyLDAvis.gensim
vis = pyLDAvis.gensim.prepare(topic_model=lda_model, corpus=bow_corpus, dictionary=dictionary)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis)

  pid = os.fork()
  if isinstance(node, ast.Num):  # <number>
  if isinstance(node, ast.Num):  # <number>
  if isinstance(node, ast.Num):  # <number>
  return node.n
  if isinstance(node, ast.Num):  # <number>
  return node.n
  pid = os.fork()
  pid = os.fork()
  pid = os.fork()
  pid = os.fork()
  pid = os.fork()
  pid = os.fork()
  pid = os.fork()
  pid = os.fork()
  pid = os.fork()
  pid = os.fork()
  pid = os.fork()
  pid = os.fork()
  EPOCH = datetime.datetime.utcfromtimestamp(0)
  EPOCH = datetime.datetime.utcfromtimestamp(0)
  EPOCH = datetime.datetime.utcfromtimestamp(0)
  EPOCH = datetime.datetime.utcfromtimestamp(0)
  EPOCH = datetime.datetime.utcfromtimestamp(0)
  EPOCH = datetime.datetime.utcfromtimestamp(0)
  EPOCH = datetime.datetime.utcfromtimestamp(0)
  EPOCH = datetime.datetime.utcfromtimestamp(0)


## Some other useful functions

In [20]:
#get the top 20 words and their weights for a specific topic
topic_id=1
top_terms=20
for wordid, score in lda_model.get_topic_terms(topic_id, top_terms):
    print(wordid, ":", dictionary[wordid], ":", score)

363 : restriction : 0.023614423
1287 : canberra : 0.02361012
1465 : life : 0.022850947
55 : water : 0.018874085
237 : police : 0.01588662
313 : missing : 0.015460977
426 : country : 0.0153653445
368 : concern : 0.014701052
372 : claim : 0.014269243
455 : farmer : 0.013329839
2214 : officer : 0.012007995
3361 : john : 0.011676877
1737 : party : 0.011647861
541 : search : 0.011593503
293 : river : 0.010622957
342 : western : 0.0103224795
4403 : amid : 0.010024075
396 : campaign : 0.009943682
888 : crisis : 0.009788474
277 : investigation : 0.009403193


In [21]:
#### Utility function to get the id for a word

def get_id_for_word(dictionary, word):
    for k, v in dictionary.iteritems():
        if (v==word):
            return k
    return -1

In [22]:
top_terms=20
index=get_id_for_word(dictionary,'market')
for topic_id, score in lda_model.get_term_topics(index):
    print("Topic:", topic_id)
    for wordid, score in lda_model.get_topic_terms(topic_id, top_terms):
        print(wordid, ":", dictionary[wordid], ":", score)


Topic: 7
18888 : coronavirus : 0.035308406
1363 : news : 0.019467585
1300 : china : 0.017615708
18889 : covid : 0.017266054
37 : australia : 0.017129906
26 : record : 0.016907472
1188 : market : 0.015830228
16 : australian : 0.015199825
367 : live : 0.0132650295
224 : coast : 0.011249922
225 : gold : 0.010654345
12 : rise : 0.010280104
41 : million : 0.009715714
852 : price : 0.009497092
3204 : street : 0.009474546
5235 : quarantine : 0.008613039
567 : industry : 0.008337505
110 : rate : 0.0076725883
693 : high : 0.0074844514
2262 : wall : 0.0071285265


## Saving and loading your model for re-use

Building a model takes time.Once you have a stable model, you can save it to disk and reload it later.

In [25]:
# Save model to disk.
temp_file = "./model"
lda_model.save(temp_file)

# Load a potentially pretrained model from disk.
loaded_lda = lda_model.load(temp_file)

## Testing model on unseen document

In [26]:
unseen_document = 'How a Pentagon deal became an identity crisis for Google'

In order to compare any new text against the topic model, we first need to process it in the same way as we processed the input texts for the model.
We apply the same preprocessing function and next apply the *doc2bow* function to represent it using the same vector representation as we used for modeling.

In [27]:
bow_vector = dictionary.doc2bow(preprocess(unseen_document))
print(bow_vector)


[(888, 1), (1088, 1), (1921, 1), (5666, 1), (12406, 1)]


We can now pass this representation of the unseen document into the model to compare it against all the topics.
The next function returns in index to the topics and a similarity score for the new document. We print the scores and the topics with the top 5 words.

In [28]:
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic_id {}\t Topic: {}".format(score, index, lda_model.print_topic(index, 5)))

Score: 0.35016191005706787	 Topic_id 0	 Topic: 0.055*"covid" + 0.038*"victoria" + 0.037*"coronavirus" + 0.034*"case" + 0.023*"child"
Score: 0.18343740701675415	 Topic_id 1	 Topic: 0.024*"restriction" + 0.024*"canberra" + 0.023*"life" + 0.019*"water" + 0.016*"police"
Score: 0.18336418271064758	 Topic_id 7	 Topic: 0.035*"coronavirus" + 0.019*"news" + 0.018*"china" + 0.017*"covid" + 0.017*"australia"
Score: 0.18297043442726135	 Topic_id 2	 Topic: 0.040*"sydney" + 0.030*"election" + 0.017*"lockdown" + 0.012*"andrew" + 0.011*"state"
Score: 0.016677673906087875	 Topic_id 3	 Topic: 0.043*"queensland" + 0.026*"south" + 0.017*"north" + 0.016*"victorian" + 0.015*"australia"
Score: 0.016677673906087875	 Topic_id 4	 Topic: 0.036*"police" + 0.030*"woman" + 0.027*"court" + 0.024*"death" + 0.023*"donald"
Score: 0.016677673906087875	 Topic_id 5	 Topic: 0.030*"government" + 0.020*"health" + 0.015*"tasmania" + 0.014*"plan" + 0.013*"federal"
Score: 0.016677673906087875	 Topic_id 6	 Topic: 0.021*"crash" +

This text matches best with topic 5 although the score is not very high!

### Updating the model with a new document

We can also use the unseen documents to extend our model and update the topics. This is useful when processing texts in a stream.

In [29]:
# Update the model by incrementally training on the new corpus.

other_texts = [['computer', 'time', 'graph'],['survey', 'response', 'eps'],['human', 'system', 'computer']]
other_corpus = [dictionary.doc2bow(text) for text in other_texts]

# Update the model by incrementally training on the new corpus.
lda_model.update(other_corpus)  # update the LDA model with additional documents



  perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words


## End of this notebook