# Topic Modeling
Text mining technique that provides methods for identifying co-occuring keywords to summarize large collections of textual information.

## Topic Modeling with LDA

### Dirichlet Process and Dirichlet Distribution

<p>A Dirichlet process is a distribution over a distribution. It can be represented as DP(α,G) where G is the base distribution and α is the concentration parameter that defines how close DP(α,G) is to the base distribution G. It is for this reason that the Dirichlet process is a versatile way to represent various probability distributions. It is used for the HDP topic-modeling algorithm.</p>


<p>The Dirichlet distribution is a special case of the Dirichlet process, in which the number of topics needs to be specified explicitly. It is used for the LDA topic-modeling algorithm.</p>


### Latent Dirichlet Allocation (LDA)

<p>Instead of using matrix factorization, like we did for LSA, it is possible to consider a generative model called LDA. LDA is considered an advancement over probabilistic LSA. Probabilistic LSA is prone to overfitting as it does not probabilistically model the distribution of the documents. LDA is a three-level hierarchical generative statistical model that maps documents to topics, which in turn get mapped to words—all in a probabilistic way. In this case, we have two concentration parameters corresponding to the document level and the topic level.</p>


<p>In this exercise, we will use the tomotopy LDA model to analyze the Canadian Open Data Inventory. For simplicity, we will consider that the corpus has twenty topics.</p>


Note<br>
The dataset used for this exercise can be found at https://packt.live/2PbvMds.

In [1]:
import pandas as pd

data = pd.read_csv('abcnews-date-text.csv')

In [9]:
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text.copy()

In [10]:
documents

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4
...,...,...
1244179,two aged care residents die as state records 2...,1244179
1244180,victoria records 5;919 new cases and seven deaths,1244180
1244181,wa delays adopting new close contact definition,1244181
1244182,western ringtail possums found badly dehydrate...,1244182


In [11]:
len(documents)

1244184

In [12]:
documents[:5]

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


## Data Preprocessing

In [14]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)

In [15]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\visha\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Lemmatize example

In [16]:
print(WordNetLemmatizer().lemmatize('went', pos='v'))

go


### Stemmer Example

In [19]:
stemmer = SnowballStemmer('english')
original_words = ['caresses', 'files', 'dies', 'mules', 'denied', 'died', 'agreed',
                  'owned', 'humbled', 'sized', 'meeting', 'stating', 'sizeing', 'itemization',
                  'sensational', 'traditional', 'reference', 'colonizer', 'plotted'
                 ]
singles = [stemmer.stem(plural) for plural in original_words]
pd.DataFrame(data = {'original word': original_words, 'stemmed': singles})

Unnamed: 0,original word,stemmed
0,caresses,caress
1,files,file
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


### Write a function to perform lemmatize and stem preprocessing steps on the data set

In [24]:
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

### Select a document to preview after preprocessing.

In [21]:
doc_sample = documents[documents['index'] == 4310].values[0][0]
doc_sample

'ratepayers group wants compulsory local govt voting'

In [25]:
print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['ratepayers', 'group', 'wants', 'compulsory', 'local', 'govt', 'voting']


 tokenized and lemmatized document: 
['ratepay', 'group', 'want', 'compulsori', 'local', 'govt', 'vote']


In [26]:
processed_docs = documents['headline_text'].map(preprocess)

In [27]:
processed_docs[:10]

0            [decid, communiti, broadcast, licenc]
1                               [wit, awar, defam]
2           [call, infrastructur, protect, summit]
3                      [staff, aust, strike, rise]
4             [strike, affect, australian, travel]
5               [ambiti, olsson, win, tripl, jump]
6           [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, memphi, match]
8            [aust, address, secur, council, iraq]
9                         [australia, lock, timet]
Name: headline_text, dtype: object

### Bag of words on the dataset
Create a dictionary from `processed_docs` containing the number of times a word appears in the training set.

In [28]:
dictionary = gensim.corpora.Dictionary(processed_docs)
dictionary

<gensim.corpora.dictionary.Dictionary at 0x2b78f17ea90>

In [29]:
count = 0
for k,v in dictionary.iteritems():
    print(k,v)
    count+=1
    if count > 10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


### Filter out tokens that appear in
- less than 15 documents(absolute number) or
- More than 0.5 documents(fraction of total corpus size, not absolute number)
- After the above two steps, keep only the first 100000 most frequent tokens.


In [30]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

### Gensim doc2bow
For each document we create a dictioanry reporting how many words and how many times those words appear. Save this to `bow_corpus`, then check our selected document earlier.

In [31]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]

[(162, 1), (240, 1), (292, 1), (589, 1), (838, 1), (3571, 1), (3572, 1)]

In [33]:
bow_doc_4310 = bow_corpus[4310]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} times.".format(bow_doc_4310[i][0],
                                                     dictionary[bow_doc_4310[i][0]],
                                                     bow_doc_4310[i][1]))

Word 162 ("govt") appears 1 times.
Word 240 ("group") appears 1 times.
Word 292 ("vote") appears 1 times.
Word 589 ("local") appears 1 times.
Word 838 ("want") appears 1 times.
Word 3571 ("compulsori") appears 1 times.
Word 3572 ("ratepay") appears 1 times.


### TF-IDF

In [34]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)

In [35]:
corpus_tfidf = tfidf[bow_corpus]

In [36]:
from pprint import pprint

for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5844216176085719),
 (1, 0.38716866963787633),
 (2, 0.5013820927104505),
 (3, 0.5071171375845095)]


## Running LDA using Bag of Words

In [37]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

For each topic, we will explore the words occuring in that topic and its relative weight.

In [38]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nwords: {}'.format(idx, topic))

Topic: 0 
words: 0.029*"vaccin" + 0.020*"health" + 0.017*"hospit" + 0.016*"scott" + 0.015*"farmer" + 0.013*"break" + 0.011*"show" + 0.011*"pandem" + 0.010*"beach" + 0.010*"citi"
Topic: 1 
words: 0.044*"polic" + 0.030*"death" + 0.024*"home" + 0.020*"crash" + 0.018*"die" + 0.016*"lockdown" + 0.015*"perth" + 0.015*"shoot" + 0.015*"woman" + 0.015*"investig"
Topic: 2 
words: 0.042*"trump" + 0.026*"record" + 0.023*"test" + 0.023*"donald" + 0.021*"open" + 0.018*"market" + 0.016*"australian" + 0.014*"final" + 0.010*"guilti" + 0.010*"year"
Topic: 3 
words: 0.031*"elect" + 0.019*"say" + 0.016*"busi" + 0.014*"state" + 0.013*"speak" + 0.013*"minist" + 0.013*"labor" + 0.012*"andrew" + 0.011*"quarantin" + 0.011*"north"
Topic: 4 
words: 0.047*"queensland" + 0.019*"world" + 0.017*"canberra" + 0.015*"time" + 0.014*"win" + 0.013*"premier" + 0.011*"assault" + 0.010*"game" + 0.010*"australian" + 0.009*"sexual"
Topic: 5 
words: 0.038*"case" + 0.027*"charg" + 0.027*"court" + 0.023*"news" + 0.022*"murder" + 

## Running LDA using TF-IDF

In [39]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)

In [40]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} \nwords: {}'.format(idx, topic))

Topic: 0 
words: 0.015*"polic" + 0.014*"crash" + 0.012*"kill" + 0.011*"woman" + 0.011*"death" + 0.009*"lockdown" + 0.009*"shoot" + 0.008*"dead" + 0.008*"charg" + 0.008*"die"
Topic: 1 
words: 0.015*"interview" + 0.010*"weather" + 0.008*"peter" + 0.006*"extend" + 0.006*"june" + 0.005*"cancer" + 0.005*"april" + 0.005*"toni" + 0.005*"great" + 0.005*"storm"
Topic: 2 
words: 0.013*"charg" + 0.012*"morrison" + 0.012*"court" + 0.010*"murder" + 0.010*"polic" + 0.009*"alleg" + 0.009*"michael" + 0.008*"video" + 0.007*"accus" + 0.007*"steal"
Topic: 3 
words: 0.025*"trump" + 0.014*"donald" + 0.011*"australia" + 0.011*"world" + 0.009*"final" + 0.009*"scott" + 0.007*"leagu" + 0.006*"open" + 0.006*"cricket" + 0.006*"biden"
Topic: 4 
words: 0.026*"covid" + 0.023*"coronavirus" + 0.011*"countri" + 0.010*"queensland" + 0.010*"victoria" + 0.009*"case" + 0.008*"hour" + 0.008*"coast" + 0.008*"health" + 0.008*"australia"
Topic: 5 
words: 0.015*"drum" + 0.009*"tuesday" + 0.009*"wednesday" + 0.008*"live" + 0.00

### Classification of the topics
Our test document has the highest probability to be part of the topic that our model assigned, which is the accurate classification

### Performance evaluation by classifying sample document using LDA Bag of words model

In [41]:
processed_docs[4310]

['ratepay', 'group', 'want', 'compulsori', 'local', 'govt', 'vote']

In [42]:
for index, score in sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print('\nScore: {}\t \nTopic: {}'.format(score, lda_model.print_topic(index, 10)))




Score: 0.5550630688667297	 
Topic: 0.033*"govern" + 0.023*"live" + 0.015*"worker" + 0.015*"indigen" + 0.014*"communiti" + 0.014*"work" + 0.012*"care" + 0.011*"street" + 0.011*"call" + 0.011*"age"

Score: 0.21477612853050232	 
Topic: 0.031*"elect" + 0.019*"say" + 0.016*"busi" + 0.014*"state" + 0.013*"speak" + 0.013*"minist" + 0.013*"labor" + 0.012*"andrew" + 0.011*"quarantin" + 0.011*"north"

Score: 0.1426113247871399	 
Topic: 0.042*"trump" + 0.026*"record" + 0.023*"test" + 0.023*"donald" + 0.021*"open" + 0.018*"market" + 0.016*"australian" + 0.014*"final" + 0.010*"guilti" + 0.010*"year"

Score: 0.012508495710790157	 
Topic: 0.021*"nation" + 0.019*"coast" + 0.017*"adelaid" + 0.017*"tasmania" + 0.015*"victorian" + 0.013*"region" + 0.012*"concern" + 0.012*"hous" + 0.012*"gold" + 0.012*"school"

Score: 0.012506960891187191	 
Topic: 0.038*"case" + 0.027*"charg" + 0.027*"court" + 0.023*"news" + 0.022*"murder" + 0.019*"face" + 0.019*"alleg" + 0.018*"peopl" + 0.015*"trial" + 0.014*"morrison"


### Performance evaluation by classifying sample document using LDA TF-IDF model

In [43]:
for index, score in sorted(lda_model_tfidf[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print('\nScore: {}\t \nTopic: {}'.format(score, lda_model_tfidf.print_topic(index, 10)))




Score: 0.5961377620697021	 
Topic: 0.018*"govern" + 0.010*"climat" + 0.009*"andrew" + 0.009*"chang" + 0.007*"john" + 0.006*"social" + 0.005*"right" + 0.005*"jam" + 0.005*"say" + 0.005*"control"

Score: 0.3038157522678375	 
Topic: 0.026*"covid" + 0.023*"coronavirus" + 0.011*"countri" + 0.010*"queensland" + 0.010*"victoria" + 0.009*"case" + 0.008*"hour" + 0.008*"coast" + 0.008*"health" + 0.008*"australia"

Score: 0.012508079409599304	 
Topic: 0.012*"elect" + 0.008*"liber" + 0.007*"david" + 0.007*"turnbul" + 0.007*"say" + 0.006*"parliament" + 0.006*"labor" + 0.006*"histori" + 0.006*"parti" + 0.006*"minist"

Score: 0.012507021427154541	 
Topic: 0.019*"news" + 0.012*"market" + 0.008*"vaccin" + 0.008*"care" + 0.008*"coronavirus" + 0.007*"share" + 0.007*"age" + 0.006*"wall" + 0.006*"busi" + 0.006*"australian"

Score: 0.01250616554170847	 
Topic: 0.010*"royal" + 0.010*"stori" + 0.009*"friday" + 0.009*"monday" + 0.009*"sport" + 0.009*"christma" + 0.008*"commiss" + 0.008*"financ" + 0.007*"brief

### Testing model on unseen document

In [44]:
unseen_document = 'How a Pentagon deal became an identity crisis for Google'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print('\nScore: {}\t \nTopic: {}'.format(score, lda_model.print_topic(index, 10)))



Score: 0.3500813841819763	 
Topic: 0.042*"trump" + 0.026*"record" + 0.023*"test" + 0.023*"donald" + 0.021*"open" + 0.018*"market" + 0.016*"australian" + 0.014*"final" + 0.010*"guilti" + 0.010*"year"

Score: 0.34990212321281433	 
Topic: 0.074*"coronavirus" + 0.070*"australia" + 0.064*"covid" + 0.031*"victoria" + 0.022*"south" + 0.017*"chang" + 0.016*"restrict" + 0.013*"island" + 0.012*"water" + 0.011*"west"

Score: 0.18328313529491425	 
Topic: 0.033*"melbourn" + 0.023*"protest" + 0.022*"attack" + 0.019*"australian" + 0.017*"arrest" + 0.016*"border" + 0.015*"kill" + 0.014*"royal" + 0.014*"polic" + 0.012*"abus"

Score: 0.01668340340256691	 
Topic: 0.033*"govern" + 0.023*"live" + 0.015*"worker" + 0.015*"indigen" + 0.014*"communiti" + 0.014*"work" + 0.012*"care" + 0.011*"street" + 0.011*"call" + 0.011*"age"

Score: 0.01667807251214981	 
Topic: 0.031*"elect" + 0.019*"say" + 0.016*"busi" + 0.014*"state" + 0.013*"speak" + 0.013*"minist" + 0.013*"labor" + 0.012*"andrew" + 0.011*"quarantin" + 0