# Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a popular approach for topic modeling. It works by identifying the key topics within a set of text documents, and the key words that make up each topic.

Under LDA, each document is assumed to have a mix of underlying (latent) topics, each topic with a certain probability of occurring in the document. Individual text documents can therefore be represented by the topics that make them up.

In this way, LDA topic modeling can be used to categorize or classify documents based on their topic content.

Each LDA topic model requires:

1. A set of documents for training the model—the training corpus
2. A dictionary of words to form the vocabulary used in the model—this can be derived from the training corpus


Once a model has been trained, it can be applied to a new set of documents to identify the topics in those new documents.

## Dataset

In [1]:
import pandas as pd

data = pd.read_csv('abcnews-date-text.csv', error_bad_lines=False);
data_text = data[['headline_text']]
data_text['index'] = data_text.index
documents = data_text

In [2]:
len(documents)

1103665

In [36]:
documents.head(10)

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4
5,ambitious olsson wins triple jump,5
6,antic delighted with record breaking barca,6
7,aussie qualifier stosur wastes four memphis match,7
8,aust addresses un security council over iraq,8
9,australia is locked into war timetable opp,9


## Pre-processing

In [4]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)



In [2]:
import numpy as np

In [3]:
np.version.version

'1.14.3'

In [5]:
np.version.version

'1.19.5'

In [5]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Rahul\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
print(WordNetLemmatizer().lemmatize('went', pos='v'))

go


In [7]:
stemmer = SnowballStemmer('english')
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in original_words]
pd.DataFrame(data = {'original word': original_words, 'stemmed': singles})

Unnamed: 0,original word,stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


In [8]:
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

## Reviewing a Pre-Processed Document

In [9]:
doc_sample = documents[documents['index'] == 4310].values[0][0]

print('original document: ')
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print('\n\n tokenized and lemmatized document: ')
print(preprocess(doc_sample))

original document: 
['rain', 'helps', 'dampen', 'bushfires']


 tokenized and lemmatized document: 
['rain', 'help', 'dampen', 'bushfir']


In [10]:
processed_docs = documents['headline_text'].map(preprocess)

In [11]:
processed_docs[:10]

0            [decid, communiti, broadcast, licenc]
1                               [wit, awar, defam]
2           [call, infrastructur, protect, summit]
3                      [staff, aust, strike, rise]
4             [strike, affect, australian, travel]
5               [ambiti, olsson, win, tripl, jump]
6           [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, memphi, match]
8            [aust, address, secur, council, iraq]
9                         [australia, lock, timet]
Name: headline_text, dtype: object

In [37]:
documents.head()

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


## Bag of Words on the Data set

Create a dictionary from ‘processed_docs’ containing the number of times a word appears in the training set.

In [12]:
dictionary = gensim.corpora.Dictionary(processed_docs)

In [13]:
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


## Gensim doc2bow

For each document we create a dictionary reporting how many
words and how many times those words appear. Save this to ‘bow_corpus’, then check our selected document earlier.

In [14]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [15]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]

[(76, 1), (112, 1), (483, 1), (4014, 1)]

In [16]:
bow_doc_4310 = bow_corpus[4310]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 76 ("bushfir") appears 1 time.
Word 112 ("help") appears 1 time.
Word 483 ("rain") appears 1 time.
Word 4014 ("dampen") appears 1 time.


In [17]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)

In [18]:
corpus_tfidf = tfidf[bow_corpus]

In [19]:
from pprint import pprint

for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5892908644709983),
 (1, 0.38929657403503015),
 (2, 0.4964985198530063),
 (3, 0.5046520328695662)]


In [20]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

In [21]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.027*"elect" + 0.022*"nation" + 0.016*"rural" + 0.014*"say" + 0.014*"chang" + 0.014*"labor" + 0.013*"deal" + 0.010*"talk" + 0.010*"vote" + 0.010*"leader"
Topic: 1 
Words: 0.020*"australia" + 0.020*"world" + 0.018*"women" + 0.017*"final" + 0.015*"win" + 0.015*"indigen" + 0.014*"time" + 0.013*"leagu" + 0.012*"australian" + 0.011*"lead"
Topic: 2 
Words: 0.026*"crash" + 0.023*"die" + 0.016*"investig" + 0.015*"feder" + 0.015*"polic" + 0.015*"road" + 0.015*"protest" + 0.015*"hospit" + 0.013*"help" + 0.012*"public"
Topic: 3 
Words: 0.028*"melbourn" + 0.021*"perth" + 0.016*"tasmanian" + 0.016*"tasmania" + 0.015*"home" + 0.013*"sydney" + 0.012*"return" + 0.011*"guilti" + 0.011*"dead" + 0.011*"record"
Topic: 4 
Words: 0.021*"test" + 0.021*"market" + 0.019*"hous" + 0.015*"share" + 0.015*"price" + 0.013*"fall" + 0.013*"rise" + 0.013*"power" + 0.012*"busi" + 0.012*"news"
Topic: 5 
Words: 0.034*"polic" + 0.030*"charg" + 0.028*"court" + 0.025*"queensland" + 0.022*"murder" + 0.019*"f

In [22]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)

In [23]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.009*"sport" + 0.007*"kill" + 0.007*"octob" + 0.007*"monday" + 0.007*"stori" + 0.006*"presid" + 0.006*"east" + 0.006*"syria" + 0.006*"islam" + 0.005*"south"
Topic: 1 Word: 0.011*"hobart" + 0.010*"grandstand" + 0.009*"plead" + 0.008*"wednesday" + 0.007*"climat" + 0.007*"shark" + 0.005*"beach" + 0.005*"plane" + 0.005*"insid" + 0.005*"money"
Topic: 2 Word: 0.013*"govern" + 0.007*"budget" + 0.007*"fund" + 0.006*"say" + 0.006*"council" + 0.006*"labor" + 0.006*"plan" + 0.006*"elect" + 0.005*"hill" + 0.005*"cut"
Topic: 3 Word: 0.007*"energi" + 0.007*"mental" + 0.006*"bushfir" + 0.006*"health" + 0.006*"emerg" + 0.006*"cancer" + 0.005*"disabl" + 0.005*"tree" + 0.005*"patient" + 0.005*"pacif"
Topic: 4 Word: 0.009*"john" + 0.009*"royal" + 0.008*"abbott" + 0.008*"commiss" + 0.008*"farm" + 0.007*"coal" + 0.007*"juli" + 0.007*"peter" + 0.006*"andrew" + 0.006*"toni"
Topic: 5 Word: 0.020*"interview" + 0.014*"drum" + 0.013*"donald" + 0.010*"abus" + 0.009*"child" + 0.008*"friday" + 0.008

Classification of the topics

In [24]:
processed_docs[4310]

['rain', 'help', 'dampen', 'bushfir']

In [25]:
for index, score in sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.3450404405593872	 
Topic: 0.028*"south" + 0.026*"north" + 0.024*"coast" + 0.017*"west" + 0.014*"train" + 0.014*"gold" + 0.014*"victoria" + 0.013*"flood" + 0.013*"polit" + 0.011*"darwin"

Score: 0.295529305934906	 
Topic: 0.026*"crash" + 0.023*"die" + 0.016*"investig" + 0.015*"feder" + 0.015*"polic" + 0.015*"road" + 0.015*"protest" + 0.015*"hospit" + 0.013*"help" + 0.012*"public"

Score: 0.21928861737251282	 
Topic: 0.020*"australia" + 0.020*"world" + 0.018*"women" + 0.017*"final" + 0.015*"win" + 0.015*"indigen" + 0.014*"time" + 0.013*"leagu" + 0.012*"australian" + 0.011*"lead"

Score: 0.020022952929139137	 
Topic: 0.031*"death" + 0.021*"year" + 0.017*"turnbul" + 0.016*"leav" + 0.015*"island" + 0.014*"china" + 0.013*"royal" + 0.012*"forc" + 0.011*"drum" + 0.010*"commiss"

Score: 0.020021997392177582	 
Topic: 0.026*"trump" + 0.024*"govern" + 0.020*"say" + 0.015*"school" + 0.014*"countri" + 0.013*"health" + 0.013*"fund" + 0.011*"donald" + 0.011*"hour" + 0.010*"water"

Score: 0.0

In [26]:
for index, score in sorted(lda_model_tfidf[bow_corpus[4310]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.4246877133846283	 
Topic: 0.010*"turnbul" + 0.010*"weather" + 0.008*"australia" + 0.008*"storm" + 0.007*"ash" + 0.007*"leagu" + 0.007*"michael" + 0.006*"tasmania" + 0.006*"rugbi" + 0.006*"celebr"

Score: 0.4152609705924988	 
Topic: 0.007*"energi" + 0.007*"mental" + 0.006*"bushfir" + 0.006*"health" + 0.006*"emerg" + 0.006*"cancer" + 0.005*"disabl" + 0.005*"tree" + 0.005*"patient" + 0.005*"pacif"

Score: 0.020007260143756866	 
Topic: 0.013*"govern" + 0.007*"budget" + 0.007*"fund" + 0.006*"say" + 0.006*"council" + 0.006*"labor" + 0.006*"plan" + 0.006*"elect" + 0.005*"hill" + 0.005*"cut"

Score: 0.020006943494081497	 
Topic: 0.025*"countri" + 0.023*"hour" + 0.012*"podcast" + 0.011*"christma" + 0.006*"june" + 0.006*"action" + 0.006*"univers" + 0.005*"april" + 0.005*"explain" + 0.005*"legal"

Score: 0.020006701350212097	 
Topic: 0.009*"john" + 0.009*"royal" + 0.008*"abbott" + 0.008*"commiss" + 0.008*"farm" + 0.007*"coal" + 0.007*"juli" + 0.007*"peter" + 0.006*"andrew" + 0.006*"toni

## Comparison on Basis of BoW and TFIDF

In [43]:
processed_docs[4310]

['rain', 'help', 'dampen', 'bushfir']

In [61]:
Bow_index,Bow_topic_score=sorted(lda_model[bow_corpus[4310]], key=lambda tup: -1*tup[1])[0]

In [63]:
print("\nTopic: ",lda_model.print_topic(Bow_index, 10))


Topic:  0.028*"south" + 0.026*"north" + 0.024*"coast" + 0.017*"west" + 0.014*"train" + 0.014*"gold" + 0.014*"victoria" + 0.013*"flood" + 0.013*"polit" + 0.011*"darwin"


In [56]:
tfidf_index,tfidf_topic_score=sorted(lda_model_tfidf[bow_corpus[4310]], key=lambda tup: -1*tup[1])[0]

In [64]:
print("\nTopic: {}",lda_model_tfidf.print_topic(tfidf_index, 10))


Topic: {} 0.010*"turnbul" + 0.010*"weather" + 0.008*"australia" + 0.008*"storm" + 0.007*"ash" + 0.007*"leagu" + 0.007*"michael" + 0.006*"tasmania" + 0.006*"rugbi" + 0.006*"celebr"
