**Latent Dirichlet Allocation**

This notebook is a part of my learning journey which I've been documenting from Udacity's Natural Language Processing Nanodegree program, which helped me a lot to learn and excel advanced data science stuff such as PySpark. Thank you so much Udacity for providing such quality content.

LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.

LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is crucial.

It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution.

Load the dataset

The dataset we'll use is a list of over one million news headlines published over a period of 15 years. We'll start by loading it from the abcnews-date-text.csv file.

In [None]:
import numpy as np
import pandas as pd
from IPython.display import display
from tqdm import tqdm
from collections import Counter
import ast

import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import seaborn as sb

from sklearn.feature_extraction.text import CountVectorizer
from textblob import TextBlob
import scipy.stats as stats

from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.manifold import TSNE

from bokeh.plotting import figure, output_file, show
from bokeh.models import Label
from bokeh.io import output_notebook
output_notebook()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
'''
Load the dataset from the csv and save it to 'data_text'
'''
import pandas as pd
data = pd.read_csv('/content/drive/MyDrive/LDA Topic modeling/abcnews-date-text.csv', error_bad_lines=False)
# we only need to headlines from the data
data_text = data[:300000][['headline_text']]
data_text['index'] = data_text.index
documents = data_text




  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
data.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


Data Preprocessing

We will perform the following steps:

Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.

Words that have fewer than 3 characters are removed.

All stopwords are removed.

Words are lemmatized - words in third person are changed to first person and verbs in past and future tenses are changed into present.

Words are stemmed - words are reduced to their root form.

In [None]:
'''
Loading Gensim and nltk libraries
'''

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
'''
Write a function to perform the pre processing steps on the entire dataset
'''
stemmer = SnowballStemmer("english")
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and Lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [None]:
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
'''
Preview a document after preprocessing
'''
document_num = 4310
doc_sample = documents[documents['index'] == document_num].values[0][0]

print("Original document: ")
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

Original document: 
['rain', 'helps', 'dampen', 'bushfires']


Tokenized and lemmatized document: 
['rain', 'help', 'dampen', 'bushfir']


Let's now preprocess all the news headlines we have. To do that, let's use the map function from pandas to apply preprocess() to the headline_text column

In [None]:
# preprocess all the headlines, saving the list of results as 'processed_docs'
processed_docs = documents['headline_text'].map(preprocess)

In [None]:
'''
Preview 'processed_docs'
'''
processed_docs.head()

0     [decid, communiti, broadcast, licenc]
1                        [wit, awar, defam]
2    [call, infrastructur, protect, summit]
3               [staff, aust, strike, rise]
4      [strike, affect, australian, travel]
Name: headline_text, dtype: object

Bag of words on the dataset

Now let's create a dictionary from 'processed_docs' containing the number of times a word appears in the training set. To do that, let's pass processed_docs to gensim.corpora.Dictionary() and call it 'dictionary'

In [None]:
'''
Create a dictionary from 'processed_docs' containing the number of times a word appears 
in the training set using gensim.corpora.Dictionary and call it 'dictionary'
'''
dictionary = gensim.corpora.Dictionary(processed_docs)

In [None]:
'''
Checking dictionary created
'''
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


Gensim filter_extremes​filter_extremes(no_below=5, no_above=0.5, keep_n=100000)​Filter out tokens that appear in​

less than no_below documents (absolute number) or

more than no_above documents (fraction of total corpus size, not absolute number).

after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

In [None]:
'''
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n=100000)

Gensim doc2bow

doc2bow(document)

Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

In [None]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [None]:
'''
Checking Bag of Words corpus for our sample document --> (token_id, token_count)
'''
bow_corpus[document_num]

[(71, 1), (107, 1), (462, 1), (3530, 1)]

In [None]:
'''
Preview BOW for our sample preprocessed document
'''
# Here document_num is document number 4310 which we have checked in Step 2
bow_doc_4310 = bow_corpus[document_num]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 71 ("bushfir") appears 1 time.
Word 107 ("help") appears 1 time.
Word 462 ("rain") appears 1 time.
Word 3530 ("dampen") appears 1 time.


TF-IDF on our document set

While performing TF-IDF on the corpus is not necessary for LDA implemention using the gensim model, it is recemmended. TF-IDF expects a bag-of-words (integer values) training corpus during initialization. During transformation, it will take a vector and return another vector of the same dimensionality.

In [None]:
 '''
Create tf-idf model object using models.TfidfModel on 'bow_corpus' and save it to 'tfidf'
'''
from gensim import corpora, models


tfidf = models.TfidfModel(bow_corpus)

In [None]:
'''
Apply transformation to the entire corpus and call it 'corpus_tfidf'
'''
corpus_tfidf = tfidf[bow_corpus]

In [None]:
'''
Preview TF-IDF scores for our first document --> --> (token_id, tfidf score)
'''
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5959813347777092),
 (1, 0.39204529549491984),
 (2, 0.48531419274988147),
 (3, 0.5055461098578569)]


In [None]:
# LDA mono-core -- fallback code in case LdaMulticore throws an error on your machine
# lda_model = gensim.models.LdaModel(bow_corpus, 
#                                    num_topics = 10, 
#                                    id2word = dictionary,                                    
#                                    passes = 50)

# LDA multicore 
'''
Train your lda model using gensim.models.LdaMulticore and save it to 'lda_model'
'''
lda_model = gensim.models.LdaMulticore(bow_corpus, 
                                       num_topics=10, 
                                       id2word = dictionary, 
                                       passes = 2, 
                                       workers=2)

In [None]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(topic, idx ))
    print("\n")

Topic: 0.022*"closer" + 0.021*"test" + 0.020*"lead" + 0.017*"talk" + 0.014*"south" + 0.013*"law" + 0.012*"take" + 0.012*"timor" + 0.011*"open" + 0.010*"clash" 
Words: 0


Topic: 0.091*"polic" + 0.028*"seek" + 0.025*"investig" + 0.022*"miss" + 0.016*"search" + 0.015*"probe" + 0.013*"region" + 0.011*"offic" + 0.011*"bodi" + 0.011*"shoot" 
Words: 1


Topic: 0.016*"record" + 0.014*"australia" + 0.014*"break" + 0.013*"look" + 0.013*"drought" + 0.012*"rain" + 0.012*"dead" + 0.012*"price" + 0.010*"sydney" + 0.009*"crew" 
Words: 2


Topic: 0.051*"water" + 0.033*"warn" + 0.015*"industri" + 0.015*"continu" + 0.014*"urg" + 0.013*"farmer" + 0.012*"busi" + 0.012*"begin" + 0.011*"worker" + 0.010*"threat" 
Words: 3


Topic: 0.016*"elect" + 0.016*"iraq" + 0.014*"council" + 0.014*"howard" + 0.013*"reject" + 0.013*"market" + 0.013*"deal" + 0.013*"labor" + 0.012*"say" + 0.012*"plan" 
Words: 4


Topic: 0.040*"charg" + 0.035*"court" + 0.034*"face" + 0.022*"kill" + 0.020*"murder" + 0.020*"accus" + 0.020*"fo

Running LDA using TF-IDF

In [None]:
'''
Define lda model using corpus_tfidf, again using gensim.models.LdaMulticore()
'''

lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, 
                                       num_topics=10, 
                                       id2word = dictionary, 
                                       passes = 2, 
                                       workers=2)

In [None]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

Topic: 0 Word: 0.019*"kill" + 0.016*"iraq" + 0.011*"troop" + 0.010*"firefight" + 0.010*"blaze" + 0.008*"bomb" + 0.007*"timor" + 0.007*"crew" + 0.007*"blast" + 0.006*"attack"


Topic: 1 Word: 0.012*"price" + 0.011*"teen" + 0.010*"market" + 0.008*"climat" + 0.007*"restrict" + 0.007*"rise" + 0.007*"eas" + 0.007*"water" + 0.007*"level" + 0.006*"profit"


Topic: 2 Word: 0.011*"hick" + 0.009*"drink" + 0.008*"condit" + 0.007*"bird" + 0.006*"driver" + 0.006*"retir" + 0.006*"polic" + 0.006*"award" + 0.006*"perth" + 0.005*"flag"


Topic: 3 Word: 0.013*"opposit" + 0.008*"rais" + 0.008*"govt" + 0.007*"busi" + 0.007*"chang" + 0.007*"baghdad" + 0.006*"lebanon" + 0.006*"hill" + 0.006*"law" + 0.006*"cancer"


Topic: 4 Word: 0.016*"govt" + 0.011*"water" + 0.011*"plan" + 0.010*"council" + 0.010*"fund" + 0.010*"urg" + 0.007*"health" + 0.007*"group" + 0.007*"union" + 0.007*"indigen"


Topic: 5 Word: 0.045*"closer" + 0.010*"murray" + 0.007*"tiger" + 0.006*"recycl" + 0.006*"miner" + 0.006*"kangaroo" + 0.005

Performance evaluation by classifying sample document using LDA Bag of Words model

In [None]:
'''
Text of sample document 4310
'''
processed_docs[4310]

['rain', 'help', 'dampen', 'bushfir']

In [None]:
'''
Check which topic our test document belongs to using the LDA Bag of Words model.
'''
document_num = 4310

# Our test document is document number 4310
for index, score in sorted(lda_model[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.48499351739883423	 
Topic: 0.016*"record" + 0.014*"australia" + 0.014*"break" + 0.013*"look" + 0.013*"drought" + 0.012*"rain" + 0.012*"dead" + 0.012*"price" + 0.010*"sydney" + 0.009*"crew"

Score: 0.35498127341270447	 
Topic: 0.091*"polic" + 0.028*"seek" + 0.025*"investig" + 0.022*"miss" + 0.016*"search" + 0.015*"probe" + 0.013*"region" + 0.011*"offic" + 0.011*"bodi" + 0.011*"shoot"

Score: 0.02000931277871132	 
Topic: 0.051*"water" + 0.033*"warn" + 0.015*"industri" + 0.015*"continu" + 0.014*"urg" + 0.013*"farmer" + 0.012*"busi" + 0.012*"begin" + 0.011*"worker" + 0.010*"threat"

Score: 0.020006787031888962	 
Topic: 0.018*"return" + 0.017*"hold" + 0.014*"question" + 0.014*"resid" + 0.014*"work" + 0.012*"firefight" + 0.011*"blaze" + 0.011*"rais" + 0.010*"unit" + 0.010*"titl"

Score: 0.020003389567136765	 
Topic: 0.060*"govt" + 0.027*"council" + 0.025*"fund" + 0.021*"plan" + 0.020*"urg" + 0.016*"boost" + 0.013*"servic" + 0.012*"rise" + 0.012*"health" + 0.012*"defend"

Score: 0.0

It has the highest probability (0.48) to be part of the topic that we assigned as Topic X, which is the accurate classification

Performance evaluation by classifying sample document using LDA TF-IDF model

In [None]:
'''
Check which topic our test document belongs to using the LDA TF-IDF model.
'''
# Our test document is document number 4310
for index, score in sorted(lda_model_tfidf[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.5793165564537048	 
Topic: 0.012*"nuclear" + 0.011*"rain" + 0.009*"drought" + 0.009*"north" + 0.008*"cyclon" + 0.008*"farmer" + 0.008*"storm" + 0.008*"wind" + 0.007*"damag" + 0.007*"farm"

Score: 0.26067036390304565	 
Topic: 0.013*"rudd" + 0.009*"control" + 0.007*"council" + 0.007*"light" + 0.007*"plan" + 0.006*"qanta" + 0.006*"news" + 0.006*"govt" + 0.006*"propos" + 0.006*"illeg"

Score: 0.02000400796532631	 
Topic: 0.019*"kill" + 0.016*"iraq" + 0.011*"troop" + 0.010*"firefight" + 0.010*"blaze" + 0.008*"bomb" + 0.007*"timor" + 0.007*"crew" + 0.007*"blast" + 0.006*"attack"

Score: 0.020002862438559532	 
Topic: 0.016*"govt" + 0.011*"water" + 0.011*"plan" + 0.010*"council" + 0.010*"fund" + 0.010*"urg" + 0.007*"health" + 0.007*"group" + 0.007*"union" + 0.007*"indigen"

Score: 0.020001530647277832	 
Topic: 0.026*"polic" + 0.020*"charg" + 0.016*"investig" + 0.016*"court" + 0.015*"murder" + 0.015*"crash" + 0.013*"search" + 0.012*"woman" + 0.011*"jail" + 0.011*"miss"

Score: 0.020001

It has the highest probability (57%) to be part of the topic that we assigned as topic X.

Testing model on unseen document

In [None]:
unseen_document = "My favorite sports activities are running and swimming."

# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.4200000762939453	 Topic: 0.022*"closer" + 0.021*"test" + 0.020*"lead" + 0.017*"talk" + 0.014*"south"
Score: 0.2199999839067459	 Topic: 0.018*"return" + 0.017*"hold" + 0.014*"question" + 0.014*"resid" + 0.014*"work"
Score: 0.2199922502040863	 Topic: 0.040*"charg" + 0.035*"court" + 0.034*"face" + 0.022*"kill" + 0.020*"murder"
Score: 0.020004000514745712	 Topic: 0.038*"crash" + 0.025*"jail" + 0.021*"road" + 0.018*"die" + 0.016*"death"
Score: 0.02000368759036064	 Topic: 0.036*"report" + 0.024*"opposit" + 0.023*"power" + 0.014*"win" + 0.012*"say"
Score: 0.019999999552965164	 Topic: 0.091*"polic" + 0.028*"seek" + 0.025*"investig" + 0.022*"miss" + 0.016*"search"
Score: 0.019999999552965164	 Topic: 0.016*"record" + 0.014*"australia" + 0.014*"break" + 0.013*"look" + 0.013*"drought"
Score: 0.019999999552965164	 Topic: 0.051*"water" + 0.033*"warn" + 0.015*"industri" + 0.015*"continu" + 0.014*"urg"
Score: 0.019999999552965164	 Topic: 0.016*"elect" + 0.016*"iraq" + 0.014*"council" + 0.014*

The model correctly classifies the unseen document with '42'% probability to the X category.