# Latent Dirichlet Allocation (LDA)

We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of
discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each
item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in
turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of
text modeling, the topic probabilities provide an explicit representation of a document. Read more in <a href="http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf">"Latent Dirichlet Allocation" paper. </a>

<hr>

# NLP Pipeline

Every NLP model has 3 sections that have to be fullfiled:
<ol>
  <li>Text Processing </li>
  <li>Feature Extraction</li>
  <li>Modeling</li>
</ol>
Below you can follow these steps.

<hr>

# 1. Text Processing

In the text processing section we do the following:

<ol>
  <li>**Tokenization:** Splitting the text into words + Lowercasing the words + Deleting the punctuation. </li>
  <li>**Eliminating irelavants:** Words with less than 3 alphabets + Stopwords. </li>
  <li>**Lemmatization:** Converting third person to first person + Converting every verb time to persent. </li>
  <li>**Stemming:** Finding the root of each word.</li>
</ol>

### 1.1. Loading the Data

In [1]:
# Importing the libraries
import numpy as np
import pandas as pd

# Seed
np.random.seed(300)

In [2]:
# Loading the dataset
data = pd.read_csv(filepath_or_buffer = "./abcnews-date-text.csv", error_bad_lines = False)
data.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [3]:
# Getting the second column
data_text = data[:300000][["headline_text"]]
data_text.head()

Unnamed: 0,headline_text
0,aba decides against community broadcasting lic...
1,act fire witnesses must be aware of defamation
2,a g calls for infrastructure protection summit
3,air nz staff in aust strike for pay rise
4,air nz strike to affect australian travellers


In [4]:
# Adding a 'index' column
data_text['index'] = data_text.index
data_text.head()

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


In [5]:
# Making our documents
documents = data_text

# Total number of documents
print("Total number of documents: ", len(documents))

Total number of documents:  300000


In [6]:
documents.head()

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


### 1.2. Preprocessing the Raw Text

In [7]:
# Importing the NLP libraries
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

In [8]:
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /Users/soheil/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [9]:
# Lemmatization and Stemming
def lemmatize_stemming(text):
    """
    Applying the lemmatization and stemming to the give text.
    
    Parameters
    ----------
    text : Raw text for lemmatization and stemming
    
    Return
    ------
    Lemmatized and stemmed text
    
    """
    # Initializing the stemmer
    stemmer = SnowballStemmer(language = 'english')
    
    # Applying lemmatization
    lemmatized_word = WordNetLemmatizer().lemmatize(word = text, pos = 'v')
    
    # Applying stemming
    stemmed_word = stemmer.stem(word = lemmatized_word)
    
    return stemmed_word

In [10]:
# Preprocessing
def preprocess(text):
    """
    Preprocessing the text, including: lowercasing, eliminating punctuation, eliminating words less than 3,
    eliminating the stopwords, and at the end applying the lemmatization and stemming.
    
    Parameters
    ----------
    text : Raw text to apply preprocessing
    
    Return
    ------
    Preprocessed text
    
    """
    # initializing any empty list for the results
    result = []
    
    # Iterating through tokens + Lowercasing and eliminating the punctuation
    for token in gensim.utils.simple_preprocess(text):
        
        # Eliminating stopwords 
        if token not in gensim.parsing.preprocessing.STOPWORDS:
            
            # Eliminating words less than 3
            if len(token)>3:
                
                # Applying lemmatization and stemming
                lemm_stem_word = lemmatize_stemming(text = token)
                
                # Appending the preprocessed word
                result.append(lemm_stem_word)
                
    return result

In [11]:
document_number = 10

# Sampling a document 
sample_doc = documents[documents['index'] == document_number].values[0][0]
print("==> Sample Document: ", sample_doc)

# Tokenizing the sample
sample_words = [s for s in sample_doc.split(' ')]
print("==> Original Words: ", sample_words)

# Preprocessing the sample
preprocessed_words = preprocess(sample_doc)
print("==> Preprocessed Words: ", preprocessed_words)

==> Sample Document:  australia to contribute 10 million in aid to iraq
==> Original Words:  ['australia', 'to', 'contribute', '10', 'million', 'in', 'aid', 'to', 'iraq']
==> Preprocessed Words:  ['australia', 'contribut', 'million', 'iraq']


In [12]:
# Preprocessing the whole document
processed_documents = documents['headline_text'].map(preprocess)

In [13]:
processed_documents.head(10)

0            [decid, communiti, broadcast, licenc]
1                               [wit, awar, defam]
2           [call, infrastructur, protect, summit]
3                      [staff, aust, strike, rise]
4             [strike, affect, australian, travel]
5               [ambiti, olsson, win, tripl, jump]
6           [antic, delight, record, break, barca]
7    [aussi, qualifi, stosur, wast, memphi, match]
8            [aust, address, secur, council, iraq]
9                         [australia, lock, timet]
Name: headline_text, dtype: object

# 2. Feature Extraction

### 2.1. Bag of Words

In [14]:
# Making a dictionary cotaining words and their integer ids
dictionary = gensim.corpora.Dictionary(processed_documents)

In [15]:
# Printing first 10 items in dictionary
count = 0
for key, value in dictionary.iteritems():
    print(key, value)
    count += 1
    if count == 10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect


In [16]:
# Deletting very rare and very common words
dictionary.filter_extremes(no_below = 15, # Removing words with less than 15 (absolute number)
                           no_above = 0.1, # Words appearing more than 10% of documents (fraction of size, not absolute number)
                           keep_n = 100000) # keeping the most frequent tokens

In [17]:
# Applying Bag-of-Words for each document (a list of words)
bow_corpus = [dictionary.doc2bow(document = doc) for doc in processed_documents]

print("Bag-of-Words of our sample document: ", bow_corpus[document_number]) # document_number = 10

Bag-of-Words of our sample document:  [(34, 1), (36, 1), (39, 1), (40, 1)]


In [18]:
# Printing out the BoW for our sample document
bow_10 = bow_corpus[document_number]

for i in range(len(bow_10)):
    print('The word "{}" with word id of {} repeated {} times'.format(dictionary[bow_10[i][0]],
                                                                    bow_10[i][0],
                                                                    bow_10[i][1]))

The word "iraq" with word id of 34 repeated 1 times
The word "australia" with word id of 36 repeated 1 times
The word "contribut" with word id of 39 repeated 1 times
The word "million" with word id of 40 repeated 1 times


### 2.2. TF-IDF (Term Frequency, Inverse Document Frequency)

In [19]:
# Importing the libraries
from gensim import corpora
from gensim import models
from pprint import pprint

In [20]:
# Creating a TF-IDF model
tdidf = models.TfidfModel(corpus = bow_corpus)

In [21]:
# Applying TF-IDF to the entire corpus
tdidf_corpus = tdidf[bow_corpus]

print("1st TD-IDF: ", tdidf_corpus[0])

1st TD-IDF:  [(0, 0.5959813347777092), (1, 0.39204529549491984), (2, 0.48531419274988147), (3, 0.5055461098578569)]


In [22]:
# Preview TF-IDF for first document
for doc in tdidf_corpus:
    pprint(doc)
    break

[(0, 0.5959813347777092),
 (1, 0.39204529549491984),
 (2, 0.48531419274988147),
 (3, 0.5055461098578569)]


# 3. Modeling

### 3.1. Runnung LDA (Latent Dirichlet Allocation) Using Bag-of-words

In [23]:
# Creating a LDA model
# LdaModel = Mono core /  LdaMulticore = Multi core
lda_model_bow = gensim.models.LdaMulticore(corpus = bow_corpus,
                                       num_topics = 10,
                                       id2word = dictionary, 
                                       passes = 2,  # Iteration
                                       workers = 2) # Using 2 processor

In [24]:
# Exploring words in each topic and it weight
for index, topic in lda_model_bow.print_topics(-1):
    print("Topic: {} \nWords: {}".format(index, topic), "\n")

Topic: 0 
Words: 0.022*"lead" + 0.021*"open" + 0.017*"take" + 0.014*"aussi" + 0.013*"test" + 0.013*"england" + 0.012*"injur" + 0.011*"victori" + 0.011*"vote" + 0.010*"unit" 

Topic: 1 
Words: 0.055*"polic" + 0.028*"death" + 0.023*"investig" + 0.022*"continu" + 0.022*"miss" + 0.016*"price" + 0.014*"search" + 0.014*"probe" + 0.014*"road" + 0.012*"warn" 

Topic: 2 
Words: 0.040*"charg" + 0.035*"court" + 0.033*"face" + 0.022*"accus" + 0.021*"jail" + 0.021*"murder" + 0.020*"polic" + 0.016*"case" + 0.013*"trial" + 0.013*"world" 

Topic: 3 
Words: 0.044*"govt" + 0.023*"fund" + 0.021*"urg" + 0.014*"health" + 0.013*"labor" + 0.013*"seek" + 0.013*"opposit" + 0.013*"servic" + 0.012*"help" + 0.012*"boost" 

Topic: 4 
Words: 0.028*"kill" + 0.024*"crash" + 0.023*"attack" + 0.021*"drug" + 0.017*"die" + 0.015*"nuclear" + 0.013*"dead" + 0.012*"strike" + 0.011*"north" + 0.011*"guilti" 

Topic: 5 
Words: 0.035*"say" + 0.020*"farmer" + 0.020*"drought" + 0.016*"howard" + 0.016*"deal" + 0.015*"talk" + 0.015

### 3.2. Running LDA (Latent Dirichlet Allocation) Using TF-IDF (Term Frequency, Inverse Document Frequency)

In [25]:
# Defining the LDA model
lda_model_tfidf = gensim.models.LdaMulticore(corpus = bow_corpus, 
                                       num_topics = 10, 
                                       id2word = dictionary, 
                                       passes = 2,
                                       workers = 2)

In [26]:
# Exploring words in each topic and it weight
for index, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} \nWords: {}".format(index, topic), "\n")

Topic: 0 
Words: 0.022*"govt" + 0.019*"urg" + 0.018*"defend" + 0.016*"indigen" + 0.016*"driver" + 0.015*"rise" + 0.015*"public" + 0.015*"council" + 0.015*"hold" + 0.014*"road" 

Topic: 1 
Words: 0.023*"hospit" + 0.020*"worker" + 0.016*"centr" + 0.013*"power" + 0.012*"storm" + 0.011*"threat" + 0.011*"delay" + 0.010*"govt" + 0.010*"action" + 0.009*"plan" 

Topic: 2 
Words: 0.027*"say" + 0.025*"govt" + 0.024*"call" + 0.020*"labor" + 0.015*"howard" + 0.013*"deni" + 0.013*"inquiri" + 0.012*"rule" + 0.011*"sale" + 0.010*"chief" 

Topic: 3 
Words: 0.057*"water" + 0.023*"nation" + 0.022*"opposit" + 0.017*"elect" + 0.017*"win" + 0.012*"resid" + 0.011*"reject" + 0.011*"terror" + 0.011*"plan" + 0.010*"river" 

Topic: 4 
Words: 0.039*"kill" + 0.020*"iraq" + 0.015*"attack" + 0.015*"aust" + 0.014*"troop" + 0.013*"final" + 0.012*"open" + 0.011*"die" + 0.010*"clash" + 0.010*"bomb" 

Topic: 5 
Words: 0.035*"crash" + 0.018*"lead" + 0.017*"world" + 0.012*"take" + 0.011*"aussi" + 0.011*"timor" + 0.009*"re

# 4. Evaluation

### 4.1. Classifying Sample Document Using Bag-of-Words

In [27]:
# Text of sample document 4310
document_number = 4310
processed_documents[document_number]

['rain', 'help', 'dampen', 'bushfir']

In [28]:
# Checking which topic does our document belongs to
for index, score in sorted(lda_model_bow[bow_corpus[document_number]], key = lambda tup: (-1)*tup[1]):
    print("\nScore: {} \nTopic: {}".format(score, lda_model_bow.print_topic(topicno = index, topn = 10)))


Score: 0.4195842742919922 
Topic: 0.035*"say" + 0.020*"farmer" + 0.020*"drought" + 0.016*"howard" + 0.016*"deal" + 0.015*"talk" + 0.015*"iraq" + 0.013*"final" + 0.013*"rain" + 0.010*"trade"

Score: 0.22040049731731415 
Topic: 0.028*"kill" + 0.024*"crash" + 0.023*"attack" + 0.021*"drug" + 0.017*"die" + 0.015*"nuclear" + 0.013*"dead" + 0.012*"strike" + 0.011*"north" + 0.011*"guilti"

Score: 0.2199997901916504 
Topic: 0.055*"polic" + 0.028*"death" + 0.023*"investig" + 0.022*"continu" + 0.022*"miss" + 0.016*"price" + 0.014*"search" + 0.014*"probe" + 0.014*"road" + 0.012*"warn"

Score: 0.02001262456178665 
Topic: 0.044*"govt" + 0.023*"fund" + 0.021*"urg" + 0.014*"health" + 0.013*"labor" + 0.013*"seek" + 0.013*"opposit" + 0.013*"servic" + 0.012*"help" + 0.012*"boost"

Score: 0.0200028233230114 
Topic: 0.050*"plan" + 0.033*"water" + 0.032*"council" + 0.017*"govt" + 0.015*"closer" + 0.014*"group" + 0.014*"work" + 0.013*"push" + 0.012*"consid" + 0.012*"industri"

Score: 0.020000021904706955 
T

### 4.2. Classifying Sample Document Using TF-IDF

In [29]:
# Checking which topic does our document belongs to
for index, score in sorted(lda_model_tfidf[bow_corpus[document_number]], key = lambda tup: (-1)*tup[1]):
    print("\nScore: {} \nTopic: {}".format(score, lda_model_tfidf.print_topic(topicno = index, topn = 10)))


Score: 0.43032628297805786 
Topic: 0.027*"say" + 0.025*"govt" + 0.024*"call" + 0.020*"labor" + 0.015*"howard" + 0.013*"deni" + 0.013*"inquiri" + 0.012*"rule" + 0.011*"sale" + 0.010*"chief"

Score: 0.4096580445766449 
Topic: 0.028*"fund" + 0.021*"boost" + 0.017*"govt" + 0.016*"farmer" + 0.016*"drought" + 0.015*"health" + 0.014*"price" + 0.013*"probe" + 0.012*"break" + 0.012*"school"

Score: 0.020008672028779984 
Topic: 0.057*"water" + 0.023*"nation" + 0.022*"opposit" + 0.017*"elect" + 0.017*"win" + 0.012*"resid" + 0.011*"reject" + 0.011*"terror" + 0.011*"plan" + 0.010*"river"

Score: 0.020002365112304688 
Topic: 0.023*"hospit" + 0.020*"worker" + 0.016*"centr" + 0.013*"power" + 0.012*"storm" + 0.011*"threat" + 0.011*"delay" + 0.010*"govt" + 0.010*"action" + 0.009*"plan"

Score: 0.02000197395682335 
Topic: 0.022*"govt" + 0.019*"urg" + 0.018*"defend" + 0.016*"indigen" + 0.016*"driver" + 0.015*"rise" + 0.015*"public" + 0.015*"council" + 0.015*"hold" + 0.014*"road"

Score: 0.020000927150249

### 4.3. Testing the model on the onseen document

In [30]:
unseen_document = "My favorite sports activities are running and swimming."

In [31]:
# Preprocessing the new document
processed_text = preprocess(text = unseen_document)
print("Processed text: ", processed_text)

Processed text:  ['favorit', 'sport', 'activ', 'run', 'swim']


In [33]:
# Creating a Bag-of-Word
bow_vector = dictionary.doc2bow(document = processed_text)
print("Bag-of-Words for given text: ", bow_vector)

Bag-of-Words for given text:  [(372, 1), (1001, 1), (1632, 1), (2573, 1)]


In [35]:
# Checking which topic does our document belongs to
for index, score in sorted(lda_model_bow[bow_vector], key=lambda tup: -1*tup[1]):
    print("\nScore: {} Topic: {}".format(score, lda_model_bow.print_topic(topicno = index, topn = 5)))


Score: 0.5976608991622925 Topic: 0.023*"concern" + 0.015*"power" + 0.014*"close" + 0.013*"firefight" + 0.013*"centr"

Score: 0.24232117831707 Topic: 0.022*"lead" + 0.021*"open" + 0.017*"take" + 0.014*"aussi" + 0.013*"test"

Score: 0.020006127655506134 Topic: 0.022*"australia" + 0.020*"aust" + 0.018*"return" + 0.015*"play" + 0.012*"announc"

Score: 0.020003166049718857 Topic: 0.050*"plan" + 0.033*"water" + 0.032*"council" + 0.017*"govt" + 0.015*"closer"

Score: 0.020002849400043488 Topic: 0.055*"polic" + 0.028*"death" + 0.023*"investig" + 0.022*"continu" + 0.022*"miss"

Score: 0.020002581179142 Topic: 0.019*"win" + 0.019*"arrest" + 0.018*"rule" + 0.017*"protest" + 0.017*"polic"

Score: 0.020002515986561775 Topic: 0.040*"charg" + 0.035*"court" + 0.033*"face" + 0.022*"accus" + 0.021*"jail"

Score: 0.020000645890831947 Topic: 0.028*"kill" + 0.024*"crash" + 0.023*"attack" + 0.021*"drug" + 0.017*"die"

Score: 0.019999999552965164 Topic: 0.044*"govt" + 0.023*"fund" + 0.021*"urg" + 0.014*"hea

In [87]:
index = 14

doc = documents[documents['index'] == index]
doc = np.array(doc)[0][0]

print("=> News :\n", doc, "\n")

# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(doc))
print("=> Topics: ")
for index, score in (sorted(lda_model_tfidf[bow_vector], key=lambda tup: -1*tup[1])):
    print("Score: {}\t Topic: {}".format(score, lda_model_tfidf.print_topic(index, 1)))

=> News :
 big plan to boost paroo water supplies 

=> Topics: 
Score: 0.6207208633422852	 Topic: 0.057*"water"
Score: 0.21926729381084442	 Topic: 0.028*"fund"
Score: 0.020007101818919182	 Topic: 0.033*"plan"
Score: 0.020002758130431175	 Topic: 0.022*"govt"
Score: 0.02000199817121029	 Topic: 0.023*"hospit"
Score: 0.020000003278255463	 Topic: 0.027*"miss"
Score: 0.020000001415610313	 Topic: 0.027*"say"
Score: 0.020000001415610313	 Topic: 0.039*"kill"
Score: 0.020000001415610313	 Topic: 0.035*"crash"
Score: 0.020000001415610313	 Topic: 0.075*"polic"
