LATENT DIRICHLET ALLOCATION

Johann Peter Gustav Lejeune Dirichlet was a German mathematician in the 1800's who contributed widely to the field of modern mathematics.

There is a probability distribution named after him "Dirichlet Distribution".

LDA is based on this probability distribution.

In 2023 LDA was first published as a graphical model for topic discovery in Journal of Machine Learning Research by David Blei, Andrew Ng and Michael I. Jordan.

Assumptions of LDA for topic modeling:
=> Documents with similar topics use similar group of words.
=> Latent topics can then be found by searching for groups of words that frequently occur together in documents across the corpus.

=> Documents are probability distributions over words.
=> Topics themselves are Probability distributions over words.

LDA represents documents as mixtures of topics that spit out words with certain probabilities.

It assumes that documents are produced in the following fashion:
=> Choose a topic mixture for the document (According to a Dirichlet distribution over a fixed set of K topics).
 Eg: A document is 60% business, 20% politics, 10% food. This would be our actual distribution.
=> Generate each word in the document by:
===>First picking a topic according to the multinomial distribution that you sampled previously (60% business, 20% politics, 10% food)
===> Using the topic the generate the word itself (according to the topic's multinomial distribution).
===> For example, if we selected the food topic, we might generate the word "apple" with 60% probability, "home" with 30% probability, and so on.

=> Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("../UPDATED_NLP_COURSE/05-Topic-Modeling/npr.csv")

In [3]:
len(df)

11992

In [4]:
df.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [5]:
print(df['Article'][0])

In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing o

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

In [7]:
cv = CountVectorizer(max_df=0.9, min_df=2, stop_words='english')

Here, max_df=0.9 = > removes words that appear in 90% of the documents
min_df=2 => removes words that appear in less than 2 documents, i.e. we want words that appear in atleast 2 documents
stop_words = 'english' => remove all english stop words

We can actually do the above steps by tokenizing each document using spacy and removing stop words by ourselves. However we shall stick to CountVectorizer for now 

Since it is unsupervised learning there is no test train split as we dont have anything to test against

In [8]:
dtm = cv.fit_transform(df['Article'])

In [9]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [10]:
from sklearn.decomposition import LatentDirichletAllocation

In [11]:
LDA = LatentDirichletAllocation(n_components = 7, random_state = 42)

In [12]:
LDA.fit(dtm)

LatentDirichletAllocation(n_components=7, random_state=42)

# Grab the vocabulary of words

In [18]:
len(cv.get_feature_names())

54777

In [19]:
type(cv.get_feature_names())

list

In [21]:
cv.get_feature_names()[900]

'5s'

# Grab the topics

In [22]:
len(LDA.components_)

7

In [23]:
type(LDA.components_)

numpy.ndarray

In [24]:
LDA.components_.shape

(7, 54777)

In [25]:
LDA.components_

array([[8.64332806e+00, 2.38014333e+03, 1.42900522e-01, ...,
        1.43006821e-01, 1.42902042e-01, 1.42861626e-01],
       [2.76191749e+01, 5.36394437e+02, 1.42857148e-01, ...,
        1.42861973e-01, 1.42857147e-01, 1.42906875e-01],
       [7.22783888e+00, 8.24033986e+02, 1.42857148e-01, ...,
        6.14236247e+00, 2.14061364e+00, 1.42923753e-01],
       ...,
       [3.11488651e+00, 3.50409655e+02, 1.42857147e-01, ...,
        1.42859912e-01, 1.42857146e-01, 1.42866614e-01],
       [4.61486388e+01, 5.14408600e+01, 3.14281373e+00, ...,
        1.43107628e-01, 1.43902481e-01, 2.14271779e+00],
       [4.93991422e-01, 4.18841042e+02, 1.42857151e-01, ...,
        1.42857146e-01, 1.43760101e-01, 1.42866201e-01]])

In [26]:
single_topic = LDA.components_[0]

In [27]:
single_topic.argsort()

array([ 2475, 18302, 35285, ..., 22673, 42561, 42993])

In [28]:
import numpy as np

In [29]:
arr = np.array([10,200,1])

In [30]:
arr

array([ 10, 200,   1])

In [31]:
arr.argsort()

array([2, 0, 1])

.argsort() will return the original index positions of the array elements 

In [32]:
len(single_topic)

54777

In [35]:
# Top 10 values (10 greatest values)
# Last 10 values of argsort()
single_topic.argsort()[-10:] #[-10:] grab the last 10 values of the argsort(),
#here it is index 10 from the last to the end
# Here it returns the 10 highest probability words for this single topic

array([33390, 36310, 21228, 10425, 31464,  8149, 36283, 22673, 42561,
       42993])

In [36]:
top_ten_words = single_topic.argsort()[-10:]

In [38]:
for index in top_ten_words:
    print(cv.get_feature_names()[index])

new
percent
government
company
million
care
people
health
said
says


In [39]:
top_twenty_words = single_topic.argsort()[-20:]

In [40]:
for index in top_twenty_words:
    print(cv.get_feature_names()[index])

president
state
tax
insurance
trump
companies
money
year
federal
000
new
percent
government
company
million
care
people
health
said
says


# Grab the highest probability words per topic

In [42]:
for i, topic in enumerate(LDA.components_):
    print(f"The top 15 words for the topic #{i}")
    print([cv.get_feature_names()[index] for index in topic.argsort()[-15:]])
    print("\n")
    print("\n")

The top 15 words for the topic #0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']




The top 15 words for the topic #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']




The top 15 words for the topic #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']




The top 15 words for the topic #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']




The top 15 words for the topic #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']




The top 15 words for the topic #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 

In [43]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [44]:
topic_results = LDA.transform(dtm)

In [46]:
# This is basically the probability distributions of the article in to the 7 topics
topic_results.shape

(11992, 7)

In [48]:
# this is basically the probability distributions of article 0 to all 7 topics
topic_results[0]

array([1.61040465e-02, 6.83341493e-01, 2.25376318e-04, 2.25369288e-04,
       2.99652737e-01, 2.25479379e-04, 2.25497980e-04])

In [49]:
topic_results[0].round(2)

array([0.02, 0.68, 0.  , 0.  , 0.3 , 0.  , 0.  ])

In [51]:
# .argmax() returns the index position of the highest value, in this case it is highest probability
topic_results[0].argmax()

1

In [53]:
df['topic'] = topic_results.argmax(axis=1)

In [54]:
df.head()

Unnamed: 0,Article,topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2


# Non Negative Matrix Factorization

Non-negative matrix factorization is an unsupervised algorithm that simulataneously performs dimentionality reduction and clustering. We can use it in conjunction with TF-IDF to model topics across documents

In [55]:
import pandas as pd
npr = pd.read_csv("../UPDATED_NLP_COURSE/05-Topic-Modeling/npr.csv")

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [57]:
tfidf = TfidfVectorizer(max_df= 0.95, min_df = 2, stop_words='english')

In [58]:
dtm = tfidf.fit_transform(npr['Article'])

In [59]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [60]:
from sklearn.decomposition import NMF

In [61]:
nmf_model = NMF(n_components=7, random_state=42)

In [63]:
nmf_model.fit(dtm)



NMF(n_components=7, random_state=42)

In [65]:
tfidf.get_feature_names()[2000]

'africa'

In [66]:
for index, topic in enumerate(nmf_model.components_):
    print(f"THE TOP 15 WORDS FOR TOPIC #{index}")
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print("\n")

THE TOP 15 WORDS FOR TOPIC #0




['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


THE TOP 15 WORDS FOR TOPIC #2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


THE TOP 15 WORDS FOR TOPIC #3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


THE TOP 15 WORDS FOR TOPIC #4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


THE TOP 15 WORDS FOR TOPIC #5
['love', 've', 'don', 'album', 'way', 'time', 'song', '

In [67]:
topic_results=nmf_model.transform(dtm)

In [68]:
topic_results.argmax(axis=1)

array([1, 1, 1, ..., 0, 4, 3])

In [69]:
npr['topic'] = topic_results.argmax(axis=1)

In [70]:
npr.head()

Unnamed: 0,Article,topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6


In [71]:
mytopic_dict = {0:'health', 1:'election', 2:'legis', 3: 'poli',4: 'election', 5: 'music',6:'edu'}
npr['topic_label'] = npr['topic'].map(mytopic_dict)

In [72]:
npr.head()

Unnamed: 0,Article,topic,topic_label
0,"In the Washington of 2016, even when the polic...",1,election
1,Donald Trump has used Twitter — his prefe...,1,election
2,Donald Trump is unabashedly praising Russian...,1,election
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,poli
4,"From photography, illustration and video, to d...",6,edu
