LATENT DIRICHLET ALLOCATION

Johann Peter Gustav Lejeune Dirichlet was a German mathematician in the 1800's who contributed widely to the field of modern mathematics.

There is a probability distribution named after him "Dirichlet Distribution".

LDA is based on this probability distribution.

In 2023 LDA was first published as a graphical model for topic discovery in Journal of Machine Learning Research by David Blei, Andrew Ng and Michael I. Jordan.

Assumptions of LDA for topic modeling:
=> Documents with similar topics use similar group of words.
=> Latent topics can then be found by searching for groups of words that frequently occur together in documents across the corpus.

=> Documents are probability distributions over words.
=> Topics themselves are Probability distributions over words.

LDA represents documents as mixtures of topics that spit out words with certain probabilities.

It assumes that documents are produced in the following fashion:
=> Choose a topic mixture for the document (According to a Dirichlet distribution over a fixed set of K topics).
 Eg: A document is 60% business, 20% politics, 10% food. This would be our actual distribution.
=> Generate each word in the document by:
===>First picking a topic according to the multinomial distribution that you sampled previously (60% business, 20% politics, 10% food)
===> Using the topic the generate the word itself (according to the topic's multinomial distribution).
===> For example, if we selected the food topic, we might generate the word "apple" with 60% probability, "home" with 30% probability, and so on.

=> Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.


In [2]:
import pandas as pd

In [3]:
df = pd.read_csv("../UPDATED_NLP_COURSE/05-Topic-Modeling/npr.csv")

In [4]:
len(df)

11992

In [5]:
df.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [8]:
print(df['Article'][0])

In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing o

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

  LARGE_SPARSE_SUPPORTED = LooseVersion(scipy_version) >= '0.14.0'


In [10]:
cv = CountVectorizer(max_df=0.9, min_df=2, stop_words='english')

Here, max_df=0.9 = > removes words that appear in 90% of the documents
min_df=2 => removes words that appear in less than 2 documents, i.e. we want words that appear in atleast 2 documents
stop_words = 'english' => remove all english stop words

We can actually do the above steps by tokenizing each document using spacy and removing stop words by ourselves. However we shall stick to CountVectorizer for now 

Since it is unsupervised learning there is no test train split as we dont have anything to test against

In [11]:
dtm = cv.fit_transform(df['Article'])

In [12]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [13]:
from sklearn.decomposition import LatentDirichletAllocation

In [14]:
LDA = LatentDirichletAllocation(n_components = 7, random_state = 42)

In [15]:
LDA.fit(dtm)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=7, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)