# Latent Dirichlet Allocation

## Data

We will be using articles from NPR (National Public Radio), obtained from their website [www.npr.org](http://www.npr.org)

In [2]:
import pandas as pd

npr = pd.read_csv('D:\\ML-Datasets\\NLP-Datasets\\npr.csv')

# Check few initial rows
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


Notice how we don't have the topic of the articles! Let's use LDA to attempt to figure out clusters of the articles.

In [4]:
# Lets explore the first article
npr['Article'][0]

'In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing 

In [5]:
# Check on number of articles
len(npr)

11992

## Preprocessing

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

In [6]:
# We would want to ignore certain words which are common across almost all of the documents, this is in order to have the 
# LDA identify the topics correctly

# This says we ignore words which are present across 95% of the documents and ignore words which are present in less than 2 
# documents, which means for a word to be considered, it has to be present in atleast 2 documents

cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [7]:
# Now we will apply the count vectorizer and generate a document term matrix
dtm = cv.fit_transform(npr['Article'])

# Check the dtm
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

## LDA

In [8]:
from sklearn.decomposition import LatentDirichletAllocation

LDA = LatentDirichletAllocation(n_components=7, random_state=42)

In [9]:
# This can take a while, we are dealing with large number of documents here
LDA.fit(dtm)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=7, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

## Interpret the LDA results

In [10]:
# Get the vocabulary of words

len(cv.get_feature_names()) # This also matches with the number of columns in our sparse matrix

54777

In [11]:
# get a random feature name
cv.get_feature_names()[25000]

'infiltrated'

In [12]:
# I can print out a bunch of random words from my list of words

import random

for index in range(20):
    random_word_id = random.randint(0, len(cv.get_feature_names()) - 1)
    print(cv.get_feature_names()[random_word_id])

internationalwomensday
shout
bonus
iceberg
payoff
ashes
associate
briggs
20k
stillbirths
reform
cases
innovate
warbelow
guangdong
pastorius
heilemann
rainfall
estimations
glorified


In [13]:
# Grab the topics

len(LDA.components_) # This should match with the number of topics we specified during training

7

In [14]:
# The type is a numpy array, having probabilities of each word in all the documents belonging to this topic
type(LDA.components_)

numpy.ndarray

In [17]:
# Check the shape of the numpy array
LDA.components_.shape

(7, 54777)

In [18]:
# Now if we want to grab the highest probability word in each topic (Lets do it for the first topic)

# Get the probability of all words belonging to this topic
single_topic = LDA.components_[0]

# Now we will get the index positions to sort this, in order to get the highest/lowest probability words
# default sort is by ascending order, so in order to get the top 10 highest probability words

# This will return as the index positions of the top 10 words with highest probabilities for this topic
top_ten_words = single_topic.argsort()[-10:]

In [19]:
for index in top_ten_words:
    print(cv.get_feature_names()[index])

new
percent
government
company
million
care
people
health
said
says


#### Now we will setup a loop which will print out the top 15 words with highest probabilities for all the seven topics

In [21]:
for index, topic in enumerate(LDA.components_):
    print(f"THE TOP 15 WORDS FOR TOPIC #{index}")
    print([cv.get_feature_names()[index] for index in topic.argsort()[-15:]])
    print("\n")

THE TOP 15 WORDS FOR TOPIC #0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


THE TOP 15 WORDS FOR TOPIC #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


THE TOP 15 WORDS FOR TOPIC #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


THE TOP 15 WORDS FOR TOPIC #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']


THE TOP 15 WORDS FOR TOPIC #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think',

In [22]:
# Now we will attach the topic numbers to the original articles

# In order to do this, we would need to apply the transform method on our LDA model and DTM to associate the topic to each
# article
topic_results = LDA.transform(dtm)

# Now we will check the shape
topic_results.shape

(11992, 7)

In [23]:
# If we check the first element, we will get the probabilities of each of the 11992 articles belonging to one of the 7 topics
topic_results[0]

array([1.61040465e-02, 6.83341493e-01, 2.25376318e-04, 2.25369288e-04,
       2.99652737e-01, 2.25479379e-04, 2.25497980e-04])

In [24]:
# to see this better
topic_results[0].round(2) # It has 68% probability to belong to topic 2

array([0.02, 0.68, 0.  , 0.  , 0.3 , 0.  , 0.  ])

In [25]:
# We will grab the index position of the highest probability
topic_results[0].argmax()

1

In [26]:
# Associating topic to each of the documents

npr['Topic'] = topic_results.argmax(axis=1)

npr.head(10)

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
5,I did not want to join yoga class. I hated tho...,3
6,With a who has publicly supported the debunk...,3
7,"I was standing by the airport exit, debating w...",2
8,"If movies were trying to be more realistic, pe...",3
9,"Eighteen years ago, on New Year’s Eve, David F...",2
