In natural language processing, **Latent Dirichlet Allocation (LDA)** is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar. The LDA is an example of a topic model. In this, observations (e.g., words) are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics. (source: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

In [43]:
# import libs
import pandas as pd

In [44]:
# get the data
npr_df = pd.read_csv('npr.csv')
npr_df.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [45]:
npr_df['Article'][8]

'If movies were trying to be more realistic, perhaps the way to summon Batman shouldn’t have been the    —   it should have been the bat squeak. New research from the Bat Lab for   at Tel Aviv University found that bats are ”vocalizing” more information than many researchers previously thought. And researchers were able to decipher what the bats were squeaking to each other about  —   often they were bickering over things like food, sleep and mating. ”It’s not as if now we can understand everything. It’s not as if we have a    dictionary,” says Dr. Yossi Yovel, a   at Tel Aviv University and a member of the Bat Lab. ”But what we’ve found is that this cacophony that you could hear . .. actually contains much more information than previously believed. So, all of [this] shouting, all of these vocalizations that were previously all categorized as aggressive vocalizations, we can now divide them,” Yovel tells NPR’s Scott Simon. ”For example, we can classify whether the bats are arguing over

In [46]:
# feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer

"""

max_dffloat or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold

min_dffloat or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. 

"""
idf = TfidfVectorizer(max_df=0.9, min_df=2, stop_words='english')
dtm = idf.fit_transform(npr_df['Article'])

In [47]:
# LDA 
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=9, random_state=42)
lda.fit(dtm)

LatentDirichletAllocation(n_components=9, random_state=42)

In [48]:
print("length: ",len(idf.get_feature_names()))
print("type: ",type(idf.get_feature_names()))

length:  54777
type:  <class 'list'>




In [49]:
idf.get_feature_names_out()

array(['00', '000', '00000', ..., 'ángel', 'émigrés', 'überfunky'],
      dtype=object)

In [50]:
# grab the vocabluary
import random
word_ind = random.randint(0, 5477)
idf.get_feature_names()[word_ind]

'ailment'

In [51]:
# grab the topics
len(lda.components_)

9

In [52]:
type(lda.components_)

numpy.ndarray

In [53]:
lda.components_.shape

(9, 54777)

In [54]:
single_topic = lda.components_[0]
"""
argsort returns the index positions that would sort this array

Input unsorted array :  [2 0 1 5 4 1 9]
Output sorted array indices :  [1 2 5 0 4 3 6]
"""
single_topic.argsort()

array([ 4635, 28075, 27321, ..., 42098,  6983, 27108])

In [55]:
single_topic.argsort()[-10:] # top 10 

array([12529, 30642,  9534, 36656, 27843,  4518,  2307, 42098,  6983,
       27108])

In [56]:
top_ten_words = single_topic.argsort()[-10:] 
"""
top 10 words that tend to have some sort of generalized topic feeling.
"""
for index in top_ten_words:
  print(idf.get_feature_names()[index])



cáceres
mcgee
cladding
philae
lander
backpage
albee
rosetta
brezler
ketamine


In [57]:
# grab the highest probability words per topic
for i, topic in enumerate(lda.components_):
  print(f"The top 15 words for topic #{i + 1}")
  print([idf.get_feature_names()[index] for index in topic.argsort()[-15:]])
  print("\n")

The top 15 words for topic #1
['kjellberg', 'windham', 'mantis', 'tuareg', 'klingon', 'cáceres', 'mcgee', 'cladding', 'philae', 'lander', 'backpage', 'albee', 'rosetta', 'brezler', 'ketamine']


The top 15 words for topic #2
['guillo', 'takata', 'carlile', 'durst', 'amt', 'brt', 'rayos', 'puig', 'hoboken', 'hekmati', 'ameera', 'korematsu', 'cernan', 'kerber', 'crapser']


The top 15 words for topic #3
['heffernan', 'fernández', 'bald', 'noriega', 'ammon', 'harney', 'dalí', 'gula', 'insanity', 'yanez', 'ants', 'malheur', 'occupiers', 'refuge', 'bundy']


The top 15 words for topic #4
['mammography', 'probuphine', 'rubella', 'radium', 'kalanick', 'herrera', 'gingerbread', 'amaranth', 'headset', 'listeria', 'somaliland', 'neanderthals', 'wilmore', 'dyslexia', 'gulen']


The top 15 words for topic #5
['intelligence', 'reports', 'security', 'attack', 'police', 'russian', 'comey', 'fbi', 'syria', 'isis', 'president', 'said', 'court', 'trump', 'russia']


The top 15 words for topic #6
['black

In [58]:
topic_res = lda.transform(dtm)

In [59]:
topic_res.shape

(11992, 9)

In [60]:
topic_res

array([[0.00655119, 0.00655119, 0.00655135, ..., 0.00655119, 0.00655119,
        0.63097523],
       [0.00836978, 0.00836966, 0.00837048, ..., 0.00836966, 0.00836966,
        0.59658806],
       [0.00904441, 0.00904441, 0.00904441, ..., 0.00904441, 0.00904441,
        0.56248219],
       ...,
       [0.00725201, 0.00725201, 0.00725206, ..., 0.00753302, 0.00725308,
        0.8723203 ],
       [0.00770954, 0.00770954, 0.00770954, ..., 0.00770954, 0.00770954,
        0.93831418],
       [0.00893004, 0.00893004, 0.00893004, ..., 0.00893004, 0.00893004,
        0.92855641]])

In [61]:
topic_res[3] # these are essentially probabilities of belonging to particular topic. the 4th article(ind=3) belongs to the topic 5 with 0.91 probability

array([0.01148878, 0.01148878, 0.01148878, 0.01148878, 0.58362811,
       0.01148878, 0.01148878, 0.01148878, 0.33595042])

In [62]:
npr_df['topics'] = topic_res.argmax(axis=1)

In [63]:
npr_df.head(35)

Unnamed: 0,Article,topics
0,"In the Washington of 2016, even when the polic...",8
1,Donald Trump has used Twitter — his prefe...,8
2,Donald Trump is unabashedly praising Russian...,8
3,"Updated at 2:50 p. m. ET, Russian President Vl...",4
4,"From photography, illustration and video, to d...",8
5,I did not want to join yoga class. I hated tho...,8
6,With a who has publicly supported the debunk...,8
7,"I was standing by the airport exit, debating w...",8
8,"If movies were trying to be more realistic, pe...",8
9,"Eighteen years ago, on New Year’s Eve, David F...",8
