In [1]:
"""
Topic modeling allows us to analyze large volumes
of text by clustering documents into topics.

If we have unlabeled data, the we can 'discover' labels.
In the case of text data, this means 'discovering' clusters
of documents, grouped together by topic.

Assumptions of Latent Dirichlet Allocation:
    1. Documents are probability distributions over latent(undiscovered) topics.
       --> a document can belong to multiple topics with different probabilities.
       
    2. Topics themselves are probability distributions over words.

How Latent Dirichlet Allocation assumes we produce documents:
    1. We first decide the number of words N the document will have.
    2. Choose a mixture of topics for the document(according to a Dirichlet
       distribution over a fixed set of K topics). 
       e.g. 60% business, 20% politics, 10% food.
    3. Then we generate each word in the document by picking a topic
       according to the multinomial distribution that we sampled previously
       (60% business, 20% politics, 10% food)
    4. Using the topic we generate the word itselt(as per the topic's distribution)
       e.g. if we selected the food topic, we might generate the word "biryani"
       with 60% probability, "home" with 30% probability, and so on.
    5. ASsuming this type of generative model for a collection of documents, LDA
       then tries to backtrack from the documents to find a set of topics that are
       likely to have generated the collection.
    
We have to tell LDA how many topics(K) needs to be discovered.

For every word in every document and for each topic t we calculate:
    p(word w | topic t) --> probability of word w coming from topic t.
    p(topic t | document d) --> probability of topic t coming from document d.
"""

'\nTopic modeling allows us to analyze large volumes\nof text by clustering documents into topics.\n\nIf we have unlabeled data, the we can \'discover\' labels.\nIn the case of text data, this means \'discovering\' clusters\nof documents, grouped together by topic.\n\nAssumptions of Latent Dirichlet Allocation:\n    1. Documents are probability distributions over latent(undiscovered) topics.\n       --> a document can belong to multiple topics with different probabilities.\n       \n    2. Topics themselves are probability distributions over words.\n\nHow Latent Dirichlet Allocation assumes we produce documents:\n    1. We first decide the number of words N the document will have.\n    2. Choose a mixture of topics for the document(according to a Dirichlet\n       distribution over a fixed set of K topics). \n       e.g. 60% business, 20% politics, 10% food.\n    3. Then we generate each word in the document by picking a topic\n       according to the multinomial distribution that we s

In [2]:
file_path = "/home/viper/Downloads/UPDATED_NLP_COURSE/TextFiles/npr.csv"

In [3]:
import pandas as pd

In [4]:
npr = pd.read_csv(file_path)

In [5]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [6]:
from sklearn.feature_extraction.text import CountVectorizer

In [63]:
cv = CountVectorizer(max_df=0.9, min_df=2, stop_words='english') 
# max_df = 0.9 => discard the words which appear in more than 90% of the documents.
# min_df = 2 => only keep words which appear in atleast 2 documents.
# stop_words = 'english' => CountVectorizer will not use 'english' stop words.

In [64]:
document_term_metric = cv.fit_transform(npr['Article'])

In [65]:
document_term_metric.shape

(11992, 54777)

In [40]:
from sklearn.decomposition import LatentDirichletAllocation

In [41]:
lda = LatentDirichletAllocation(n_components=7, random_state=42)

In [66]:
lda.fit(document_term_metric)

In [None]:
# Grab the vocabulary of words

In [67]:
len(cv.get_feature_names_out()) # list of all the words in the document

54777

In [68]:
import random

In [69]:
for i in range(10):
    index = random.randint(0, len(cv.get_feature_names_out()))
    print(cv.get_feature_names_out()[index])

smallness
debilitating
emerges
faction
community
lakeland
nagging
antihistamine
deadlocks
loin


In [70]:
# Grab the topics

In [71]:
len(lda.components_)

7

In [72]:
type(lda.components_)

numpy.ndarray

In [73]:
lda.components_.shape # topic X probability of each word in the topic

(7, 54777)

In [74]:
single_topic = lda.components_[2]

In [75]:
single_topic.argsort() # returns the indices in order which will sort the array.

array([34110, 24223, 23865, ..., 36283, 28659, 42993])

In [76]:
top_twenty_words = single_topic.argsort()[-20:]

In [77]:
for word in top_twenty_words:
    print(cv.get_feature_names_out()[word])

little
know
don
year
make
way
world
family
home
day
time
water
city
new
years
food
just
people
like
says


In [78]:
# Grab the highest probability words per topic

In [79]:
for i, topic in enumerate(lda.components_):
    print(f"THE TOP 15 WORDS FOR TOPIC #{i+1}")
    print([cv.get_feature_names_out()[index] for index in topic.argsort()[-15:]])
    print()

THE TOP 15 WORDS FOR TOPIC #1
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']

THE TOP 15 WORDS FOR TOPIC #2
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']

THE TOP 15 WORDS FOR TOPIC #3
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']

THE TOP 15 WORDS FOR TOPIC #4
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']

THE TOP 15 WORDS FOR TOPIC #5
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']

THE TOP 15 WORDS FOR TOPIC #6
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think', 'peo

In [80]:
# Assigning topic numbers to Articles

In [82]:
topic_results = lda.transform(document_term_metric)

In [92]:
topic_results[0].round(3).argmax()

1

In [93]:
npr['Topic'] = topic_results.argmax(axis=1)

In [94]:
npr.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2


In [95]:
npr['Topic'].value_counts()

5    2458
1    2004
4    1943
2    1838
3    1485
0    1313
6     951
Name: Topic, dtype: int64

In [None]:
value_counts