Latent Dirichlet Allocation (LDA)<br>
- based off of the Dirichlet distribution
- assumptions: documents with similar topics use similar groups of words, latent topics can then be found by searching for groups of words that frequently occur together in documents across the corpus
- treats documents as probability distributions over latent topics and topics themselves are probability distributions over words
- this is unsupervised learning!

LDA represents documents as mixtures of topics that spit out words with certain probabilities. 

It assumes that documents are produced in the following fashion:<br>
- decide on the number of words N the document will have
- choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics)
    - e.g. 60% business, 20% politics, 10% food
- using the topic to generate the word itself (according to the topic's multinomial distribution), e.g. if selecting the food topic we may generate the word "apple" with 60% probability
<br>

Under these assumptions, LDA attempts to backtrack this process to find the possible set of topics that generated the document

<b>We need to choose the number of topics to begin with

To start, go through each document and randomly assign each word in the document to one of the K topics. This random assignment already gives you both topic representations of all the documents and word distributions of all the topics (these won't initially make any sense!)<br>
Then iterate over every word in every document to improve these topics.<br>
For every word in every document and for each topic t we calculate:
- P(topic t | document d) = proportion of words in document d that are currently assigned to topic t
- P(word w | topic t) = proportion of assignments to topic t over all documents that come from this word w

Reassign w a new topic, where we choose topic t with probability P(topic t | document d) * P(word w | topic t). This is essentially the probability that topic t generated word w<br>
After repeating several times, we eventually reach a roughly steady state

<b>It will gives us a topic number for each document and we can look at the most common words within that topic and then it is up to us to determine the meaning of that given topic

In [1]:
import pandas as pd

In [2]:
npr = pd.read_csv('npr.csv') # data on articles 

In [7]:
type(npr['Article']) # first article in data

pandas.core.series.Series

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

In [4]:
cv = CountVectorizer(max_df = 0.9, min_df = 2, stop_words = 'english') 
# max_df ignores high frequency words, here it is > 90% in the documents
# min_df ignores low frequency words
# can use an integer instead of a decimal for the above, this means the min or max number of documents, min is 2 documents
# stop_words will ignore stop words

In [9]:
dtm = cv.fit_transform(npr['Article']) # dtm is document term matrix

In [13]:
from sklearn.decomposition import LatentDirichletAllocation

In [14]:
LDA = LatentDirichletAllocation(n_components = 7, random_state = 42) 
# n_components is number of topics
# random_state simply means that when running it again we get the same output
# reason is because there is some randomness in the initial step

In [15]:
LDA.fit(dtm)

LatentDirichletAllocation(n_components=7, random_state=42)

In [18]:
# Grab the vocab of words
cv.get_feature_names_out() # all the words in the documents

array(['00', '000', '00000', ..., 'ángel', 'émigrés', 'überfunky'],
      dtype=object)

In [28]:
# Grab the topics
print(LDA.components_.shape) # array containing probabilities for each word for each topic

# argsort gives us the index of the least to greatest values (-10: gives us top 10 indices)
    
for i,topic in enumerate(LDA.components_):
    print(f"The top 15 words for topic #{i}")
    print([cv.get_feature_names_out()[index] for index in topic.argsort()[-15:]])
    print("\n")

(7, 54777)
The top 15 words for topic #0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


The top 15 words for topic #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


The top 15 words for topic #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


The top 15 words for topic #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


The top 15 words for topic #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']


The top 15 words for topic #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know

In [31]:
topic_results = LDA.transform(dtm)
print(topic_results.shape) # probability of each article/documents belonging to the 7 topics

(11992, 7)


In [32]:
print(topic_results[0].round(2))

[0.02 0.68 0.   0.   0.3  0.   0.  ]


In [34]:
print(topic_results[0].argmax()) # this is the topic number of highest probability

1


In [35]:
npr['Topic'] = topic_results.argmax(axis=1)

In [37]:
npr

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2
...,...,...
11987,The number of law enforcement officers shot an...,1
11988,"Trump is busy these days with victory tours,...",4
11989,It’s always interesting for the Goats and Soda...,3
11990,The election of Donald Trump was a surprise to...,4


Non-negative Matrix Factorization<br>
- unsupervised algorithm
- simultaneously performs dimensionality reduction and clustering
- can be used in conjunction with TF-IDF to model topics across documents

Given a non-negative matrix A, find k-dimension approximation in terms of non-negative factors W and H<br>

A (n x m) -> W (n x k) . H (k x m)<br>

A is the data matrix, W is the basis vectors, H is the coefficient matrix<br>

The rows in A are features and the columns are objects. The rows in W are features. The columns in H are objects<br>

Each basis vector can be interpreted as a cluster. The memberships of objects in these clusters encoded by H<br>

k is the number of topics we choose to have (like in LDA), must also interpret the topics based off coefficient values

1. Construct vector space model for documents (after stopword filtering), resulting ina  term-document matrix A
2. Apply TF-IDF term weight normalisation to A
3. Normalize TF-IDF vectors to unit length
4. Initialise factors using NNDSVD on A
5. Apply Projected Gradient NMF to A

In [38]:
npr = pd.read_csv('npr.csv')

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [40]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [41]:
dtm = tfidf.fit_transform(npr['Article'])

In [42]:
from sklearn.decomposition import NMF

In [43]:
nmf_model = NMF(n_components=7, random_state=42)

In [44]:
nmf_model.fit(dtm)



NMF(n_components=7, random_state=42)

In [47]:
for index,topic in enumerate(nmf_model.components_):
    print(f"The top 15 words for topic #{index}")
    # get words with highest coefficient values
    print([tfidf.get_feature_names_out()[i] for i in topic.argsort()[-15:]])
    print("\n")

The top 15 words for topic #0
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


The top 15 words for topic #1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


The top 15 words for topic #2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


The top 15 words for topic #3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


The top 15 words for topic #4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


The top 15 words for topic #5
['love', 've', 'don', 'al

In [48]:
topic_results = nmf_model.transform(dtm)
npr['Topic'] = topic_results.argmax(axis=1)

In [49]:
npr.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6


In [50]:
mytopics = {0:'health',1:'election',2:'legis',3:'poli',4:'election',5:'lifestyle',6:'edu'}
npr['Topic Label'] = npr['Topic'].map(mytopics)

In [51]:
npr.head()

Unnamed: 0,Article,Topic,Topic Label
0,"In the Washington of 2016, even when the polic...",1,election
1,Donald Trump has used Twitter — his prefe...,1,election
2,Donald Trump is unabashedly praising Russian...,1,election
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,poli
4,"From photography, illustration and video, to d...",6,edu


In [3]:
npr['Article'[4]

KeyError: 4