# TOPIC MODELING

- Discover topics in a text corpus (a collection of documents)

- scikit-learn package is very popular in Python: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html
- Gensim module is very popular in Python: http://radimrehurek.com/gensim/



## Algorithms

- Non-negative matrix factorization (NMF)

- Latent Dirchilet Allocation (LDA)

# 1. Non-negative Matrix Factorization (NMF or NNMF)

- is a group of algorithms in multivariate analysis and linear algebra 
- where a matrix V is factorized into (usually) two matrices W and H, 
- with the property that all three matrices have **no negative elements**. 

This non-negativity makes the resulting matrices easier to inspect. 

Also, in applications such as processing of audio spectrograms or muscular activity, 
non-negativity is inherent to the data being considered. 

Since the problem is not exactly solvable in general, it is commonly approximated numerically.

Source:
https://en.wikipedia.org/wiki/Non-negative_matrix_factorization

### Fig. 1. A matrix V is factorized into two matrices W and H
![NMF](NMF.png)

### Fig. 2. Matrix factorizatio: hidden topics in 1K tweets
![MF](MF.png)

## (1) Open the JSON file and create a list of 1K tweets text, corpus_contents, for TFIDF Vectorization

In [None]:
import json
import numpy as np
from pprint import pprint

In [None]:
infile = open('tweet_stream_halloween_1000.json')
data = json.load(infile)
infile.close()

In [None]:
data[0].keys()

In [None]:
data[0]['text']

In [None]:
corpus_contents = []

for t in data:
    corpus_contents.append(t['text'])

In [None]:
pprint(corpus_contents)

In [None]:
len(corpus_contents)

## (2)  Vectorize the corpus with TfidfVectorizer and create a list of unique words.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Set a TfidfVecorizer instance object that remove stopwords (stop_words='english') 
# and ignore terms that appears less than 2% of the documents (min_df=2).
vectorizer = TfidfVectorizer(stop_words = 'english', min_df = 2)

# Using the TfidVectorizer instance object, 
# tokenize all the strings in the corpus 
# and return document-term TF-IDF matrix comprised of vectors for strings
doc_term_matrix = vectorizer.fit_transform(corpus_contents)

print(doc_term_matrix.shape) # 1000 documents (tweets) and 887 unique words
print(doc_term_matrix)

In [None]:
# Please create a list of unique vocab, we will use this later
unique_words = vectorizer.get_feature_names_out() 
print(len(unique_words)) #, '# of unique words'
print(unique_words[:10])

## (3) NMF Decomposition using document-term matrix with TfidfVectorizer

Scikit-learn NMF
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html

In [None]:
from sklearn import decomposition

# Set the desired number of topics
num_topics = 5

# Set a classifier (i.e. clf) that initializes 
# the NMF decomposition with the assigned number of topics
clf = decomposition.NMF(n_components = num_topics)

# Using the classifier object, 
# transform the document-term TF-IDF matrix to fit the NMF model
# and return a decomposed matrix with the number of documents and the number of topics.

doc_top_matrix = clf.fit_transform(doc_term_matrix)

print(doc_top_matrix.shape) # check the shape of the matrix
print(doc_top_matrix)

> **< name of the classifier >.components_** returns a decomposed matrix with the number of topics and the number of terms.

In [None]:
top_term_matrix = clf.components_

print(top_term_matrix.shape)
print(top_term_matrix)

## (4) Now let's try to see the constructed topics!

In [None]:
import numpy as np

topic_1 = clf.components_[0]
topic_1[:10]



In [None]:
# We need indices of top key words of each topic!
# How to find them? Sorting? We may lose their origial indices...
# You can use np.argsort function!

num_top_words = 5 # I want to see top 5 words of each topic

# get the indices of 5 largest weights (from smaller to larger)
np.argsort(topic_1)[-num_top_words:]

In [None]:
# But I need top 5 words from top to down! 
np.argsort(topic_1)[-num_top_words:][::-1] #[::-1] will change its direction!

In [None]:
print(topic_1[330], topic_1[370], topic_1[321], topic_1[634], topic_1[747])

In [None]:
# finally, we can use unique_words that we created in section (2)
print(unique_words[330], unique_words[370], unique_words[321], unique_words[634], unique_words[747])

In [None]:
import numpy as np

topic_words = []
num_top_words = 5 # I want to see top 5 words of each topic

# go over each component/topic
for topic in clf.components_:

    # get the indices of 5 largest weights (from smaller to larger)
    word_idx = np.argsort(topic)[-num_top_words:]
    
    temp_lst = []
    # let's see the words corresponding to the indices
    for idx in word_idx[::-1]: # to access the largest weights first, plesae reverse the sequential object using [::-1]
        temp_lst.append(unique_words[idx]) # let's append a keywords of the topic to a temp_lst
        
    topic_words.append(temp_lst) # let's append a list of keyword of the topic to topic_words

In [None]:
from pprint import pprint
pprint(topic_words)

## (5) Summary

In [None]:
import json
import numpy as np
from pprint import pprint
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import decomposition
import numpy as np

infile = open('tweet_stream_halloween_1000.json')
data = json.load(infile)
infile.close()

corpus_contents = []

for t in data:
    corpus_contents.append(t['text'])
    
vectorizer = TfidfVectorizer(stop_words = 'english', min_df = 2)
doc_term_matrix = vectorizer.fit_transform(corpus_contents)

unique_words = vectorizer.get_feature_names_out() 

num_topics = 5

clf = decomposition.NMF(n_components = num_topics)
doc_top_matrix = clf.fit_transform(doc_term_matrix)
top_term_matrix = clf.components_

topic_words = []
num_top_words = 5 

for topic in clf.components_:
    word_idx = np.argsort(topic)[-num_top_words:]
    temp_lst = []
    for idx in word_idx[::-1]: 
        temp_lst.append(unique_words[idx])
    topic_words.append(temp_lst) 
    
pprint(topic_words)

## (6) Practice: Customizing Stopwords for Topic Modeling with 1K Tweets

In [None]:
from sklearn.feature_extraction import text 

In [None]:
my_additional_stop_word_list = ['rt', 'https']

In [None]:
stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_word_list)

In [None]:
stop_words

In [None]:
import json
import numpy as np
from pprint import pprint
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text 
from sklearn import decomposition
import numpy as np

infile = open('tweet_stream_halloween_1000.json')
data = json.load(infile)
infile.close()

corpus_contents = []

for t in data:
    corpus_contents.append(t['text'])

my_additional_stop_word_list = ['rt', 'https']
my_stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_word_list)

vectorizer = TfidfVectorizer(stop_words = my_stop_words, min_df = 2)
doc_term_matrix = vectorizer.fit_transform(corpus_contents)

unique_words = vectorizer.get_feature_names_out() 

num_topics = 5

clf = decomposition.NMF(n_components = num_topics)
doc_top_matrix = clf.fit_transform(doc_term_matrix)
top_term_matrix = clf.components_

topic_words = []
num_top_words = 5 

for topic in clf.components_:
    word_idx = np.argsort(topic)[-num_top_words:]
    temp_lst = []
    for idx in word_idx[::-1]: 
        temp_lst.append(unique_words[idx])
    topic_words.append(temp_lst) 
    
pprint(topic_words)