# Activity: NMF for Topic Modeling   
### Fetch and preprocessing documents

In [2]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import decomposition
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import NMF
from scipy import linalg
import pandas as pd

Here we get some documents from scikit-learn 's "The 20 newsgroups text dataset" database. Please visit this website https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html for details about this dataset if you are interested.     
-- The scikit_learn 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics, and here we choose 4 topics from them and pick 10 random documents within them.  
-- Please upload the "20news-bydate_py3.pkz" file to the same folder/directory as this file if you cannot download the dataset using UW Jupyter-Lab. 

In [3]:
# Set number of topics
NUM = 10
# Pick categories 
categories = ['alt.atheism', 'talk.religion.misc',
              'comp.graphics', 'sci.space']
# Remove headers, footers and quotes
remove = ('headers', 'footers', 'quotes')
# Get data
data_ALL = fetch_20newsgroups(data_home = './', categories = categories, subset = 'train', shuffle = True, random_state = 42, remove = remove)

Show the document categories: 

In [4]:
from pprint import pprint
pprint(list(data_ALL.target_names))

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']


We can have a peek at the content of some articles:

In [5]:
print("\n--------------------\n".join(data_ALL.data[:3]))

Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych
--------------------


Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1993.


Nope - fruitcakes like Koresh have been demonstrating such evil corruption
for centuries.
--------------------

 >In article <1993

Finally, we need to vectorize the documents, i.e., to transform documents into document-words matrix.  
Here, we apply Tf-id vectorization and removing english stop-words from documents:

In [6]:
# Vectorize Documents Using Tf-id
vectorizer = TfidfVectorizer(stop_words = 'english')
vectors = vectorizer.fit_transform(data_ALL.data[:NUM])
dictionary = vectorizer.get_feature_names()

### Classifying Documents Using SVD and NMF 

In [7]:
# Show topic - top words
num_top_words = 8
def show_topics(a):
    top_words = lambda t: [dictionary[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]

In [10]:
def createList(k): 
    if (k == 1): 
        return k  
    else:   
        i = 1
        # Create empty list 
        res = []   
        # loop to append successors to  
        # list until r2 is reached. 
        while(i < k + 1 ):               
            res.append("topic " + str(i)) 
            i += 1
        return res 

First we will use SVD to analysis the data:  
-- $A = U\Sigma V^T$, where $V^T$ is the topic-encoded matrix.  
-- Here we use truncatedSVD, and only display the top 4 most important components. 

In [16]:
# Do SVD
com_num = 4

svd = TruncatedSVD(n_components = com_num)
lsa = svd.fit_transform(vectors)

Display the topic-encoded data:  
-- The columns represent each topic from t1 to t4.  
-- The rows represent documents.   
-- We can interprete the topic-encoded matrix as representing how each document is classified into one or more of the 4 topics. 

In [17]:
# Display Topic-Encoded-Matrix
topic_encoded_df = pd.DataFrame(lsa, columns = createList(com_num))
display(topic_encoded_df.round(decimals = 2))

Unnamed: 0,topic 1,topic 2,topic 3,topic 4
0,0.3,-0.21,0.15,0.63
1,0.11,0.05,-0.19,-0.18
2,0.19,0.44,0.44,-0.07
3,0.36,0.36,-0.46,0.16
4,0.51,-0.41,0.11,0.25
5,0.26,0.47,-0.44,0.22
6,0.37,-0.09,-0.08,-0.37
7,0.23,0.37,0.55,0.12
8,0.52,-0.33,-0.05,-0.37
9,0.24,0.21,0.18,-0.4


Next let us see what the topics consist of:  
-- Each line is the top-8 words of the corresponding topic.  
-- Given the top words, what do you think the topics are?  

In [18]:
# Show topic - top words
show_topics(svd.components_[:4])

['lunar orbit file does oh darling clementine like',
 'sq arm theists list wingate challenges just mb',
 'arm sq ll say com mb used talking',
 'file prj 3ds does format orientation save texture']

Second, we will use NMF to analysis the data:  
-- $A = WH$, where $H$ is the topic-encoded matrix.  
-- To match our result in SVD, we will set $k = 4$ in NMF.   
-- You can change $k$ manually in the following block. 

In [19]:
# Do NMF

# Set component number
# You can change the value k here:
k = 4

nmf = NMF(n_components = k)
lsa = nmf.fit_transform(vectors)

Display the topic-encoded data:  
-- The interpretation of the matrix is similar to that of SVD.  

In [20]:
topic_encoded_df = pd.DataFrame(lsa, columns = createList(k))
display(topic_encoded_df.round(decimals=2))

Unnamed: 0,topic 1,topic 2,topic 3,topic 4
0,0.0,0.0,0.0,1.18
1,0.0,0.15,0.0,0.06
2,0.0,0.0,0.67,0.0
3,0.0,0.67,0.0,0.0
4,0.43,0.0,0.0,0.06
5,0.0,0.67,0.0,0.0
6,0.33,0.01,0.0,0.0
7,0.0,0.0,0.69,0.01
8,0.52,0.0,0.0,0.0
9,0.02,0.01,0.43,0.0


Again, let us see what the topics consist of:  
-- Each line is the top-8 words of the corresponding topic.  
-- Given the top words, what do you think the topics are?  

In [21]:
# Show topic - top words
show_topics(nmf.components_[:k])

['lunar orbit clementine oh darling exploration blurb city',
 'challenges wingate list theists mr quite bobby peace',
 'arm sq processing com mb say ll used',
 'file 3ds prj format save texture orientation information']

### Questions:   
1. Compare the result from SVD and NMF, especially the topic-encoded matrix. How do you interpret the negative values in the matrix in SVD? 
2. Remeber that NMF is not unique with respect to component number k. Try differen values of k and see what result would you get. 