# Topic Modeling w NMF (Used Scikit)

Dataset Available - http://mlg.ucd.ie/datasets/bbc.html

Topic modeling is a key tool for the discovery of latent semantic structure within document collections, where probabilistic models such as Latent Dirichlet allocation (LDA) are widely-used. 

However, a highly-effective alternative is to use Non-negative Matrix Factorization (NMF). This notebook provides a simple example of using the NMF implementation from scikit-learn to find topics in a small collection of news articles.

Firstly, import everything required from scikit-learn and numpy.

In [1]:
import os, os.path, codecs
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import decomposition
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
import numpy as np

In [2]:
#For Mac OS
dir_data=r'/Users/viral.parikh/Desktop/External_Datasets/kaggle/bbc/bbc_small'

Create a corpus by reading plain text files into a list of strings. The sample collection of 1162 BBC news articles used in this example can be downloaded from here.

In [3]:
file_paths = [os.path.join(dir_data, fname) for fname in os.listdir(dir_data) if fname.endswith(".txt") ]
documents = [codecs.open(file_path, 'r', encoding="utf8", errors='ignore').read() for file_path in file_paths ]
print "Read %d corpus of documents" % len(documents)

Read 1162 corpus of documents


Apply tokenization and vectorization to build a document-term matrix A for the documents.

In [4]:
#can specify max_features,tokenzier, ngram_range
tfidf_vectorizer = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS, lowercase=True, strip_accents="unicode", use_idf=True, norm="l2", min_df = 5) 

A = tfidf_vectorizer.fit_transform(documents)

print(A.shape)

(1162, 6045)


Store the list of all terms for later use, whose indices correspond to the columns of the document-term matrix.

In [5]:
num_terms = len(tfidf_vectorizer.vocabulary_)
print num_terms

6045


In [6]:
terms = [""] * num_terms

In [7]:
for term in tfidf_vectorizer.vocabulary_.keys():
    terms[tfidf_vectorizer.vocabulary_[term]] = term
    
print "Created document-term matrix of size %d x %d" % (A.shape[0],A.shape[1])

Created document-term matrix of size 1162 x 6045


Apply NMF with SVD-based initialization to the document-term matrix A generate 4 topics, and get the factors W and H from the resulting model.

In [13]:
model = decomposition.NMF(init="nndsvd", n_components=6, max_iter=200)
W = model.fit_transform(A)

In [14]:
print W.shape
W

(1162, 6)


array([[-0.        ,  0.02920229,  0.15534348,  0.00041028, -0.        ,
         0.04660398],
       [ 0.00872395, -0.        ,  0.15982083,  0.00684455,  0.01738618,
         0.00053172],
       [-0.        , -0.        ,  0.0419963 , -0.        ,  0.09228407,
        -0.        ],
       ..., 
       [ 0.01349794,  0.12488964,  0.02907738,  0.00093087,  0.00422005,
         0.0348679 ],
       [-0.        ,  0.20169135, -0.        , -0.        , -0.        ,
         0.00902608],
       [ 0.01178719,  0.03192502,  0.00473044,  0.00313964,  0.03883936,
         0.20023313]])

In [15]:
H = model.components_
print "Generated factor W of size %s and factor H of size %s" % (str(W.shape), str(H.shape))

Generated factor W of size (1162, 6) and factor H of size (6, 6045)


In [16]:
print H.shape
H

(6, 6045)


array([[  2.64024097e-02,   1.60538034e-02,   1.52582236e-02, ...,
          0.00000000e+00,   0.00000000e+00,   5.71339566e-04],
       [  7.71661041e-02,   4.92047458e-02,   4.23580669e-02, ...,
          3.76352143e-05,   6.81016309e-03,   0.00000000e+00],
       [  1.04295564e-01,   7.52195989e-02,   2.21729951e-02, ...,
          0.00000000e+00,   0.00000000e+00,   1.08614941e-02],
       [  5.49892816e-03,   4.10755009e-02,   4.94313569e-03, ...,
          1.52466446e-02,   2.19416981e-03,   6.28886334e-03],
       [  7.64601474e-02,   1.41698988e-02,   2.43940915e-02, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  2.82531445e-02,   2.22283426e-02,   1.51616239e-02, ...,
          0.00000000e+00,   7.34599675e-03,   2.79332146e-03]])

Print the top 10 ranked terms for each topic, by sorting the values in the rows of the H factor (i.e. the weights for each of the 6045 terms relative to the 4 topics found by NMF).

In [17]:
for topic_index in range(H.shape[0]):
    top_indices = np.argsort(H[topic_index,:])[::-1][0:10]
    term_ranking = [terms[i] for i in top_indices]
    print "Topic %d: %s" % (topic_index, ", ".join(term_ranking))

Topic 0: mr, blair, labour, brown, election, party, prime, minister, chancellor, howard
Topic 1: mobile, phone, music, digital, people, technology, broadband, video, games, phones
Topic 2: growth, economy, year, sales, bank, said, economic, prices, 2004, market
Topic 3: chelsea, game, club, league, arsenal, united, cup, liverpool, players, mourinho
Topic 4: said, government, law, lord, mr, police, court, lords, rights, home
Topic 5: microsoft, virus, software, search, users, spyware, security, mail, program, windows


Reference - 

1. http://derekgreene.com/nmf-topic/
2. http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html
3. http://derekgreene.com/slides/nmf_insight_workshop.pdf
4. http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf.html#example-applications-topics-extraction-with-nmf-py