## Example code for LDA and NMF in topic models

See: https://towardsdatascience.com/improving-the-interpretation-of-topic-models-87fd2ee3847d
And: https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730#.vivglhmhv

Difference between LDA and NMF:
* LDA uses TF matrix
* NMF uses TF-IDF matrix

LDA is based on probabilistic graphical modeling while NMF relies on linear algebra. 

Good discussion:
http://nbviewer.jupyter.org/github/dolaameng/tutorials/blob/master/topic-finding-for-short-texts/topics_for_short_texts.ipynb

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import NMF, LatentDirichletAllocation
import numpy as np

In [3]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [12]:
def display_topics2(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print(documents[doc_index])

In [4]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

no_features = 1000

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [5]:
# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

no_topics = 20

In [6]:
# Run NMF
nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)

In [14]:
nmf_W = nmf.transform(tfidf)
nmf_H = nmf.components_

In [8]:
# Run LDA
lda = LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)

In [15]:
lda_W = lda.transform(tf)
lda_H = lda.components_

In [16]:
no_top_words = 3
display_topics(nmf, tfidf_feature_names, no_top_words)
display_topics(lda, tf_feature_names, no_top_words)

Topic 0:
people time right
Topic 1:
window problem using
Topic 2:
god jesus bible
Topic 3:
game team year
Topic 4:
new 00 sale
Topic 5:
thanks mail advance
Topic 6:
windows file files
Topic 7:
edu soon cs
Topic 8:
key chip clipper
Topic 9:
drive scsi drives
Topic 10:
just ll thought
Topic 11:
does know anybody
Topic 12:
card video monitor
Topic 13:
like sounds looks
Topic 14:
don know want
Topic 15:
car cars engine
Topic 16:
ve got seen
Topic 17:
use used using
Topic 18:
think don lot
Topic 19:
com list dave
Topic 0:
people gun state
Topic 1:
time question book
Topic 2:
mr line rules
Topic 3:
key chip keys
Topic 4:
edu com cs
Topic 5:
use does window
Topic 6:
windows thanks know
Topic 7:
bike water effect
Topic 8:
don just like
Topic 9:
car new price
Topic 10:
file available program
Topic 11:
ax max b8f
Topic 12:
government law privacy
Topic 13:
card bit memory
Topic 14:
drive scsi disk
Topic 15:
god jesus people
Topic 16:
year game team
Topic 17:
10 00 15
Topic 18:
armenian israel arm

In [21]:
no_top_words = 2
no_top_documents = 1
display_topics2(nmf_H, nmf_W, tfidf_feature_names, documents, no_top_words, no_top_documents)
display_topics2(lda_H, lda_W, tf_feature_names, documents, no_top_words, no_top_documents)

Topic 0:
people time
Accounts of Anti-Armenian Human Right Violations in Azerbaijan #012
                 Prelude to Current Events in Nagorno-Karabakh

        +---------------------------------------------------------+
        |                                                         |
        |  I saw a naked girl with her hair down. They were       |
        |  dragging her. She kept falling because they were       |
        |  pushing her and kicking her. She fell down, it was     |
        |  muddy there, and later other witnesses who saw it from |
        |  their balconies told us, they seized her by the hair   |
        |  and dragged her a couple of blocks, as far as the      |
        |  mortgage bank, that's a good block and a half or two   |
        |  from here. I know this for sure because I saw it       |
        |  myself.                                                |
        |                                                         |
        +----------------------