## Non Negetive Matrix Factorization for Topic Modelling

### Why should we use tf-idf vectors for NMF topic model

Using **TF-IDF** `(Term Frequency-Inverse Document Frequency)` Vectorizer with **Non-negative Matrix Factorization (NMF)** is generally recommended for a few key reasons related to how **NMF** works and what **TF-IDF** _emphasizes_. Here's why TF-IDF is typically a better choice than raw term counts when applying NMF for topic modeling:

#### 1. Emphasis on Unique and Relevant Terms
> TF-IDF assigns higher weights to terms that are more unique to each document within the entire corpus, rather than terms that are frequent across many documents.
Since NMF tries to capture distinctive, high-level features (topics) by decomposing the document-term matrix, using TF-IDF helps it focus on words that are more informative or relevant for differentiating topics. This way, common words that appear across multiple topics don't dominate the topics.
#### 2. Alignment with NMF's Matrix Decomposition
> NMF decomposes a document-term matrix into two smaller matrices (document-topic and topic-term matrices), aiming to represent each document as a combination of several key topics.
The TF-IDF weighting scheme helps balance the distribution of word frequencies so that rare but important words are not overlooked. This is particularly useful for NMF, as it performs well with the added emphasis on distinctive terms that TF-IDF provides.
#### 3. Improved Interpretability of Topics
> Using TF-IDF tends to lead to more interpretable topics because it reduces the impact of very common words that might not be specific to any topic. For instance, terms like "data" or "report" in a business-related corpus may be ubiquitous but are not necessarily informative about any specific topic. TF-IDF reduces their weight, making it easier for NMF to identify topics based on distinctive words.
As a result, the top terms in each topic produced by NMF with TF-IDF often represent cleaner, more specific themes.
#### 4. Performance in Extracting Topics from Sparse Matrices
> Document-term matrices are often very sparse, and TF-IDF can improve the quality of the matrix by reducing some of this sparsity in meaningful ways.
NMF tends to perform better with this pre-processing, as it provides a more balanced input for the factorization algorithm, allowing it to uncover latent topics more effectively than with raw counts.

In [18]:
%pip install -q pyLDAvis

  and should_run_async(code)


In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

In [None]:
import pyLDAvis
import pyLDAvis.lda_model
pyLDAvis.enable_notebook()

In [2]:
# # Load dataset, download only 3 categories
# newsgroups = fetch_20newsgroups(subset='all', categories=['sci.space', 'rec.autos', 'talk.politics.mideast'])
# Load dataset, download ALL categories
newsgroups = fetch_20newsgroups(subset='all')
documents = newsgroups.data
print(f"{len(documents)=}")

len(documents)=18846


In [3]:
categories = newsgroups.target_names
print(f"{len(categories)=}")
print(categories)

len(categories)=20
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [12]:
# Text preprocessing with TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=4000, ngram_range=(1,2), min_df=10, max_df=0.3)
doc_vectors = vectorizer.fit_transform(documents)

# Apply NMF for topic modeling
n_topics = len(categories)
model = NMF(n_components=n_topics, random_state=42)
W = model.fit_transform(doc_vectors)
H = model.components_

print(f"{W.shape=}")
print(f"{H.shape=}")

  and should_run_async(code)


W.shape=(18846, 20)
H.shape=(20, 4000)


In [13]:
# Display key terms for each topics
key_term_map = []
num_top_words = 20
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(H):
    key_terms = ",".join([feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]])
    key_term_map.append(key_terms)
    print(f"Topic {topic_idx}: {key_terms}")

Topic 0: people,don,think,right,government,gun,time,israel,know,said,did,car,way,good,say,make,want,ve,years,going
Topic 1: windows,file,dos,program,window,files,thanks,help,use,graphics,using,mit,ms,problem,software,au,mail,ftp,version,pc
Topic 2: ca,game,team,games,hockey,year,players,baseball,play,season,canada,toronto,player,win,nhl,espn,league,cs,fans,roger
Topic 3: cleveland,cwru,cwru edu,freenet,cleveland freenet,freenet edu,case western,western reserve,reserve university,organization case,ins cwru,reserve,ins,western,university cleveland,case,usa lines,po cwru,po,usa
Topic 4: key,clipper,chip,encryption,keys,escrow,clipper chip,government,algorithm,security,nsa,crypto,phone,des,public,secure,use,secret,law,wiretap
Topic 5: ohio state,ohio,state,state edu,magnus,magnus acs,acs ohio,acs,state university,edu organization,university lines,ryan,cis,oil,sale,distribution,drugs,john,distribution usa,flat
Topic 6: uiuc,uiuc edu,cso,cso uiuc,illinois,university illinois,urbana,uxa,uxa c

  and should_run_async(code)


In [14]:
# Print some documents and their topics
for i in range(5):
    topic_distribution = W[i]
    probable_topic = topic_distribution.argmax()
    print(f"Document {i+1}: {probable_topic=}  Topic Terms: {key_term_map[probable_topic]}")
    print(documents[i][:1000] + "...")  # Print a snippet of the document

    print("=" * 60)

Document 1: probable_topic=9  Topic Terms: cmu,cmu edu,andrew,andrew cmu,mellon,carnegie,carnegie mellon,pittsburgh,pittsburgh pa,pa,pa lines,mellon pittsburgh,edu subject,cs cmu,edu reply,engineering,computer,reply,pens,science
From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp

  and should_run_async(code)


### Visualize the Model using `pyLDAVis`

In [16]:
pyLDAvis.lda_model.prepare(model, doc_vectors, vectorizer)

  and should_run_async(code)
  result = func(self.values, **kwargs)
  result = func(self.values, **kwargs)
  result = func(self.values, **kwargs)
