# Latent Dirichlet Allocation (LDA) Topic Modelling

## Why we use CountVectorizer for LatentDirichletAllocation

In `Latent Dirichlet Allocation (LDA)`, the `CountVectorizer` is commonly used instead of `TfidfVectorizer` because LDA is based on the frequency of terms within each document, rather than on their weighted importance (as is emphasized in TF-IDF). Here’s why CountVectorizer is preferred for LDA:

#### 1. Probabilistic Nature of LDA
LDA is a generative probabilistic model that assumes each document is generated by a mixture of topics, and each topic is generated by a mixture of words. This process relies on simple term frequency counts:

LDA assigns probabilities to words based on how often they appear in documents, helping to determine which topics are likely present.
Since LDA calculates these probabilities based on raw counts, CountVectorizer provides exactly what it needs.
#### 2. TF-IDF’s Focus on Rare Terms
TfidfVectorizer applies a weighting scheme that downplays common terms (like "car" or "engine" in an automotive topic) and up-weights rare terms, which might not always suit LDA’s probabilistic model:

TF-IDF emphasizes terms that are unique or rare within a corpus, rather than terms that are frequently discussed within a particular topic.
Because LDA does not focus on distinguishing unique terms but rather on uncovering word patterns that reflect underlying topics, the raw count matrix works better.
#### 3. Interpretable Topic Composition
Using CountVectorizer provides a direct interpretation of LDA topics by showing the most frequent and contextually relevant words associated with each topic. This aligns well with LDA's goal of representing topics as a set of words likely to co-occur, which is easier to interpret with raw counts.

## Why we shound NOT use TF-IDF Vectorizer for LatentDirichletAllocation
> Using **TF-IDF Vectorizer** with **Latent Dirichlet Allocation (LDA)** is generally not recommended because it doesn't align with the probabilistic foundation of LDA, which relies on raw word frequencies to model topics effectively. However, in some specific cases, TF-IDF may still be useful when combined with LDA:

### Reducing the Influence of Common Words Across All Topics:

In certain cases, common terms that are not truly informative for topic distinction may dominate the dataset (e.g., "report," "data," "result").
TF-IDF can help by downweighting terms that appear across many documents, allowing LDA to better capture distinctive patterns of terms for specific topics. This could improve topic differentiation if there is a large volume of non-informative common words.

### Small or Highly Homogeneous Corpora:

For smaller corpora or very homogeneous datasets, where most documents are quite similar, raw counts may not differentiate well between topics, since all documents may use similar vocabulary.
TF-IDF can increase the weight of rarer terms, potentially allowing LDA to identify subtle topic distinctions.

### Exploratory Analysis:

If you are experimenting with topic modeling and interested in finding rare, unique words that define topics more than general themes, using TF-IDF may reveal different patterns, though they might be less coherent.
However, this should be seen more as an exploratory method rather than a best practice for topic modeling with LDA.

### Limitations of Using TF-IDF with LDA
Misalignment with LDA’s Assumptions: LDA assumes a multinomial distribution of terms based on raw counts. TF-IDF introduces non-integer weights that can distort the intended probabilistic topic-word distribution.
Loss of Co-occurrence Patterns: By prioritizing rare terms, TF-IDF may weaken the signal of words that naturally co-occur in topics, leading to less coherent topics.

## References
### 1. **Original Paper on LDA**  
   - **"Latent Dirichlet Allocation" by David M. Blei, Andrew Y. Ng, and Michael I. Jordan (2003)**: This foundational paper describes LDA as a generative probabilistic model that relies on the frequency of words across documents to infer topics. It explains that LDA assumes a multinomial distribution of words, which aligns with raw count data and does not involve TF-IDF weighting.  
   - **Reference**: Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). *Latent Dirichlet Allocation*. Journal of Machine Learning Research, 3, 993-1022. [Available here](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf).

### 2. **Scikit-Learn Documentation**  
   - **Scikit-Learn's LDA Documentation** recommends using `CountVectorizer` for LDA topic modeling because the underlying probabilistic model of LDA assumes raw term frequencies. Scikit-Learn highlights the difference in alignment between count data for LDA versus TF-IDF for models like NMF.  
   - **Reference**: [Scikit-Learn Topic Extraction with LDA](https://scikit-learn.org/stable/modules/decomposition.html#latent-dirichlet-allocation)

### 3. **Text Mining and Topic Modeling Textbooks**  
   - **"Text Mining with R: A Tidy Approach" by Julia Silge and David Robinson (2017)**: This book explains the difference between LDA and other topic modeling methods and explicitly notes that LDA benefits from raw term counts because of its probabilistic foundations.  
   - **Reference**: Silge, J., & Robinson, D. (2017). *Text Mining with R: A Tidy Approach*. O’Reilly Media.  

   - **"Natural Language Processing with Python" by Steven Bird, Ewan Klein, and Edward Loper**: In explaining topic modeling, this book underscores that probabilistic models like LDA assume word distributions based on term frequencies. It notes that using TF-IDF weights with LDA is generally discouraged.  
   - **Reference**: Bird, S., Klein, E., & Loper, E. (2009). *Natural Language Processing with Python*. O'Reilly Media.

### 4. **Technical Blogs and Articles on LDA and Topic Modeling**  
   - **Towards Data Science** and **Machine Learning Mastery** articles on topic modeling frequently recommend Count Vectorizer for LDA and explain why TF-IDF can misrepresent term distributions for probabilistic models.
   - **References**:  
      - Brownlee, J. (2019). *A Gentle Introduction to LDA for Topic Modeling*. [Machine Learning Mastery](https://machinelearningmastery.com/latent-dirichlet-allocation-for-topic-modeling/).  
      - Pande, T. (2020). *Topic Modeling with LDA and Python*. [Towards Data Science](https://towardsdatascience.com/topic-modeling-with-latent-dirichlet-allocation-lda-697bcd9197b9).  

These references explain why LDA’s generative model favors raw counts and why TF-IDF’s emphasis on rare terms can distort LDA’s topic-word distributions. Using `CountVectorizer` with LDA is thus the standard practice among both researchers and practitioners.

In [None]:
%pip install -q pyLDAvis

  and should_run_async(code)


In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

  and should_run_async(code)


In [None]:
import pyLDAvis
import pyLDAvis.lda_model
pyLDAvis.enable_notebook()

In [None]:
# # Load dataset, download only 3 categories
# newsgroups = fetch_20newsgroups(subset='all', categories=['sci.space', 'rec.autos', 'talk.politics.mideast'])
# Load dataset, download ALL categories
newsgroups = fetch_20newsgroups(subset='all')
documents = newsgroups.data
print(f"{len(documents)=}")

  and should_run_async(code)


len(documents)=18846


In [None]:
categories = newsgroups.target_names
print(f"{len(categories)=}")
print(categories)

len(categories)=20
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


  and should_run_async(code)


In [None]:
# # Text preprocessing with CountVectorizer
# vectorizer = CountVectorizer(stop_words='english',ngram_range=(1,2), max_features=4000)
# Text preprocessing with TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english',ngram_range=(1,3), max_features=5000, min_df=10, max_df=0.5)

doc_vectors = vectorizer.fit_transform(documents)

# Apply LDA for topic modeling
n_topics = 12
lda_model = LatentDirichletAllocation(n_components=n_topics, random_state=42)
W = lda_model.fit_transform(doc_vectors)
H = lda_model.components_

print(f"{W.shape=}")
print(f"{H.shape=}")

  and should_run_async(code)


W.shape=(18846, 12)
H.shape=(12, 5000)


In [None]:
# Display key terms for each topics
key_term_map = []
num_top_words = 20
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(H):
    key_terms = ",".join([feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]])
    key_term_map.append(key_terms)
    print(f"Topic {topic_idx}: {key_terms}")

Topic 0: indiana,virginia,sandvik,cramer,indiana edu,optilink,virginia edu,kent,apple com,ucs indiana,ucs indiana edu,clayton,apple,carleton,university virginia,newton apple com,sandvik newton,sandvik newton apple,newton apple,ucs
Topic 1: access,digex,access digex,uk,nasa,mit,mit edu,com,nasa gov,window,gov,nyx,ac,digex net,access digex net,du edu,du,ac uk,space,cs du edu
Topic 2: game,ca,hockey,buffalo,team,pitt,upenn,upenn edu,pitt edu,duke,buffalo edu,espn,gordon banks,games,geb,duke edu,banks,gordon,nhl,sas
Topic 3: god,people,com,don,think,just,article,say,jesus,like,believe,know,christian,time,good,does,bible,did,way,right
Topic 4: uiuc,uiuc edu,israel,cso,cso uiuc edu,cso uiuc,fbi,israeli,jews,batf,people,government,uchicago,uchicago edu,arab,university illinois,illinois,urbana,waco,atf
Topic 5: com,car,like,just,article,don,hp,new,good,know,university,hp com,time,netcom,gun,space,think,distribution,ve,people
Topic 6: com,sun,sun com,berkeley,portal,dod,nasa,ca,att,uci,bike,ber

  and should_run_async(code)


In [None]:
# Print some documents and their topics
for i in range(5):
    topic_distribution = W[i]
    probable_topics = topic_distribution.argsort()[-3:]
    probable_topic = probable_topics[-1]
    print(f"Document {i+1}: {probable_topics=}  Topic Terms: {key_term_map[probable_topic]}")
    print(documents[i][:1000] + "...")  # Print a snippet of the document

    print("=" * 60)

Document 1: probable_topics=array([3, 8, 2])  Topic Terms: game,ca,hockey,buffalo,team,pitt,upenn,upenn edu,pitt edu,duke,buffalo edu,espn,gordon banks,games,geb,duke edu,banks,gordon,nhl,sas
From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very dis

  and should_run_async(code)


## Visualize the Model

In [None]:
pyLDAvis.lda_model.prepare(lda_model, doc_vectors, vectorizer)

  and should_run_async(code)
