## üß© Topic Modeling
*(Personal Practice Notes)*

Topic modeling is an NLP technique used to discover **hidden themes (topics)**
within a collection of documents.

Documents can be:
- rows in a DataFrame
- items in a list
- individual text files


## 1Ô∏è‚É£ What is Topic Modeling?

Topic modeling scans a collection of documents to identify **patterns of word usage**.

Based on these patterns, documents that discuss similar ideas
are grouped together into **topics**.

Key characteristics:
- Topic modeling is an example of **unsupervised learning**
- No labeled data is required
- Algorithms discover structure automatically


## 2Ô∏è‚É£ Why Topic Modeling Works

Topic modeling algorithms:
- identify recurring word patterns
- learn what each document is mostly about
- group documents that share similar word distributions

This helps uncover the **main themes**
that run through a collection of documents.


## 3Ô∏è‚É£ Common Topic Modeling Algorithms

Two widely used topic modeling techniques are:

1. **Latent Dirichlet Allocation (LDA)**
2. **Latent Semantic Analysis (LSA)**

Both aim to uncover latent topics, but they use different mathematical approaches.


## 4Ô∏è‚É£ Latent Dirichlet Allocation (LDA)

LDA is a probabilistic topic modeling technique.

Key ideas:
- Each document is a mixture of topics
- Each topic is a mixture of words
- Topics are inferred based on word co-occurrence patterns


In [None]:
#libraries we will use is gensim, pandas, re, nltk
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from gensim.models import LsiModel
import gensim
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel
import matplotlib.pyplot as plt




In [None]:
data = pd.read_csv("../../data/news_articles.csv")

In [None]:
data.head()

In [None]:
#title column contains the news article titles and content column contains the main news article complete
data.info()

In [None]:
#Assigning and Cleaning the text data
articles = data["content"]

In [None]:
articles =  articles.str.lower().apply(lambda x: re.sub(r"([^\w\s])", '', x))

In [None]:
#removing stopwords
en_stopwords = stopwords.words('english')
articles = articles.apply(lambda x: ' '.join(word for word in x.split() if word not in en_stopwords))



In [None]:
#tokenizing the text
articles = articles.apply(lambda x: word_tokenize(x))



In [None]:
#stemming the text   #large amount of text, stemming is chosen here to speed up processing
ps = PorterStemmer()
articles = articles.apply(lambda tokens: [ps.stem(token) for token in tokens]) 

In [None]:
articles

In [None]:
#creating the dictionary and corpus needed for Topic Modeling
dictionary = corpora.Dictionary(articles)  #each word in the articles is assigned a unique id which will alow the lda model to access the words 
print(dictionary)

In [None]:
doc_term = [dictionary.doc2bow(doc) for doc in articles]  #converting each document into the bag-of-words format
#doc2bow takes a single article, looks up each word in the dicttionary and returns a list of words and their frequency in the article


In [None]:
print(doc_term)

In [None]:
#to begin modeling we decide how many topics we want to extract from the articles. For this example, we choose 2 topics.
num_topics = 2

In [None]:
lda_model = gensim.models.LdaModel(corpus=doc_term, id2word=dictionary, num_topics=num_topics)

In [None]:
lda_model.print_topics(num_topics=num_topics, num_words=5)

### Latent Semantic Analysis (LSA)

## 5Ô∏è‚É£ Latent Semantic Analysis (LSA)

LSA is based on **linear algebra** and **dimensionality reduction**.

It identifies similarities between documents using:
- clustering
- similarity scores

LSA rests on two key ideas:


### üîπ Distributional Hypothesis

The distributional hypothesis states:

> Words with similar meanings tend to appear in similar contexts.

In other words:
- words that frequently occur together
- often share related meanings


### üîπ Singular Value Decomposition (SVD)

SVD reduces high-dimensional text data into a lower-dimensional space
that captures the most important patterns.

Mathematically:

M = U Œ£ V·µÄ

Where:
- **M** = document-term matrix
- **U** = document-topic matrix
- **Œ£ (Sigma)** = importance of each latent topic
- **V·µÄ** = topic-term matrix


## 6Ô∏è‚É£ Building an LSA Model

We use the **LSI (Latent Semantic Indexing)** implementation of LSA.


In [None]:
lsmodel =  LsiModel(doc_term, num_topics=num_topics, id2word=dictionary)
print(lsmodel.print_topics(num_topics=num_topics, num_words=5))

## 7Ô∏è‚É£ Determining the Optimal Number of Topics

Choosing the correct number of topics is critical.

We use **topic coherence** to evaluate topic quality.

Coherence measures:
- how meaningful the top words in a topic are
- how often those words appear together in documents

Higher coherence scores indicate more interpretable topics.


In [None]:
coherence_values = []
model_list = []

min_topics = 2
max_topics = 11

for num_topics_i in range(min_topics, max_topics+1):
    model = LsiModel(doc_term, num_topics=num_topics_i, id2word=dictionary, random_seed=0)
    model_list.append(model)
    coherence_model  = CoherenceModel(model=model, texts=articles, dictionary=dictionary, coherence='c_v')   #cv here measures how often the top words of a topic 
    #appear together in the documents
    coherence_values.append(coherence_model.get_coherence())

## 8Ô∏è‚É£ Visualizing Coherence Scores

We plot coherence scores against the number of topics
to identify the optimal topic count.


In [None]:
plt.plot(range(min_topics, max_topics+1), coherence_values)
plt.xlabel("Number of Topics")
plt.ylabel("Coherence Score")
plt.legend(("coherence_values"), loc='best')
plt.show()

## 9Ô∏è‚É£ Training the Final LSA Model

Based on coherence scores, we select the optimal number of topics
and train the final LSA model.


In [None]:
final_n_topics = 3
lsmodel_f = LsiModel(doc_term, num_topics=final_n_topics, id2word=dictionary)
print(lsmodel_f.print_topics(num_topics=final_n_topics, num_words=5))

## ‚úÖ Final Takeaways

- Topic modeling uncovers hidden themes in text
- It is an unsupervised learning technique
- LDA uses probabilistic modeling
- LSA uses linear algebra and dimensionality reduction
- Topic coherence helps evaluate topic quality
- Selecting the right number of topics is crucial for interpretability
